• ClickHouse介绍示例


    示例

    GitHub 事件数据集

    数据集包含了GitHub上从2011年到2020年12月6日的所有事件,大小为31亿条记录。下载大小为75 GB,如果存储在使用lz4压缩的表中,则需要多达200 GB的磁盘空间。

     

    Anonymized Yandex.Metrica Data

    数据集由两个表组成,包含关于Yandex.Metrica的hits(hits_v1)和visit(visits_v1)的匿名数据。你可以阅读更多关于Yandex的信息。在ClickHouse历史的Metrica部分。

    数据集由两个表组成,他们中的任何一个都可以下载作为一个压缩tsv.xz的文件或准备的分区。除此之外,一个扩展版的hits表包含1亿行TSV在https://datasets.clickhouse.com/hits/tsv/hits_100m_obfuscated_v1.tsv.xz,准备分区在https://datasets.clickhouse.com/hits/partitions/hits_100m_obfuscated_v1.tar.xz。

    从准备好的分区获取表

    下载和导入hits表:

    curl -O https://datasets.clickhouse.com/hits/partitions/hits_v1.tar
    tar xvf hits_v1.tar -C /var/lib/clickhouse # path to ClickHouse data directory
    # check permissions on unpacked data, fix if required
    sudo service clickhouse-server restart
    clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"
    

    下载和导入visits表:

    curl -O https://datasets.clickhouse.com/visits/partitions/visits_v1.tar
    tar xvf visits_v1.tar -C /var/lib/clickhouse # path to ClickHouse data directory
    # check permissions on unpacked data, fix if required
    sudo service clickhouse-server restart
    clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"
    

    从TSV压缩文件获取表

    从TSV压缩文件下载并导入hits:

    curl https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsv
    # now create table
    clickhouse-client --query "CREATE DATABASE IF NOT EXISTS datasets"
    clickhouse-client --query "CREATE TABLE datasets.hits_v1 ( WatchID UInt64,  JavaEnable UInt8,  Title String,  GoodEvent Int16,  EventTime DateTime,  EventDate Date,  CounterID UInt32,  ClientIP UInt32,  ClientIP6 FixedString(16),  RegionID UInt32,  UserID UInt64,  CounterClass Int8,  OS UInt8,  UserAgent UInt8,  URL String,  Referer String,  URLDomain String,  RefererDomain String,  Refresh UInt8,  IsRobot UInt8,  RefererCategories Array(UInt16),  URLCategories Array(UInt16), URLRegions Array(UInt32),  RefererRegions Array(UInt32),  ResolutionWidth UInt16,  ResolutionHeight UInt16,  ResolutionDepth UInt8,  FlashMajor UInt8, FlashMinor UInt8,  FlashMinor2 String,  NetMajor UInt8,  NetMinor UInt8, UserAgentMajor UInt16,  UserAgentMinor FixedString(2),  CookieEnable UInt8, JavascriptEnable UInt8,  IsMobile UInt8,  MobilePhone UInt8,  MobilePhoneModel String,  Params String,  IPNetworkID UInt32,  TraficSourceID Int8, SearchEngineID UInt16,  SearchPhrase String,  AdvEngineID UInt8,  IsArtifical UInt8,  WindowClientWidth UInt16,  WindowClientHeight UInt16,  ClientTimeZone Int16,  ClientEventTime DateTime,  SilverlightVersion1 UInt8, SilverlightVersion2 UInt8,  SilverlightVersion3 UInt32,  SilverlightVersion4 UInt16,  PageCharset String,  CodeVersion UInt32,  IsLink UInt8,  IsDownload UInt8,  IsNotBounce UInt8,  FUniqID UInt64,  HID UInt32,  IsOldCounter UInt8, IsEvent UInt8,  IsParameter UInt8,  DontCountHits UInt8,  WithHash UInt8, HitColor FixedString(1),  UTCEventTime DateTime,  Age UInt8,  Sex UInt8,  Income UInt8,  Interests UInt16,  Robotness UInt8,  GeneralInterests Array(UInt16), RemoteIP UInt32,  RemoteIP6 FixedString(16),  WindowName Int32,  OpenerName Int32,  HistoryLength Int16,  BrowserLanguage FixedString(2),  BrowserCountry FixedString(2),  SocialNetwork String,  SocialAction String,  HTTPError UInt16, SendTiming Int32,  DNSTiming Int32,  ConnectTiming Int32,  ResponseStartTiming Int32,  ResponseEndTiming Int32,  FetchTiming Int32,  RedirectTiming Int32, DOMInteractiveTiming Int32,  DOMContentLoadedTiming Int32,  DOMCompleteTiming Int32,  LoadEventStartTiming Int32,  LoadEventEndTiming Int32, NSToDOMContentLoadedTiming Int32,  FirstPaintTiming Int32,  RedirectCount Int8, SocialSourceNetworkID UInt8,  SocialSourcePage String,  ParamPrice Int64, ParamOrderID String,  ParamCurrency FixedString(3),  ParamCurrencyID UInt16, GoalsReached Array(UInt32),  OpenstatServiceName String,  OpenstatCampaignID String,  OpenstatAdID String,  OpenstatSourceID String,  UTMSource String, UTMMedium String,  UTMCampaign String,  UTMContent String,  UTMTerm String, FromTag String,  HasGCLID UInt8,  RefererHash UInt64,  URLHash UInt64,  CLID UInt32,  YCLID UInt64,  ShareService String,  ShareURL String,  ShareTitle String,  ParsedParams Nested(Key1 String,  Key2 String, Key3 String, Key4 String, Key5 String,  ValueDouble Float64),  IslandID FixedString(16),  RequestNum UInt32,  RequestTry UInt8) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192"
    # import data
    cat hits_v1.tsv | clickhouse-client --query "INSERT INTO datasets.hits_v1 FORMAT TSV" --max_insert_block_size=100000
    # optionally you can optimize table
    clickhouse-client --query "OPTIMIZE TABLE datasets.hits_v1 FINAL"
    clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"
    

    从压缩tsv文件下载和导入visits:

    curl https://datasets.clickhouse.com/visits/tsv/visits_v1.tsv.xz | unxz --threads=`nproc` > visits_v1.tsv
    # now create table
    clickhouse-client --query "CREATE DATABASE IF NOT EXISTS datasets"
    clickhouse-client --query "CREATE TABLE datasets.visits_v1 ( CounterID UInt32,  StartDate Date,  Sign Int8,  IsNew UInt8,  VisitID UInt64,  UserID UInt64,  StartTime DateTime,  Duration UInt32,  UTCStartTime DateTime,  PageViews Int32,  Hits Int32,  IsBounce UInt8,  Referer String,  StartURL String,  RefererDomain String,  StartURLDomain String,  EndURL String,  LinkURL String,  IsDownload UInt8,  TraficSourceID Int8,  SearchEngineID UInt16,  SearchPhrase String,  AdvEngineID UInt8,  PlaceID Int32,  RefererCategories Array(UInt16),  URLCategories Array(UInt16),  URLRegions Array(UInt32),  RefererRegions Array(UInt32),  IsYandex UInt8,  GoalReachesDepth Int32,  GoalReachesURL Int32,  GoalReachesAny Int32,  SocialSourceNetworkID UInt8,  SocialSourcePage String,  MobilePhoneModel String,  ClientEventTime DateTime,  RegionID UInt32,  ClientIP UInt32,  ClientIP6 FixedString(16),  RemoteIP UInt32,  RemoteIP6 FixedString(16),  IPNetworkID UInt32,  SilverlightVersion3 UInt32,  CodeVersion UInt32,  ResolutionWidth UInt16,  ResolutionHeight UInt16,  UserAgentMajor UInt16,  UserAgentMinor UInt16,  WindowClientWidth UInt16,  WindowClientHeight UInt16,  SilverlightVersion2 UInt8,  SilverlightVersion4 UInt16,  FlashVersion3 UInt16,  FlashVersion4 UInt16,  ClientTimeZone Int16,  OS UInt8,  UserAgent UInt8,  ResolutionDepth UInt8,  FlashMajor UInt8,  FlashMinor UInt8,  NetMajor UInt8,  NetMinor UInt8,  MobilePhone UInt8,  SilverlightVersion1 UInt8,  Age UInt8,  Sex UInt8,  Income UInt8,  JavaEnable UInt8,  CookieEnable UInt8,  JavascriptEnable UInt8,  IsMobile UInt8,  BrowserLanguage UInt16,  BrowserCountry UInt16,  Interests UInt16,  Robotness UInt8,  GeneralInterests Array(UInt16),  Params Array(String),  Goals Nested(ID UInt32, Serial UInt32, EventTime DateTime,  Price Int64,  OrderID String, CurrencyID UInt32),  WatchIDs Array(UInt64),  ParamSumPrice Int64,  ParamCurrency FixedString(3),  ParamCurrencyID UInt16,  ClickLogID UInt64,  ClickEventID Int32,  ClickGoodEvent Int32,  ClickEventTime DateTime,  ClickPriorityID Int32,  ClickPhraseID Int32,  ClickPageID Int32,  ClickPlaceID Int32,  ClickTypeID Int32,  ClickResourceID Int32,  ClickCost UInt32,  ClickClientIP UInt32,  ClickDomainID UInt32,  ClickURL String,  ClickAttempt UInt8,  ClickOrderID UInt32,  ClickBannerID UInt32,  ClickMarketCategoryID UInt32,  ClickMarketPP UInt32,  ClickMarketCategoryName String,  ClickMarketPPName String,  ClickAWAPSCampaignName String,  ClickPageName String,  ClickTargetType UInt16,  ClickTargetPhraseID UInt64,  ClickContextType UInt8,  ClickSelectType Int8,  ClickOptions String,  ClickGroupBannerID Int32,  OpenstatServiceName String,  OpenstatCampaignID String,  OpenstatAdID String,  OpenstatSourceID String,  UTMSource String,  UTMMedium String,  UTMCampaign String,  UTMContent String,  UTMTerm String,  FromTag String,  HasGCLID UInt8,  FirstVisit DateTime,  PredLastVisit Date,  LastVisit Date,  TotalVisits UInt32,  TraficSource    Nested(ID Int8,  SearchEngineID UInt16, AdvEngineID UInt8, PlaceID UInt16, SocialSourceNetworkID UInt8, Domain String, SearchPhrase String, SocialSourcePage String),  Attendance FixedString(16),  CLID UInt32,  YCLID UInt64,  NormalizedRefererHash UInt64,  SearchPhraseHash UInt64,  RefererDomainHash UInt64,  NormalizedStartURLHash UInt64,  StartURLDomainHash UInt64,  NormalizedEndURLHash UInt64,  TopLevelDomain UInt64,  URLScheme UInt64,  OpenstatServiceNameHash UInt64,  OpenstatCampaignIDHash UInt64,  OpenstatAdIDHash UInt64,  OpenstatSourceIDHash UInt64,  UTMSourceHash UInt64,  UTMMediumHash UInt64,  UTMCampaignHash UInt64,  UTMContentHash UInt64,  UTMTermHash UInt64,  FromHash UInt64,  WebVisorEnabled UInt8,  WebVisorActivity UInt32,  ParsedParams    Nested(Key1 String,  Key2 String,  Key3 String,  Key4 String, Key5 String, ValueDouble    Float64),  Market Nested(Type UInt8, GoalID UInt32, OrderID String,  OrderPrice Int64,  PP UInt32,  DirectPlaceID UInt32,  DirectOrderID  UInt32,  DirectBannerID UInt32,  GoodID String, GoodName String, GoodQuantity Int32,  GoodPrice Int64),  IslandID FixedString(16)) ENGINE = CollapsingMergeTree(Sign) PARTITION BY toYYYYMM(StartDate) ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192"
    # import data
    cat visits_v1.tsv | clickhouse-client --query "INSERT INTO datasets.visits_v1 FORMAT TSV" --max_insert_block_size=100000
    # optionally you can optimize table
    clickhouse-client --query "OPTIMIZE TABLE datasets.visits_v1 FINAL"
    clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"
    

    查询示例

    使用教程是以Yandex.Metrica数据集开始教程。

    可以在ClickHouse的stateful tests 中找到对这些表的查询的其他示例(它们被命名为test.histstest.visits)。

     

    Star Schema Benchmark

    编译 dbgen:

    $ git clone git@github.com:vadimtk/ssb-dbgen.git
    $ cd ssb-dbgen
    $ make
    

    开始生成数据:

    !!! warning "注意" 使用-s 100dbgen 将生成 6 亿行数据(67GB), 如果使用-s 1000它会生成 60 亿行数据(这需要很多时间))

    $ ./dbgen -s 1000 -T c
    $ ./dbgen -s 1000 -T l
    $ ./dbgen -s 1000 -T p
    $ ./dbgen -s 1000 -T s
    $ ./dbgen -s 1000 -T d
    

    在 ClickHouse 中创建数据表:

    CREATE TABLE customer
    (
            C_CUSTKEY       UInt32,
            C_NAME          String,
            C_ADDRESS       String,
            C_CITY          LowCardinality(String),
            C_NATION        LowCardinality(String),
            C_REGION        LowCardinality(String),
            C_PHONE         String,
            C_MKTSEGMENT    LowCardinality(String)
    )
    ENGINE = MergeTree ORDER BY (C_CUSTKEY);
    
    CREATE TABLE lineorder
    (
        LO_ORDERKEY             UInt32,
        LO_LINENUMBER           UInt8,
        LO_CUSTKEY              UInt32,
        LO_PARTKEY              UInt32,
        LO_SUPPKEY              UInt32,
        LO_ORDERDATE            Date,
        LO_ORDERPRIORITY        LowCardinality(String),
        LO_SHIPPRIORITY         UInt8,
        LO_QUANTITY             UInt8,
        LO_EXTENDEDPRICE        UInt32,
        LO_ORDTOTALPRICE        UInt32,
        LO_DISCOUNT             UInt8,
        LO_REVENUE              UInt32,
        LO_SUPPLYCOST           UInt32,
        LO_TAX                  UInt8,
        LO_COMMITDATE           Date,
        LO_SHIPMODE             LowCardinality(String)
    )
    ENGINE = MergeTree PARTITION BY toYear(LO_ORDERDATE) ORDER BY (LO_ORDERDATE, LO_ORDERKEY);
    
    CREATE TABLE part
    (
            P_PARTKEY       UInt32,
            P_NAME          String,
            P_MFGR          LowCardinality(String),
            P_CATEGORY      LowCardinality(String),
            P_BRAND         LowCardinality(String),
            P_COLOR         LowCardinality(String),
            P_TYPE          LowCardinality(String),
            P_SIZE          UInt8,
            P_CONTAINER     LowCardinality(String)
    )
    ENGINE = MergeTree ORDER BY P_PARTKEY;
    
    CREATE TABLE supplier
    (
            S_SUPPKEY       UInt32,
            S_NAME          String,
            S_ADDRESS       String,
            S_CITY          LowCardinality(String),
            S_NATION        LowCardinality(String),
            S_REGION        LowCardinality(String),
            S_PHONE         String
    )
    ENGINE = MergeTree ORDER BY S_SUPPKEY;
    

    写入数据:

    $ clickhouse-client --query "INSERT INTO customer FORMAT CSV" < customer.tbl
    $ clickhouse-client --query "INSERT INTO part FORMAT CSV" < part.tbl
    $ clickhouse-client --query "INSERT INTO supplier FORMAT CSV" < supplier.tbl
    $ clickhouse-client --query "INSERT INTO lineorder FORMAT CSV" < lineorder.tbl
    

    star schema转换为flat schema

    SET max_memory_usage = 20000000000;
    
    CREATE TABLE lineorder_flat
    ENGINE = MergeTree
    PARTITION BY toYear(LO_ORDERDATE)
    ORDER BY (LO_ORDERDATE, LO_ORDERKEY) AS
    SELECT
        l.LO_ORDERKEY AS LO_ORDERKEY,
        l.LO_LINENUMBER AS LO_LINENUMBER,
        l.LO_CUSTKEY AS LO_CUSTKEY,
        l.LO_PARTKEY AS LO_PARTKEY,
        l.LO_SUPPKEY AS LO_SUPPKEY,
        l.LO_ORDERDATE AS LO_ORDERDATE,
        l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
        l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
        l.LO_QUANTITY AS LO_QUANTITY,
        l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
        l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
        l.LO_DISCOUNT AS LO_DISCOUNT,
        l.LO_REVENUE AS LO_REVENUE,
        l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
        l.LO_TAX AS LO_TAX,
        l.LO_COMMITDATE AS LO_COMMITDATE,
        l.LO_SHIPMODE AS LO_SHIPMODE,
        c.C_NAME AS C_NAME,
        c.C_ADDRESS AS C_ADDRESS,
        c.C_CITY AS C_CITY,
        c.C_NATION AS C_NATION,
        c.C_REGION AS C_REGION,
        c.C_PHONE AS C_PHONE,
        c.C_MKTSEGMENT AS C_MKTSEGMENT,
        s.S_NAME AS S_NAME,
        s.S_ADDRESS AS S_ADDRESS,
        s.S_CITY AS S_CITY,
        s.S_NATION AS S_NATION,
        s.S_REGION AS S_REGION,
        s.S_PHONE AS S_PHONE,
        p.P_NAME AS P_NAME,
        p.P_MFGR AS P_MFGR,
        p.P_CATEGORY AS P_CATEGORY,
        p.P_BRAND AS P_BRAND,
        p.P_COLOR AS P_COLOR,
        p.P_TYPE AS P_TYPE,
        p.P_SIZE AS P_SIZE,
        p.P_CONTAINER AS P_CONTAINER
    FROM lineorder AS l
    INNER JOIN customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
    INNER JOIN supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
    INNER JOIN part AS p ON p.P_PARTKEY = l.LO_PARTKEY;
    

    运行查询:

    Q1.1

    SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
    FROM lineorder_flat
    WHERE toYear(LO_ORDERDATE) = 1993 AND LO_DISCOUNT BETWEEN 1 AND 3 AND LO_QUANTITY < 25;
    

    Q1.2

    SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
    FROM lineorder_flat
    WHERE toYYYYMM(LO_ORDERDATE) = 199401 AND LO_DISCOUNT BETWEEN 4 AND 6 AND LO_QUANTITY BETWEEN 26 AND 35;
    

    Q1.3

    SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
    FROM lineorder_flat
    WHERE toISOWeek(LO_ORDERDATE) = 6 AND toYear(LO_ORDERDATE) = 1994
      AND LO_DISCOUNT BETWEEN 5 AND 7 AND LO_QUANTITY BETWEEN 26 AND 35;
    

    Q2.1

    SELECT
        sum(LO_REVENUE),
        toYear(LO_ORDERDATE) AS year,
        P_BRAND
    FROM lineorder_flat
    WHERE P_CATEGORY = 'MFGR#12' AND S_REGION = 'AMERICA'
    GROUP BY
        year,
        P_BRAND
    ORDER BY
        year,
        P_BRAND;
    

    Q2.2

    SELECT
        sum(LO_REVENUE),
        toYear(LO_ORDERDATE) AS year,
        P_BRAND
    FROM lineorder_flat
    WHERE P_BRAND >= 'MFGR#2221' AND P_BRAND <= 'MFGR#2228' AND S_REGION = 'ASIA'
    GROUP BY
        year,
        P_BRAND
    ORDER BY
        year,
        P_BRAND;
    

    Q2.3

    SELECT
        sum(LO_REVENUE),
        toYear(LO_ORDERDATE) AS year,
        P_BRAND
    FROM lineorder_flat
    WHERE P_BRAND = 'MFGR#2239' AND S_REGION = 'EUROPE'
    GROUP BY
        year,
        P_BRAND
    ORDER BY
        year,
        P_BRAND;
    

    Q3.1

    SELECT
        C_NATION,
        S_NATION,
        toYear(LO_ORDERDATE) AS year,
        sum(LO_REVENUE) AS revenue
    FROM lineorder_flat
    WHERE C_REGION = 'ASIA' AND S_REGION = 'ASIA' AND year >= 1992 AND year <= 1997
    GROUP BY
        C_NATION,
        S_NATION,
        year
    ORDER BY
        year ASC,
        revenue DESC;
    

    Q3.2

    SELECT
        C_CITY,
        S_CITY,
        toYear(LO_ORDERDATE) AS year,
        sum(LO_REVENUE) AS revenue
    FROM lineorder_flat
    WHERE C_NATION = 'UNITED STATES' AND S_NATION = 'UNITED STATES' AND year >= 1992 AND year <= 1997
    GROUP BY
        C_CITY,
        S_CITY,
        year
    ORDER BY
        year ASC,
        revenue DESC;
    

    Q3.3

    SELECT
        C_CITY,
        S_CITY,
        toYear(LO_ORDERDATE) AS year,
        sum(LO_REVENUE) AS revenue
    FROM lineorder_flat
    WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND year >= 1992 AND year <= 1997
    GROUP BY
        C_CITY,
        S_CITY,
        year
    ORDER BY
        year ASC,
        revenue DESC;
    

    Q3.4

    SELECT
        C_CITY,
        S_CITY,
        toYear(LO_ORDERDATE) AS year,
        sum(LO_REVENUE) AS revenue
    FROM lineorder_flat
    WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND toYYYYMM(LO_ORDERDATE) = 199712
    GROUP BY
        C_CITY,
        S_CITY,
        year
    ORDER BY
        year ASC,
        revenue DESC;
    

    Q4.1

    SELECT
        toYear(LO_ORDERDATE) AS year,
        C_NATION,
        sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
    FROM lineorder_flat
    WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
    GROUP BY
        year,
        C_NATION
    ORDER BY
        year ASC,
        C_NATION ASC;
    

    Q4.2

    SELECT
        toYear(LO_ORDERDATE) AS year,
        S_NATION,
        P_CATEGORY,
        sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
    FROM lineorder_flat
    WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (year = 1997 OR year = 1998) AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
    GROUP BY
        year,
        S_NATION,
        P_CATEGORY
    ORDER BY
        year ASC,
        S_NATION ASC,
        P_CATEGORY ASC;
    

    Q4.3

    SELECT
        toYear(LO_ORDERDATE) AS year,
        S_CITY,
        P_BRAND,
        sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
    FROM lineorder_flat
    WHERE S_NATION = 'UNITED STATES' AND (year = 1997 OR year = 1998) AND P_CATEGORY = 'MFGR#14'
    GROUP BY
        year,
        S_CITY,
        P_BRAND
    ORDER BY
        year ASC,
        S_CITY ASC,
        P_BRAND ASC;

    WikiStat

    参考: http://dumps.wikimedia.org/other/pagecounts-raw/

    创建表结构:

    CREATE TABLE wikistat
    (
        date Date,
        time DateTime,
        project String,
        subproject String,
        path String,
        hits UInt64,
        size UInt64
    ) ENGINE = MergeTree(date, (path, time), 8192);
    

    加载数据:

    $ for i in {2007..2016}; do for j in {01..12}; do echo $i-$j >&2; curl -sSL "http://dumps.wikimedia.org/other/pagecounts-raw/$i/$i-$j/" | grep -oE 'pagecounts-[0-9]+-[0-9]+\.gz'; done; done | sort | uniq | tee links.txt
    $ cat links.txt | while read link; do wget http://dumps.wikimedia.org/other/pagecounts-raw/$(echo $link | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})[0-9]{2}-[0-9]+\.gz/\1/')/$(echo $link | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})[0-9]{2}-[0-9]+\.gz/\1-\2/')/$link; done
    $ ls -1 /opt/wikistat/ | grep gz | while read i; do echo $i; gzip -cd /opt/wikistat/$i | ./wikistat-loader --time="$(echo -n $i | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})([0-9]{2})-([0-9]{2})([0-9]{2})([0-9]{2})\.gz/\1-\2-\3 \4-00-00/')" | clickhouse-client --query="INSERT INTO wikistat FORMAT TabSeparated"; done

    Terabyte of Click Logs from Criteo

    可以从 http://labs.criteo.com/downloads/download-terabyte-click-logs/ 上下载数据

    创建原始数据对应的表结构:

    CREATE TABLE criteo_log (date Date, clicked UInt8, int1 Int32, int2 Int32, int3 Int32, int4 Int32, int5 Int32, int6 Int32, int7 Int32, int8 Int32, int9 Int32, int10 Int32, int11 Int32, int12 Int32, int13 Int32, cat1 String, cat2 String, cat3 String, cat4 String, cat5 String, cat6 String, cat7 String, cat8 String, cat9 String, cat10 String, cat11 String, cat12 String, cat13 String, cat14 String, cat15 String, cat16 String, cat17 String, cat18 String, cat19 String, cat20 String, cat21 String, cat22 String, cat23 String, cat24 String, cat25 String, cat26 String) ENGINE = Log
    

    下载数据:

    $ for i in {00..23}; do echo $i; zcat datasets/criteo/day_${i#0}.gz | sed -r 's/^/2000-01-'${i/00/24}'\t/' | clickhouse-client --host=example-perftest01j --query="INSERT INTO criteo_log FORMAT TabSeparated"; done
    

    创建转换后的数据对应的表结构:

    CREATE TABLE criteo
    (
        date Date,
        clicked UInt8,
        int1 Int32,
        int2 Int32,
        int3 Int32,
        int4 Int32,
        int5 Int32,
        int6 Int32,
        int7 Int32,
        int8 Int32,
        int9 Int32,
        int10 Int32,
        int11 Int32,
        int12 Int32,
        int13 Int32,
        icat1 UInt32,
        icat2 UInt32,
        icat3 UInt32,
        icat4 UInt32,
        icat5 UInt32,
        icat6 UInt32,
        icat7 UInt32,
        icat8 UInt32,
        icat9 UInt32,
        icat10 UInt32,
        icat11 UInt32,
        icat12 UInt32,
        icat13 UInt32,
        icat14 UInt32,
        icat15 UInt32,
        icat16 UInt32,
        icat17 UInt32,
        icat18 UInt32,
        icat19 UInt32,
        icat20 UInt32,
        icat21 UInt32,
        icat22 UInt32,
        icat23 UInt32,
        icat24 UInt32,
        icat25 UInt32,
        icat26 UInt32
    ) ENGINE = MergeTree(date, intHash32(icat1), (date, intHash32(icat1)), 8192)
    

    将第一张表中的原始数据转化写入到第二张表中去:

    INSERT INTO criteo SELECT date, clicked, int1, int2, int3, int4, int5, int6, int7, int8, int9, int10, int11, int12, int13, reinterpretAsUInt32(unhex(cat1)) AS icat1, reinterpretAsUInt32(unhex(cat2)) AS icat2, reinterpretAsUInt32(unhex(cat3)) AS icat3, reinterpretAsUInt32(unhex(cat4)) AS icat4, reinterpretAsUInt32(unhex(cat5)) AS icat5, reinterpretAsUInt32(unhex(cat6)) AS icat6, reinterpretAsUInt32(unhex(cat7)) AS icat7, reinterpretAsUInt32(unhex(cat8)) AS icat8, reinterpretAsUInt32(unhex(cat9)) AS icat9, reinterpretAsUInt32(unhex(cat10)) AS icat10, reinterpretAsUInt32(unhex(cat11)) AS icat11, reinterpretAsUInt32(unhex(cat12)) AS icat12, reinterpretAsUInt32(unhex(cat13)) AS icat13, reinterpretAsUInt32(unhex(cat14)) AS icat14, reinterpretAsUInt32(unhex(cat15)) AS icat15, reinterpretAsUInt32(unhex(cat16)) AS icat16, reinterpretAsUInt32(unhex(cat17)) AS icat17, reinterpretAsUInt32(unhex(cat18)) AS icat18, reinterpretAsUInt32(unhex(cat19)) AS icat19, reinterpretAsUInt32(unhex(cat20)) AS icat20, reinterpretAsUInt32(unhex(cat21)) AS icat21, reinterpretAsUInt32(unhex(cat22)) AS icat22, reinterpretAsUInt32(unhex(cat23)) AS icat23, reinterpretAsUInt32(unhex(cat24)) AS icat24, reinterpretAsUInt32(unhex(cat25)) AS icat25, reinterpretAsUInt32(unhex(cat26)) AS icat26 FROM criteo_log;
    
    DROP TABLE criteo_log;

     

    AMPLab Big Data Benchmark

    参考 https://amplab.cs.berkeley.edu/benchmark/

    需要您在Amazon注册一个免费的账号。注册时需要您提供信用卡、邮箱、电话等信息。之后可以在Amazon AWS Console获取新的访问密钥

    在控制台运行以下命令:

    $ sudo apt-get install s3cmd
    $ mkdir tiny; cd tiny;
    $ s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/tiny/ .
    $ cd ..
    $ mkdir 1node; cd 1node;
    $ s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/1node/ .
    $ cd ..
    $ mkdir 5nodes; cd 5nodes;
    $ s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/5nodes/ .
    $ cd ..
    

    在ClickHouse运行如下查询:

    CREATE TABLE rankings_tiny
    (
        pageURL String,
        pageRank UInt32,
        avgDuration UInt32
    ) ENGINE = Log;
    
    CREATE TABLE uservisits_tiny
    (
        sourceIP String,
        destinationURL String,
        visitDate Date,
        adRevenue Float32,
        UserAgent String,
        cCode FixedString(3),
        lCode FixedString(6),
        searchWord String,
        duration UInt32
    ) ENGINE = MergeTree(visitDate, visitDate, 8192);
    
    CREATE TABLE rankings_1node
    (
        pageURL String,
        pageRank UInt32,
        avgDuration UInt32
    ) ENGINE = Log;
    
    CREATE TABLE uservisits_1node
    (
        sourceIP String,
        destinationURL String,
        visitDate Date,
        adRevenue Float32,
        UserAgent String,
        cCode FixedString(3),
        lCode FixedString(6),
        searchWord String,
        duration UInt32
    ) ENGINE = MergeTree(visitDate, visitDate, 8192);
    
    CREATE TABLE rankings_5nodes_on_single
    (
        pageURL String,
        pageRank UInt32,
        avgDuration UInt32
    ) ENGINE = Log;
    
    CREATE TABLE uservisits_5nodes_on_single
    (
        sourceIP String,
        destinationURL String,
        visitDate Date,
        adRevenue Float32,
        UserAgent String,
        cCode FixedString(3),
        lCode FixedString(6),
        searchWord String,
        duration UInt32
    ) ENGINE = MergeTree(visitDate, visitDate, 8192);
    

    回到控制台运行如下命令:

    $ for i in tiny/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_tiny FORMAT CSV"; done
    $ for i in tiny/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_tiny FORMAT CSV"; done
    $ for i in 1node/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_1node FORMAT CSV"; done
    $ for i in 1node/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_1node FORMAT CSV"; done
    $ for i in 5nodes/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_5nodes_on_single FORMAT CSV"; done
    $ for i in 5nodes/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_5nodes_on_single FORMAT CSV"; done
    

    简单的查询示例:

    SELECT pageURL, pageRank FROM rankings_1node WHERE pageRank > 1000
    
    SELECT substring(sourceIP, 1, 8), sum(adRevenue) FROM uservisits_1node GROUP BY substring(sourceIP, 1, 8)
    
    SELECT
        sourceIP,
        sum(adRevenue) AS totalRevenue,
        avg(pageRank) AS pageRank
    FROM rankings_1node ALL INNER JOIN
    (
        SELECT
            sourceIP,
            destinationURL AS pageURL,
            adRevenue
        FROM uservisits_1node
        WHERE (visitDate > '1980-01-01') AND (visitDate < '1980-04-01')
    ) USING pageURL
    GROUP BY sourceIP
    ORDER BY totalRevenue DESC
    LIMIT 1

    纽约出租车数据

    纽约市出租车数据有以下两个方式获取:

    • 从原始数据导入
    • 下载处理好的数据

    怎样导入原始数据

    可以参考 https://github.com/toddwschneider/nyc-taxi-data 和 http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html 中的关于数据集结构描述与数据下载指令说明。

    数据集包含227GB的CSV文件。在1Gbig的带宽下,下载大约需要一个小时这大约需要一个小时的下载时间(从s3.amazonaws.com并行下载时间至少可以缩减一半)。 下载时注意损坏的文件。可以检查文件大小并重新下载损坏的文件。

    有些文件中包含一些无效的行,您可以使用如下语句修复他们:

    sed -E '/(.*,){18,}/d' data/yellow_tripdata_2010-02.csv > data/yellow_tripdata_2010-02.csv_
    sed -E '/(.*,){18,}/d' data/yellow_tripdata_2010-03.csv > data/yellow_tripdata_2010-03.csv_
    mv data/yellow_tripdata_2010-02.csv_ data/yellow_tripdata_2010-02.csv
    mv data/yellow_tripdata_2010-03.csv_ data/yellow_tripdata_2010-03.csv

    然后必须在PostgreSQL中对数据进行预处理。这将创建多边形中选择的点(将地图上的点与纽约市的行政区相匹配),并使用连接将所有数据合并到一个非规范化的平面表中。为此,您需要安装支持PostGIS的PostgreSQL。

    运行initialize_database.sh时要小心,并手动重新检查是否正确创建了所有表。

    在PostgreSQL中处理每个月的数据大约需要20-30分钟,总共大约需要48小时。

    您可以按如下方式检查下载的行数:

    $ time psql nyc-taxi-data -c "SELECT count(*) FROM trips;"
    ## Count
    1298979494
    (1 row)

    real 7m9.164s

    (根据Mark Litwintschik的系列博客报道数据略多余11亿行)

    PostgreSQL处理这些数据大概需要370GB的磁盘空间。

    从PostgreSQL中导出数据:

    COPY
    (
    SELECT trips.id,
    trips.vendor_id,
    trips.pickup_datetime,
    trips.dropoff_datetime,
    trips.store_and_fwd_flag,
    trips.rate_code_id,
    trips.pickup_longitude,
    trips.pickup_latitude,
    trips.dropoff_longitude,
    trips.dropoff_latitude,
    trips.passenger_count,
    trips.trip_distance,
    trips.fare_amount,
    trips.extra,
    trips.mta_tax,
    trips.tip_amount,
    trips.tolls_amount,
    trips.ehail_fee,
    trips.improvement_surcharge,
    trips.total_amount,
    trips.payment_type,
    trips.trip_type,
    trips.pickup,
    trips.dropoff,

    cab_types.type cab_type,

    weather.precipitation_tenths_of_mm rain,
    weather.snow_depth_mm,
    weather.snowfall_mm,
    weather.max_temperature_tenths_degrees_celsius max_temp,
    weather.min_temperature_tenths_degrees_celsius min_temp,
    weather.average_wind_speed_tenths_of_meters_per_second wind,

    pick_up.gid pickup_nyct2010_gid,
    pick_up.ctlabel pickup_ctlabel,
    pick_up.borocode pickup_borocode,
    pick_up.boroname pickup_boroname,
    pick_up.ct2010 pickup_ct2010,
    pick_up.boroct2010 pickup_boroct2010,
    pick_up.cdeligibil pickup_cdeligibil,
    pick_up.ntacode pickup_ntacode,
    pick_up.ntaname pickup_ntaname,
    pick_up.puma pickup_puma,

    drop_off.gid dropoff_nyct2010_gid,
    drop_off.ctlabel dropoff_ctlabel,
    drop_off.borocode dropoff_borocode,
    drop_off.boroname dropoff_boroname,
    drop_off.ct2010 dropoff_ct2010,
    drop_off.boroct2010 dropoff_boroct2010,
    drop_off.cdeligibil dropoff_cdeligibil,
    drop_off.ntacode dropoff_ntacode,
    drop_off.ntaname dropoff_ntaname,
    drop_off.puma dropoff_puma
    FROM trips
    LEFT JOIN cab_types
    ON trips.cab_type_id = cab_types.id
    LEFT JOIN central_park_weather_observations_raw weather
    ON weather.date = trips.pickup_datetime::date
    LEFT JOIN nyct2010 pick_up
    ON pick_up.gid = trips.pickup_nyct2010_gid
    LEFT JOIN nyct2010 drop_off
    ON drop_off.gid = trips.dropoff_nyct2010_gid
    ) TO '/opt/milovidov/nyc-taxi-data/trips.tsv';

    数据快照的创建速度约为每秒50MB。 在创建快照时,PostgreSQL以每秒约28MB的速度从磁盘读取数据。 这大约需要5个小时。 最终生成的TSV文件为590612904969 bytes。

    在ClickHouse中创建临时表:

    CREATE TABLE trips
    (
    trip_id UInt32,
    vendor_id String,
    pickup_datetime DateTime,
    dropoff_datetime Nullable(DateTime),
    store_and_fwd_flag Nullable(FixedString(1)),
    rate_code_id Nullable(UInt8),
    pickup_longitude Nullable(Float64),
    pickup_latitude Nullable(Float64),
    dropoff_longitude Nullable(Float64),
    dropoff_latitude Nullable(Float64),
    passenger_count Nullable(UInt8),
    trip_distance Nullable(Float64),
    fare_amount Nullable(Float32),
    extra Nullable(Float32),
    mta_tax Nullable(Float32),
    tip_amount Nullable(Float32),
    tolls_amount Nullable(Float32),
    ehail_fee Nullable(Float32),
    improvement_surcharge Nullable(Float32),
    total_amount Nullable(Float32),
    payment_type Nullable(String),
    trip_type Nullable(UInt8),
    pickup Nullable(String),
    dropoff Nullable(String),
    cab_type Nullable(String),
    precipitation Nullable(UInt8),
    snow_depth Nullable(UInt8),
    snowfall Nullable(UInt8),
    max_temperature Nullable(UInt8),
    min_temperature Nullable(UInt8),
    average_wind_speed Nullable(UInt8),
    pickup_nyct2010_gid Nullable(UInt8),
    pickup_ctlabel Nullable(String),
    pickup_borocode Nullable(UInt8),
    pickup_boroname Nullable(String),
    pickup_ct2010 Nullable(String),
    pickup_boroct2010 Nullable(String),
    pickup_cdeligibil Nullable(FixedString(1)),
    pickup_ntacode Nullable(String),
    pickup_ntaname Nullable(String),
    pickup_puma Nullable(String),
    dropoff_nyct2010_gid Nullable(UInt8),
    dropoff_ctlabel Nullable(String),
    dropoff_borocode Nullable(UInt8),
    dropoff_boroname Nullable(String),
    dropoff_ct2010 Nullable(String),
    dropoff_boroct2010 Nullable(String),
    dropoff_cdeligibil Nullable(String),
    dropoff_ntacode Nullable(String),
    dropoff_ntaname Nullable(String),
    dropoff_puma Nullable(String)
    ) ENGINE = Log;

    接下来,需要将字段转换为更正确的数据类型,并且在可能的情况下,消除NULL。

    $ time clickhouse-client --query="INSERT INTO trips FORMAT TabSeparated" < trips.tsv

    real 75m56.214s

    数据的读取速度为112-140 Mb/秒。 通过这种方式将数据加载到Log表中需要76分钟。 这个表中的数据需要使用142GB的磁盘空间.

    (也可以直接使用COPY ... TO PROGRAM从Postgres中导入数据)

    数据中所有与天气相关的字段(precipitation……average_wind_speed)都填充了NULL。 所以,我们将从最终数据集中删除它们

    首先,我们使用单台服务器创建表,后面我们将在多台节点上创建这些表。

    创建表结构并写入数据:

    CREATE TABLE trips_mergetree
    ENGINE = MergeTree(pickup_date, pickup_datetime, 8192)
    AS SELECT

    trip_id,
    CAST(vendor_id AS Enum8('1' = 1, '2' = 2, 'CMT' = 3, 'VTS' = 4, 'DDS' = 5, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14)) AS vendor_id,
    toDate(pickup_datetime) AS pickup_date,
    ifNull(pickup_datetime, toDateTime(0)) AS pickup_datetime,
    toDate(dropoff_datetime) AS dropoff_date,
    ifNull(dropoff_datetime, toDateTime(0)) AS dropoff_datetime,
    assumeNotNull(store_and_fwd_flag) IN ('Y', '1', '2') AS store_and_fwd_flag,
    assumeNotNull(rate_code_id) AS rate_code_id,
    assumeNotNull(pickup_longitude) AS pickup_longitude,
    assumeNotNull(pickup_latitude) AS pickup_latitude,
    assumeNotNull(dropoff_longitude) AS dropoff_longitude,
    assumeNotNull(dropoff_latitude) AS dropoff_latitude,
    assumeNotNull(passenger_count) AS passenger_count,
    assumeNotNull(trip_distance) AS trip_distance,
    assumeNotNull(fare_amount) AS fare_amount,
    assumeNotNull(extra) AS extra,
    assumeNotNull(mta_tax) AS mta_tax,
    assumeNotNull(tip_amount) AS tip_amount,
    assumeNotNull(tolls_amount) AS tolls_amount,
    assumeNotNull(ehail_fee) AS ehail_fee,
    assumeNotNull(improvement_surcharge) AS improvement_surcharge,
    assumeNotNull(total_amount) AS total_amount,
    CAST((assumeNotNull(payment_type) AS pt) IN ('CSH', 'CASH', 'Cash', 'CAS', 'Cas', '1') ? 'CSH' : (pt IN ('CRD', 'Credit', 'Cre', 'CRE', 'CREDIT', '2') ? 'CRE' : (pt IN ('NOC', 'No Charge', 'No', '3') ? 'NOC' : (pt IN ('DIS', 'Dispute', 'Dis', '4') ? 'DIS' : 'UNK'))) AS Enum8('CSH' = 1, 'CRE' = 2, 'UNK' = 0, 'NOC' = 3, 'DIS' = 4)) AS payment_type_,
    assumeNotNull(trip_type) AS trip_type,
    ifNull(toFixedString(unhex(pickup), 25), toFixedString('',25))AS pickup,
    ifNull(toFixedString(unhex(dropoff),25), toFixedString('',25))AS dropoff,
    CAST(assumeNotNull(cab_type)AS Enum8('yellow'=1,'green'=2,'uber'=3))AS cab_type,

    assumeNotNull(pickup_nyct2010_gid)AS pickup_nyct2010_gid,
    toFloat32(ifNull(pickup_ctlabel,'0'))AS pickup_ctlabel,
    assumeNotNull(pickup_borocode)AS pickup_borocode,
    CAST(assumeNotNull(pickup_boroname)AS Enum8('Manhattan'=1,'Queens'=4,'Brooklyn'=3,''=0,'Bronx'=2,'Staten Island'=5))AS pickup_boroname,
    toFixedString(ifNull(pickup_ct2010,'000000'),6)AS pickup_ct2010,
    toFixedString(ifNull(pickup_boroct2010,'0000000'),7)AS pickup_boroct2010,
    CAST(assumeNotNull(ifNull(pickup_cdeligibil,' '))AS Enum8(' '=0,'E'=1,'I'=2))AS pickup_cdeligibil,
    toFixedString(ifNull(pickup_ntacode,'0000'),4)AS pickup_ntacode,

    CAST(assumeNotNull(pickup_ntaname)AS Enum16(''=0,'Airport'=1,'Allerton-Pelham Gardens'=2,'Annadale-Huguenot-Prince\'s Bay-Eltingville'=3,'Arden Heights'=4,'Astoria'=5,'Auburndale'=6,'Baisley Park'=7,'Bath Beach'=8,'Battery Park City-Lower Manhattan'=9,'Bay Ridge'=10,'Bayside-Bayside Hills'=11,'Bedford'=12,'Bedford Park-Fordham North'=13,'Bellerose'=14,'Belmont'=15,'Bensonhurst East'=16,'Bensonhurst West'=17,'Borough Park'=18,'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel'=19,'Briarwood-Jamaica Hills'=20,'Brighton Beach'=21,'Bronxdale'=22,'Brooklyn Heights-Cobble Hill'=23,'Brownsville'=24,'Bushwick North'=25,'Bushwick South'=26,'Cambria Heights'=27,'Canarsie'=28,'Carroll Gardens-Columbia Street-Red Hook'=29,'Central Harlem North-Polo Grounds'=30,'Central Harlem South'=31,'Charleston-Richmond Valley-Tottenville'=32,'Chinatown'=33,'Claremont-Bathgate'=34,'Clinton'=35,'Clinton Hill'=36,'Co-op City'=37,'College Point'=38,'Corona'=39,'Crotona Park East'=40,'Crown Heights North'=41,'Crown Heights South'=42,'Cypress Hills-City Line'=43,'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill'=44,'Douglas Manor-Douglaston-Little Neck'=45,'Dyker Heights'=46,'East Concourse-Concourse Village'=47,'East Elmhurst'=48,'East Flatbush-Farragut'=49,'East Flushing'=50,'East Harlem North'=51,'East Harlem South'=52,'East New York'=53,'East New York (Pennsylvania Ave)'=54,'East Tremont'=55,'East Village'=56,'East Williamsburg'=57,'Eastchester-Edenwald-Baychester'=58,'Elmhurst'=59,'Elmhurst-Maspeth'=60,'Erasmus'=61,'Far Rockaway-Bayswater'=62,'Flatbush'=63,'Flatlands'=64,'Flushing'=65,'Fordham South'=66,'Forest Hills'=67,'Fort Greene'=68,'Fresh Meadows-Utopia'=69,'Ft. Totten-Bay Terrace-Clearview'=70,'Georgetown-Marine Park-Bergen Beach-Mill Basin'=71,'Glen Oaks-Floral Park-New Hyde Park'=72,'Glendale'=73,'Gramercy'=74,'Grasmere-Arrochar-Ft. Wadsworth'=75,'Gravesend'=76,'Great Kills'=77,'Greenpoint'=78,'Grymes Hill-Clifton-Fox Hills'=79,'Hamilton Heights'=80,'Hammels-Arverne-Edgemere'=81,'Highbridge'=82,'Hollis'=83,'Homecrest'=84,'Hudson Yards-Chelsea-Flatiron-Union Square'=85,'Hunters Point-Sunnyside-West Maspeth'=86,'Hunts Point'=87,'Jackson Heights'=88,'Jamaica'=89,'Jamaica Estates-Holliswood'=90,'Kensington-Ocean Parkway'=91,'Kew Gardens'=92,'Kew Gardens Hills'=93,'Kingsbridge Heights'=94,'Laurelton'=95,'Lenox Hill-Roosevelt Island'=96,'Lincoln Square'=97,'Lindenwood-Howard Beach'=98,'Longwood'=99,'Lower East Side'=100,'Madison'=101,'Manhattanville'=102,'Marble Hill-Inwood'=103,'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville'=104,'Maspeth'=105,'Melrose South-Mott Haven North'=106,'Middle Village'=107,'Midtown-Midtown South'=108,'Midwood'=109,'Morningside Heights'=110,'Morrisania-Melrose'=111,'Mott Haven-Port Morris'=112,'Mount Hope'=113,'Murray Hill'=114,'Murray Hill-Kips Bay'=115,'New Brighton-Silver Lake'=116,'New Dorp-Midland Beach'=117,'New Springville-Bloomfield-Travis'=118,'North Corona'=119,'North Riverdale-Fieldston-Riverdale'=120,'North Side-South Side'=121,'Norwood'=122,'Oakland Gardens'=123,'Oakwood-Oakwood Beach'=124,'Ocean Hill'=125,'Ocean Parkway South'=126,'Old Astoria'=127,'Old Town-Dongan Hills-South Beach'=128,'Ozone Park'=129,'Park Slope-Gowanus'=130,'Parkchester'=131,'Pelham Bay-Country Club-City Island'=132,'Pelham Parkway'=133,'Pomonok-Flushing Heights-Hillcrest'=134,'Port Richmond'=135,'Prospect Heights'=136,'Prospect Lefferts Gardens-Wingate'=137,'Queens Village'=138,'Queensboro Hill'=139,'Queensbridge-Ravenswood-Long Island City'=140,'Rego Park'=141,'Richmond Hill'=142,'Ridgewood'=143,'Rikers Island'=144,'Rosedale'=145,'Rossville-Woodrow'=146,'Rugby-Remsen Village'=147,'Schuylerville-Throgs Neck-Edgewater Park'=148,'Seagate-Coney Island'=149,'Sheepshead Bay-Gerritsen Beach-Manhattan Beach'=150,'SoHo-TriBeCa-Civic Center-Little Italy'=151,'Soundview-Bruckner'=152,'Soundview-Castle Hill-Clason Point-Harding Park'=153,'South Jamaica'=154,'South Ozone Park'=155,'Springfield Gardens North'=156,'Springfield Gardens South-Brookville'=157,'Spuyten Duyvil-Kingsbridge'=158,'St. Albans'=159,'Stapleton-Rosebank'=160,'Starrett City'=161,'Steinway'=162,'Stuyvesant Heights'=163,'Stuyvesant Town-Cooper Village'=164,'Sunset Park East'=165,'Sunset Park West'=166,'Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill'=167,'Turtle Bay-East Midtown'=168,'University Heights-Morris Heights'=169,'Upper East Side-Carnegie Hill'=170,'Upper West Side'=171,'Van Cortlandt Village'=172,'Van Nest-Morris Park-Westchester Square'=173,'Washington Heights North'=174,'Washington Heights South'=175,'West Brighton'=176,'West Concourse'=177,'West Farms-Bronx River'=178,'West New Brighton-New Brighton-St. George'=179,'West Village'=180,'Westchester-Unionport'=181,'Westerleigh'=182,'Whitestone'=183,'Williamsbridge-Olinville'=184,'Williamsburg'=185,'Windsor Terrace'=186,'Woodhaven'=187,'Woodlawn-Wakefield'=188,'Woodside'=189,'Yorkville'=190,'park-cemetery-etc-Bronx'=191,'park-cemetery-etc-Brooklyn'=192,'park-cemetery-etc-Manhattan'=193,'park-cemetery-etc-Queens'=194,'park-cemetery-etc-Staten Island'=195))AS pickup_ntaname,

    toUInt16(ifNull(pickup_puma,'0'))AS pickup_puma,

    assumeNotNull(dropoff_nyct2010_gid)AS dropoff_nyct2010_gid,
    toFloat32(ifNull(dropoff_ctlabel,'0'))AS dropoff_ctlabel,
    assumeNotNull(dropoff_borocode)AS dropoff_borocode,
    CAST(assumeNotNull(dropoff_boroname)AS Enum8('Manhattan'=1,'Queens'=4,'Brooklyn'=3,''=0,'Bronx'=2,'Staten Island'=5))AS dropoff_boroname,
    toFixedString(ifNull(dropoff_ct2010,'000000'),6)AS dropoff_ct2010,
    toFixedString(ifNull(dropoff_boroct2010,'0000000'),7)AS dropoff_boroct2010,
    CAST(assumeNotNull(ifNull(dropoff_cdeligibil,' '))AS Enum8(' '=0,'E'=1,'I'=2))AS dropoff_cdeligibil,
    toFixedString(ifNull(dropoff_ntacode,'0000'),4)AS dropoff_ntacode,

    CAST(assumeNotNull(dropoff_ntaname)AS Enum16(''=0,'Airport'=1,'Allerton-Pelham Gardens'=2,'Annadale-Huguenot-Prince\'s Bay-Eltingville'=3,'Arden Heights'=4,'Astoria'=5,'Auburndale'=6,'Baisley Park'=7,'Bath Beach'=8,'Battery Park City-Lower Manhattan'=9,'Bay Ridge'=10,'Bayside-Bayside Hills'=11,'Bedford'=12,'Bedford Park-Fordham North'=13,'Bellerose'=14,'Belmont'=15,'Bensonhurst East'=16,'Bensonhurst West'=17,'Borough Park'=18,'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel'=19,'Briarwood-Jamaica Hills'=20,'Brighton Beach'=21,'Bronxdale'=22,'Brooklyn Heights-Cobble Hill'=23,'Brownsville'=24,'Bushwick North'=25,'Bushwick South'=26,'Cambria Heights'=27,'Canarsie'=28,'Carroll Gardens-Columbia Street-Red Hook'=29,'Central Harlem North-Polo Grounds'=30,'Central Harlem South'=31,'Charleston-Richmond Valley-Tottenville'=32,'Chinatown'=33,'Claremont-Bathgate'=34,'Clinton'=35,'Clinton Hill'=36,'Co-op City'=37,'College Point'=38,'Corona'=39,'Crotona Park East'=40,'Crown Heights North'=41,'Crown Heights South'=42,'Cypress Hills-City Line'=43,'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill'=44,'Douglas Manor-Douglaston-Little Neck'=45,'Dyker Heights'=46,'East Concourse-Concourse Village'=47,'East Elmhurst'=48,'East Flatbush-Farragut'=49,'East Flushing'=50,'East Harlem North'=51,'East Harlem South'=52,'East New York'=53,'East New York (Pennsylvania Ave)'=54,'East Tremont'=55,'East Village'=56,'East Williamsburg'=57,'Eastchester-Edenwald-Baychester'=58,'Elmhurst'=59,'Elmhurst-Maspeth'=60,'Erasmus'=61,'Far Rockaway-Bayswater'=62,'Flatbush'=63,'Flatlands'=64,'Flushing'=65,'Fordham South'=66,'Forest Hills'=67,'Fort Greene'=68,'Fresh Meadows-Utopia'=69,'Ft. Totten-Bay Terrace-Clearview'=70,'Georgetown-Marine Park-Bergen Beach-Mill Basin'=71,'Glen Oaks-Floral Park-New Hyde Park'=72,'Glendale'=73,'Gramercy'=74,'Grasmere-Arrochar-Ft. Wadsworth'=75,'Gravesend'=76,'Great Kills'=77,'Greenpoint'=78,'Grymes Hill-Clifton-Fox Hills'=79,'Hamilton Heights'=80,'Hammels-Arverne-Edgemere'=81,'Highbridge'=82,'Hollis'=83,'Homecrest'=84,'Hudson Yards-Chelsea-Flatiron-Union Square'=85,'Hunters Point-Sunnyside-West Maspeth'=86,'Hunts Point'=87,'Jackson Heights'=88,'Jamaica'=89,'Jamaica Estates-Holliswood'=90,'Kensington-Ocean Parkway'=91,'Kew Gardens'=92,'Kew Gardens Hills'=93,'Kingsbridge Heights'=94,'Laurelton'=95,'Lenox Hill-Roosevelt Island'=96,'Lincoln Square'=97,'Lindenwood-Howard Beach'=98,'Longwood'=99,'Lower East Side'=100,'Madison'=101,'Manhattanville'=102,'Marble Hill-Inwood'=103,'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville'=104,'Maspeth'=105,'Melrose South-Mott Haven North'=106,'Middle Village'=107,'Midtown-Midtown South'=108,'Midwood'=109,'Morningside Heights'=110,'Morrisania-Melrose'=111,'Mott Haven-Port Morris'=112,'Mount Hope'=113,'Murray Hill'=114,'Murray Hill-Kips Bay'=115,'New Brighton-Silver Lake'=116,'New Dorp-Midland Beach'=117,'New Springville-Bloomfield-Travis'=118,'North Corona'=119,'North Riverdale-Fieldston-Riverdale'=120,'North Side-South Side'=121,'Norwood'=122,'Oakland Gardens'=123,'Oakwood-Oakwood Beach'=124,'Ocean Hill'=125,'Ocean Parkway South'=126,'Old Astoria'=127,'Old Town-Dongan Hills-South Beach'=128,'Ozone Park'=129,'Park Slope-Gowanus'=130,'Parkchester'=131,'Pelham Bay-Country Club-City Island'=132,'Pelham Parkway'=133,'Pomonok-Flushing Heights-Hillcrest'=134,'Port Richmond'=135,'Prospect Heights'=136,'Prospect Lefferts Gardens-Wingate'=137,'Queens Village'=138,'Queensboro Hill'=139,'Queensbridge-Ravenswood-Long Island City'=140,'Rego Park'=141,'Richmond Hill'=142,'Ridgewood'=143,'Rikers Island'=144,'Rosedale'=145,'Rossville-Woodrow'=146,'Rugby-Remsen Village'=147,'Schuylerville-Throgs Neck-Edgewater Park'=148,'Seagate-Coney Island'=149,'Sheepshead Bay-Gerritsen Beach-Manhattan Beach'=150,'SoHo-TriBeCa-Civic Center-Little Italy'=151,'Soundview-Bruckner'=152,'Soundview-Castle Hill-Clason Point-Harding Park'=153,'South Jamaica'=154,'South Ozone Park'=155,'Springfield Gardens North'=156,'Springfield Gardens South-Brookville'=157,'Spuyten Duyvil-Kingsbridge'=158,'St. Albans'=159,'Stapleton-Rosebank'=160,'Starrett City'=161,'Steinway'=162,'Stuyvesant Heights'=163,'Stuyvesant Town-Cooper Village'=164,'Sunset Park East'=165,'Sunset Park West'=166,'Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill'=167,'Turtle Bay-East Midtown'=168,'University Heights-Morris Heights'=169,'Upper East Side-Carnegie Hill'=170,'Upper West Side'=171,'Van Cortlandt Village'=172,'Van Nest-Morris Park-Westchester Square'=173,'Washington Heights North'=174,'Washington Heights South'=175,'West Brighton'=176,'West Concourse'=177,'West Farms-Bronx River'=178,'West New Brighton-New Brighton-St. George'=179,'West Village'=180,'Westchester-Unionport'=181,'Westerleigh'=182,'Whitestone'=183,'Williamsbridge-Olinville'=184,'Williamsburg'=185,'Windsor Terrace'=186,'Woodhaven'=187,'Woodlawn-Wakefield'=188,'Woodside'=189,'Yorkville'=190,'park-cemetery-etc-Bronx'=191,'park-cemetery-etc-Brooklyn'=192,'park-cemetery-etc-Manhattan'=193,'park-cemetery-etc-Queens'=194,'park-cemetery-etc-Staten Island'=195))AS dropoff_ntaname,

    toUInt16(ifNull(dropoff_puma,'0'))AS dropoff_puma

    FROM trips

    这需要3030秒,速度约为每秒428,000行。 要加快速度,可以使用Log引擎替换MergeTree引擎来创建表。 在这种情况下,下载速度超过200秒。

    这个表需要使用126GB的磁盘空间。

    SELECT formatReadableSize(sum(bytes)) FROM system.parts WHERE table = 'trips_mergetree' AND active
    ┌─formatReadableSize(sum(bytes))─┐
    │ 126.18 GiB │
    └────────────────────────────────┘

    除此之外,你还可以在MergeTree上运行OPTIMIZE查询来进行优化。但这不是必须的,因为即使在没有进行优化的情况下它的表现依然是很好的。

    下载预处理好的分区数据

    $ curl -O https://datasets.clickhouse.com/trips_mergetree/partitions/trips_mergetree.tar
    $ tar xvf trips_mergetree.tar -C /var/lib/clickhouse # path to ClickHouse data directory
    $ # check permissions of unpacked data, fix if required
    $ sudo service clickhouse-server restart
    $ clickhouse-client --query "select count(*) from datasets.trips_mergetree"

    !!! info "信息" 如果要运行下面的SQL查询,必须使用完整的表名,datasets.trips_mergetree

    单台服务器运行结果

    Q1:

    SELECT cab_type, count(*) FROM trips_mergetree GROUP BY cab_type

    0.490秒

    Q2:

    SELECT passenger_count, avg(total_amount) FROM trips_mergetree GROUP BY passenger_count

    1.224秒

    Q3:

    SELECT passenger_count, toYear(pickup_date) AS year, count(*) FROM trips_mergetree GROUP BY passenger_count, year

    2.104秒

    Q4:

    SELECT passenger_count, toYear(pickup_date) AS year, round(trip_distance) AS distance, count(*)
    FROM trips_mergetree
    GROUP BY passenger_count, year, distance
    ORDER BY year, count(*) DESC

    3.593秒

    我们使用的是如下配置的服务器:

    两个Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz,总共有16个物理内核,128GiB RAM,8X6TB HD,RAID-5

    执行时间是取三次运行中最好的值,但是从第二次查询开始,查询就将从文件系统的缓存中读取数据。同时在每次读取和处理后不在进行缓存。

    在三台服务器中创建表结构:

    在每台服务器中运行:

    CREATE TABLE default.trips_mergetree_third ( trip_id UInt32, vendor_id Enum8('1' = 1, '2' = 2, 'CMT' = 3, 'VTS' = 4, 'DDS' = 5, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14), pickup_date Date, pickup_datetime DateTime, dropoff_date Date, dropoff_datetime DateTime, store_and_fwd_flag UInt8, rate_code_id UInt8, pickup_longitude Float64, pickup_latitude Float64, dropoff_longitude Float64, dropoff_latitude Float64, passenger_count UInt8, trip_distance Float64, fare_amount Float32, extra Float32, mta_tax Float32, tip_amount Float32, tolls_amount Float32, ehail_fee Float32, improvement_surcharge Float32, total_amount Float32, payment_type_ Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4), trip_type UInt8, pickup FixedString(25), dropoff FixedString(25), cab_type Enum8('yellow' = 1, 'green' = 2, 'uber' = 3), pickup_nyct2010_gid UInt8, pickup_ctlabel Float32, pickup_borocode UInt8, pickup_boroname Enum8('' = 0, 'Manhattan' = 1, 'Bronx' = 2, 'Brooklyn' = 3, 'Queens' = 4, 'Staten Island' = 5), pickup_ct2010 FixedString(6), pickup_boroct2010 FixedString(7), pickup_cdeligibil Enum8(' ' = 0, 'E' = 1, 'I' = 2), pickup_ntacode FixedString(4), pickup_ntaname Enum16('' = 0, 'Airport' = 1, 'Allerton-Pelham Gardens' = 2, 'Annadale-Huguenot-Prince\'s Bay-Eltingville' = 3, 'Arden Heights' = 4, 'Astoria' = 5, 'Auburndale' = 6, 'Baisley Park' = 7, 'Bath Beach' = 8, 'Battery Park City-Lower Manhattan' = 9, 'Bay Ridge' = 10, 'Bayside-Bayside Hills' = 11, 'Bedford' = 12, 'Bedford Park-Fordham North' = 13, 'Bellerose' = 14, 'Belmont' = 15, 'Bensonhurst East' = 16, 'Bensonhurst West' = 17, 'Borough Park' = 18, 'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel' = 19, 'Briarwood-Jamaica Hills' = 20, 'Brighton Beach' = 21, 'Bronxdale' = 22, 'Brooklyn Heights-Cobble Hill' = 23, 'Brownsville' = 24, 'Bushwick North' = 25, 'Bushwick South' = 26, 'Cambria Heights' = 27, 'Canarsie' = 28, 'Carroll Gardens-Columbia Street-Red Hook' = 29, 'Central Harlem North-Polo Grounds' = 30, 'Central Harlem South'=31,'Charleston-Richmond Valley-Tottenville'=32,'Chinatown'=33,'Claremont-Bathgate'=34,'Clinton'=35,'Clinton Hill'=36,'Co-op City'=37,'College Point'=38,'Corona'=39,'Crotona Park East'=40,'Crown Heights North'=41,'Crown Heights South'=42,'Cypress Hills-City Line'=43,'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill'=44,'Douglas Manor-Douglaston-Little Neck'=45,'Dyker Heights'=46,'East Concourse-Concourse Village'=47,'East Elmhurst'=48,'East Flatbush-Farragut'=49,'East Flushing'=50,'East Harlem North'=51,'East Harlem South'=52,'East New York'=53,'East New York (Pennsylvania Ave)'=54,'East Tremont'=55,'East Village'=56,'East Williamsburg'=57,'Eastchester-Edenwald-Baychester'=58,'Elmhurst'=59,'Elmhurst-Maspeth'=60,'Erasmus'=61,'Far Rockaway-Bayswater'=62,'Flatbush'=63,'Flatlands'=64,'Flushing'=65,'Fordham South'=66,'Forest Hills'=67,'Fort Greene'=68,'Fresh Meadows-Utopia'=69,'Ft. Totten-Bay Terrace-Clearview'=70,'Georgetown-Marine Park-Bergen Beach-Mill Basin'=71,'Glen Oaks-Floral Park-New Hyde Park'=72,'Glendale'=73,'Gramercy'=74,'Grasmere-Arrochar-Ft. Wadsworth'=75,'Gravesend'=76,'Great Kills'=77,'Greenpoint'=78,'Grymes Hill-Clifton-Fox Hills'=79,'Hamilton Heights'=80,'Hammels-Arverne-Edgemere'=81,'Highbridge'=82,'Hollis'=83,'Homecrest'=84,'Hudson Yards-Chelsea-Flatiron-Union Square'=85,'Hunters Point-Sunnyside-West Maspeth'=86,'Hunts Point'=87,'Jackson Heights'=88,'Jamaica'=89,'Jamaica Estates-Holliswood'=90,'Kensington-Ocean Parkway'=91,'Kew Gardens'=92,'Kew Gardens Hills'=93,'Kingsbridge Heights'=94,'Laurelton'=95,'Lenox Hill-Roosevelt Island'=96,'Lincoln Square'=97,'Lindenwood-Howard Beach'=98,'Longwood'=99,'Lower East Side'=100,'Madison'=101,'Manhattanville'=102,'Marble Hill-Inwood'=103,'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville'=104,'Maspeth'=105,'Melrose South-Mott Haven North'=106,'Middle Village'=107,'Midtown-Midtown South'=108,'Midwood'=109,'Morningside Heights'=110,'Morrisania-Melrose'=111,'Mott Haven-Port Morris'=112,'Mount Hope'=113,'Murray Hill'=114,'Murray Hill-Kips Bay'=115,'New Brighton-Silver Lake'=116,'New Dorp-Midland Beach'=117,'New Springville-Bloomfield-Travis'=118,'North Corona'=119,'North Riverdale-Fieldston-Riverdale'=120,'North Side-South Side'=121,'Norwood'=122,'Oakland Gardens'=123,'Oakwood-Oakwood Beach'=124,'Ocean Hill'=125,'Ocean Parkway South'=126,'Old Astoria'=127,'Old Town-Dongan Hills-South Beach'=128,'Ozone Park'=129,'Park Slope-Gowanus'=130,'Parkchester'=131,'Pelham Bay-Country Club-City Island'=132,'Pelham Parkway'=133,'Pomonok-Flushing Heights-Hillcrest'=134,'Port Richmond'=135,'Prospect Heights'=136,'Prospect Lefferts Gardens-Wingate'=137,'Queens Village'=138,'Queensboro Hill'=139,'Queensbridge-Ravenswood-Long Island City'=140,'Rego Park'=141,'Richmond Hill'=142,'Ridgewood'=143,'Rikers Island'=144,'Rosedale'=145,'Rossville-Woodrow'=146,'Rugby-Remsen Village'=147,'Schuylerville-Throgs Neck-Edgewater Park'=148,'Seagate-Coney Island'=149,'Sheepshead Bay-Gerritsen Beach-Manhattan Beach'=150,'SoHo-TriBeCa-Civic Center-Little Italy'=151,'Soundview-Bruckner'=152,'Soundview-Castle Hill-Clason Point-Harding Park'=153,'South Jamaica'=154,'South Ozone Park'=155,'Springfield Gardens North'=156,'Springfield Gardens South-Brookville'=157,'Spuyten Duyvil-Kingsbridge'=158,'St. Albans'=159,'Stapleton-Rosebank'=160,'Starrett City'=161,'Steinway'=162,'Stuyvesant Heights'=163,'Stuyvesant Town-Cooper Village'=164,'Sunset Park East'=165,'Sunset Park West'=166,'Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill'=167,'Turtle Bay-East Midtown'=168,'University Heights-Morris Heights'=169,'Upper East Side-Carnegie Hill'=170,'Upper West Side'=171,'Van Cortlandt Village'=172,'Van Nest-Morris Park-Westchester Square'=173,'Washington Heights North'=174,'Washington Heights South'=175,'West Brighton'=176,'West Concourse'=177,'West Farms-Bronx River'=178,'West New Brighton-New Brighton-St. George'=179,'West Village'=180,'Westchester-Unionport'=181,'Westerleigh'=182,'Whitestone'=183,'Williamsbridge-Olinville'=184,'Williamsburg'=185,'Windsor Terrace'=186,'Woodhaven'=187,'Woodlawn-Wakefield'=188,'Woodside'=189,'Yorkville'=190,'park-cemetery-etc-Bronx'=191,'park-cemetery-etc-Brooklyn'=192,'park-cemetery-etc-Manhattan'=193,'park-cemetery-etc-Queens'=194,'park-cemetery-etc-Staten Island'=195), pickup_puma UInt16, dropoff_nyct2010_gid UInt8, dropoff_ctlabel Float32, dropoff_borocode UInt8, dropoff_boroname Enum8(''=0,'Manhattan'=1,'Bronx'=2,'Brooklyn'=3,'Queens'=4,'Staten Island'=5), dropoff_ct2010 FixedString(6), dropoff_boroct2010 FixedString(7), dropoff_cdeligibil Enum8(' '=0,'E'=1,'I'=2), dropoff_ntacode FixedString(4), dropoff_ntaname Enum16(''=0,'Airport'=1,'Allerton-Pelham Gardens'=2,'Annadale-Huguenot-Prince\'s Bay-Eltingville'=3,'Arden Heights'=4,'Astoria'=5,'Auburndale'=6,'Baisley Park'=7,'Bath Beach'=8,'Battery Park City-Lower Manhattan'=9,'Bay Ridge'=10,'Bayside-Bayside Hills'=11,'Bedford'=12,'Bedford Park-Fordham North'=13,'Bellerose'=14,'Belmont'=15,'Bensonhurst East'=16,'Bensonhurst West'=17,'Borough Park'=18,'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel'=19,'Briarwood-Jamaica Hills'=20,'Brighton Beach'=21,'Bronxdale'=22,'Brooklyn Heights-Cobble Hill'=23,'Brownsville'=24,'Bushwick North'=25,'Bushwick South'=26,'Cambria Heights'=27,'Canarsie'=28,'Carroll Gardens-Columbia Street-Red Hook'=29,'Central Harlem North-Polo Grounds'=30,'Central Harlem South'=31,'Charleston-Richmond Valley-Tottenville'=32,'Chinatown'=33,'Claremont-Bathgate'=34,'Clinton'=35,'Clinton Hill'=36,'Co-op City'=37,'College Point'=38,'Corona'=39,'Crotona Park East'=40,'Crown Heights North'=41,'Crown Heights South'=42,'Cypress Hills-City Line'=43,'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill'=44,'Douglas Manor-Douglaston-Little Neck'=45,'Dyker Heights'=46,'East Concourse-Concourse Village'=47,'East Elmhurst'=48,'East Flatbush-Farragut'=49,'East Flushing'=50,'East Harlem North'=51,'East Harlem South'=52,'East New York'=53,'East New York (Pennsylvania Ave)'=54,'East Tremont'=55,'East Village'=56,'East Williamsburg'=57,'Eastchester-Edenwald-Baychester'=58,'Elmhurst'=59,'Elmhurst-Maspeth'=60,'Erasmus'=61,'Far Rockaway-Bayswater'=62,'Flatbush'=63,'Flatlands'=64,'Flushing'=65,'Fordham South'=66,'Forest Hills'=67,'Fort Greene'=68,'Fresh Meadows-Utopia'=69,'Ft. Totten-Bay Terrace-Clearview'=70,'Georgetown-Marine Park-Bergen Beach-Mill Basin'=71,'Glen Oaks-Floral Park-New Hyde Park'=72,'Glendale'=73,'Gramercy'=74,'Grasmere-Arrochar-Ft. Wadsworth'=75,'Gravesend'=76,'Great Kills'=77,'Greenpoint'=78,'Grymes Hill-Clifton-Fox Hills'=79,'Hamilton Heights'=80,'Hammels-Arverne-Edgemere'=81,'Highbridge'=82,'Hollis'=83,'Homecrest'=84,'Hudson Yards-Chelsea-Flatiron-Union Square'=85,'Hunters Point-Sunnyside-West Maspeth'=86,'Hunts Point'=87,'Jackson Heights'=88,'Jamaica'=89,'Jamaica Estates-Holliswood'=90,'Kensington-Ocean Parkway'=91,'Kew Gardens'=92,'Kew Gardens Hills'=93,'Kingsbridge Heights'=94,'Laurelton'=95,'Lenox Hill-Roosevelt Island'=96,'Lincoln Square'=97,'Lindenwood-Howard Beach'=98,'Longwood'=99,'Lower East Side'=100,'Madison'=101,'Manhattanville'=102,'Marble Hill-Inwood'=103,'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville'=104,'Maspeth'=105,'Melrose South-Mott Haven North'=106,'Middle Village'=107,'Midtown-Midtown South'=108,'Midwood'=109,'Morningside Heights'=110,'Morrisania-Melrose'=111,'Mott Haven-Port Morris'=112,'Mount Hope'=113,'Murray Hill'=114,'Murray Hill-Kips Bay'=115,'New Brighton-Silver Lake'=116,'New Dorp-Midland Beach'=117,'New Springville-Bloomfield-Travis'=118,'North Corona'=119,'North Riverdale-Fieldston-Riverdale'=120,'North Side-South Side'=121,'Norwood'=122,'Oakland Gardens'=123,'Oakwood-Oakwood Beach'=124,'Ocean Hill'=125,'Ocean Parkway South'=126,'Old Astoria'=127,'Old Town-Dongan Hills-South Beach'=128,'Ozone Park'=129,'Park Slope-Gowanus'=130,'Parkchester'=131,'Pelham Bay-Country Club-City Island'=132,'Pelham Parkway'=133,'Pomonok-Flushing Heights-Hillcrest'=134,'Port Richmond'=135,'Prospect Heights'=136,'Prospect Lefferts Gardens-Wingate'=137,'Queens Village'=138,'Queensboro Hill'=139,'Queensbridge-Ravenswood-Long Island City'=140,'Rego Park'=141,'Richmond Hill'=142,'Ridgewood'=143,'Rikers Island'=144,'Rosedale'=145,'Rossville-Woodrow'=146,'Rugby-Remsen Village'=147,'Schuylerville-Throgs Neck-Edgewater Park'=148,'Seagate-Coney Island'=149,'Sheepshead Bay-Gerritsen Beach-Manhattan Beach'=150,'SoHo-TriBeCa-Civic Center-Little Italy'=151,'Soundview-Bruckner'=152,'Soundview-Castle Hill-Clason Point-Harding Park'=153,'South Jamaica'=154,'South Ozone Park'=155,'Springfield Gardens North'=156,'Springfield Gardens South-Brookville'=157,'Spuyten Duyvil-Kingsbridge'=158,'St. Albans'=159,'Stapleton-Rosebank'=160,'Starrett City'=161,'Steinway'=162,'Stuyvesant Heights'=163,'Stuyvesant Town-Cooper Village'=164,'Sunset Park East'=165,'Sunset Park West'=166,'Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill'=167,'Turtle Bay-East Midtown'=168,'University Heights-Morris Heights'=169,'Upper East Side-Carnegie Hill'=170,'Upper West Side'=171,'Van Cortlandt Village'=172,'Van Nest-Morris Park-Westchester Square'=173,'Washington Heights North'=174,'Washington Heights South'=175,'West Brighton'=176,'West Concourse'=177,'West Farms-Bronx River'=178,'West New Brighton-New Brighton-St. George'=179,'West Village'=180,'Westchester-Unionport'=181,'Westerleigh'=182,'Whitestone'=183,'Williamsbridge-Olinville'=184,'Williamsburg'=185,'Windsor Terrace'=186,'Woodhaven'=187,'Woodlawn-Wakefield'=188,'Woodside'=189,'Yorkville'=190,'park-cemetery-etc-Bronx'=191,'park-cemetery-etc-Brooklyn'=192,'park-cemetery-etc-Manhattan'=193,'park-cemetery-etc-Queens'=194,'park-cemetery-etc-Staten Island'=195), dropoff_puma UInt16)ENGINE= MergeTree(pickup_date, pickup_datetime,8192)

    在之前的服务器中运行:

    CREATE TABLE trips_mergetree_x3 AS trips_mergetree_third ENGINE = Distributed(perftest, default, trips_mergetree_third, rand())

    运行如下查询重新分布数据:

    INSERT INTO trips_mergetree_x3 SELECT * FROM trips_mergetree

    这个查询需要运行2454秒。

    在三台服务器集群中运行的结果:

    Q1: 0.212秒. Q2:0.438秒。 Q3:0.733秒。 Q4: 1.241秒.

    这并不奇怪,因为查询是线性扩展的。

    我们同时在140台服务器的集群中运行的结果:

    Q1:0.028秒。 Q2:0.043秒。 Q3:0.051秒。 Q4:0.072秒。

    在这种情况下,查询处理时间首先由网络延迟确定。

    总结

    服务器Q1Q2Q3Q4
    1 0.490 1.224 2.104 3.593
    3 0.212 0.438 0.733 1.241
    140 0.028 0.043 0.051 0.072

    OnTime

    航班飞行数据有以下两个方式获取:

    • 从原始数据导入
    • 下载预处理好的数据

    从原始数据导入

    下载数据:

    wget --no-check-certificate --continue https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip

    创建表结构:

    CREATE TABLE `ontime`
    (
    `Year` UInt16,
    `Quarter` UInt8,
    `Month` UInt8,
    `DayofMonth` UInt8,
    `DayOfWeek` UInt8,
    `FlightDate` Date,
    `Reporting_Airline` String,
    `DOT_ID_Reporting_Airline` Int32,
    `IATA_CODE_Reporting_Airline` String,
    `Tail_Number` String,
    `Flight_Number_Reporting_Airline` String,
    `OriginAirportID` Int32,
    `OriginAirportSeqID` Int32,
    `OriginCityMarketID` Int32,
    `Origin` FixedString(5),
    `OriginCityName` String,
    `OriginState` FixedString(2),
    `OriginStateFips` String,
    `OriginStateName` String,
    `OriginWac` Int32,
    `DestAirportID` Int32,
    `DestAirportSeqID` Int32,
    `DestCityMarketID` Int32,
    `Dest` FixedString(5),
    `DestCityName` String,
    `DestState` FixedString(2),
    `DestStateFips` String,
    `DestStateName` String,
    `DestWac` Int32,
    `CRSDepTime` Int32,
    `DepTime` Int32,
    `DepDelay` Int32,
    `DepDelayMinutes` Int32,
    `DepDel15` Int32,
    `DepartureDelayGroups` String,
    `DepTimeBlk` String,
    `TaxiOut` Int32,
    `WheelsOff` Int32,
    `WheelsOn` Int32,
    `TaxiIn` Int32,
    `CRSArrTime` Int32,
    `ArrTime` Int32,
    `ArrDelay` Int32,
    `ArrDelayMinutes` Int32,
    `ArrDel15` Int32,
    `ArrivalDelayGroups` Int32,
    `ArrTimeBlk` String,
    `Cancelled` UInt8,
    `CancellationCode` FixedString(1),
    `Diverted` UInt8,
    `CRSElapsedTime` Int32,
    `ActualElapsedTime` Int32,
    `AirTime` Nullable(Int32),
    `Flights` Int32,
    `Distance` Int32,
    `DistanceGroup` UInt8,
    `CarrierDelay` Int32,
    `WeatherDelay` Int32,
    `NASDelay` Int32,
    `SecurityDelay` Int32,
    `LateAircraftDelay` Int32,
    `FirstDepTime` String,
    `TotalAddGTime` String,
    `LongestAddGTime` String,
    `DivAirportLandings` String,
    `DivReachedDest` String,
    `DivActualElapsedTime` String,
    `DivArrDelay` String,
    `DivDistance` String,
    `Div1Airport` String,
    `Div1AirportID` Int32,
    `Div1AirportSeqID` Int32,
    `Div1WheelsOn` String,
    `Div1TotalGTime` String,
    `Div1LongestGTime` String,
    `Div1WheelsOff` String,
    `Div1TailNum` String,
    `Div2Airport` String,
    `Div2AirportID` Int32,
    `Div2AirportSeqID` Int32,
    `Div2WheelsOn` String,
    `Div2TotalGTime` String,
    `Div2LongestGTime` String,
    `Div2WheelsOff` String,
    `Div2TailNum` String,
    `Div3Airport` String,
    `Div3AirportID` Int32,
    `Div3AirportSeqID` Int32,
    `Div3WheelsOn` String,
    `Div3TotalGTime` String,
    `Div3LongestGTime` String,
    `Div3WheelsOff` String,
    `Div3TailNum` String,
    `Div4Airport` String,
    `Div4AirportID` Int32,
    `Div4AirportSeqID` Int32,
    `Div4WheelsOn` String,
    `Div4TotalGTime` String,
    `Div4LongestGTime` String,
    `Div4WheelsOff` String,
    `Div4TailNum` String,
    `Div5Airport` String,
    `Div5AirportID` Int32,
    `Div5AirportSeqID` Int32,
    `Div5WheelsOn` String,
    `Div5TotalGTime` String,
    `Div5LongestGTime` String,
    `Div5WheelsOff` String,
    `Div5TailNum` String
    )ENGINE= MergeTree
    PARTITIONBYYear
    ORDERBY(IATA_CODE_Reporting_Airline, FlightDate)
    SETTINGS index_granularity =8192;

    加载数据:

    ls -1 *.zip | xargs -I{} -P $(nproc) bash -c "echo {}; unzip -cq {} '*.csv' | sed 's/\.00//g' | clickhouse-client --input_format_with_names_use_header=0 --query='INSERT INTO ontime FORMAT CSVWithNames'"

    下载预处理好的分区数据

    $ curl -O https://datasets.clickhouse.com/ontime/partitions/ontime.tar
    $ tar xvf ontime.tar -C /var/lib/clickhouse # path to ClickHouse data directory
    $ # check permissions of unpacked data, fix if required
    $ sudo service clickhouse-server restart
    $ clickhouse-client --query "select count(*) from datasets.ontime"

    !!! info "信息" 如果要运行下面的SQL查询,必须使用完整的表名,datasets.ontime

    查询:

    Q0.

    SELECT avg(c1)
    FROM
    (
    SELECT Year, Month, count(*) AS c1
    FROM ontime
    GROUP BY Year, Month
    );

    Q1. 查询从2000年到2008年每天的航班数

    SELECT DayOfWeek, count(*) AS c
    FROM ontime
    WHERE Year>=2000 AND Year<=2008
    GROUP BY DayOfWeek
    ORDER BY c DESC;

    Q2. 查询从2000年到2008年每周延误超过10分钟的航班数。

    SELECT DayOfWeek, count(*) AS c
    FROM ontime
    WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
    GROUP BY DayOfWeek
    ORDER BY c DESC;

    Q3. 查询2000年到2008年每个机场延误超过10分钟以上的次数

    SELECT Origin, count(*) AS c
    FROM ontime
    WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
    GROUP BY Origin
    ORDER BY c DESC
    LIMIT 10;

    Q4. 查询2007年各航空公司延误超过10分钟以上的次数

    SELECT IATA_CODE_Reporting_Airline AS Carrier, count(*)
    FROM ontime
    WHERE DepDelay>10 AND Year=2007
    GROUP BY Carrier
    ORDER BY count(*) DESC;

    Q5. 查询2007年各航空公司延误超过10分钟以上的百分比

    SELECT Carrier, c, c2, c*100/c2 as c3
    FROM
    (
    SELECT
    IATA_CODE_Reporting_Airline AS Carrier,
    count(*) AS c
    FROM ontime
    WHERE DepDelay>10
    AND Year=2007
    GROUP BY Carrier
    ) q
    JOIN
    (
    SELECT
    IATA_CODE_Reporting_Airline AS Carrier,
    count(*) AS c2
    FROM ontime
    WHERE Year=2007
    GROUP BY Carrier
    ) qq USING Carrier
    ORDER BY c3 DESC;

    更好的查询版本:

    SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay>10)*100 AS c3
    FROM ontime
    WHERE Year=2007
    GROUP BY Carrier
    ORDER BY c3 DESC

    Q6. 同上一个查询一致,只是查询范围扩大到2000年到2008年

    SELECT Carrier, c, c2, c*100/c2 as c3
    FROM
    (
    SELECT
    IATA_CODE_Reporting_Airline AS Carrier,
    count(*) AS c
    FROM ontime
    WHERE DepDelay>10
    AND Year>=2000 AND Year<=2008
    GROUP BY Carrier
    ) q
    JOIN
    (
    SELECT
    IATA_CODE_Reporting_Airline AS Carrier,
    count(*) AS c2
    FROM ontime
    WHERE Year>=2000 AND Year<=2008
    GROUP BY Carrier
    ) qq USING Carrier
    ORDER BY c3 DESC;

    更好的查询版本:

    SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay>10)*100 AS c3
    FROM ontime
    WHERE Year>=2000 AND Year<=2008
    GROUP BY Carrier
    ORDER BY c3 DESC;

    Q7. 每年航班延误超过10分钟的百分比

    SELECT Year, c1/c2
    FROM
    (
    select
    Year,
    count(*)*100 as c1
    from ontime
    WHERE DepDelay>10
    GROUP BY Year
    ) q
    JOIN
    (
    select
    Year,
    count(*) as c2
    from ontime
    GROUP BY Year
    ) qq USING (Year)
    ORDER BY Year;

    更好的查询版本:

    SELECT Year, avg(DepDelay>10)*100
    FROM ontime
    GROUP BY Year
    ORDER BY Year;

    Q8. 每年更受人们喜爱的目的地

    SELECT DestCityName, uniqExact(OriginCityName) AS u
    FROM ontime
    WHERE Year >= 2000 and Year <= 2010
    GROUP BY DestCityName
    ORDER BY u DESC LIMIT 10;

    Q9.

    SELECT Year, count(*) AS c1
    FROM ontime
    GROUP BY Year;

    Q10.

    SELECT
    min(Year), max(Year), IATA_CODE_Reporting_Airline AS Carrier, count(*) AS cnt,
    sum(ArrDelayMinutes>30) AS flights_delayed,
    round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
    FROM ontime
    WHERE
    DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
    AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
    AND FlightDate < '2010-01-01'
    GROUP by Carrier
    HAVING cnt>100000 and max(Year)>1990
    ORDER by rate DESC
    LIMIT 1000;

    Bonus:

    SELECT avg(cnt)
    FROM
    (
    SELECT Year,Month,count(*) AS cnt
    FROM ontime
    WHERE DepDel15=1
    GROUP BY Year,Month
    );

    SELECT avg(c1) FROM
    (
    SELECT Year,Month,count(*) AS c1
    FROM ontime
    GROUP BY Year,Month
    );

    SELECT DestCityName, uniqExact(OriginCityName) AS u
    FROM ontime
    GROUP BY DestCityName
    ORDER BY u DESC
    LIMIT 10;

    SELECT OriginCityName, DestCityName, count() AS c
    FROM ontime
    GROUP BY OriginCityName, DestCityName
    ORDER BY c DESC
    LIMIT 10;

    SELECT OriginCityName, count() AS c
    FROM ontime
    GROUP BY OriginCityName
    ORDER BY c DESC
    LIMIT 10;

    这个性能测试由Vadim Tkachenko提供。参考:

    Brown University Benchmark

    MgBench is a new analytical benchmark for machine-generated log data, Andrew Crotty.

    Download the data:

    wget https://datasets.clickhouse.com/mgbench{1..3}.csv.xz

    Unpack the data:

    xz -v -d mgbench{1..3}.csv.xz

    Create tables:

    CREATE DATABASE mgbench;


    CREATE TABLE mgbench.logs1 (
    log_time DateTime,
    machine_name LowCardinality(String),
    machine_group LowCardinality(String),
    cpu_idle Nullable(Float32),
    cpu_nice Nullable(Float32),
    cpu_system Nullable(Float32),
    cpu_user Nullable(Float32),
    cpu_wio Nullable(Float32),
    disk_free Nullable(Float32),
    disk_total Nullable(Float32),
    part_max_used Nullable(Float32),
    load_fifteen Nullable(Float32),
    load_five Nullable(Float32),
    load_one Nullable(Float32),
    mem_buffers Nullable(Float32),
    mem_cached Nullable(Float32),
    mem_free Nullable(Float32),
    mem_shared Nullable(Float32),
    swap_free Nullable(Float32),
    bytes_in Nullable(Float32),
    bytes_out Nullable(Float32)
    )
    ENGINE = MergeTree()
    ORDER BY (machine_group, machine_name, log_time);


    CREATE TABLE mgbench.logs2 (
    log_time DateTime,
    client_ip IPv4,
    request String,
    status_code UInt16,
    object_size UInt64
    )
    ENGINE = MergeTree()
    ORDER BY log_time;


    CREATE TABLE mgbench.logs3 (
    log_time DateTime64,
    device_id FixedString(15),
    device_name LowCardinality(String),
    device_type LowCardinality(String),
    device_floor UInt8,
    event_type LowCardinality(String),
    event_unit FixedString(1),
    event_value Nullable(Float32)
    )
    ENGINE = MergeTree()
    ORDER BY (event_type, log_time);

    Insert data:

    clickhouse-client --query "INSERT INTO mgbench.logs1 FORMAT CSVWithNames" < mgbench1.csv
    clickhouse-client --query "INSERT INTO mgbench.logs2 FORMAT CSVWithNames" < mgbench2.csv
    clickhouse-client --query "INSERT INTO mgbench.logs3 FORMAT CSVWithNames" < mgbench3.csv

    Run benchmark queries:

    -- Q1.1: What is the CPU/network utilization for each web server since midnight?

    SELECT machine_name,
    MIN(cpu) AS cpu_min,
    MAX(cpu) AS cpu_max,
    AVG(cpu) AS cpu_avg,
    MIN(net_in) AS net_in_min,
    MAX(net_in) AS net_in_max,
    AVG(net_in) AS net_in_avg,
    MIN(net_out) AS net_out_min,
    MAX(net_out) AS net_out_max,
    AVG(net_out) AS net_out_avg
    FROM (
    SELECT machine_name,
    COALESCE(cpu_user, 0.0) AS cpu,
    COALESCE(bytes_in, 0.0) AS net_in,
    COALESCE(bytes_out, 0.0) AS net_out
    FROM logs1
    WHERE machine_name IN ('anansi','aragog','urd')
    AND log_time >= TIMESTAMP '2017-01-11 00:00:00'
    ) AS r
    GROUP BY machine_name;


    -- Q1.2: Which computer lab machines have been offline in the past day?

    SELECT machine_name,
    log_time
    FROM logs1
    WHERE (machine_name LIKE 'cslab%' OR
    machine_name LIKE 'mslab%')
    AND load_one IS NULL
    AND log_time >= TIMESTAMP '2017-01-10 00:00:00'
    ORDER BY machine_name,
    log_time;


    -- Q1.3: What are the hourly average metrics during the past 10 days for a specific workstation?

    SELECT dt,
    hr,
    AVG(load_fifteen) AS load_fifteen_avg,
    AVG(load_five) AS load_five_avg,
    AVG(load_one) AS load_one_avg,
    AVG(mem_free) AS mem_free_avg,
    AVG(swap_free) AS swap_free_avg
    FROM (
    SELECT CAST(log_time AS DATE) AS dt,
    EXTRACT(HOUR FROM log_time) AS hr,
    load_fifteen,
    load_five,
    load_one,
    mem_free,
    swap_free
    FROM logs1
    WHERE machine_name = 'babbage'
    AND load_fifteen IS NOT NULL
    AND load_five IS NOT NULL
    AND load_one IS NOT NULL
    AND mem_free IS NOT NULL
    AND swap_free IS NOT NULL
    AND log_time >= TIMESTAMP '2017-01-01 00:00:00'
    ) AS r
    GROUP BY dt,
    hr
    ORDER BY dt,
    hr;


    -- Q1.4: Over 1 month, how often was each server blocked on disk I/O?

    SELECT machine_name,
    COUNT(*) AS spikes
    FROM logs1
    WHERE machine_group = 'Servers'
    AND cpu_wio > 0.99
    AND log_time >= TIMESTAMP '2016-12-01 00:00:00'
    AND log_time < TIMESTAMP '2017-01-01 00:00:00'
    GROUP BY machine_name
    ORDER BY spikes DESC
    LIMIT 10;


    -- Q1.5: Which externally reachable VMs have run low on memory?

    SELECT machine_name,
    dt,
    MIN(mem_free) AS mem_free_min
    FROM (
    SELECT machine_name,
    CAST(log_time AS DATE) AS dt,
    mem_free
    FROM logs1
    WHERE machine_group = 'DMZ'
    AND mem_free IS NOT NULL
    ) AS r
    GROUP BY machine_name,
    dt
    HAVING MIN(mem_free) < 10000
    ORDER BY machine_name,
    dt;


    -- Q1.6: What is the total hourly network traffic across all file servers?

    SELECT dt,
    hr,
    SUM(net_in) AS net_in_sum,
    SUM(net_out) AS net_out_sum,
    SUM(net_in) + SUM(net_out) AS both_sum
    FROM (
    SELECT CAST(log_time AS DATE) AS dt,
    EXTRACT(HOUR FROM log_time) AS hr,
    COALESCE(bytes_in, 0.0) / 1000000000.0 AS net_in,
    COALESCE(bytes_out, 0.0) / 1000000000.0 AS net_out
    FROM logs1
    WHERE machine_name IN ('allsorts','andes','bigred','blackjack','bonbon',
    'cadbury','chiclets','cotton','crows','dove','fireball','hearts','huey',
    'lindt','milkduds','milkyway','mnm','necco','nerds','orbit','peeps',
    'poprocks','razzles','runts','smarties','smuggler','spree','stride',
    'tootsie','trident','wrigley','york')
    ) AS r
    GROUP BY dt,
    hr
    ORDER BY both_sum DESC
    LIMIT 10;


    -- Q2.1: Which requests have caused server errors within the past 2 weeks?

    SELECT *
    FROM logs2
    WHERE status_code >= 500
    AND log_time >= TIMESTAMP '2012-12-18 00:00:00'
    ORDER BY log_time;


    -- Q2.2: During a specific 2-week period, was the user password file leaked?

    SELECT *
    FROM logs2
    WHERE status_code >= 200
    AND status_code < 300
    AND request LIKE '%/etc/passwd%'
    AND log_time >= TIMESTAMP '2012-05-06 00:00:00'
    AND log_time < TIMESTAMP '2012-05-20 00:00:00';


    -- Q2.3: What was the average path depth for top-level requests in the past month?

    SELECT top_level,
    AVG(LENGTH(request) - LENGTH(REPLACE(request, '/', ''))) AS depth_avg
    FROM (
    SELECT SUBSTRING(request FROM 1 FOR len) AS top_level,
    request
    FROM (
    SELECT POSITION(SUBSTRING(request FROM 2), '/') AS len,
    request
    FROM logs2
    WHERE status_code >= 200
    AND status_code < 300
    AND log_time >= TIMESTAMP '2012-12-01 00:00:00'
    ) AS r
    WHERE len > 0
    ) AS s
    WHERE top_level IN ('/about','/courses','/degrees','/events',
    '/grad','/industry','/news','/people',
    '/publications','/research','/teaching','/ugrad')
    GROUP BY top_level
    ORDER BY top_level;


    -- Q2.4: During the last 3 months, which clients have made an excessive number of requests?

    SELECT client_ip,
    COUNT(*) AS num_requests
    FROM logs2
    WHERE log_time >= TIMESTAMP '2012-10-01 00:00:00'
    GROUP BY client_ip
    HAVING COUNT(*) >= 100000
    ORDER BY num_requests DESC;


    -- Q2.5: What are the daily unique visitors?

    SELECT dt,
    COUNT(DISTINCT client_ip)
    FROM (
    SELECT CAST(log_time AS DATE) AS dt,
    client_ip
    FROM logs2
    ) AS r
    GROUP BY dt
    ORDER BY dt;


    -- Q2.6: What are the average and maximum data transfer rates (Gbps)?

    SELECT AVG(transfer) / 125000000.0 AS transfer_avg,
    MAX(transfer) / 125000000.0 AS transfer_max
    FROM (
    SELECT log_time,
    SUM(object_size) AS transfer
    FROM logs2
    GROUP BY log_time
    ) AS r;


    -- Q3.1: Did the indoor temperature reach freezing over the weekend?

    SELECT *
    FROM logs3
    WHERE event_type = 'temperature'
    AND event_value <= 32.0
    AND log_time >= '2019-11-29 17:00:00.000';


    -- Q3.4: Over the past 6 months, how frequently were each door opened?

    SELECT device_name,
    device_floor,
    COUNT(*) AS ct
    FROM logs3
    WHERE event_type = 'door_open'
    AND log_time >= '2019-06-01 00:00:00.000'
    GROUP BY device_name,
    device_floor
    ORDER BY ct DESC;


    -- Q3.5: Where in the building do large temperature variations occur in winter and summer?

    WITH temperature AS (
    SELECT dt,
    device_name,
    device_type,
    device_floor
    FROM (
    SELECT dt,
    hr,
    device_name,
    device_type,
    device_floor,
    AVG(event_value) AS temperature_hourly_avg
    FROM (
    SELECT CAST(log_time AS DATE) AS dt,
    EXTRACT(HOUR FROM log_time) AS hr,
    device_name,
    device_type,
    device_floor,
    event_value
    FROM logs3
    WHERE event_type = 'temperature'
    ) AS r
    GROUP BY dt,
    hr,
    device_name,
    device_type,
    device_floor
    ) AS s
    GROUP BY dt,
    device_name,
    device_type,
    device_floor
    HAVING MAX(temperature_hourly_avg) - MIN(temperature_hourly_avg) >= 25.0
    )
    SELECT DISTINCT device_name,
    device_type,
    device_floor,
    'WINTER'
    FROM temperature
    WHERE dt >= DATE '2018-12-01'
    AND dt < DATE '2019-03-01'
    UNION
    SELECT DISTINCT device_name,
    device_type,
    device_floor,
    'SUMMER'
    FROM temperature
    WHERE dt >= DATE '2019-06-01'
    AND dt < DATE '2019-09-01';


    -- Q3.6: For each device category, what are the monthly power consumption metrics?

    SELECT yr,
    mo,
    SUM(coffee_hourly_avg) AS coffee_monthly_sum,
    AVG(coffee_hourly_avg) AS coffee_monthly_avg,
    SUM(printer_hourly_avg) AS printer_monthly_sum,
    AVG(printer_hourly_avg) AS printer_monthly_avg,
    SUM(projector_hourly_avg) AS projector_monthly_sum,
    AVG(projector_hourly_avg) AS projector_monthly_avg,
    SUM(vending_hourly_avg) AS vending_monthly_sum,
    AVG(vending_hourly_avg) AS vending_monthly_avg
    FROM (
    SELECT dt,
    yr,
    mo,
    hr,
    AVG(coffee) AS coffee_hourly_avg,
    AVG(printer) AS printer_hourly_avg,
    AVG(projector) AS projector_hourly_avg,
    AVG(vending) AS vending_hourly_avg
    FROM (
    SELECT CAST(log_time AS DATE) AS dt,
    EXTRACT(YEAR FROM log_time) AS yr,
    EXTRACT(MONTH FROM log_time) AS mo,
    EXTRACT(HOUR FROM log_time) AS hr,
    CASE WHEN device_name LIKE 'coffee%' THEN event_value END AS coffee,
    CASE WHEN device_name LIKE 'printer%' THEN event_value END AS printer,
    CASE WHEN device_name LIKE 'projector%' THEN event_value END AS projector,
    CASE WHEN device_name LIKE 'vending%' THEN event_value END AS vending
    FROM logs3
    WHERE device_type = 'meter'
    ) AS r
    GROUP BY dt,
    yr,
    mo,
    hr
    ) AS s
    GROUP BY yr,
    mo
    ORDER BY yr,
    mo;

    Cell Towers

    This dataset is from OpenCellid - The world's largest Open Database of Cell Towers.

    As of 2021, it contains more than 40 million records about cell towers (GSM, LTE, UMTS, etc.) around the world with their geographical coordinates and metadata (country code, network, etc).

    OpenCelliD Project is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, and we redistribute a snapshot of this dataset under the terms of the same license. The up-to-date version of the dataset is available to download after sign in.

    Get the Dataset

    1. Download the snapshot of the dataset from February 2021: [https://datasets.clickhouse.com/cell_towers.csv.xz] (729 MB).

    2. Validate the integrity (optional step):

    md5sum cell_towers.csv.xz
    8cf986f4a0d9f12c6f384a0e9192c908 cell_towers.csv.xz
    1. Decompress it with the following command:
    xz -d cell_towers.csv.xz
    1. Create a table:
    CREATE TABLE cell_towers
    (
    radio Enum8('' = 0, 'CDMA' = 1, 'GSM' = 2, 'LTE' = 3, 'NR' = 4, 'UMTS' = 5),
    mcc UInt16,
    net UInt16,
    area UInt16,
    cell UInt64,
    unit Int16,
    lon Float64,
    lat Float64,
    range UInt32,
    samples UInt32,
    changeable UInt8,
    created DateTime,
    updated DateTime,
    averageSignal UInt8
    )
    ENGINE = MergeTree ORDER BY (radio, mcc, net, created);
    1. Insert the dataset:
    clickhouse-client --query "INSERT INTO cell_towers FORMAT CSVWithNames" < cell_towers.csv

    Examples

    1. A number of cell towers by type:
    SELECT radio, count() AS c FROM cell_towers GROUP BY radio ORDER BY c DESC

    ┌─radio─┬────────c─┐
    │ UMTS │ 20686487 │
    │ LTE │ 12101148 │
    │ GSM │ 9931312 │
    │ CDMA │ 556344 │
    │ NR │ 867 │
    └───────┴──────────┘

    5 rows in set. Elapsed: 0.011 sec. Processed 43.28 million rows, 43.28 MB (3.83 billion rows/s., 3.83 GB/s.)
    1. Cell towers by mobile country code (MCC):
    SELECT mcc, count() FROM cell_towers GROUP BY mcc ORDER BY count() DESC LIMIT 10

    ┌─mcc─┬─count()─┐
    │ 310 │ 5024650 │
    │ 262 │ 2622423 │
    │ 250 │ 1953176 │
    │ 208 │ 1891187 │
    │ 724 │ 1836150 │
    │ 404 │ 1729151 │
    │ 234 │ 1618924 │
    │ 510 │ 1353998 │
    │ 440 │ 1343355 │
    │ 311 │ 1332798 │
    └─────┴─────────┘

    10 rows in set. Elapsed: 0.019 sec. Processed 43.28 million rows, 86.55 MB (2.33 billion rows/s., 4.65 GB/s.)

    So, the top countries are: the USA, Germany, and Russia.

    You may want to create an External Dictionary in ClickHouse to decode these values.

    Use case

    Using pointInPolygon function.

    1. Create a table where we will store polygons:
    CREATE TEMPORARY TABLE moscow (polygon Array(Tuple(Float64, Float64)));
    1. This is a rough shape of Moscow (without "new Moscow"):
    INSERT INTO moscow VALUES ([(37.84172564285271, 55.78000432402266), (37.8381207618713, 55.775874525970494), (37.83979446823122, 55.775626746008065), (37.84243326983639, 55.77446586811748), (37.84262672750849, 55.771974101091104), (37.84153238623039, 55.77114545193181), (37.841124690460184, 55.76722010265554), (37.84239076983644, 55.76654891107098), (37.842283558197025, 55.76258709833121), (37.8421759312134, 55.758073999993734), (37.84198330422974, 55.75381499999371), (37.8416827275085, 55.749277102484484), (37.84157576190186, 55.74794544108413), (37.83897929098507, 55.74525257875241), (37.83739676451868, 55.74404373042019), (37.838732481460525, 55.74298009816793), (37.841183997352545, 55.743060321833575), (37.84097476190185, 55.73938799999373), (37.84048155819702, 55.73570799999372), (37.840095812164286, 55.73228210777237), (37.83983814285274, 55.73080491981639), (37.83846476321406, 55.729799917464675), (37.83835745269769, 55.72919751082619), (37.838636380279524, 55.72859509486539), (37.8395161005249, 55.727705075632784), (37.83897964285276, 55.722727886185154), (37.83862557539366, 55.72034817326636), (37.83559735744853, 55.71944437307499), (37.835370708803126, 55.71831419154461), (37.83738169402022, 55.71765218986692), (37.83823396494291, 55.71691750159089), (37.838056931213345, 55.71547311301385), (37.836812846557606, 55.71221445615604), (37.83522525396725, 55.709331054395555), (37.83269301586908, 55.70953687463627), (37.829667367706236, 55.70903403789297), (37.83311126588435, 55.70552351822608), (37.83058993121339, 55.70041317726053), (37.82983872750851, 55.69883771404813), (37.82934501586913, 55.69718947487017), (37.828926414016685, 55.69504441658371), (37.82876530422971, 55.69287499999378), (37.82894754100031, 55.690759754047335), (37.827697554878185, 55.68951421135665), (37.82447346292115, 55.68965045405069), (37.83136543914793, 55.68322046195302), (37.833554015869154, 55.67814012759211), (37.83544184655761, 55.67295011628339), (37.837480388885474, 55.6672498719639), (37.838960677246064, 55.66316274139358), (37.83926093121332, 55.66046999999383), (37.839025050262435, 55.65869897264431), (37.83670784390257, 55.65794084879904), (37.835656529083245, 55.65694309303843), (37.83704060449217, 55.65689306460552), (37.83696819873806, 55.65550363526252), (37.83760389616388, 55.65487847246661), (37.83687972750851, 55.65356745541324), (37.83515216004943, 55.65155951234079), (37.83312418518067, 55.64979413590619), (37.82801726983639, 55.64640836412121), (37.820614174591, 55.64164525405531), (37.818908190475426, 55.6421883258084), (37.81717543386075, 55.64112490388471), (37.81690987037274, 55.63916106913107), (37.815099354492155, 55.637925371757085), (37.808769150787356, 55.633798276884455), (37.80100123544311, 55.62873670012244), (37.79598013491824, 55.62554336109055), (37.78634567724606, 55.62033499605651), (37.78334147619623, 55.618768681480326), (37.77746201055901, 55.619855533402706), (37.77527329626457, 55.61909966711279), (37.77801986242668, 55.618770300976294), (37.778212973541216, 55.617257701952106), (37.77784818518065, 55.61574504433011), (37.77016867724609, 55.61148576294007), (37.760191219573976, 55.60599579539028), (37.75338926983641, 55.60227892751446), (37.746329965606634, 55.59920577639331), (37.73939925396728, 55.59631430313617), (37.73273665739439, 55.5935318803559), (37.7299954450912, 55.59350760316188), (37.7268679946899, 55.59469840523759), (37.72626726983634, 55.59229549697373), (37.7262673598022, 55.59081598950582), (37.71897193121335, 55.5877595845419), (37.70871550793456, 55.58393177431724), (37.700497489410374, 55.580917323756644), (37.69204305026244, 55.57778089778455), (37.68544477378839, 55.57815154690915), (37.68391050793454, 55.57472945079756), (37.678803592590306, 55.57328235936491), (37.6743402539673, 55.57255251445782), (37.66813862698363, 55.57216388774464), (37.617927457672096, 55.57505691895805), (37.60443099999999, 55.5757737568051), (37.599683515869145, 55.57749105910326), (37.59754177842709, 55.57796291823627), (37.59625834786988, 55.57906686095235), (37.59501783265684, 55.57746616444403), (37.593090671936025, 55.57671634534502), (37.587018007904, 55.577944600233785), (37.578692203704804, 55.57982895000019), (37.57327546607398, 55.58116294118248), (37.57385012109279, 55.581550362779), (37.57399562266922, 55.5820107079112), (37.5735356072979, 55.58226289171689), (37.57290393054962, 55.582393529795155), (37.57037722355653, 55.581919415056234), (37.5592298306885, 55.584471614867844), (37.54189249206543, 55.58867650795186), (37.5297256269836, 55.59158133551745), (37.517837865081766, 55.59443656218868), (37.51200186508174, 55.59635625174229), (37.506808949737554, 55.59907823904434), (37.49820432275389, 55.6062944994944), (37.494406071441674, 55.60967103463367), (37.494760001358024, 55.61066689753365), (37.49397137107085, 55.61220931698269), (37.49016528606031, 55.613417718449064), (37.48773249206542, 55.61530616333343), (37.47921386508177, 55.622640129112334), (37.470652153442394, 55.62993723476164), (37.46273446298218, 55.6368075123157), (37.46350692265317, 55.64068225239439), (37.46050283203121, 55.640794546982576), (37.457627470916734, 55.64118904154646), (37.450718034393326, 55.64690488145138), (37.44239252645875, 55.65397824729769), (37.434587576721185, 55.66053543155961), (37.43582144975277, 55.661693766520735), (37.43576786245721, 55.662755031737014), (37.430982915344174, 55.664610641628116), (37.428547447097685, 55.66778515273695), (37.42945134592044, 55.668633314343566), (37.42859571562949, 55.66948145750025), (37.4262836402282, 55.670813882451405), (37.418709037048295, 55.6811141674414), (37.41922139651101, 55.68235377885389), (37.419218771842885, 55.68359335082235), (37.417196501327446, 55.684375235224735), (37.41607020370478, 55.68540557585352), (37.415640857147146, 55.68686637150793), (37.414632153442334, 55.68903015131686), (37.413344899475064, 55.690896881757396), (37.41171432275391, 55.69264232162232), (37.40948282275393, 55.69455101638112), (37.40703674603271, 55.69638690385348), (37.39607169577025, 55.70451821283731), (37.38952706878662, 55.70942491932811), (37.387778313491815, 55.71149057784176), (37.39049275399779, 55.71419814298992), (37.385557272491454, 55.7155489617061), (37.38388335714726, 55.71849856042102), (37.378368238098155, 55.7292763261685), (37.37763597123337, 55.730845879211614), (37.37890062088197, 55.73167906388319), (37.37750451918789, 55.734703664681774), (37.375610832015965, 55.734851959522246), (37.3723813571472, 55.74105626086403), (37.37014935714723, 55.746115620904355), (37.36944173016362, 55.750883999993725), (37.36975304365541, 55.76335905525834), (37.37244070571134, 55.76432079697595), (37.3724259757175, 55.76636979670426), (37.369922155757884, 55.76735417953104), (37.369892695770275, 55.76823419316575), (37.370214730163575, 55.782312184391266), (37.370493611114505, 55.78436801120489), (37.37120164550783, 55.78596427165359), (37.37284851456452, 55.7874378183096), (37.37608325135799, 55.7886695054807), (37.3764587460632, 55.78947647305964), (37.37530000265506, 55.79146512926804), (37.38235915344241, 55.79899647809345), (37.384344043655396, 55.80113596939471), (37.38594269577028, 55.80322699999366), (37.38711208598329, 55.804919036911976), (37.3880239841309, 55.806610999993666), (37.38928977249147, 55.81001864976979), (37.39038389947512, 55.81348641242801), (37.39235781481933, 55.81983538336746), (37.393709457672124, 55.82417822811877), (37.394685720901464, 55.82792275755836), (37.39557615344238, 55.830447148154136), (37.39844478226658, 55.83167107969975), (37.40019761214057, 55.83151823557964), (37.400398790382326, 55.83264967594742), (37.39659544313046, 55.83322180909622), (37.39667059524539, 55.83402792148566), (37.39682089947515, 55.83638877400216), (37.39643489154053, 55.83861656112751), (37.3955338994751, 55.84072348043264), (37.392680272491454, 55.84502158126453), (37.39241188227847, 55.84659117913199), (37.392529730163616, 55.84816071336481), (37.39486835714723, 55.85288092980303), (37.39873052645878, 55.859893456073635), (37.40272161111449, 55.86441833633205), (37.40697072750854, 55.867579567544375), (37.410007082016016, 55.868369880337), (37.4120992989502, 55.86920843741314), (37.412668021163924, 55.87055369615854), (37.41482461111453, 55.87170587948249), (37.41862266137694, 55.873183961039565), (37.42413732540892, 55.874879126654704), (37.4312182698669, 55.875614937236705), (37.43111093783558, 55.8762723478417), (37.43332105622856, 55.87706546369396), (37.43385747619623, 55.87790681284802), (37.441303050262405, 55.88027084462084), (37.44747234260555, 55.87942070143253), (37.44716141796871, 55.88072960917233), (37.44769797085568, 55.88121221323979), (37.45204320500181, 55.882080694420715), (37.45673176190186, 55.882346110794586), (37.463383999999984, 55.88252729504517), (37.46682797486874, 55.88294937719063), (37.470014457672086, 55.88361266759345), (37.47751410450743, 55.88546991372396), (37.47860317658232, 55.88534929207307), (37.48165826025772, 55.882563306475106), (37.48316434442331, 55.8815803226785), (37.483831555817645, 55.882427612793315), (37.483182967125686, 55.88372791409729), (37.483092277908824, 55.88495581062434), (37.4855716508179, 55.8875561994203), (37.486440636245746, 55.887827444039566), (37.49014203439328, 55.88897899871799), (37.493210285705544, 55.890208937135604), (37.497512451065035, 55.891342397444696), (37.49780744510645, 55.89174030252967), (37.49940333499519, 55.89239745507079), (37.50018383334346, 55.89339220941865), (37.52421672750851, 55.903869074155224), (37.52977457672118, 55.90564076517974), (37.53503220370484, 55.90661661218259), (37.54042858064267, 55.90714113744566), (37.54320461007303, 55.905645048442985), (37.545686966066306, 55.906608607018505), (37.54743976120755, 55.90788552162358), (37.55796999999999, 55.90901557907218), (37.572711542327866, 55.91059395704873), (37.57942799999998, 55.91073854155573), (37.58502865872187, 55.91009969268444), (37.58739968913264, 55.90794809960554), (37.59131567193598, 55.908713267595054), (37.612687423278814, 55.902866854295375), (37.62348079629517, 55.90041967242986), (37.635797880950896, 55.898141151686396), (37.649487626983664, 55.89639275532968), (37.65619302513125, 55.89572360207488), (37.66294133862307, 55.895295577183965), (37.66874564418033, 55.89505457604897), (37.67375601586915, 55.89254677027454), (37.67744661901856, 55.8947775867987), (37.688347, 55.89450045676125), (37.69480554232789, 55.89422926332761), (37.70107096560668, 55.89322256101114), (37.705962965606716, 55.891763491662616), (37.711885134918205, 55.889110234998974), (37.71682005026245, 55.886577568759876), (37.7199315476074, 55.88458159806678), (37.72234560316464, 55.882281005794134), (37.72364385977171, 55.8809452036196), (37.725371142837474, 55.8809722706006), (37.727870902099546, 55.88037213862385), (37.73394330422971, 55.877941504088696), (37.745339592590376, 55.87208120378722), (37.75525267724611, 55.86703807949492), (37.76919976190188, 55.859821640197474), (37.827835219574, 55.82962968399116), (37.83341438888553, 55.82575289922351), (37.83652584655761, 55.82188784027888), (37.83809213491821, 55.81612575504693), (37.83605359521481, 55.81460347077685), (37.83632178569025, 55.81276696067908), (37.838623105812026, 55.811486181656385), (37.83912198147584, 55.807329380532785), (37.839079078033414, 55.80510270463816), (37.83965844708251, 55.79940712529036), (37.840581150787344, 55.79131399999368), (37.84172564285271, 55.78000432402266)]);
    1. Check how many cell towers are in Moscow:
    SELECT count() FROM cell_towers WHERE pointInPolygon((lon, lat), (SELECT * FROM moscow))

    ┌─count()─┐
    │ 310463 │
    └─────────┘

    1 rows in set. Elapsed: 0.067 sec. Processed 43.28 million rows, 692.42 MB (645.83 million rows/s., 10.33 GB/s.)

    The data is also available for interactive queries in the Playgroundexample.

    Although you cannot create temporary tables there.

    New York Public Library "What's on the Menu?" Dataset

    The dataset is created by the New York Public Library. It contains historical data on the menus of hotels, restaurants and cafes with the dishes along with their prices.

    Source: http://menus.nypl.org/data The data is in public domain.

    The data is from library's archive and it may be incomplete and difficult for statistical analysis. Nevertheless it is also very yummy. The size is just 1.3 million records about dishes in the menus — it's a very small data volume for ClickHouse, but it's still a good example.

    Download the Dataset

    Run the command:

    wget https://s3.amazonaws.com/menusdata.nypl.org/gzips/2021_08_01_07_01_17_data.tgz

    Replace the link to the up to date link from http://menus.nypl.org/data if needed. Download size is about 35 MB.

    Unpack the Dataset

    tar xvf 2021_08_01_07_01_17_data.tgz

    Uncompressed size is about 150 MB.

    The data is normalized consisted of four tables:

    • Menu — Information about menus: the name of the restaurant, the date when menu was seen, etc.
    • Dish — Information about dishes: the name of the dish along with some characteristic.
    • MenuPage — Information about the pages in the menus, because every page belongs to some menu.
    • MenuItem — An item of the menu. A dish along with its price on some menu page: links to dish and menu page.

    Create the Tables

    We use Decimal data type to store prices.

    CREATE TABLE dish
    (
    id UInt32,
    name String,
    description String,
    menus_appeared UInt32,
    times_appeared Int32,
    first_appeared UInt16,
    last_appeared UInt16,
    lowest_price Decimal64(3),
    highest_price Decimal64(3)
    ) ENGINE = MergeTree ORDER BY id;

    CREATE TABLE menu
    (
    id UInt32,
    name String,
    sponsor String,
    event String,
    venue String,
    place String,
    physical_description String,
    occasion String,
    notes String,
    call_number String,
    keywords String,
    language String,
    date String,
    location String,
    location_type String,
    currency String,
    currency_symbol String,
    status String,
    page_count UInt16,
    dish_count UInt16
    ) ENGINE = MergeTree ORDER BY id;

    CREATE TABLE menu_page
    (
    id UInt32,
    menu_id UInt32,
    page_number UInt16,
    image_id String,
    full_height UInt16,
    full_width UInt16,
    uuid UUID
    ) ENGINE = MergeTree ORDER BY id;

    CREATE TABLE menu_item
    (
    id UInt32,
    menu_page_id UInt32,
    price Decimal64(3),
    high_price Decimal64(3),
    dish_id UInt32,
    created_at DateTime,
    updated_at DateTime,
    xpos Float64,
    ypos Float64
    ) ENGINE = MergeTree ORDER BY id;

    Import the Data

    Upload data into ClickHouse, run:

    clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO dish FORMAT CSVWithNames" < Dish.csv
    clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO menu FORMAT CSVWithNames" < Menu.csv
    clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO menu_page FORMAT CSVWithNames" < MenuPage.csv
    clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --date_time_input_format best_effort --query "INSERT INTO menu_item FORMAT CSVWithNames" < MenuItem.csv

    We use CSVWithNames format as the data is represented by CSV with header.

    We disable format_csv_allow_single_quotes as only double quotes are used for data fields and single quotes can be inside the values and should not confuse the CSV parser.

    We disable input_format_null_as_default as our data does not have NULL. Otherwise ClickHouse will try to parse \N sequences and can be confused with \ in data.

    The setting date_time_input_format best_effort allows to parse DateTime fields in wide variety of formats. For example, ISO-8601 without seconds like '2000-01-01 01:02' will be recognized. Without this setting only fixed DateTime format is allowed.

    Denormalize the Data

    Data is presented in multiple tables in normalized form. It means you have to perform JOIN if you want to query, e.g. dish names from menu items. For typical analytical tasks it is way more efficient to deal with pre-JOINed data to avoid doing JOIN every time. It is called "denormalized" data.

    We will create a table menu_item_denorm where will contain all the data JOINed together:

    CREATE TABLE menu_item_denorm
    ENGINE = MergeTree ORDER BY (dish_name, created_at)
    AS SELECT
    price,
    high_price,
    created_at,
    updated_at,
    xpos,
    ypos,
    dish.id AS dish_id,
    dish.name AS dish_name,
    dish.description AS dish_description,
    dish.menus_appeared AS dish_menus_appeared,
    dish.times_appeared AS dish_times_appeared,
    dish.first_appeared AS dish_first_appeared,
    dish.last_appeared AS dish_last_appeared,
    dish.lowest_price AS dish_lowest_price,
    dish.highest_price AS dish_highest_price,
    menu.id AS menu_id,
    menu.name AS menu_name,
    menu.sponsor AS menu_sponsor,
    menu.event AS menu_event,
    menu.venue AS menu_venue,
    menu.place AS menu_place,
    menu.physical_description AS menu_physical_description,
    menu.occasion AS menu_occasion,
    menu.notes AS menu_notes,
    menu.call_number AS menu_call_number,
    menu.keywords AS menu_keywords,
    menu.language AS menu_language,
    menu.date AS menu_date,
    menu.location AS menu_location,
    menu.location_type AS menu_location_type,
    menu.currency AS menu_currency,
    menu.currency_symbol AS menu_currency_symbol,
    menu.status AS menu_status,
    menu.page_count AS menu_page_count,
    menu.dish_count AS menu_dish_count
    FROM menu_item
    JOIN dish ON menu_item.dish_id = dish.id
    JOIN menu_page ON menu_item.menu_page_id = menu_page.id
    JOIN menu ON menu_page.menu_id = menu.id;

    Validate the Data

    Query:

    SELECT count() FROM menu_item_denorm;

    Result:

    ┌─count()─┐
    │ 1329175 │
    └─────────┘

    Run Some Queries

    Averaged historical prices of dishes

    Query:

    SELECT
    round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
    count(),
    round(avg(price), 2),
    bar(avg(price), 0, 100, 100)
    FROM menu_item_denorm
    WHERE (menu_currency = 'Dollars') AND (d > 0) AND (d < 2022)
    GROUP BY d
    ORDER BY d ASC;

    Result:

    ┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 100, 100)─┐
    │ 1850 │ 618 │ 1.5 │ █▍ │
    │ 1860 │ 1634 │ 1.29 │ █▎ │
    │ 1870 │ 2215 │ 1.36 │ █▎ │
    │ 1880 │ 3909 │ 1.01 │ █ │
    │ 1890 │ 8837 │ 1.4 │ █▍ │
    │ 1900 │ 176292 │ 0.68 │ ▋ │
    │ 1910 │ 212196 │ 0.88 │ ▊ │
    │ 1920 │ 179590 │ 0.74 │ ▋ │
    │ 1930 │ 73707 │ 0.6 │ ▌ │
    │ 1940 │ 58795 │ 0.57 │ ▌ │
    │ 1950 │ 41407 │ 0.95 │ ▊ │
    │ 1960 │ 51179 │ 1.32 │ █▎ │
    │ 1970 │ 12914 │ 1.86 │ █▋ │
    │ 1980 │ 7268 │ 4.35 │ ████▎ │
    │ 1990 │ 11055 │ 6.03 │ ██████ │
    │ 2000 │ 2467 │ 11.85 │ ███████████▋ │
    │ 2010 │ 597 │ 25.66 │ █████████████████████████▋ │
    └──────┴─────────┴──────────────────────┴──────────────────────────────┘

    Take it with a grain of salt.

    Burger Prices

    Query:

    SELECT
    round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
    count(),
    round(avg(price), 2),
    bar(avg(price), 0, 50, 100)
    FROM menu_item_denorm
    WHERE (menu_currency = 'Dollars') AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%burger%')
    GROUP BY d
    ORDER BY d ASC;

    Result:

    ┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)───────────┐
    │ 1880 │ 2 │ 0.42 │ ▋ │
    │ 1890 │ 7 │ 0.85 │ █▋ │
    │ 1900 │ 399 │ 0.49 │ ▊ │
    │ 1910 │ 589 │ 0.68 │ █▎ │
    │ 1920 │ 280 │ 0.56 │ █ │
    │ 1930 │ 74 │ 0.42 │ ▋ │
    │ 1940 │ 119 │ 0.59 │ █▏ │
    │ 1950 │ 134 │ 1.09 │ ██▏ │
    │ 1960 │ 272 │ 0.92 │ █▋ │
    │ 1970 │ 108 │ 1.18 │ ██▎ │
    │ 1980 │ 88 │ 2.82 │ █████▋ │
    │ 1990 │ 184 │ 3.68 │ ███████▎ │
    │ 2000 │ 21 │ 7.14 │ ██████████████▎ │
    │ 2010 │ 6 │ 18.42 │ ████████████████████████████████████▋ │
    └──────┴─────────┴──────────────────────┴───────────────────────────────────────┘

    Vodka

    Query:

    SELECT
    round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
    count(),
    round(avg(price), 2),
    bar(avg(price), 0, 50, 100)
    FROM menu_item_denorm
    WHERE (menu_currency IN ('Dollars', '')) AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%vodka%')
    GROUP BY d
    ORDER BY d ASC;

    Result:

    ┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)─┐
    │ 1910 │ 2 │ 0 │ │
    │ 1920 │ 1 │ 0.3 │ ▌ │
    │ 1940 │ 21 │ 0.42 │ ▋ │
    │ 1950 │ 14 │ 0.59 │ █▏ │
    │ 1960 │ 113 │ 2.17 │ ████▎ │
    │ 1970 │ 37 │ 0.68 │ █▎ │
    │ 1980 │ 19 │ 2.55 │ █████ │
    │ 1990 │ 86 │ 3.6 │ ███████▏ │
    │ 2000 │ 2 │ 3.98 │ ███████▊ │
    └──────┴─────────┴──────────────────────┴─────────────────────────────┘

    To get vodka we have to write ILIKE '%vodka%' and this definitely makes a statement.

    Caviar

    Let's print caviar prices. Also let's print a name of any dish with caviar.

    Query:

    SELECT
    round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
    count(),
    round(avg(price), 2),
    bar(avg(price), 0, 50, 100),
    any(dish_name)
    FROM menu_item_denorm
    WHERE (menu_currency IN ('Dollars', '')) AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%caviar%')
    GROUP BY d
    ORDER BY d ASC;

    Result:

    ┌────d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)──────┬─any(dish_name)──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
    │ 1090 │ 1 │ 0 │ │ Caviar │
    │ 1880 │ 3 │ 0 │ │ Caviar │
    │ 1890 │ 39 │ 0.59 │ █▏ │ Butter and caviar │
    │ 1900 │ 1014 │ 0.34 │ ▋ │ Anchovy Caviar on Toast │
    │ 1910 │ 1588 │ 1.35 │ ██▋ │ 1/1 Brötchen Caviar │
    │ 1920 │ 927 │ 1.37 │ ██▋ │ ASTRAKAN CAVIAR │
    │ 1930 │ 289 │ 1.91 │ ███▋ │ Astrachan caviar │
    │ 1940 │ 201 │ 0.83 │ █▋ │ (SPECIAL) Domestic Caviar Sandwich │
    │ 1950 │ 81 │ 2.27 │ ████▌ │ Beluga Caviar │
    │ 1960 │ 126 │ 2.21 │ ████▍ │ Beluga Caviar │
    │ 1970 │ 105 │ 0.95 │ █▊ │ BELUGA MALOSSOL CAVIAR AMERICAN DRESSING │
    │ 1980 │ 12 │ 7.22 │ ██████████████▍ │ Authentic Iranian Beluga Caviar the world's finest black caviar presented in ice garni and a sampling of chilled 100° Russian vodka │
    │ 1990 │ 74 │ 14.42 │ ████████████████████████████▋ │ Avocado Salad, Fresh cut avocado with caviare │
    │ 2000 │ 3 │ 7.82 │ ███████████████▋ │ Aufgeschlagenes Kartoffelsueppchen mit Forellencaviar │
    │ 2010 │ 6 │ 15.58 │ ███████████████████████████████▏ │ "OYSTERS AND PEARLS" "Sabayon" of Pearl Tapioca with Island Creek Oysters and Russian Sevruga Caviar │
    └──────┴─────────┴──────────────────────┴──────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

    At least they have caviar with vodka. Very nice.

    Crowdsourced air traffic data from The OpenSky Network 2020

    The data in this dataset is derived and cleaned from the full OpenSky dataset to illustrate the development of air traffic during the COVID-19 pandemic. It spans all flights seen by the network's more than 2500 members since 1 January 2019. More data will be periodically included in the dataset until the end of the COVID-19 pandemic.

    Source: https://zenodo.org/record/5092942#.YRBCyTpRXYd

    Martin Strohmeier, Xavier Olive, Jannis Lübbe, Matthias Schäfer, and Vincent Lenders "Crowdsourced air traffic data from the OpenSky Network 2019–2020" Earth System Science Data 13(2), 2021 https://doi.org/10.5194/essd-13-357-2021

    Download the Dataset

    Run the command:

    wget -O- https://zenodo.org/record/5092942 | grep -oP 'https://zenodo.org/record/5092942/files/flightlist_\d+_\d+\.csv\.gz' | xargs wget

    Download will take about 2 minutes with good internet connection. There are 30 files with total size of 4.3 GB.

    Create the Table

    CREATE TABLE opensky
    (
    callsign String,
    number String,
    icao24 String,
    registration String,
    typecode String,
    origin String,
    destination String,
    firstseen DateTime,
    lastseen DateTime,
    day DateTime,
    latitude_1 Float64,
    longitude_1 Float64,
    altitude_1 Float64,
    latitude_2 Float64,
    longitude_2 Float64,
    altitude_2 Float64
    ) ENGINE = MergeTree ORDER BY (origin, destination, callsign);

    Import Data

    Upload data into ClickHouse in parallel:

    ls -1 flightlist_*.csv.gz | xargs -P100 -I{} bash -c 'gzip -c -d "{}" | clickhouse-client --date_time_input_format best_effort --query "INSERT INTO opensky FORMAT CSVWithNames"'
    • Here we pass the list of files (ls -1 flightlist_*.csv.gz) to xargs for parallel processing. xargs -P100 specifies to use up to 100 parallel workers but as we only have 30 files, the number of workers will be only 30.
    • For every file, xargs will run a script with bash -c. The script has substitution in form of {} and the xargs command will substitute the filename to it (we have asked it for xargs with -I{}).
    • The script will decompress the file (gzip -c -d "{}") to standard output (-c parameter) and the output is redirected to clickhouse-client.
    • We also asked to parse DateTime fields with extended parser (--date_time_input_format best_effort) to recognize ISO-8601 format with timezone offsets.

    Finally, clickhouse-client will do insertion. It will read input data in CSVWithNames format.

    Parallel upload takes 24 seconds.

    If you don't like parallel upload, here is sequential variant:

    for file in flightlist_*.csv.gz; do gzip -c -d "$file" | clickhouse-client --date_time_input_format best_effort --query "INSERT INTO opensky FORMAT CSVWithNames"; done

    Validate the Data

    Query:

    SELECT count() FROM opensky;

    Result:

    ┌──count()─┐
    │ 66010819 │
    └──────────┘

    The size of dataset in ClickHouse is just 2.66 GiB, check it.

    Query:

    SELECT formatReadableSize(total_bytes) FROM system.tables WHERE name = 'opensky';

    Result:

    ┌─formatReadableSize(total_bytes)─┐
    │ 2.66 GiB │
    └─────────────────────────────────┘

    Run Some Queries

    Total distance travelled is 68 billion kilometers.

    Query:

    SELECT formatReadableQuantity(sum(geoDistance(longitude_1, latitude_1, longitude_2, latitude_2)) / 1000) FROM opensky;

    Result:

    ┌─formatReadableQuantity(divide(sum(geoDistance(longitude_1, latitude_1, longitude_2, latitude_2)), 1000))─┐
    │ 68.72 billion │
    └──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

    Average flight distance is around 1000 km.

    Query:

    SELECT avg(geoDistance(longitude_1, latitude_1, longitude_2, latitude_2)) FROM opensky;

    Result:

    ┌─avg(geoDistance(longitude_1, latitude_1, longitude_2, latitude_2))─┐
    │ 1041090.6465708319 │
    └────────────────────────────────────────────────────────────────────┘

    Most busy origin airports and the average distance seen

    Query:

    SELECT
    origin,
    count(),
    round(avg(geoDistance(longitude_1, latitude_1, longitude_2, latitude_2))) AS distance,
    bar(distance, 0, 10000000, 100) AS bar
    FROM opensky
    WHERE origin != ''
    GROUP BY origin
    ORDER BY count() DESC
    LIMIT 100;

    Result:

         ┌─origin─┬─count()─┬─distance─┬─bar────────────────────────────────────┐
    1. │ KORD │ 745007 │ 1546108 │ ███████████████▍ │
    2. │ KDFW │ 696702 │ 1358721 │ █████████████▌ │
    3. │ KATL │ 667286 │ 1169661 │ ███████████▋ │
    4. │ KDEN │ 582709 │ 1287742 │ ████████████▊ │
    5. │ KLAX │ 581952 │ 2628393 │ ██████████████████████████▎ │
    6. │ KLAS │ 447789 │ 1336967 │ █████████████▎ │
    7. │ KPHX │ 428558 │ 1345635 │ █████████████▍ │
    8. │ KSEA │ 412592 │ 1757317 │ █████████████████▌ │
    9. │ KCLT │ 404612 │ 880355 │ ████████▋ │
    10. │ VIDP │ 363074 │ 1445052 │ ██████████████▍ │
    11. │ EDDF │ 362643 │ 2263960 │ ██████████████████████▋ │
    12. │ KSFO │ 361869 │ 2445732 │ ████████████████████████▍ │
    13. │ KJFK │ 349232 │ 2996550 │ █████████████████████████████▊ │
    14. │ KMSP │ 346010 │ 1287328 │ ████████████▋ │
    15. │ LFPG │ 344748 │ 2206203 │ ██████████████████████ │
    16. │ EGLL │ 341370 │ 3216593 │ ████████████████████████████████▏ │
    17. │ EHAM │ 340272 │ 2116425 │ █████████████████████▏ │
    18. │ KEWR │ 337696 │ 1826545 │ ██████████████████▎ │
    19. │ KPHL │ 320762 │ 1291761 │ ████████████▊ │
    20. │ OMDB │ 308855 │ 2855706 │ ████████████████████████████▌ │
    21. │ UUEE │ 307098 │ 1555122 │ ███████████████▌ │
    22. │ KBOS │ 304416 │ 1621675 │ ████████████████▏ │
    23. │ LEMD │ 291787 │ 1695097 │ ████████████████▊ │
    24. │ YSSY │ 272979 │ 1875298 │ ██████████████████▋ │
    25. │ KMIA │ 265121 │ 1923542 │ ███████████████████▏ │
    26. │ ZGSZ │ 263497 │ 745086 │ ███████▍ │
    27. │ EDDM │ 256691 │ 1361453 │ █████████████▌ │
    28. │ WMKK │ 254264 │ 1626688 │ ████████████████▎ │
    29. │ CYYZ │ 251192 │ 2175026 │ █████████████████████▋ │
    30. │ KLGA │ 248699 │ 1106935 │ ███████████ │
    31. │ VHHH │ 248473 │ 3457658 │ ██████████████████████████████████▌ │
    32. │ RJTT │ 243477 │ 1272744 │ ████████████▋ │
    33. │ KBWI │ 241440 │ 1187060 │ ███████████▋ │
    34. │ KIAD │ 239558 │ 1683485 │ ████████████████▋ │
    35. │ KIAH │ 234202 │ 1538335 │ ███████████████▍ │
    36. │ KFLL │ 223447 │ 1464410 │ ██████████████▋ │
    37. │ KDAL │ 212055 │ 1082339 │ ██████████▋ │
    38. │ KDCA │ 207883 │ 1013359 │ ██████████▏ │
    39. │ LIRF │ 207047 │ 1427965 │ ██████████████▎ │
    40. │ PANC │ 206007 │ 2525359 │ █████████████████████████▎ │
    41. │ LTFJ │ 205415 │ 860470 │ ████████▌ │
    42. │ KDTW │ 204020 │ 1106716 │ ███████████ │
    43. │ VABB │ 201679 │ 1300865 │ █████████████ │
    44. │ OTHH │ 200797 │ 3759544 │ █████████████████████████████████████▌ │
    45. │ KMDW │ 200796 │ 1232551 │ ████████████▎ │
    46. │ KSAN │ 198003 │ 1495195 │ ██████████████▊ │
    47. │ KPDX │ 197760 │ 1269230 │ ████████████▋ │
    48. │ SBGR │ 197624 │ 2041697 │ ████████████████████▍ │
    49. │ VOBL │ 189011 │ 1040180 │ ██████████▍ │
    50. │ LEBL │ 188956 │ 1283190 │ ████████████▋ │
    51. │ YBBN │ 188011 │ 1253405 │ ████████████▌ │
    52. │ LSZH │ 187934 │ 1572029 │ ███████████████▋ │
    53. │ YMML │ 187643 │ 1870076 │ ██████████████████▋ │
    54. │ RCTP │ 184466 │ 2773976 │ ███████████████████████████▋ │
    55. │ KSNA │ 180045 │ 778484 │ ███████▋ │
    56. │ EGKK │ 176420 │ 1694770 │ ████████████████▊ │
    57. │ LOWW │ 176191 │ 1274833 │ ████████████▋ │
    58. │ UUDD │ 176099 │ 1368226 │ █████████████▋ │
    59. │ RKSI │ 173466 │ 3079026 │ ██████████████████████████████▋ │
    60. │ EKCH │ 172128 │ 1229895 │ ████████████▎ │
    61. │ KOAK │ 171119 │ 1114447 │ ███████████▏ │
    62. │ RPLL │ 170122 │ 1440735 │ ██████████████▍ │
    63. │ KRDU │ 167001 │ 830521 │ ████████▎ │
    64. │ KAUS │ 164524 │ 1256198 │ ████████████▌ │
    65. │ KBNA │ 163242 │ 1022726 │ ██████████▏ │
    66. │ KSDF │ 162655 │ 1380867 │ █████████████▋ │
    67. │ ENGM │ 160732 │ 910108 │ █████████ │
    68. │ LIMC │ 160696 │ 1564620 │ ███████████████▋ │
    69. │ KSJC │ 159278 │ 1081125 │ ██████████▋ │
    70. │ KSTL │ 157984 │ 1026699 │ ██████████▎ │
    71. │ UUWW │ 156811 │ 1261155 │ ████████████▌ │
    72. │ KIND │ 153929 │ 987944 │ █████████▊ │
    73. │ ESSA │ 153390 │ 1203439 │ ████████████ │
    74. │ KMCO │ 153351 │ 1508657 │ ███████████████ │
    75. │ KDVT │ 152895 │ 74048 │ ▋ │
    76. │ VTBS │ 152645 │ 2255591 │ ██████████████████████▌ │
    77. │ CYVR │ 149574 │ 2027413 │ ████████████████████▎ │
    78. │ EIDW │ 148723 │ 1503985 │ ███████████████ │
    79. │ LFPO │ 143277 │ 1152964 │ ███████████▌ │
    80. │ EGSS │ 140830 │ 1348183 │ █████████████▍ │
    81. │ KAPA │ 140776 │ 420441 │ ████▏ │
    82. │ KHOU │ 138985 │ 1068806 │ ██████████▋ │
    83. │ KTPA │ 138033 │ 1338223 │ █████████████▍ │
    84. │ KFFZ │ 137333 │ 55397 │ ▌ │
    85. │ NZAA │ 136092 │ 1581264 │ ███████████████▋ │
    86. │ YPPH │ 133916 │ 1271550 │ ████████████▋ │
    87. │ RJBB │ 133522 │ 1805623 │ ██████████████████ │
    88. │ EDDL │ 133018 │ 1265919 │ ████████████▋ │
    89. │ ULLI │ 130501 │ 1197108 │ ███████████▊ │
    90. │ KIWA │ 127195 │ 250876 │ ██▌ │
    91. │ KTEB │ 126969 │ 1189414 │ ███████████▊ │
    92. │ VOMM │ 125616 │ 1127757 │ ███████████▎ │
    93. │ LSGG │ 123998 │ 1049101 │ ██████████▍ │
    94. │ LPPT │ 122733 │ 1779187 │ █████████████████▋ │
    95. │ WSSS │ 120493 │ 3264122 │ ████████████████████████████████▋ │
    96. │ EBBR │ 118539 │ 1579939 │ ███████████████▋ │
    97. │ VTBD │ 118107 │ 661627 │ ██████▌ │
    98. │ KVNY │ 116326 │ 692960 │ ██████▊ │
    99. │ EDDT │ 115122 │ 941740 │ █████████▍ │
    100. │ EFHK │ 114860 │ 1629143 │ ████████████████▎ │
    └────────┴─────────┴──────────┴────────────────────────────────────────┘

    Number of flights from three major Moscow airports, weekly

    Query:

    SELECT
    toMonday(day) AS k,
    count() AS c,
    bar(c, 0, 10000, 100) AS bar
    FROM opensky
    WHERE origin IN ('UUEE', 'UUDD', 'UUWW')
    GROUP BY k
    ORDER BY k ASC;

    Recipes Dataset

    RecipeNLG dataset is available for download here. It contains 2.2 million recipes. The size is slightly less than 1 GB.

    Download and Unpack the Dataset

    1. Go to the download page https://recipenlg.cs.put.poznan.pl/dataset.
    2. Accept Terms and Conditions and download zip file.
    3. Unpack the zip file with unzip. You will get the full_dataset.csv file.

    Create a Table

    Run clickhouse-client and execute the following CREATE query:

    CREATE TABLE recipes
    (
    title String,
    ingredients Array(String),
    directions Array(String),
    link String,
    source LowCardinality(String),
    NER Array(String)
    ) ENGINE = MergeTree ORDER BY title;

    Insert the Data

    Run the following command:

    clickhouse-client --query "
    INSERT INTO recipes
    SELECT
    title,
    JSONExtract(ingredients, 'Array(String)'),
    JSONExtract(directions, 'Array(String)'),
    link,
    source,
    JSONExtract(NER, 'Array(String)')
    FROM input('num UInt32, title String, ingredients String, directions String, link String, source LowCardinality(String), NER String')
    FORMAT CSVWithNames
    " --input_format_with_names_use_header 0 --format_csv_allow_single_quote 0 --input_format_allow_errors_num 10 < full_dataset.csv

    This is a showcase how to parse custom CSV, as it requires multiple tunes.

    Explanation:

    • The dataset is in CSV format, but it requires some preprocessing on insertion; we use table function input to perform preprocessing;
    • The structure of CSV file is specified in the argument of the table function input;
    • The field num (row number) is unneeded - we parse it from file and ignore;
    • We use FORMAT CSVWithNames but the header in CSV will be ignored (by command line parameter --input_format_with_names_use_header 0), because the header does not contain the name for the first field;
    • File is using only double quotes to enclose CSV strings; some strings are not enclosed in double quotes, and single quote must not be parsed as the string enclosing - that's why we also add the --format_csv_allow_single_quote 0 parameter;
    • Some strings from CSV cannot parse, because they contain \M/ sequence at the beginning of the value; the only value starting with backslash in CSV can be \N that is parsed as SQL NULL. We add --input_format_allow_errors_num 10 parameter and up to ten malformed records can be skipped;
    • There are arrays for ingredients, directions and NER fields; these arrays are represented in unusual form: they are serialized into string as JSON and then placed in CSV - we parse them as String and then use JSONExtract function to transform it to Array.

    Validate the Inserted Data

    By checking the row count:

    Query:

    SELECT count() FROM recipes;

    Result:

    ┌─count()─┐
    │ 2231141 │
    └─────────┘

    Example Queries

    Top Components by the Number of Recipes:

    In this example we learn how to use arrayJoin function to expand an array into a set of rows.

    Query:

    SELECT
    arrayJoin(NER) AS k,
    count() AS c
    FROM recipes
    GROUP BY k
    ORDER BY c DESC
    LIMIT 50

    Result:

    ┌─k────────────────────┬──────c─┐
    │ salt │ 890741 │
    │ sugar │ 620027 │
    │ butter │ 493823 │
    │ flour │ 466110 │
    │ eggs │ 401276 │
    │ onion │ 372469 │
    │ garlic │ 358364 │
    │ milk │ 346769 │
    │ water │ 326092 │
    │ vanilla │ 270381 │
    │ olive oil │ 197877 │
    │ pepper │ 179305 │
    │ brown sugar │ 174447 │
    │ tomatoes │ 163933 │
    │ egg │ 160507 │
    │ baking powder │ 148277 │
    │ lemon juice │ 146414 │
    │ Salt │ 122557 │
    │ cinnamon │ 117927 │
    │ sour cream │ 116682 │
    │ cream cheese │ 114423 │
    │ margarine │ 112742 │
    │ celery │ 112676 │
    │ baking soda │ 110690 │
    │ parsley │ 102151 │
    │ chicken │ 101505 │
    │ onions │ 98903 │
    │ vegetable oil │ 91395 │
    │ oil │ 85600 │
    │ mayonnaise │ 84822 │
    │ pecans │ 79741 │
    │ nuts │ 78471 │
    │ potatoes │ 75820 │
    │ carrots │ 75458 │
    │ pineapple │ 74345 │
    │ soy sauce │ 70355 │
    │ black pepper │ 69064 │
    │ thyme │ 68429 │
    │ mustard │ 65948 │
    │ chicken broth │ 65112 │
    │ bacon │ 64956 │
    │ honey │ 64626 │
    │ oregano │ 64077 │
    │ ground beef │ 64068 │
    │ unsalted butter │ 63848 │
    │ mushrooms │ 61465 │
    │ Worcestershire sauce │ 59328 │
    │ cornstarch │ 58476 │
    │ green pepper │ 58388 │
    │ Cheddar cheese │ 58354 │
    └──────────────────────┴────────┘

    50 rows in set. Elapsed: 0.112 sec. Processed 2.23 million rows, 361.57 MB (19.99 million rows/s., 3.24 GB/s.)

    The Most Complex Recipes with Strawberry

    SELECT
    title,
    length(NER),
    length(directions)
    FROM recipes
    WHERE has(NER, 'strawberry')
    ORDER BY length(directions) DESC
    LIMIT 10

    Result:

    ┌─title────────────────────────────────────────────────────────────┬─length(NER)─┬─length(directions)─┐
    │ Chocolate-Strawberry-Orange Wedding Cake │ 24 │ 126 │
    │ Strawberry Cream Cheese Crumble Tart │ 19 │ 47 │
    │ Charlotte-Style Ice Cream │ 11 │ 45 │
    │ Sinfully Good a Million Layers Chocolate Layer Cake, With Strawb │ 31 │ 45 │
    │ Sweetened Berries With Elderflower Sherbet │ 24 │ 44 │
    │ Chocolate-Strawberry Mousse Cake │ 15 │ 42 │
    │ Rhubarb Charlotte with Strawberries and Rum │ 20 │ 42 │
    │ Chef Joey's Strawberry Vanilla Tart │ 7 │ 37 │
    │ Old-Fashioned Ice Cream Sundae Cake │ 17 │ 37 │
    │ Watermelon Cake │ 16 │ 36 │
    └──────────────────────────────────────────────────────────────────┴─────────────┴────────────────────┘

    10 rows in set. Elapsed: 0.215 sec. Processed 2.23 million rows, 1.48 GB (10.35 million rows/s., 6.86 GB/s.)

    In this example, we involve has function to filter by array elements and sort by the number of directions.

    There is a wedding cake that requires the whole 126 steps to produce! Show that directions:

    Query:

    SELECT arrayJoin(directions)
    FROM recipes
    WHERE title = 'Chocolate-Strawberry-Orange Wedding Cake'

    UK Property Price Paid

    The dataset contains data about prices paid for real-estate property in England and Wales. The data is available since year 1995. The size of the dataset in uncompressed form is about 4 GiB and it will take about 278 MiB in ClickHouse.

    Source: https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads Description of the fields: https://www.gov.uk/guidance/about-the-price-paid-data

    Contains HM Land Registry data © Crown copyright and database right 2021. This data is licensed under the Open Government Licence v3.0.

    Download the Dataset

    Run the command:

    wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv

    Download will take about 2 minutes with good internet connection.

    Create the Table

    CREATE TABLE uk_price_paid
    (
    price UInt32,
    date Date,
    postcode1 LowCardinality(String),
    postcode2 LowCardinality(String),
    type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),
    is_new UInt8,
    duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),
    addr1 String,
    addr2 String,
    street LowCardinality(String),
    locality LowCardinality(String),
    town LowCardinality(String),
    district LowCardinality(String),
    county LowCardinality(String),
    category UInt8
    ) ENGINE = MergeTree ORDER BY (postcode1, postcode2, addr1, addr2);

    Preprocess and Import Data

    We will use clickhouse-local tool for data preprocessing and clickhouse-client to upload it.

    In this example, we define the structure of source data from the CSV file and specify a query to preprocess the data with clickhouse-local.

    The preprocessing is:

    • splitting the postcode to two different columns postcode1 and postcode2 that is better for storage and queries;
    • coverting the time field to date as it only contains 00:00 time;
    • ignoring the UUid field because we don't need it for analysis;
    • transforming type and duration to more readable Enum fields with function transform;
    • transforming is_new and category fields from single-character string (Y/N and A/B) to UInt8 field with 0 and 1.

    Preprocessed data is piped directly to clickhouse-client to be inserted into ClickHouse table in streaming fashion.

    clickhouse-local --input-format CSV --structure '
    uuid String,
    price UInt32,
    time DateTime,
    postcode String,
    a String,
    b String,
    c String,
    addr1 String,
    addr2 String,
    street String,
    locality String,
    town String,
    district String,
    county String,
    d String,
    e String
    ' --query "
    WITH splitByChar(' ', postcode) AS p
    SELECT
    price,
    toDate(time) AS date,
    p[1] AS postcode1,
    p[2] AS postcode2,
    transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,
    b = 'Y' AS is_new,
    transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration,
    addr1,
    addr2,
    street,
    locality,
    town,
    district,
    county,
    d = 'B' AS category
    FROM table" --date_time_input_format best_effort < pp-complete.csv | clickhouse-client --query "INSERT INTO uk_price_paid FORMAT TSV"

    It will take about 40 seconds.

    Validate the Data

    Query:

    SELECT count() FROM uk_price_paid;

    Result:

    ┌──count()─┐
    │ 26321785 │
    └──────────┘

    The size of dataset in ClickHouse is just 278 MiB, check it.

    Query:

    SELECT formatReadableSize(total_bytes) FROM system.tables WHERE name = 'uk_price_paid';

    Result:

    ┌─formatReadableSize(total_bytes)─┐
    │ 278.80 MiB │
    └─────────────────────────────────┘

    Run Some Queries

    Query 1. Average Price Per Year

    Query:

    SELECT toYear(date) AS year, round(avg(price)) AS price, bar(price, 0, 1000000, 80) FROM uk_price_paid GROUP BY year ORDER BY year;

    Result:

    ┌─year─┬──price─┬─bar(round(avg(price)), 0, 1000000, 80)─┐
    │ 1995 │ 67932 │ █████▍ │
    │ 1996 │ 71505 │ █████▋ │
    │ 1997 │ 78532 │ ██████▎ │
    │ 1998 │ 85436 │ ██████▋ │
    │ 1999 │ 96037 │ ███████▋ │
    │ 2000 │ 107479 │ ████████▌ │
    │ 2001 │ 118885 │ █████████▌ │
    │ 2002 │ 137941 │ ███████████ │
    │ 2003 │ 155889 │ ████████████▍ │
    │ 2004 │ 178885 │ ██████████████▎ │
    │ 2005 │ 189351 │ ███████████████▏ │
    │ 2006 │ 203528 │ ████████████████▎ │
    │ 2007 │ 219378 │ █████████████████▌ │
    │ 2008 │ 217056 │ █████████████████▎ │
    │ 2009 │ 213419 │ █████████████████ │
    │ 2010 │ 236109 │ ██████████████████▊ │
    │ 2011 │ 232805 │ ██████████████████▌ │
    │ 2012 │ 238367 │ ███████████████████ │
    │ 2013 │ 256931 │ ████████████████████▌ │
    │ 2014 │ 279915 │ ██████████████████████▍ │
    │ 2015 │ 297266 │ ███████████████████████▋ │
    │ 2016 │ 313201 │ █████████████████████████ │
    │ 2017 │ 346097 │ ███████████████████████████▋ │
    │ 2018 │ 350116 │ ████████████████████████████ │
    │ 2019 │ 351013 │ ████████████████████████████ │
    │ 2020 │ 369420 │ █████████████████████████████▌ │
    │ 2021 │ 386903 │ ██████████████████████████████▊ │
    └──────┴────────┴────────────────────────────────────────┘

    Query 2. Average Price per Year in London

    Query:

    SELECT toYear(date) AS year, round(avg(price)) AS price, bar(price, 0, 2000000, 100) FROM uk_price_paid WHERE town = 'LONDON' GROUP BY year ORDER BY year;

    Result:

    ┌─year─┬───price─┬─bar(round(avg(price)), 0, 2000000, 100)───────────────┐
    │ 1995 │ 109116 │ █████▍ │
    │ 1996 │ 118667 │ █████▊ │
    │ 1997 │ 136518 │ ██████▋ │
    │ 1998 │ 152983 │ ███████▋ │
    │ 1999 │ 180637 │ █████████ │
    │ 2000 │ 215838 │ ██████████▋ │
    │ 2001 │ 232994 │ ███████████▋ │
    │ 2002 │ 263670 │ █████████████▏ │
    │ 2003 │ 278394 │ █████████████▊ │
    │ 2004 │ 304666 │ ███████████████▏ │
    │ 2005 │ 322875 │ ████████████████▏ │
    │ 2006 │ 356191 │ █████████████████▋ │
    │ 2007 │ 404054 │ ████████████████████▏ │
    │ 2008 │ 420741 │ █████████████████████ │
    │ 2009 │ 427753 │ █████████████████████▍ │
    │ 2010 │ 480306 │ ████████████████████████ │
    │ 2011 │ 496274 │ ████████████████████████▋ │
    │ 2012 │ 519442 │ █████████████████████████▊ │
    │ 2013 │ 616212 │ ██████████████████████████████▋ │
    │ 2014 │ 724154 │ ████████████████████████████████████▏ │
    │ 2015 │ 792129 │ ███████████████████████████████████████▌ │
    │ 2016 │ 843655 │ ██████████████████████████████████████████▏ │
    │ 2017 │ 982642 │ █████████████████████████████████████████████████▏ │
    │ 2018 │ 1016835 │ ██████████████████████████████████████████████████▋ │
    │ 2019 │ 1042849 │ ████████████████████████████████████████████████████▏ │
    │ 2020 │ 1011889 │ ██████████████████████████████████████████████████▌ │
    │ 2021 │ 960343 │ ████████████████████████████████████████████████ │
    └──────┴─────────┴───────────────────────────────────────────────────────┘

    Something happened in 2013. I don't have a clue. Maybe you have a clue what happened in 2020?

    Query 3. The Most Expensive Neighborhoods

    Query:

    SELECT
    town,
    district,
    count() AS c,
    round(avg(price)) AS price,
    bar(price, 0, 5000000, 100)
    FROM uk_price_paid
    WHERE date >= '2020-01-01'
    GROUP BY
    town,
    district
    HAVING c >= 100
    ORDER BY price DESC
    LIMIT 100;
  • 相关阅读:
    iOS 键盘回收实现步骤
    Xcode 向6.0以后版本添加iOS开发空白模板
    popViewControllerAnimated 后,对页面内UITableView 内数据刷新
    指针,数组,字符串
    求解 s = (1*1)!+(2*2)! + (3*3)!+...+(n*n)! (C语言)
    sqlserver分页;mysql分页;orcale分页 的sql 查询语句
    实现strlen,strcpy,strcat,strcmp同功能的函数stringLength,stringCopy,stringCatch,stringCompare
    对字符串(英文)从小到大排序
    二维数组名作为实参或者形参
    联合与枚举 、 高级指针 、 C语言标准库(一)
  • 原文地址:https://www.cnblogs.com/hanease/p/16210038.html
Copyright © 2020-2023  润新知