• 使用Data Lake Analytics + OSS分析CSV格式的TPC-H数据集


    0. Data Lake Analytics(DLA)简介

    关于Data Lake的概念,更多阅读可以参考:
    https://en.wikipedia.org/wiki/Data_lake

    以及AWS和Azure关于Data Lake的解读:
    https://amazonaws-china.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
    https://azure.microsoft.com/en-us/solutions/data-lake/

    终于,阿里云现在也有了自己的数据湖分析产品:https://www.aliyun.com/product/datalakeanalytics

    可以点击申请使用(目前公测阶段还属于邀测模式,我们会尽快审批申请),体验本教程的TPC-H CSV数据格式的数据分析之旅。

    产品文档:https://help.aliyun.com/product/70174.html

    1. 开通Data Lake Analytics与OSS服务

    如果您已经开通,可以跳过该步骤。如果没有开通,可以参考:https://help.aliyun.com/document_detail/70386.html
    进行产品开通服务申请。

    2. 下载TPC-H测试数据集

    可以从这下载TPC-H 100MB的数据集:
    https://public-datasets-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/tpch_100m_data.zip

    3. 上传数据文件到OSS

    登录阿里云官网的OSS控制台:https://oss.console.aliyun.com/overview
    规划您要使用的OSS bucket,创建或选择好后,点击“文件管理”,因为有8个数据文件,为每个数据文件创建对应的文件目录:

    创建好8个目录如下:

    点击进入目录,上传相应的数据文件,例如,customer目录,则上传customer.tbl文件。

    上传好后,如下图。然后,依次把其他7个数据文件也上传到对应的目录下。

    至此,8个数据文件都上传到了您的OSS bucket中:

    oss://xxx/tpch_100m/customer/customer.tbl
    oss://xxx/tpch_100m/lineitem/lineitem.tbl
    oss://xxx/tpch_100m/nation/nation.tbl
    oss://xxx/tpch_100m/orders/orders.tbl
    oss://xxx/tpch_100m/part/part.tbl
    oss://xxx/tpch_100m/partsupp/partsupp.tbl
    oss://xxx/tpch_100m/region/region.tbl
    oss://xxx/tpch_100m/supplier/supplier.tbl
    

    4. 登录Data Lake Analytics控制台

    https://openanalytics.console.aliyun.com/
    点击“登录数据库”,输入开通服务时分配的用户名和密码,登录Data Lake Analytics控制台。

    5. 创建Schema和Table

    输入创建SCHEMA的语句,点击“同步执行”。

    CREATE SCHEMA tpch_100m with DBPROPERTIES(
      LOCATION = 'oss://test-bucket-julian-1/tpch_100m/',
      catalog='oss'
    );
    
    

    (注意:目前在同一个阿里云region,Data Lake Analytics的schema名全局唯一,建议schema名尽量根据业务定义,已有重名schema,在创建时会提示报错,则请换一个schema名字。)

    Schema创建好后,在“数据库”的下拉框中,选择刚刚创建的schema。然后在SQL文本框中输入建表语句,点击同步执行。
    建表语句语法参考:https://help.aliyun.com/document_detail/72006.html

    TPC-H对应的8个表的建表语句如下,分别贴入文档框中执行(LOCATION子句中的数据文件位置请根据您的实际OSS bucket目录相应修改)。(注意:目前控制台中还不支持多个SQL语句执行,请单条语句执行。)

    CREATE EXTERNAL TABLE nation (
        N_NATIONKEY INT, 
        N_NAME STRING,
           N_ID STRING,
        N_REGIONKEY INT, 
        N_COMMENT STRING
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE 
    LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';
    
    CREATE EXTERNAL TABLE lineitem (
        L_ORDERKEY INT, 
        L_PARTKEY INT, 
        L_SUPPKEY INT, 
        L_LINENUMBER INT, 
        L_QUANTITY DOUBLE, 
        L_EXTENDEDPRICE DOUBLE, 
        L_DISCOUNT DOUBLE, 
        L_TAX DOUBLE, 
        L_RETURNFLAG STRING, 
        L_LINESTATUS STRING, 
        L_SHIPDATE DATE, 
        L_COMMITDATE DATE, 
        L_RECEIPTDATE DATE, 
        L_SHIPINSTRUCT STRING, 
        L_SHIPMODE STRING, 
        L_COMMENT STRING
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE 
    LOCATION 'oss://test-bucket-julian-1/tpch_100m/lineitem';
    
    CREATE EXTERNAL TABLE orders (
        O_ORDERKEY INT, 
        O_CUSTKEY INT, 
        O_ORDERSTATUS STRING, 
        O_TOTALPRICE DOUBLE, 
        O_ORDERDATE DATE, 
        O_ORDERPRIORITY STRING, 
        O_CLERK STRING, 
        O_SHIPPRIORITY INT, 
        O_COMMENT STRING
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE 
    LOCATION 'oss://test-bucket-julian-1/tpch_100m/orders';
    
    CREATE EXTERNAL TABLE supplier (
        S_SUPPKEY INT, 
        S_NAME STRING, 
        S_ADDRESS STRING, 
        S_NATIONKEY INT, 
        S_PHONE STRING, 
        S_ACCTBAL DOUBLE, 
        S_COMMENT STRING
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE 
    LOCATION 'oss://test-bucket-julian-1/tpch_100m/supplier';
    
    CREATE EXTERNAL TABLE partsupp (
        PS_PARTKEY INT, 
        PS_SUPPKEY INT, 
        PS_AVAILQTY INT, 
        PS_SUPPLYCOST DOUBLE, 
        PS_COMMENT STRING
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE 
    LOCATION 'oss://test-bucket-julian-1/tpch_100m/partsupp';
    
    CREATE EXTERNAL TABLE customer (
        C_CUSTKEY INT, 
        C_NAME STRING, 
        C_ADDRESS STRING, 
        C_NATIONKEY INT, 
        C_PHONE STRING, 
        C_ACCTBAL DOUBLE, 
        C_MKTSEGMENT STRING, 
        C_COMMENT STRING
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE 
    LOCATION 'oss://test-bucket-julian-1/tpch_100m/customer';
    
    CREATE EXTERNAL TABLE part (
        P_PARTKEY INT, 
        P_NAME STRING, 
        P_MFGR STRING, 
        P_BRAND STRING, 
        P_TYPE STRING, 
        P_SIZE INT, 
        P_CONTAINER STRING, 
        P_RETAILPRICE DOUBLE, 
        P_COMMENT STRING
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE 
    LOCATION 'oss://test-bucket-julian-1/tpch_100m/part';
    
    CREATE EXTERNAL TABLE region (
        R_REGIONKEY INT, 
        R_NAME STRING, 
        R_COMMENT STRING
    ) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE 
    LOCATION 'oss://test-bucket-julian-1/tpch_100m/region';
    

    建表完毕后,刷新页面,在左边导航条中能看到schema下的8张表。

    6. 执行TPC-H查询

    TPC-H总共22条查询,如下:
    Q1:

    SELECT   l_returnflag,
             l_linestatus,
             Sum(l_quantity)                                       AS sum_qty,
             Sum(l_extendedprice)                                  AS sum_base_price,
             Sum(l_extendedprice * (1 - l_discount))               AS sum_disc_price,
             Sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
             Avg(l_quantity)                                       AS avg_qty,
             Avg(l_extendedprice)                                  AS avg_price,
             Avg(l_discount)                                       AS avg_disc,
             Count(*)                                              AS count_order
    FROM     lineitem
    WHERE    l_shipdate <= date '1998-12-01' - INTERVAL '93' day
    GROUP BY l_returnflag,
             l_linestatus
    ORDER BY l_returnflag,
             l_linestatus
    LIMIT    1;
    

    Q2:

    SELECT   s_acctbal,
             s_name,
             n_name,
             p_partkey,
             p_mfgr,
             s_address,
             s_phone,
             s_comment
    FROM     part,
             supplier,
             partsupp,
             nation,
             region
    WHERE    p_partkey = ps_partkey
    AND      s_suppkey = ps_suppkey
    AND      p_size = 35
    AND      p_type LIKE '%NICKEL'
    AND      s_nationkey = n_nationkey
    AND      n_regionkey = r_regionkey
    AND      r_name = 'MIDDLE EAST'
    

    Q3:

    SELECT   l_orderkey,
             Sum(l_extendedprice * (1 - l_discount)) AS revenue,
             o_orderdate,
             o_shippriority
    FROM     customer,
             orders,
             lineitem
    WHERE    c_mktsegment = 'AUTOMOBILE'
    AND      c_custkey = o_custkey
    AND      l_orderkey = o_orderkey
    AND      o_orderdate < date '1995-03-31'
    AND      l_shipdate >  date '1995-03-31'
    GROUP BY l_orderkey,
             o_orderdate,
             o_shippriority
    ORDER BY revenue DESC,
             o_orderdate
    LIMIT    10;
    

    Q4:

    SELECT   o_orderpriority,
             Count(*) AS order_count
    FROM     orders,
             lineitem
    WHERE    o_orderdate >= date '1997-10-01'
    AND      o_orderdate <  date '1997-10-01' + INTERVAL '3' month
    AND      l_orderkey = o_orderkey
    AND      l_commitdate < l_receiptdate
    GROUP BY o_orderpriority
    ORDER BY o_orderpriority
    LIMIT    1;
    

    Q5:

    SELECT   n_name,
             Sum(l_extendedprice * (1 - l_discount)) AS revenue
    FROM     customer,
             orders,
             lineitem,
             supplier,
             nation,
             region
    WHERE    c_custkey = o_custkey
    AND      l_orderkey = o_orderkey
    AND      l_suppkey = s_suppkey
    AND      c_nationkey = s_nationkey
    AND      s_nationkey = n_nationkey
    AND      n_regionkey = r_regionkey
    AND      r_name = 'ASIA'
    AND      o_orderdate >= date '1995-01-01'
    AND      o_orderdate <  date '1995-01-01' + INTERVAL '1' year
    GROUP BY n_name
    ORDER BY revenue DESC
    LIMIT    1;
    

    Q6:

    SELECT sum(l_extendedprice * l_discount) AS revenue
    FROM lineitem
    WHERE l_shipdate >= date '1995-01-01'
    AND l_shipdate < date '1995-01-01' + interval '1' year
    AND l_discount between 0.04 - 0.01 AND 0.04 + 0.01
    AND l_quantity < 24
    LIMIT 1;
    

    Q7:

    SELECT   supp_nation,
             cust_nation,
             l_year,
             Sum(volume) AS revenue
    FROM     (
                    SELECT n1.n_name                          AS supp_nation,
                           n2.n_name                          AS cust_nation,
                           Extract(year FROM l_shipdate)      AS l_year,
                           l_extendedprice * (1 - l_discount) AS volume
                    FROM   supplier,
                           lineitem,
                           orders,
                           customer,
                           nation n1,
                           nation n2
                    WHERE  s_suppkey = l_suppkey
                    AND    o_orderkey = l_orderkey
                    AND    c_custkey = o_custkey
                    AND    s_nationkey = n1.n_nationkey
                    AND    c_nationkey = n2.n_nationkey
                    AND    ( (
                                         n1.n_name = 'GERMANY'
                                  AND    n2.n_name = 'INDIA')
                           OR     (
                                         n1.n_name = 'INDIA'
                                  AND    n2.n_name = 'GERMANY') )
                    AND    l_shipdate BETWEEN date '1995-01-01' AND    date '1996-12-31' ) AS shipping
    GROUP BY supp_nation,
             cust_nation,
             l_year
    ORDER BY supp_nation,
             cust_nation,
             l_year
    LIMIT    1;
    

    Q8:

    SELECT   o_year,
             Sum(
             CASE
                      WHEN nation = 'INDIA' THEN volume
                      ELSE 0
             end) / Sum(volume) AS mkt_share
    FROM     (
                    SELECT Extract(year FROM o_orderdate)     AS o_year,
                           l_extendedprice * (1 - l_discount) AS volume,
                           n2.n_name                          AS nation
                    FROM   part,
                           supplier,
                           lineitem,
                           orders,
                           customer,
                           nation n1,
                           nation n2,
                           region
                    WHERE  p_partkey = l_partkey
                    AND    s_suppkey = l_suppkey
                    AND    l_orderkey = o_orderkey
                    AND    o_custkey = c_custkey
                    AND    c_nationkey = n1.n_nationkey
                    AND    n1.n_regionkey = r_regionkey
                    AND    r_name = 'ASIA'
                    AND    s_nationkey = n2.n_nationkey
                    AND    o_orderdate BETWEEN date '1995-01-01' AND    date '1996-12-31'
                    AND    p_type = 'STANDARD ANODIZED STEEL' ) AS all_nations
    GROUP BY o_year
    ORDER BY o_year
    LIMIT    1;
    

    Q9:

    SELECT   nation,
             o_year,
             Sum(amount) AS sum_profit
    FROM     (
                    SELECT n_name                                                          AS nation,
                           Extract(year FROM o_orderdate) AS o_year,
                           l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
                    FROM   part,
                           supplier,
                           lineitem,
                           partsupp,
                           orders,
                           nation
                    WHERE  s_suppkey = l_suppkey
                    AND    ps_suppkey = l_suppkey
                    AND    ps_partkey = l_partkey
                    AND    p_partkey = l_partkey
                    AND    o_orderkey = l_orderkey
                    AND    s_nationkey = n_nationkey
                    AND    p_name LIKE '%aquamarine%' ) AS profit
    GROUP BY nation,
             o_year
    ORDER BY nation,
             o_year DESC
    LIMIT    1;
    

    Q10:

    SELECT   c_custkey,
             c_name,
             Sum(l_extendedprice * (1 - l_discount)) AS revenue,
             c_acctbal,
             n_name,
             c_address,
             c_phone,
             c_comment
    FROM     customer,
             orders,
             lineitem,
             nation
    WHERE    c_custkey = o_custkey
    AND      l_orderkey = o_orderkey
    AND      o_orderdate >= date '1994-08-01'
    AND      o_orderdate <  date '1994-08-01' + INTERVAL '3' month
    AND      l_returnflag = 'R'
    AND      c_nationkey = n_nationkey
    GROUP BY c_custkey,
             c_name,
             c_acctbal,
             c_phone,
             n_name,
             c_address,
             c_comment
    ORDER BY revenue DESC
    LIMIT    20;
    

    Q11:

    SELECT   ps_partkey,
             Sum(ps_supplycost * ps_availqty) AS value
    FROM     partsupp,
             supplier,
             nation
    WHERE    ps_suppkey = s_suppkey
    AND      s_nationkey = n_nationkey
    AND      n_name = 'PERU'
    GROUP BY ps_partkey
    HAVING   Sum(ps_supplycost * ps_availqty) >
    (
    SELECT Sum(ps_supplycost * ps_availqty) * 0.0001000000 as sum_value
      FROM partsupp,
           supplier,
           nation
      WHERE  ps_suppkey = s_suppkey
      AND    s_nationkey = n_nationkey
      AND    n_name = 'PERU'
    )
    ORDER BY value DESC
    LIMIT    1;
    

    Q12:

    SELECT l_shipmode, sum(case when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1
        else 0
    end) AS high_line_count, sum(case when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1
    else 0
    end) AS low_line_count
    FROM orders,
         lineitem
    WHERE o_orderkey = l_orderkey
    AND l_shipmode in ('MAIL', 'TRUCK')
    AND l_commitdate < l_receiptdate
    AND l_shipdate < l_commitdate
    AND l_receiptdate >= date '1996-01-01'
    AND l_receiptdate < date '1996-01-01' + interval '1' year
    GROUP BY l_shipmode
    ORDER BY l_shipmode
    LIMIT 1;
    

    Q13:

    SELECT c_count, count(*) AS custdist
    FROM (
        SELECT c_custkey, count(o_orderkey) AS c_count
        FROM customer,
             orders
        WHERE c_custkey = o_custkey
        AND o_comment NOT LIKE '%pending%accounts%'
        GROUP BY c_custkey ) AS c_orders
    GROUP BY c_count
    ORDER BY custdist DESC, c_count DESC
    LIMIT 1;
    

    Q14:

    SELECT 100.00 * sum(case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount)
        else 0
    end) / sum(l_extendedprice * (1 - l_discount)) AS promo_revenue
    FROM lineitem,
         part
    WHERE l_partkey = p_partkey
    AND l_shipdate >= date '1996-01-01'
    AND l_shipdate < date '1996-01-01' + interval '1' month
    LIMIT 1;
    

    Q15:

    WITH revenue0 AS
    (
    SELECT l_suppkey AS supplier_no, sum(l_extendedprice * (1 - l_discount)) AS total_revenue
    FROM lineitem
    WHERE l_shipdate >= date '1993-01-01'
    AND l_shipdate < date '1993-01-01' + interval '3' month
    GROUP BY l_suppkey
    )
    SELECT s_suppkey, s_name, s_address, s_phone, total_revenue
    FROM supplier, revenue0
    WHERE s_suppkey = supplier_no
    AND total_revenue IN (
        SELECT max(total_revenue)
        FROM revenue0 )
    ORDER BY s_suppkey;
    

    Q16:

    SELECT p_brand, p_type, p_size, count(distinct ps_suppkey) AS supplier_cnt
    FROM partsupp,
         part
    WHERE p_partkey = ps_partkey
    AND p_brand <> 'Brand#23'
    AND p_type NOT LIKE 'PROMO BURNISHED%'
    AND p_size IN (1, 13, 10, 28, 21, 35, 31, 11)
    AND ps_suppkey NOT IN (
        SELECT s_suppkey
        FROM supplier
        WHERE s_comment LIKE '%Customer%Complaints%' )
    GROUP BY p_brand, p_type, p_size
    ORDER BY supplier_cnt DESC, p_brand, p_type, p_size
    LIMIT 1;
    

    Q17:

    SELECT
        sum(l_extendedprice) / 7.0 AS avg_yearly
    FROM
        lineitem,
        part
    WHERE p_partkey = l_partkey
        AND p_brand = 'Brand#44'
        AND p_container = 'WRAP PKG'
        AND l_quantity < (
            SELECT
                0.2 * avg(l_quantity)
            FROM
                lineitem, part
            WHERE
                l_partkey = p_partkey
        );
    

    Q18:

    SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity)
    FROM customer,
         orders,
         lineitem
    WHERE o_orderkey IN (
          SELECT l_orderkey
          FROM lineitem
          GROUP BY l_orderkey
          HAVING sum(l_quantity) > 315 )
    AND c_custkey = o_custkey
    AND o_orderkey = l_orderkey
    GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
    ORDER BY o_totalprice DESC, o_orderdate
    LIMIT 100;
    

    Q19:

    SELECT sum(l_extendedprice* (1 - l_discount)) AS revenue
    FROM lineitem,
         part
    WHERE ( p_partkey = l_partkey and p_brand = 'Brand#12'
            and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
            and l_quantity >= 6 and l_quantity <= 6 + 10
            and p_size between 1 and 5
            and l_shipmode in ('AIR', 'AIR REG')
            and l_shipinstruct = 'DELIVER IN PERSON' )
        or ( p_partkey = l_partkey and p_brand = 'Brand#13'
            and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
            and l_quantity >= 10 and l_quantity <= 10 + 10
            and p_size between 1 and 10
            and l_shipmode in ('AIR', 'AIR REG')
            and l_shipinstruct = 'DELIVER IN PERSON' )
        or ( p_partkey = l_partkey and p_brand = 'Brand#24'
            and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
            and l_quantity >= 21 and l_quantity <= 21 + 10
            and p_size between 1 and 15
            and l_shipmode in ('AIR', 'AIR REG')
            and l_shipinstruct = 'DELIVER IN PERSON' )
    LIMIT 1;
    

    Q20:

    with temp_table as
    (
     select 0.5 * sum(l_quantity) as col1
     from lineitem,
          partsupp
     where l_partkey = ps_partkey and l_suppkey = ps_suppkey
     and l_shipdate >= date '1993-01-01'
     and l_shipdate < date '1993-01-01' + interval '1' year
    )
    select s_name, s_address
    from supplier,
         nation
    where s_suppkey in (
        select ps_suppkey
        from partsupp,
             temp_table
        where ps_partkey in (
            select p_partkey
            from part
            where p_name like 'dark%' )
            and ps_availqty > temp_table.col1 )
        and s_nationkey = n_nationkey and n_name = 'JORDAN'
    order by s_name
    limit 1;
    

    Q21:

    select
        s_name,
        count(*) as numwait
    from
        supplier,
        lineitem l1,
        orders,
        nation
    where
        s_suppkey = l1.l_suppkey
        and o_orderkey = l1.l_orderkey
        and o_orderstatus = 'F'
        and l1.l_receiptdate > l1.l_commitdate
        and exists (
            select
                *
            from
                lineitem l2
            where
                l2.l_orderkey = l1.l_orderkey
                and l2.l_suppkey <> l1.l_suppkey
        )
        and not exists (
            select
                *
            from
                lineitem l3
            where
                l3.l_orderkey = l1.l_orderkey
                and l3.l_suppkey <> l1.l_suppkey
                and l3.l_receiptdate > l3.l_commitdate
        )
        and s_nationkey = n_nationkey
        and n_name = 'SAUDI ARABIA'
    group by
        s_name
    order by
        numwait desc,
        s_name
    limit 100;
    

    Q22:

    with temp_table_1 as
    (
      select avg(c_acctbal) as avg_value
      from customer
      where c_acctbal > 0.00 and substring(c_phone from 1 for 2)
      in ('33', '29', '37', '35', '25', '27', '43')
    ),
    temp_table_2 as
    (
      select count(*) as count1
      from orders, customer
      where o_custkey = c_custkey
    )
    select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbal
    from (
        select substring(c_phone from 1 for 2) as cntrycode, c_acctbal
        from customer, temp_table_1, temp_table_2
        where substring(c_phone
            from 1
            for 2) in ('33', '29', '37', '35', '25', '27', '43')
            and c_acctbal > temp_table_1.avg_value
            and temp_table_2.count1 = 0) as custsale
    group by cntrycode
    order by cntrycode
    limit 1;
    

    7. 异步执行查询

    Data Lake Analytics支持“同步执行”模式和“异步执行”模式。“同步执行”模式下,控制台界面等待执行结果返回;“异步执行”模式下,立刻返回查询任务的ID。

    点击“执行状态”,可以看到该异步查询任务的执行状态,主要分为:“RUNNING”,“SUCCESS”,“FAILURE”。

    点击“刷新”,当STATUS变为“SUCCESS”时,表示查询成功,同时可查看查询耗时“ELAPSE_TIME”和查询扫描的数据字节数“SCANNED_DATA_BYTES”。

    8. 查看查询历史

    点击“执行历史”,可以看到您执行的查询的历史详细信息,包括:
    1)查询语句;
    2)查询耗时与执行具体时间;
    3)查询结果返回行数;
    4)查询状态;
    5)查询扫描的字节数;
    6)结果集回写到的目标OSS文件(Data Lake Analytics会将查询结果集保存用户的bucket中)。

    查询结果文件自动上传到用户同region的OSS bucket中,其中包括结果数据文件和结果集元数据描述文件。

    {QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv
    {QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv.metadata
    

    其中QueryLocation为:

    aliyun-oa-query-results-<your_account_id>-<oss_region>
    

    9. 后续

    至此,本教程一步一步教您如何利用Data Lake Analytics云产品分析您OSS上的CSV格式的数据文件。除了CSV文件外,Data Lake Analytics还支持Parquet、ORC、json、RCFile、AVRO等多种格式文件的数据分析能力。特别是Parquet、ORC,相比CSV文件,有极大的性能和成本优势(同样内容的数据集,拥有更小的存储空间、更快的查询性能,这也意味着更低的分析成本)。
    后续陆续会有更多教程和文章,手把手教您轻松使用Data Lake Analytics进行数据湖上数据分析和探索,开启您的云上低成本、即存即用的数据分析和探索之旅。

     

     

    原文链接
    更多技术干货 请关注阿里云云栖社区微信号 :yunqiinsight

  • 相关阅读:
    mysql数据库
    Mysql之sql语句操作
    mysql修改root密码的多种方法
    kvm虚拟化
    清华AIOps算法:KPI聚类
    有点扯的预测方法
    内网安全运营的逻辑体系架构
    SpringBoot定时消费Kafka消息
    kafka的consumer消费能力很低的情况下的处理方案
    Kafka_Kafka 消费者 偏移量 与 积压 查询脚本 kafka-consumer-groups.sh
  • 原文地址:https://www.cnblogs.com/zhaowei121/p/10511336.html
Copyright © 2020-2023  润新知