• drill 学习 四 drill参考案例


    具体案例来自官方文档,只是简单的说明下,详细的可以参考官方文档

    环境准备

    • 运行drill
      测试环境可以通过docker运行,简单方便

    简单查询

    • 查询json文件
     
    SELECT * FROM cp.`employee.json` LIMIT 3;
    • 查询parquet文件
    SELECT * FROM dfs.`Users/drilluser/apache-drill/sample-data/region.parquet`;

    分析Yelp Academic Dataset

    注意数据下载地址好像不存在或者变动了

    • 查看业务数据
    select * from dfs.`<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json
    • 总数统计
    select sum(review_count) as totalreviews from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`;
    • top 10 统计
    select state, city, count(*) totalreviews from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` group by state, city order by count(*) desc limit 10
    • 统计平均值
      trunc 函数的意思是返回包含小数点.0 而不是返回一个整数
     
    select stars,trunc(avg(review_count)) reviewsavg 
    from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`
    group by stars order by stars desc;
    • 查询json 数组包含Restaurants的统计
    select name,repeated_count(categories) as categorycount, categories from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` where true=repeated_contains(categories,'Restaurants') order by repeated_count(categories) desc limit 10;
    • json 数据的扁平化
      json 数组转为table
     
    select name, flatten(categories) as category 
    from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json`  limit 20;

    分析动态数据

    核心还是一个分析json文件的,只是我们的json是一个比较复杂的,我们需要基于drill提供的一些函数方便的解析数据

    • 参考数据格式
    check-in
    {
        'type': 'checkin',
        'business_id': (encrypted business id),
        'checkin_info': {
            '0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
            '1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
            ...
            '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
            ...
            '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
        }, # if there was no checkin for a hour-day block it will not be in the dataset
    }
    • key-value 映射
      基于KVGEN函数
     
    SELECT KVGEN(checkin_info) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` LIMIT 2;
    • 统计总数
      集合KVGEN,FLATTEN 以及SUM函数
     
    SELECT SUM(checkintbl.checkins.`value`) AS TotalCheckins FROM (
    SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl;
    • 统计一天每个小时的总数
    SELECT SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) AS HourOfTheDay ,SUM(checkintbl.checkins.`value`) TotalCheckins FROM (
    SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl GROUP BY SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) ORDER BY TotalCheckins;

    窗口函数使用

    drill 支持的窗口函数:
    使用over 从句的partition by
    sum,max,min,count,avg sql 函数
    分析函数first_value,last_value,Lead,Lag,Ntitle,Row_number,Rank

    • 基于window函数的简单查询
      统计每个城市基于浏览数的信息
     
    SELECT name, city, review_count, row_number()
    OVER (PARTITION BY city ORDER BY review_count DESC) AS rownum 
    FROM `business.json` LIMIT 15; 
    • 复杂高级查询
      基于with 语句
      按评论数列出前10个城市及其排名最高的企业
     
    WITH X
    AS
    (SELECT name, city, review_count,
    RANK()
    OVER (PARTITION BY city
    ORDER BY review_count DESC) AS review_rank
    FROM `business.json`)
    SELECT X.name, X.city, X.review_count
    FROM X
    WHERE X.review_rank =1 ORDER BY review_count DESC LIMIT 10;
    • 将每个企业的评论数与城市中排名最高和最低的评论数进行比较
    SELECT name, city, review_count,
    FIRST_VALUE(review_count)
    OVER(PARTITION BY city ORDER BY review_count DESC) AS top_review_count,
    LAST_VALUE(review_count)
    OVER(PARTITION BY city ORDER BY review_count DESC) AS bottom_review_count
    FROM `business.json` limit 15;
    • 将评论数与上一个和下一个业务的评论数进行比较
    SELECT city, review_count, name,
    LAG(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC) 
    AS preceding_count,
    LEAD(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC) 
    AS following_count
    FROM `business.json` limit 15;

    说明

    drill 提供的sql函数还是很强大的,丰富而且灵活,更多的我们可以参考官方文档

    参考资料

    https://drill.apache.org/docs/tutorials-introduction/

  • 相关阅读:
    elasticsearch + kibana 安装
    JSON Web Token
    JConsole和VisualVM远程访问JMX
    DNS域名解析
    Spring Cloud(2)主要组件应用实例
    Docker(三):Docker入门教程-CentOS Docker 安装
    Spring Cloud(1)相关概念
    Zookeeper实现哨兵机制
    大数据学习笔记之初识Hadoop
    大数据学习笔记之大数据概述
  • 原文地址:https://www.cnblogs.com/rongfengliang/p/13766706.html
Copyright © 2020-2023  润新知