• OLAP之Druid之查询


    数据查询

    Druid的聚合查询主要有三种形式:

    • Timeseries
    • TopN
    • GroupBy

    一般而言,OLAP系统最核心的能力是GroupBy查询,Druid也不例外。 但是GroupBy查询资源消耗较多,TopNTimeseries作为GroupBy的有益补充,能够改善查询的性能。我们建议:如果TopNTimeseries能够满足业务的应用场景,那么尽量采用这两种查询,而非GroupBy

    Druid提供RESTful的查询接口,用户使用JSON表达查询意图。

    查询命令:

    curl -X POST 'broker:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -d @<query_json_file>

    注意点

    在Druid查询中,过滤条件是所有查询都可能涉及的部分,并且有一些使用技巧,需要特别注意。请参考Filters

    指标聚合这部分也是非常重要的,Aggregations也提供了系统的介绍,此处就不再赘述了。我们需要指出的是,这一页文档中Filtered Aggregator能够提供非常强大的查询功能,比如在查询过程中根据维度取值定制指标。

    GroupBy

    示例

    {
      "queryType": "groupBy",
      "dataSource": "sample_datasource",
      "granularity": "day",
      "dimensions": ["country", "device"], #需要聚合的维度列
      "limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] }, #limit语句
      "filter": { #过滤条件
        "type": "and",
        "fields": [
          { "type": "selector", "dimension": "carrier", "value": "AT&T" },
          { "type": "or", 
            "fields": [
              { "type": "selector", "dimension": "make", "value": "Apple" },
              { "type": "selector", "dimension": "make", "value": "Samsung" }
            ]
          }
        ]
      },
      "aggregations": [ #返回的指标列
        { "type": "longSum", "name": "total_usage", "fieldName": "user_count" },
        { "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }
      ],
      "postAggregations": [ #这部分是可选的
        { "type": "arithmetic",
          "name": "avg_usage",
          "fn": "/",
          "fields": [
            { "type": "fieldAccess", "fieldName": "data_transfer" },
            { "type": "fieldAccess", "fieldName": "total_usage" }
          ]
        }
      ],
      "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ], #本次查询需要覆盖的时间范围
      "having": { #having语句,这部分是可选的
        "type": "greaterThan",
        "aggregation": "total_usage",
        "value": 100
      }
    }

    Timeseries

    示例

    {
      "queryType": "timeseries",
      "dataSource": "sample_datasource",
      "granularity": "day",
      "descending": "true", #是否排序
      "filter": { #过滤条件
        "type": "and",
        "fields": [
          { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },
          { "type": "or",
            "fields": [
              { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },
              { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }
            ]
          }
        ]
      },
      "aggregations": [ #返回的指标列
        { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },
        { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }
      ],
      "postAggregations": [ #这部分是可选的
        { "type": "arithmetic",
          "name": "sample_divide",
          "fn": "/",
          "fields": [
            { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },
            { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }
          ]
        }
      ],
      "intervals": [ "2012-01-01T00:00:00.000/2012-01-04T00:00:00.000" ] #本次查询覆盖的时间范围
    }

    Timeseries query通常对空的查询时间段返回0作为查询结果

    TopN

    • TopN查询返回的是根据某一维度进行group by后再排序,返回结果集
    • 为了提高执行效率,TopN的查询是近似查询(从我们使用经验来看,返回结果基本是比较准确的)

    示例

    {
      "queryType": "topN",
      "dataSource": "sample_data",
      "dimension": "sample_dim", #需要聚合的维度列
      "threshold": 5,
      "metric": "count", #作为排序依据的指标列
      "granularity": "all",
      "filter": { #过滤条件
        "type": "and",
        "fields": [
          {
            "type": "selector",
            "dimension": "dim1",
            "value": "some_value"
          },
          {
            "type": "selector",
            "dimension": "dim2",
            "value": "some_other_val"
          }
        ]
      },
      "aggregations": [ #返回的指标列
        {
          "type": "longSum",
          "name": "count",
          "fieldName": "count"
        },
        {
          "type": "doubleSum",
          "name": "some_metric",
          "fieldName": "some_metric"
        }
      ],
      "postAggregations": [ #后处理逻辑,这部分是可选的
        {
          "type": "arithmetic",
          "name": "sample_divide",
          "fn": "/",
          "fields": [
            {
              "type": "fieldAccess",
              "name": "some_metric",
              "fieldName": "some_metric"
            },
            {
              "type": "fieldAccess",
              "name": "count",
              "fieldName": "count"
            }
          ]
        }
      ],
      "intervals": [
        "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000" #查询覆盖的时间范围
      ]
    }
  • 相关阅读:
    openstack项目【day23】:glance基础
    openstack项目【day23】:openstack组件介绍
    openstack项目【day23】:云计算介绍(一)
    学习有五个层次和境界
    gdb 初步学习记录
    Linux samba 服务的配置
    QT 5.7.0 移植之 tslib 编译配置
    tiny4412 u-boot 启动参数的设置
    Windows 只能安装32位虚拟机问题
    Tiny4412 虚拟机交叉编译环境的设置以及编译u-boot 和 kernel
  • 原文地址:https://www.cnblogs.com/029zz010buct/p/12663498.html
Copyright © 2020-2023  润新知