• 星型数据仓库olap工具kylin介绍


    星型数据仓库olap工具kylin介绍

    数据仓库是目前企业级BI分析的重要平台,尤其在互联网公司,每天都会产生数以百G的日志,如何从这些日志中发现数据的规律很重要. 数据仓库是数据分析的重要工具, 每个大公司都花费数百万每年的资金进行数据仓库的运维.

    本文介绍一个基于hadoop的数据仓库, 它基于hadoop(HIVE, HBASE)水平扩展的特性, 客服传统olap受限于关系型数据库数据容量的问题. Kylin是ebay推出的olap星型数据仓库的开源实现. 

    首先请安装Kylin, 和它的运行环境(Hadoop, yarn, hive, hbase). 如果安装成功, 登陆(http://<KYLIN_HOST>:7070/), 用户名:ADMIN, 密码(KYLIN). 安装过程请参考(http://kylin.incubator.apache.org/download/,  注意下载编译后的二进制包, 免去很多编译烦恼).

    在创建数据仓库前, 我们先聊一下, 什么是数据仓库.

    从业务过程的角度考虑, 信息系统可以划分为两个主要类别, 一类用于支持业务过程的执行, 代表作品是mysql; 另一类用于支持业务过程的分析, 代表作品是hive, 还有就是今天的主角kylin.

    首先, 数据仓库的设计

    下图展示了一个简单的基于订单流程中事实和维度的星型模型.

    这是一个典型的星型结构, 订单的事实表有3个度量值(messures)(订单数量, 订单金额, 和订单成本); 另外有4个度量维度(dimession), 分别是时间, 产品, 销售员, 客户. 这里时间以天为单位,  这里注意day_key必须是(YYYY-MM-DD)格式(这是kylin的规定). 

    其次, 根据数据仓库的设计创建hive表

    1. 创建事实表并插入数据

    DROP TABLE IF EXISTS DEFAULT.fact_order ;
    
    create table DEFAULT.fact_order (
        time_key string,
        product_key string,
        salesperson_key string,
        custom_key string,
        quantity_ordered bigint,
        order_dollars bigint,
        cost_dollars bigint
    
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE;
    load data local inpath 'fact_order.csv' overwrite into table DEFAULT.fact_order; 
    

     

    fact_order.csv

    2015-05-01,pd001,sp001,ct001,100,101,51
    2015-05-01,pd001,sp002,ct002,100,101,51
    2015-05-01,pd001,sp003,ct002,100,101,51
    2015-05-01,pd002,sp001,ct001,100,101,51
    2015-05-01,pd003,sp001,ct001,100,101,51
    2015-05-01,pd004,sp001,ct001,100,101,51
    2015-05-02,pd001,sp001,ct001,100,101,51
    2015-05-02,pd001,sp002,ct002,100,101,51
    2015-05-02,pd001,sp003,ct002,100,101,51
    2015-05-02,pd002,sp001,ct001,100,101,51
    2015-05-02,pd003,sp001,ct001,100,101,51
    2015-05-02,pd004,sp001,ct001,100,101,51

    2. 创建天维度表day_dim

    DROP TABLE IF EXISTS DEFAULT.dim_day ;
    
    create table DEFAULT.dim_day (
        day_key string,
        full_day string,
        month_name string,
        quarter string, 
        year string
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE;
    
    load data local inpath 'dim_day.csv' overwrite into table DEFAULT.dim_day; 

    dim_day.csv

    2015-05-01,2015-05-01,201505,2015q2,2015
    2015-05-02,2015-05-02,201505,2015q2,2015
    2015-05-03,2015-05-03,201505,2015q2,2015
    2015-05-04,2015-05-04,201505,2015q2,2015
    2015-05-05,2015-05-05,201505,2015q2,2015

    3. 创建售卖员的维度表salesperson_dim

    DROP TABLE IF EXISTS DEFAULT.dim_salesperson ;
    
    create table DEFAULT.dim_salesperson (
        salesperson_key string,
        salesperson string,
        salesperson_id string, 
        region string,
        region_code string
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE;
    
    load data local inpath 'dim_salesperson.csv' overwrite into table DEFAULT.dim_salesperson; 

    dim_salesperson.csv

    sp001,hongbin,sp001,beijing,10086
    sp002,hongming,sp002,beijing,10086
    sp003,hongmei,sp003,beijing,10086

    4. 创建客户维度 custom_dim

    DROP TABLE IF EXISTS DEFAULT.dim_custom ;
    
    create table DEFAULT.dim_custom (
            custom_key string,
            custom_name string,
            custorm_id string, 
            headquarter_states string,
            billing_address string,
        billing_city string,
        billing_state string,
        industry_name string
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE;
    
    load data local inpath 'dim_custom.csv' overwrite into table DEFAULT.dim_custom; 
    

      

    dim_custom.csv

    ct001,custom_john,ct001,beijing,zgx-beijing,beijing,beijing,internet                     
    ct002,custom_herry,ct002,henan,shlinjie,shangdang,henan,internet     
    

    5. 创建产品维度表并插入数据

    DROP TABLE IF EXISTS DEFAULT.dim_product ;                                               
                                                                                             
    create table DEFAULT.dim_product (                                                       
        product_key string,                                                                  
        product_name string,                                                                 
        product_id string,                                                                   
        product_desc string,                                                                 
        sku string,                                                                          
        brand string,                                                                        
        brand_code string,                                                                   
        brand_manager string,                                                                
        category string,                                                                     
        category_code string                                                                 
    )                                                                                        
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','                                            
    STORED AS TEXTFILE;                                                                      
                                                                                             
    load data local inpath 'dim_product.csv' overwrite into table DEFAULT.dim_product;       
    

    dim_product.csv

    pd001,Box-Large,pd001,Box-Large-des,large1.0,brand001,brandcode001,brandmanager001,Packing,cate001
    pd002,Box-Medium,pd001,Box-Medium-des,medium1.0,brand001,brandcode001,brandmanager001,Packing,cate001
    pd003,Box-small,pd001,Box-small-des,small1.0,brand001,brandcode001,brandmanager001,Packing,cate001
    pd004,Evelope,pd001,Evelope_des,large3.0,brand001,brandcode001,brandmanager001,Pens,cate002
    

    这样一个星型的结构表在hive中创建完毕, 实际上一个离线的数据仓库已经完成, 它包含一个主题, 即商品订单.

    关于商品订单的统计需求可以使用hive命令产生. 比如:

    1. 统计20150501到20150502所有的订单数.

    Hive> select dday.full_day, sum(quantity_ordered) from fact_order as fact inner join dim_day  as dday on fact.time_key == dday.day_key and dday.full_day >= "2015-05-01" and dday.full_day <= "2015-05-02" group by dday.full_day order by dday.full_day;

    2015-05-01      600

    2015-05-02      600

     

    2. 统计20150501到20150502各个销售员的销售订单数

    select dday.full_day, dsp.salesperson_key, sum(quantity_ordered) from fact_order as fact 

        inner join dim_day  as dday on fact.time_key == dday.day_key 

        inner join dim_salesperson as dsp on fact.salesperson_key == dsp.salesperson_key  

        where dday.full_day >= "2015-05-01" and dday.full_day <= "2015-05-02" 

        group by dday.full_day, dsp.salesperson_key 

        order by dday.full_day;

    2015-05-01      sp003   100

    2015-05-01      sp002   100

    2015-05-01      sp001   400

    2015-05-02      sp003   100

    2015-05-02      sp002   100

    2015-05-02      sp001   400

    然后,导入kylin数据仓库中

    kylin在hive的基础上仓库olap数据cube, 完成实时数据仓库服务的任务. kylin在hive的基础上完成:

    1. 将星型数据库部署在hbase上实现实时的查询服务

    2. 提供restful查询接口

    3. 集成BI

    首先, 创建一个数据仓库工程(kylin_test_project)

    其次, 点击tables标签,点击"load hive table"按钮, 同步上述的所有hive表

     

    完成hive表和kylin的同步.

    接着, 简历kylin的数据cube

    点击cube 和新增cube按钮.

    1. 命名cube order_cube

    2. 增加fact 和 dim 表

    3. 增加维度

    4. 增加mesure值

    5. 不用选filter条件

    6. 选择开始开始时间

    7. 完成

    然后, build cube 

    可以在jobs中查看build状态. build过程实际上是把cube存到hbase中, 方便快速检索.

  • 相关阅读:
    c#: List.Sort()实现稳固排序(stable sort)
    c# dt.AsEnumerable ().Join用法
    C#中new的两种用法"public new"和"new public"
    简说设计模式——观察者模式
    mysql中explain的type的解释
    mysql 查询优化 ~explain解读之select_type的解读
    代理
    charle
    like语句防止SQL注入
    java学习网站
  • 原文地址:https://www.cnblogs.com/hsydj/p/4515057.html
Copyright © 2020-2023  润新知