• hive 动态分区实现 (hive-1.1.0)


    笔者使用的hive版本是hive-1.1.0

    hive-1.1.0动态分区的默认实现是只有map没有reduce,通过执行计划就可以看出来。(执行计划如下)

    insert overwrite table public_t_par partition(delivery_datekey) select * from public_oi_fact_partition;
    

    hive 默认的动态分区实现,不需要shuffle

    那么hive如何通过map就实现了动态分区了呢,stage1根据FileInputSplit决定有几个map,假如数据量较少只有一个map,这个job是没有reduce的,那么就不会有sort merge combine等操作。
    假设这个map读的数据有10基数的delivery_datekey,那么这个map每读到一个不同delivery_datekey数据就会打开一个file writer,一共会打开十个file writer。其实是相当浪费文件句柄的。这也就是为什么hive 严格模式是禁止用动态分区的,就算关闭严格模式,也是限制job最大写分区数的,甚至限制每台节点写分区数,就是怕某个动态分区的hive任务把系统的文件句柄耗光,影响其他任务的正常运行。
    当stage1 map把数据写到相关分区后,再由stage 2启动分区数(其实小于等于生成的分区数)个map进行小文件的合并。由于我们stage1 只有一个map不会涉及每个分区下有多个文件,并不用启动stage2,进行分区下小文件合并。

    hive 优化后动态分区实现,开启reduce,需要shuffle

    # 使用这个参数开始动态分区优化,其实就是强制开启reduce
    SET hive.optimize.sort.dynamic.partition=true; 
    

    实现思路就是把分区字段delivery_datekey 作为key(细节可能跟源码有出入,因为没看源码,不过这里怎么实现都行,就看源码想不想根据其他字段排序了,不影响整体理解),其他字段作为value,也根据分区字段delivery_datekey 分区partition,然后通过shuffle传到不同的reduce,这样合并小文件的操作有reduce完成了。一个reduce会写多个分区(不会出现小文件问题)。

    两种实现方式有个优缺点,稍后总结一下

    # 总结
    未完待续...
    

    还有就是如果目标表是 orc 或者 parquet 使用动态分区有时会出现java heap oom
    稍后说下原因:
    未完待续...

    解决方案

    如果出现oom,说明使用的是hive默认实现方式并且用了orc或parquet作为target 表的格式。
    
    1. 开始强制开启reduce,可以解决
    SET hive.optimize.sort.dynamic.partition=true; 
    
    2. 减小maxSplit,相当于把map数变多,让分区基数分散到多个map上,减少单个map的内存压力,不过这个跟数据分布也有关。
    set mapred.max.split.size 设置一个小于128m的数
    
    3. 增大map 的堆内存空间。
    mapreduce.map.memory.mb和 mapreduce.map.java.opts
    
    # hive 想使用动态的参数,后两个参数根据集群情况自己设置合适的值
    set hive.mapred.mode=nonstrict;
    SET hive.exec.dynamic.partition.mode=nonstrict;
    SET hive.exec.max.dynamic.partitions.pernode = 1500;
    SET hive.exec.max.dynamic.partitions=3000;
    
    hive (public)> explain insert overwrite table public_t_par partition(delivery_datekey) select * from public_oi_fact_partition;
    OK
    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
      Stage-4
      Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
      Stage-2 depends on stages: Stage-0
      Stage-3
      Stage-5
      Stage-6 depends on stages: Stage-5
    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Map Operator Tree:
              TableScan
                alias: public_oi_fact_partition
                Statistics: Num rows: 110000 Data size: 35162211 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: order_datekey (type: int), oiid (type: bigint), custom_order_id (type: bigint), ciid (type: int), bi_name (type: string), siid (type: int), si_name (type: string), classify (type: string), status (type: int), status_text (type: string), class1_id (type: int), class1_name (type: string), class2_id (type: int), class2_name (type: string), city_id (type: int), city_name (type: string), operate_area (type: int), company_id (type: int), standard_item_num (type: int), package_num (type: double), expect_num (type: decimal(30,6)), price (type: decimal(30,6)), order_weight (type: decimal(30,6)), order_amount (type: decimal(30,6)), order_money (type: decimal(30,6)), ci_weight (type: decimal(30,6)), c_t (type: string), u_t (type: string), real_num (type: decimal(30,6)), real_weight (type: decimal(30,6)), real_money (type: decimal(30,6)), cost_price (type: decimal(30,6)), cost_money (type: decimal(30,6)), price_unit (type: string), order_money_coupon (type: decimal(30,6)), real_money_coupon (type: decimal(30,6)), real_price (type: decimal(30,6)), f_activity (type: int), activity_type (type: tinyint), is_activity (type: tinyint), original_price (type: decimal(30,6)), car_group_id (type: bigint), driver_id (type: string), expect_pay_way (type: int), desc (type: string), coupon_score_amount (type: decimal(30,6)), sale_area_id (type: int), delivery_area_id (type: int), tag (type: int), promote_tag_id (type: bigint), promote_tag_name (type: string), pop_id (type: bigint), delivery_datekey (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49, _col50, _col51, _col52
                  Statistics: Num rows: 110000 Data size: 35162211 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 110000 Data size: 35162211 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        name: public.public_t_par
      Stage: Stage-7
        Conditional Operator
      Stage: Stage-4
        Move Operator
          files:
              hdfs directory: true
              destination: hdfs://ns1/user/hive/warehouse/public.db/public_t_par/.hive-staging_hive_2018-06-08_15-41-18_222_4176438830382881060-1/-ext-10000
      Stage: Stage-0
        Move Operator
          tables:
              partition:
                delivery_datekey 
              replace: true
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: public.public_t_par
      Stage: Stage-2
        Stats-Aggr Operator
      Stage: Stage-3
        Map Reduce
          Map Operator Tree:
              TableScan
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      name: public.public_t_par
      Stage: Stage-5
        Map Reduce
          Map Operator Tree:
              TableScan
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      name: public.public_t_par
      Stage: Stage-6
        Move Operator
          files:
              hdfs directory: true
              destination: hdfs://ns1/user/hive/warehouse/public.db/public_t_par/.hive-staging_hive_2018-06-08_15-41-18_222_4176438830382881060-1/-ext-10000
    
    

    uploading-image-422377.png
    uploading-image-83430.png

    目录都完成后_tmp.-ext-1000会变成-ext-1000 并参见stage-6

  • 相关阅读:
    查找整数
    Ling To Xml 学习之 对xml增、删、改、查
    JS获得鼠标
    xml之数据岛绑定到表格
    C# 三种序列化[转]
    编程字体
    Oracle 、C#
    提示信息Javascript
    几个好用的日历控件
    收藏网站
  • 原文地址:https://www.cnblogs.com/jiangxiaoxian/p/9565500.html
Copyright © 2020-2023  润新知