• hive2


    
    
    4.hive优化
    1)跑sql的时候会出现的参数:
    
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number> 
      如果大于<number>,就会多生成一个reduce
      <number> =1024    <1k 一个reduce
      1m 10个reduce
      
      set hive.exec.reducers.bytes.per.reducer=20000;
      select user_id,count(1) as order_cnt 
      from orders group by user_id limit 10;
    --结果number of mappers: 1; number of reducers: 1009  
      
      
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
      set hive.exec.reducers.max=10;
     -- number of mappers: 1; number of reducers: 10 
     
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
      set mapreduce.job.reduces=5;
      --number of mappers: 1; number of reducers: 5
      set mapreduce.job.reduces=15;
      --number of mappers: 1; number of reducers: 15
      对你当前窗口,或者执行任务(脚本)过程中生效
    
      2)where条件使得group by冗余 
      map 和 reduce执行过程是一个同步的过程
      同步:打电话
      异步:发短信 
      1:map执行完 reduce在执行       1+2=3:reduce
      2:map reduce
      
      map 60%  reduce=3%
      
      3)只有一个reduce
      a.没有group by
      set mapreduce.job.reduces=5;
      select count(1) from orders where order_dow='0';
      --number of mappers: 1; number of reducers: 1
      b.order by
      set mapreduce.job.reduces=5;
      select user_id,order_dow 
      from orders where order_dow='0'
      order by user_id
      limit 10;
      -- number of mappers: 1; number of reducers: 1
      c.笛卡尔积 cross product
      tmp_d
    1
    2
    3
    4
    5
    select * from tmp_d 
    join (select * from tmp_d)t 
    where tmp_d.user_id=t.user_id; --相当于on
    
    join没有on的字段关联
    1   1
    2    1
    3    1
    1    2
    2    2
    3    2
    1    3
    2    3
    3    3
    user product(库中所有商品中调小部分觉得这个用户喜欢 召回(match) 候选集1000)  top10 
    users 母婴类 products
    要同时考虑users和products信息来给它们做一个筛选(粗粒度)
    
    5)map join
    select /*+ MAPJOIN(aisles) */ a.aisle as aisle,p.product_id as product_id 
    from aisles a join product p 
    on a.aisle_id=p.aisle_id limit 10;
    
    
    dict  hashMap  {aisle_id : aisle}
    for line in products:
        ss = line.split('	')
        aisle_id = ss[0]
        product_id = ss[1]
        aisle = dict[aisle_id]
        print '%s	%s'%(aisle,product_id)
        
    6)union all + distinct   == union
    --运行时间:74.712 seconds 2job
    select count( *) c
    from (
    select order_id,user_id,order_dow from orders where order_dow='0' union all
    select order_id,user_id,order_dow from orders where order_dow='0' union all 
    select order_id,user_id,order_dow from orders where order_dow='1'
    )t;
    
    --运行时间122.996 seconds 3 job
    select *
    from(
    select order_id,user_id,order_dow from orders where order_dow='0' 
    union 
    select order_id,user_id,order_dow from orders where order_dow='0'
    union 
    select order_id,user_id,order_dow from orders where order_dow='1')t;
    
    7) 
    set hive.groupby.skewindata=true;
    将一个map reduce拆分成两个map reduce
    ‘-’(‘’,-1,0,null)1亿条 到一个reduce上面,
    
    1个reduce处理6000w ‘-1%     200w求和 =》1条
    29 reduce处理剩余的4000w 99%
    
    1.随机分发到不同的reduce节点,进行聚合(count2. 最终的一个reduce做最终结果的聚合(200w求和 =》1条)
        
    select add_to_cart_order,count(1) as cnt 
    from order_products_prior 
    group by add_to_cart_order
    limit 10; 
    
    
    select user_id,count(1) as cnt 
    from order_products_prior 
    group by user_id
    limit 10; 
    
    
    
    
    
        
    -- 没指定set hive.groupby.skewindata=true;
    --Launching Job 1 out of 1
    -- 1m 41s
    
    --指定了set hive.groupby.skewindata=true;
    --Launching Job 1 out of 2
    -- 2m 50s
    
    如果在不导致reduce一直失败起不来的时候,就不用这个变量
    如果确实出现了其中一个reduce的处理数据量太多,导致任务一直出问题,运行时间长。这种情况需要设置这个变量。
    
    凌晨定时任务,近一周报表,跑了3个小时。
    洗出来的基础表,3点出来,7点出来,后面接了70任务 
      
      8)MR的数量
     --Launching Job 1 out of 1
     select 
     ord.order_id order_id,
     tra.product_id product_id,
     pri.reordered reordered
    from orders ord 
    join train tra on ord.order_id=tra.order_id
    join order_products_prior pri on ord.order_id=pri.order_id
    limit 10;
    
    --两个MR任务
     select 
     ord.order_id,
     tra.product_id,
     pro.aisle_id
    from orders ord
    join trains tra on ord.order_id=tra.order_id
    join products pro on tra.product_id=pro.product_id
    limit 10;
    
    9/*+ STREAMTABLE(a) */ a是大表
    类似map join 放到select中的,区别:它是指定大表
    select /*+STREAMTABLE(pr)*/ ord.order_id,pr.product_id,pro.aisle_id
    from orders ord
    join order_products_prior pr on ord.order_id=pr.order_id
    join products pro on pr.product_id=pro.product_id
    limit 10;
    
    10)LEFT OUTER JOIN
    select od.user_id,
    od.order_id,
    tr.product_id
    from
    (select user_id,order_id,order_dow from orders limit 100)od
    left outer join
    (select order_id,product_id,reordered from train)tr 
    on (od.order_id=tr.order_id and od.order_dow='0' and tr.reordered=1)
    limit 30;
    
    --join默认是inner
    
    11)set hive.exec.parallel=true
    1:map执行完 reduce在执行       1+2=3:reduce
    2:map reduce
    
    
    12)
    1. '-' ,where age<>'-' 直接丢掉这个数据
    select age,count(1) group by age where age<>'-'
    
    1_-  2_- 3_-
    
    怎么定位具体哪几个key发生倾斜?
    sample 
    SELECT COUNT(1) FROM (SELECT * FROM lxw1 TABLESAMPLE (200 ROWS)) x;
    SELECT * FROM udata TABLESAMPLE (50 PERCENT);
    select * from table_name where col=xxx order by rand() limit num; 
    SELECT * FROM lxw1 TABLESAMPLE (30M);
    
    长尾数据
     
  • 相关阅读:
    【288】◀▶ 博客园黑色代码主题实现
    【287】◀▶ arcpy 常用类说明
    【286】◀▶ Python 内置函数说明
    LeetCode Insert Interval
    Craking the coding interview 面试题:完美随机洗牌
    遭遇多线程bug (1)
    全情投入是做好工作的基础——Leo鉴书39
    《Java 并发编程实战》读书笔记之二:图文讲述同步的另一个重要功能:内存可见性
    视图控制器
    Chapter 1. OpenGL基础回顾
  • 原文地址:https://www.cnblogs.com/hejunhong/p/11186393.html
Copyright © 2020-2023  润新知