Hive优化:MR阶段优化– 调整task数目
Hive优化:MR阶段优化– Reduce阶段
mapreduce.job.reduces直接设置
num_reduce_tasks⼤⼩影响参数
• hive.exec.reducers.max 默认:1099
• hive.exec.reducers.bytes.per.reducer默认:1G
切割算法
• numRTasks = min[maxReducers,input.size/perReducer]
• maxReducers = ${hive.exec.reducers.max}
• perReducer= {hive.exec.reducers.bytes.per.reducer}
Hive优化:整体优化– 压缩
原始⽇志BZ2压缩
MR中间输出LZO压缩
中间表SEQUENCEFILE、ORCFile
Hive优化:SQL作业优化– SQL并行执行
hive.exec.parallel=true (default false)
hive.exec.parallel.thread.number =8 (default 8)
Hive优化:整体优化– 表分区
查询维度、业务需求,⽇期分区,类型分区
Hive优化:数据倾斜– count distinct
Select count(distinct id) from acorn_3g.iplog where log_date like ‘2013-12%’;
耗时:1600S
Select count(1) from (select distinct id from acorn_3g.iplog where log_date like ‘2013-12%’ and id>0) tmp;
耗时:260s