在美团点评的文章中,介绍了Hive SQL转化为MapReduce的过程
1、Antlr定义SQL的语法规则,完成SQL词法,语法解析,将SQL转化为抽象语法树AST Tree 2、遍历AST Tree,抽象出查询的基本组成单元QueryBlock 3、遍历QueryBlock,翻译为执行操作树OperatorTree 4、逻辑层优化器进行OperatorTree变换,合并不必要的ReduceSinkOperator,减少shuffle数据量 5、遍历OperatorTree,翻译为MapReduce任务 6、物理层优化器进行MapReduce任务的变换,生成最终的执行计划
参考
https://tech.meituan.com/2014/02/12/hive-sql-to-mapreduce.html
但是不是所有的SQL都有必要转换为MR来执行,比如
select * from xx.xx limit 1
Hive只需要直接读取文件,并传输到控制台即可
在hive-default.xml配置文件中,有2个参数,hive.fetch.task.conversion和hive.fetch.task.conversion.threshold
hive.fetch.task.conversion属性修改为more以后,在全局查找、字段查找、limit查找等都不走mapreduce
hive.fetch.task.conversion.threshold属性表示在输入大小为多少以内的时候fetch task生效,默认1073741824 byte = 1G
<property> <name>hive.fetch.task.conversion</name> <value>more</value> <description> Expects one of [none, minimal, more]. Some select queries can be converted to single FETCH task minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins. 0. none : disable hive.fetch.task.conversion 1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only 2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns) </description> </property> <property> <name>hive.fetch.task.conversion.threshold</name> <value>1073741824</value> <description> Input threshold for applying hive.fetch.task.conversion. If target table is native, input length is calculated by summation of file lengths. If it's not native, storage handler for the table can optionally implement org.apache.hadoop.hive.ql.metadata.InputEstimator interface. </description> </property>
参考:
Hive快速入门系列(14) | Hive性能调优 [一]Fetch抓取与本地模式