a. 创建一个数据库
create database word;
b. 建表
create external table word_data(line string) row format delimited fields terminated by '
' stored as textfile location '/home/hadoop/worddata';
这里假设我们的数据存放在hadoop下,路径为:/home/hadoop/worddata,里面主要是一些单词文件,内容大概为:
+-------------------------+--+
| word_data.line |
+-------------------------+--+
| |
| hello man |
| what are you doing now |
| my running |
| hello |
| kevin |
| hi man |
| hadoop hive es |
| storm hive es |
| |
| |
+-------------------------+--+
执行了上述hql就会创建一张表src_data,内容是这些文件的每行数据,每行数据存在字段line中,
select * from word_data;
#就可以看到这些数据:
+-------------------------+--+
| word_data.line |
+-------------------------+--+
| |
| hello man |
| what are you doing now |
| my running |
| hello |
| kevin |
| hi man |
| hadoop hive es |
| storm hive es |
| |
| |
+-------------------------+--+
c. 根据MapReduce的规则,需要进行拆分
把每行数据拆分成单词,这里需要用到一个hive的内置表生成函数(UDTF):explode(array),参数是array,
其实就是行变多列:
create table words(word string);
insert into table words select explode(split(line, " ")) as word from word_data;
0: jdbc:hive2://bd004:10000> select * from words;
+-------------+--+
| words.word |
+-------------+--+
| |
| hello |
| man |
| what |
| are |
| you |
| doing |
| now |
| my |
| running |
| hello |
| kevin |
| hi |
| man |
| hadoop |
| hive |
| es |
| storm |
| hive |
| es |
| |
| |
+-------------+--+
split是拆分函数,跟java的split功能一样,这里是按照空格拆分,所以执行完hql语句,words表里面就全部保存的单个单词
d. 基本实现
因为hql可以group by,所以最后统计语句为:
select word, count(*) from word.words group by word;
#word.words 库名称.表名称,group by word这个word是create table words(word string) 命令创建的word string
+----------+------+--+
| word | _c1 |
+----------+------+--+
| | 3 |
| are | 1 |
| doing | 1 |
| es | 2 |
| hadoop | 1 |
| hello | 2 |
| hi | 1 |
| hive | 2 |
| kevin | 1 |
| man | 2 |
| my | 1 |
| now | 1 |
| running | 1 |
| storm | 1 |
| what | 1 |
| you | 1 |
+----------+------+--+
总结:对比写MR和使用hive,还是hive比较简便,对于比较复杂的统计操作可以建一些中间表,或者一些视图之类的。