• Hive实现wordCount


    a. 创建一个数据库

    create database word;
    

    b. 建表

    create external table word_data(line string) row format delimited fields terminated by '
    ' stored as textfile location '/home/hadoop/worddata';
    这里假设我们的数据存放在hadoop下,路径为:/home/hadoop/worddata,里面主要是一些单词文件,内容大概为:
    
    
    +-------------------------+--+
    |     word_data.line      |
    +-------------------------+--+
    |                         |
    | hello man               |
    | what are you doing now  |
    | my running              |
    | hello                   |
    | kevin                   |
    | hi man                  |
    | hadoop hive es          |
    | storm hive es           |
    |                         |
    |                         |
    +-------------------------+--+
    

    执行了上述hql就会创建一张表src_data,内容是这些文件的每行数据,每行数据存在字段line中,

    	select * from word_data;
    	#就可以看到这些数据:
    +-------------------------+--+
    |     word_data.line      |
    +-------------------------+--+
    |                         |
    | hello man               |
    | what are you doing now  |
    | my running              |
    | hello                   |
    | kevin                   |
    | hi man                  |
    | hadoop hive es          |
    | storm hive es           |
    |                         |
    |                         |
    +-------------------------+--+
    

    c. 根据MapReduce的规则,需要进行拆分

    把每行数据拆分成单词,这里需要用到一个hive的内置表生成函数(UDTF):explode(array),参数是array,
    其实就是行变多列:

    create table words(word string);
    insert into table words select explode(split(line, " ")) as word from word_data;
    
    0: jdbc:hive2://bd004:10000> select * from words;
    +-------------+--+
    | words.word  |
    +-------------+--+
    |             |
    | hello       |
    | man         |
    | what        |
    | are         |
    | you         |
    | doing       |
    | now         |
    | my          |
    | running     |
    | hello       |
    | kevin       |
    | hi          |
    | man         |
    | hadoop      |
    | hive        |
    | es          |
    | storm       |
    | hive        |
    | es          |
    |             |
    |             |
    +-------------+--+
    

    split是拆分函数,跟java的split功能一样,这里是按照空格拆分,所以执行完hql语句,words表里面就全部保存的单个单词

    d. 基本实现

    因为hql可以group by,所以最后统计语句为:

    select word, count(*) from word.words group by word;
    #word.words 库名称.表名称,group by word这个word是create table words(word string) 命令创建的word string
    
    
    +----------+------+--+
    |   word   | _c1  |
    +----------+------+--+
    |          | 3    |
    | are      | 1    |
    | doing    | 1    |
    | es       | 2    |
    | hadoop   | 1    |
    | hello    | 2    |
    | hi       | 1    |
    | hive     | 2    |
    | kevin    | 1    |
    | man      | 2    |
    | my       | 1    |
    | now      | 1    |
    | running  | 1    |
    | storm    | 1    |
    | what     | 1    |
    | you      | 1    |
    +----------+------+--+
    

    总结:对比写MR和使用hive,还是hive比较简便,对于比较复杂的统计操作可以建一些中间表,或者一些视图之类的。

  • 相关阅读:
    「UVA12293」 Box Game
    「CF803C」 Maximal GCD
    「CF525D」Arthur and Walls
    「CF442C」 Artem and Array
    LeetCode lcci 16.03 交点
    LeetCode 1305 两棵二叉搜索树中的所有元素
    LeetCode 1040 移动石子直到连续 II
    LeetCode 664 奇怪的打印机
    iOS UIPageViewController系统方法崩溃修复
    LeetCode 334 递增的三元子序列
  • 原文地址:https://www.cnblogs.com/ernst/p/12819169.html
Copyright © 2020-2023  润新知