• 使用hive streaming 统计Wordcount


    一、 编译处理程序

    使用python编写脚本

    1、编写map对应的脚本

    map.py

    #!/usr/bin/env python
    import sys
    for i in sys.stdin:
        worlds = i.strip().split()
        for word in worlds:
            print("%s	1" % word.lower())
    

    2、编写reduce对应的脚本

    reduce.py

    #!/usr/bin/env python
    import sys
    wordDict=dict()
    for i in sys.stdin:
        i = i.split()[0]
        if i in wordDict:
            wordDict[i] += 1
        else:
            wordDict[i] = 1
    
    for i in wordDict.keys():
        print(i + ":" + str(wordDict[i]))
    

    二、 创建对应的表结构

    1、创建docs表

    hive (default)> create table docs(line string);
    

    2、 准备docs表对于的数据

    [hduser@yjt test]$ cat f.txt 
    hadoop beijin
    yjt hadoop test
    spark hadoop shanghai
    yjt
    mm
    jj
    gg
    gg
    this is a beijin
    

    3、加载数据到docs表

    hive (default)> load data local inpath '/data1/studby/test/f.txt' into table docs;
    

    4、准备wordcount表,用于存放最终的数据结果

    hive (default)> create  table wordcount(word string, count int) row format delimited fields terminated by ':';
    

    三、执行

    hive (default)> from (from docs select transform (line) using "/data1/studby/test/map.py" as (word, count) cluster by word) as wc insert overwrite table wordcount  select transform(wc.word, wc.count) using "/data1/studby/test/reduce.py" as (word, count);
    

    查看结果

    hive (default)> select * from wordcount;
    OK
    wordcount.word	wordcount.count
    a	1
    mm	1
    is	1
    shanghai	1
    beijin	2
    hadoop	3
    gg	2
    this	1
    yjt	2
    jj	1
    test	1
    spark	1
    
  • 相关阅读:
    VMware提示此主机支持Intel VT-x,但Intel VT-x处于禁用状态怎么解决
    linux中几种安装软件 方法
    Linux向文件添加内容的几种方法
    什么是模块?模块划分的原则是什么?
    NOIP2009 t3 最优贸易
    HDU3072 Intelligence System
    洛谷P2569 股票交易
    玄学
    [0403]学习一个——苟(简单Java开发)
    实验 3:类和对象
  • 原文地址:https://www.cnblogs.com/yjt1993/p/13266436.html
Copyright © 2020-2023  润新知