• Python结合Shell/Hadoop实现MapReduce


    基本流程为: 

    cat data | map | sort | reduce

    cat devProbe | ./mapper.py | sort| ./reducer.py

    echo "foo foo quux labs foo bar quux" | ./mapper.py | sort -k1,1 | ./reducer.py

    # -k, -key=POS1[,POS2]     键以pos1开始,以pos2结束

    如不执行下述命令,可以再py文件前加上python调用

    chmod +x mapper.py
    chmod +x reducer.py

    对于分布式环境下,可以使用以下命令:

    hadoop jar /[YOUR_PATH]/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.4.4.jar
     -file mapper.py -mapper mapper.py
     -file reducer.py -reducer reducer.py
     -input [IN_FILE]    -output [OUT_DIR]

    mapper.py

    #!/usr/bin/python
    # -*- coding: UTF-8 -*-
    
    __author__ = 'Manhua'
    
    import sys
    for line in sys.stdin:
        line = line.strip()
        item = line.split('`')
        print "%s	%s" % (item[0]+'`'+item[1], 1)

    reducer.py

    #!/usr/bin/python
    # -*- coding: UTF-8 -*-
    
    __author__ = 'Manhua'
    
    
    import sys
    
    current_word = None
    current_count = 0
    word = None
    
    for line in sys.stdin:
        line = line.strip()
        word, count = line.split('	', 1)
        try:
            count = int(count)
        except ValueError:  #count如果不是数字的话,直接忽略掉
            continue
        if current_word == word:
            current_count += count
        else:
            if current_word:
                print "%s	%s" % (current_word, current_count)
            current_count = count
            current_word = word
    
    if word == current_word:  #不要忘记最后的输出
        print "%s	%s" % (current_word, current_count)
  • 相关阅读:
    Unity Ioc框架简单例子
    Newtonsoft.Json.Linq
    Quartz.net
    AngularJS
    Zookeeper
    mysql 游标CURSOR
    mysql 存储过程 CONCAT 字符串拼接
    MD5Util
    生成缩略图
    Asp.net MVC 基于Area的路由映射
  • 原文地址:https://www.cnblogs.com/manhua/p/6593185.html
Copyright © 2020-2023  润新知