• Python结合Shell/Hadoop实现MapReduce


    基本流程为: 

    cat data | map | sort | reduce

    cat devProbe | ./mapper.py | sort| ./reducer.py

    echo "foo foo quux labs foo bar quux" | ./mapper.py | sort -k1,1 | ./reducer.py

    # -k, -key=POS1[,POS2]     键以pos1开始,以pos2结束

    如不执行下述命令,可以再py文件前加上python调用

    chmod +x mapper.py
    chmod +x reducer.py

    对于分布式环境下,可以使用以下命令:

    hadoop jar /[YOUR_PATH]/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.4.4.jar
     -file mapper.py -mapper mapper.py
     -file reducer.py -reducer reducer.py
     -input [IN_FILE]    -output [OUT_DIR]

    mapper.py

    #!/usr/bin/python
    # -*- coding: UTF-8 -*-
    
    __author__ = 'Manhua'
    
    import sys
    for line in sys.stdin:
        line = line.strip()
        item = line.split('`')
        print "%s	%s" % (item[0]+'`'+item[1], 1)

    reducer.py

    #!/usr/bin/python
    # -*- coding: UTF-8 -*-
    
    __author__ = 'Manhua'
    
    
    import sys
    
    current_word = None
    current_count = 0
    word = None
    
    for line in sys.stdin:
        line = line.strip()
        word, count = line.split('	', 1)
        try:
            count = int(count)
        except ValueError:  #count如果不是数字的话,直接忽略掉
            continue
        if current_word == word:
            current_count += count
        else:
            if current_word:
                print "%s	%s" % (current_word, current_count)
            current_count = count
            current_word = word
    
    if word == current_word:  #不要忘记最后的输出
        print "%s	%s" % (current_word, current_count)
  • 相关阅读:
    魔法方法中的__str__和__repr__区别
    新建分类目录后,点击显示错误页面?
    3.用while和for循环分别计算100以内奇数和偶数的和,并输出。
    2.for循环实现打印1到10
    1.while循环实现打印1到10
    021_for语句
    014_运算符_字符串连接
    020_while语句
    019_增强switch语句
    018_switch语句
  • 原文地址:https://www.cnblogs.com/manhua/p/6593185.html
Copyright © 2020-2023  润新知