用Python实现基于Hadoop Stream的mapreduce任务

因为Hadoop Stream的存在，使得任何支持读写标准数据流的编程语言实现map和reduce操作成为了可能。

为了方便测试map代码和reduce代码，下面给出一个Linux环境下的shell 命令：

cat inputFileName | python map.py | sort | python map.py > outputFileName

可以轻松的在没有hadoop 环境的机器上进行测试。

下面介绍，在Hadoop环境中的，如何用Python完成Map和Reduce两个任务的代码编写。

任务示例

这里依然采用大部分讲述MapReduce文章中所采用的WordCount任务作为示例。改任务需要统计给的海量文档中，各种单词出现的次数，其实就是统计词频（tf)。

map.py

import sys

for line in sys.stdin:
    words = line.strip().split()
    for word in words:
        print("{}	{}".format(word, 1))

reduce.py

import sys

word, curWord, wordCount = None, None, 0

for line in sys.stdin:
    word, count = line.strip().split('	')
    count = int(count)
    if word == curWord: wordCount += count
    else:
        print("{}	{}".format(word, wordCount))
        curWord, wordCount = curWord, count
        
if word and word == curWord:
    print("{}	{}".format(word, wordCount))

可以在单机上执行前面所述的命令没有问题后，然后执行下面的shell命令

hadoop jar $HADOOP_STREAMING  
-D mapred.job.name="自定义的job名字"  
-D mapred.map.tasks=1024 
-D mapred.reduce.tasks=1024
-files map.py  
-files reduce.py 
-mapper "python map.py" 
-reducer "python reduce.py" 
-input /user/rte/hdfs_in/* 
-output /user/rte/hdfs_out

原文地址：https://www.cnblogs.com/crackpotisback/p/10548693.html