使用python写一个最基本的mapreduce程序

一个mapreduce程序大致分成三个部分，第一部分是mapper文件，第二个就是reducer文件，第三部分就是使用hadoop command 执行程序。

在这个过程中，困惑我最久的一个问题就是在hadoop command中hadoop-streaming 也就是streaming jar包的路径。

路径大概是这样的:

cd ~
cd /usr/local/hadoop-2.7.3/share/hadoop/tools/lib
#在这个文件下，我们可以找到你 hadoop-streaming-2.7.3.jar

这个路径是参考的这里

这个最基本的mapreduce程序我主要参考了三个博客:

第一个-主要是参考这个博客的mapper和reducer的写法-在这个博客中它在练习中给出了只写mapper执行文件的一个例子

第二个博客-主要参考的这个博客的runsh的写法

第三个博客-主要是参考这个博客的将本地文件上传到hdfs文件系统中

首先对于mapper文件
mapper.py

#!/usr/bin/env python  
  
import sys  
  
# input comes from STDIN (standard input)  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
    # split the line into words  
    words = line.split()  
    # increase counters  
    for word in words:  
        # write the results to STDOUT (standard output);  
        # what we output here will be the input for the  
        # Reduce step, i.e. the input for reducer.py  
        #  
        # tab-delimited; the trivial word count is 1  
        print '%s	%s' % (word, 1)

#上面这个文件我们得到的结果大概是每个单词对应一个数字1

对于reducer文件:reducer.py

#!/usr/bin/env python  
  
from operator import itemgetter  
import sys  
  
current_word = None  
current_count = 0  
word = None  
  
# input comes from STDIN  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
  
    # parse the input we got from mapper.py  
    word, count = line.split('	', 1)  
  
    # convert count (currently a string) to int  
    try:  
        count = int(count)  
    except ValueError:  
        # count was not a number, so silently  
        # ignore/discard this line  
        continue  
  
    # this IF-switch only works because Hadoop sorts map output  
    # by key (here: word) before it is passed to the reducer  
    if current_word == word:  
        current_count += count  
    else:  
        if current_word:  
            # write result to STDOUT  
            print '%s	%s' % (current_word, current_count)  
        current_count = count  
        current_word = word  
  
# do not forget to output the last word if needed!  
if current_word == word:  
    print '%s	%s' % (current_word, current_count)

对上面两个代码先进行一个本地的检测

vim test.txt
foo foo quux labs foo bar quux

cat test.txt|python mapper.py

cat test.txt|python mapper.py|sort|python reducer.py
##注意在这里我们执行万mapper之后我们进行了一个排序，所以对于相同单词是处于相邻位置的，这样在执行reducer文件的时候代码可以写的比较简单一点

然后在hadoop集群中跑这个代码

首先讲这个test.txt 上传到相应的hdfs文件系统中，使用的命令模式如下:

hadoop fs -put ./test.txt /dw_ext/weibo_bigdata_ugrowth/mds/

然后写一个run.sh


HADOOP_CMD="/usr/local/hadoop-2.7.3/bin/hadoop"  # hadoop的bin的路径
STREAM_JAR_PATH="/usr/local/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar"  ## streaming jar包的路径

INPUT_FILE_PATH="/dw_ext/weibo_bigdata_ugrowth/mds/src.txt" #hadoop集群上的资源输入路径
#需要注意的是intput文件必须是在hadooop集群上的hdfs文件中的，所以必须将本地文件上传到集群上
OUTPUT_PATH="/dw_ext/weibo_bigdata_ugrowth/mds/output"
#需要注意的是这output文件必须是不存在的目录，因为我已经执行过一次了，所以这里我把这个目录通过下面的代码删掉

$HADOOP_CMD fs -rmr  $OUTPUT_PATH

$HADOOP_CMD jar $STREAM_JAR_PATH 
    -input $INPUT_FILE_PATH 
    -output $OUTPUT_PATH 
    -mapper "python mapper.py" 
    -reducer "python reducer.py" 
    -file ./mapper.py 
    -file ./reducer.py

# -mapper：用户自己写的mapper程序，可以是可执行文件或者脚本
# -reducer：用户自己写的reducer程序，可以是可执行文件或者脚本
# -file：打包文件到提交的作业中，可以是mapper或者reducer要用的输入文件，如配置文件，字典等。

明天看这个
https://www.cnblogs.com/shay-zhangjin/p/7714868.html
https://www.cnblogs.com/kaituorensheng/p/3826114.html

相关阅读:
Windows Forms中通过自定义组件实现统一的数据验证（二）
The WindowsClient.NET Community Site Launches
二十六岁，仍在路上
 Visual Studio 2008 Express版本下载
 Page Controller及其在ASP.NET中的实现
 iBATIS In Action：使用映射语句（二）
在VS2005中创建项目模板来提高开发效率
 2007年，听见春天的脚步
 iBATIS In Action：使用映射语句（一）
iBATIS In Action：序言和目录
原文地址：https://www.cnblogs.com/lzida9223/p/10536253.html