hadoop streaming的使用

hadoop streaming的使用
　本节我们使用C++和python实现wordcount的编写

　首先我们介绍一下hadoop streaming。

　　mapper和reducer会从标准输入中读取用户数据，一行一行处理后发送给标准输出。Streaming工具会创建MapReduce作业，发送给各个tasktracker，同时监控整个作业的执行过程。

　　如果一个文件（可执行或者脚本）作为mapper，mapper初始化时，每一个mapper任务会把该文件作为一个单独进程启动，mapper任务运行时，它把输入切分成行并把每一行提供给可执行文件进程的标准输入。同时，mapper收集可执行文件进程标准输出的内容，并把收到的每一行内容转化成key/value对，作为mapper的输出。默认情况下，一行中第一个tab之前的部分作为key，之后的（不包括tab）作为value。如果没有tab，整行作为key值，value值为null。

　　对于reducer，类似。 (参考董的博客)

　　C++代码如下:

　　　
```
//map端
#include <iostream>
#include <string>

using namespace std;

int main() {
    string key;
    string value = "1";
    while(cin>>key) {
        cout<<key<<"	"<<value<<endl;
    }
    return 0;
}

//reduce端
```
#include <iostream>
#include <string>
#include <map>
#include <iterator>
using namespace std;
int main() {
string key;
string value;
map<string, int> word2count;
map<string, int>::iterator it;
while(cin>>key) {
  cin>>value;
  it = word2count.find(key);
  if(it != word2count.end()) {
   (it->second)++;
  }else{
   word2count.insert(make_pair(key, 1));
  }
}
for(it=word2count.begin(); it!=word2count.end(); ++it) {
  cout<<it->first<<" "<<it->second<<endl;
}
return 0;
}
```
 
```
python代码
```
//map端
#!/usr/bin/env python

import sys

word2count = {}

for line in sys.stdin:
	line = line.strip()
	words = filter(lambda word: word, line.split())  #去除空字符串
	for word in words:
		print '%s	%s' % (word, 1)

//reduce端

from operator import itemgetter
import sys

word2count = {}

for line in sys.stdin:
	line = line.strip()
	word, count = line.split()
	try:
		count = int(count)
		word2count[word] = word2count.get(word, 0) + count 
	except ValueError:
		pass

sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

for word, count in sorted_word2count:
	print "%s	%s" % (word, count)
```
　　
　　
相关阅读:
mysql 练习题
 mysql 语法
 mysql数据库简单练习（创建表格，增删改查数据）
dom对象基础
 JS定时器
 JS小测验
 JS事件练习题
 JS事件
 dom对象
 tiles介绍
原文地址：https://www.cnblogs.com/xingxing1024/p/7459158.html