【hadoop2.6.0】用C++ 编写mapreduce

【hadoop2.6.0】用C++ 编写mapreduce
hadoop通过hadoop streaming 来实现用非Java语言写的mapreduce代码。对于一个一点Java都不会的我来说，这真是个天大的好消息。

官网上hadoop streaming的介绍在：http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html

我们用wordcount的例子来说明，输入文件我用的是从网上下载的哈利波特第七部的英文版，命名为h.txt

用C++写map程序，只要能够从标准输入中读取信息，并且能用标准输出来输出<key, value>键值对就行。

对于wordcount单词计数来说，map程序非常简单，只要把每个单词分别输出后面再输出个1就行，表示每个单词出现了1次

wordcount_map.cpp程序如下：
```
#include <iostream>
#include <string>
using namespace std;

int main(int argc, char** argv)
{
    string word;
    while(cin >> word)
    {
        cout << word << "/t" << "1" << endl;
    }
    return 0;
}
```
reduce程序要能够读取map的输出键值对，并且把key值（单词）相同的键值对做整合，并且输出整合后结果

wordcount_reduce.cpp程序如下：
```
#include <iostream>
#include <string>
#include <map>
using namespace std;

int main(int argc, char** argv)
{
    string key, num;
    map<string, int> count; 
    map<string, int>::iterator it;
    while(cin >> key >> num)
    {
        it = count.find(key);
        if(it != count.end())
        {
            it->second++;
        }
        else
        {
            count.insert(make_pair(key, 1));
        }
    }

    for(it = count.begin(); it != count.end(); it++)
    {
        cout << it->first << "/t" << it->second << endl;
    }
    return 0;
}
```
把两个.cpp文件编译为可执行文件，并且把这两个可执行文件放在hadoop根目录下
```
g++ -o mapperC wordcount_map.cpp
g++ -o reduceC wordcount_reduce.cpp
```
上传待处理文件h.txt到 hdfs 的 /user/kzy/input中
```
bin/hdfs dfs -put h.txt  /user/kzy/input
```
要运行hadoop streaming需要hadoop-streaming-2.6.0.jar，位置在hadoop-2.6.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar 开始我各种运行不了，就是因为新版本里面文件的位置和以前不一样了。

执行mapreduce,里面的选项我并不是完全理解，但是这样可以正常运行。注意，老版本里的-jobconf 已经改名叫 -D 了
```
bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar  
-D  mapred.job.name="word count~"  
-input /user/kzy/input/h.txt 
-output /user/output/c++_out  
-mapper ./mapperC  
-reducer ./reduceC  
-file mapperC  -file reduceC 
```
查看结果，sort中 -k 2 表示用以tab为分隔符的第二个字段来排序 -n表示用数字形式排序 -r表示从大到小排序显示结果前20行
```
bin/hadoop dfs -cat /user/output/c++_out/* | sort -k 2 -n -r|head -20
```
结果如下：
相关阅读:
懒惰了
 android环境搭建问题总结(0基础)
android初次配置运行环境
 android 模拟器黑屏
 MapReduce——求每年最高气温
 MapReduce——调用HDFS
11.Mapreduce实例——MapReduce自定义输出格式小
 MapReduce——Docker服务安装
 10.Mapreduce实例——MapReduce自定义输入格式
 Docker镜像操作——Mysql安装
原文地址：https://www.cnblogs.com/dplearning/p/4207931.html