Atitit hadoop使用总结
目录
HDFS是Hadoop大数据平台中的分布式文件系统,为上层应用或其他大数据组件提供数据存储,如Hive,Mapreduce,Spark,HBase等。
hadoop-2.4.1\share\hadoop\common\hadoop-common-2.4.1.jar
hadoop-2.4.1\share\hadoop\common\lib\所有jar包
hadoop-2.4.1\share\hadoop\mapreduce\lib\所有jar包
---------------------
package hadoopDemo;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
// public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
// 1.mapper阶段,切片
// 1).mapper类首先要继承自mapper类,指定输入的key类型,输入的value类型
// 2).指定输出的key类型,输出的value类型
// 3).重写map方法
// 在map方法里面获取的是文本的行号,一行文本的内容,写出的上下文对象
@Override
protected void map(LongWritable key, Text value_line, Context context) throws IOException, InterruptedException {
String line = value_line.toString();
String[] words = line.split(" ");
for (String word : words) {
Text key_Text = new Text();
IntWritable val_IntWritable = new IntWritable(1);
key_Text.set(word);
context.write(key_Text, val_IntWritable);
}
}
}
package hadoopDemo;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import com.alibaba.fastjson.JSON;
import com.google.common.collect.Maps;
import java.io.IOException;
import java.util.Map;
public class WCReduce extends Reducer<Text,IntWritable,Text,IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum=0; //定义一个变量来统计单词出现的次数
for (IntWritable num:values //遍历这个迭代器,累计单词出现的次数
) {
sum += num.get();
Map m=Maps.newConcurrentMap();
m.put("key",key );
m.put("num",num);
m.put("sum_curr",sum );
System.out.println(JSON.toJSONString(m));
}
context.write(key,new IntWritable(sum));
}
}
运行驱动的目的就是在程序中指定用户的Map类和Reduce类,并配置提交给Hadoop时的相关参数。例如实现一个词频统计的wordcount驱动类:MyWordCount.java,其核心代码如下:
package hadoopDemo;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WCDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
System.load("D:\\haddop\\hadoop-3.1.1\\bin\\hadoop.dll");
//创建Job作业
Job job = Job.getInstance(new Configuration());
//设置驱动类
job.setJarByClass(WCDriver.class);
//设置mapper类、reduce类
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReduce.class);
//设置map阶段输出的key类型、value类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//设置reduce阶段输出key类型、value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//设置读取文件路径、输出文件路径
String path_ipt ="D:\\workspace\\hadoopDemo\\ipt.txt";
FileInputFormat.setInputPaths(job, new Path(path_ipt));
String path_out = "D:\\workspace\\hadoopDemo\\out.txt";
FileOutputFormat.setOutputPath(job, new Path(path_out));
//等待提交作业
boolean result = job.waitForCompletion(true);
System.out.println(result);
while(true)
{
Thread.sleep(5000);
System.out.println("..");
}
// System.exit(result ? 0 : 1);
}
}
import org.apache.hadoop.conf.Conf?iguration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyWordCount {
public static void main(String[] args) throws Exception {
Conf?iguration conf = new Conf?iguration();
Job job = new Job(conf, "word count");
job.setJarByClass(MyWordCount.class);
job.setMapperClass(WordcountMapper.class);
job.setCombinerClass(WordcountReducer.class);
job.setReducerClass(WordcountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
从上述核心代码中可以看出,需要在main函数中设置输入/输出路径的参数,同时为了提交作业,需要job对象,并在job对象中指定作业名称、Map类、Reduce类,以及键值的类型等参数。来源:CUUG官网
可以通过附加下面的命令到 ~/.bashrc 文件中设置 Hadoop 环境变量。
export HADOOP_HOME=/usr/local/hadoop
Eclipse envi only can cfg in run cfg ..
aaa bbb ccc aaa
{"num":{},"sum_curr":1,"key":{"bytes":"YWFh","length":3}}
{"num":{},"sum_curr":2,"key":{"bytes":"YWFh","length":3}}
{"num":{},"sum_curr":1,"key":{"bytes":"YmJi","length":3}}
{"num":{},"sum_curr":1,"key":{"bytes":"Y2Nj","length":3}}
D:\workspace\hadoopDemo\out.txt\part-r-00000 file
aaa 2
bbb 1
ccc 1
1、将项目打成jar包上传到虚拟机上 if use jar mode
运行jar文件
Mapreduce实例---统计单词个数(wordcount) - Tyshawn的博客 - CSDN博客.html
MapperReduce入门Wordcount案例 - 小刘的博客 - CSDN博客.html