- java开发map_reduce程序
- 配置系统环境变量HADOOP_HOME,指向hadoop安装目录(如果你不想招惹不必要的麻烦,不要在目录中包含空格或者中文字符)
把HADOOP_HOME/bin加到PATH环境变量(非必要,只是为了方便) -
如果是在windows下开发,需要添加windows的库文件
- 把盘中共享的bin目录覆盖HADOOP_HOME/bin
- 如果还是不行,把其中的hadoop.dll复制到c:windowssystem32目录下,可能需要重启机器
- 建立新项目,引入hadoop需要的jar文件
- 代码WordMapper:
-
123456789101112131415161718192021
import
java.io.IOException;
import
org.apache.hadoop.io.IntWritable;
import
org.apache.hadoop.io.LongWritable;
import
org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Mapper;
public
class
WordMapper
extends
Mapper<LongWritable,Text, Text, IntWritable> {
@Override
protected
void
map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws
IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(
" "
);
for
(String word : words) {
context.write(
new
Text(word),
new
IntWritable(
1
));
}
}
}
- 代码WordReducer:
1234567891011121314151617181920
import
java.io.IOException;
import
org.apache.hadoop.io.IntWritable;
import
org.apache.hadoop.io.LongWritable;
import
org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Reducer;
public
class
WordReducer
extends
Reducer<Text, IntWritable, Text, LongWritable> {
@Override
protected
void
reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, LongWritable>.Context context)
throws
IOException, InterruptedException {
long
count =
0
;
for
(IntWritable v : values) {
count += v.get();
}
context.write(key,
new
LongWritable(count));
}
}
- 代码Test:
123456789101112131415161718192021222324252627282930
import
org.apache.hadoop.conf.Configuration;
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWritable;
import
org.apache.hadoop.io.LongWritable;
import
org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public
class
Test {
public
static
void
main(String[] args)
throws
Exception {
Configuration conf =
new
Configuration();
Job job = Job.getInstance(conf);
job.setMapperClass(WordMapper.
class
);
job.setReducerClass(WordReducer.
class
);
job.setMapOutputKeyClass(Text.
class
);
job.setMapOutputValueClass(IntWritable.
class
);
job.setOutputKeyClass(Text.
class
);
job.setOutputValueClass(LongWritable.
class
);
FileInputFormat.setInputPaths(job,
"c:/bigdata/hadoop/test/test.txt"
);
FileOutputFormat.setOutputPath(job,
new
Path(
"c:/bigdata/hadoop/test/out/"
));
job.waitForCompletion(
true
);
}
}
- 把hdfs中的文件拉到本地来运行
123
同时这样的运行方式是不需要yarn的(自己停掉yarn服务做实验) -
在远程服务器执行
1234567
conf.set(
"mapreduce.job.jar"
,
"target/wc.jar"
);
conf.set(
"mapreduce.framework.name"
,
"yarn"
);
conf.set(
"yarn.resourcemanager.hostname"
,
"master"
);
conf.set(
"mapreduce.app-submission.cross-platform"
,
"true"
);
123FileInputFormat.setInputPaths(job,
"/wcinput/"
);
FileOutputFormat.setOutputPath(job,
new
Path(
"/wcoutput3/"
));
- 也可以将hadoop的四个配置文件拿下来放到src根目录下,就不需要进行手工配置了,默认到classpath目录寻找
- 或者将配置文件放到别的地方,使用conf.addResource(.class.getClassLoader.getResourceAsStream)方式添加,不推荐使用绝对路径的方式
-
建立maven-hadoop项目:
123456789101112131415161718192021222324252627282930313233343536373839
<
project
xmlns
=
"http://maven.apache.org/POM/4.0.0"
xmlns:xsi
=
"http://www.w3.org/2001/XMLSchema-instance"
xsi:schemalocation
=
"http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"
>
<
modelversion
>4.0.0</
modelversion
>
<
groupid
>mashibing.com</
groupid
>
<
artifactid
>maven</
artifactid
>
<
version
>0.0.1-SNAPSHOT</
version
>
<
name
>wc</
name
>
<
description
>hello mp</
description
>
<
properties
>
<
project.build.sourceencoding
>UTF-8</
project.build.sourceencoding
>
<
hadoop.version
>2.7.3</
hadoop.version
>
</
properties
>
<
dependencies
>
<
dependency
>
<
groupId
>junit</
groupId
>
<
artifactId
>junit</
artifactId
>
<
version
>4.12</
version
>
</
dependency
>
<
dependency
>
<
groupId
>org.apache.hadoop</
groupId
>
<
artifactId
>hadoop-client</
artifactId
>
<
version
>${hadoop.version}</
version
>
</
dependency
>
<
dependency
>
<
groupId
>org.apache.hadoop</
groupId
>
<
artifactId
>hadoop-common</
artifactId
>
<
version
>${hadoop.version}</
version
>
</
dependency
>
<
dependency
>
<
groupId
>org.apache.hadoop</
groupId
>
<
artifactId
>hadoop-hdfs</
artifactId
>
<
version
>${hadoop.version}</
version
>
</
dependency
>
</
dependencies
>
</
project
>
-
配置log4j.properties,放到src/main/resources目录下
123456
log4j.rootCategory=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=[QC] %p [%t] %C.%M(%L) | %m%n