1 public static void main(String[] args) throws Exception { 2 Configuration conf = new Configuration(); 3 //conf就是作业的配置对象,读取core-site、core-default、hdfs-site/default、mapred-site/default文件里的配置信息 4 String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); 5 //args[]就是使用hadoop jar命令运行作业时输入/输出路径参数,这两个参数传给了main函数 6 if (otherArgs.length != 2) { 7 System.err.println("Usage: wordcount <in> <out>"); 8 System.exit(2);//System.exit(0)表示正常退出,exit()参数非0表示非正常退出。 9 } 10 11 Job job = new Job(conf, "word count"); 12 //以下就是设置job的一些运行参数,这些方法内部都会调用jobconf对象对应的方法 13 job.setJarByClass(WordCount.class); 14 job.setMapperClass(TokenizerMapper.class); 15 job.setCombinerClass(IntSumReducer.class); 16 job.setReducerClass(IntSumReducer.class); 17 job.setOutputKeyClass(Text.class); 18 job.setOutputValueClass(IntWritable.class); 19 //以下就是设置job的一些运行参数,这些方法内部都会调用jobconf对象对应的方法 20 FileInputFormat.addInputPath(job, new Path(otherArgs[0])); 21 FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); 22 System.exit(job.waitForCompletion(true) ? 0 : 1); 23 }
可以看出作业完成了配置之后,调用了System.exit(job.waitForCompletion(true) ? 0 : 1);job.waitForCompletion(true),true表示显示作业运行过程中的信息,waitForCompletion方法实现如下:
1 public boolean waitForCompletion(boolean verbose 2 ) throws IOException, InterruptedException, 3 ClassNotFoundException { 4 if (state == JobState.DEFINE) { 5 submit(); 6 } 7 if (verbose) { 8 jobClient.monitorAndPrintJob(conf, info); 9 } else { 10 info.waitForCompletion(); 11 } 12 return isSuccessful(); 13 }
提交作业的时候,其状态是JobState.DEFINE,然后调用submit()方法,其实现如下所示:
1 public void submit() throws IOException, InterruptedException, 2 ClassNotFoundException { 3 ensureState(JobState.DEFINE); 4 setUseNewAPI();//使用新API 5 6 // Connect to the JobTracker and submit the job 7 connect(); 8 info = jobClient.submitJobInternal(conf); 9 super.setJobID(info.getID()); 10 state = JobState.RUNNING; 11 }
使用connect()方法连接JobTracker,其实现如下:
1 private void connect() throws IOException, InterruptedException { 2 ugi.doAs(new PrivilegedExceptionAction<Object>() { 3 public Object run() throws IOException { 4 jobClient = new JobClient((JobConf) getConfiguration()); 5 return null; 6 } 7 }); 8 }
1 public JobClient(JobConf conf) throws IOException { 2 ...... 3 init(conf); 4 } 5 public void init(JobConf conf) throws IOException { 6 ...... 7 this.jobSubmitClient = createRPCProxy(JobTracker.getAddress(conf), conf); 8 } 9 private static JobSubmissionProtocol createRPCProxy(InetSocketAddress addr, 10 Configuration conf) throws IOException { 11 return (JobSubmissionProtocol) RPC.getProxy(JobSubmissionProtocol.class, 12 JobSubmissionProtocol.versionID, addr, 13 UserGroupInformation.getCurrentUser(), conf, 14 NetUtils.getSocketFactory(conf, JobSubmissionProtocol.class)); 15 } 16 1718
第7行:此时获得一个实现JobSubmissionProtocol 的RPC调用,即JobTracker的代理,jobSubmitClient就是JobSubmissionProtocol接口的实现类。JobSubmissionProtocol是JobClient和Jobtracker通信的桥梁。
在connect方法中创建了jobClient,在submit方法中JobClient通过submitJobInternal(conf)方法正式向JobTracker提交作业。summitJobInternal(conf
)实现如下:
1 public RunningJob submitJobInternal(final JobConf job 2 ) throws FileNotFoundException, 3 ClassNotFoundException, 4 InterruptedException, 5 IOException { 6 /* 7 * configure the command line options correctly on the submitting dfs 8 */ 9 return ugi.doAs(new PrivilegedExceptionAction<RunningJob>() { 10 public RunningJob run() throws FileNotFoundException, 11 ClassNotFoundException, 12 InterruptedException, 13 IOException{ 14 JobConf jobCopy = job; 15 Path jobStagingArea = JobSubmissionFiles.getStagingDir(JobClient.this, 16 jobCopy); 17 JobID jobId = jobSubmitClient.getNewJobId(); 18 Path submitJobDir = new Path(jobStagingArea, jobId.toString()); 19 jobCopy.set("mapreduce.job.dir", submitJobDir.toString()); 20 JobStatus status = null; 21 try { 22 populateTokenCache(jobCopy, jobCopy.getCredentials()); 23 24 copyAndConfigureFiles(jobCopy, submitJobDir); 25 26 // get delegation token for the dir 27 TokenCache.obtainTokensForNamenodes(jobCopy.getCredentials(), 28 new Path [] {submitJobDir}, 29 jobCopy); 30 31 Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir); 32 int reduces = jobCopy.getNumReduceTasks(); 33 InetAddress ip = InetAddress.getLocalHost(); 34 if (ip != null) { 35 job.setJobSubmitHostAddress(ip.getHostAddress()); 36 job.setJobSubmitHostName(ip.getHostName()); 37 } 38 JobContext context = new JobContext(jobCopy, jobId); 39 40 // Check the output specification 41 if (reduces == 0 ? jobCopy.getUseNewMapper() : 42 jobCopy.getUseNewReducer()) { 43 org.apache.hadoop.mapreduce.OutputFormat<?,?> output = 44 ReflectionUtils.newInstance(context.getOutputFormatClass(), 45 jobCopy); 46 output.checkOutputSpecs(context); 47 } else { 48 jobCopy.getOutputFormat().checkOutputSpecs(fs, jobCopy); 49 } 50 51 jobCopy = (JobConf)context.getConfiguration(); 52 53 // Create the splits for the job 54 FileSystem fs = submitJobDir.getFileSystem(jobCopy); 55 LOG.debug("Creating splits at " + fs.makeQualified(submitJobDir)); 56 int maps = writeSplits(context, submitJobDir); 57 jobCopy.setNumMapTasks(maps); 58 59 // write "queue admins of the queue to which job is being submitted" 60 // to job file. 61 String queue = jobCopy.getQueueName(); 62 AccessControlList acl = jobSubmitClient.getQueueAdmins(queue); 63 jobCopy.set(QueueManager.toFullPropertyName(queue, 64 QueueACL.ADMINISTER_JOBS.getAclName()), acl.getACLString()); 65 66 // Write job file to JobTracker's fs 67 FSDataOutputStream out = 68 FileSystem.create(fs, submitJobFile, 69 new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION)); 70 71 // removing jobtoken referrals before copying the jobconf to HDFS 72 // as the tasks don't need this setting, actually they may break 73 // because of it if present as the referral will point to a 74 // different job. 75 TokenCache.cleanUpTokenReferral(jobCopy); 76 77 try { 78 jobCopy.writeXml(out); 79 } finally { 80 out.close(); 81 } 82 // 83 // 真正开始提交作业 84 // 85 printTokens(jobId, jobCopy.getCredentials()); 86 //利用jobSumitClient调用submitJob向JobTracker提交作业 87 status = jobSubmitClient.submitJob( 88 jobId, submitJobDir.toString(), jobCopy.getCredentials()); 89 JobProfile prof = jobSubmitClient.getJobProfile(jobId); 90 if (status != null && prof != null) { 91 return new NetworkedJob(status, prof, jobSubmitClient); 92 } else { 93 throw new IOException("Could not launch job"); 94 } 95 } finally { 96 if (status == null) { 97 LOG.info("Cleaning up the staging area " + submitJobDir); 98 if (fs != null && submitJobDir != null) 99 fs.delete(submitJobDir, true); 100 } 101 } 102 } 103 }); 104 }
下图描述了作业提交和初始化的过程:
上图中的1,2表示被调用函数被调用的顺序。
1)关于DistributedCache:
MapReduce作业文件的上传和下载都是由DistributedCache工具完成的。
2)关于复制作业运行所需要的文件到JobTracker所在的文件系统的指定目录(一般是HDFS上的某个目录)
通常而言,对于一个典型的Java MapReduce作业,可能包含以下资源。也就是需要将这些资源复制到HDFS上的某个目录
a.程序jar包:用户用Java编写的MapReduce应用程序jar包。(job.jar)
b.作业配置文件:描述MapReduce应用程序的配置信息(job.xml)
c.依赖的第三方jar包:应用程序依赖的第三方jar包,提交作业时用参数“-libjars”指定。
d.依赖的归档文件:应用程序中用到多个文件,可直接打包成归档文件,提交作业时用参数“-archives”指定。
e.依赖的普通文件:应用程序中可能用到普通文件,比如文本格式的字典文件,提交作业时用参数“-files”指定。
其实,一般复制的文件只有job.jar,job.xml,job.split,job.splitmetainfo .这些统称为作业文件。
作业文件上传到HDFS后,可能会有大量节点同时从HDFS上下载这些文件,进而产生文件访问热点现象,造成性能瓶颈。为此,JobClient上传这些文件时会调高它们的副本数(由参数mapred.submit.replication指定,默认是10)以通过分摊负载方式避免产生访问热点。
再看一下core-site.xml中的一个配置项:
<property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/tmp</value> </property>
hadoop.tmp.dir 是hadoop文件系统(HDFS)依赖的基础配置,很多路径都依赖它,并且NameNode的元数据备份等信息也会放在此这个目录下,如果不配置,其默认路径是/tmp,这里的/tmp是HDFS中的/tmp,并不是本机Linux中的/tmp,而/tmp是系统的临时目录,系统重启时往往会被清空,所以需要自定义一个持久化的数据目录。
运行一个WordCount作业,当前运行的作业的相关的各种文件在由JobClient提交作业的时候,是复制到了HDFS的某个目录上的,作业运行成功之后会再从HDFS上删除掉。
可以通过两个属性查找到当前运行作业相关的各种文件:
作业属性 属性值 说明
mapreduce.jobtracker.staging.root.dir ${hadoop.tmp.dir}/mapred/staging 复制过来的作业相关文件在HDFS上的存放位置
mapreduce.job.dir ${mapreduce.jobtracker.staging.root.dir}/${user}/.staging/${jobId} 用户{user}的某个作业相关文件的存放位置
而配置文件中hadoop.tmp.dir=/usr/local/hadoop/tmp,很容易找到WordCount作业在HDFS上的相关文件。
参考:
http://blog.sina.com.cn/s/blog_9d31d3870101dkxt.html