在hadoop中,由于一个Task可能由多个节点同时运行,当每个节点完成Task时,一个Task可能会出现多个结果,为了避免这种情况的出现,使用了OutPutCommitter。所以OutPutCommitter主要的功能是在作业或任务完成时,确保结果的正确提交。OutPutCommitter的主要功能是:
1.在作业初始化被调用;例,在初始化Job时,为Job创建临时的输出目录
2.在作业完成时清理后续工作;例,在Job完成后删除临时的输出目录
3.设置任务的临时输出。在Job的临时目录下创建一个side-effect file。
4.检查任务是否需要被提交。如果任务之前结果已经被提交,避免了任务重复提交。
5.提交任务的结果。
6.放弃提交任务。
下面看看OutPutCommitter的源代码:
1 public abstract class OutputCommitter { 2 /** 3 * For the framework to setup the job output during initialization. This is 4 * called from the application master process for the entire job. This will be 5 * called multiple times, once per job attempt. 6 * 在初始化事设置Job的输出。这个方法主要是被整个Job的master调用。它是在每个Job时被调用。 7 * @param jobContext Context of the job whose output is being written. 8 * @throws IOException if temporary output could not be created 9 */ 10 public abstract void setupJob(JobContext jobContext) throws IOException; 11 12 /** 13 * For cleaning up the job's output after job completion. This is called 14 * from the application master process for the entire job. This may be called 15 * multiple times. 16 * 在工作完成后清理Job的输出。这个方法主要是被整个Job的master调用。也可能被多次调用。该方法已经不再使用。 17 * 已经被commitJob和commitJob代替。 18 * @param jobContext Context of the job whose output is being written. 19 * @throws IOException 20 * @deprecated Use {@link #commitJob(JobContext)} and 21 * {@link #commitJob(JobContext, JobStatus.State)} instead. 22 */ 23 @Deprecated 24 public void cleanupJob(JobContext jobContext) throws IOException { } 25 26 /** 27 * For committing job's output after successful job completion. Note that this 28 * is invoked for jobs with final runstate as SUCCESSFUL. This is called 29 * from the application master process for the entire job. This is guaranteed 30 * to only be called once. If it throws an exception the entire job will 31 * fail. 32 * 当Job成功完成时提交所有Job的输出。这个通过调用Job的最终的状态为SUCCESSFUL, 33 * 该方法仅仅被整个Job的master调用。它仅能被调用一次。 34 * @param jobContext Context of the job whose output is being written. 35 * @throws IOException 36 */ 37 public void commitJob(JobContext jobContext) throws IOException { 38 cleanupJob(jobContext); 39 } 40 41 42 /** 43 * For aborting an unsuccessful job's output. Note that this is invoked for 44 * jobs with final runstate as {@link JobStatus.State#FAILED} or 45 * {@link JobStatus.State#KILLED}. This is called from the application 46 * master process for the entire job. This may be called multiple times. 47 * 中止一个不成功作业的输出。该方法需要调用查看Job的最终的运行状态(Failed或Killed), 48 * 该方法也是被Master多次调用。 49 * @param jobContext Context of the job whose output is being written. 50 * @param state final runstate of the job 51 * @throws IOException 52 */ 53 public void abortJob(JobContext jobContext, JobStatus.State state) 54 throws IOException { 55 cleanupJob(jobContext); 56 } 57 58 /** 59 * Sets up output for the task. This is called from each individual task's 60 * process that will output to HDFS, and it is called just for that task. This 61 * may be called multiple times for the same task, but for different task 62 * attempts. 63 * 设置任务的输出。每个单一的Task所调用该方法将结果输出到HDFS上,它可以被同一个Task多次调用。 64 * @param taskContext Context of the task whose output is being written. 65 * @throws IOException 66 */ 67 public abstract void setupTask(TaskAttemptContext taskContext) 68 throws IOException; 69 70 /** 71 * Check whether task needs a commit. This is called from each individual 72 * task's process that will output to HDFS, and it is called just for that 73 * task. 74 * 检查任务是否需要被提交。 75 * @param taskContext 76 * @return true/false 77 * @throws IOException 78 */ 79 public abstract boolean needsTaskCommit(TaskAttemptContext taskContext) 80 throws IOException; 81 82 /** 83 * To promote the task's temporary output to final output location. 84 * If {@link #needsTaskCommit(TaskAttemptContext)} returns true and this 85 * task is the task that the AM determines finished first, this method 86 * is called to commit an individual task's output. This is to mark 87 * that tasks output as complete, as {@link #commitJob(JobContext)} will 88 * also be called later on if the entire job finished successfully. This 89 * is called from a task's process. This may be called multiple times for the 90 * same task, but different task attempts. It should be very rare for this to 91 * be called multiple times and requires odd networking failures to make this 92 * happen. In the future the Hadoop framework may eliminate this race. 93 * 94 * @param taskContext Context of the task whose output is being written. 95 * @throws IOException if commit is not successful. 96 */ 97 public abstract void commitTask(TaskAttemptContext taskContext) 98 throws IOException; 99 100 /** 101 * Discard the task output. This is called from a task's process to clean 102 * up a single task's output that can not yet been committed. This may be 103 * called multiple times for the same task, but for different task attempts. 104 * 放弃Task的结果的输出。 105 * @param taskContext 106 * @throws IOException 107 */ 108 public abstract void abortTask(TaskAttemptContext taskContext) 109 throws IOException; 110 111 /** 112 * Is task output recovery supported for restarting jobs? 113 * 114 * If task output recovery is supported, job restart can be done more 115 * efficiently. 116 * 117 * @return <code>true</code> if task output recovery is supported, 118 * <code>false</code> otherwise 119 * @see #recoverTask(TaskAttemptContext) 120 */ 121 public boolean isRecoverySupported() { 122 return false; 123 } 124 125 /** 126 * Recover the task output. 127 * 128 * The retry-count for the job will be passed via the 129 * {@link MRJobConfig#APPLICATION_ATTEMPT_ID} key in 130 * {@link TaskAttemptContext#getConfiguration()} for the 131 * <code>OutputCommitter</code>. This is called from the application master 132 * process, but it is called individually for each task. 133 * 134 * If an exception is thrown the task will be attempted again. 135 * 136 * This may be called multiple times for the same task. But from different 137 * application attempts. 138 * 139 * @param taskContext Context of the task whose output is being recovered 140 * @throws IOException 141 */ 142 public void recoverTask(TaskAttemptContext taskContext) 143 throws IOException 144 {} 145 }