• OutputFormat中OutputCommitter解析


    在hadoop中,由于一个Task可能由多个节点同时运行,当每个节点完成Task时,一个Task可能会出现多个结果,为了避免这种情况的出现,使用了OutPutCommitter。所以OutPutCommitter主要的功能是在作业或任务完成时,确保结果的正确提交。OutPutCommitter的主要功能是:

      1.在作业初始化被调用;例,在初始化Job时,为Job创建临时的输出目录

      2.在作业完成时清理后续工作;例,在Job完成后删除临时的输出目录

      3.设置任务的临时输出。在Job的临时目录下创建一个side-effect file。

      4.检查任务是否需要被提交。如果任务之前结果已经被提交,避免了任务重复提交。

      5.提交任务的结果。

      6.放弃提交任务。

    下面看看OutPutCommitter的源代码

      1 public abstract class OutputCommitter {
      2   /**
      3    * For the framework to setup the job output during initialization.  This is
      4    * called from the application master process for the entire job. This will be
      5    * called multiple times, once per job attempt.
      6    * 在初始化事设置Job的输出。这个方法主要是被整个Job的master调用。它是在每个Job时被调用。
      7    * @param jobContext Context of the job whose output is being written.
      8    * @throws IOException if temporary output could not be created
      9    */
     10   public abstract void setupJob(JobContext jobContext) throws IOException;
     11 
     12   /**
     13    * For cleaning up the job's output after job completion.  This is called
     14    * from the application master process for the entire job. This may be called
     15    * multiple times.
     16    * 在工作完成后清理Job的输出。这个方法主要是被整个Job的master调用。也可能被多次调用。该方法已经不再使用。
     17    * 已经被commitJob和commitJob代替。
     18    * @param jobContext Context of the job whose output is being written.
     19    * @throws IOException
     20    * @deprecated Use {@link #commitJob(JobContext)} and
     21    *                 {@link #commitJob(JobContext, JobStatus.State)} instead.
     22    */
     23   @Deprecated
     24   public void cleanupJob(JobContext jobContext) throws IOException { }
     25 
     26   /**
     27    * For committing job's output after successful job completion. Note that this
     28    * is invoked for jobs with final runstate as SUCCESSFUL.  This is called
     29    * from the application master process for the entire job. This is guaranteed
     30    * to only be called once.  If it throws an exception the entire job will
     31    * fail.    
     32    * 当Job成功完成时提交所有Job的输出。这个通过调用Job的最终的状态为SUCCESSFUL,
     33    * 该方法仅仅被整个Job的master调用。它仅能被调用一次。
     34    * @param jobContext Context of the job whose output is being written.
     35    * @throws IOException
     36    */
     37   public void commitJob(JobContext jobContext) throws IOException {
     38     cleanupJob(jobContext);
     39   }
     40 
     41   
     42   /**
     43    * For aborting an unsuccessful job's output. Note that this is invoked for 
     44    * jobs with final runstate as {@link JobStatus.State#FAILED} or 
     45    * {@link JobStatus.State#KILLED}.  This is called from the application
     46    * master process for the entire job. This may be called multiple times.
     47    * 中止一个不成功作业的输出。该方法需要调用查看Job的最终的运行状态(Failed或Killed),
     48    * 该方法也是被Master多次调用。
     49    * @param jobContext Context of the job whose output is being written.
     50    * @param state final runstate of the job
     51    * @throws IOException
     52    */
     53   public void abortJob(JobContext jobContext, JobStatus.State state) 
     54   throws IOException {
     55     cleanupJob(jobContext);
     56   }
     57   
     58   /**
     59    * Sets up output for the task.  This is called from each individual task's
     60    * process that will output to HDFS, and it is called just for that task. This
     61    * may be called multiple times for the same task, but for different task
     62    * attempts.
     63    * 设置任务的输出。每个单一的Task所调用该方法将结果输出到HDFS上,它可以被同一个Task多次调用。
     64    * @param taskContext Context of the task whose output is being written.
     65    * @throws IOException
     66    */
     67   public abstract void setupTask(TaskAttemptContext taskContext)
     68   throws IOException;
     69   
     70   /**
     71    * Check whether task needs a commit.  This is called from each individual
     72    * task's process that will output to HDFS, and it is called just for that
     73    * task.
     74    * 检查任务是否需要被提交。
     75    * @param taskContext
     76    * @return true/false
     77    * @throws IOException
     78    */
     79   public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)
     80   throws IOException;
     81 
     82   /**
     83    * To promote the task's temporary output to final output location.
     84    * If {@link #needsTaskCommit(TaskAttemptContext)} returns true and this
     85    * task is the task that the AM determines finished first, this method
     86    * is called to commit an individual task's output.  This is to mark
     87    * that tasks output as complete, as {@link #commitJob(JobContext)} will 
     88    * also be called later on if the entire job finished successfully. This
     89    * is called from a task's process. This may be called multiple times for the
     90    * same task, but different task attempts.  It should be very rare for this to
     91    * be called multiple times and requires odd networking failures to make this
     92    * happen. In the future the Hadoop framework may eliminate this race.
     93    * 
     94    * @param taskContext Context of the task whose output is being written.
     95    * @throws IOException if commit is not successful. 
     96    */
     97   public abstract void commitTask(TaskAttemptContext taskContext)
     98   throws IOException;
     99   
    100   /**
    101    * Discard the task output. This is called from a task's process to clean 
    102    * up a single task's output that can not yet been committed. This may be
    103    * called multiple times for the same task, but for different task attempts.
    104    * 放弃Task的结果的输出。
    105    * @param taskContext
    106    * @throws IOException
    107    */
    108   public abstract void abortTask(TaskAttemptContext taskContext)
    109   throws IOException;
    110 
    111   /**
    112    * Is task output recovery supported for restarting jobs?
    113    * 
    114    * If task output recovery is supported, job restart can be done more 
    115    * efficiently.
    116    * 
    117    * @return <code>true</code> if task output recovery is supported,
    118    *         <code>false</code> otherwise
    119    * @see #recoverTask(TaskAttemptContext)         
    120    */
    121   public boolean isRecoverySupported() {
    122     return false;
    123   }
    124   
    125   /**
    126    * Recover the task output. 
    127    * 
    128    * The retry-count for the job will be passed via the 
    129    * {@link MRJobConfig#APPLICATION_ATTEMPT_ID} key in  
    130    * {@link TaskAttemptContext#getConfiguration()} for the 
    131    * <code>OutputCommitter</code>.  This is called from the application master
    132    * process, but it is called individually for each task.
    133    * 
    134    * If an exception is thrown the task will be attempted again. 
    135    * 
    136    * This may be called multiple times for the same task.  But from different
    137    * application attempts.
    138    * 
    139    * @param taskContext Context of the task whose output is being recovered
    140    * @throws IOException
    141    */
    142   public void recoverTask(TaskAttemptContext taskContext)
    143   throws IOException
    144   {}
    145 }
  • 相关阅读:
    firefox ajax async 弹出窗口提示阻止
    Spring加载resource时classpath*:与classpath:的区别(转)
    超链接的href属性 js调用
    jquery easyui tabs layout 记录
    PostgreSQL 中的递归查询 与oracle 的比较
    将字符串中的中文(英文)字符串转化为阿拉伯数字
    用CSS控制文字的宽度,超过省略号代替(转)
    制作gridview 固定列的2种简单解决方案
    自制树型控件(DataGrid) 支持多种响应
    备忘: select 对象的操作js(转)
  • 原文地址:https://www.cnblogs.com/rolly-yan/p/3704129.html
Copyright © 2020-2023  润新知