（转）MapReduce Design Patterns（chapter 6 （part 1））（十一）

（转）MapReduce Design Patterns（chapter 6 （part 1））（十一）

Chapter 6. Metapatterns

这种模式不是解决某个问题的，而是处理模式的关系的。可以理解为“模式的模式”。首先讨论的是job链，把几个模式联合起来解决复杂的，有多个阶段要处理的问题。第二个是job 合并，用相同的MapReduce job执行多个分析的优化，达到一箭多雕的目的。

Job chaining

理解job链接和对job链接的操作计划非常重要。很多人发现用单独一个MapReduce job不能解决一个问题。需要一连串的job需要跑，一些需要其它job的输出。一旦你开始熟悉用一些列MapReduce job解决问题时，你就进入了一个新的挑战阶段。

Job链接是一个较难处理的过程，因为它不是MapReduce 框架里确定的特性。像hadoop这样的系统设计成处理一个MapReduce job会容易做，但处理一个有多个阶段要执行的job需要大量的工作量。需要考虑的有，某一阶段出错的job，要清楚掉中间输出。这一部分将会讨论几个不同的处理job 链接的方法。有一些对你的需求可能很适合，每一种都有利弊。

几个框架和工具已经应运而生来填补这项应用。如果你做大量的工作流并且很复杂。你应该考虑使用其中一个。这里描述的方法是轻量级的，且需要实现为一种串行模式。Oozie是apache的开源项目，有创建工作流并协调job运行的功能。创建job链是其中的一项工作，

并且对操作运行hadoop MapReduce job非常有用。

使用MapReduce的一个共同的缺陷是数据太小没必要分布式运行。如果你认为链接两个job是正确的选择，要考虑第一个job有多少输出量。如果有大量的输出数据，尽量使用第二个MapReduce job。很多时候，Job的输出文件很小就可以在单节点上高效的执行。这两种方式是：或者在job完成后，在驱动代码里通过文件系统加载数据，或者用某种脚本封装在一起。

Notice：MapReduce链的主要问题是临时文件的大小。有时比较小，可能导致大量的map 任务。在非链式job中，reducer的数量通常依赖于接受到的数据量的大小而不是输出的数量。当使用链时，输出文件的大小就很重要，甚至reducer要运行很长时间。争取输出文件时分布式系统中一个块的大小。尝试不同的reducer的数量，并看看影响性能的瓶颈。

另一种选择是使用CombineFileInputFormat来加载断断续续的输出数据。它会把小数据合成一个大的输入分片进行下面的mapper处理。

With the Driver

可能最简单的执行job链的方法是用主驱动代码来简单的驱动多个与具体job对应的驱动代码。没有特别的地方，java中用得很广泛。它不跟某种类或其它什么东西绑定。

通过顺序调用job的驱动代码让job按指定的顺序执行。你必须确保第一个job的输出路径是第二个job的输入路径，可以通过共享临时目录变量的方式实现。

在生产环境下，这个临时目录应该被清理，job完成后就不应该存在。缺乏规律的处理，会使你的集群资源很快用完。也要小心你要创建的临时数据量，因为他们要存储到文件系统中。

用能很容易推断这种途径创建的链会比简单执行两个job所用时间长。注意跟踪临时目录，并视情况清除那些job不再用的数据。

你可以使用Job.submit()代替Job.waitForCompletion()并行执行job。Submit方法会立刻返回，并启动一个后台程序执行job。这允许一次执行多个job。使用Job.isComplete()，非阻塞的检查job是否完成，经常使用。

另一件要注意的事情是job是否成功。仅仅知道job是否完成是不好的。需要检查成功与否。如果依赖job失败，应该停止整个链，而不是让它继续执行。

很明显从软件工程的角度管理和维护这个执行过程是非常困难的。因为job链很复杂。这也是像jobControl或者oozie出现的原因。

Job Chaining Examples

Basic job chaining

这个例子的目的是输出一对信息：声誉值和发帖数。这可以在一个MapReduce job里完成，但我们要根据发帖数的平均值把用户分成两部分。我们需要一个job统计数据，另一个基于平均值把用户分成两部分。这里将用到4中模式：数值聚合，计数，分箱，复制join。

使用框架的计数器计算发帖的平均数。第二个job中用户数据放入分布式缓存从而使输出数据带有用户声誉值。这种改进是为了适合下一个例子，计算用户的平均声誉值，分成两个箱（大于或小于平均值）。

问题：给出stackOverflow 用户发帖数据，把用户分成两部分，根据高于或低于发帖数的平均值。并且丰富用户信息，加上从另一个数据集获得的声誉值，然后输出。

Job one mapper。在我们看驱动代码之前，先理解下两个job的mapper和reducer。Mapper通过从每条记录指定的OwnerUserId 属性的值记录user id，并作为job的输出key，value为1。记录计数器也会增1.这个value随后会在驱动中用来计算用户的平均发帖数。AVERAGE_CALC_GROUP 是一个public static 驱动级别的string。

publicstaticclass UserIdCountMapper extends

       Mapper<Object, Text, Text, LongWritable> {

    publicstaticfinal String RECORDS_COUNTER_NAME = "Records";

    privatestaticfinal LongWritable ONE = new LongWritable(1);

    private Text outkey = new Text();

    publicvoid map(Object key, Text value, Context context)

           throws IOException, InterruptedException {

       Map<String, String> parsed = MRDPUtils.transformXmlToMap(value

              .toString());

       String userId = parsed.get("OwnerUserId");

       if (userId != null) {

           outkey.set(userId);

           context.write(outkey, ONE);

           context.getCounter(AVERAGE_CALC_GROUP, RECORDS_COUNTER_NAME)

                  .increment(1);

       }

    }

}

Job one reducer。Reducer也相对简单。只是迭代输入values，计算sum值，作为值跟输入key作为key一同输出。一个不同的计数器会对每个reduce自增，为了计算平均值。

publicstaticclass UserIdSumReducer extends

       Reducer<Text, LongWritable, Text, LongWritable> {

    publicstaticfinal String USERS_COUNTER_NAME = "Users";

    private LongWritable outvalue = new LongWritable();

    publicvoid reduce(Text key, Iterable<LongWritable> values,

           Context context) throws IOException, InterruptedException {

       // Increment user counter, as each reduce group represents one user

       context.getCounter(AVERAGE_CALC_GROUP, USERS_COUNTER_NAME)

              .increment(1);

       int sum = 0;

       for (LongWritable value : values) {

           sum += value.get();

       }

       outvalue.set(sum);

       context.write(key, outvalue);

    }

}

Job two mapper。比前面的job稍复杂。这里做了几个不同的事情得到期望的输出。Setup阶段完成三件事情。发帖的平均值从job配置阶段设置的context对象中取出来。初始化MultipleOutputs，用来把输出写到不同的箱。最后，从DistributedCache解析用户数据，创建一个user id对应声誉值的map。用于数据丰富的目的。

跟setup阶段相比这个map方法相对容易。解析输入值得到user id和发帖数。只需要用tab 分割输入value，取得前两个字段。然后设置输出key为user id，输出值为发帖数和用户声誉值，靠tab分割。用户发帖数跟平均值作比较，对用户完成分箱。

可选的第四个参数MultipleOutputs.write用于命名输出文件。一个常量用来指定用户的目录，根据用户的发帖数是在平均值之上或之下。目录里的文件名增加了额外的字符串“/part”，作为文件名的开始，然后框架会自动追加上-m-nnnn。Nnnn代表任务id。用这中命名，针对对两个箱会创建目录，并且每个目录里包含部分文件。这样做是便于下一个例子并行执行job时的输入输出的管理。

最后，cleanup阶段关闭MultipleOutputs。

publicstaticclass UserIdBinningMapper extends

       Mapper<Object, Text, Text, Text> {

    publicstaticfinal String AVERAGE_POSTS_PER_USER = "avg.posts.per.user";

    publicstaticvoid setAveragePostsPerUser(Job job, double avg) {

       job.getConfiguration().set(AVERAGE_POSTS_PER_USER,

              Double.toString(avg));

    }

    publicstaticdouble getAveragePostsPerUser(Configuration conf) {

       return Double.parseDouble(conf.get(AVERAGE_POSTS_PER_USER));

    }

    privatedoubleaverage = 0.0;

    private MultipleOutputs<Text, Text> mos = null;

    private Text outkey = new Text(), outvalue = new Text();

    private HashMap<String, String> userIdToReputation = new HashMap<String, String>();

    protectedvoid setup(Context context) throws IOException,

           InterruptedException {

       average = getAveragePostsPerUser(context.getConfiguration());

       mos = new MultipleOutputs<Text, Text>(context);

       Path[] files = DistributedCache.getLocalCacheFiles(context

              .getConfiguration());

       // Read all files in the DistributedCache

       for (Path p : files) {

           BufferedReader rdr = new BufferedReader(new InputStreamReader(

                  new GZIPInputStream(new FileInputStream(new File(

                         p.toString())))));

           String line;

           // For each record in the user file

           while ((line = rdr.readLine()) != null) {

              // Get the user ID and reputation

              Map<String, String> parsed = MRDPUtils

                     .transformXmlToMap(line);

              // Map the user ID to the reputation

              userIdToReputation.put(parsed.get("Id"),

                     parsed.get("Reputation"));

           }

       }

    }

    publicvoid map(Object key, Text value, Context context)

           throws IOException, InterruptedException {

       String[] tokens = value.toString().split(" ");

       String userId = tokens[0];

       int posts = Integer.parseInt(tokens[1]);

       outkey.set(userId);

       outvalue.set((long) posts + " " + userIdToReputation.get(userId));

       if ((double) posts < average) {

           mos.write(MULTIPLE_OUTPUTS_BELOW_NAME, outkey, outvalue,

                  MULTIPLE_OUTPUTS_BELOW_NAME + "/part");

       } else {

           mos.write(MULTIPLE_OUTPUTS_ABOVE_NAME, outkey, outvalue,

                  MULTIPLE_OUTPUTS_ABOVE_NAME + "/part");

       }

    }

    protectedvoid cleanup(Context context) throws IOException,

           InterruptedException {

       mos.close();

    }

}

Driver code。下面看最复杂的驱动代码。分解为两部分讨论：第一个job和第二个job。第一个job解析命令行参数创建合适的输入输出目录。创建的临时目录会在job链的最后由驱动代码删掉。

Notice:输出目录名字附加一个string作为中间输出目录。这在大多数情况下是合适的，但如果对中间目录有一个命名约定来避免冲突会更好。Job提交时如果输出目录已经存在，job将不会启动。

publicstaticvoid main(String[] args) throws Exception {

    Configuration conf = new Configuration();

    Path postInput = new Path(args[0]);

    Path userInput = new Path(args[1]);

    Path outputDirIntermediate = new Path(args[2] + "_int");

    Path outputDir = new Path(args[2]);

    // Setup first job to counter user posts

    Job countingJob = new Job(, "JobChaining-Counting");

    countingJob.setJarByClass(JobChainingDriver.class);

    // Set our mapper and reducer, we can use the API's long sum reducer for

    // a combiner!

    countingJob.setMapperClass(UserIdCountMapper.class);

    countingJob.setCombinerClass(LongSumReducer.class);

    countingJob.setReducerClass(UserIdSumReducer.class);

    countingJob.setOutputKeyClass(Text.class);

    countingJob.setOutputValueClass(LongWritable.class);

    countingJob.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(countingJob, postInput);

    countingJob.setOutputFormatClass(TextOutputFormat.class);

    TextOutputFormat.setOutputPath(countingJob, outputDirIntermediate);

    // Execute job and grab exit code

    int code = countingJob.waitForCompletion(true) ? 0 : 1;

   。。。

执行第二个job之前要检测第一个job是否成功。这看起来足够简单，但对于更复杂的job链，检测是比较烦人的。第二个job配置之前，从第一个job抽取代表平均发帖数的计数器的值，加到job配置里。然后设置mapper并禁用reducer阶段。另外的关键部分要注意的是MultipleOutputs和DistributedCache的配置。然后job执行

最后，最终要的是成功或失败，中间输出目录被清除。这是一个重要并经常被忽视的阶段。留下中间输出目录会很快的填满集群，需要你手动删除这些目录。不需要的就删掉。

if (code == 0) {

    // Calculate the average posts per user by getting counter values

    double numRecords = (double) countingJob

    .getCounters()

    .findCounter(AVERAGE_CALC_GROUP,

    UserIdCountMapper.RECORDS_COUNTER_NAME).getValue();

    double numUsers = (double) countingJob

    .getCounters()

    .findCounter(AVERAGE_CALC_GROUP,

    UserIdSumReducer.USERS_COUNTER_NAME).getValue();

    double averagePostsPerUser = numRecords / numUsers;

    // Setup binning job

    Job binningJob = new Job(new Configuration(), "JobChaining-Binning");

    binningJob.setJarByClass(JobChainingDriver.class);

    // Set mapper and the average posts per user

    binningJob.setMapperClass(UserIdBinningMapper.class);

    UserIdBinningMapper.setAveragePostsPerUser(binningJob,

    averagePostsPerUser);

    binningJob.setNumReduceTasks(0);

    binningJob.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(binningJob, outputDirIntermediate);

    // Add two named outputs for below/above average

    MultipleOutputs.addNamedOutput(binningJob,

    MULTIPLE_OUTPUTS_BELOW_NAME, TextOutputFormat.class,

    Text.class, Text.class);

    MultipleOutputs.addNamedOutput(binningJob,

    MULTIPLE_OUTPUTS_ABOVE_NAME, TextOutputFormat.class,

    Text.class, Text.class);

    MultipleOutputs.setCountersEnabled(binningJob, true);

    TextOutputFormat.setOutputPath(binningJob, outputDir);

    // Add the user files to the DistributedCache

    FileStatus[] userFiles = FileSystem.get(conf).listStatus(userInput);

    for (FileStatus status : userFiles) {

    DistributedCache.addCacheFile(status.getPath().toUri(),

    binningJob.getConfiguration());

    }

    // Execute job and grab exit code

    code = binningJob.waitForCompletion(true) ? 0 : 1;

    }

    // Clean up the intermediate output

    FileSystem.get(conf).delete(outputDirIntermediate, true);

    System.exit(code);

Parallel job chaining

并行job链的驱动跟前面例子的相似。唯一大的改进是jobs被并行提交然后监控它们直到完成。本例中的两个job是独立的（当然，用到了前面例子的输出）。这增加了更好利用集群资源的好处，能同时运行两个job。

问题：用到前面例子产生的分箱的用户数据，在两个箱上同时跑job计算平均声誉值。

Mapper code。Mapper分割输入值为字符串数组。第三个索引值是该用户的声誉值。这个值是随着唯一key输出的。为了分组所有的声誉值计算平均值，这个key通过所有的map任务共享，nullwritable能用，但我们需要一个有意义的表示。

Notice：对非常大的数据集这个执行会很昂贵。因为只有一个reducer负责所有的中间键值对通过网络传输。从一个节点连续读数据带来的好处是，输入分片被并行读，reducer数量可配置。

publicstaticclass AverageReputationMapper extends

       Mapper<LongWritable, Text, Text, DoubleWritable> {

    privatestaticfinal Text GROUP_ALL_KEY = new Text(

           "Average Reputation:");

    private DoubleWritable outvalue = new DoubleWritable();

    protectedvoid map(LongWritable key, Text value, Context context)

           throws IOException, InterruptedException {

       // Split the line into tokens

       String[] tokens = value.toString().split(" ");

       // Get the reputation from the third column

       double reputation = Double.parseDouble(tokens[2]);

       // Set the output value and write to context

       outvalue.set(reputation);

       context.write(GROUP_ALL_KEY, outvalue);

    }

}

Reducer code。Reducer简单的迭代声誉值，求声誉值和，求用户个数，然后相除得到平均值，平均值随着输入key一同输出。

publicstaticclass AverageReputationReducer extends

       Reducer<Text, DoubleWritable, Text, DoubleWritable> {

    private DoubleWritable outvalue = new DoubleWritable();

    protectedvoid reduce(Text key, Iterable<DoubleWritable> values,

           Context context) throws IOException, InterruptedException {

       double sum = 0.0;

       double count = 0;

       for (DoubleWritable dw : values) {

           sum += dw.get();

           ++count;

       }

       outvalue.set(sum / count);

       context.write(key, outvalue);

    }

}

Driver code。驱动代码解析命令行参数为这两个job得到输入输出目录。调用帮助方法提交job的配置，下面会看到。两个job对象会返回，并监控直到job的完成。只要其中一个job仍在运行，驱动就会再休息5秒。两个都完成以后，检查成功或失败，打印相关log信息。Job成功，则返回退出代码。

publicstaticvoid main(String[] args) throws Exception {

    Configuration conf = new Configuration();

    Path belowAvgInputDir = new Path(args[0]);

    Path aboveAvgInputDir = new Path(args[1]);

    Path belowAvgOutputDir = new Path(args[2]);

    Path aboveAvgOutputDir = new Path(args[3]);

    Job belowAvgJob = submitJob(conf, belowAvgInputDir, belowAvgOutputDir);

    Job aboveAvgJob = submitJob(conf, aboveAvgInputDir, aboveAvgOutputDir);

    // While both jobs are not finished, sleep

    while (!belowAvgJob.isComplete() || !aboveAvgJob.isComplete()) {

       Thread.sleep(5000);

    }

    if (belowAvgJob.isSuccessful()) {

       System.out.println("Below average job completed successfully!");

    } else {

       System.out.println("Below average job failed!");

    }

    if (aboveAvgJob.isSuccessful()) {

       System.out.println("Above average job completed successfully!");

    } else {

       System.out.println("Above average job failed!");

    }

    System.exit(belowAvgJob.isSuccessful() && aboveAvgJob.isSuccessful() ? 0: 1);

}

帮助方法可以配置每个job，看起来很标准，除了使用job.Submit而不是Job.waitForCompletion。这样会提交job立刻返回，允许下面的代码继续执行。正如我们看到的，返回的job在main方法被监控直到完成。

privatestatic Job submitJob(Configuration conf, Path inputDir,

       Path outputDir) throws Exception {

    Job job = new Job(conf, "ParallelJobs");

    job.setJarByClass(ParallelJobs.class);

    job.setMapperClass(AverageReputationMapper.class);

    job.setReducerClass(AverageReputationReducer.class);

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(DoubleWritable.class);

    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, inputDir);

    job.setOutputFormatClass(TextOutputFormat.class);

    TextOutputFormat.setOutputPath(job, outputDir);

    // Submit job and immediately return, rather than waiting for completion

    job.submit();

    return job;

}

With Shell Scripting

这种方法跟前面使用主驱动来启动单独的job驱动代码类似，除了使用脚本语言。在shell 脚本内，链中的每个job都可以用命令行指定的方式单独的启动。

这里有几个主要的益处和一对小的负面影响。一个好处是不用编译代码就能改变job流，因为驱动使用脚本语言，而不是java。对于失败可能性大的job，需要容易手动重新运行或修复失败的job。也可以把已经用于生产的job通过命令行调用，不通过脚本。另一个益处是shell脚本可以跟服务，系统，和非java写的工具交互。例如，本章随后讨论的输出的后处理，很自然的用sed或awk处理，很少用java。

Notice:用脚本封装MapReduce job，无论是一个java MapReduce，pig job或其它的，都有几个好处：后处理，数据流，数据准备，添加额外日志等等。

通常使用脚本能快速把新job和已有的job链起来。对健壮的程序，构建基于驱动的链机制能改善跟hadoop的接口，且更有意义。

Bash example。

本例中，我们使用bash shell把基本的job 链绑在一起并行执行。脚本分成两部分：设置job执行需要的变量，然后执行。

Bash script。输入输出保存在变量里用来创建几个可执行的命令。跑这两个job需要两个命令，cat输出到显示器，然后清除输出。

#!/bin/bash

JAR_FILE="mrdp.jar"

JOB_CHAIN_CLASS="mrdp.ch6.JobChainingDriver"

PARALLEL_JOB_CLASS="mrdp.ch6.ParallelJobs"

HADOOP="$( which hadoop )"

POST_INPUT="posts"

USER_INPUT="users"

JOBCHAIN_OUTDIR="jobchainout"

BELOW_AVG_INPUT="${JOBCHAIN_OUTDIR}/belowavg"

ABOVE_AVG_INPUT="${JOBCHAIN_OUTDIR}/aboveavg"

BELOW_AVG_REP_OUTPUT="belowavgrep"

ABOVE_AVG_REP_OUTPUT="aboveavgrep"

JOB_1_CMD="${HADOOP} jar ${JAR_FILE} ${JOB_CHAIN_CLASS} ${POST_INPUT}

${USER_INPUT} ${JOBCHAIN_OUTDIR}"

JOB_2_CMD="${HADOOP} jar ${JAR_FILE} ${PARALLEL_JOB_CLASS} ${BELOW_AVG_INPUT}

${ABOVE_AVG_INPUT} ${BELOW_AVG_REP_OUTPUT} ${ABOVE_AVG_REP_OUTPUT}"

CAT_BELOW_OUTPUT_CMD="${HADOOP} fs -cat ${BELOW_AVG_REP_OUTPUT}/part-*"

CAT_ABOVE_OUTPUT_CMD="${HADOOP} fs -cat ${ABOVE_AVG_REP_OUTPUT}/part-*"

RMR_CMD="${HADOOP} fs -rmr ${JOBCHAIN_OUTDIR} ${BELOW_AVG_REP_OUTPUT}

${ABOVE_AVG_REP_OUTPUT}"

LOG_FILE="avgrep_`date +%s`.txt"

下一部分脚本内容是在运行之前执行若干echo命令。然后执行第一个job，查看返回值判断是否失败。如果失败，删除输出目录，脚本退出执行。成功，执行第二个job。如果第二个job成功完成，每个job的输出写到日志文件并且输出被删除。额外的输出也是不需要的，因为输出文件只有一行数据，保存在日志文件要比hdfs更好。

{

echo ${JOB_1_CMD}

${JOB_1_CMD}

if [ $? -ne 0 ]

then

echo "First job failed!"

echo ${RMR_CMD}

${RMR_CMD}

exit $?

fi

echo ${JOB_2_CMD}

${JOB_2_CMD}

if [ $? -ne 0 ]

then

echo "Second job failed!"

echo ${RMR_CMD}

${RMR_CMD}

exit $?

fi

echo ${CAT_BELOW_OUTPUT_CMD}

${CAT_BELOW_OUTPUT_CMD}

echo ${CAT_ABOVE_OUTPUT_CMD}

${CAT_ABOVE_OUTPUT_CMD}

echo ${RMR_CMD}

${RMR_CMD}

exit 0

} &> ${LOG_FILE}

Sample run。运行输出如下，省略了MapReduce的一些信息。

/home/mrdp/hadoop/bin/hadoop jar mrdp.jar mrdp.ch6.JobChainingDriver posts

users jobchainout

12/06/10 15:57:43 INFO input.FileInputFormat: Total input paths to process : 5

12/06/10 15:57:43 INFO util.NativeCodeLoader: Loaded the native-hadoop library

12/06/10 15:57:43 WARN snappy.LoadSnappy: Snappy native library not loaded

12/06/10 15:57:44 INFO mapred.JobClient: Running job: job_201206031928_0065

...

12/06/10 15:59:14 INFO mapred.JobClient: Job complete: job_201206031928_0065

...

12/06/10 15:59:15 INFO mapred.JobClient: Running job: job_201206031928_0066

...

12/06/10 16:02:02 INFO mapred.JobClient: Job complete: job_201206031928_0066

/home/mrdp/hadoop/bin/hadoop jar mrdp.jar mrdp.ch6.ParallelJobs

jobchainout/belowavg jobchainout/aboveavg belowavgrep aboveavgrep

12/06/10 16:02:08 INFO input.FileInputFormat: Total input paths to process : 1

12/06/10 16:02:08 INFO util.NativeCodeLoader: Loaded the native-hadoop library

12/06/10 16:02:08 WARN snappy.LoadSnappy: Snappy native library not loaded

12/06/10 16:02:12 INFO input.FileInputFormat: Total input paths to process : 1

Below average job completed successfully!

Above average job completed successfully!

/home/mrdp/hadoop/bin/hadoop fs -cat belowavgrep/part-*

Average Reputation: 275.36385831014724

/home/mrdp/hadoop/bin/hadoop fs -cat aboveavgrep/part-*

Average Reputation: 2375.301960784314

/home/mrdp/hadoop/bin/hadoop fs -rmr jobchainout belowavgrep aboveavgrep

Deleted hdfs://localhost:9000/user/mrdp/jobchainout

Deleted hdfs://localhost:9000/user/mrdp/belowavgrep

Deleted hdfs://localhost:9000/user/mrdp/aboveavgrep

With JobControl

JobControl和ControlledJob类组成一个MapReduce 链的系统。并有一些很好的特性，例如跟踪链的状态，满足依赖关系时自动启动job。使用JobControl处理job链是正确的选择，但有时对简单的程序较重量级。

使用 JobControl，开始要用ControlledJob封装你的job。做法相对简单：创建job，并创建ControlledJob，它能接收job或Configuration，和一系列的依赖作为参数。然后把job一个一个加到JobControl对象。

也需要跟踪临时数据并在最后或失败时清除。

Job control example

本例在驱动中使用JobControl，让我们把前面两个基本job链和并行job链组合起来执行。我们已经熟悉了mapper和reducer代码，所以这里不需要叙述了。Job配置的驱动代码是主要展示的。它使用基本job链提交第一个job，然后用JobControl执行剩下的一个job链中的job和两个并行的job。初始job不加到JobControl，因为需要在中间过程中使用第一个job的计数器配置第二个job的阶段要打断控制。

所有的job在执行整个job链时必须完成配置，可能有局限性。

Main method。让我们看一下main方法。解析命令行参数创建四个job需要的所有路径。当命名变量以了解我们的数据流时要小心。然后第一个job通过帮助方法配置并执行。这个job完成后，通过配置方法配置三个ControlledJob对象。配置方法决定了job用那个mapper类，reducer类等等。

binningControlledJob没有依赖，当然要验证前一个job是否执行成功。下面的两个job都依赖binningControlledJob。在binning job执行成功之前，这两个job不会执行。如果没执行成功，这两个job也不会执行。

这三个ControlledJob都加到JobControl对象，然后运行。JobControl.run的调用会阻塞，直到这一组job的完成。然后检查是否有job失败并设置退出代码。退出之前要清除中间输出目录。

publicstaticvoid main(String[] args) throws Exception {

    Configuration conf = new Configuration();

    Path postInput = new Path(args[0]);

    Path userInput = new Path(args[1]);

    Path countingOutput = new Path(args[3] + "_count");

    Path binningOutputRoot = new Path(args[3] + "_bins");

    Path binningOutputBelow = new Path(binningOutputRoot + "/"

           + JobChainingDriver.MULTIPLE_OUTPUTS_BELOW_NAME);

    Path binningOutputAbove = new Path(binningOutputRoot + "/"

           + JobChainingDriver.MULTIPLE_OUTPUTS_ABOVE_NAME);

    Path belowAverageRepOutput = new Path(args[2]);

    Path aboveAverageRepOutput = new Path(args[3]);

    Job countingJob = getCountingJob(conf, postInput, countingOutput);

    int code = 1;

    if (countingJob.waitForCompletion(true)) {

       ControlledJob binningControlledJob = new ControlledJob(

               getBinningJobConf(countingJob, conf, countingOutput,

                     userInput, binningOutputRoot));

       ControlledJob belowAvgControlledJob = new ControlledJob(

              getAverageJobConf(conf, binningOutputBelow,

                     belowAverageRepOutput));

       belowAvgControlledJob.addDependingJob(binningControlledJob);

       ControlledJob aboveAvgControlledJob = new ControlledJob(

              getAverageJobConf(conf, binningOutputAbove,

                     aboveAverageRepOutput));

       aboveAvgControlledJob.addDependingJob(binningControlledJob);

       JobControl jc = new JobControl("AverageReputation");

       jc.addJob(binningControlledJob);

       jc.addJob(belowAvgControlledJob);

       jc.addJob(aboveAvgControlledJob);

       jc.run();

       code = jc.getFailedJobList().size() == 0 ? 0 : 1;

    }

    FileSystem fs = FileSystem.get(conf);

    fs.delete(countingOutput, true);

    fs.delete(binningOutputRoot, true);

    System.exit(code);

}

Helper methods。下面是用到的帮助方法，用来创建具体的job或配置对象。ControlledJob能使用这两个类中的任意一个创建。这里有三个独立的方法，最后一个方法会使用过两次创建相同的两个并行job。输入输出在所有job中都是不同的。

publicstatic Job getCountingJob(Configuration conf, Path postInput,

       Path outputDirIntermediate) throws IOException {

    // Setup first job to counter user posts

    Job countingJob = new Job(conf, "JobChaining-Counting");

    countingJob.setJarByClass(JobChainingDriver.class);

    // Set our mapper and reducer, we can use the API's long sum reducer for

    // a combiner!

    countingJob.setMapperClass(UserIdCountMapper.class);

    countingJob.setCombinerClass(LongSumReducer.class);

    countingJob.setReducerClass(UserIdSumReducer.class);

    countingJob.setOutputKeyClass(Text.class);

    countingJob.setOutputValueClass(LongWritable.class);

    countingJob.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(countingJob, postInput);

    countingJob.setOutputFormatClass(TextOutputFormat.class);

    TextOutputFormat.setOutputPath(countingJob, outputDirIntermediate);

    return countingJob;

}

publicstatic Configuration getBinningJobConf(Job countingJob,

       Configuration conf, Path jobchainOutdir, Path userInput,

       Path binningOutput) throws IOException {

    // Calculate the average posts per user by getting counter values

    double numRecords = (double) countingJob

           .getCounters()

           .findCounter(JobChainingDriver.AVERAGE_CALC_GROUP,

                  UserIdCountMapper.RECORDS_COUNTER_NAME).getValue();

    double numUsers = (double) countingJob

           .getCounters()

           .findCounter(JobChainingDriver.AVERAGE_CALC_GROUP,

                  UserIdSumReducer.USERS_COUNTER_NAME).getValue();

    double averagePostsPerUser = numRecords / numUsers;

    // Setup binning job

    Job binningJob = new Job(conf, "JobChaining-Binning");

    binningJob.setJarByClass(JobChainingDriver.class);

    // Set mapper and the average posts per user

    binningJob.setMapperClass(UserIdBinningMapper.class);

    UserIdBinningMapper.setAveragePostsPerUser(binningJob,

           averagePostsPerUser);

    binningJob.setNumReduceTasks(0);

    binningJob.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(binningJob, jobchainOutdir);

    // Add two named outputs for below/above average

    MultipleOutputs.addNamedOutput(binningJob,

           JobChainingDriver.MULTIPLE_OUTPUTS_BELOW_NAME,

           TextOutputFormat.class, Text.class, Text.class);

    MultipleOutputs.addNamedOutput(binningJob,

           JobChainingDriver.MULTIPLE_OUTPUTS_ABOVE_NAME,

           TextOutputFormat.class, Text.class, Text.class);

    MultipleOutputs.setCountersEnabled(binningJob, true);

    // Configure multiple outputs

    conf.setOutputFormat(NullOutputFormat.class);

    FileOutputFormat.setOutputPath(conf, outputDir);

    MultipleOutputs.addNamedOutput(conf, MULTIPLE_OUTPUTS_ABOVE_5000,

           TextOutputFormat.class, Text.class, LongWritable.class);

    MultipleOutputs.addNamedOutput(conf, MULTIPLE_OUTPUTS_BELOW_5000,

           TextOutputFormat.class, Text.class, LongWritable.class);

    // Add the user files to the DistributedCache

    FileStatus[] userFiles = FileSystem.get(conf).listStatus(userInput);

    for (FileStatus status : userFiles) {

       DistributedCache.addCacheFile(status.getPath().toUri(),

              binningJob.getConfiguration());

    }

    // Execute job and grab exit code

    return binningJob.getConfiguration();

}

publicstatic Configuration getAverageJobConf(Configuration conf,

       Path averageOutputDir, Path outputDir) throws IOException {

    Job averageJob = new Job(conf, "ParallelJobs");

    averageJob.setJarByClass(ParallelJobs.class);

    averageJob.setMapperClass(AverageReputationMapper.class);

    averageJob.setReducerClass(AverageReputationReducer.class);

    averageJob.setOutputKeyClass(Text.class);

    averageJob.setOutputValueClass(DoubleWritable.class);

    averageJob.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(averageJob, averageOutputDir);

    averageJob.setOutputFormatClass(TextOutputFormat.class);

    TextOutputFormat.setOutputPath(averageJob, outputDir);

    // Execute job and grab exit code

    return averageJob.getConfiguration();

}

摘录地址：http://blog.csdn.net/cuirong1986/article/details/8492804
相关阅读:
计算机知识
 试题:论需求分析方法及应用
 试题:论信息系统开发方法及应用
 爬虫数据存储——安装docker和ElasticSearch(基于Centos7)
go并发版爬虫
 go单任务版爬虫
 可变类型与不可变类型
 基本数据类型内置方法
 @submit.native.prevent作用
 获取当月第一天,今天的日期的方法
原文地址：https://www.cnblogs.com/anny-1980/p/3663377.html

（转）MapReduce Design Patterns（chapter 6 （part 1））（十一）

Chapter 6. Metapatterns

Job chaining

With the Driver

Job Chaining Examples

Basic job chaining

Parallel job chaining

With Shell Scripting

With JobControl

Job control example