Introduction
This document provides information for users to migrate their Apache Hadoop MapReduce applications from Apache Hadoop 1.x to Apache Hadoop 2.x.
本文档提供的信息为用户从Apache
Hadoop的1.x的MapReduce应用迁移到Apache
Hadoop的2.x版本
In Apache Hadoop 2.x we have spun off resource management capabilities into Apache Hadoop YARN, a general purpose, distributed application management framework while Apache Hadoop MapReduce (aka MRv2) remains as a pure distributed computation framework.
在Apache Hadoop
2.x中我们已经剥离了资源管理功能集成到Apache
Hadoop YARN,成为通用化的分布式应用程序的管理框架,同时Apache
Hadoop MapReduce(又名MRv2)仍然是一个纯粹的分布式计算框架。
In general, the previous MapReduce runtime (aka MRv1) has been reused and no major surgery has been conducted on it. Therefore, MRv2 is able to ensure satisfactory compatibility with MRv1 applications. However, due to some improvements and code refactorings,
a few APIs have been rendered backward-incompatible.
在一般情况下,以前的MapReduce的运行时(又名MRv1)无需大的改动而在其上运行。MRv2能够确保与MRv1应用的良好兼容性。虽然进行一些改进和代码重构,部分API已提供向后兼容。
The remainder of this page will discuss the scope and the level of backward compatibility that we support in Apache Hadoop MapReduce 2.x (MRv2).
在这个页面的其余部分将讨论我们在Apache
Hadoop MapReduce 2.X(MRv2)向后兼容范围。
Binary Compatibility
First, we ensure binary compatibility to the applications that use old mapred APIs. This means that applications which were built against MRv1 mapred APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration.
首先,在二进制兼容于使用老mapred的API的应用程序。这意味着使用MRv1
mapred API构建的程序可以直接在YARN上运行而无需重新编译,仅仅通过进行Apache Hadoop 2.x集群进行配置即可。
Source Compatibility
We cannot ensure complete binary compatibility with the applications that use mapreduce APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility formapreduce APIs that break binary compatibility. In other words, users should recompile their applications that usemapreduce APIs against MRv2 jars. One notable binary incompatibility break is Counter and CounterGroup.
我们不能保证二进制完全兼容使用MapReduce
API的应用程序,因为这些API已经从MRv1发展了很多。然而,我们保证源代码级别mapreduce的兼容,即使二进制API兼容。换句话说,用户只需要基于MRv2
jars重新编译他们的使用mapreduce API的应用程序。需要注意的二进制形式不兼容Counter和CounterGroup。
Not Supported
MRAdmin has been removed in MRv2 because because mradmin commands no longer exist. They have been replaced by the commands inrmadmin. We neither support binary compatibility nor source compatibility for the applications that use this class directly.
因为mradmin命令不再存在,MRAdmin已MRv2被删除。他们已被替换的命令rmadmin。无论是二进制和源代码级别都不兼容直接使用这个类的应用程序。
Tradeoffs between MRv1 Users and Early MRv2 Adopters
Unfortunately, maintaining binary compatibility for MRv1 applications may lead to binary incompatibility issues for early MRv2 adopters, in particular Hadoop 0.23 users. Formapred APIs, we have chosen to be compatible with MRv1 applications, which have a larger user base. Formapreduce APIs, if they don't significantly break Hadoop 0.23 applications, we still change them to be compatible with MRv1 applications. Below is the list of MapReduce APIs which are incompatible with Hadoop 0.23.
不幸的是,MRv1应用程序的二进制兼容性可能会导致二进制不兼容的问题早日MRv2采用,特别是Hadoop的0.23用户。
For mapred的API,我们选择与MRv1的应用程序,其中有一个更大的用户群兼容。
For mapreduce的API,如果他们不显著突破0.23的Hadoop应用程序,我们仍然将其更改为与MRv1应用程序兼容。下面是MapReduce的API的这是用Hadoop0.23不兼容的列表。
Problematic Function | Incompatibility Issue |
org.apache.hadoop.util.ProgramDriver#drive | Return type changes from void to int |
org.apache.hadoop.mapred.jobcontrol.Job#getMapredJobID | Return type changes from String to JobID |
org.apache.hadoop.mapred.TaskReport#getTaskId | Return type changes from String to TaskID |
org.apache.hadoop.mapred.ClusterStatus#UNINITIALIZED_MEMORY_VALUE | Data type changes from long to int |
org.apache.hadoop.mapreduce.filecache.DistributedCache#getArchiveTimestamps | Return type changes from long[] to String[] |
org.apache.hadoop.mapreduce.filecache.DistributedCache#getFileTimestamps | Return type changes from long[] to String[] |
org.apache.hadoop.mapreduce.Job#failTask | Return type changes from void to boolean |
org.apache.hadoop.mapreduce.Job#killTask | Return type changes from void to boolean |
org.apache.hadoop.mapreduce.Job#getTaskCompletionEvents | Return type changes from o.a.h.mapred.TaskCompletionEvent[] too.a.h.mapreduce.TaskCompletionEvent[] |
Malicious
For the users who are going to try hadoop-examples-1.x.x.jar on YARN, please note thathadoop -jar hadoop-examples-1.x.x.jar will still use hadoop-mapreduce-examples-2.x.x.jar, which is installed together with other MRv2 jars. By default Hadoop framework jars appear before the users' jars in the classpath, such that the classes from the 2.x.x jar will still be picked. Users should remove hadoop-mapreduce-examples-2.x.x.jar from the classpath of all the nodes in a cluster. Otherwise, users need to setHADOOP_USER_CLASSPATH_FIRST=true and HADOOP_CLASSPATH=...:hadoop-examples-1.x.x.jar to run their target examples jar, and add the following configuration inmapred-site.xml to make the processes in YARN containers pick this jar as well.
对于想要在YARN上尝试的hadoop-examples-1.x.x.jar的用户,请注意hadoop -jar hadoop-examples-1.x.x.jar仍然会使用hadoop-mapreduce-examples-2.x.x.jar,这是与其他MRv2
jar一起安装的。默认情况下,Hadoop框架的jar在用户类路径中的jar文件之前被调用,使得从2.xx版本的jar的类会被优先。用户应该从集群中的所有节点的类路径中删除hadoop-mapreduce-examples-2.x.x.jar。否则,用户需要设置HADOOP_USER_CLASSPATH_FIRST=true,HADOOP_CLASSPATH=...:hadoop-examples-1.x.x.jar来运行他们的目标实例的jar,并添加以下配置在mapred-site.xml中,使YARN中的进程有限使用这个jar。
<property> <name>mapreduce.job.user.classpath.first</name> <value>true</value> </property>