以下为hadoop中使用的常用类说明(来源于hadoop api),排列仅以学习时出现的顺序为参考,不做其他比较:
1. Configuration
public class Configuration extends Object implements Iterable<Map.Entry<String,String>>, Writable
Provides access to configuration parameters.
Resources
Configurations are specified by resources. A resource contains a set of name/value pairs as XML data. Each resource is named by either a String
or by a Path
. If named by a String
, then the classpath is examined for a file with that name. If named by a Path
, then the local filesystem is examined directly, without referring to the classpath.
Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:
- core-default.xml : Read-only defaults for hadoop.
- core-site.xml: Site-specific configuration for a given hadoop installation.
Applications may add additional resources, which are loaded subsequent to these resources in the order they are added.
Final Parameters
Configuration parameters may be declared final. Once a resource declares a value final, no subsequently-loaded resource can alter that value. For example, one might define a final parameter with:
<property> <name>dfs.client.buffer.dir</name> <value>/tmp/hadoop/dfs/client</value> <final>true</final> </property>
Administrators typically define parameters as final in core-site.xml for values that user applications may not alter.
2. Serialization
public interface Serialization<T>
- All Known Implementing Classes:
- JavaSerialization, WritableSerialization
Encapsulates a Serializer
/Deserializer
pair.
3. Tool
public interface Toolextends Configurable
A tool interface that supports handling of generic command-line options.
Tool
, is the standard for any Map-Reduce tool/application. The tool/application should delegate the handling of standard command-line options to ToolRunner.run(Tool, String[])
and only handle its custom arguments.
4. ToolRunner
public class ToolRunner extends Object
A utility to help run Tool
s.
ToolRunner
can be used to run classes implementing Tool
interface. It works in conjunction with GenericOptionsParser
to parse thegeneric hadoop command line arguments and modifies the Configuration
of the Tool
. The application-specific options are passed along without being modified.
5.1 Job (newer than JobConf)
public class Job extends JobContext
The job submitter's view of the Job. It allows the user to configure the job, submit it, control its execution, and query the state. The set methods only work until the job is submitted, afterwards they will throw an IllegalStateException.
5.2 JobConf (has more methods than Job)
A map/reduce job configuration.
JobConf
is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution. The framework tries to faithfully execute the job as-is described by JobConf
, however:
- Some configuration parameters might have been marked as final by administrators and hence cannot be altered.
- While some job parameters are straight-forward to set (e.g.
setNumReduceTasks(int)
), some parameters interact subtly rest of the framework and/or job-configuration and is relatively more complex for the user to control finely (e.g.setNumMapTasks(int)
).
JobConf
typically specifies the Mapper
, combiner (if any), Partitioner
, Reducer
, InputFormat
and OutputFormat
implementations to be used etc.
Optionally JobConf
is used to specify other advanced facets of the job such as Comparator
s to be used, files to be put in theDistributedCache
, whether or not intermediate and/or job outputs are to be compressed (and how), debugability via user-provided scripts ( setMapDebugScript(String)
/setReduceDebugScript(String)
), for doing post-processing on task logs, task's stdout, stderr, syslog. and etc.
6. GenericOptionsParser
public class GenericOptionsParser extends Object
GenericOptionsParser
is a utility to parse command line arguments generic to the Hadoop framework. GenericOptionsParser
recognizes several standarad command line arguments, enabling applications to easily specify a namenode, a jobtracker, additional configuration resources etc.
Generic Options
The supported generic options are:
-conf <configuration file> specify a configuration file -D <property=value> use value for given property -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
7.Mapper
public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> extends Object
Maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
The Hadoop Map-Reduce framework spawns one map task for each InputSplit
generated by the InputFormat
for the job. Mapper
implementations can access the Configuration
for the job via the JobContext.getConfiguration()
.
The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context)
, followed by map(Object, Object, Context)
for each key/value pair in the InputSplit
. Finally cleanup(Context)
is called.
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer
to determine the final output. Users can control the sorting and grouping by specifying two key RawComparator
classes.
The Mapper
outputs are partitioned per Reducer
. Users can control which keys (and hence records) go to which Reducer
by implementing a custom Partitioner
.
Users can optionally specify a combiner
, via Job.setCombinerClass(Class)
, to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper
to the Reducer
.
Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodec
s are to be used via theConfiguration
.
If the job has zero reduces then the output of the Mapper
is directly written to the OutputFormat
without sorting by keys.
8. reducer
public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> extends Object
Reduces a set of intermediate values which share a key to a smaller set of values.
Reducer
implementations can access the Configuration
for the job via the JobContext.getConfiguration()
method.
Reducer
has 3 primary phases:
-
Shuffle
The
Reducer
copies the sorted output from eachMapper
using HTTP across the network. -
Sort
The framework merge sorts
Reducer
inputs bykey
s (since differentMapper
s may have output the same key).The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.The grouping comparator is specified via
For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like:Job.setGroupingComparatorClass(Class)
. The sort order is controlled byJob.setSortComparatorClass(Class)
.- Map Input Key: url
- Map Input Value: document
- Map Output Key: document checksum, url pagerank
- Map Output Value: url
- Partitioner: by checksum
- OutputKeyComparator: by checksum and then decreasing pagerank
- OutputValueGroupingComparator: by checksum
-
Reduce
In this phase the
reduce(Object, Iterable, Context)
method is called for each<key, (collection of values)>
in the sorted inputs.The output of the reduce task is typically written to a
RecordWriter
viaTaskInputOutputContext.write(Object, Object)
.
The output of the Reducer
is not re-sorted.
9 RawComparator
public interface RawComparator<T> extends Comparator<T>
- All Known Implementing Classes:
- BooleanWritable.Comparator, BytesWritable.Comparator, ByteWritable.Comparator, DeserializerComparator,DoubleWritable.Comparator,
FloatWritable.Comparator, IntWritable.Comparator, JavaSerializationComparator,KeyFieldBasedComparator, KeyFieldBasedComparator,
LongWritable.Comparator, LongWritable.DecreasingComparator,MD5Hash.Comparator, NullWritable.Comparator, RecordComparator,
SecondarySort.FirstGroupingComparator,SecondarySort.IntPair.Comparator, Text.Comparator, UTF8.Comparator, WritableComparator
A Comparator
that operates directly on byte representations of objects.