Let’s see how MapReduce reads input data and writes output data and focus on the
file formats it uses. To enable easy distributed processing, MapReduce makes certain
assumptions about the data it’s processing. It also provides flexibility in dealing with a
variety of data formats.
Input data usually resides in large files, typically tens or hundreds of gigabytes or
even more. One of the fundamental principles of MapReduce’s processing power is
the splitting of the input data into chunks . You can process these chunks in parallel
using multiple machines. In Hadoop terminology these chunks are called input splits
The size of each split should be small enough for a more granular parallelization .
(If all the input data is in one split, then there is no parallelization.) On the other
hand, each split shouldn’t be so small that the overhead of starting and stopping the
processing of a split becomes a large fraction of execution time.
The principle of dividing input data (which often can be one single massive file) into
splits for parallel processing explains some of the design decisions behind Hadoop’s
generic FileSystem as well as HDFS in particular. For example, Hadoop’s FileSystem
provides the class FSDataInputStream for file reading rather than using Java’s java.
io.DataInputStream . FSDataInputStream extends DataInputStream with random
read access, a feature that MapReduce requires because a machine may be assigned
to process a split that sits right in the middle of an input file. Without random access,
it would be extremely inefficient to have to read the file from the beginning until
you reach the location of the split. You can also see how HDFS is designed for storing
data that MapReduce will split and process in parallel. HDFS stores files in blocks
spread over multiple machines. Roughly speaking, each file block is a split. As different
machines will likely have different blocks, parallelization is automatic if each split/
block is processed by the machine that it’s residing at. Furthermore, as HDFS replicates
blocks in multiple nodes for reliability, MapReduce can choose any of the nodes that
have a copy of a split/block.
Input splits and record boundaries
Note that input splits are a logical division of your records whereas HDFS blocks are
a physical division of the input data. It’s extremely efficient when they’re the same
but in practice it’s never perfectly aligned. Records may cross block boundaries.
Hadoop guarantees the processing of all records . A machine processing a particular
split may fetch a fragment of a record from a block other than its “main” block and
which may reside remotely. The communication cost for fetching a record fragment is
inconsequential because it happens relatively rarely.
You’ll recall that MapReduce works on key/value pairs. So far we’ve seen that Hadoop
by default considers each line in the input file to be a record and the key/value pair
is the byte offset (key) and content of the line (value), respectively. You may not have
recorded all your data that way. Hadoop supports a few other data formats and allows
you to define your own.
3.3.1 InputFormat
The way an input file is split up and read by Hadoop is defined by one of the implementations
of the InputFormat interface . TextInputFormat is the default Input-
Format implementation, and it’s the data format we’ve been implicitly using up to
now. It’s often useful for input data that has no definite key value, when you want to
get the content one line at a time. The key returned by TextInputFormat is the byte
offset of each line, and we have yet to see any program that uses that key for its data
processing.
POPULAR INPUTFORMAT CLASSES
Table 3.4 lists other popular implementations of InputFormat along with a description
of the key/value pair each one passes to the mapper.
Table 3.4 Main InputFormat classes. TextInputFormat is the default unless an alternative is
specified. The object type for key and value are also described.
InputFormat Description
TextInputFormat Each line in the text files is a record. Key is the byte
offset of the line, and value is the content of the line.
key: LongWritable
value: Text
KeyValueTextInputFormat Each line in the text files is a record. The first separator
character divides each line. Everything before the
separator is the key, and everything after is the value.
The separator is set by the key.value.separator.in.input.
line property, and the default is the tab (\t) character.
key: Text
value: Text
SequenceFileInputFormat<K,V> An InputFormat for reading in sequence files. Key and
value are user defined. Sequence file is a Hadoopspecific
compressed binary file format. It’s optimized for
passing data between the output of one MapReduce job
to the input of some other MapReduce job.
key: K (user defined)
value: V (user defined)
NLineInputFormat Same as TextInputFormat, but each split is guaranteed
to have exactly N lines. The mapred.line.input.format.
linespermap property, which defaults to one, sets N.
key: LongWritable
value: Text
KeyValueTextInputFormat is used in the more structured input files where a predefined
character, usually a tab (\t), separates the key and value of each line (record).
For example, you may have a tab-separated data file of timestamps and URLs:
17:16:18 http://hadoop.apache.org/core/docs/r0.19.0/api/index.html
17:16:19 http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html
17:16:20 http://wiki.apache.org/hadoop/GettingStartedWithHadoop
17:16:20 http://www.maxim.com/hotties/2008/finalist_gallery.aspx
17:16:25 http://wiki.apache.org/hadoop/
...
You can set your JobConf object to use the KeyValueTextInputFormat class to read
this file.
conf.setInputFormat(KeyValueTextInputFormat.class);
Given the preceding example file, the first record your mapper reads will have a key
of “17:16:18” and a value of “http://hadoop.apache.org/core/docs/r0.19.0/api/
index.html”. The second record to your mapper will have a key of “17:16:19” and
a value of “http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html.”
And so on.
Recall that our previous mappers had used LongWritable and Text as the
key and value types, respectively. LongWritable is a reasonable type for the key
under TextInputFormat because the key is a numerical offset. When using
KeyValueTextInputFormat, both the key and the value will be of type Text , and
you’ll have to change your Mapper implementation and map() method to reflect the
new key type.
The input data to your MapReduce job does not necessarily have to be some
external data. In fact it’s often the case that the input to one MapReduce job is the
output of some other MapReduce job. As we’ll see, you can customize your output
format too. The default output format writes the output in the same format that
KeyValueTextInputFormat can read back in (i.e., each line is a record with key and
value separated by a tab character). Hadoop provides a much more efficient binary
compressed file format called sequence file . This sequence file is optimized for Hadoop
processing and should be the preferred format when chaining multiple MapReduce
jobs. The InputFormat class to read sequence files is SequenceFileInputFormat .
The object type for key and value in a sequence file are definable by the user. The
output and the input type have to match, and your Mapper implementation and map()
method have to take in the right input type.
CREATING A CUSTOM INPUTFORMAT—INPUTSPLIT AND RECORDREADER
Sometimes you may want to read input data in a way different from the standard
InputFormat classes. In that case you’ll have to write your own custom InputFormat
class. Let’s look at what it involves. InputFormat is an interface consisting of only two
methods.
public interface InputFormat<K, V> {
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
RecordReader<K, V> getRecordReader(InputSplit split,
JobConf job,
Reporter reporter) throws IOException;
}
The two methods sum up the functions that InputFormat has to perform:
■ Identify all the files used as input data and divide them into input splits. Each
map task is assigned one split.
Provide an object (RecordReader) to iterate through records in a given split,
and to parse each record into key and value of predefined types.
Who wants to worry about how files are divided into splits ? In creating your
own InputFormat class you should subclass the FileInputFormat class, which
takes care of file splitting. In fact, all the InputFormat classes in table 3.4 subclass
FileInputFormat. FileInputFormat implements the getSplits() method but
leaves getRecordReader() abstract for the subclass to fill out. FileInputFormat’s
getSplits() implementation tries to divide the input data into roughly the number
of splits specified in numSplits , subject to the constraints that each split must have
more than mapred.min.split.size number of bytes but also be smaller than the
block size of the filesystem. In practice, a split usually ends up being the size of a block,
which defaults to 64 MB in HDFS.
FileInputFormat has a number of protected methods a subclass can overwrite
to change its behavior, one of which is the isSplitable(FileSystem fs, Path
filename) method. It checks whether you can split a given file. The default
implementation always returns true, so all files larger than a block will be split.
Sometimes you may want a file to be its own split, and you’ll overwrite isSplitable()
to return false in those situations. For example, some file compression schemes don’t
support splits. (You can’t start reading from the middle of those files.) Some data
processing operations, such as file conversion, will need to treat each file as an atomic
record and one should also not be able to split it.
In using FileInputFormat you focus on customizing RecordReader, which is
responsible for parsing an input split into records and then parsing each record into a
key/value pair . Let’s look at the signature of this interface.
public interface RecordReader<K, V> {
boolean next(K key, V value) throws IOException;
K createKey();
V createValue();
long getPos() throws IOException;
public void close() throws IOException;
float getProgress() throws IOException;
}
Instead of writing our own RecordReader , we’ll again leverage existing classes provided
by Hadoop. For example, LineRecordReader implements RecordReader
<LongWritable,Text> . It’s used in TextInputFormat and reads one line at a time,
with byte offset as key and line content as value. KeyValueLineRecordReader uses
KeyValueTextInputFormat . For the most part, your custom RecordReader will be
a wrapper around an existing implementation, and most of the action will be in the
next() method .
One use case for writing your own custom InputFormat class is to read records
in a specific type rather than the generic Text type. For example, we had previously
used KeyValueTextInputFormat to read a tab-separated data file of timestamps
and URLs. The class ends up treating both the timestamp and the URL as Text type.
For our illustration, let’s create a TimeUrlTextInputFormat that works exactly the
same but treats the URL as a URLWritable type 3. As mentioned earlier, we create
our InputFormat class by extending FileInputFormat and implementing the factory
method to return our RecordReader.
public class TimeUrlTextInputFormat extends
➥ FileInputFormat<Text, URLWritable> {
public RecordReader<Text, URLWritable> getRecordReader(
➥ InputSplit input, JobConf job, Reporter reporter)
➥ throws IOException {
return new TimeUrlLineRecordReader(job, (FileSplit)input);
}
}
Our URLWritable class is quite straightforward:
public class URLWritable implements Writable {
protected URL url;
public URLWritable() { }
public URLWritable(URL url) {
this.url = url;
}
public void write(DataOutput out) throws IOException {
out.writeUTF(url.toString());
}
public void readFields(DataInput in) throws IOException {
url = new URL(in.readUTF());
}
public void set(String s) throws MalformedURLException {
url = new URL(s);
}
}
Our TimeUrlLineRecordReader will implement the six methods in the RecordReader
interface, in addition to the class constructor. It’s mostly a wrapper around KeyValue-
TextInputFormat, but converts the record value from Text to type URLWritable.
class TimeUrlLineRecordReader implements RecordReader<Text, URLWritable> {
private KeyValueLineRecordReader lineReader;
private Text lineKey, lineValue;
public TimeUrlLineRecordReader(JobConf job, FileSplit split) throws
IOException {
lineReader = new KeyValueLineRecordReader(job, split);
lineKey = lineReader.createKey();
lineValue = lineReader.createValue();
}
public boolean next(Text key, URLWritable value) throws IOException {
if (!lineReader.next(lineKey, lineValue)) {
return false;
}
key.set(lineKey);
value.set(lineValue.toString());
return true;
}
public Text createKey() {
return new Text("");
}
public URLWritable createValue() {
return new URLWritable();
}
public long getPos() throws IOException {
return lineReader.getPos();
}
public float getProgress() throws IOException {
return lineReader.getProgress();
}
public void close() throws IOException {
lineReader.close();
}
}
Our TimeUrlLineRecordReader class creates a KeyValueLineRecordReader object
and passes the getPos() , getProgress() , and close() method calls directly to it.
The next () method casts the lineValue Text object into the URLWritable type.
3.3.2 OutputFormat
MapReduce outputs data into files using the OutputFormat class , which is analogous
to the InputFormat class. The output has no splits, as each reducer writes its output
only to its own file. The output files reside in a common directory and are typically
named part-nnnnn, where nnnnn is the partition ID of the reducer. RecordWriter
objects format the output and RecordReaders parse the format of the input.
Hadoop provides several standard implementations of OutputFormat, as shown
in table 3.5. Not surprisingly, almost all the ones we deal with inherit from the File
OutputFormat abstract class; InputFormat classes inherit from FileInputFormat.
You specify the OutputFormat by calling setOutputFormat () of the JobConf object
that holds the configuration of your MapReduce job.
NOTE You may wonder why there’s a separation between OutputFormat
(InputFormat) and FileOutputFormat (FileInputFormat) when it
seems all OutputFormat (InputFormat) classes extend FileOutputFormat
(FileInputFormat). Are there OutputFormat (InputFormat) classes that
don’t work with files? Well, the NullOutputFormat implements OutputFormat
in a trivial way and doesn’t need to subclass FileOutputFormat. More importantly,
there are OutputFormat (InputFormat) classes that work with databases
rather than files, and these classes are in a separate branch in the class hierarchy
from FileOutputFormat (FileInputFormat). These classes have specialized
applications, and the interested reader can dig further in the online Java documentation
for DBInputFormat and DBOutputFormat.
Table 3.5 Main OutputFormat classes. TextOutputFormat is the default.
OutputFormat Description
TextOutputFormat<K,V> Writes each record as a line of text. Keys and values
are written as strings and separated by a tab (\t)
character, which can be changed in the mapred.
textoutputformat.separator property.
SequenceFileOutputFormat<K,V> Writes the key/value pairs in Hadoop’s proprietary
sequence file format. Works in conjunction with
SequenceFileInputFormat.
NullOutputFormat<K,V> Outputs nothing.
The default OutputFormat is TextOutputFormat, which writes each record as a line
of text. Each record’s key and value are converted to strings through toString() , and
a tab (\t) character separates them. The separator character can be changed in the
mapred.textoutputformat.separator property.
TextOutputFormat outputs data in a format readable by KeyValueTextInput-
Format. It can also output in a format readable by TextInputFormat if you make the
key type a NullWritable . In that case the key in the key/value pair is not written out,
and neither is the separator character. If you want to suppress the output completely,
then you should use the NullOutputFormat . Suppressing the Hadoop output is useful
if your reducer writes its output in its own way and doesn’t need Hadoop to write any
additional files.
Finally, SequenceFileOutputFormat writes the output in a sequence file format
that can be read back in using SequenceFileInputFormat. It’s useful for writing
intermediate data results when chaining MapReduce jobs.
3.4 Summary
Hadoop is a software framework that demands a different perspective on data processing.
It has its own filesystem, HDFS, that stores data in a way optimized for data-intensive
processing. You need specialized Hadoop tools to work with HDFS, but fortunately
most of those tools follow familiar Unix or Java syntax.