• 3.3 Reading and writing


    Let’s see how MapReduce reads input data and writes output data and focus on the
    file formats it uses. To enable easy distributed processing, MapReduce makes certain
    assumptions about the data it’s processing. It also provides flexibility in dealing with a
    variety of data formats.
    Input data usually resides in large files, typically tens or hundreds of gigabytes or
    even more. One of the fundamental principles of MapReduce’s processing power is
    the splitting of the input data into chunks . You can process these chunks in parallel
    using multiple machines. In Hadoop terminology these chunks are called input splits

    The size of each split should be small enough for a more granular parallelization .
    (If all the input data is in one split, then there is no parallelization.) On the other
    hand, each split shouldn’t be so small that the overhead of starting and stopping the
    processing of a split becomes a large fraction of execution time.
    The principle of dividing input data (which often can be one single massive file) into
    splits for parallel processing explains some of the design decisions behind Hadoop’s
    generic FileSystem as well as HDFS in particular. For example, Hadoop’s FileSystem
    provides the class FSDataInputStream for file reading rather than using Java’s java.
    io.DataInputStream . FSDataInputStream extends DataInputStream with random
    read access, a feature that MapReduce requires because a machine may be assigned
    to process a split that sits right in the middle of an input file. Without random access,
    it would be extremely inefficient to have to read the file from the beginning until
    you reach the location of the split. You can also see how HDFS is designed for storing
    data that MapReduce will split and process in parallel. HDFS stores files in blocks
    spread over multiple machines. Roughly speaking, each file block is a split. As different
    machines will likely have different blocks, parallelization is automatic if each split/
    block is processed by the machine that it’s residing at. Furthermore, as HDFS replicates
    blocks in multiple nodes for reliability, MapReduce can choose any of the nodes that
    have a copy of a split/block.

    Input splits and record boundaries
    Note that input splits are a logical division of your records whereas HDFS blocks are
    a physical division of the input data. It’s extremely efficient when they’re the same
    but in practice it’s never perfectly aligned. Records may cross block boundaries.
    Hadoop guarantees the processing of all records . A machine processing a particular
    split may fetch a fragment of a record from a block other than its “main” block and
    which may reside remotely. The communication cost for fetching a record fragment is
    inconsequential because it happens relatively rarely.

    You’ll recall that MapReduce works on key/value pairs. So far we’ve seen that Hadoop
    by default considers each line in the input file to be a record and the key/value pair
    is the byte offset (key) and content of the line (value), respectively. You may not have
    recorded all your data that way. Hadoop supports a few other data formats and allows
    you to define your own.

    3.3.1 InputFormat
    The way an input file is split up and read by Hadoop is defined by one of the implementations
    of the InputFormat interface . TextInputFormat is the default Input-
    Format implementation, and it’s the data format we’ve been implicitly using up to
    now. It’s often useful for input data that has no definite key value, when you want to

    get the content one line at a time. The key returned by TextInputFormat is the byte
    offset of each line, and we have yet to see any program that uses that key for its data
    processing.
    POPULAR INPUTFORMAT CLASSES
    Table 3.4 lists other popular implementations of InputFormat along with a description
    of the key/value pair each one passes to the mapper.
    Table 3.4 Main InputFormat classes. TextInputFormat is the default unless an alternative is
    specified. The object type for key and value are also described.
    InputFormat Description
    TextInputFormat Each line in the text files is a record. Key is the byte
    offset of the line, and value is the content of the line.
    key: LongWritable
    value: Text
    KeyValueTextInputFormat Each line in the text files is a record. The first separator
    character divides each line. Everything before the
    separator is the key, and everything after is the value.
    The separator is set by the key.value.separator.in.input.
    line property, and the default is the tab (\t) character.
    key: Text
    value: Text
    SequenceFileInputFormat<K,V> An InputFormat for reading in sequence files. Key and
    value are user defined. Sequence file is a Hadoopspecific
    compressed binary file format. It’s optimized for
    passing data between the output of one MapReduce job
    to the input of some other MapReduce job.
    key: K (user defined)
    value: V (user defined)
    NLineInputFormat Same as TextInputFormat, but each split is guaranteed
    to have exactly N lines. The mapred.line.input.format.
    linespermap property, which defaults to one, sets N.
    key: LongWritable
    value: Text
    KeyValueTextInputFormat is used in the more structured input files where a predefined
    character, usually a tab (\t), separates the key and value of each line (record).
    For example, you may have a tab-separated data file of timestamps and URLs:
    17:16:18 http://hadoop.apache.org/core/docs/r0.19.0/api/index.html
    17:16:19 http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html
    17:16:20 http://wiki.apache.org/hadoop/GettingStartedWithHadoop
    17:16:20 http://www.maxim.com/hotties/2008/finalist_gallery.aspx
    17:16:25 http://wiki.apache.org/hadoop/
    ...

    You can set your JobConf object to use the KeyValueTextInputFormat class to read
    this file.
    conf.setInputFormat(KeyValueTextInputFormat.class);
    Given the preceding example file, the first record your mapper reads will have a key
    of “17:16:18” and a value of “http://hadoop.apache.org/core/docs/r0.19.0/api/
    index.html”. The second record to your mapper will have a key of “17:16:19” and
    a value of “http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html.”
    And so on.
    Recall that our previous mappers had used LongWritable and Text as the
    key and value types, respectively. LongWritable is a reasonable type for the key
    under TextInputFormat because the key is a numerical offset. When using
    KeyValueTextInputFormat, both the key and the value will be of type Text , and
    you’ll have to change your Mapper implementation and map() method to reflect the
    new key type.
    The input data to your MapReduce job does not necessarily have to be some
    external data. In fact it’s often the case that the input to one MapReduce job is the
    output of some other MapReduce job. As we’ll see, you can customize your output
    format too. The default output format writes the output in the same format that
    KeyValueTextInputFormat can read back in (i.e., each line is a record with key and
    value separated by a tab character). Hadoop provides a much more efficient binary
    compressed file format called sequence file . This sequence file is optimized for Hadoop
    processing and should be the preferred format when chaining multiple MapReduce
    jobs. The InputFormat class to read sequence files is SequenceFileInputFormat .
    The object type for key and value in a sequence file are definable by the user. The
    output and the input type have to match, and your Mapper implementation and map()
    method have to take in the right input type.
    CREATING A CUSTOM INPUTFORMAT—INPUTSPLIT AND RECORDREADER
    Sometimes you may want to read input data in a way different from the standard
    InputFormat classes. In that case you’ll have to write your own custom InputFormat
    class. Let’s look at what it involves. InputFormat is an interface consisting of only two
    methods.
    public interface InputFormat<K, V> {
    InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
    RecordReader<K, V> getRecordReader(InputSplit split,
    JobConf job,
    Reporter reporter) throws IOException;
    }
    The two methods sum up the functions that InputFormat has to perform:
    ■ Identify all the files used as input data and divide them into input splits. Each
    map task is assigned one split.

    Provide an object (RecordReader) to iterate through records in a given split,
    and to parse each record into key and value of predefined types.
    Who wants to worry about how files are divided into splits ? In creating your
    own InputFormat class you should subclass the FileInputFormat class, which
    takes care of file splitting. In fact, all the InputFormat classes in table 3.4 subclass
    FileInputFormat. FileInputFormat implements the getSplits() method but
    leaves getRecordReader() abstract for the subclass to fill out. FileInputFormat’s
    getSplits() implementation tries to divide the input data into roughly the number
    of splits specified in numSplits , subject to the constraints that each split must have
    more than mapred.min.split.size number of bytes but also be smaller than the
    block size of the filesystem. In practice, a split usually ends up being the size of a block,
    which defaults to 64 MB in HDFS.
    FileInputFormat has a number of protected methods a subclass can overwrite
    to change its behavior, one of which is the isSplitable(FileSystem fs, Path
    filename) method. It checks whether you can split a given file. The default
    implementation always returns true, so all files larger than a block will be split.
    Sometimes you may want a file to be its own split, and you’ll overwrite isSplitable()
    to return false in those situations. For example, some file compression schemes don’t
    support splits. (You can’t start reading from the middle of those files.) Some data
    processing operations, such as file conversion, will need to treat each file as an atomic
    record and one should also not be able to split it.
    In using FileInputFormat you focus on customizing RecordReader, which is
    responsible for parsing an input split into records and then parsing each record into a
    key/value pair . Let’s look at the signature of this interface.
    public interface RecordReader<K, V> {
    boolean next(K key, V value) throws IOException;
    K createKey();
    V createValue();
    long getPos() throws IOException;
    public void close() throws IOException;
    float getProgress() throws IOException;
    }
    Instead of writing our own RecordReader , we’ll again leverage existing classes provided
    by Hadoop. For example, LineRecordReader implements RecordReader
    <LongWritable,Text> . It’s used in TextInputFormat and reads one line at a time,
    with byte offset as key and line content as value. KeyValueLineRecordReader uses
    KeyValueTextInputFormat . For the most part, your custom RecordReader will be
    a wrapper around an existing implementation, and most of the action will be in the
    next() method .
    One use case for writing your own custom InputFormat class is to read records
    in a specific type rather than the generic Text type. For example, we had previously
    used KeyValueTextInputFormat to read a tab-separated data file of timestamps

    and URLs. The class ends up treating both the timestamp and the URL as Text type.
    For our illustration, let’s create a TimeUrlTextInputFormat that works exactly the
    same but treats the URL as a URLWritable type 3. As mentioned earlier, we create
    our InputFormat class by extending FileInputFormat and implementing the factory
    method to return our RecordReader.
    public class TimeUrlTextInputFormat extends
    ➥ FileInputFormat<Text, URLWritable> {
    public RecordReader<Text, URLWritable> getRecordReader(
    ➥ InputSplit input, JobConf job, Reporter reporter)
    ➥ throws IOException {
    return new TimeUrlLineRecordReader(job, (FileSplit)input);
    }
    }
    Our URLWritable class is quite straightforward:
    public class URLWritable implements Writable {
    protected URL url;
    public URLWritable() { }
    public URLWritable(URL url) {
    this.url = url;
    }
    public void write(DataOutput out) throws IOException {
    out.writeUTF(url.toString());
    }
    public void readFields(DataInput in) throws IOException {
    url = new URL(in.readUTF());
    }
    public void set(String s) throws MalformedURLException {
    url = new URL(s);
    }
    }
    Our TimeUrlLineRecordReader will implement the six methods in the RecordReader
    interface, in addition to the class constructor. It’s mostly a wrapper around KeyValue-
    TextInputFormat, but converts the record value from Text to type URLWritable.
    class TimeUrlLineRecordReader implements RecordReader<Text, URLWritable> {
    private KeyValueLineRecordReader lineReader;
    private Text lineKey, lineValue;
    public TimeUrlLineRecordReader(JobConf job, FileSplit split) throws

    IOException {
    lineReader = new KeyValueLineRecordReader(job, split);
    lineKey = lineReader.createKey();
    lineValue = lineReader.createValue();
    }
    public boolean next(Text key, URLWritable value) throws IOException {
    if (!lineReader.next(lineKey, lineValue)) {
    return false;
    }
    key.set(lineKey);
    value.set(lineValue.toString());
    return true;
    }
    public Text createKey() {
    return new Text("");
    }
    public URLWritable createValue() {
    return new URLWritable();
    }
    public long getPos() throws IOException {
    return lineReader.getPos();
    }
    public float getProgress() throws IOException {
    return lineReader.getProgress();
    }
    public void close() throws IOException {
    lineReader.close();
    }
    }
    Our TimeUrlLineRecordReader class creates a KeyValueLineRecordReader object
    and passes the getPos() , getProgress() , and close() method calls directly to it.
    The next () method casts the lineValue Text object into the URLWritable type.
    3.3.2 OutputFormat
    MapReduce outputs data into files using the OutputFormat class , which is analogous
    to the InputFormat class. The output has no splits, as each reducer writes its output
    only to its own file. The output files reside in a common directory and are typically
    named part-nnnnn, where nnnnn is the partition ID of the reducer. RecordWriter
    objects format the output and RecordReaders parse the format of the input.
    Hadoop provides several standard implementations of OutputFormat, as shown
    in table 3.5. Not surprisingly, almost all the ones we deal with inherit from the File
    OutputFormat abstract class; InputFormat classes inherit from FileInputFormat.
    You specify the OutputFormat by calling setOutputFormat () of the JobConf object
    that holds the configuration of your MapReduce job.

    NOTE You may wonder why there’s a separation between OutputFormat
    (InputFormat) and FileOutputFormat (FileInputFormat) when it
    seems all OutputFormat (InputFormat) classes extend FileOutputFormat
    (FileInputFormat). Are there OutputFormat (InputFormat) classes that
    don’t work with files? Well, the NullOutputFormat implements OutputFormat
    in a trivial way and doesn’t need to subclass FileOutputFormat. More importantly,
    there are OutputFormat (InputFormat) classes that work with databases
    rather than files, and these classes are in a separate branch in the class hierarchy
    from FileOutputFormat (FileInputFormat). These classes have specialized
    applications, and the interested reader can dig further in the online Java documentation
    for DBInputFormat and DBOutputFormat.
    Table 3.5 Main OutputFormat classes. TextOutputFormat is the default.
    OutputFormat Description
    TextOutputFormat<K,V> Writes each record as a line of text. Keys and values
    are written as strings and separated by a tab (\t)
    character, which can be changed in the mapred.
    textoutputformat.separator property.
    SequenceFileOutputFormat<K,V> Writes the key/value pairs in Hadoop’s proprietary
    sequence file format. Works in conjunction with
    SequenceFileInputFormat.
    NullOutputFormat<K,V> Outputs nothing.
    The default OutputFormat is TextOutputFormat, which writes each record as a line
    of text. Each record’s key and value are converted to strings through toString() , and
    a tab (\t) character separates them. The separator character can be changed in the
    mapred.textoutputformat.separator property.
    TextOutputFormat outputs data in a format readable by KeyValueTextInput-
    Format. It can also output in a format readable by TextInputFormat if you make the
    key type a NullWritable . In that case the key in the key/value pair is not written out,
    and neither is the separator character. If you want to suppress the output completely,
    then you should use the NullOutputFormat . Suppressing the Hadoop output is useful
    if your reducer writes its output in its own way and doesn’t need Hadoop to write any
    additional files.
    Finally, SequenceFileOutputFormat writes the output in a sequence file format
    that can be read back in using SequenceFileInputFormat. It’s useful for writing
    intermediate data results when chaining MapReduce jobs.
    3.4 Summary
    Hadoop is a software framework that demands a different perspective on data processing.
    It has its own filesystem, HDFS, that stores data in a way optimized for data-intensive
    processing. You need specialized Hadoop tools to work with HDFS, but fortunately
    most of those tools follow familiar Unix or Java syntax.

  • 相关阅读:
    【数学建模】—优秀论文(一)
    【数学建模】—论文排版
    【Linux学习】—第8章linux编程
    【Linux学习】—文件权限和目录配置
    【ESP8266学习】(一)
    【OpenCV】——b站达尔闻
    【Linux学习】——Shell编程基础
    【数学建模】——模拟退火算法(SAA)
    react 开发中火狐,Safari浏览器嵌套iframe显示空白
    element ui dataPicker 日期范围限制
  • 原文地址:https://www.cnblogs.com/chenli0513/p/2290872.html
Copyright © 2020-2023  润新知