InputFormat在hadoop源码中是一个抽象类 public abstract class InputFormat<K, V>
https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/InputFormat.java
可以参考文章
https://cloud.tencent.com/developer/article/1043622
其中有两个抽象方法
public abstract List<InputSplit> getSplits(JobContext context ) throws IOException, InterruptedException;
和
public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context ) throws IOException, InterruptedException;
getSplits方法负责将输入的文件做一个逻辑上的切分,切分成一个List<InputSplit>,InputSplit的源码在
https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/InputSplit.java
在下文中提到 InputSplit
是一个逻辑概念,并没有对实际文件进行切分,它只包含一些元数据信息,比如数据的起始位置,数据长度,数据所在的节点等
https://cloud.tencent.com/developer/article/1481777