• hadoop map(分片)数量确定


    之前学习hadoop的时候,一直希望可以调试hadoop源码,可是一直没找到有效的方法,今天在调试矩阵乘法的时候发现了调试的方法,所以在这里记录下来。

    1)事情的起因是想在一个Job里设置map的数量(虽然最终的map数量是由分片决定的),在hadoop1.2.1之前,设置方法是:

    job.setNumMapTasks()

    不过,hadoop1.2.1没有了这个方法,只保留了设置reduce数量的方法。继续搜索资料,发现有同学提供了另外一种方法,就是使用configuration设置,设置方式如下:

    conf.set("mapred.map.tasks",5);//设置5个map

    按照上述方法设置之后,还是没有什么效果,控制分片数量的代码如下():

    goalSize=totalSize/(numSplits==0?1:numSplits)
    //totalSize是输入数据文件的大小,numSplits是用户设置的map数量,就是按照用户自己
    //的意愿,每个分片的大小应该是goalSize
    minSize=Math.max(job.getLong("mapred.min.split.size",1),minSplitSize)
    //hadoop1.2.1中mapred-default.xml文件中mapred.min.split.size=0,所以job.getLong("mapred.min.split.size",1)=0,而minSplitSize是InputSplit中的一个数据成员,在File//Split中值为1.所以minSize=1,其目的就是得到配置中的最小值。
    splitSize=Math.max(minSize,Math.min(goalSize,blockSize))
    //真正的分片大小就是取按照用户设置的map数量计算出的goalSize和块大小blockSize中最小值(这是为了是分片不会大于一个块大小,有利于本地化计算),并且又比minSize大的值。

    其实,这是hadoop1.2.1之前的生成分片的方式,所以即使设置了map数量也不会有什么实际效果。

    2)新版API(hadoop1.2.1)中计算分片的代码如下所示:

     1  public List<InputSplit> getSplits(JobContext job) throws IOException {
     2         long minSize = Math.max(this.getFormatMinSplitSize(), getMinSplitSize(job));
     3         long maxSize = getMaxSplitSize(job);
     4         ArrayList splits = new ArrayList();
     5         List files = this.listStatus(job);
     6         Iterator i$ = files.iterator();
     7 
     8         while(true) {
     9             while(i$.hasNext()) {
    10                 FileStatus file = (FileStatus)i$.next();
    11                 Path path = file.getPath();
    12                 FileSystem fs = path.getFileSystem(job.getConfiguration());
    13                 long length = file.getLen();
    14                 BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0L, length);
    15                 if(length != 0L && this.isSplitable(job, path)) {
    16                     long blockSize = file.getBlockSize();
    17                     long splitSize = this.computeSplitSize(blockSize, minSize, maxSize);
    18 
    19                     long bytesRemaining;
    20                     for(bytesRemaining = length; (double)bytesRemaining / (double)splitSize > 1.1D; bytesRemaining -= splitSize) {
    21                         int blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
    22                         splits.add(new FileSplit(path, length - bytesRemaining, splitSize, blkLocations[blkIndex].getHosts()));
    23                     }
    24 
    25                     if(bytesRemaining != 0L) {
    26                         splits.add(new FileSplit(path, length - bytesRemaining, bytesRemaining, blkLocations[blkLocations.length - 1].getHosts()));
    27                     }
    28                 } else if(length != 0L) {
    29                     splits.add(new FileSplit(path, 0L, length, blkLocations[0].getHosts()));
    30                 } else {
    31                     splits.add(new FileSplit(path, 0L, length, new String[0]));
    32                 }
    33             }
    34 
    35             job.getConfiguration().setLong("mapreduce.input.num.files", (long)files.size());
    36             LOG.debug("Total # of splits: " + splits.size());
    37             return splits;
    38         }
    39     }

    第17行使用computeSplitSize(blockSize,minSize,maxsize)计算分片大小。

    a.minSize通过以下方式计算:

     long minSize = Math.max(this.getFormatMinSplitSize(), getMinSplitSize(job))

    而getFormatMinSplitSize():

    protected long getFormatMinSplitSize() {
            return 1L;
        }

    而getMinSplitSize(job):

     public static long getMinSplitSize(JobContext job) {
            return job.getConfiguration().getLong("mapred.min.split.size", 1L);
        }

    没有设置“mapred.min.split.size”的默认值是0。

    所以,不设置“mapred.min.split.size”的话,就使用方法的默认值1代替,而“mapred.min.split.size”的默认值是0,所以minSize的值就是1

    b.再看maxSize的计算方式:

       long maxSize = getMaxSplitSize(job);

    而getMaxSplitSize():

    public static long getMaxSplitSize(JobContext context) {
            return context.getConfiguration().getLong("mapred.max.split.size", 9223372036854775807L);
        }

       没有设置"mapred.max.split.size"的话,就使用方法的默认值 9223372036854775807,而"mapred.max.split.size"并没有默认值,所以maxSize= 9223372036854775807;

    c.我们已经能够计算出minSize=1,maxSize= 9223372036854775807,接下来计算分片大小:

     long splitSize = this.computeSplitSize(blockSize, minSize, maxSize);
    protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
            return Math.max(minSize, Math.min(maxSize, blockSize));
        }

       显然,分片大小是就是maxSize和blockSize的较小值(minSize=1),那么我们就可以通过设置"mapred.max.split.size"来控制map的数量,只要设置值比物理块小就可以了。使用configuration对象的设置方法如下:

    conf.set("mapred.max.split.size",2000000)//单位是字节,物理块是16M

    3)可以设置map数量的矩阵乘法代码如下所示:

      1 /**
      2  * Created with IntelliJ IDEA.
      3  * User: hadoop
      4  * Date: 16-3-14
      5  * Time: 下午3:13
      6  * To change this template use File | Settings | File Templates.
      7  */
      8 import org.apache.hadoop.conf.Configuration;
      9 import org.apache.hadoop.fs.FileSystem;
     10 import java.io.IOException;
     11 import java.net.URI;
     12 import org.apache.hadoop.fs.Path;
     13 import org.apache.hadoop.io.*;
     14 import org.apache.hadoop.io.DoubleWritable;
     15 import org.apache.hadoop.io.Writable;
     16 import org.apache.hadoop.mapreduce.InputSplit;
     17 import org.apache.hadoop.mapreduce.Job;
     18 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
     19 import org.apache.hadoop.mapreduce.lib.input.FileSplit;
     20 import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
     21 import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
     22 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
     23 import org.apache.hadoop.mapreduce.Reducer;
     24 import org.apache.hadoop.mapreduce.Mapper;
     25 import org.apache.hadoop.filecache.DistributedCache;
     26 import org.apache.hadoop.util.ReflectionUtils;
     27 
     28 public class MutiDoubleInputMatrixProduct {
     29 
     30     public static void initDoubleArrayWritable(int length,DoubleWritable[] doubleArrayWritable){
     31         for (int i=0;i<length;++i){
     32             doubleArrayWritable[i]=new DoubleWritable(0.0);
     33         }
     34     }
     35 
     36     public static  class MyMapper extends Mapper<IntWritable,DoubleArrayWritable,IntWritable,DoubleArrayWritable>{
     37         public DoubleArrayWritable map_value=new DoubleArrayWritable();
     38         public  double[][] leftMatrix=null;/******************************************/
     39         //public Object obValue=null;
     40         public DoubleWritable[] arraySum=null;
     41         public DoubleWritable[] tempColumnArrayDoubleWritable=null;
     42         public DoubleWritable[] tempRowArrayDoubleWritable=null;
     43         public double sum=0;
     44         public double uValue;
     45         public int leftMatrixRowNum;
     46         public int leftMatrixColumnNum;
     47         public void setup(Context context) throws IOException {
     48             Configuration conf=context.getConfiguration();
     49             leftMatrixRowNum=conf.getInt("leftMatrixRowNum",10);
     50             leftMatrixColumnNum=conf.getInt("leftMatrixColumnNum",10);
     51             leftMatrix=new double[leftMatrixRowNum][leftMatrixColumnNum];
     52             uValue=(double)(context.getConfiguration().getFloat("u",1.0f));
     53             tempRowArrayDoubleWritable=new DoubleWritable[leftMatrixColumnNum];
     54             initDoubleArrayWritable(leftMatrixColumnNum,tempRowArrayDoubleWritable);
     55             tempColumnArrayDoubleWritable=new DoubleWritable[leftMatrixRowNum];
     56             initDoubleArrayWritable(leftMatrixRowNum,tempColumnArrayDoubleWritable);
     57             System.out.println("map setup() start!");
     58             //URI[] cacheFiles=DistributedCache.getCacheFiles(context.getConfiguration());
     59             Path[] cacheFiles=DistributedCache.getLocalCacheFiles(conf);
     60             String localCacheFile="file://"+cacheFiles[0].toString();
     61             //URI[] cacheFiles=DistributedCache.getCacheFiles(conf);
     62             //DistributedCache.
     63             System.out.println("local path is:"+cacheFiles[0].toString());
     64             // URI[] cacheFiles=DistributedCache.getCacheFiles(context.getConfiguration());
     65             FileSystem fs =FileSystem.get(URI.create(localCacheFile), conf);
     66             SequenceFile.Reader reader=null;
     67             reader=new SequenceFile.Reader(fs,new Path(localCacheFile),conf);
     68             IntWritable key= (IntWritable)ReflectionUtils.newInstance(reader.getKeyClass(),conf);
     69             DoubleArrayWritable value= (DoubleArrayWritable)ReflectionUtils.newInstance(reader.getValueClass(),conf);
     70             //int valueLength=0;
     71             int rowIndex=0;
     72             int index;
     73             while (reader.next(key,value)){
     74                 index=-1;
     75                 for (Writable val:value.get()){ //ArrayWritable类的get方法返回Writable[]数组
     76                     tempRowArrayDoubleWritable[++index].set(((DoubleWritable)val).get());
     77                 }
     78                 //obValue=value.toArray();
     79                 rowIndex=key.get();
     80                 leftMatrix[rowIndex]=new double[leftMatrixColumnNum];
     81                 //this.leftMatrix=new double[valueLength][Integer.parseInt(context.getConfiguration().get("leftMatrixColumnNum"))];
     82                 for (int i=0;i<leftMatrixColumnNum;++i){
     83                     //leftMatrix[rowIndex][i]=Double.parseDouble(Array.get(obValue, i).toString());
     84                     //leftMatrix[rowIndex][i]=Array.getDouble(obValue, i);
     85                     leftMatrix[rowIndex][i]= tempRowArrayDoubleWritable[i].get();
     86                 }
     87 
     88             }
     89             arraySum=new DoubleWritable[leftMatrix.length];
     90             initDoubleArrayWritable(leftMatrix.length,arraySum);
     91         }
     92         public void map(IntWritable key,DoubleArrayWritable value,Context context) throws IOException, InterruptedException {
     93             //obValue=value.toArray();
     94             InputSplit inputSplit=context.getInputSplit();
     95             String fileName=((FileSplit)inputSplit).getPath().getName();
     96             if (fileName.startsWith("FB")) {
     97                 context.write(key,value);
     98             }
     99             else{
    100                 int ii=-1;
    101                 for(Writable val:value.get()){
    102                     tempColumnArrayDoubleWritable[++ii].set(((DoubleWritable)val).get());
    103                 }
    104                 //arraySum=new DoubleWritable[this.leftMatrix.length];
    105                 for (int i=0;i<this.leftMatrix.length;++i){
    106                     sum=0;
    107                     for (int j=0;j<this.leftMatrix[0].length;++j){
    108                         //sum+= this.leftMatrix[i][j]*Double.parseDouble(Array.get(obValue,j).toString())*(double)(context.getConfiguration().getFloat("u",1f));
    109                         //sum+= this.leftMatrix[i][j]*Array.getDouble(obValue,j)*uValue;
    110                         sum+= this.leftMatrix[i][j]*tempColumnArrayDoubleWritable[j].get()*uValue;
    111                     }
    112                     arraySum[i].set(sum);
    113                     //arraySum[i].set(sum);
    114                 }
    115                 map_value.set(arraySum);
    116                 context.write(key,map_value);
    117             }
    118         }
    119     }
    120     public static class MyReducer extends Reducer<IntWritable,DoubleArrayWritable,IntWritable,DoubleArrayWritable>{
    121         public DoubleWritable[] sum=null;
    122         // public Object obValue=null;
    123         public DoubleArrayWritable valueArrayWritable=new DoubleArrayWritable();
    124         public DoubleWritable[] tempColumnArrayDoubleWritable=null;
    125         private int leftMatrixRowNum;
    126 
    127         public void setup(Context context){
    128             //leftMatrixColumnNum=context.getConfiguration().getInt("leftMatrixColumnNum",100);
    129             leftMatrixRowNum=context.getConfiguration().getInt("leftMatrixRowNum",100);
    130             sum=new DoubleWritable[leftMatrixRowNum];
    131             initDoubleArrayWritable(leftMatrixRowNum,sum);
    132             //tempRowArrayDoubleWritable=new DoubleWritable[leftMatrixColumnNum];
    133             tempColumnArrayDoubleWritable=new DoubleWritable[leftMatrixRowNum];
    134             initDoubleArrayWritable(leftMatrixRowNum,tempColumnArrayDoubleWritable);
    135         }
    136         //如果矩阵的计算已经在map中完成了,貌似可以不使用reduce,如果不创建reduce类,MR框架仍然会调用一个默认的reduce,只是这个reduce什么也不做
    137         //但是,不使用reduce的话,map直接写文件,有多少个map就会产生多少个结果文件。这里使用reduce是为了将结果矩阵存储在一个文件中。
    138         public void reduce(IntWritable key,Iterable<DoubleArrayWritable>value,Context context) throws IOException, InterruptedException {
    139             //int valueLength=0;
    140             for(DoubleArrayWritable doubleValue:value){
    141                 int index=-1;
    142                 for (Writable val:doubleValue.get()){
    143                     tempColumnArrayDoubleWritable[++index].set(((DoubleWritable)val).get());
    144                 }
    145                 //valueLength=Array.getLength(obValue);
    146                 /*
    147                 for (int i=0;i<leftMatrixRowNum;++i){
    148                     //sum[i]=new DoubleWritable(Double.parseDouble(Array.get(obValue,i).toString())+sum[i].get());
    149                     //sum[i]=new DoubleWritable(Array.getDouble(obValue,i)+sum[i].get());
    150                     sum[i].set(tempColumnArrayDoubleWritable[i].get()+sum[i].get());
    151                 }
    152                 */
    153             }
    154             //valueArrayWritable.set(sum);
    155             valueArrayWritable.set(tempColumnArrayDoubleWritable);
    156             context.write(key,valueArrayWritable);
    157             /*
    158             for (int i=0;i<sum.length;++i){
    159                 sum[i].set(0.0);
    160             }
    161             */
    162 
    163         }
    164     }
    165 
    166     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    167         String uri=args[3];
    168         String outUri=args[4];
    169         String cachePath=args[2];
    170         HDFSOperator.deleteDir(outUri);
    171         Configuration conf=new Configuration();
    172         DistributedCache.addCacheFile(URI.create(cachePath),conf);//添加分布式缓存
    173         /**************************************************/
    174         //FileSystem fs=FileSystem.get(URI.create(uri),conf);
    175         //fs.delete(new Path(outUri),true);
    176         /*********************************************************/
    177         conf.setInt("leftMatrixColumnNum",Integer.parseInt(args[0]));
    178         conf.setInt("leftMatrixRowNum",Integer.parseInt(args[1]));
    179         conf.setFloat("u",1.0f);
    180         //conf.set("mapred.map.tasks",args[5]);
    181         //int mxSplitSize=Integer.valueOf(args[5])
    182         conf.set("mapred.max.split.size",args[5]);//hadoop1.2.1中并没有setNumMapTasks方法,只能通过这种方式控制计算分片的大小来控制map数量
    183         conf.set("mapred.jar","MutiDoubleInputMatrixProduct.jar");
    184         Job job=new Job(conf,"MatrixProdcut");
    185         job.setJarByClass(MutiDoubleInputMatrixProduct.class);
    186         job.setInputFormatClass(SequenceFileInputFormat.class);
    187         job.setOutputFormatClass(SequenceFileOutputFormat.class);
    188         job.setMapperClass(MyMapper.class);
    189         job.setReducerClass(MyReducer.class);
    190         job.setMapOutputKeyClass(IntWritable.class);
    191         job.setMapOutputValueClass(DoubleArrayWritable.class);
    192         job.setOutputKeyClass(IntWritable.class);
    193         job.setOutputValueClass(DoubleArrayWritable.class);
    194         FileInputFormat.setInputPaths(job, new Path(uri));
    195         FileOutputFormat.setOutputPath(job,new Path(outUri));
    196         System.exit(job.waitForCompletion(true)?0:1);
    197     }
    198 
    199 
    200 }
    201 class DoubleArrayWritable extends ArrayWritable {
    202     public DoubleArrayWritable(){
    203         super(DoubleWritable.class);
    204     }
    205 /*
    206     public String toString(){
    207         StringBuilder sb=new StringBuilder();
    208         for (Writable val:get()){
    209             DoubleWritable doubleWritable=(DoubleWritable)val;
    210             sb.append(doubleWritable.get());
    211             sb.append(",");
    212         }
    213         sb.deleteCharAt(sb.length()-1);
    214         return sb.toString();
    215     }
    216 */
    217 }
    218 
    219 class HDFSOperator{
    220     public static boolean deleteDir(String dir)throws IOException{
    221         Configuration conf=new Configuration();
    222         FileSystem fs =FileSystem.get(conf);
    223         boolean result=fs.delete(new Path(dir),true);
    224         System.out.println("sOutput delete");
    225         fs.close();
    226         return result;
    227     }
    228 }

    4)接下来说说如何断点调试hadoop源码,这里以计算文件分片的源码为例来说明。

    a.首先找到FileInputFormat类,这个类就在hadoop-core-1.2.1.jar中,我们需要将这个jar包添加到工程中,如下所示:

       虽然这是编译之后的类文件,也就是字节码,但是仍然可以像java源码一样,断点调试,这里我们分别在getSplits()方法和computeSplitSize()方法中添加两个断点,然后使用IDEA在本地直接以Debug方式运行我们的MapReduce程序,结果如下所示:

    命中断点,并且我们可以查看相关的变量值。

  • 相关阅读:
    【算法设计与分析基础】8、背包问题
    【算法设计与分析基础】8、穷举 旅行商问题
    【算法设计与分析基础】7、蛮力求平面中距离最近的两点
    【算法设计与分析基础】6、蛮力字符匹配
    【算法设计与分析基础】5、冒泡排序与选择排序
    【算法设计与分析基础】4、伪随机数
    【算法设计与分析基础】3、斐波拉契数列
    【算法设计与分析基础】2、矩阵相乘
    【算法设计与分析基础】1、埃拉托色尼筛选法
    【JAVA并发编程实战】12、使用condition实现多线程下的有界缓存先进先出队列
  • 原文地址:https://www.cnblogs.com/lz3018/p/5373821.html
Copyright © 2020-2023  润新知