• Hadoop之SequenceFile


            Hadoop序列化文件SequenceFile能够用于解决大量小文件(所谓小文件:泛指小于black大小的文件)问题,SequenceFile是Hadoop API提供的一种二进制文件支持。这样的二进制文件直接将<key,value>对序列化到文件里,一般对小文件能够使用这样的文件合并。即将文件名称作为key。文件内容作为value序列化到大文件里。


    hadoop Archive也是一个高效地将小文件放入HDFS块中的文件存档文件格式,详情请看:hadoop Archive


    可是SequenceFile文件不能追加写入,适用于一次性写入大量小文件的操作。

    SequenceFile的压缩基于CompressType,请看源代码:

      /**
       * The compression type used to compress key/value pairs in the
       * {@link SequenceFile}.
       * @see SequenceFile.Writer
       */
    public static enum CompressionType {
        /** Do not compress records. */
        NONE, //不压缩
        /** Compress values only, each separately. */
        RECORD,  //仅仅压缩values
        /** Compress sequences of records together in blocks. */
        BLOCK //压缩非常多记录的key/value组成块
    }

    SequenceFile读写演示样例:
    import java.io.IOException;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IOUtils;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.SequenceFile;
    import org.apache.hadoop.io.SequenceFile.CompressionType;
    import org.apache.hadoop.io.SequenceFile.Reader;
    import org.apache.hadoop.io.SequenceFile.Writer;
    import org.apache.hadoop.io.Text;
    
    /**
     * @version 1.0
     * @author Fish
     */
    public class SequenceFileWriteDemo {
    	private static final String[] DATA = { "fish1", "fish2", "fish3", "fish4" };
    
    	public static void main(String[] args) throws IOException {
    		/**
    		 * 写SequenceFile
    		 */
    		String uri = "/test/fish/seq.txt";
    		Configuration conf = new Configuration();
    		Path path = new Path(uri);
    		IntWritable key = new IntWritable();
    		Text value = new Text();
    		Writer writer = null;
    		try {
    			/**
    			 * CompressionType.NONE 不压缩<br>
    			 * CompressionType.RECORD 仅仅压缩value<br>
    			 * CompressionType.BLOCK 压缩非常多记录的key/value组成块
    			 */
    			writer = SequenceFile.createWriter(conf, Writer.file(path), Writer.keyClass(key.getClass()),
    					Writer.valueClass(value.getClass()), Writer.compression(CompressionType.BLOCK));
    
    			for (int i = 0; i < 4; i++) {
    				value.set(DATA[i]);
    				key.set(i);
    				System.out.printf("[%s]	%s	%s
    ", writer.getLength(), key, value);
    				writer.append(key, value);
    
    			}
    		} finally {
    			IOUtils.closeStream(writer);
    		}
    
    		/**
    		 * 读SequenceFile
    		 */
    		SequenceFile.Reader reader = new SequenceFile.Reader(conf, Reader.file(path));
    		IntWritable key1 = new IntWritable();
    		Text value1 = new Text();
    		while (reader.next(key1, value1)) {
    			System.out.println(key1 + "----" + value1);
    		}
    		IOUtils.closeStream(reader);// 关闭read流
    		
    		/**
    		 * 用于排序
    		 */
    //		SequenceFile.Sorter sorter = new SequenceFile.Sorter(fs, comparator, IntWritable.class, Text.class, conf);
    	}
    }

    以上程序运行多次,并不会出现数据append的情况,每次都是又一次创建一个文件,且文件里只唯独四条数据。究其原因。能够查看SequenceFile.Writer类的构造方法源代码:
    out = fs.create(p, true, bufferSize, replication, blockSize, progress);

    第二个參数为true,表示每次覆盖同名文件,假设为false会抛出异常。

    这样设计的目的可能是和HDFS一次写入多次读取有关。不提倡追加现有文件,所以构造方法写死了true。


    SequenceFile文件的数据组成形式:



    一,Header


    写入头部的源代码:

        /** Write and flush the file header. */
        private void writeFileHeader() 
          throws IOException {
          out.write(VERSION);//版本
          Text.writeString(out, keyClass.getName());//key的Class
          Text.writeString(out, valClass.getName());//val的Class
    
          out.writeBoolean(this.isCompressed());//是否压缩
          out.writeBoolean(this.isBlockCompressed());//是否是CompressionType.BLOCK类型的压缩
          
          if (this.isCompressed()) {
            Text.writeString(out, (codec.getClass()).getName());//压缩类的名称
          }
          this.metadata.write(out);//写入metadata
          out.write(sync);                       // write the sync bytes
          out.flush();                           // flush header
        }
    版本:
      private static byte[] VERSION = new byte[] {
        (byte)'S', (byte)'E', (byte)'Q', VERSION_WITH_METADATA
      };

    同步标识符的生成方式:
        byte[] sync;                          // 16 random bytes
        {
          try {                                       
            MessageDigest digester = MessageDigest.getInstance("MD5");
            long time = Time.now();
            digester.update((new UID()+"@"+time).getBytes());
            sync = digester.digest();
          } catch (Exception e) {
            throw new RuntimeException(e);
          }
        }
    二。Record

    Writer有三个实现类。分别相应CompressType的NONE,RECOR。BLOCK。以下逐一介绍一下(结合上面的图看):

    1,NONE SequenceFile

    Record直接存Record 的长度,KEY的长度,key值,Value的值

    2, BlockCompressWriter

    /** Append a key/value pair. */
        @Override
        @SuppressWarnings("unchecked")
        public synchronized void append(Object key, Object val)
          throws IOException {
          if (key.getClass() != keyClass)
            throw new IOException("wrong key class: "+key+" is not "+keyClass);
          if (val.getClass() != valClass)
            throw new IOException("wrong value class: "+val+" is not "+valClass);
    
          // Save key/value into respective buffers 
          int oldKeyLength = keyBuffer.getLength();
          keySerializer.serialize(key);
          int keyLength = keyBuffer.getLength() - oldKeyLength;
          if (keyLength < 0)
            throw new IOException("negative length keys not allowed: " + key);
          WritableUtils.writeVInt(keyLenBuffer, keyLength);//每调一次,都会累加keyLength
    
          int oldValLength = valBuffer.getLength();
          uncompressedValSerializer.serialize(val);
          int valLength = valBuffer.getLength() - oldValLength;
          WritableUtils.writeVInt(valLenBuffer, valLength);//每调一次,都会累加valLength      
          // Added another key/value pair
          ++noBufferedRecords;
          
          // Compress and flush?

    int currentBlockSize = keyBuffer.getLength() + valBuffer.getLength(); if (currentBlockSize >= compressionBlockSize) { //compressionBlockSize =  conf.getInt("io.seqfile.compress.blocksize", 1000000); //超过1000000就会写一个Sync   sync(); }

    超过compressionBlockSize的大小,就会调用sync()方法,以下看看sync的源代码(和上面的图对比):

    会写入和图中所画的各个数据项。

    /** Compress and flush contents to dfs */
        @Override
        public synchronized void sync() throws IOException {
          if (noBufferedRecords > 0) {
            super.sync();
            
            // No. of records
            WritableUtils.writeVInt(out, noBufferedRecords);
            
            // Write 'keys' and lengths
            writeBuffer(keyLenBuffer);
            writeBuffer(keyBuffer);
            
            // Write 'values' and lengths
            writeBuffer(valLenBuffer);
            writeBuffer(valBuffer);
            
            // Flush the file-stream
            out.flush();
            
            // Reset internal states
            keyLenBuffer.reset();
            keyBuffer.reset();
            valLenBuffer.reset();
            valBuffer.reset();
            noBufferedRecords = 0;
          }
          
        }


    2,RecordCompressWriter

    /** Append a key/value pair. */
        @Override
        @SuppressWarnings("unchecked")
        public synchronized void append(Object key, Object val)
          throws IOException {
          if (key.getClass() != keyClass)
            throw new IOException("wrong key class: "+key.getClass().getName()
                                  +" is not "+keyClass);
          if (val.getClass() != valClass)
            throw new IOException("wrong value class: "+val.getClass().getName()
                                  +" is not "+valClass);
    
          buffer.reset();
    
          // Append the 'key'
          keySerializer.serialize(key);
          int keyLength = buffer.getLength();
          if (keyLength < 0)
            throw new IOException("negative length keys not allowed: " + key);
    
          // Compress 'value' and append it
          deflateFilter.resetState();
          compressedValSerializer.serialize(val);
          deflateOut.flush();
          deflateFilter.finish();
    
          // Write the record out
          checkAndWriteSync();                                // sync
          out.writeInt(buffer.getLength());                   // total record length record的长度
          out.writeInt(keyLength);                            // key portion length key的长度
          out.write(buffer.getData(), 0, buffer.getLength()); // data 数据
        }
    写入Sync:
    synchronized void checkAndWriteSync() throws IOException {
          if (sync != null &&
              out.getPos() >= lastSyncPos+SYNC_INTERVAL) { // time to emit sync
            sync();
          }
        }

    SYNC_INTERVAL的定义:
      private static final int SYNC_ESCAPE = -1;      // "length" of sync entries
      private static final int SYNC_HASH_SIZE = 16;   // number of bytes in hash 
      private static final int SYNC_SIZE = 4+SYNC_HASH_SIZE; // escape + hash
    
      /** The number of bytes between sync points.*/
      public static final int SYNC_INTERVAL = 100*SYNC_SIZE; 
    每2000个byte,就会写一个Sync。



    总结:

    Record:存储SequenceFile通用的KV数据格式。Key和Value都是二进制变长的数据。

    Record表示Key和Value的byte的总和。

    Sync:主要是用来扫描和恢复数据的。以至于读取数据的Reader不会迷失。

    Header:存储了例如以下信息:文件标识符SEQ,key和value的格式说明。以及压缩的相关信息。metadata等信息。

    metadata包括文件头所须要的数据:文件标识、Sync标识、数据格式说明(含压缩)、文件元数据(时间、owner、权限等)、检验信息等

  • 相关阅读:
    Java 类加载机制详解
    设置菜单栏中和地址栏对应的活动项高亮
    相交链表
    二叉树的最大深度 递归
    买卖股票的最佳时机 一次遍历
    对称二叉树 递归&迭代
    二叉树的中序遍历 --采用递归
    最大子序和 动态规划
    前K个高频单词 字符型 用Hash表+Collections排序 + 优先队列
    加一 (运用取余数)
  • 原文地址:https://www.cnblogs.com/mfmdaoyou/p/6807964.html
Copyright © 2020-2023  润新知