• HDFS中文件的压缩与解压


      文件的压缩有两大好处:1、可以减少存储文件所需要的磁盘空间;2、可以加速数据在网络和磁盘上的传输。尤其是在处理大数据时,这两大好处是相当重要的。

      下面是一个使用gzip工具压缩文件的例子。将文件/user/hadoop/aa.txt进行压缩,压缩后为/user/hadoop/text.gz

     1 package com.hdfs;
     2 
     3 import java.io.IOException;
     4 import java.io.InputStream;
     5 import java.io.OutputStream;
     6 import java.net.URI;
     7 
     8 import org.apache.hadoop.conf.Configuration;
     9 import org.apache.hadoop.fs.FSDataInputStream;
    10 import org.apache.hadoop.fs.FSDataOutputStream;
    11 import org.apache.hadoop.fs.FileSystem;
    12 import org.apache.hadoop.fs.Path;
    13 import org.apache.hadoop.io.IOUtils;
    14 import org.apache.hadoop.io.compress.CompressionCodec;
    15 import org.apache.hadoop.io.compress.CompressionCodecFactory;
    16 import org.apache.hadoop.io.compress.CompressionInputStream;
    17 import org.apache.hadoop.io.compress.CompressionOutputStream;
    18 import org.apache.hadoop.util.ReflectionUtils;
    19 
    20 public class CodecTest {
    21     //压缩文件
    22     public static void compress(String codecClassName) throws Exception{
    23         Class<?> codecClass = Class.forName(codecClassName);
    24         Configuration conf = new Configuration();
    25         FileSystem fs = FileSystem.get(conf);
    26         CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf);
    27         //指定压缩文件路径
    28         FSDataOutputStream outputStream = fs.create(new Path("/user/hadoop/text.gz"));
    29         //指定要被压缩的文件路径
    30         FSDataInputStream in = fs.open(new Path("/user/hadoop/aa.txt"));
    31         //创建压缩输出流
    32         CompressionOutputStream out = codec.createOutputStream(outputStream);  
    33         IOUtils.copyBytes(in, out, conf); 
    34         IOUtils.closeStream(in);
    35         IOUtils.closeStream(out);
    36     }
    37     
    38     //解压缩
    39     public static void uncompress(String fileName) throws Exception{
    40         Class<?> codecClass = Class.forName("org.apache.hadoop.io.compress.GzipCodec");
    41         Configuration conf = new Configuration();
    42         FileSystem fs = FileSystem.get(conf);
    43         CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf);
    44         FSDataInputStream inputStream = fs.open(new Path("/user/hadoop/text.gz"));
    45          //把text文件里到数据解压,然后输出到控制台  
    46         InputStream in = codec.createInputStream(inputStream);  
    47         IOUtils.copyBytes(in, System.out, conf);
    48         IOUtils.closeStream(in);
    49     }
    50     
    51     //使用文件扩展名来推断二来的codec来对文件进行解压缩
    52     public static void uncompress1(String uri) throws IOException{
    53         Configuration conf = new Configuration();
    54         FileSystem fs = FileSystem.get(URI.create(uri), conf);
    55         
    56         Path inputPath = new Path(uri);
    57         CompressionCodecFactory factory = new CompressionCodecFactory(conf);
    58         CompressionCodec codec = factory.getCodec(inputPath);
    59         if(codec == null){
    60             System.out.println("no codec found for " + uri);
    61             System.exit(1);
    62         }
    63         String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
    64         InputStream in = null;
    65         OutputStream out = null;
    66         try {
    67             in = codec.createInputStream(fs.open(inputPath));
    68             out = fs.create(new Path(outputUri));
    69             IOUtils.copyBytes(in, out, conf);
    70         } finally{
    71             IOUtils.closeStream(out);
    72             IOUtils.closeStream(in);
    73         }
    74     }
    75     
    76     public static void main(String[] args) throws Exception {
    77         //compress("org.apache.hadoop.io.compress.GzipCodec");
    78         //uncompress("text");
    79         uncompress1("hdfs://master:9000/user/hadoop/text.gz");
    80     }
    81 
    82 }

      首先执行77行进行压缩,压缩后执行第78行进行解压缩,这里解压到标准输出,所以执行78行会再控制台看到文件/user/hadoop/aa.txt的内容。如果执行79行的话会将文件解压到/user/hadoop/text,他是根据/user/hadoop/text.gz的扩展名判断使用哪个解压工具进行解压的。解压后的路径就是去掉扩展名。

      进行文件压缩后,在执行命令./hadoop fs -ls /user/hadoop/查看文件信息,如下:

    1 [hadoop@master bin]$ ./hadoop fs -ls /user/hadoop/
    2 Found 7 items
    3 -rw-r--r--   3 hadoop supergroup   76805248 2013-06-17 23:55 /user/hadoop/aa.mp4
    4 -rw-r--r--   3 hadoop supergroup        520 2013-06-17 22:29 /user/hadoop/aa.txt
    5 drwxr-xr-x   - hadoop supergroup          0 2013-06-16 17:19 /user/hadoop/input
    6 drwxr-xr-x   - hadoop supergroup          0 2013-06-16 19:32 /user/hadoop/output
    7 drwxr-xr-x   - hadoop supergroup          0 2013-06-18 17:08 /user/hadoop/test
    8 drwxr-xr-x   - hadoop supergroup          0 2013-06-18 19:45 /user/hadoop/test1
    9 -rw-r--r--   3 hadoop supergroup         46 2013-06-19 20:09 /user/hadoop/text.gz

    第4行为压缩之前的文件,大小为520个字节。第9行为压缩后的文件,大小为46个字节。由此可以看出上面讲的压缩的两大好处了。

  • 相关阅读:
    Mysql5.6主从复制-基于binlog
    mysql 1449 : The user specified as a definer ('root'@'%') does not exist 解决方法
    socket recv阻塞与非阻塞error总结
    linux socket talkclient talkserver示例
    linux-socket connect阻塞和非阻塞模式 示例
    OPENSSL FIPS
    epoll的LT和ET使用EPOLLONESHOT
    如何在socket编程的Tcp连接中实现心跳协议
    linux网络编程:splice函数和tee( )函数高效的零拷贝
    Linux网络编程--sendfile零拷贝高效率发送文件
  • 原文地址:https://www.cnblogs.com/liuling/p/2013-6-19-01.html
Copyright © 2020-2023  润新知