• Hadoop2.2.0+hive使用LZO压缩那些事


    环境:

    Centos6.4 64位

    Hadoop2.2.0

    Sun JDK1.7.0_45

    hive-0.12.0

    准备工作:

    yum -y install  lzo-devel  zlib-devel  gcc autoconf automake libtool

    开始了哦!

    (1)安装LZO

    wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
    tar -zxvf lzo-2.06.tar.gz
    ./configure -enable-shared -prefix=/usr/local/hadoop/lzo/
    make && make test && make install

    安装完毕,将/usr/local/hadoop/lzo/lib/* 复制到/usr/lib/和/usr/lib64/下
    sudo cp /usr/local/hadoop/lzo/lib/* /usr/lib/
    sudo cp /usr/local/hadoop/lzo/lib/* /usr/lib64/
    配置环境变量(vim /etc/bashrc):export PATH=/usr/local//hadoop/lzo/:$PATH

    (2)安装LZOP
    wget http://www.lzop.org/download/lzop-1.03.tar.gz
    tar -zxvf lzop-1.03.tar.gz

    export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include/

    PS:如果不配置,会报错:
    configure: error: LZO header files not found. Please check your installation or set the environment variable `CPPFLAGS'.
    接下来,

    ./configure -enable-shared -prefix=/usr/local/hadoop/lzop
    make  && make install

    (3)把lzop复制到/usr/bin/
    ln -s /usr/local/hadoop/lzop/bin/lzop /usr/bin/lzop

    (4)测试lzop
    lzop /home/hadoop/data/access_20131219.log

    输入lzop

    报错:lzop: error while loading shared libraries: liblzo2.so.2: cannot open shared object file: No such file or directory

    解决办法:增加环境变量export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64

    会在生成一个lzo后缀的压缩文件: /home/hadoop/data/access_20131219.log.lzo即表示前述几个步骤正确哦。
    (5)安装Hadoop-LZO

    当然的还有一个前提,就是配置好maven和svn 或者Git(我使用的是SVN),这个就不说了,如果这些搞不定,其实也不必要进行下去了!

    我这里使用https://github.com/twitter/hadoop-lzo

    使用SVN从https://github.com/twitter/hadoop-lzo/trunk下载代码,修改pom.xml文件中的一部分。

    从:

    <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <hadoop.current.version>2.1.0-beta</hadoop.current.version>
    <hadoop.old.version>1.0.4</hadoop.old.version>
    </properties>

    修改为:

    <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <hadoop.current.version>2.2.0</hadoop.current.version>
    <hadoop.old.version>1.0.4</hadoop.old.version>
    </properties>

    再依次执行:

    mvn clean package -Dmaven.test.skip=true
    tar -cBf - -C target/native/Linux-amd64-64/lib . | tar -xBvf - -C /home/hadoop/hadoop-2.2.0/lib/native/
    cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar /home/hadoop/hadoop-2.2.0/share/hadoop/common/

    接下来就是将/home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar以及/home/hadoop/hadoop-2.2.0/lib/native/ 同步到其它所有的hadoop节点。注意,要保证目录/home/hadoop/hadoop-2.2.0/lib/native/ 下的jar包,你运行hadoop的用户都有执行权限。

    (6)配置Hadoop

    在文件$HADOOP_HOME/etc/hadoop/hadoop-env.sh中追加如下内容:

    export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib

    在文件$HADOOP_HOME/etc/hadoop/core-site.xml中追加如下内容:

    <property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,
    org.apache.hadoop.io.compress.DefaultCodec,
    com.hadoop.compression.lzo.LzoCodec,
    com.hadoop.compression.lzo.LzopCodec,
    org.apache.hadoop.io.compress.BZip2Codec
    </value>
    </property>
    <property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>

    在文件$HADOOP_HOME/etc/hadoop/mapred-site.xml中追加如下内容:

    <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
    </property>
    <property>
    <name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>
    <property>
    <name>mapred.child.env</name>
    <value>LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib</value>
    </property>

    (7)在Hive中体验lzo

    A:首先创建nginx_lzo的表

    CREATE TABLE logs_app_nginx (
    ip STRING,
    user STRING,
    time STRING,
    request STRING,
    status STRING,
    size STRING,
    rt STRING,
    referer STRING,
    agent STRING,
    forwarded String
    )
    partitioned by (
    date string,
    host string
    )
    row format delimited
    fields terminated by ' '
    STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
    OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

    B:导入数据

    LOAD DATA Local INPATH '/home/hadoop/data/access_20131230_25.log.lzo' INTO TABLE logs_app_nginx PARTITION(date=20131229,host=25);

    /home/hadoop/data/access_20131219.log文件的格式如下:

    221.207.93.109  -       [23/Dec/2013:23:22:38 +0800]    "GET /ClientGetResourceDetail.action?id=318880&token=Ocm HTTP/1.1"   200     199     0.008   "xxx.com"        "Android4.1.2/LENOVO/Lenovo A706/ch_lenovo/80"   "-"

    直接采用lzop  /home/hadoop/data/access_20131219.log即可生成lzo格式压缩文件/home/hadoop/data/access_20131219.log.lzo

    C:索引LZO文件

    $HADOOP_HOME/bin/hadoop jar /home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/<span style="font-family: Arial, Helvetica, sans-serif;">logs_app_nginx</span>

    D:开始跑利用hive来跑map/reduce任务了

    set hive.exec.reducers.max=10;
    set mapred.reduce.tasks=10;
    select ip,rt from nginx_lzo limit 10;

    在hive的控制台能看到类似如下格式输出,就表示正确了!

    hive> set hive.exec.reducers.max=10;
    hive> set mapred.reduce.tasks=10;
    hive> select ip,rt from nginx_lzo limit 10;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_1388065803340_0009, Tracking URL = http://lrts216:8088/proxy/application_1388065803340_0009/
    Kill Command = /home/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1388065803340_0009
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    2013-12-27 09:13:39,163 Stage-1 map = 0%, reduce = 0%
    2013-12-27 09:13:45,343 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
    2013-12-27 09:13:46,369 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
    MapReduce Total cumulative CPU time: 1 seconds 220 msec
    Ended Job = job_1388065803340_0009
    MapReduce Jobs Launched:
    Job 0: Map: 1 Cumulative CPU: 1.22 sec HDFS Read: 63570 HDFS Write: 315 SUCCESS
    Total MapReduce CPU Time Spent: 1 seconds 220 msec
    OK
    221.207.93.109 "XXX.com"
    Time taken: 17.498 seconds, Fetched: 10 row(s)

  • 相关阅读:
    什么是ETL?5分钟看完秒懂
    横向滚动 css
    解决echarts中横坐标值显示不全(自动隐藏)问题
    Echarts
    post 二进制流下载文件
    如何停止foreach
    日期格式 js
    cookie 属性
    HTML5 file对象和blob对象的互相转换
    前端图片压缩
  • 原文地址:https://www.cnblogs.com/luxiaorui/p/3931024.html
Copyright © 2020-2023  润新知