• Pig安装与应用



     

    1.  参考说明

    参考文档:

     

    http://pig.apache.org/docs/r0.17.0/start.html#build

     

     

    2.  安装环境说明

    2.1.  环境说明

     

    CentOS7.4+ Hadoop2.7.5的伪分布式环境

     

    主机名

    NameNode

    SecondaryNameNode

    DataNodes

    centoshadoop.smartmap.com

    192.168.1.80

    192.168.1.80

    192.168.1.80

     

     

     

     

     

    Hadoop的安装目录为:/opt/hadoop/hadoop-2.7.5

     

    3.  安装

     

     

    3.1.  Pig下载

     

    http://pig.apache.org/releases.html#Download

     

    clip_image002[4]

     

    [root@server1 ~]# mkdir /opt/mongodb

    [root@server1 ~]# chown -R mongodb:mongodb /opt/mongodb/

     

    3.2.  Pig解压

     

    将下载的pig-0.17.0.tar.gz解压到/opt/hadoop/pig-0.17.0目录下

     

    4.  配置

     

    4.1.  修改profile文件

    vi /etc/profile

     

    export PIG_HOME=/opt/hadoop/pig-0.17.0

    export PATH=$PATH:$PIG_HOME/bin

     

    4.2.  JDK升级为1.8版本

     

    JDK切换成1.8的版本,并修改所有与JAVA_HOME相关的变量

     

    4.3.  修改pig的配置文件

    vi /opt/hadoop/pig-0.17.0/conf/pig.properties

     

    exectype=mapreduce

     

    4.4.  修改mapred-site.xml以启用jobhistory

     

    vi /opt/hadoop/hadoop-2.7.5/etc/hadoop/mapred-site.xml

     

    <property>

           <name>mapreduce.jobhistory.address</name>                

           <value>192.168.1.80:10020</value>

    </property>

     

    5.  启动Hadoop

     

    5.1.  启动YARNHDFS

    cd /opt/hadoop/hadoop-2.7.5/sbin

     

    start-all.sh

     

    5.2.  启动historyserver

     

    cd /opt/hadoop/hadoop-2.7.5/sbin

     

    mr-jobhistory-daemon.sh start historyserver

     

    6.  应用Pig工具

     

    6.1.  导入文件到HDFS

     

    hadoop fs -mkdir -p /input/ncdc/micro-tab

    hadoop fs -copyFromLocal sample.txt /input/ncdc/micro-tab/sample.txt

     

    6.2.  启动运行Pig的交互式Shell环境

     

    cd /opt/hadoop/pig-0.17.0/bin

     

    pig

     

    clip_image004[4]

    6.3.  运行任务

     

    grunt> records = load '/input/ncdc/micro-tab/sample.txt' as (year:chararray, temperature:int, quality:int);

     

    grunt> dump records;

     

    clip_image006[4]

     

    clip_image008[4]

     

    6.4.  退出

     

     

    grunt> q

     

    clip_image010[4]

     

    6.5.  显示模式

     

    cd /opt/hadoop/pig-0.17.0/bin

     

    pig

     

     

    grunt> records = LOAD '/input/ncdc/micro-tab/sample.txt' as (year:chararray, temperature:int, quality:int);

     

    grunt> DUMP records;

     

    grunt> DESCRIBE records

    records: {year: chararray,temperature: int,quality: int}

    grunt>

     

    6.6.  过滤数据

     

    grunt> filter_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);

    grunt> DUMP filter_records;

     

    clip_image012[4]

     

    clip_image014[4]

     

    6.7.  分组记录

     

    grunt> grouped_records = GROUP filter_records BY year;

    grunt> DUMP grouped_records;

    grunt> DESCRIBE grouped_records;

    grouped_records: {group: chararray,filter_records: {(year: chararray,temperature: int,quality: int)}}

    grunt>

     

    clip_image016[4]

     

    clip_image018[4]

     

    clip_image020[4]

     

     

    6.8.  计算最大值

     

    grunt> max_temp = FOREACH grouped_records GENERATE group, MAX(filter_records.temperature);

    grunt> DUMP max_temp;

     

    clip_image022[4]

     

    clip_image024[4]

     

    6.9.  查看执行过程

     

    grunt> ILLUSTRATE max_temp;

     

    clip_image026[4]

     

    clip_image028[4]

     

  • 相关阅读:
    FFmpeg简单使用:解封装 ---- 基本流程
    SDL播放PCM音频数据
    JDK8时间新API-2
    RocketMq延时队列的实现原理
    Kibana复杂查询语句
    Es基础api
    Redis sscan命令
    如何实现分布式的延时队列
    客户端从broker拉取的messagequeue的样子
    RocketMq多个consumerQueue长什么样子
  • 原文地址:https://www.cnblogs.com/gispathfinder/p/9065345.html
Copyright © 2020-2023  润新知