• pig入门案例


    测试数据位于:/home/hadoop/luogankun/workspace/sync_data/pig
    person.txt中的数据以逗号分隔

    1,zhangsan,112
    2,lisi,113
    3,wangwu,114
    4,zhaoliu,115

    score.txt中的数据以制表符分隔

    1       20
    2       30
    3       40
    5       50

    pig只能针对HDFS上的文件进行操作,所以需要将文件先上传到HDFS中

    cd /home/hadoop/luogankun/workspace/sync_data/pig
    hadoop fs -put person.txt input/pig/person.txt
    hadoop fs -put score.txt input/pig/score.txt

    load文件(HDFS系统上的)

    a = load 'input/pig/person.txt' using PigStorage(',') as (id:int, name:chararray, age:int);
    b = load 'input/pig/score.txt' using PigStorage('	') as (id:int, score:int);

    查看表结构

    describe a
    a: {id: int,name: chararray,age: int}
    
    describe b
    b: {id: int,score: int}

    查看表数据

    dump a
    (1,zhangsan,112)
    (2,lisi,113)
    (3,wangwu,114)
    (4,zhaoliu,115)
    
    dump b
    (1,20)
    (2,30)
    (3,40)
    (5,50)

    dump 会跑mapreduce任务。

    条件过滤

    查询person中id小于4的人

    aa = filter a by id < 4;
    
    dump aa;
    (1,zhangsan,112)
    (2,lisi,113)
    (3,wangwu,114)

    pig中等号使用==, 例如:aa = filter a by id == 4;

    表关联

    c = join a by id left , b by id;
    
    describe c
    c: {a::id: int,a::name: chararray,a::age: int,b::id: int,b::score: int}
    #表名字段名之间两个冒号,字段与字段类型之间一个冒号
    
    dump c
    (1,zhangsan,112,1,20)
    (2,lisi,113,2,30)
    (3,wangwu,114,3,40)
    (4,zhaoliu,115,,)

    由于采用的是left join,所以只有四条数据,而且第四条数据是没有分数的。

    迭代数据

    d =foreach c generate a::id as id, a::name as name, b::score as score, a::age as age;
    
    describe d;
    d: {id: int,name: chararray,score: int,age: int}
    
    dump d
    (1,zhangsan,20,112)
    (2,lisi,30,113)
    (3,wangwu,40,114)
    (4,zhaoliu,,115)

    注意:foreach使用时只要等号前或者后有一个空格即可,如果等号两端都没有空格的话会报错。

    处理结果存储到HDFS系统上

    store d into 'output/pig/person_score' using PigStorage(',');   #导出到HDFS上的文件分隔符是逗号
    hadoop fs -ls output/pig/person_score
    hadoop fs -cat output/pig/person_score/part-r-00000
    1,zhangsan,20,112
    2,lisi,30,113
    3,wangwu,40,114
    4,zhaoliu,,115
    
    hadoop fs -rmr output/pig/person_score
    store d into 'output/pig/person_score';     #导出到HDFS上的文件分隔符是制表符
    hadoop fs -ls output/pig/person_score
    hadoop fs -cat output/pig/person_score/part-r-00000
    1 zhangsan 20 112
    2 lisi 30 113
    3 wangwu 40 114
    4 zhaoliu 115

    pig执行文件

    将上面的所有pig shell脚本放到一个sh脚本中执行
    /home/hadoop/luogankun/workspace/shell/pig/person_score.pig

    a = load 'input/pig/person.txt' using PigStorage(',') as (id:int, name:chararray, age:int);
    b = load 'input/pig/score.txt' using PigStorage('	') as (id:int, score:int);
    c = join a by id left , b by id;
    d =foreach c generate a::id as id, a::name as name, b::score as score, a::age as age;
    store d into 'output/pig/person_score2' using PigStorage(',');   

    执行person.score.pig脚本:

    /home/hadoop/luogankun/workspace/shell/pig

    pig person_score.pig

    pig脚本传递参数

    pig脚本位置:/home/hadoop/luogankun/workspace/shell/pig/mulit_params_demo01.pig

    log = LOAD '$input' AS (user:chararray, time:long, query:chararray);
    lmt = LIMIT log $size;
    DUMP lmt;

    上传数据到hdfs文件中

    cd /home/hadoop/luogankun/workspace/shell/pig
    hadoop fs -put excite-small.log input/pig/excite-small.log

    传递方式一:逐个参数传递

    pig -param input=input/pig/excite-small.log -param size=4 mulit_params_demo01.pig

    传递方式二:将参数保存在txt文件中

    /home/hadoop/luogankun/workspace/shell/pig/mulit_params.txt

    input=input/pig/excite-small.log
    size=5
    pig -param_file mulit_params.txt mulit_params_demo01.pig
  • 相关阅读:
    Kafka设计解析(二):Kafka High Availability (上)
    使用Storm实现实时大数据分析
    Kafka消息的可靠性测试--针对直播业务的方案选择
    天机镜—优土大数据平台应用级别监控神器
    Kafka文件存储机制那些事
    Hbase框架原理及相关的知识点理解、Hbase访问MapReduce、Hbase访问Java API、Hbase shell及Hbase性能优化总结
    《BI那点儿事—数据的艺术》目录索引
    面试中的排序算法总结
    HBase二级索引与Join
    Hbase 学习(九) 华为二级索引(原理)
  • 原文地址:https://www.cnblogs.com/luogankun/p/3897135.html
Copyright © 2020-2023  润新知