1、Pig数据模型
Bag:表
Tuple:行,记录
Field:属性
Pig不要求同一个Bag里面的各个Tuple有相同数量或相同类型的Field
2、Pig Lating常用语句
1)LOAD:指出载入数据的方法
2)FOREACH:逐行扫描进行某种处理
3)FILTER:过滤行
4)DUMP:把结果显示到屏幕
5)STORE:把结果保存到文件
3、简单例子:
假如有一份成绩单,有学号、语文成绩、数学成绩,属性之间用|分隔,如下:
20130001|80|90 20130002|85|96 20130003|60|70 20130004|74|86 20130005|65|98
1)把文件从本地系统上传到Hadoop
[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put /home/coder/score.txt in
查看是否上传成功:
[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in Found 1 items -rw-r--r-- 2 coder supergroup 75 2013-04-20 14:33 /user/coder/in/score.txt
2)载入原始数据,使用LOAD
grunt> scores = LOAD 'hdfs://h1:9000/user/coder/in/score.txt' USING PigStorage('|') AS (num:int,Chinese:int,Math:int);
输入文件是:'hdfs://h1:9000/user/coder/in/score.txt'
表名(Bag):scores
从输入文件读取数据(Tuple)时以 | 分隔
读取的Tuple包含3个属性,分别为学号(num)、语文成绩(Chinese)和数学成绩(Math),这三个属性的数据类型都为int
3)查看表的结构
grunt> DESCRIBE scores;
scores: {num: int,Chinese: int,Math: int}
4)假如我们需要过滤掉学号为20130005的记录
grunt> filter_scores = FILTER scores BY num != 20130005;
查看过滤后的记录
grunt> dump filter_scores; (20130001,80,90) (20130002,85,96) (20130003,60,70) (20130004,74,86)
5)计算每个人的总分
grunt> totalScore = FOREACH scores GENERATE num,Chinese+Math;
查看结果:
grunt> dump totalScore;
(20130001,170) (20130002,181) (20130003,130) (20130004,160)
(20130005,163)
6)将每个人的总分结果输出到文件
grunt> store totalScore into 'hdfs://h1:9000/user/coder/out/result' using PigStorage('|');
查看结果:
[coder@h1 ~]$ hadoop dfs -ls /user/coder/out/result Found 2 items drwxr-xr-x - coder supergroup 0 2013-04-20 15:54 /user/coder/out/result/_logs -rw-r--r-- 2 coder supergroup 65 2013-04-20 15:54 /user/coder/out/result/part-m-00000 [coder@h1 ~]$ ^C [coder@h1 ~]$ hadoop dfs -cat /user/coder/out/result/* 20130001|170 20130002|181 20130003|130 20130004|160 20130005|163 cat: Source must be a file. [coder@h1 ~]$
再看一个小例子:
有一批如下格式的文件:
zhangsan#123456#zhangsan@qq.com lisi#434dfdds#lisi@126.com wangwu#ffere233#wangwu@163.com zhouliu#fgrtr43#zhouliu@139.com
每行记录有三个字段:账号、密码、邮箱,字段之间以#号分隔,现在要提取这批文件中的邮箱。
1)上传文件到hadoop
[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put data.txt in
[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in Found 1 items -rw-r--r-- 2 coder supergroup 122 2013-04-24 20:34 /user/coder/in/data.txt [coder@h1 hadoop-0.20.2]$
2)载入原始数据文件
grunt> T_A = LOAD '/user/coder/in/data.txt' using PigStorage('#') as (username:chararray,password:chararray,email:chararray);
3)取出email字段
grunt> T_B = FOREACH T_A GENERATE email;
4)把结果输出到文件
grunt> STORE T_B INTO '/user/coder/out/email'
5)查看结果
[coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -cat /user/coder/out/email/* zhangsan@qq.com lisi@126.com wangwu@163.com zhouliu@139.com cat: Source must be a file.