Load运算符
你可以使用 Pig Latin 的 LOAD 运算符,从文件系统(HDFS / Local)将数据加载到Apache Pig中。
语法
load语句由两部分组成,用“=”运算符分隔。在左侧,需要提到我们想要存储数据的关系的名称;而在右侧,我们需要定义如何存储数据。下面给出了 Load 运算符的语法。
Relation_name = LOAD 'Input file path' USING function as schema;
说明:
-
relation_name - 我们必须提到要存储数据的关系。要与后面的=之间留一个空格,不然报错
-
Input file path - 我们必须提到存储文件的HDFS目录。(在MapReduce模式下)
-
function - 我们必须从Apache Pig提供的一组加载函数中选择一个函数( BinStorage,JsonLoader,PigStorage,TextLoader )。
-
Schema - 我们必须定义数据的模式,可以定义所需的模式如下 -
(column1 : data type, column2 : data type, column3 : data type);
需要加载的HDFS文件
[root@host ~]# hdfs dfs -cat hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29
pig按如下方式执行pig latin语句:
1.pig 对所有语句的语法语意进行确认
2.如果遇到dump或者store命令,Pig将顺序执行上面的所有语句。
所以pig一些命令不会自动执行,需要通过其他命令触发,触发后连续一次性执行完毕。
加载数据
grunt> customer =LOAD 'hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003' USING PigStorage(',') as ( roleid:Int,name:chararray,dateid:Datetime,addr:chararray,sex:Int,level:Int );
grunt> b =foreach customer generate roleid; //foreach 的作用是基于数据的列进行数据转换
grunt> dump b
..................................
2018-06-15 14:54:22,671 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2018-06-15 14:54:22,672 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(600)
(650)
(600)
(650)
(600)
(650)
(600)
(650)
grunt> b =foreach customer generate roleid,sex;
...................
2018-06-15 15:14:20,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
grunt> dump customer;
...........................
2018-06-15 14:59:53,355 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2018-06-15 14:59:53,355 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
存储数据
grunt> store customer into ' hdfs://localhost:9000/pig' USING PigStorage(',');
......................................
2018-06-15 15:23:24,887 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2018-06-15 15:23:24,888 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.7.4 0.17.0 root 2018-06-15 15:23:18 2018-06-15 15:23:24 UNKNOWN
Success!
Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_local1251771877_0005 1 0 n/a n/a n/a n/a 0 0 0 0 customer MAP_ONLY hdfs://localhost:9000/pig,
Input(s): Successfully read 8 records (28781246 bytes) from: "hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003"
Output(s): Successfully stored 8 records (28779838 bytes) in: "hdfs://localhost:9000/pig"
Counters: Total records written : 8 Total bytes written : 28779838 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0
Job DAG: job_local1251771877_0005
2018-06-15 15:23:24,888 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2018-06-15 15:23:24,890 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2018-06-15 15:23:24,890 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2018-06-15 15:23:24,892 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
查看hdfs:
[root@host ~]# hdfs dfs -ls -R /pig
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
-rw-r--r-- 1 root supergroup 0 2018-06-15 15:23 /pig/_SUCCESS
-rw-r--r-- 1 root supergroup 432 2018-06-15 15:23 /pig/part-m-00000
[root@host ~]# hdfs dfs -ls -R /pig/20180615/
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
-rw-r--r-- 1 root supergroup 0 2018-06-15 15:27 /pig/20180615/_SUCCESS
-rw-r--r-- 1 root supergroup 432 2018-06-15 15:27 /pig/20180615/part-m-00000
[root@host ~]# hdfs dfs -cat /pig/20180615/part-m-00000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
diagnostic [,daɪəɡ'nɑstɪk] 诊断
Load 语句会简单地将数据加载到Apache Pig中的指定关系中。要验证Load语句的执行,必须使用Diagnostic运算符。Pig Latin提供四种不同类型的诊断运算符:
- Dump运算符
- Describe运算符
- Explanation运算符
- Illustration运算符
Dump运算符
Dump 运算符用于运行Pig Latin语句,并在屏幕上显示结果,它通常用于调试目的。上面已经做了演示。
describe 运算符用于查看relation(关系)的模式。
grunt> describe b
b: {roleid: int,sex: int}
grunt> describe customer
customer: {roleid: int,name: chararray,dateid: datetime,addr: chararray,sex: int,level: int}
explain 运算符用于显示relation(关系)的逻辑,物理的和MapReduce的执行计划。
grunt> explain b
2018-06-15 16:08:01,235 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2018-06-15 16:08:01,236 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2018-06-15 16:08:01,237 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for customer: $1, $2, $3, $5
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
b: (Name: LOStore Schema: roleid#453:int,sex#457:int)ColumnPrune:OutputUids=[453, 457]ColumnPrune:InputUids=[453, 457]
|
|---b: (Name: LOForEach Schema: roleid#453:int,sex#457:int)
| |
| (Name: LOGenerate[false,false] Schema: roleid#453:int,sex#457:int)
| | |
| | (Name: Cast Type: int Uid: 453)
| | |
| | |---roleid:(Name: Project Type: bytearray Uid: 453 Input: 0 Column: (*))
| | |
| | (Name: Cast Type: int Uid: 457)
| | |
| | |---sex:(Name: Project Type: bytearray Uid: 457 Input: 1 Column: (*))
| |
| |---(Name: LOInnerLoad[0] Schema: roleid#453:bytearray)
| |
| |---(Name: LOInnerLoad[1] Schema: sex#457:bytearray)
|
|---customer: (Name: LOLoad Schema: roleid#453:bytearray,sex#457:bytearray)ColumnPrune:OutputUids=[453, 457]ColumnPrune:InputUids=[453, 457]ColumnPrune:RequiredColumns=[0, 4]RequiredFields:[0, 4]
#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
b: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-98
|
|---b: New For Each(false,false)[bag] - scope-97
| |
| Cast[int] - scope-92
| |
| |---Project[bytearray][0] - scope-91
| |
| Cast[int] - scope-95
| |
| |---Project[bytearray][1] - scope-94
|
|---customer: Load(hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003:PigStorage(',')) - scope-90
2018-06-15 16:08:01,239 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2018-06-15 16:08:01,240 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2018-06-15 16:08:01,240 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-99
Map Plan
b: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-98
|
|---b: New For Each(false,false)[bag] - scope-97
| |
| Cast[int] - scope-92
| |
| |---Project[bytearray][0] - scope-91
| |
| Cast[int] - scope-95
| |
| |---Project[bytearray][1] - scope-94
|
|---customer: Load(hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003:PigStorage(',')) - scope-90--------
Global sort: false
----------------
illustrate 使用该操作对pig latin语句进行单步执行
grunt> illustrate b
....................................
2018-06-15 16:14:59,291 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2018-06-15 16:14:59,292 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2018-06-15 16:14:59,292 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: customer[1,10],customer[-1,-1],b[4,3] C: R:
-----------------------------------------------------------------------------------------------------------------------------------------
| customer | roleid:int | name:chararray | dateid:datetime | addr:chararray | sex:int | level:int |
-----------------------------------------------------------------------------------------------------------------------------------------
| | 600 | null | 2017-11-15T14:50:05.000+08:00 | hunan changsha | 0 | 91 |
-----------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------
| b | roleid:int | sex:int |
----------------------------------------
| | 600 | 0 |
----------------------------------------
pig latin中,关系名,域名,函数名是区分大小写的,参数名和关键字是不区分大小写的。
GROUP 运算符用于在一个或多个关系中对数据进行分组,它收集具有相同key的数据。
grunt> dump b
....
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
grunt> group_id =group b by sex;
grunt> describe group_id
group_id: {group: int,b: {(roleid: int,sex: int)}}
grunt> dump group_id
...................................................
2018-06-15 16:27:20,311 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(0,{(600,0),(600,0),(600,0),(600,0)})
(1,{(650,1),(650,1),(650,1),(650,1)})
可以按多列分组
grunt> group_idsex =group b by (sex,roleid);
grunt> describe group_idsex
group_idsex: {group: (sex: int,roleid: int),b: {(roleid: int,sex: int)}}
grunt> dump group_idsex
((0,600),{(600,0),(600,0),(600,0),(600,0)})
((1,650),{(650,1),(650,1),(650,1),(650,1)})
也可以按所有列分组
grunt> group_all =group b ALL;
grunt> describe group_all;
group_all: {group: chararray,b: {(roleid: int,sex: int)}}
grunt> dump group_all;
...........
(all,{(650,1),(600,0),(650,1),(600,0),(650,1),(600,0),(650,1),(600,0)})
COGROUP 运算符的运作方式与 GROUP 运算符相同。两个运算符之间的唯一区别是 group 运算符通常用于一个关系,而 cogroup 运算符用于涉及两个或多个关系的语句。
grunt> distinctcustid =distinct b;
grunt> describe distinctcustid
distinctcustid: {roleid: int,sex: int}
grunt> dump distinctcustid
...........
(600,0)
(650,1)
grunt> cogroup1 =cogroup b by sex,distinctcustid by sex;
grunt> describe cogroup1;
cogroup1: {group: int,b: {(roleid: int,sex: int)},distinctcustid: {(roleid: int,sex: int)}}
grunt> dump cogroup1
(0,{(600,0),(600,0),(600,0),(600,0)},{(600,0)})
(1,{(650,1),(650,1),(650,1),(650,1)},{(650,1)})
grunt> cogroup2 =cogroup customer by sex,distinctcustid by sex;
grunt> describe cogroup2
cogroup2: {group: int,customer: {(roleid: int,name: chararray,dateid: datetime,addr: chararray,sex: int,level: int)},distinctcustid: {(roleid: int,sex: int)}}
grunt> dump cogroup2
............................
(0,{(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91),(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91),(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91),(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)},{(600,0)})
(1,{(650,null,2017-11-01T17:24:34.000+08:00,null,1,29),(650,null,2017-11-01T17:24:34.000+08:00,null,1,29),(650,null,2017-11-01T17:24:34.000+08:00,null,1,29),(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)},{(650,1)})
JOIN 运算符用于组合来自两个或多个关系的记录。在执行连接操作时,我们从每个关系中声明一个(或一组)元组作为key。 当这些key匹配时,两个特定的元组匹配,否则记录将被丢弃。连接可以是以下类型:
- Self-join
- Inner-join
- Outer-join − left join, right join, and full join
Self-join(自连接)
Self-join 用于将表与其自身连接,就像表是两个关系一样,临时重命名至少一个关系。通常,在Apache Pig中,为了执行self-join,我们将在不同的别名(名称)下多次加载相同的数据。
Inner Join(内部连接)
Inner Join使用较为频繁;它也被称为等值连接。当两个表中都存在匹配时,内部连接将返回行。基于连接谓词(join-predicate),通过组合两个关系(例如A和B)的列值来创建新关系。查询将A的每一行与B的每一行进行比较,以查找满足连接谓词的所有行对。当连接谓词被满足时,A和B的每个匹配的行对的列值被组合成结果行。
grunt> join1 =join distinctcustid by roleid,b by roleid;
grunt> describe join1;
join1: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
grunt> dump join1
......
(600,0,600,0)
(600,0,600,0)
(600,0,600,0)
(600,0,600,0)
(650,1,650,1)
(650,1,650,1)
(650,1,650,1)
(650,1,650,1)
Inner Join(内部连接)
Inner Join使用较为频繁;它也被称为等值连接。当两个表中都存在匹配时,内部连接将返回行。基于连接谓词(join-predicate),通过组合两个关系(例如A和B)的列值来创建新关系。查询将A的每一行与B的每一行进行比较,以查找满足连接谓词的所有行对。当连接谓词被满足时,A和B的每个匹配的行对的列值被组合成结果行。
grunt> joinleft =join distinctcustid by roleid left,b by roleid;
grunt> describe joinleft
joinleft: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
Right Outer Join(右外连接)
right outer join操作将返回右表中的所有行,即使左表中没有匹配项。
grunt> joinright =join distinctcustid by roleid right,b by roleid;
grunt> describe joinright
joinright: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
Full Outer Join(全外连接)
当一个关系中存在匹配时,full outer join操作将返回行。
grunt> joinfull =join distinctcustid by roleid full,b by roleid;
grunt> describe joinfull
joinfull: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
使用多个Key
我们可以使用多个key执行JOIN操作。关联的key顺序必须一致
grunt> joinbykeys =join distinctcustid by (roleid,sex),b by (roleid,sex);
grunt> describe joinbykeys
joinbykeys: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
CROSS 运算符计算两个或多个关系的向量积。笛卡尔积
grunt> crosstest =cross distinctcustid,b;
grunt> describe crosstest
crosstest: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
Pig Latin的 UNION 运算符用于合并两个关系的内容。要对两个关系执行UNION操作,它们的列和域必须相同。
grunt> customer1 =LOAD 'hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00002' USING PigStorage(',') as ( roleid:Int,name:chararray,dateid:Datetime,addr:chararray,sex:Int,level:Int );
grunt> customer =LOAD 'hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003' USING PigStorage(',') as ( roleid:Int,name:chararray,dateid:Datetime,addr:chararray,sex:Int,level:Int );
grunt> union1 =union customer1,customer;
grunt> describe union1
union1: {roleid: int,name: chararray,dateid: datetime,addr: chararray,sex: int,level: int}
grunt> dump union1
..............................
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
SPLIT 运算符用于将关系拆分为两个或多个关系。
grunt> split union1 into customer1 if(sex==1),customer0 if(sex==0);
grunt> dump customer0;
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
grunt> dump customer1
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
grunt> cust1 =distinct customer0;
grunt> dump cust1
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
FILTER 运算符用于根据条件从关系中选择所需的元组。
grunt> uniondis =distinct union1;
grunt> dump uniondis
.....
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
grunt> filter_level =filter uniondis by (level<50);
grunt> dump filter_level
.................
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
grunt> filter_level2 =filter uniondis by (level>=50);
grunt> dump filter_level2
........................
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
DISTINCT 运算符用于从关系中删除冗余(重复)元组,上面已有实例
FOREACH 运算符用于基于列数据生成指定的数据转换。
grunt> foreach1 =foreach uniondis generate(name,sex,level);
grunt> dump foreach1
...........
((null,0,4))
((null,0,91))
((null,1,29))
grunt> foreach1 =foreach uniondis generate name,sex,level;
grunt> dump foreach1
.............
(null,0,4)
(null,0,91)
(null,1,29)
ORDER BY 运算符用于以基于一个或多个字段的排序顺序显示关系的内容。
grunt> orderby1 =order uniondis by level desc;
grunt> dump orderby1
.....
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
grunt> limit1 =limit uniondis 2;
grunt> dump limit1
......
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
--------------------------------------------------------------------------------------------
grunt> B = foreach customer generate $0 as id,$4 as sex,level;
grunt> dump B
.........
(600,0,91)
(650,1,29)
(600,0,91)
(650,1,29)
(600,0,91)
(650,1,29)
(600,0,91)
(650,1,29)
----------------------------------------------------------------------------------------------------------
generate:使形成,发生,产生;
span:跨越
extend:延伸;扩大;推广
Pig allows you to transform data in many ways. As a starting point, become familiar with these operators:
-
Use the FILTER operator to work with tuples or rows of data. Use the FOREACH operator to work with columns of data.
-
Use the GROUP operator to group data in a single relation. Use the COGROUP, inner JOIN, and outer JOIN operators to group or join data in two or more relations.
-
Use the UNION operator to merge the contents of two or more relations. Use the SPLIT operator to partition the contents of a relation into multiple relations.
Pig Latin provides operators that can help you debug your Pig Latin statements:
-
Use the DUMP operator to display results to your terminal screen.
-
Use the DESCRIBE operator to review the schema of a relation.
-
Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.
-
Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.
Pig provides shortcuts for the frequently used debugging operators (DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE). These shortcuts can be used in Grunt shell or within pig scripts. Following are the shortcuts supported by pig
-
d alias - shourtcut for DUMP operator. If alias is ignored last defined alias will be used.
-
de alias - shourtcut for DESCRIBE operator. If alias is ignored last defined alias will be used.
-
e alias - shourtcut for EXPLAIN operator. If alias is ignored last defined alias will be used.
-
i alias - shourtcut for ILLUSTRATE operator. If alias is ignored last defined alias will be used.
-
q - To quit grunt shell