Pig 知识盲点 - 润新知

Pig 知识盲点

1. Word Count 例子

inputfile = load 'file' as (line);

内容：

{line: bytearray}

(Look at the stars, )
(Look how they shine for you, )
(And everything you do, )
(Yeah they were all yellow,)

wordsLine = foreach inputfile generate flatten(TOKENIZE(line)) aseachWord;

内容：

{eachWord: chararray}

(Look)
(at)
(the)
(stars)
(Look)
(how)
(they)
(shine)
(for)
(you)
(And)
(everything)
(you)
(do)
(Yeah)
(they)
(were)
(all)
(yellow)

groupedWords = group wordsLine by eachWord;

内容：

{group: chararray,wordsLine: {(eachWord: chararray)}}

(at,{(at)})
(do,{(do)})
(And,{(And)})
(all,{(all)})
(for,{(for)})
(how,{(how)})
(the,{(the)})
(you,{(you),(you)})
(Look,{(Look),(Look)})
(Yeah,{(Yeah)})
(they,{(they),(they)})
(were,{(were)})
(shine,{(shine)})
(stars,{(stars)})
(yellow,{(yellow)})
(everything,{(everything)})

wordcount = foreach groupedWords generate group, COUNT(wordsLine) as cnt

{group: chararray, cnt:long}

(at,1)
(do,1)
(And,1)
(all,1)
(for,1)
(how,1)
(the,1)
(you,2)
(Look,2)
(Yeah,1)
(they,2)
(were,1)
(shine,1)
(stars,1)
(yellow,1)
(everything,1)

2.Pig 返回码

0：成功；1：失败，可重试；2：失败；3：部分失败......其余是各种异常。

3. 复杂数据类型(很少用)

map:

tuple: 定长有序

bag: 无序

4. NULL值

null 对任何运算符都抵消：

x+null = null

null==1 ? 1: 0 , 返回值为null

5.加载和存储

加载函数：

PigStoreage(',') ：HDFS路径

HBaseStoreage()：HBase表

TextLoader: HDFS路径, 每行作为一个tuple

存储函数：

PigStoreage(',') ：HDFS路径

HBaseStoreage()：HBase表

TextLoader: HDFS路径, 每行作为一个tuple

6.大小写敏感：

关键字不敏感： load == LOAD

变量敏感：tablea != tableA

自定义函数敏感： count != COUNT

7.Parallel

可触发reduce的操作：group、order、distinct、join、cogroup、cross

后面可使用parallel 指定并行数目

8. 注册UDF

使用命令register，或者属性 -Dudf.import.list, 或-Dpig.additional.jars

9. Java静态函数

实际是使用反射来运行

10. flattern

操作bag，一行变多行

11. replicated Join

map side join

12. skew join

先抽样，确定键的分布，重写partitioner，从而均衡各个reducer的负载

13. merge join

已经排好序，比默认的高效

14. cogroup

对多个输入进行group，如 : C = cogroup A by id, B by id.

相当于join的前一半

15. stream

执行perl、python等脚本

16. 直接运行mapreduce

使用mapreduce命令执行

17. 有向无环图 DAG

18. 分割器 Partitioner

可注册自定义jar包

19. 宏

使用define关键字

20. 嵌套Pig 脚本

使用import关键字

21. 执行计划 exlain

用于调试

22. illustrate

用于调试

23. Pig 统计信息

在日志或者终端输出

24. PigUnit

集成到JUnit

25. 与 Python 交互

26. 评估函数

求值函数，继承org.apache.pig.EvalFunc<V>, 实现 ecec(Tuple input)

也可以用Python书写

27. 过滤函数

继承org.apache.pig.FilterFunc

28. 加载函数

29. 存储函数

30. Piggybank

内置常用聚合函数、数学函数、字符串处理函数。

特别的，有正则表达式匹配函数：REGEX_EXTRACT、REGEX_EXTRACT_ALL

统计函数：相关系数COR、协方差COV等等。
相关阅读:
linux c 正则表达式
 【编译原理】1. 宏观结构
 编译原理
 知我所未知，方行我所未至
 接口返回值中的状态值设置
 Quotes in shell(bash, csh)
Blog 081018
58同城职位分类数据 json
ckeditor小记
 三大WEB服务器对比分析（apache ,lighttpd,nginx）
原文地址：https://www.cnblogs.com/leeeee/p/7276131.html