Pig:
1. 内连接:
join_table = join table1 by col1, join table2 by col2
2. 外连接:
左外连接:join_table = join table1 by col1 left outer, join table2 by col2 其中outer 可省略
右外连接:join_table = join table1 by col1 right outer, join table2 by col2 其中outer 可省略
全外连接:join_table = join table1 by col1 full outer, join table2 by col2
join_table = join big_table by col1 , join small_table by col2 using 'replicated'
这样small_table 会加载到内存中复制到各个节点,再与big_table 连接。
replicated join 只支持内连接和左外连接
5. 倾斜数据连接 skew join
join_table = join table1 by col1, join table2 by col2 using 'skewed'
先对table2的键值进行抽样,判断键值的分布情况,然后给各个Reducer分配键值范围,可能同一个键值被分配到好几个reducer中
6. 排好顺序的数据连接 merge join
join_table = join table1 by col1, join table2 by col2 using 'merge'
Hive:
1. 内连接,可连接多个, 注意小表在前,大表在后
select * from table1 a join table2 b on a.col1=b.col2
2. 外连接
左:select * from table1 a left outer join table2 b on a.col1=b.col2
全:select * from table1 a full outer join table2 b on a.col1=b.col2
3. 左半开连接
select a.col1 , a.col2 , a.col3 from table1 a left semi join table2 b on a.col1=b.col2
实现与内连接相同功能,但是比内连接高效。
4.笛卡尔积
select * from table1 join table2
5.map-side join
设置了 hive.auto.convert.join=true, hive.mapjoin.smalltable.filesize=2500, hive.optimize.bucketmapJoin=true 等参数会自动开启