1. partition by和group by区别和联系
1)group by是分组函数,partition by是分析函数(然后像sum()等是聚合函数)
2)在执行顺序上partition by应用在以上关键字之后,实际上就是在执行完select之后,在所得结果集之上进行partition,group by 使用常用sql关键字的优先级(from > where > group by > having > order by)
3)partition by相比较于group by,能够在保留全部数据的基础上,只对其中某些字段做分组排序,而group by则只保留参与分组的字段和聚合函数的结果
2. spark sql 使用group by
val df = Seq( ("ABC", "2019-02-10", "411626"), ("ABC", "2019-02-10", "411627"), ("BCD", "2020-04-01", "411626"), ("BCD", "2020-04-01", "411627"), ("BCD", "2020-04-02", "411626"), ("BCD", "2020-04-02", "411627"), ("DEF", "2019-01-09", "411626")) .toDF("user_id", "start_time", "end_time") df.groupBy(col("user_id"), col("start_time")) .agg(count(col("end_time")), sum(col("end_time"))) .show() +-------+----------+---------------+-------------+ |user_id|start_time|count(end_time)|sum(end_time)| +-------+----------+---------------+-------------+ | BCD|2020-04-02| 2| 823253.0| | ABC|2019-02-10| 2| 823253.0| | BCD|2020-04-01| 2| 823253.0| | DEF|2019-01-09| 1| 411626.0| +-------+----------+---------------+-------------+
2. spark sql 使用partition by
df.withColumn("rank",row_number().over(Window.partitionBy(col("user_id"), col("start_time")).orderBy(col("end_time")))) .show() +-------+----------+--------+----+ |user_id|start_time|end_time|rank| +-------+----------+--------+----+ | BCD|2020-04-02| 411626| 1| | BCD|2020-04-02| 411627| 2| | ABC|2019-02-10| 411626| 1| | ABC|2019-02-10| 411627| 2| | BCD|2020-04-01| 411626| 1| | BCD|2020-04-01| 411627| 2| | DEF|2019-01-09| 411626| 1| +-------+----------+--------+----+
partition by 返所有数据列
3. group by实现返所有数据列
df.groupBy(col("user_id"), col("start_time")) .agg(count(col("end_time")), sum(col("end_time")), collect_set(col("end_time"))(0).as("end_time")) .show() +-------+----------+---------------+-------------+--------+ |user_id|start_time|count(end_time)|sum(end_time)|end_time| +-------+----------+---------------+-------------+--------+ | BCD|2020-04-02| 2| 823253.0| 411627| | ABC|2019-02-10| 2| 823253.0| 411627| | BCD|2020-04-01| 2| 823253.0| 411627| | DEF|2019-01-09| 1| 411626.0| 411626| +-------+----------+---------------+-------------+--------+
使用 collect_set(去重)可以实现返回所有列