• sparkSQL中partition by和group by区别及使用


    1. partition by和group by区别和联系

    1)group by是分组函数,partition by是分析函数(然后像sum()等是聚合函数)

    2)在执行顺序上partition by应用在以上关键字之后,实际上就是在执行完select之后,在所得结果集之上进行partition,group by 使用常用sql关键字的优先级(from > where > group by > having > order by)

    3)partition by相比较于group by,能够在保留全部数据的基础上,只对其中某些字段做分组排序,而group by则只保留参与分组的字段和聚合函数的结果

    2. spark sql 使用group by

        val df = Seq(
          ("ABC", "2019-02-10", "411626"),
          ("ABC", "2019-02-10", "411627"),
          ("BCD", "2020-04-01", "411626"),
          ("BCD", "2020-04-01", "411627"),
          ("BCD", "2020-04-02", "411626"),
          ("BCD", "2020-04-02", "411627"),
          ("DEF", "2019-01-09", "411626"))
          .toDF("user_id", "start_time", "end_time")
    
       df.groupBy(col("user_id"), col("start_time"))
          .agg(count(col("end_time")), sum(col("end_time")))
          .show()
    
    +-------+----------+---------------+-------------+
    |user_id|start_time|count(end_time)|sum(end_time)|
    +-------+----------+---------------+-------------+
    |    BCD|2020-04-02|              2|     823253.0|
    |    ABC|2019-02-10|              2|     823253.0|
    |    BCD|2020-04-01|              2|     823253.0|
    |    DEF|2019-01-09|              1|     411626.0|
    +-------+----------+---------------+-------------+

    2. spark sql 使用partition by

        df.withColumn("rank",row_number().over(Window.partitionBy(col("user_id"), col("start_time")).orderBy(col("end_time"))))
        .show()
    
    +-------+----------+--------+----+
    |user_id|start_time|end_time|rank|
    +-------+----------+--------+----+
    |    BCD|2020-04-02|  411626|   1|
    |    BCD|2020-04-02|  411627|   2|
    |    ABC|2019-02-10|  411626|   1|
    |    ABC|2019-02-10|  411627|   2|
    |    BCD|2020-04-01|  411626|   1|
    |    BCD|2020-04-01|  411627|   2|
    |    DEF|2019-01-09|  411626|   1|
    +-------+----------+--------+----+

    partition by 返所有数据列

    3. group by实现返所有数据列

       df.groupBy(col("user_id"), col("start_time"))
          .agg(count(col("end_time")), sum(col("end_time")), collect_set(col("end_time"))(0).as("end_time"))
          .show()
    
    +-------+----------+---------------+-------------+--------+
    |user_id|start_time|count(end_time)|sum(end_time)|end_time|
    +-------+----------+---------------+-------------+--------+
    |    BCD|2020-04-02|              2|     823253.0|  411627|
    |    ABC|2019-02-10|              2|     823253.0|  411627|
    |    BCD|2020-04-01|              2|     823253.0|  411627|
    |    DEF|2019-01-09|              1|     411626.0|  411626|
    +-------+----------+---------------+-------------+--------+

    使用 collect_set(去重)可以实现返回所有列

     

  • 相关阅读:
    Wamp 访问本地站点慢 的解决办法
    PHP中如何实现 “在页面中一边执行一边输出” 的效果
    PHP很有用的一个函数ignore_user_abort ()
    PHP中CURL方法curl_setopt()函数的一些参数
    RGB转为Lab空间
    void v.s. void *
    __attribute__((regparm(3))) from GNU C
    a number of 和the number of用法
    Linux 版本查詢
    在frameset,iframe內調用Javascript的方法
  • 原文地址:https://www.cnblogs.com/yyy-blog/p/12711701.html
Copyright © 2020-2023  润新知