spark aggregate - 润新知

spark aggregate

该函数官方的api，说的不是很明白：
aggregate(zeroValue, seqOp, combOp)
Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.”
The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U
>>> seqOp=(lambdax,y:(x[0]+y,x[1]+1))
>>> combOp=(lambdax,y:(x[0]+y[0],x[1]+y[1]))
>>> sc.parallelize([1,2,3,4]).aggregate((0,0),seqOp,combOp)
(10, 4)
>>> sc.parallelize([]).aggregate((0,0),seqOp,combOp)
(0, 0)
下面列出，代码的执行流程：
假设[1,2,3,4]被分成两个分区，为分区1（[1,2]），分区2（[3,4]）
首先用seqOp对分区1进行操作：
x=(0,0) y=1 -----> (1,1) #对分区进行第一次seqOp操作时，x为zero value
x=(1,1) y=2 -----> (3,2) #对分区进行的第二次及以后的seqOp操作，x为前一次seqOp的执行结果
同样对分区2进行操作：
x=(0,0)   y=3 ----->   (3,1)
       x=(3,1)   y=4 ----->   (7,2)
然后用combOp对两个分区seqOp作用后的结果进行操作：
分区1：
x=(0,0) y=(3,2) ------> (3,2) #对第一个分区进行combOp操作时，x为zero value
x=(3,2) y=(7,2)   ------> (10,4) #对第二个及以后分区进行combOp操作时，x为前一分区combOp处理后的结果
可以看出，例子实际上即 (rdd.sum(),rdd.count())

来自为知笔记(Wiz)
相关阅读:
视图、触发器、事物、存储过程、函数、流程控制
 pymysql
单表查询与多表查询
 多线程学习（第三天）线程间通信
 多线程学习（第二天）Java内存模型
 多线程学习（第一天）java语言的线程
 springboot集成es7(基于high level client)
elasticSearch(六)--全文搜索
 elasticSearch(五)--排序
 elasticSearch(四)--结构化查询
原文地址：https://www.cnblogs.com/porco/p/4642434.html

Copyright © 2020-2023 润新知