在调试的时候,发现有一个有趣的现象。task运行的时候,是分区分批次地执行。
每次执行都会调用我们运用在RDD上的所有的操作。分区之间能够并行执行,同一批次的数据运算当然需要顺序执行(不能并行计算)
比如,我有一个job
rdd.map(...).map(...).map(...).reduce(...)
数据源有5行数据,共2个分区
-
task-0-分区0:
第1行数据,map(...).map(...).map(...) 第2行数据,map(...).map(...).map(...) reduce(...) 继续处理分区0的数据...
-
task-1-分区1:
第4行数据,map(...).map(...).map(...) 第5行数据,map(...).map(...).map(...) reduce(...) 继续处理分区1的数据...
-
task-2-DAG-scheduler:
对task-0和task-1的结果,进行reduce
task-0和task-1属于同一个stage, 可以并行运行;
task-2属于另一个stage,但也可以和task-0,task-1并行执行;
task-0和task-1在运行过程中,一边执行一边把数据shuffle到task-2,task-2就可以开始reduce
不知道TaskScheduler有没有实现这种层次的并行。
测试数据: sale.svc
John,iphone Cover,9.99
John,Headphones,5.49
Jack,iphone Cover,9.99
Jill,Samsung Galaxy Cover,8.95
Bob,iPad Cover,5.49
测试代码:
object SaleStat {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true).setAppName("SaleStat").setMaster("local[4]").set("spark.executor.heartbeatInterval", 100.toString)
val sc = new SparkContext(conf)
val saleData = sc.textFile("PATH_TO_TEST_DATA/sale.svc")
.map {
line => {
val words = line.split(",")
println("map1: " + Thread.currentThread().getName + "," + line)
words
}
}
.map {
words => val (user, product, price) = (words(0), words(1), words(2))
println("map2: " + (Thread.currentThread().getName,user, product, price))
(user, product, price)
}
.map(
x => {
val price: Double = x._3.toDouble
println("map3: " + Thread.currentThread().getName + "," + x)
(x._1, price)
}
)
.reduce {
(x, y) => {
val total: Double = x._2 + y._2
println("reduceByKey: " + (Thread.currentThread().getName, x,y,total))
(x._1,total)
}
}
println("saleData: " + saleData) // 出发了rdd的计算
}
}
输出
map1: Executor task launch worker-0,John,iphone Cover,9.99
map1: Executor task launch worker-1,Jill,Samsung Galaxy Cover,8.95
map2: (Executor task launch worker-0,John,iphone Cover,9.99)
map2: (Executor task launch worker-1,Jill,Samsung Galaxy Cover,8.95)
map3: Executor task launch worker-1,(Jill,Samsung Galaxy Cover,8.95)
map3: Executor task launch worker-0,(John,iphone Cover,9.99)
map1: Executor task launch worker-1,Bob,iPad Cover,5.49
map1: Executor task launch worker-0,John,Headphones,5.49
map2: (Executor task launch worker-1,Bob,iPad Cover,5.49)
map2: (Executor task launch worker-0,John,Headphones,5.49)
map3: Executor task launch worker-1,(Bob,iPad Cover,5.49)
map3: Executor task launch worker-0,(John,Headphones,5.49)
reduceByKey: (Executor task launch worker-1,(Jill,8.95),(Bob,5.49),14.44)
reduceByKey: (Executor task launch worker-0,(John,9.99),(John,5.49),15.48)
map1: Executor task launch worker-0,Jack,iphone Cover,9.99
map2: (Executor task launch worker-0,Jack,iphone Cover,9.99)
map3: Executor task launch worker-0,(Jack,iphone Cover,9.99)
reduceByKey: (Executor task launch worker-0,(John,15.48),(Jack,9.99),25.47)
reduceByKey: (dag-scheduler-event-loop,(Jill,14.44),(John,25.47),39.91)
saleData: (Jill,39.91)
worker-0即task-0
worker-1即task-1
dag-scheduler-event-loop即task-2