• [杂谈]Rdd运行-1


    在调试的时候,发现有一个有趣的现象。task运行的时候,是分区分批次地执行。
    每次执行都会调用我们运用在RDD上的所有的操作。分区之间能够并行执行,同一批次的数据运算当然需要顺序执行(不能并行计算)

    比如,我有一个job

    rdd.map(...).map(...).map(...).reduce(...)
    

    数据源有5行数据,共2个分区

    1. task-0-分区0:

       第1行数据,map(...).map(...).map(...)
       第2行数据,map(...).map(...).map(...)
       reduce(...)
       继续处理分区0的数据...
      
    2. task-1-分区1:

       第4行数据,map(...).map(...).map(...)
       第5行数据,map(...).map(...).map(...)
       reduce(...)
       继续处理分区1的数据...
      
    3. task-2-DAG-scheduler:

       对task-0和task-1的结果,进行reduce
      

    task-0和task-1属于同一个stage, 可以并行运行;

    task-2属于另一个stage,但也可以和task-0,task-1并行执行;

    task-0和task-1在运行过程中,一边执行一边把数据shuffle到task-2,task-2就可以开始reduce

    不知道TaskScheduler有没有实现这种层次的并行。

    测试数据: sale.svc

    John,iphone Cover,9.99
    John,Headphones,5.49
    Jack,iphone Cover,9.99
    Jill,Samsung Galaxy Cover,8.95
    Bob,iPad Cover,5.49

    测试代码:

    object SaleStat {
    	def main(args: Array[String]): Unit = {
    		val conf = new SparkConf(true).setAppName("SaleStat").setMaster("local[4]").set("spark.executor.heartbeatInterval", 100.toString)
    		val sc = new SparkContext(conf)
    		val saleData = sc.textFile("PATH_TO_TEST_DATA/sale.svc")
    			.map {
    				line => {
    					val words = line.split(",")
    					println("map1: " + Thread.currentThread().getName + "," + line)
    					words
    				}
    			}
    			.map {
    				words => val (user, product, price) = (words(0), words(1), words(2))
    				println("map2: " + (Thread.currentThread().getName,user, product, price))
    				(user, product, price)
    			}
    			.map(
    				x => {
    					val price: Double = x._3.toDouble
    					println("map3: " + Thread.currentThread().getName + "," + x)
    					(x._1, price)
    				}
    			)
    			.reduce {
    				(x, y) => {
    					val total: Double = x._2 + y._2
    					println("reduceByKey: " + (Thread.currentThread().getName, x,y,total))
    					(x._1,total)
    				}
    			}
    		
    		println("saleData: " + saleData)			// 出发了rdd的计算
    	}
    }
    

    输出

    map1: Executor task launch worker-0,John,iphone Cover,9.99
    map1: Executor task launch worker-1,Jill,Samsung Galaxy Cover,8.95
    map2: (Executor task launch worker-0,John,iphone Cover,9.99)
    map2: (Executor task launch worker-1,Jill,Samsung Galaxy Cover,8.95)
    map3: Executor task launch worker-1,(Jill,Samsung Galaxy Cover,8.95)
    map3: Executor task launch worker-0,(John,iphone Cover,9.99)
    map1: Executor task launch worker-1,Bob,iPad Cover,5.49
    map1: Executor task launch worker-0,John,Headphones,5.49
    map2: (Executor task launch worker-1,Bob,iPad Cover,5.49)
    map2: (Executor task launch worker-0,John,Headphones,5.49)
    map3: Executor task launch worker-1,(Bob,iPad Cover,5.49)
    map3: Executor task launch worker-0,(John,Headphones,5.49)
    reduceByKey: (Executor task launch worker-1,(Jill,8.95),(Bob,5.49),14.44)
    reduceByKey: (Executor task launch worker-0,(John,9.99),(John,5.49),15.48)
    map1: Executor task launch worker-0,Jack,iphone Cover,9.99
    map2: (Executor task launch worker-0,Jack,iphone Cover,9.99)
    map3: Executor task launch worker-0,(Jack,iphone Cover,9.99)
    reduceByKey: (Executor task launch worker-0,(John,15.48),(Jack,9.99),25.47)
    reduceByKey: (dag-scheduler-event-loop,(Jill,14.44),(John,25.47),39.91)
    saleData: (Jill,39.91)
    
    worker-0即task-0
    worker-1即task-1
    dag-scheduler-event-loop即task-2
    

    这个可能是DAG的魅力之一了。RDD的运行是懒执行的,DAG能够整合rdd的运行过程,让很多的操作集中在一个线程里运行连续执行,避免了磁盘和网络io

  • 相关阅读:
    [学习笔记] 高维前缀和
    [模板] BEST 定理
    [HDU6765] Count on a Tree II Striking Back
    Codeforces Round #775 (Div. 1)
    Stream流:自定义的distinctByKey根据对象的属性进行去重
    代码优化:尽量采用懒加载的策略,即在需要的时候才创建
    volatile解决内存可见性的使用
    代码优化:String替换尽量少用正则表达式(replace()和replaceAll()的区别)
    内存可见性以及synchronized实现可见性
    代码优化:防止空指针异常 NPE ,是程序员的基本修养,注意 NPE 产生的场景:
  • 原文地址:https://www.cnblogs.com/ivanny/p/spark_rdd_run_DAGScheduler.html
Copyright © 2020-2023  润新知