• [转载]Spark-Task not serializable错误解析


    Spark-Task not serializable错误解析

    在学习SparkStreaming的时候偶然出现的一个问题,先看下面一段代码:

    import org.apache.log4j.{Level, Logger}
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    import org.apache.spark.{SparkConf, SparkContext}
    /**
      * Created by Administrator on 2017/11/6.
      */
    object ForEachTest {
       val checkpointDirectory="hdfs://hadoop1:9000/streamingchekpoint4"
      def functionToCreateContext(): StreamingContext = {
        //程序入口
        val conf = new  SparkConf().setMaster("local[2]").setAppName(s"${this.getClass.getSimpleName}")
        val sc = new SparkContext(conf)
        sc.setLogLevel("ERROR")
        val ssc = new StreamingContext(sc,Seconds(1))
        //数据的输入
        val dStream = ssc.socketTextStream("192.168.32.10",9999)
        //数据的处理
        val resultDStream = dStream.flatMap(_.split(","))
          .map((_, 1))
          .updateStateByKey((values: Seq[Int], valuesState: Option[Int]) => {
            val currentCount = values.sum
            val lastCount = valuesState.getOrElse(0)
            Some(currentCount + lastCount)
          })
        //程序的输出
        resultDStream.foreachRDD( rdd  =>{
          //Driver
            val jdbcCoon = MysqlPool.getJdbcCoon()
            val statement = jdbcCoon.createStatement()
          rdd.foreachPartition( partition  =>{
            //Executor
            partition.foreach( recored  =>{
              //Executor
               val word = recored._1
               val count = recored._2
              val sql=s"insert into  aura.1706wordcount values(now(),'${word}',${count})"
              statement.execute(sql)
            })
            MysqlPool.releaseConn(jdbcCoon)
          })
        })
        //设置检查点
        ssc.checkpoint(checkpointDirectory)
        ssc
      }
    
      def main(args: Array[String]): Unit = {
    
        val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
        //启动程序
        ssc.start()
        ssc.awaitTermination()
      }
    }

    这段代码是一个SparkStraming与mysql交互的Demo,用到了foreachRDD算子,mysql连接池的代码这里先省略,因为不是重点,会在另一片专门写SparkStreaming的博客中给出。这段代码看似没有问题,但是运行报错:

    org.apache.spark.SparkException: Task not serializable
    Caused by: java.io.NotSerializableException: java.lang.Object

    表示任务没有被序列化,那么这个序列化到底是指哪里呢?通过查阅官网,发现在介绍foreachRDD的时候有过这么一个介绍:

    dstream.foreachRDD { rdd =>
      val connection = createNewConnection()  // executed at the driver
      rdd.foreach { record =>
        connection.send(record) // executed at the worker
      }
    }

    这个说明foreachRDD是在driver端执行的,而foreach是在worker端执行的。我们知道我们在提交代码的时候,提交这个动作是在driver端执行的,提交的这台服务器就是driver,那么哪些代码是在drvier端执行的呢?

        val conf = new SparkConf()
        conf.setAppName(s"${this.getClass.getSimpleName}").setMaster("local[2]")
        val sc = new SparkContext(conf)
        val ssc: StreamingContext = new StreamingContext(sc, Seconds(1))

    以上的这些初始化的代码和:textfile、foreachRDD都是在driver端执行的;

    而map、flatmap、reduceByKey、foreach、foreachPartition...这类算子都是在worker端执行的。

    从driver到worker是要先序列化再可以传输的,所以你如果要在foreachRDD里面写代码,如果没有经过序列化,就会报错。那么怎么解决呢?

    1、让它序列化啊

    2、如果这个对象不支持序列化,那就不要写在foreachRDD里面啊

    所以,原文的这段代码应该修改为:

        resultDStream.foreachRDD( rdd  =>{
          //Driver
          rdd.foreachPartition( partition  =>{
            //Executor
            val jdbcCoon = MysqlPool.getJdbcCoon()
            val statement = jdbcCoon.createStatement()
            partition.foreach( recored  =>{
              //Executor
               val word = recored._1
               val count = recored._2
              val sql=s"insert into  aura.1706wordcount values(now(),'${word}',${count})"
              statement.execute(sql)
            })
            MysqlPool.releaseConn(jdbcCoon)
          })
        })
  • 相关阅读:
    for循环中变量的作用域问题
    一个数与0进行按位或,能取整
    Spring中日志的使用(log4j)
    MyBatis实现动态语句操作
    MyBatis实现简单增删改查操作
    python计算2的平方根,并输出小数点后的第100万位数字
    Spring配置文件报错:Multiple annotations found at this line: - Element type "beans" must be followed by either attribute specifications, ">" or "/>". - Start tag of element <beans>
    python跳出多重循环
    python中的生成器(二)
    python中的生成器(一)
  • 原文地址:https://www.cnblogs.com/sanmuqingliang/p/10647216.html
Copyright © 2020-2023  润新知