• Spark Streaming DStream 创建方式


    1、通过RDD队列创建DStream

    测试过程中,可以通过使用ssc.queueStream(queueOfRDDs)来创建DStream,每一个推送到这个队列中的RDD,都会作为一个DStream处理。

    创建方式

      def main(args: Array[String]): Unit = {
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing")
        //StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期
        val ssc = new StreamingContext(sparkConf, Seconds(3))
        //声明队列
        val rddQueue = new mutable.Queue[RDD[Int]]()
        //ssc.queueStream(rddQueue, oneAtATime = false) oneAtATime 一个采集周期只出现一次,默认 true
        val inputSream: InputDStream[Int] = ssc.queueStream(rddQueue, oneAtATime = false)
        val mapStream: DStream[(Int, Int)] = inputSream.map((_, 1))
        val reduceStream: DStream[(Int, Int)] = mapStream.reduceByKey(_ + _)
        reduceStream.print()
    
        // 启动采集器
        ssc.start()
        for (i <- 1 to 5) {
          // 放数据到 队列
          rddQueue += ssc.sparkContext.makeRDD(seq = 1 to 5, numSlices = 10)
          Thread.sleep(2000)
        }
        //等待采集器关闭
        ssc.awaitTermination()
      }
    

    执行效果

    -------------------------------------------
    Time: 1650099129000 ms
    -------------------------------------------
    (4,2)
    (1,2)
    (5,2)
    (2,2)
    (3,2)
    
    -------------------------------------------
    Time: 1650099132000 ms
    -------------------------------------------
    (4,1)
    (1,1)
    (5,1)
    (2,1)
    (3,1)
    
    -------------------------------------------
    Time: 1650099135000 ms
    -------------------------------------------
    (4,2)
    (1,2)
    (5,2)
    (2,2)
    (3,2)

    2、自定义数据源

    需要继承Receiver,并实现onStart、onStop方法来自定义数据源采集。

    实现方式

      def main(args: Array[String]): Unit = {
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing")
        //StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期
        val ssc = new StreamingContext(sparkConf, Seconds(3))
    
        val line: ReceiverInputDStream[String] = ssc.receiverStream(new myReceiver("hadoop103", 9999))
        line
          .flatMap(_.split(" "))
          .map((_, 1))
          .reduceByKey(_ + _)
          .print()
        // 启动采集器
        ssc.start()
    
        //等待采集器关闭
        ssc.awaitTermination()
      }
    
      /**
       * 自定义数据采集器
       * 1、继承 Receiver,定义泛型,传参数
       *
       */
      private class myReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY) {
        private var socket: Socket = _
    
        override def onStart(): Unit = {
          new Thread("socket Receiver") {
            setDaemon(true)
    
            override def run() {
              receiver()
            }
          }.start()
        }
    
        def receiver(): Unit = {
          try {
            //读取端口数据
            socket = new Socket(host, port)
            val bf: BufferedReader = new BufferedReader(new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8))
            //定义变量存储读取的数据
            var line: String = null
            while ((line = bf.readLine()) != null) {
              //缓存到内存 store() 是 Receiver 提供的方法
              store(line)
            }
          } catch {
            case e: ConnectException =>
              restart(s"Error connecting to $host:$port...", e)
              return
          }
        }
    
        override def onStop(): Unit = {
          synchronized {
            if (socket != null) {
              socket.close()
              socket = null
            }
          }
        }
      }
    

    3、Kafka 数据源

    ReceiverAPI:需要一个专门的Executor去接收数据,然后发送给其他的Executor做计算。存在的问题,接收数据的Executor和计算的Executor速度会有所不同,特别在接收数据的Executor速度大于计算的Executor速度,会导致计算数据的节点内存溢出。DirectAPI:是由计算的Executor来主动消费Kafka的数据,速度由自身控制

    实现方式

      def main(args: Array[String]): Unit = {
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkStreaing")
        //StreamingContext 两个参数 sparkConf 配置文件 Seconds(3) 微批采集周期
        val ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(3))
    
        val kafkaPara: Map[String, Object] = Map[String, Object](
          ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop103:9092,hadoop104:9092,hadoop105:9092", //kafka所在集群主机端口信息
          ConsumerConfig.GROUP_ID_CONFIG -> "hui", "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer", "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer")
        //从 kafka 读取数据
        val kfkDataDS: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
          ssc, //ssc 上下文环境对象
          LocationStrategies.PreferConsistent, //LocationStrategies 位置策略 PreferConsistent 采集节点和计算节点自己控制
          ConsumerStrategies.Subscribe[String, String](Set("tbg"), //ConsumerStrategies 消费策略  tbg kafka topic
            kafkaPara //kafka主题 kafkaPara kafka配置
          ))
        kfkDataDS
          .flatMap(_.value().split(" "))
          .map((_, 1))
          .reduceByKey(_ + _)
          .print()
    
    
        /**
         bin/kafka-topics.sh --bootstrap-server hadoop103:9092 --list
         bin/kafka-console-producer.sh --bootstrap-server hadoop103:9092 --topic tbg
         **/
        // 启动采集器
        ssc.start()
    
        //等待采集器关闭
        ssc.awaitTermination()
      }
  • 相关阅读:
    OS + Linux + zipTool / tar / tar.gz / zst
    project scm
    product wiki confluence
    script ActionScript / ColdFusion
    链表例题
    链表原理
    链表例题
    链表原理
    链表原理
    链表原理
  • 原文地址:https://www.cnblogs.com/wdh01/p/16153434.html
Copyright © 2020-2023  润新知