• 15、Spark Streaming源码解读之No Receivers彻底思考



    在前几期文章里
    讲了带Receiver的Spark Streaming 应用的相关源码解读,但是现在开发Spark Streaming的应用越来越多的采用No Receivers(Direct Approach)的方式,No Receiver的方式的优势: 
    1. 更强的控制自由度 
    2. 语义一致性 
    其实No Receivers的方式更符合我们读取数据,操作数据的思路的。因为Spark 本身是一个计算框架,他底层会有数据来源,如果没有Receivers,我们直接操作数据来源,这其实是一种更自然的方式。 如果要操作数据来源,肯定要有一个封装器,这个封装器一定是RDD类型。 以直接访问Kafka中的数据为例,看一下源码中直接读写Kafka中数据的例子代码:
    1. object DirectKafkaWordCount {
    2. def main(args: Array[String]) {
    3. if (args.length < 2) {
    4. System.err.println(s"""
    5. |Usage: DirectKafkaWordCount <brokers> <topics>
    6. | <brokers> is a list of one or more Kafka brokers
    7. | <topics> is a list of one or more kafka topics to consume from
    8. |
    9. """.stripMargin)
    10. System.exit(1)
    11. }
    12. StreamingExamples.setStreamingLogLevels()
    13. val Array(brokers, topics) = args
    14. // Create context with 2 second batch interval
    15. val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
    16. val ssc = new StreamingContext(sparkConf, Seconds(2))
    17. // Create direct kafka stream with brokers and topics
    18. val topicsSet = topics.split(",").toSet
    19. val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
    20. val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
    21. ssc, kafkaParams, topicsSet)
    22. // Get the lines, split them into words, count the words and print
    23. val lines = messages.map(_._2)
    24. val words = lines.flatMap(_.split(" "))
    25. val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
    26. wordCounts.print()
    27. // Start the computation
    28. ssc.start()
    29. ssc.awaitTermination()
    30. }
    31. }

    Spark streaming 会将数据源封装成一个RDD,也就是KafkaRDD:

    1. /**
    2. * A batch-oriented interface for consuming from Kafka.
    3. * Starting and ending offsets are specified in advance,
    4. * so that you can control exactly-once semantics.
    5. * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">
    6. * configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers" to be set
    7. * with Kafka broker(s) specified in host1:port1,host2:port2 form.
    8. * @param offsetRanges offset ranges that define the Kafka data belonging to this RDD
    9. * @param messageHandler function for translating each message into the desired type
    10. */
    11. private[kafka]
    12. class KafkaRDD[
    13. K: ClassTag,
    14. V: ClassTag,
    15. U <: Decoder[_]: ClassTag,
    16. T <: Decoder[_]: ClassTag,
    17. R: ClassTag] private[spark] (
    18. sc: SparkContext,
    19. kafkaParams: Map[String, String],
    20. val offsetRanges: Array[OffsetRange], //该RDD的数据偏移量
    21. leaders: Map[TopicAndPartition, (String, Int)],
    22. messageHandler: MessageAndMetadata[K, V] => R
    23. ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges

    可以看到KafkaRDD 混入了HasOffsetRanges,它是一个trait:

    1. trait HasOffsetRanges {
    2. def offsetRanges: Array[OffsetRange]
    3. }

    其中OffsetRange,标识了RDD的数据的主题、分区、开始偏移量和结束偏移量:

    1. inal class OffsetRange private(
    2. val topic: String,
    3. val partition: Int,
    4. val fromOffset: Long,
    5. val untilOffset: Long) extends Serializable

    回到KafkaRDD,看一下KafkaRDD的getPartitions方法:

    1. override def getPartitions: Array[Partition] = {
    2. offsetRanges.zipWithIndex.map { case (o, i) =>
    3. val (host, port) = leaders(TopicAndPartition(o.topic, o.partition))
    4. new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, o.untilOffset, host, port)
    5. }.toArray
    6. }

    返回KafkaRDDPartition:

    1. private[kafka]
    2. class KafkaRDDPartition(
    3. val index: Int,
    4. val topic: String,
    5. val partition: Int,
    6. val fromOffset: Long,
    7. val untilOffset: Long,
    8. val host: String,
    9. val port: Int
    10. ) extends Partition {
    11. /** Number of messages this partition refers to */
    12. def count(): Long = untilOffset - fromOffset
    13. }

    KafkaRDDPartition清晰的描述了数据的具体位置,每个KafkaRDDPartition分区的数据交给KafkaRDD的compute方法计算:

    1. override def compute(thePart: Partition, context: TaskContext): Iterator[R] = {
    2. val part = thePart.asInstanceOf[KafkaRDDPartition]
    3. assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))
    4. if (part.fromOffset == part.untilOffset) {
    5. log.info(s"Beginning offset ${part.fromOffset} is the same as ending offset " +
    6. s"skipping ${part.topic} ${part.partition}")
    7. Iterator.empty
    8. } else {
    9. new KafkaRDDIterator(part, context)
    10. }
    11. }

    KafkaRDD的compute方法返回了KafkaIterator对象:

    1. private class KafkaRDDIterator(
    2. part: KafkaRDDPartition,
    3. context: TaskContext) extends NextIterator[R] {
    4. context.addTaskCompletionListener{ context => closeIfNeeded() }
    5. log.info(s"Computing topic ${part.topic}, partition ${part.partition} " +
    6. s"offsets ${part.fromOffset} -> ${part.untilOffset}")
    7. val kc = new KafkaCluster(kafkaParams)
    8. val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
    9. .newInstance(kc.config.props)
    10. .asInstanceOf[Decoder[K]]
    11. val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
    12. .newInstance(kc.config.props)
    13. .asInstanceOf[Decoder[V]]
    14. val consumer = connectLeader
    15. var requestOffset = part.fromOffset
    16. var iter: Iterator[MessageAndOffset] = null
    17.     //..................
    18. }

    KafkaIterator中创建了一个KakfkaCluster对象用于与Kafka集群进行交互,获取数据。

    回到开头的例子,我们使用 KafkaUtils.createDirectStream 创建了InputDStream:

    1. val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
    2. ssc, kafkaParams, topicsSet)

    看一下createDirectStream源码:

    1. def createDirectStream[
    2. K: ClassTag,
    3. V: ClassTag,
    4. KD <: Decoder[K]: ClassTag,
    5. VD <: Decoder[V]: ClassTag] (
    6. ssc: StreamingContext,
    7. kafkaParams: Map[String, String],
    8. topics: Set[String]
    9. ): InputDStream[(K, V)] = {
    10. val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)
    11. //创建KakfaCluster对象
    12. val kc = new KafkaCluster(kafkaParams)
    13. //更具kc的信息获取数据偏移量
    14. val fromOffsets = getFromOffsets(kc, kafkaParams, topics)
    15. new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](
    16. ssc, kafkaParams, fromOffsets, messageHandler)
    17. }

    首先通过KafkaCluster从Kafka集群获取信息,创建DirectKafkaInputDStream对象返回

    DirectKafkaInputDStream的compute方法源码:
    1. override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {
    2.     //计算最近的数据终止偏移量
    3. val untilOffsets = clamp(latestLeaderOffsets(maxRetries))
    4.     //利用数据的偏移量创建KafkaRDD
    5. val rdd = KafkaRDD[K, V, U, T, R](
    6. context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler)
    7. // Report the record number and metadata of this batch interval to InputInfoTracker.
    8. val offsetRanges = currentOffsets.map { case (tp, fo) =>
    9. val uo = untilOffsets(tp)
    10. OffsetRange(tp.topic, tp.partition, fo, uo.offset)
    11. }
    12. val description = offsetRanges.filter { offsetRange =>
    13. // Don't display empty ranges.
    14. offsetRange.fromOffset != offsetRange.untilOffset
    15. }.map { offsetRange =>
    16. s"topic: ${offsetRange.topic} partition: ${offsetRange.partition} " +
    17. s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"
    18. }.mkString(" ")
    19. // Copy offsetRanges to immutable.List to prevent from being modified by the user
    20. val metadata = Map(
    21. "offsets" -> offsetRanges.toList,
    22. StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)
    23. val inputInfo = StreamInputInfo(id, rdd.count, metadata)
    24. ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
    25. currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset)
    26. Some(rdd)
    27. }

    可以看到DirectKafkaInputDStream的compute方法中,首先从Kafka集群获取数据的偏移量,然后利用获取偏移量创建RDD,这个Receiver的RDD创建方式不同




     



  • 相关阅读:
    Ozone数据探查服务的增量数据更新机制
    HDFS的块Topology位置重分布
    聊聊更为高效的ACL认证方式
    YARN Federation的架构设计
    Confluence 6 用户宏示例
    Confluence 6 用户宏示例
    Confluence 6 用户宏示例
    Confluence 6 用户宏最佳实践
    Confluence 6 编辑和删除用户宏
    Confluence 6 创建一个用户宏
  • 原文地址:https://www.cnblogs.com/zhouyf/p/5554774.html
Copyright © 2020-2023  润新知