0)摘要
本章博客是对Spark Streaming2.20文档(http://spark.apache.org/docs/2.2.0/streaming-programming-guide.html)的一些梳理,加上代码实现了一些算子,最后写了Spark Streaming如何整合SparkSQL。
1)DStream
SparkStreaming 中最基本的概念:抽象化离散化数据流,表示来连续不断的数据流
DStreams:是由不同批次的RDD构成的,交给spark core来完成处理。
2)Input DStreams and Receivers
Input DStreams:收到的数据流
注意:除了文件流,其他的Input DStreams都需要一个receive来接收数据和然后将数据存在内存里面等待着spark去处理,所以,在本地运行spark streaming程序的时候,使用“local”或者“local[1]”只会启动一个线程来处理,所以,注意有receive的输入流应该:n(线程数量)>receive的数量。如果使用sockets, Kafka, Flume等作为数据源,不要使用"local[1]"或者“local”。
在spark streaming中会有一个长期运行的组件Receivers,作为一个长期运行的任务(Task)运行在Executor上,每一个Receive会负责一个DStreams输入流。Receive组件会接收数据源发来的数据,会提交给sparkcc core来处理。
3)Transformations on DStreams(算子)
- map、flatMap、filter、repartition、union、count、reduce、countByValue()、reduceByKey、join、cogroup、transform
- updateStateByKey、
-
Window Operations
4)Output Operations on DStreams(输出操作)
输出流,将结果写入到外面
6)DataFrame and SQL Operations(整合sparkSQL)
如何整合sparkSQL
下面代码来自于sprak 上的官方案例,地址:https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
1 import org.apache.spark.SparkConf 2 import org.apache.spark.rdd.RDD 3 import org.apache.spark.sql.SparkSession 4 import org.apache.spark.storage.StorageLevel 5 import org.apache.spark.streaming.{Seconds, StreamingContext, Time} 6 7 /** 8 * @Author: SmallWild 9 * @Date: 2019/10/26 17:29 10 * @Desc: 11 */ 12 object SqlNetworkWordCount { 13 def main(args: Array[String]): Unit = { 14 val sparkConf = new SparkConf().setAppName("SqlNetworkWordCount").setMaster("local[2]") 15 val ssc = new StreamingContext(sparkConf, Seconds(5)) 16 ssc.sparkContext.setLogLevel("WARN") 17 val lines = ssc.socketTextStream(pro.ADDRESS, 1884) 18 val words = lines.flatMap(_.split(" ")) 19 // Convert RDDs of the words DStream to DataFrame and run SQL query 20 words.foreachRDD { (rdd: RDD[String], time: Time) => 21 // Get the singleton instance of SparkSession 22 val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf) 23 import spark.implicits._ 24 25 // Convert RDD[String] to RDD[case class] to DataFrame 26 val wordsDataFrame = rdd.map(w => Record(w)).toDF() 27 28 // Creates a temporary view using the DataFrame 29 wordsDataFrame.createOrReplaceTempView("words") 30 31 // Do word count on table using SQL and print it 32 val wordCountsDataFrame = 33 spark.sql("select word, count(*) as total from words group by word") 34 println(s"========= $time =========") 35 wordCountsDataFrame.show() 36 } 37 38 ssc.start() 39 ssc.awaitTermination() 40 } 41 42 /** Case class for converting RDD to DataFrame */ 43 case class Record(word: String) 44 45 46 /** Lazily instantiated singleton instance of SparkSession */ 47 object SparkSessionSingleton { 48 49 @transient private var instance: SparkSession = _ 50 51 def getInstance(sparkConf: SparkConf): SparkSession = { 52 if (instance == null) { 53 instance = SparkSession 54 .builder 55 .config(sparkConf) 56 .getOrCreate() 57 } 58 instance 59 } 60 } 61 }
8)foreachRDD的使用
作用:foreachRDD
is a powerful primitive that allows data to be sent out to external systems
正确的使用方式:使用foreachPartition获得每个分区的数据,
//不使用连接池
dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => val connection = createNewConnection() partitionOfRecords.foreach(record => connection.send(record)) connection.close() } }
//使用连接池
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
7)总结
SparkStreaming无法实现毫秒级的流计算,如果需要实现毫秒级的流计算,仍然需要使用流计算框架(如Storm)