1 Overview
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map
, reduce
, join
and window
. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.
Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
input data stream->spark streaming->batches of input data->spark engine->batches of processed data
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
2 A quick example
//start the data server
# nc -lk 9999
3 Basic concepts
3.1 Linking
3.2 Initializing StreamingContext
step1 Define the input sources by creating input DStreams.
step2 Define the streaming computations by applying transformation and output operations to DStreams.
step3 Start receiving data and processing it using streamingContext.start().
step4 Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
step5 The processing can be manually stopped using streamingContext.stop().
Points to remember:
- Once a context has been started, no new streaming computations can be set up or added to it.
- Once a context has been stopped, it cannot be restarted.
- Only one StreamingContext can be active in a JVM at the same time.
- stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of
stop()
calledstopSparkContext
to false.
- A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.
3.3 Discretized Streams (DStreams)
Discretized(离散化处理的) Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs.
3.4 Input DStreams and Receivers
Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which receives the data from a source and stores it in Spark’s memory for processing.
Points to remember
-
When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).
-
Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.
Basic Sources:
scc.fileStream()
scc.queueStream()
scc.socketTextStream()
scc.actorStream()
Advanced Sources:
Kafka
Flume
Kinesis
Twitter
Custom Sources:
Input DStreams can also be created out of custom data sources. All you have to do is implement a user-defined receiver (see next section to understand what that is) that can receive data from the custom sources and push it into Spark. See the Custom Receiver Guide for details.
3.5 Transformations on DStreams
transformations that worth discussing in more detail:
UpdateStateByKey Operation
Transform Operation
Window Operation
Any window operation needs to specify two parameters:
window length - The duration of the window (3 in the figure).
sliding interval - The interval at which the window operation is performed (2 in the figure).
I want to extend the earlier example by generating word counts over the last 30 seconds of data, every 10 seconds.
// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
Join Operation : leftOuterJoin, rightOuterJoin, fullOuterJoin
3.6 Output Operations on DStreams
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
3.7 DataFrame and SQL Operations
3.8 MLlib Operations
3.9 Caching / Persistence
3.10 Checkpointing
The default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
3.11 Deploying Applications
3.12 Monitoring Applications
4 Performance Tuning
4.1 Reducing the Batch Processing Times
4.2 Setting the Right Batch Interval
4.3 Memory Tuning