• spark-streaming first insight


    一、

    Spark Streaming 构建在Spark core API之上,具备可伸缩,高吞吐,可容错的流处理模块。

    1)支持多种数据源,如Kafka,Flume,Socket,文件等;

    • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
    • Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies.

    2)处理完成数据可写入Kafka,Hdfs,本地文件等多种地方;

    DStream:

    Spark Streaming对持续流入的数据有个高层的抽像:

    It represents a continuous stream of data

    a DStream is represented by a continuous series of RDDs,Each RDD in a DStream contains data from a certain interval

    Any operation applied on a DStream translates to operations on the underlying RDDs.

    什么是RDD?

    RDD是Resilient Distributed Dataset的缩写,中文译为弹性分布式数据集,是Spark中最重要的概念。

    RDD是只读的、分区的,可容错的数据集合。

    何为弹性?

    RDD可在内存、磁盘之间任意切换

    RDD可以转换成其它RDD,可由其它RDD生成

    RDD可存储任意类型数据

    二、基本概念

    1)add dependency

    <dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-streaming_2.11</artifactId>

    <version>2.3.1</version>

    </dependency>

    其它想关依赖查询:

    https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:2.2.0

    2)文件作为DStream源,是如何被监控的?

    1)文件格式须一致

    2)根据modify time开成流,而非create time

    3)处理时,当前文件变更不会在此window处理,即不会reread

    4)可以调用 FileSystem.setTimes()来修改文件时间,使其在下个window被处理,即使文件内容未被修改过

    三、Transform operation

    window operation

    Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data.

    every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream. 

    在一个时间窗口内的RDD被合并为一个RDD来处理。

    Any window operation needs to specify two parameters:

    window length: The duration of the window

    sliding interval: The interval at which the window operation if performed

    四、Output operation

    使用foreachRDD

    dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. However, it is important to understand how to use this primitive correctly and efficiently. 

    CheckPoint概念

    Performance Tuning

    Fault-tolerance Semantics

  • 相关阅读:
    解决问题,别扩展问题
    Hystrix 配置参数全解析
    请求合并哪家强
    在Spring-Boot中实现通用Auth认证的几种方式
    Java高级特性之泛型
    一键部署进化史
    JavaScript Alert 函数执行顺序问题
    新版的 Springsecurity request.getRequestDispatcher).forward(request, response); 404 问题,已解决
    maridb 10.3 主从复制,待机情况下从库 cpu 占用率高的处理方法
    springboot 2.0 mariadb hikari-cp连接池
  • 原文地址:https://www.cnblogs.com/gm-201705/p/9533271.html
Copyright © 2020-2023  润新知