• 【Spark】SparkStreaming与flume进行整合



    注意事项

    一、首先要保证安装了flume,flume相关安装文章可以看【Hadoop离线基础总结】日志采集框架Flume
    二、把flume的lib目录下自带的过时的scala-library-2.10.5.jar包替换成scala-library-2.11.8.jar
    三、下载需要的jar包,下载地址献上:https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-flume_2.11/2.2.0/spark-streaming-flume_2.11-2.2.0.jar
    并把jar包也放到flume的lib目录下


    SparkStreaming从flume中poll数据

    步骤

    一、开发flume配置文件

    在安装了flume的虚拟机执行以下操作命令

    mkdir -p /export/servers/flume/flume-poll		//受监控的文件夹
    
    cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
    vim flume-poll.conf
    
    # 命名flume的各个组件
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    
    # 配置source组件
    a1.sources.r1.channels = c1
    a1.sources.r1.type = spooldir
    a1.sources.r1.spoolDir = /export/servers/flume/flume-poll
    a1.sources.r1.fileHeader = true
    
    # 配置channel组件 选用memory channel
    a1.channels.c1.type =memory
    a1.channels.c1.capacity = 20000
    a1.channels.c1.transactionCapacity=5000
    
    # 配置sink组件
    a1.sinks.k1.channel = c1
    a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
    a1.sinks.k1.hostname=node03
    a1.sinks.k1.port = 8888
    a1.sinks.k1.batchSize= 2000
    

    二、启动flume

    cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/
    
    bin/flume-ng agent -c conf -f conf/flume-poll.conf -n a1 -Dflume.root.logger=DEBUG,CONSOLE
    

    三、开发sparkStreaming代码

    1.创建maven工程,导入jar包
    <properties>
        <scala.version>2.11.8</scala.version>
        <spark.version>2.2.0</spark.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-flume_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.5</version>
        </dependency>
    
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
    
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.38</version>
        </dependency>
    
    </dependencies>
    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testSourceDirectory>src/test/scala</testSourceDirectory>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                    <!--    <verbal>true</verbal>-->
                </configuration>
            </plugin>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.1.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass></mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
    
    2.开发代码
    import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
    import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    import org.apache.spark.{SparkConf, SparkContext}
    
    object SparkFlumePoll {
    
      // 定义updateFunc函数
      def updateFunc(newValues: Seq[Int],runningCount: Option[Int]): Option[Int] = {
        Option(newValues.sum + runningCount.getOrElse(0))
      }
    
      def main(args: Array[String]): Unit = {
        // 获取SparkConf
        val sparkConf: SparkConf = new SparkConf().set("spark.driver.host", "localhost").setAppName("SparkFlume-Poll").setMaster("local[6]")
        // 获取SparkContext
        val sparkContext = new SparkContext(sparkConf)
        // 设置日志级别
        sparkContext.setLogLevel("WARN")
        //获取StreamingContext
        val streamingContext = new StreamingContext(sparkContext, Seconds(5))
        streamingContext.checkpoint("./poll-Flume")
    
        // 通过FlumeUtils调用createPollingStream方法获取flume中的数据
        /*
        createPollingStream所需参数:
          ssc: StreamingContext,
          hostname: String,
          port: Int,
         */
        val stream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(streamingContext, "node03", 8888)
        // 拿到数据后,所有的数据都会封装在SparkFlumeEvent中
    
        // 将SparkFlumeEvent封装的数据转换为DStream
        val line: DStream[String] = stream.map(x => {
          // x代表SparkFlumeEvent封装对象,里面封装了event数据,通过以下方法转换成数组
          val array: Array[Byte] = x.event.getBody.array()
          // 将拿到的数组转换为String
          val str = new String(array)
          str
        }
        )
    
        // 进行单词计数操作
        val value: DStream[(String, Int)] = line.flatMap(_.split(" ")).map((_, 1)).updateStateByKey(updateFunc)
    
        //输出结果
        value.print()
    
        streamingContext.start()
        streamingContext.awaitTermination()
      }
    }
    

    四、向监控目录中导入文本文件

    在这里插入图片描述

    控制台结果
    
    -------------------------------------------
    Time: 1586877095000 ms
    -------------------------------------------
    
    -------------------------------------------
    Time: 1586877100000 ms
    -------------------------------------------
    
    20/04/14 23:11:44 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
    20/04/14 23:11:44 WARN BlockManager: Block input-0-1586877094060 replicated to only 0 peer(s) instead of 1 peers
    -------------------------------------------
    Time: 1586877105000 ms
    -------------------------------------------
    (world,1)
    (hive,2)
    (hello,2)
    (sqoop,1)
    (test,1)
    (abb,1)
    
    -------------------------------------------
    Time: 1586877110000 ms
    -------------------------------------------
    (world,1)
    (hive,2)
    (hello,2)
    (sqoop,1)
    (test,1)
    (abb,1)
    
    -------------------------------------------
    Time: 1586877115000 ms
    -------------------------------------------
    (world,1)
    (hive,2)
    (hello,2)
    (sqoop,1)
    (test,1)
    (abb,1)
    
    20/04/14 23:11:57 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
    20/04/14 23:11:57 WARN BlockManager: Block input-0-1586877094061 replicated to only 0 peer(s) instead of 1 peers
    -------------------------------------------
    Time: 1586877120000 ms
    -------------------------------------------
    (world,2)
    (hive,4)
    (hello,4)
    (sqoop,2)
    (test,2)
    (abb,2)
    
    -------------------------------------------
    Time: 1586877125000 ms
    -------------------------------------------
    (world,2)
    (hive,4)
    (hello,4)
    (sqoop,2)
    (test,2)
    (abb,2)
    

    flume将数据push给SparkStreaming

    步骤

    一、开发flume配置文件

    mkdir -p /export/servers/flume/flume-push/
    
    cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/conf
    vim flume-push.conf
    
    #push mode
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    #source
    a1.sources.r1.channels = c1
    a1.sources.r1.type = spooldir
    a1.sources.r1.spoolDir = /export/servers/flume/flume-push
    a1.sources.r1.fileHeader = true
    #channel
    a1.channels.c1.type =memory
    a1.channels.c1.capacity = 20000
    a1.channels.c1.transactionCapacity=5000
    #sinks
    a1.sinks.k1.channel = c1
    a1.sinks.k1.type = avro
    #注意这里的ip需要指定的是我们spark程序所运行的服务器的ip,也就是我们的localhost
    a1.sinks.k1.hostname=192.168.0.105
    a1.sinks.k1.port = 8888
    a1.sinks.k1.batchSize= 2000
    

    二、启动flume

    cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/
    
    bin/flume-ng agent -c conf -f conf/flume-push.conf -n a1 -Dflume.root.logger=DEBUG,CONSOLE
    

    三、开发代码

    package cn.itcast.sparkstreaming.demo4
    
    import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
    import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    import org.apache.spark.{SparkConf, SparkContext}
    
    object SparkFlumePush {
      def main(args: Array[String]): Unit = {
        //获取SparkConf
        val sparkConf: SparkConf = new SparkConf().setAppName("SparkFlume-Push").setMaster("local[6]").set("spark.driver.host", "localhost")
        //获取SparkContext
        val sparkContext = new SparkContext(sparkConf)
        sparkContext.setLogLevel("WARN")
        //获取StreamingContext
        val streamingContext = new StreamingContext(sparkContext, Seconds(5))
    
        val stream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(streamingContext, "192.168.0.105", 8888)
    
        val value: DStream[String] = stream.map(x => {
          val array: Array[Byte] = x.event.getBody.array()
    
          val str = new String(array)
          str
        })
    
        value.print()
    
        streamingContext.start()
        streamingContext.awaitTermination()
      }
    
    }
    

    四、向监控目录中导入文本文件

    在这里插入图片描述

    控制台结果
    
    -------------------------------------------
    Time: 1586882385000 ms
    -------------------------------------------
    
    20/04/15 00:39:45 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
    20/04/15 00:39:45 WARN BlockManager: Block input-0-1586882384800 replicated to only 0 peer(s) instead of 1 peers
    -------------------------------------------
    Time: 1586882390000 ms
    -------------------------------------------
    hello world
    sqoop hive
    abb test
    hello hive
    
    -------------------------------------------
    Time: 1586882395000 ms
    -------------------------------------------
    
  • 相关阅读:
    qemu的几篇文章---涉及qemu的注入
    网络以及linux相关--重点参考--包含相关linux服务的部署和使用
    ipv6的相关情况简略说明
    向日葵服务器相关ip,通过ipset过滤(oray.com oray.net),可能不完整,需要dnsmasq ipset持续监听相关域名
    linux shell重要参考网站
    web信息收集分类
    针对管理员的信息收集 以及 它的意义
    目标网站弱点功能探测
    网站文件目录探测
    目标后台探测以及物理路径探测
  • 原文地址:https://www.cnblogs.com/zzzsw0412/p/12772383.html
Copyright © 2020-2023  润新知