• 2020年寒假学习进度第六天


    今天主要进行了spark实验六的学习,Spark Streaming 编程初级实践

    Flume 是非常流行的日志采集系统,可以作为 Spark Streaming 的高级数据源。请把 Flume Source 设置为 netcat 类型,从终端上不断给 Flume Source 发送各种消息,Flume 把消息汇集到 Sink,这里把 Sink 类型设置为 avro,由 Sink 把消息推送给 Spark Streaming,由自己编写的 Spark Streaming 应用程序对消息进行处理
    ⑴配置 Flume 数据源
    1.cd /usr/local/flume
    2.cd conf
    3.vim flume-to-spark.conf
     
    flume-to-spark.conf 文件中写入如下内容:
    #flume-to-spark.conf: A single-node Flume configuration
     # Name the components on this agent
     a1.sources = r1
     a1.sinks = k1
     a1.channels = c1
     # Describe/configure the source
     a1.sources.r1.type = netcat
     a1.sources.r1.bind = localhost
     a1.sources.r1.port = 33333
    # Describe the sink
     a1.sinks.k1.type = avro
     a1.sinks.k1.hostname = localhost
     a1.sinks.k1.port =44444
     # Use a channel which buffers events in memory
     a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000000
     a1.channels.c1.transactionCapacity = 1000000
     # Bind the source and sink to the channel
     a1.sources.r1.channels = c1
     a1.sinks.k1.channel = c1
    

      

    (2)Spark 的准备工作
      1.cd /usr/local/spark
      2. ./bin/spark-shell
     
    启动成功后,在 spark-shell 中执行下面 import 语句:
    1.scala> import org.apache.spark.streaming.flume._
     
    你可以看到,马上会报错,因为找不到相关的 jar 包。所以,现在我们就需要下载spark-streaming-flume_2.11-2.1.0.jar,其中2.11表示对应的Scala版本号,2.1.0表示Spark版本号。现在请在 Linux 系统中,打开一个火狐浏览器,打开下方的网址http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume_2.11/2.1.0,里面有提供 spark-streaming-flume_2.11-2.1.0.jar 文件的下载。
     
    随后:
    1.cd /usr/local/spark/jars
    2.mkdir flume
    3.cd ~
    4.cd 下载
    5.cp ./spark-streaming-flume_2.11-2.1.0.jar
    6./usr/local/spark/jars/flume
    7.cd /usr/local/flume/lib
    8.ls
    9.cp ./* /usr/local/spark/jars/flume
     
    ⑶编写 Spark 程序使用 Flume 数据源
    1.cd /usr/local/spark/mycode
    2.mkdir flume
    3.cd flume
    4.mkdir -p src/main/scala
    5.cd src/main/scala
    6.vim FlumeEventCount.scala
    package org.apache.spark.examples.streaming
    import org.apache.spark.SparkConf
    import org.apache.spark.storage.StorageLevel
    import org.apache.spark.streaming._
    import org.apache.spark.streaming.flume._
    import org.apache.spark.util.IntParam
    object FlumeEventCount {
     def main(args: Array[String]) {
     if (args.length < 2) {
     System.err.println(
     "Usage: FlumeEventCount <host> <port>")
     System.exit(1)
     }
    StreamingExamples.setStreamingLogLevels()
     val Array(host, IntParam(port)) = args
     val batchInterval = Milliseconds(2000)
     // Create the context and set the batch size
     val sparkConf = new SparkConf().setAppName("FlumeEventCount").setMaster("local
    [2]")
     val ssc = new StreamingContext(sparkConf, batchInterval)
     // Create a flume stream
     val stream = FlumeUtils.createStream(ssc, host, port, StorageLevel.MEMORY_ONLY_S
    ER_2)
     // Print out the count of events received from this server in each batch
     stream.count().map(cnt => "Received " + cnt + " flume events." ).print()
     ssc.start()
     ssc.awaitTermination()
     } }
    

      

    然后再使用 vim 编辑器新建 StreamingExamples.scala 文件,输入如下代码,用于控
    制日志输出格式:
    package org.apache.spark.examples.streaming
    import org.apache.log4j.{Level, Logger}
    import org.apache.spark.internal.Logging
    object StreamingExamples extends Logging {
     /** Set reasonable logging levels for streaming if the user has not configured log4
    j. */
     def setStreamingLogLevels() {
     val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
     if (!log4jInitialized) {
     // We first log something to initialize Spark's default logging, then we overri
    de the
     // logging level.
     logInfo("Setting log level to [WARN] for streaming example." +
     " To override add a custom log4j.properties to the classpath.")
     Logger.getRootLogger.setLevel(Level.WARN)
     }
     }
    }
    

      

    然后,请执行下面命令新建一个 simple.sbt 文件:
    name := "Simple Project"
    version := "1.0"
    scalaVersion := "2.11.8"
    libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
    libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.1.0"
    libraryDependencies += "org.apache.spark" % "spark-streaming-flume_2.11" % "2.1.0"
    

      进行打包:

     1.cd /usr/local/spark/mycode/flume/
     2./usr/local/sbt/sbt package#执行这一步之前需要先安装 sbt 插件
     
    ⑷测试程序效果
    关闭之前打开的所有终端。首先,请新建第 1 个 Linux 终端,启动 Spark Streaming 应用程序,命令如下:
     1.cd /usr/local/spark
     2. ./bin/spark-submit --driver-class-path /usr/local/spark/jars/*:/usr/local/spark/jars/flume/* --class "org.apache.spark.examples.streaming.FlumeEventCount" /usr/local/spark/mycode/flume/target/scala-2.11/simple-project_2.11-1.0.jar localhost 44444
     
    通过上面命令,我们为应用程序提供 host 和 port 两个参数的值分别为 localhost 和44444,程序会对 localhost 的 44444 端口进行监听,Milliseconds(2000)设置了时间间隔为2 秒,所以,该程序每隔 2 秒就会从指定的端口中获取由 Flume Sink 发给该端口的消息,然后进行处理,对消息进行统计,打印出“Received 0 flume events.”这样的信息。执行该命令后,屏幕上会显示程序运行的相关信息,并会每隔 2 秒钟刷新一次信息,大量信息中会包含如下重要信息:
    -------------------------------------------
    Time: 1488029430000 ms
    -------------------------------------------
    Received 0 flume events.
    因为目前 Flume 还没有启动,没有给 FlumeEventCount 发送任何消息,所以 FlumeEvents 的数量是 0。第 1 个终端不要关闭,让它一直处于监听状态。现在,我们可以再另外新建第 2 个终端,在这个新的终端中启动 Flume Agent,命令如下:
     1.cd /usr/local/flume
     2.bin/flume-ng agent --conf ./conf --conf-file ./conf/flume-to-spark.conf --name a1 -Dflume.root.logger=INFO,console
    启动 agent 以后,该 agent 就会一直监听 localhost 的 33333 端口,这样,我们下面就可以通过“telnet localhost 33333”命令向 Flume Source 发送消息。第 2 个终端也不要关闭,让它一直处于监听状态。
     
    请另外新建第 3 个终端,执行如下命令:
    1.telnet localhost 33333
     
    执行该命令以后,就可以在这个窗口里面随便敲入若干个字符和若干个回车,这些消息都会被 Flume 监听到,Flume 把消息采集到以后汇集到 Sink,然后由 Sink 发送给 Spark的 FlumeEventCount 程序进行处理。然后,你就可以在运行 FlumeEventCount 的前面那个终端窗口内看到类似如下的统计结果:
    -------------------------------------------
    Time: 1488029430000 ms
    -------------------------------------------
    Received 0 flume events.
    #这里省略了其他屏幕信息
    -------------------------------------------
    Time: 1488029432000 ms
    -------------------------------------------
    Received 8 flume events.
    #这里省略了其他屏幕信息
    -------------------------------------------
    Time: 1488029434000 ms
    -------------------------------------------
    Received 21 flume events.
     
     
     
    以上是关于实验的一些操作步骤,最后经过实验出现了一个问题,启动Flume之后,就应该能接受到Received 21 flume events这个提示,但是我做实验启动flume之后,依然是0,不知原因,此问题还未解决。
  • 相关阅读:
    Can you answer these queries? (线段树
    小a的排列(牛客)
    Count the Colors 线段树
    Mayor's posters (离散化线段树+对lazy的理解)
    出题人的手环(求逆序对数)
    [BZOJ2251/BJWC2010]外星联络
    [ZJOI2007]报表统计
    [JLOI2016]圆的异或并
    [ZJOI2008]无序运动Movement
    [NOI2011]阿狸的打字机
  • 原文地址:https://www.cnblogs.com/ljm-zsy/p/12268395.html
Copyright © 2020-2023  润新知