• Flume+Kafka+SparkStreaming+Hbase+可视化(三)


    SparkStreaming
      1).概述:可扩展、高可用、容错性
            一站式解决方案
        2).原理
        粗粒度:Spark Streaming 接受实时流数据,将数据切分为批次数据,交由Spark Engine处理数据。组RDD,微批处理。
        细粒度:
      3).核心
      StreamingContext:
        start 之后不能 add StreamingContext
        Stop 之后不能 restart
        一个JVM仅仅能有一个StreamingContext
        stop(false) 能仅仅停止StreamingContext而不停止SparkContext。stop() 两者全部停止
        ssc = new StreamingContext(conf, Seconds(2))
      Dstream:
        来源于Dstrem或者Source
        一系列的RDD,每个批次
        Dstream操作底层为RDD操作
        Input Dstream & Receviers:
      sources,除了文件系统源,所有的input dstream都需要关联Receivers(ReceiverDstream)
      基于Receiver的Dstream线程数必须大于 Receiver数量
      transform Output Operator
     
      4).带状态的算子 updateStateByKey
      
    package com.spark.train
    
    import org.apache.spark.SparkConf
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    
    /*
    spark streaming 有状态的WordCount
    updateStateByKey
     */
    
    object statefulWordCount {
    
      def main(args: Array[String]): Unit = {
    
        val conf  = new SparkConf().setMaster("local[2]").setAppName("updateStateByKey for wordCount")
        val ssc = new StreamingContext(conf, Seconds(4))
    
        // need a checkPoint
        ssc.checkpoint("/opt/datas/spark_data/updateStateByKey1")
    
        val lines = ssc.socketTextStream("localhost", 9999)
        val words = lines.flatMap(_.split(" ")).map((_, 1))
        val wordCount = words.updateStateByKey[Int](updateFunc _)
    
        wordCount.print()
    
    
        ssc.start()
        ssc.awaitTermination()
    
      }
    
      /** updateStateByKey update function
        * currentValues: new count
        * preValues: previous count
        *
        * Some(): som
        */
      def updateFunc(currentValues:Seq[Int], preValuse:Option[Int]):Option[Int] = {
    
        val newCount = currentValues.sum
        val preCount = preValuse.getOrElse(0)
        Some(newCount + preCount)
      }
    }
    

      

      5).黑名单过滤
      
    package com.spark.train
    
    import org.apache.spark.SparkConf
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    
    
    /**
      * 黑名单过滤
      */
    
    object filterBlackListSparkStreaming {
    
      def main(args: Array[String]): Unit = {
    
        val conf = new SparkConf().setMaster("local[2]").setAppName("filter black list")
        val ssc = new StreamingContext(conf, Seconds(5))
    
        /**
          * 黑名单转 RDD
          *
          */
        val blackList = List("ls","zs")
        val rddBlackList = ssc.sparkContext.parallelize(blackList).map(x => (x, true))
    
        val logs = ssc.socketTextStream("localhost", 9999)
          .map(x => (x.split(",")(1), x))
          .transform(rdd => {
            rdd.leftOuterJoin(rddBlackList)
              .filter(x => x._2._2.getOrElse(false) != true)
              .map(x => x._2._1)
          })
    
        logs.print()
    
        ssc.start()
        ssc.awaitTermination()
      }
    }
    

      

      Flume和Spark Streaming集成两种方式:代码如下:
      
    package com.spark.test
    
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    import org.apache.spark.SparkConf
    import org.apache.spark.streaming.flume.FlumeUtils
    
    object flumeSparkStreaming {
      
      def main(args: Array[String]): Unit = {
    
        val conf = new SparkConf().setMaster("local[2]").setAppName("flumeDstream")
        val ssc = new StreamingContext(conf, Seconds(5))
    
        /** 第一种方式
          * define flumeDstream
          * 先启动spark streaming ,在启动flume,最好telnet 开始wordCount
          */
        val flumeDstream = FlumeUtils.createStream(ssc, "bigdata", 3333)
        val wordCnt = flumeDstream.map(x => new String(x.event.getBody.array()).trim())
            .flatMap(_.split(" "))
            .map((_,1)).reduceByKey(_ + _)
    
        wordCnt.print()
    
        
        /**第二种方式
          * createPollingStream
          * 需要重新设置flume sink为 org.apache.spark.streaming.flume.sink.SparkSink
          * 其他配置一样,需要先启动flume,数据会缓存,然后streaming 去pull 数据
          */
    //    val XXX = FlumeUtils.createPollingStream(ssc, "bigdata", 3333)
    //      .map(x => new String(x.event.getBody.array()).trim)
    //      .flatMap(_.split(""))
    //      .map((_, 1))
    //      .reduceByKey(_ + _)
    
        ssc.start()
        ssc.awaitTermination()
      }
    }

  • 相关阅读:
    【Alpha阶段】第四次Scrum Meeting
    【Alpha阶段】第三次Scrum Meeting
    【Alpha阶段】第二次Scrum Meeting
    项目复审——Alpha阶段
    团队作业7-Alpha冲刺之事后诸葛亮
    团队作业6-展示博客
    Alpha阶段发布说明
    团队作业5-测试与发布
    冲刺博客-8
    冲刺博客-7
  • 原文地址:https://www.cnblogs.com/mlxx9527/p/9668779.html
Copyright © 2020-2023  润新知