• object SparkStreaming_StateFul {


    窗口查询
    object SparkStreaming_StateFul {

    def main(args: Array[String]): Unit = {
    Logger.
    getLogger("org.apache.spark").setLevel(Level.WARN)
    Logger.
    getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

    val conf = new SparkConf().setMaster("local[2]")
    .setAppName(
    this.getClass.getSimpleName)
    .set(
    "spark.executor.memory", "2g")
    .set(
    "spark.cores.max", "8")
    .setJars(
    Array("E:\ScalaSpace\Spark_Streaming\out\artifacts\Spark_Streaming.jar"))
    val context = new SparkContext(conf)

    val updateFunc = (values : Seq[Int],state : Option[Int]) => {
    val currentCount = values.foldLeft(0)(_+_)
    val previousCount = state.getOrElse(0)
    Some(currentCount + previousCount)
    } 对历史数据进行保存,若存在则取值,不存在默认值为0


    //step1 create streaming context
    val ssc = new StreamingContext(context,Seconds(5)) 每5s进行统计
    ssc.checkpoint(
    ".")

    //step2 create a networkInputStream on get ip:port and count the words in input stream of delimited text
    val lines = ssc.socketTextStream("218.193.154.79",12345)

    val data = lines.flatMap(_.split(" "))
    val wordDstream = data.map(x => (x,1)).reduceByKeyAndWindow(_+_,_-_,Seconds(10),Seconds(15))
        每隔15s进行查询,查询为前10s的结果。这里的值必须为采集时间的倍数

    //使用updateStateByKey 来更新状态
    val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)

    stateDstream.print()
    ssc.start()
    ssc.awaitTermination()
    }
    }

    其输出结果如下所示,对全部的结果进行统计
    -------------------------------------------
    Time: 1459156160000 ms
    -------------------------------------------
    (B,1)
    (F,1)
    (D,4)
    (G,1)
    (A,1)
    (C,5)

    现在就可以,最热关键词进行统计,其统计代码如下所示:

    那么此处为什么会有transform呢操作呢,我们看transform的介绍如下所示
    /**
    * Return a new DStream in which each RDD is generated by applying a function
    * on each RDD of 'this' DStream.
    */
    def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U] = ssc.withScope {
    // because the DStream is reachable from the outer object here, and because
    // DStreams can't be serialized with closures, we can't proactively check
    // it for serializability and so we pass the optional false to SparkContext.clean
    val cleanedF = context.sparkContext.clean(transformFunc, false)
    transform((r: RDD[T], t: Time) => cleanedF(r))
    }


    /**
    * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
    * `collect` or `save` on the resulting RDD will return or output an ordered list of records
    * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
    * order of the keys).
    */
    // TODO: this currently doesn't work on P other than Tuple2!
    def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
    : RDD[(K, V)] = self.withScope
    {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
    .setKeyOrdering(if (ascending) ordering else ordering.reverse)
    }

    通过上述注释我们可以知道,sort是对RDD内所有partition数据进行排序,而并非针对所有RDD,因为SparkStreaming 是操作多个RDD,因此我们需要将使用transform 操作,对所有的RDD进行排序操作。 

    stateDstream.map{
    case (char,count) => (count,char)
    }.transform(_.sortByKey(false))












  • 相关阅读:
    Codeforces 689A Mike and Cellphone
    栈的一些基本操作
    Intersecting Lines POJ 1269
    Segments POJ 3304 直线与线段是否相交
    Toy Storage POJ 2398
    CF471D MUH and Cube Walls
    P 3396 哈希冲突 根号分治
    P1445 [Violet]樱花
    P6810 「MCOI-02」Convex Hull 凸包
    P3455 [POI2007]ZAP-Queries
  • 原文地址:https://www.cnblogs.com/zDanica/p/5471594.html
Copyright © 2020-2023  润新知