• sparkStreaming滑动窗口操作


    一、sparkStreaming窗口函数概念:

    1、reduceByKeyAndWindow(_+_,Seconds(3), Seconds(2))

        可以看到我们定义的window窗口大小Seconds(3s) ,是指每2s滑动时,需要统计前3s内所有的数据。
     

    2、重载函数reduceByKeyAndWindow(_+_,_-_,Seconds(3s),seconds(2))

         设计理念是,当 滑动窗口的时间Seconds(2) < Seconds(3)(窗口大小)时,两个统计的部分会有重复,那么我们就可以
         不用重新获取或者计算,而是通过获取旧信息来更新新的信息,这样即节省了空间又节省了内容,并且效率也大幅提升。
        
         如上图所示,2次统计重复的部分为time3对用的时间片内的数据,这样对于window1,和window2的计算可以如下所示
         win1 = time1 + time2 + time3
         win2 = time3 + time4 + time5
         
         更新为
         win1 = time1 + time2 + time3
         win2 = win1+ time4 + time5 - time2 - time3
         
         这样就理解了吧,  _+_是对新产生的时间分片(time4,time5内RDD)进行统计,而_-_是对上一个窗口中,过时的时间分片
         (time1,time2) 进行统计   
     

     二、应用场景:

      在项目中,若有相关的业务需求需要进行跨批次的操作,例如,项目中的sparkStreaming设置的批次为5s,而业务计算过程中,需要应用一个5min的数据,这时候就可以使用滑动窗口函数来实现。

    https://www.jianshu.com/p/2f0d2cb1faf4

    三、代码:

    /**
     * Return a new DStream by applying incremental `reduceByKey` over a sliding window.
     * The reduced value of over a new window is calculated using the old window's reduced value :
     *  1. reduce the new values that entered the window (e.g., adding new counts)
     *
     *  2. "inverse reduce" the old values that left the window (e.g., subtracting old counts)
     *
     * This is more efficient than reduceByKeyAndWindow without "inverse reduce" function.
     * However, it is applicable to only "invertible reduce functions".
     * Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
     * @param reduceFunc associative reduce function
     * @param invReduceFunc inverse reduce function
     * @param windowDuration width of the window; must be a multiple of this DStream's
     *                       batching interval
     * @param slideDuration  sliding interval of the window (i.e., the interval after which
     *                       the new DStream will generate RDDs); must be a multiple of this
     *                       DStream's batching interval
     * @param filterFunc     Optional function to filter expired key-value pairs;
     *                       only pairs that satisfy the function are retained
     */
    def reduceByKeyAndWindow(
        reduceFunc: (V, V) => V,
        invReduceFunc: (V, V) => V,
        windowDuration: Duration,
        slideDuration: Duration = self.slideDuration,
        numPartitions: Int = ssc.sc.defaultParallelism,
        filterFunc: ((K, V)) => Boolean = null
      ): DStream[(K, V)] = ssc.withScope {
      reduceByKeyAndWindow(
        reduceFunc, invReduceFunc, windowDuration,
        slideDuration, defaultPartitioner(numPartitions), filterFunc
      )
    }
    输入数据:<Id  <value,1>>
    JavaPairDStream<String, Tuple2<Integer, Integer>> resultDStream = 
                     monitorId2SpeedDStream.reduceByKeyAndWindow(new Function2<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() {
                /**
                 * 
                 */
                private static final long serialVersionUID = 1L;
    
                @Override
                public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> v1, Tuple2<Integer, Integer> v2) throws Exception {
                    return new Tuple2<Integer, Integer>(v1._1+v2._1, v1._2+v2._2);
                }
            }, new Function2<Tuple2<Integer, Integer>, Tuple2<Integer, Integer>, Tuple2<Integer, Integer>>() {
    
                /**
                 * 
                 */
                private static final long serialVersionUID = 1L;
    
                @Override
                public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> v1, Tuple2<Integer, Integer> v2) throws Exception {
                    
                    return new Tuple2<Integer, Integer>(v1._1 - v2._1,v2._2 - v2._2);
                }
            }, Durations.minutes(5), Durations.seconds(5));

    转载博客:https://www.cnblogs.com/zDanica/p/5471592.html

    https://blog.csdn.net/Thomson617/article/details/87780167

  • 相关阅读:
    axure rp8.0 序列号,亲测可以用
    纯html页面之间传参
    js下载项目中的文件
    Java获取用户ip
    阿里云部署多个tomcat
    少小有才国家用,老大空长做何为
    获取 web 服务器 port
    知识的迁移和学习能力才是最重要的
    eclipse 安装和使用AmaterasUML
    Eclipse中tomcat更改部署路径 deply path
  • 原文地址:https://www.cnblogs.com/guoyu1/p/12508971.html
Copyright © 2020-2023  润新知