• Spark笔记04


    • 深入解析:Shared Variables
    • 深入解析:RDD Persistence
    • 深入解析:RDD Key Value Pairs API
    • 额外知识点:Implicit Conversion

    Shared Variables

    一般来说,Spark中的变量都是local变量,每个executor都有一份自己的copy,本地的变化不会反应到driver上。除此之外,Spark也提供了两种方法,实现全局变量。

    Broadcast Variables

    Broadcast variables are created from a variable v by calling SparkContext.broadcast(v).

    scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
    broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
    
    scala> broadcastVar.value
    res0: Array[Int] = Array(1, 2, 3)
    

    创建了broadcast value之后,原始变量v就不要再用了,更不要再去修改它的值,以免发生错误。

    Accumulators

    A numeric accumulator can be created by calling SparkContext.longAccumulator() or SparkContext.doubleAccumulator() to accumulate values of type Long or Double, respectively.

    scala> val accum = sc.longAccumulator("My Accumulator")
    accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)
    
    scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
    ...
    10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
    
    scala> accum.value
    res2: Long = 10
    

    Programmers can also create their own types by subclassing AccumulatorV2.

    RDD Persistence

    RDD主要是存储在

    • 内存(Memory)
    • 硬盘(Disk)

    存在内存中,当然运算起来快,但是受到内存容量的限制。存在硬盘中,可以更加廉价地存储大量数据,但是读写上有速度限制。选择时需要综合考虑。

    由此,引出 storage levels 如下图:

    spark_persistence

    在上图中,除了上面提到的Memory,Disk,还有第三个变量Serialization,即序列化后的对象存储空间更小,但是需要额外计算(反序列化)消耗。
    最终的存储级别,基本上就是这三个变量的一些组合。

    在代码层面,可以使用两个方法实现persistence:

    • persist():可以选择StorageLevel
    • cache():使用默认存储级别StorageLevel.MEMORY_ONLY

    Which Storage Level to Choose?

    • 根据数据量,从小到大,依次从上图选择。
    • 如果需要fast fault recovery (e.g. if using Spark to serve requests from a web application),使用replicated storage levels,即MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

    RDD Key Value Pairs API

    Spark中Key Value Pair如果有多个值,在partition的时候,会尽量把同一个key的所有pairs分到一个node上。这样的好处是可以在单个node上完成所有关于这个key的操作。

    spark_pairs

    下面列举一些常用API,都是在Class PairRDDFunctions<K,V>下的:

    1

    collectAsMap()
    Return the key-value pairs in this RDD to the master as a Map.
    
    mapValues(scala.Function1<V,U> f)
    Pass each value in the key-value pair RDD through a map function without changing the keys;
    
    flatMapValues(scala.Function1<V,scala.collection.TraversableOnce<U>> f)
    Pass each value in the key-value pair RDD through a flatMap function without changing the keys;
    
    reduceByKey(scala.Function2<V,V,V> func)
    Merge the values for each key using an associative reduce function.
    
    groupByKey()
    Group the values for each key in the RDD into a single sequence.
    
    countByKey()
    Count the number of elements for each key, and return the result to the master as a Map.
    

    2

    join(RDD<scala.Tuple2<K,W>> other)
    Return an RDD containing all pairs of elements with matching keys in this and other.
    
    leftOuterJoin(RDD<scala.Tuple2<K,W>> other)
    Perform a left outer join of this and other.
    	
    rightOuterJoin(RDD<scala.Tuple2<K,W>> other)
    Perform a right outer join of this and other.
    

    3

    public void saveAsHadoopFile(String path,
                        Class<?> keyClass,
                        Class<?> valueClass,
                        Class<? extends org.apache.hadoop.mapred.OutputFormat<?,?>> outputFormatClass,
                        Class<? extends org.apache.hadoop.io.compress.CompressionCodec> codec)
    Output the RDD to any Hadoop-supported file system, using a Hadoop OutputFormat class supporting the key and value types K and V in this RDD.
    

    Implicit Conversion

    隐式转换,即将typeS转换为typeT

    举例1,when calling a Java method that expects a java.lang.Integer, you are free to pass it a scala.Int instead by using 'Implicit Conversion'.

    import scala.language.implicitConversions
    
    implicit def int2Integer(x: Int) =
      java.lang.Integer.valueOf(x)
    

    举例2,通过隐式转换实现1.plus(1)

    // 1
    case class IntExtensions(value: Int) {
    	def plus(operand: Int) = value + operand
    }
    
    // 2
    import scala.language.implicitConversions
    
    implicit def intToIntExtensions(value: Int) = {
    	IntExtensions(value)
    }
    

    https://docs.scala-lang.org/tour/implicit-conversions.html

  • 相关阅读:
    oracel与mysql 查询从创建时间到更新时间的3天内的数据
    微信创建菜单
    利用XStream实现实体类与xml的转换
    play freamwork 框架中 调用线程保存数据
    play freamwork 框架中 设置文件大小
    将javabean转为map类型,然后返回一个map类型的值
    Base64的加密,解密
    MD5加密
    实体类与CDATA类型的xml的转换的工具类
    解决oracle数据库中 ORA-01843: 无效的月份问题
  • 原文地址:https://www.cnblogs.com/maxstack/p/13495161.html
Copyright © 2020-2023  润新知