Spark API 之 map、mapPartitions、mapValues、flatMap、flatMapValues详解

原文地址：https://blog.csdn.net/helloxiaozhe/article/details/80492933

1、创建一个RDD变量，通过help函数，查看相关函数定义和例子：

>>> a = sc.parallelize([(1,2),(3,4),(5,6)])
 
>>> a
ParallelCollectionRDD[21] at parallelize at PythonRDD.scala:475
>>> help(a.map)



Help on RDD in module pyspark.rdd object:
 
class RDD(__builtin__.object)
 |  A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
 |  Represents an immutable, partitioned collection of elements that can be
 |  operated on in parallel.
 |
 |  Methods defined here:
 |
 |  __add__(self, other)
 |      Return the union of this RDD and another one.
 |
 |      >>> rdd = sc.parallelize([1, 1, 2, 3])
 |      >>> (rdd + rdd).collect()
 |      [1, 1, 2, 3, 1, 1, 2, 3]

RDD是什么？
RDD是Spark中的抽象数据结构类型，任何数据在Spark中都被表示为RDD。从编程的角度来看，RDD可以简单看成是一个数组。和普通数组的区别是，RDD中的数据是分区存储的，这样不同分区的数据就可以分布在不同的机器上，同时可以被并行处理。因此，Spark应用程序所做的无非是把需要处理的数据转换为RDD，然后对RDD进行一系列的变换和操作从而得到结果。

2、map(function)

map是对RDD中的每个元素都执行一个指定的函数来产生一个新的RDD。任何原RDD中的元素在新RDD中都有且只有一个元素与之对应。

map(self, f, preservesPartitioning=False) method of pyspark.rdd.RDD instance
    Return a new RDD by applying a function to each element of this RDD.
    >>> rdd = sc.parallelize(["b", "a", "c"])
    >>> sorted(rdd.map(lambda x: (x, 1)).collect())
    [('a', 1), ('b', 1), ('c', 1)]

3、mapPartitions(function)

map()的输入函数是应用于RDD中每个元素，而mapPartitions()的输入函数是应用于每个分区

mapPartitions是map的一个变种。map的输入函数是应用于RDD中每个元素，而mapPartitions的输入函数是应用于每个分区，也就是把每个分区中的内容作为整体来处理的。

mapPartitions(self, f, preservesPartitioning=False) method of pyspark.rdd.RDD instance
    Return a new RDD by applying a function to each partition of this RDD.
 
    >>> rdd = sc.parallelize([1, 2, 3, 4], 2)
    >>> def f(iterator): yield sum(iterator)
    >>> rdd.mapPartitions(f).collect()
    [3, 7]

4、mapValues(function)

原RDD中的Key保持不变，与新的Value一起组成新的RDD中的元素。因此，该函数只适用于元素为KV对的RDD。

mapValues(self, f) method of pyspark.rdd.RDD instance
    Pass each value in the key-value pair RDD through a map function
    without changing the keys; this also retains the original RDD's
    partitioning.
    >>> x = sc.parallelize([("a", ["apple", "banana", "lemon"]), ("b", ["grapes"])])
    >>> def f(x): return len(x)
    >>> x.mapValues(f).collect()
    [('a', 3), ('b', 1)]

5、flatMap(function)
与map类似，区别是原RDD中的元素经map处理后只能生成一个元素，而原RDD中的元素经flatmap处理后可生成多个元素

Help on method flatMap in module pyspark.rdd:
 
flatMap(self, f, preservesPartitioning=False) method of pyspark.rdd.RDD instance
    Return a new RDD by first applying a function to all elements of this
    RDD, and then flattening the results.
 
    >>> rdd = sc.parallelize([2, 3, 4])
    >>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
    [1, 1, 1, 2, 2, 3]
    >>> sorted(rdd.flatMap(lambda x: [(x, x), (x, x)]).collect())
    [(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]

6、flatMapValues(function)

flatMapValues类似于mapValues，不同的在于flatMapValues应用于元素为KV对的RDD中Value。每个一元素的Value被输入函数映射为一系列的值，然后这些值再与原RDD中的Key组成一系列新的KV对。

flatMapValues(self, f) method of pyspark.rdd.RDD instance
    Pass each value in the key-value pair RDD through a flatMap function
    without changing the keys; this also retains the original RDD's
    partitioning.
    >>> x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])])
    >>> def f(x): return x
    >>> x.flatMapValues(f).collect()
    [('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]

7、reduce函数

reduce将RDD中元素两两传递给输入函数，同时产生一个新的值，新产生的值与RDD中下一个元素再被传递给输入函数直到最后只有一个值为止。

reduce(self, f) method of pyspark.rdd.RDD instance
    Reduces the elements of this RDD using the specified commutative and
    associative binary operator. Currently reduces partitions locally.
 
    >>> from operator import add
    >>> sc.parallelize([1, 2, 3, 4, 5]).reduce(add)
    15
    >>> sc.parallelize((2 for _ in range(10))).map(lambda x: 1).cache().reduce(add)
    10
    >>> sc.parallelize([]).reduce(add)
    Traceback (most recent call last):
        ...
    ValueError: Can not reduce() empty RDD

例子如下:

>>> from operator import add
>>> b.collect()
[1, 2, 3, 4, 5, 6]
>>> b.reduce(add)   # 引入内置函数
21
>>> b.reduce(lambda a,b:a+b)    # lambda自定义的匿名函数
21

8、reduceByKey函数

顾名思义，reduceByKey就是对元素为KV对的RDD中Key相同的元素的Value进行reduce，因此，Key相同的多个元素的值被reduce为一个值，然后与原RDD中的Key组成一个新的KV对。

Help on method reduceByKey in module pyspark.rdd:
 
reduceByKey(self, func, numPartitions=None, partitionFunc=<function portable_hash>) method of pyspark.rdd.RDD instance
    Merge the values for each key using an associative and commutative reduce function.
 
    This will also perform the merging locally on each mapper before
    sending results to a reducer, similarly to a "combiner" in MapReduce.
 
    Output will be partitioned with C{numPartitions} partitions, or
    the default parallelism level if C{numPartitions} is not specified.
    Default partitioner is hash-partition.
 
    >>> from operator import add
    >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
    >>> sorted(rdd.reduceByKey(add).collect())
    [('a', 2), ('b', 1)]

相关阅读:
网页内容切换效果实现的15个jQuery插件
 【转】C#获取客户端及服务器端主机信息及其获取IP地址
 EnableViewState 属性
 Dictionary字典类使用范例
 AspNetPager分页控件官方网站
 [区别]APPlication,Session,Cookie,ViewState和Cache
C#特性之数据类型
 WindowsPhone8.1 开发技巧
 关于在WP8.1中使用剪贴板的问题
 MVC中使用JQuery方式进行异步请求和使用自带方式进行异步请求
原文地址：https://www.cnblogs.com/quyangzhangsiyuan/p/12266679.html