• RDD的基本命令


    1 创建RDD

    intRDD=sc.parallelize([3,1,2,5,6])
    intRDD.collect()
    [4, 2, 3, 6, 7]

    2 单RDD转换

    (1) MAP

    def addone(x):
        return (x+1)
    intRDD.map(addone).collect()
    [4, 2, 3, 6, 7]

    intRDD.map(lambda x: x+1).collect()
    [4, 2, 3, 6, 7]

    stringRDD.map(lambda x:'fruit:'+x).collect()
    ['fruit:Apple', 'fruit:Orange', 'fruit:Banana', 'fruit:Grape', 'fruit:Apple']

    (2) filter

    intRDD.filter(lambda x: x<3).collect()
    [1, 2]
    intRDD.filter(lambda x:1<x and x<5).collect()
    [3, 2]
    stringRDD.filter(lambda x: "ra" in x).collect()
    ['Orange', 'Grape']

    (3) distinct

    intRDD.distinct().collect()
    [1, 5, 2, 6, 3]
    stringRDD.distinct().collect()
    ['Orange', 'Apple', 'Banana', 'Grape']

    (4) randomSplit

    sRDD=intRDD.randomSplit([0.4,0.6])
    sRDD[0].collect()
    [1, 2]
    sRDD[1].collect()
    [3, 5, 6]

    (5) groupby

    gRDD=intRDD.groupBy(lambda x:'even' if (x%2==0) else 'odd').collect()
    print('even')
    print(list(gRDD[0][1]))
    print('odd')
    print(gRDD[1][1])

    even
    [2, 6]
    odd
    <pyspark.resultiterable.ResultIterable object at 0x7f9ba805d438>

    3  多个RDD转换运算

    intRDD1=sc.parallelize([3,1,2,5,5])
    intRDD2=sc.parallelize([5,6])
    intRDD3=sc.parallelize([2,7])

    并集union

    intRDD1.union(intRDD2).union(intRDD3).collect()

    [3, 1, 2, 5, 5, 5, 6, 2, 7]

    交集intersection

    intRDD1.intersection(intRDD2).collect()

    [5]

    差集 subtract

    intRDD1.subtract(intRDD2).collect()

    [1, 2, 3]

    笛卡尔积乘积 cartesian

    intRDD1.cartesian(intRDD2).collect()

    [(3, 5),

    (3, 6),

    (1, 5),

    (1, 6),

    (2, 5),

    (2, 6),

    (5, 5),

    (5, 5),

    (5, 6),

    (5, 6)]

    动作 运算

    first() 读取第一项数据
    take(2) 取出前两项数据
    takeOrdered(3) 从小到大排序,取出前三项数据
    takeOrdered(3,key=lambda x:-x) 从大到小排序,取出前三项

    统计功能

    stats()
    min()
    max()
    stdev()
    count()
    sum()
    mean()

    RDD key-value transformation

    kvRDD1=sc.parallelize([(3,4),(3,6),(5,6),(1,2)])
    kvRDD2=sc.parallelize([(3,8)])

    kvRDD1.collect()
    [(3, 4), (3, 6), (5, 6), (1, 2)]
    kvRDD2.collect()
    [(3, 8)]

    join

    kvRDD1.join(kvRDD2).collect()
    [(3, (4, 8)), (3, (6, 8))]

    leftOuterJoin

    kvRDD1.leftOuterJoin(kvRDD2).collect()

    [(1, (2, None)), (3, (4, 8)), (3, (6, 8)), (5, (6, None))]

    rightOuterJoin

    kvRDD1.rightOuterJoin(kvRDD2).collect()

    [(3, (4, 8)), (3, (6, 8))]

    subtractByKey

    kvRDD1.subtractByKey(kvRDD2).collect()

    [(1, 2), (5, 6)]

    RDD key-value Action

    key-value first

    kvFirst=kvRDD1.first()
    print(kvFirst[0])
    print(kvFirst[1])

    3
    4

    key count

    kvRDD1.countByKey()

    defaultdict(int, {1: 1, 3: 2, 5: 1})

    create key-value map –>collectAsMap

    KV=kvRDD1.collectAsMap()
    KV

    {1: 2, 3: 6, 5: 6}

    print(type(KV))
    print(KV[3])
    <class 'dict'> 6

    input key to get value

    kvRDD1.lookup(3)

    [4, 6]
  • 相关阅读:
    nginx配置访问本地静态资源
    百度网盘下载慢怎么解决解决方法
    idea集成activity工作流
    vue项目如何部署到Tomcat中
    SpringBoot手写starter 超详细
    将数据转换为树形结构
    NMS技术总结(NMS原理、多类别NMS、NMS的缺陷、NMS的改进思路、各种NMS方法)
    EdgeFormer: 向视觉 Transformer 学习,构建一个比 MobileViT 更好更快的卷积网络
    Python3下获取文件路径方法以及相关的操作
    JavaScript 数据结构与算法1(数组与栈)
  • 原文地址:https://www.cnblogs.com/xzjf/p/9593387.html
Copyright © 2020-2023  润新知