#原理很简单:先是通过flatMap函数,把rdd进行扁平化操作,再用map函数得到(k,1)的样式,然后再用groupByKey函数,合并value值,就相当于对key进行去重操作,再用keys()函数,取出key
实验数据:delcp.txt
hello
hello
world
world
h
h
h
g
g
g
hello
world
world
h
h
h
g
g
g
from pyspark import SparkContext
sc = SparkContext('local','delcp')
rdd = sc.textFile("file:///usr/local/spark/mycode/TestPackage/delcp.txt")
delp = rdd.flatMap(lambda line : line.split(" ")
).map(lambda a : (a,1)).groupByKey().keys()
delp.foreach(print)