1、导入配置是spark
from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster("local").setAppName("My App") sc = SparkContext(conf=conf)
2、创建RDD
# 第一种 A = [1,2,3,4,5] lines = sc.parallelize(A) # 另一种方式 lines = sc.parallelize([1,2,3,4,5]) # 第三种 lines = sc.textFile("Demo.txt")
3、tex文件练习
There were a sensitivity and a beauty to her that have nothing to do with looks. She was one to be listened to, whose words were so easy to take to heart. It is said that the true nature of being is veiled. The labor of words, the expression of art I used to find notes left in the collection basket, beautiful notes about my homilies and about the writer's thoughts on the daily scriptural readings. It was a long time before I met the author of the notes. One Sunday morning, I was told that someone was waiting for me in the office. We chatted for a while that Sunday morning and agreed to meet for lunch later that week. As it turned out we went to lunch several times, and she always wore a hat during the meal. We spoke of authors we both had read, and it was easy to tell that books are a great love of hers. I have thought about her often over the years and how she struggled in a society that places an incredible premium on looks Would her life have been different had she been pretty? Chances are it would have. How long does it take most of us to reach that level of human growth, if we ever get there? We get so consumed and diminished, The truth of her life was a desire to see beyond the surface for a glimpse of what it is that matters. She found beauty and grace and they befriended her, and showed her what is real wnagnan is good huxue is beautiful we are good
4、在Python中使用第一个单词作为键创建一个pairRDD,使用map()函数
pairs = lines.map(lambda x: (x.split(" ")[0], x))
5、print打印
pairs.foreach(print) ('There', 'There were a sensitivity and a beauty to her that have nothing to do with looks. ') ('She', 'She was one to be listened to, whose words were so easy to take to heart.') ('It', 'It is said that the true nature of being is veiled. ') ('The', 'The labor of words, the expression of art') ('I', 'I used to find notes left in the collection basket,') ('beautiful', "beautiful notes about my homilies and about the writer's thoughts on the daily scriptural readings. ") ('It', 'It was a long time before I met the author of the notes.') ('One', 'One Sunday morning, I was told that someone was waiting for me in the office. ') ('We', 'We chatted for a while that Sunday morning and agreed to meet for lunch later that week.') ('As', 'As it turned out we went to lunch several times, and she always wore a hat during the meal.') ('We', 'We spoke of authors we both had read, and it was easy to tell that books are a great love of hers.') ('I', 'I have thought about her often over the years and how she struggled in a society that places an incredible premium on looks') ('Would', 'Would her life have been different had she been pretty? Chances are it would have.') ('How', 'How long does it take most of us to reach that level of human growth, if we ever get there? We get so consumed and diminished,') ('The', 'The truth of her life was a desire to see beyond the surface for a glimpse of what it is that matters. ') ('She', 'She found beauty and grace and they befriended her, and showed her what is real') ('wnagnan', 'wnagnan is good') ('huxue', 'huxue is beautiful') ('we', 'we are good')
转化-------------------------->
6、用Python对第二个元素进行筛选
result = pairs.filter(lambda keyValue: len(keyValue[1]) < 20) result.foreach(print) ('wnagnan', 'wnagnan is good') ('huxue', 'huxue is beautiful') ('we', 'we are good')
7、用Python实现单词计数
words = lines.flatMap(lambda x: x.split(" ")) result1 = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y) ('There', 1) ('were', 2) ('a', 9) ('sensitivity', 1) ('and', 10) ('beauty', 2) ('to', 11) ('her', 5) ('that', 9) ('have', 3) ('nothing', 1) ......
8.1、在Python中使用reduceByKey()和mapValues()计算每个键对应的平均值
mapValues(function) 原RDD中的Key保持不变,与新的Value一起组成新的RDD中的元素
reduceByKey(func) 是pairRDD的转化操作,目的是合并具有相同键的值。
means = rusult1.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) ('There', (1, 1)) ('were', (2, 1)) ('a', (9, 1)) ('sensitivity', (1, 1)) ('and', (10, 1)) ('beauty', (2, 1)) ('to', (11, 1)) ('her', (5, 1)) ('that', (9, 1)) ('have', (3, 1)) ('nothing', (1, 1)) ......
8.2、在Python中使用combineByKey()求每个键对应的平均值
combineByKey相关理解见 https://www.cnblogs.com/rigid/p/5563205.html
https://www.zhihu.com/question/33798481
sumCount = result1.combineByKey((lambda x: (x, 1)), (lambda x, y: (x[0] + y, x[1] + 1)), (lambda x, y: (x[0] + y[0], x[1] + y[1])))
('There', (1, 1))
('were', (2, 1))
('a', (9, 1))
('sensitivity', (1, 1))
('and', (10, 1))
('beauty', (2, 1))
('to', (11, 1))
('her', (5, 1))
('that', (9, 1))
('have', (3, 1))
('nothing', (1, 1))
......
8.3、计算平均值
方法一: avg = sumCount.map(lambda keyxy: (keyxy[0], keyxy[1][0] / keyxy[1][1])).collectAsMap() print(avg["There"]) 1.0
方法二: avg = sumCount.map(lambda keyxy: (keyxy[0], keyxy[1][0] / keyxy[1][1])) print(avg.first()) print(avg.getNumPartitions()) ('There', 1.0) 1
9、在Python中自定义reduceByKey()的并行度
data = [("a", 3), ("b", 4), ("a", 1)] sc.parallelize(data).reduceByKey(lambda x, y: x + y) # 默认并行度 sc.parallelize(data).reduceByKey(lambda x, y: x + y, 10) # 自定义并行度
10、在Python中以字符串顺序对整数进行自定义排序
rdd = sc.parallelize(data) sort_data = rdd.sortByKey(ascending=True, numPartitions=None, keyfunc=lambda x: str(x)) print_rdd(sort_data) ('a', 3) ('a', 1) ('b', 4)
11、数据的读取与保存
#读取文本文件 input=sc.textFile("文件地址") #保存文本文件 result.saveAsTextFile(outputFile
#用textFile读取csv import csv import StringIO def loadRecord(line): """解析一行csv记录""" input = StringIO.StringIO(line) reader = csv.DictReader(input,filenames =["name","favouriteAnimal"]) return reader.next() input = sc.textFile(inputFile).map(loadRecord) #读取完整csv def loadRecords(filenameContents): """读取给定文件中的所有记录""" input = StringIO.StringIO(filenameContents[1]) reader = csv.DictReader(input,fieldnames = ["name","favouriteAnimal"]) return reader fullFileData = sc.wholeTextFiles(inputFile).flatMap(loadRecords) #保存csv def writeRecords(records): """写出一些csv记录""" output = StringIO.StringIO() writer = csv.DictReader(output,filenames = ["name","favouriteAnimal"]) for record in records: writer.writerow(record) return [output.getvalue()] pandaLovers.mapPartitions(writeRecords).saveAsTextFile(outputFile)
12、累加器:在Python中累加空行
file = sc.textFile("Demo.txt") # 创建Accumulator[int] 并初始化为0 global blankLines # 访问全局变量 blankLines = sc.accumulator(0) def extractCallSigns(line): global blankLines # 访问全局变量 if (line == ""): blankLines += 1 return line.split(" ") callSigns = file.flatMap(extractCallSigns) callSigns.saveAsTextFile("spark_output/callSigns") print(" ") print("Blank Lines:%d " % blankLines.value) print(" ")
Blank Lines:3