在Spark中可以使用RDD API、DataFrame API和Spark API这三个接口来进行数据的相关操作。且这三者也互相有联系,RDD没有定义框架,DataFrame在创建时必须定义Schema,而Spark SQL是由DataFrame派生出来的,此外Spark API适用于只有SQL操作基础的人,下面开始介绍吧。
1)配置RDD
global path
if sc.master[0:5]=="local":
Path="file:/home/jorlinlee/pythonwork/PythonProject/"
else:
Path="hdfs://master:9000/user/jorlinlee"
RawUserRDD=sc.textFile(Path+"data/u.user")
userRDD=RawUserRDD.map(lambda line:line.split("|"))
2)创建DataFrame
创建sqlContext
sqlContext=SparkSession.builder.getOrCreate()
定义Schema
from pyspark.sql import Row
user_Rows=userRDD.map(lambda p:
Row(
userid=int(p[0]),
age=int(p[1]),
gender=p[2],
occupation=p[3],
zipcode=p[4]
)
)
创建DataFrames
user_df=sqlContext.createDataFrame(user_Rows)
为DataFrame创建别名
df=user_df.alias("df")
3)创建Spark SQL
登录临时表user_df.registerTempTable("user_table")
使用Spark SQL查看项数
sqlContext.sql("SELECT count(*) counts FROM user_table").show()
或者使用多行输入Spark SQL语句
sqlContext.sql("""
SELECT count(*) counts
FROM user_table
""").show()
查看全部数据
sqlContext.sql("SELECT * FROM user_table").show()
显示指定的数据项
sqlContext.sql("SELECT * FROM user_table").show(5)
sqlContext.sql("SELECT * FROM user_table LIMIT 5").show()
4)SELECT显示部分字段
4-1)使用RDD
userRDDnew=userRDD.map(lambda x:(x[0],x[3],x[2],x[1]))
userRDDnew.take(5)
[('1', 'technician', 'M', '24'),
('2', 'other', 'F', '53'),
('3', 'writer', 'M', '23'),
('4', 'technician', 'M', '24'),
('5', 'other', 'F', '33')]
4-2)使用DataFrame
user_df.select("userid","occupation","gender","age").show(5)
或者
df[df['userid'],df['occupation'],df['gender'],df['age']].show(5)
+------+----------+------+---+
|userid|occupation|gender|age|
+------+----------+------+---+
| 1|technician| M| 24|
| 2| other| F| 53|
| 3| writer| M| 23|
| 4|technician| M| 24|
| 5| other| F| 33|
+------+----------+------+---+
only showing top 5 rows
4-3)使用Spark SQL
sqlContext.sql("SELECT userid,occupation,gender,age FROM user_table").show(5)
+------+----------+------+---+
|userid|occupation|gender|age|
+------+----------+------+---+
| 1|technician| M| 24|
| 2| other| F| 53|
| 3| writer| M| 23|
| 4|technician| M| 24|
| 5| other| F| 33|
+------+----------+------+---+
only showing top 5 rows
5)增加计算字段
5-1)使用RDD
userRDDnew=userRDD.map(lambda x:(x[0],x[3],x[2],x[1],2016-int(x[1])))
[('1', 'technician', 'M', '24', 1992),
('2', 'other', 'F', '53', 1963),
('3', 'writer', 'M', '23', 1993),
('4', 'technician', 'M', '24', 1992),
('5', 'other', 'F', '33', 1983)]
5-2)使用DataFrame
df.select("userid","occupation","gender","age",2016-df.age).show(5)
+------+----------+------+---+------------+
|userid|occupation|gender|age|(2016 - age)|
+------+----------+------+---+------------+
| 1|technician| M| 24| 1992|
| 2| other| F| 53| 1963|
| 3| writer| M| 23| 1993|
| 4|technician| M| 24| 1992|
| 5| other| F| 33| 1983|
+------+----------+------+---+------------+
only showing top 5 rows
5-3)使用Spark SQL增加计算字段
sqlContext.sql("""
SELECT userid,occupation,gender,age,2016-age birthyear
FROM user_table
""").show(5)
+------+----------+------+---+---------+
|userid|occupation|gender|age|birthyear|
+------+----------+------+---+---------+
| 1|technician| M| 24| 1992|
| 2| other| F| 53| 1963|
| 3| writer| M| 23| 1993|
| 4|technician| M| 24| 1992|
| 5| other| F| 33| 1983|
+------+----------+------+---+---------+
only showing top 5 rows
6)筛选数据
6-1)使用RDD
userRDD.filter(lambda r:r[3]=='technician' and r[2]=='M' and r[1]=='24').take(6)
6-2)使用DataFrame
user_df.filter("occupation='technician' ").filter("gender='M' ").filter("age=24").show()
或者
df.filter((df['occupation']=='technician') & (df['gender']=='M') & (df['age']==24)).show()
6-3)使用Spark SQL筛选数据
sqlContext.sql("""
SELECT *
FROM user_table
where occupation='technician' and gender='M' and age=24
""").show()
7)按单个字段给数据排序
7-1)使用RDD
升序
userRDD.takeOrdered(5,key=lambda x:int(x[1]))
降序
userRDD.takeOrdered(5,key=lambda x:-1*int(x[1]))
7-2)使用DataFrame
升序
user_df.select("userid","occupation","gender","age").orderBy("age").show(5)
或者
df.select("userid","occupation","gender","age").orderBy(df.age).show(5)
降序
df.select("userid","occupation","gender","age").orderBy("age",ascending=0).show(5)
或者
df.select("userid","occupation","gender","age").orderBy(df.age.desc()).show(5)
7-3)使用Spark SQL
升序
sqlContext.sql("SELECT userid,occupation,gender,age FROM user_table ORDER BY age").show(5)
降序
sqlContext.sql("SELECT userid,occupation,gender,age FROM user_table ORDER BY age DESC").show(5)
8)按多个字段给数据排序
8-1)使用RDD
userRDD.takeOrdered(5,key=lambda x:(-int(x[1]),x[2]))
8-2)使用DataFrames(0代表降序,1代表升序)
df.orderBy(["age","gender"],ascending=[0,1]).show(5)
或者
df.orderBy(df.age.desc(),df.gender).show(5)
8-3)使用Spark SQL
sqlContext.sql("SELECT userid,age,gender,occupation,zipcode FROM user_table ORDER BY age DESC,gender").show(5)
9)显示不重复的数据
9-1)使用RDD
userRDD.map(lambda x:x[2]).distinct().collect()
userRDD.map(lambda x:(x[1],x[2])).distinct().collect()
9-2)使用DataFrame
user_df.select("age","gender").distinct().show()
9-3)使用Spark SQL
sqlContext.sql("SELECT distinct gender FROM user_table").show()
10)分组统计数据
10-1)使用RDD
userRDD.map(lambda x:(x[2],1)).reduceByKey(lambda x,y : x+y).collect()
userRDD.map(lambda x:(x[2],x[3],1)).reduceByKey(lambda x,y : x+y).collect()
10-2)使用DataFrame
user_df.select("gender").groupby("gender").count().show()
user_df.select("gender","occupation").groupby("gender","occupation").count().orderBy("gender","occupation").show()
+------+-------------+-----+
|gender| occupation|count|
+------+-------------+-----+
| F|administrator| 36|
| F| artist| 13|
| F| educator| 26|
| F| engineer| 2|
| F|entertainment| 2|
| F| executive| 3|
| F| healthcare| 11|
| F| homemaker| 6|
| F| lawyer| 2|
| F| librarian| 29|
| F| marketing| 10|
| F| none| 4|
| F| other| 36|
| F| programmer| 6|
| F| retired| 1|
| F| salesman| 3|
| F| scientist| 3|
| F| student| 60|
| F| technician| 1|
| F| writer| 19|
+------+-------------+-----+
only showing top 20 rows
更直观的显示
user_df.stat.crosstab("occupation","gender").show()
+-----------------+---+---+
|occupation_gender| F| M|
+-----------------+---+---+
| scientist| 3| 28|
| student| 60|136|
| writer| 19| 26|
| salesman| 3| 9|
| retired| 1| 13|
| administrator| 36| 43|
| programmer| 6| 60|
| doctor| 0| 7|
| homemaker| 6| 1|
| executive| 3| 29|
| engineer| 2| 65|
| entertainment| 2| 16|
| marketing| 10| 16|
| technician| 1| 26|
| artist| 13| 15|
| librarian| 29| 22|
| lawyer| 2| 10|
| educator| 26| 69|
| healthcare| 11| 5|
| none| 4| 5|
| other| 36| 69|
+-----------------+---+---+
10-3)使用Spark SQL
sqlContext.sql("SELECT gender,count(*) counts FROM user_table GROUP BY gender").show()
sqlContext.sql("SELECT gender,occupation,count(*) counts FROM user_table GROUP BY gender,occupation").show()
11)Join联接数据
需要下载一个ZipCode数据集,然后进行数据的准备工作,之后开始联接(包含了zipcode_df和zipcode_table)。
11-1)使用Spark SQL
sqlContext.sql("SELECT u.*,z.city,z.state FROM user_table u LEFT JOIN zipcode_table z ON u.zipcode=z.zipcode WHERE z.state='NY' ").show()
sqlContext.sql("SELECT z.state,count(*) counts FROM user_table u LEFT JOIN zipcode_table z ON u.zipcode=z.zipcode GROUP BY z.state").show()
11-2)使用DataFrame
joined_df=user_df.join(zipcode_df,user_df.zipcode==zipcode_df.zipcode,"left_outer")
joined_df.filter("state='NY' ").show()
GroupByState_df=joined_df.groupBy("state").count()