• Python+Spark2.0+hadoop学习笔记——RDD、DataFrame和Spark SQL数据库相关操作


    在Spark中可以使用RDD API、DataFrame API和Spark API这三个接口来进行数据的相关操作。且这三者也互相有联系,RDD没有定义框架,DataFrame在创建时必须定义Schema,而Spark SQL是由DataFrame派生出来的,此外Spark API适用于只有SQL操作基础的人,下面开始介绍吧。

    1)配置RDD

    global path
    if sc.master[0:5]=="local":
    Path="file:/home/jorlinlee/pythonwork/PythonProject/"
    else:
    Path="hdfs://master:9000/user/jorlinlee"

    RawUserRDD=sc.textFile(Path+"data/u.user")

    userRDD=RawUserRDD.map(lambda line:line.split("|")) 

    2)创建DataFrame

    创建sqlContext

    sqlContext=SparkSession.builder.getOrCreate()

    定义Schema

    from pyspark.sql import Row

    user_Rows=userRDD.map(lambda p:

      Row(

        userid=int(p[0]),

        age=int(p[1]),

        gender=p[2],

        occupation=p[3],

        zipcode=p[4]

      )

    )

    创建DataFrames

    user_df=sqlContext.createDataFrame(user_Rows)

    为DataFrame创建别名

    df=user_df.alias("df")

    3)创建Spark SQL

    登录临时表user_df.registerTempTable("user_table")

    使用Spark SQL查看项数

    sqlContext.sql("SELECT count(*) counts FROM user_table").show() 

    或者使用多行输入Spark SQL语句

    sqlContext.sql("""

    SELECT count(*) counts

    FROM user_table

    """).show()

    查看全部数据

    sqlContext.sql("SELECT * FROM user_table").show()

    显示指定的数据项

    sqlContext.sql("SELECT * FROM user_table").show(5)

    sqlContext.sql("SELECT * FROM user_table LIMIT 5").show()

    4)SELECT显示部分字段

    4-1)使用RDD

    userRDDnew=userRDD.map(lambda x:(x[0],x[3],x[2],x[1]))

    userRDDnew.take(5)

    [('1', 'technician', 'M', '24'),
    ('2', 'other', 'F', '53'),
    ('3', 'writer', 'M', '23'),
    ('4', 'technician', 'M', '24'),
    ('5', 'other', 'F', '33')]

    4-2)使用DataFrame

    user_df.select("userid","occupation","gender","age").show(5)

    或者

    df[df['userid'],df['occupation'],df['gender'],df['age']].show(5)

    +------+----------+------+---+
    |userid|occupation|gender|age|
    +------+----------+------+---+
    | 1|technician| M| 24|
    | 2| other| F| 53|
    | 3| writer| M| 23|
    | 4|technician| M| 24|
    | 5| other| F| 33|
    +------+----------+------+---+
    only showing top 5 rows

    4-3)使用Spark SQL

    sqlContext.sql("SELECT userid,occupation,gender,age FROM user_table").show(5)

    +------+----------+------+---+
    |userid|occupation|gender|age|
    +------+----------+------+---+
    | 1|technician| M| 24|
    | 2| other| F| 53|
    | 3| writer| M| 23|
    | 4|technician| M| 24|
    | 5| other| F| 33|
    +------+----------+------+---+
    only showing top 5 rows

    5)增加计算字段

    5-1)使用RDD

    userRDDnew=userRDD.map(lambda x:(x[0],x[3],x[2],x[1],2016-int(x[1])))

    [('1', 'technician', 'M', '24', 1992),
    ('2', 'other', 'F', '53', 1963),
    ('3', 'writer', 'M', '23', 1993),
    ('4', 'technician', 'M', '24', 1992),
    ('5', 'other', 'F', '33', 1983)]

    5-2)使用DataFrame

    df.select("userid","occupation","gender","age",2016-df.age).show(5)

    +------+----------+------+---+------------+
    |userid|occupation|gender|age|(2016 - age)|
    +------+----------+------+---+------------+
    | 1|technician| M| 24| 1992|
    | 2| other| F| 53| 1963|
    | 3| writer| M| 23| 1993|
    | 4|technician| M| 24| 1992|
    | 5| other| F| 33| 1983|
    +------+----------+------+---+------------+
    only showing top 5 rows

     5-3)使用Spark SQL增加计算字段

    sqlContext.sql("""
    SELECT userid,occupation,gender,age,2016-age birthyear
    FROM user_table
    """).show(5)

    +------+----------+------+---+---------+
    |userid|occupation|gender|age|birthyear|
    +------+----------+------+---+---------+
    | 1|technician| M| 24| 1992|
    | 2| other| F| 53| 1963|
    | 3| writer| M| 23| 1993|
    | 4|technician| M| 24| 1992|
    | 5| other| F| 33| 1983|
    +------+----------+------+---+---------+
    only showing top 5 rows

     6)筛选数据

    6-1)使用RDD

    userRDD.filter(lambda r:r[3]=='technician' and r[2]=='M' and r[1]=='24').take(6)

    6-2)使用DataFrame

    user_df.filter("occupation='technician' ").filter("gender='M' ").filter("age=24").show()

    或者

    df.filter((df['occupation']=='technician') & (df['gender']=='M') & (df['age']==24)).show()

    6-3)使用Spark SQL筛选数据

    sqlContext.sql("""
    SELECT *
    FROM user_table
    where occupation='technician' and gender='M' and age=24
    """).show()

    7)按单个字段给数据排序

    7-1)使用RDD

    升序

    userRDD.takeOrdered(5,key=lambda x:int(x[1]))

    降序

    userRDD.takeOrdered(5,key=lambda x:-1*int(x[1]))

    7-2)使用DataFrame

    升序

    user_df.select("userid","occupation","gender","age").orderBy("age").show(5)

    或者

    df.select("userid","occupation","gender","age").orderBy(df.age).show(5)

    降序

    df.select("userid","occupation","gender","age").orderBy("age",ascending=0).show(5)

    或者

    df.select("userid","occupation","gender","age").orderBy(df.age.desc()).show(5)

    7-3)使用Spark SQL

    升序

    sqlContext.sql("SELECT userid,occupation,gender,age FROM user_table ORDER BY age").show(5)

    降序

    sqlContext.sql("SELECT userid,occupation,gender,age FROM user_table ORDER BY age DESC").show(5)

    8)按多个字段给数据排序

    8-1)使用RDD

    userRDD.takeOrdered(5,key=lambda x:(-int(x[1]),x[2]))

    8-2)使用DataFrames(0代表降序,1代表升序)

    df.orderBy(["age","gender"],ascending=[0,1]).show(5)

    或者

    df.orderBy(df.age.desc(),df.gender).show(5)

    8-3)使用Spark SQL

    sqlContext.sql("SELECT userid,age,gender,occupation,zipcode FROM user_table ORDER BY age DESC,gender").show(5)

    9)显示不重复的数据

    9-1)使用RDD

    userRDD.map(lambda x:x[2]).distinct().collect()

    userRDD.map(lambda x:(x[1],x[2])).distinct().collect()

    9-2)使用DataFrame

    user_df.select("age","gender").distinct().show()

    9-3)使用Spark SQL

    sqlContext.sql("SELECT distinct gender FROM user_table").show()

    10)分组统计数据

    10-1)使用RDD

    userRDD.map(lambda x:(x[2],1)).reduceByKey(lambda x,y : x+y).collect()

    userRDD.map(lambda x:(x[2],x[3],1)).reduceByKey(lambda x,y : x+y).collect()

    10-2)使用DataFrame

    user_df.select("gender").groupby("gender").count().show()

    user_df.select("gender","occupation").groupby("gender","occupation").count().orderBy("gender","occupation").show()

    +------+-------------+-----+
    |gender| occupation|count|
    +------+-------------+-----+
    | F|administrator| 36|
    | F| artist| 13|
    | F| educator| 26|
    | F| engineer| 2|
    | F|entertainment| 2|
    | F| executive| 3|
    | F| healthcare| 11|
    | F| homemaker| 6|
    | F| lawyer| 2|
    | F| librarian| 29|
    | F| marketing| 10|
    | F| none| 4|
    | F| other| 36|
    | F| programmer| 6|
    | F| retired| 1|
    | F| salesman| 3|
    | F| scientist| 3|
    | F| student| 60|
    | F| technician| 1|
    | F| writer| 19|
    +------+-------------+-----+
    only showing top 20 rows

     更直观的显示

    user_df.stat.crosstab("occupation","gender").show()

    +-----------------+---+---+
    |occupation_gender| F| M|
    +-----------------+---+---+
    | scientist| 3| 28|
    | student| 60|136|
    | writer| 19| 26|
    | salesman| 3| 9|
    | retired| 1| 13|
    | administrator| 36| 43|
    | programmer| 6| 60|
    | doctor| 0| 7|
    | homemaker| 6| 1|
    | executive| 3| 29|
    | engineer| 2| 65|
    | entertainment| 2| 16|
    | marketing| 10| 16|
    | technician| 1| 26|
    | artist| 13| 15|
    | librarian| 29| 22|
    | lawyer| 2| 10|
    | educator| 26| 69|
    | healthcare| 11| 5|
    | none| 4| 5|
    | other| 36| 69|
    +-----------------+---+---+

     10-3)使用Spark SQL

    sqlContext.sql("SELECT gender,count(*) counts FROM user_table GROUP BY gender").show()

    sqlContext.sql("SELECT gender,occupation,count(*) counts FROM user_table GROUP BY gender,occupation").show()

    11)Join联接数据

    需要下载一个ZipCode数据集,然后进行数据的准备工作,之后开始联接(包含了zipcode_df和zipcode_table)。

    11-1)使用Spark SQL

    sqlContext.sql("SELECT u.*,z.city,z.state FROM user_table u LEFT JOIN zipcode_table z ON u.zipcode=z.zipcode WHERE z.state='NY' ").show()

    sqlContext.sql("SELECT z.state,count(*)  counts FROM user_table u LEFT JOIN zipcode_table z ON u.zipcode=z.zipcode GROUP BY z.state").show()

    11-2)使用DataFrame

    joined_df=user_df.join(zipcode_df,user_df.zipcode==zipcode_df.zipcode,"left_outer") 

    joined_df.filter("state='NY' ").show()

    GroupByState_df=joined_df.groupBy("state").count()

  • 相关阅读:
    HttpURLConnection用法详解
    Docker应用场景
    算法1
    Postman 使用详解
    Postman用法简介
    cookie和session
    HTTP简介
    get和post的区别
    git 同步非master分支
    SparseArray类
  • 原文地址:https://www.cnblogs.com/zhuozige/p/12654730.html
Copyright © 2020-2023  润新知