• PySpark与jupyer notebook


    环境主备及环境配置:

    JAVA_HOME=/root/jdk1.8.0_144

    JAVA_BIN=/root/jdk1.8.0_144/bin

    JRE_HOME=/root/jdk1.8.0_144/jre

    CLASSPATH=/root/jdk1.8.0_144/jre/lib:/root/jdk1.8.0_144/lib:/root/jdk1.8.0_144/jre/lib/charsets.jar

    SCALA_HOME=/root/scala-2.12.0

    SPARK_HOME=/root/spark-2.4.4

    export PYSPARK_PYTHON=/root/anaconda3/bin/python3

    export PYSPARK_DRIVER_PYTHON=jupyter

    export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark

    PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:/root/anaconda3/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$PATH

    日志级别修改

    mv log4j.properties.template log4j.properties

    /root/spark-2.4.4/conf/log4j.properties

    log4j.rootCategory=ERROR, console

    (base) [root@pyspark bin]# run-example SparkPi 10
    19/10/22 22:08:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Pi is roughly 3.143691143691144
    (base) [root@pyspark bin]# run-example SparkPi 20
    19/10/22 22:12:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Pi is roughly 3.1413995706997855
    (base) [root@pyspark bin]# run-example SparkPi 110
    19/10/22 22:12:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Pi is roughly 3.141459921950902

    1. 读取数据
    PySpark的读数据接口read.csv读取数据
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("ML").getOrCreate()
    train = spark.read.csv('/root/Friday/train.csv',header=True,inferSchema=True)
    test = spark.read.csv('/root/Friday/test.csv',header=True,inferSchema=True)
    2.分析数据schema
    train.printSchema()
    root
    |-- User_ID: integer (nullable = true)
    |-- Product_ID: string (nullable = true)
    |-- Gender: string (nullable = true)
    |-- Age: string (nullable = true)
    |-- Occupation: integer (nullable = true)
    |-- City_Category: string (nullable = true)
    |-- Stay_In_Current_City_Years: string (nullable = true)
    |-- Marital_Status: integer (nullable = true)
    |-- Product_Category_1: integer (nullable = true)
    |-- Product_Category_2: integer (nullable = true)
    |-- Product_Category_3: integer (nullable = true)
    |-- Purchase: integer (nullable = true)
    3.预览数据集
    train.head(5)
    [Row(User_ID=1000001, Product_ID='P00069042', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=3, Product_Category_2=None, Product_Category_3=None, Purchase=8370),
    Row(User_ID=1000001, Product_ID='P00248942', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=1, Product_Category_2=6, Product_Category_3=14, Purchase=15200),
    Row(User_ID=1000001, Product_ID='P00087842', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=12, Product_Category_2=None, Product_Category_3=None, Purchase=1422),
    Row(User_ID=1000001, Product_ID='P00085442', Gender='F', Age='0-17', Occupation=10, City_Category='A', Stay_In_Current_City_Years='2', Marital_Status=0, Product_Category_1=12, Product_Category_2=14, Product_Category_3=None, Purchase=1057),
    Row(User_ID=1000002, Product_ID='P00285442', Gender='M', Age='55+', Occupation=16, City_Category='C', Stay_In_Current_City_Years='4+', Marital_Status=0, Product_Category_1=8, Product_Category_2=None, Product_Category_3=None, Purchase=7969)]
    4.核查数据条数
    train.count()
    550068
    5.插补缺失值
    drop()方法,可以检查train上非空数值的个数,并进行测试。默认情况下,drop()方法将删除包含任何空值的行。我们还可以通过设置参数“all”,当且仅当该行所有参数都为null时以删除该行。这与pandas上的drop方法类似。
    train.na.drop('any').count(),test.na.drop('any').count()
    (166821, 71037)
    使用-1 填充train和test的null值
    train =train.fillna(-1)
    test = test.fillna(-1)
    train.describe().show()
    +-------+------------------+----------+------+------+-----------------+-------------+--------------------------+-------------------+------------------+------------------+------------------+-----------------+
    |summary| User_ID|Product_ID|Gender| Age| Occupation|City_Category|Stay_In_Current_City_Years| Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3| Purchase|
    +-------+------------------+----------+------+------+-----------------+-------------+--------------------------+-------------------+------------------+------------------+------------------+-----------------+
    | count| 550068| 550068|550068|550068| 550068| 550068| 550068| 550068| 550068| 550068| 550068| 550068|
    | mean|1003028.8424013031| null| null| null|8.076706879876669| null| 1.468494139793958|0.40965298835780306| 5.404270017525106| 6.419769919355425| 3.145214773446192|9263.968712959126|
    | stddev|1727.5915855313747| null| null| null|6.522660487341741| null| 0.9890866807573103| 0.4917701263173315| 3.936211369201365| 6.565109781181316| 6.681038828257845|5023.065393820575|
    | min| 1000001| P00000142| F| 0-17| 0| A| 0| 0| 1| -1| -1| 12|
    | max| 1006040| P0099942| M| 55+| 20| C| 4+| 1| 20| 18| 18| 23961|
    +-------+------------------+----------+------+------+-----------------+-------------+--------------------------+-------------------+------------------+------------------+------------------+-----------------+

    选择部分字段进行分析
    train.select('User_ID','Age').show()
    +-------+-----+
    |User_ID| Age|
    +-------+-----+
    |1000001| 0-17|
    |1000001| 0-17|
    |1000001| 0-17|
    |1000001| 0-17|
    |1000002| 55+|
    |1000003|26-35|
    |1000004|46-50|
    |1000004|46-50|
    |1000004|46-50|
    |1000005|26-35|
    |1000005|26-35|
    |1000005|26-35|
    |1000005|26-35|
    |1000005|26-35|
    |1000006|51-55|
    |1000006|51-55|
    |1000006|51-55|
    |1000006|51-55|
    |1000007|36-45|
    |1000008|26-35|
    +-------+-----+
    only showing top 20 rows

    分析categorical特征
    为了建立一个模型,我们需要在“train”和“test”中看到分类特征的分布。这里我只对Product_ID显示这个,但是我们也可以对任何分类特性执行相同的操作。让我们看看在“train”和“test”中Product_ID的不同类别的数量。这可以通过应用distinct()和count()方法来实现。

    train.select('Product_ID').distinct().count(),test.select('Product_ID').distinct().count()
    (3631, 3491)
    在计算“train”和“test”的不同值的数量后,我们可以看到“train”和“test”有更多的类别。让我们使用相减方法检查Product_ID的类别,这些类别正在"test"中,但不在“train”中。我们也可以对所有的分类特征做同样的处理。
    diff_cat_in_train_test = test.select('Product_ID').subtract(train.select('Product_ID'))
    diff_cat_in_train_test.distinct().count()
    46
    diff_cat_in_train_test.distinct().show(5)
    +----------+
    |Product_ID|
    +----------+
    | P00322642|
    | P00300142|
    | P00077642|
    | P00249942|
    | P00294942|
    +----------+
    only showing top 5 rows

    以上你可以看到46个不同的类别是在"test"中,而不在"train"中。在这种情况下,我们要么收集更多关于它们的数据,要么跳过那些类别(无效类别)的“test”。
    将分类变量转换为标签

    我们还需要通过在Product_ID上应用StringIndexer转换将分类列转换为标签,该转换将标签的Product_ID列编码为标签索引的列。

    from pyspark.ml.feature import StringIndexer
    plan_indexer = StringIndexer(inputCol='Product_ID',outputCol='product_id_trans')
    labeller = plan_indexer.fit(train)

    Train1 = labeller.transform(train)
    Test1 = labeller.transform(test)
    Train1.show(5)
    +-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
    |User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|product_id_trans|
    +-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
    |1000001| P00069042| F|0-17| 10| A| 2| 0| 3| -1| -1| 8370| 766.0|
    |1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200| 183.0|
    |1000001| P00087842| F|0-17| 10| A| 2| 0| 12| -1| -1| 1422| 1496.0|
    |1000001| P00085442| F|0-17| 10| A| 2| 0| 12| 14| -1| 1057| 481.0|
    |1000002| P00285442| M| 55+| 16| C| 4+| 0| 8| -1| -1| 7969| 860.0|
    +-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
    only showing top 5 rows

    Train1.select('product_id_trans').show(5)
    +----------------+
    |product_id_trans|
    +----------------+
    | 766.0|
    | 183.0|
    | 1496.0|
    | 481.0|
    | 860.0|
    +----------------+
    only showing top 5 rows
    显示了我们在以前的"train" Dataframe中成功的添加了一个转化后的列“product_id_trans”,("Train1" Dataframe)。
    选择特征来构建机器学习模型
    首先,我们需要从pyspark.ml.feature导入RFormula;然后,我们需要在这个公式中指定依赖和独立的列;我们还必须为为features列和label列指定名称。

    from pyspark.ml.feature import RFormula
    formula = RFormula(formula="Purchase ~ Age+ Occupation + City_Category + Stay_In_Current_City_Years+Product_Category_1 + Product_Category_2 + Gender",
    featuresCol="features",labelCol="label")

    在创建了这个公式之后,我们需要将这个公式应用到我们的Train1上,并通过这个公式转换Train1,Test1。让我们看看如何做到这一点,在拟合变换train1之后,
    t1 = formula.fit(Train1)
    train1 = t1.transform(Train1)
    test1 = t1.transform(Test1)
    train1.show(5)
    +-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+--------------------+-------+
    |User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|product_id_trans| features| label|
    +-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+--------------------+-------+
    |1000001| P00069042| F|0-17| 10| A| 2| 0| 3| -1| -1| 8370| 766.0|(16,[6,10,13,14],...| 8370.0|
    |1000001| P00248942| F|0-17| 10| A| 2| 0| 1| 6| 14| 15200| 183.0|(16,[6,10,13,14],...|15200.0|
    |1000001| P00087842| F|0-17| 10| A| 2| 0| 12| -1| -1| 1422| 1496.0|(16,[6,10,13,14],...| 1422.0|
    |1000001| P00085442| F|0-17| 10| A| 2| 0| 12| 14| -1| 1057| 481.0|(16,[6,10,13,14],...| 1057.0|
    |1000002| P00285442| M| 55+| 16| C| 4+| 0| 8| -1| -1| 7969| 860.0|(16,[5,6,8,12,13,...| 7969.0|
    +-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+--------------------+-------+
    only showing top 5 rows

    在应用了这个公式之后,我们可以看到train1和test1有两个额外的列,称为features和label,并对我们在公式中指定的列进行标记(featuresCol= features和labelCol= label)。直观上,train1和test1中的features列中的所有分类变量都被转换为数值,数值变量与之前应用ML时相同。我们还可以查看train1和test1中的列特性和标签。

  • 相关阅读:
    View Focus的处理过程及ViewGroup的mFocused字段分析
    Android按键事件处理流程 -- KeyEvent
    Android中将xml布局文件转化为View树的过程分析(下)-- LayoutInflater源码分析
    Android中将xml布局文件转化为View树的过程分析(上)
    IntentService源码分析
    Android源码分析之SharedPreferences
    进程间传递文件描述符
    Epoll在LT和ET模式下的读写方式
    gdb
    Android开发之漫漫长途 Fragment番外篇——TabLayout+ViewPager+Fragment
  • 原文地址:https://www.cnblogs.com/songyuejie/p/11717027.html
Copyright © 2020-2023  润新知