• Pyspark ml


    spark mllib

    连接到spark

    一般来说指定本地ip得时候需要指定好端口号

    Creating a SparkSession

    首先得创建一个spark得对话

    In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a SparkSession object.
    The SparkSession class has a builder attribute, which is an instance of the Builder class. The Builder class exposes three important methods that let you:
    specify the location of the master node;
    name the application (optional); and retrieve an existing SparkSession or, if there is none, create a new one.
    The SparkSession class has a version attribute which gives the version of Spark.
    Find out more about SparkSession here.

    # Import the PySpark module
    from pyspark.sql import SparkSession
    
    # Create SparkSession object
    spark = SparkSession.builder 
                        .master('local[*]') 
                        .appName('test') 
                        .getOrCreate()
    
    # What version of Spark?
    # (Might be different to what you saw in the presentation!)
    print(spark.version)
    
    # Terminate the cluster
    spark.stop()
    

    loading data

    都是.来调用

    # Read data from CSV file
    flights = spark.read.csv('flights.csv',
                             sep=',',
                             header=True,
                             inferSchema=True,
                             nullValue='NA')
    
    # Get number of records
    print("The data contain %d records." % flights.count())
    
    # View the first five records
    flights.show(5)
    
    # Check column data types
    print(flights.dtypes)
    
    <script.py> output:
        The data contain 50000 records.
        +---+---+---+-------+------+---+----+------+--------+-----+
        |mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
        +---+---+---+-------+------+---+----+------+--------+-----+
        | 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
        |  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
        |  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
        |  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
        |  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
        +---+---+---+-------+------+---+----+------+--------+-----+
        only showing top 5 rows
        
        [('mon', 'int'), ('dom', 'int'), ('dow', 'int'), ('carrier', 'string'), ('flight', 'int'), ('org', 'string'), ('mile', 'int'), ('depart', 'double'), ('duration', 'int'), ('delay', 'int')]
    
    
    **指定读入数据得数据类型**
    
    ```r
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    # Specify column names and types
    schema = StructType([
        StructField("id", IntegerType()),
        StructField("text", StringType()),
        StructField("label", IntegerType())
    ])
    
    # Load data from a delimited file
    sms = spark.read.csv("sms.csv", sep=';', header=False, schema=schema)
    
    # Print schema of DataFrame
    sms.printSchema()
    

    Data Preparation

    准备数据

    删除行和列

    # Remove the 'flight' column
    flights_drop_column = flights.drop('flight')
    
    # Number of records with missing 'delay' values
    flights_drop_column.filter('delay IS NULL').count()
    
    # Remove records with missing 'delay' values
    flights_valid_delay = flights_drop_column.filter('delay IS NOT NULL')
    
    # Remove records with missing values in any column and get the number of remaining rows
    flights_none_missing = flights_valid_delay.dropna()
    print(flights_none_missing.count())
    

    Column manipulation

    创建新的列

    # Import the required function
    from pyspark.sql.functions import round
    
    # Convert 'mile' to 'km' and drop 'mile' column
    flights_km = flights.withColumn('km', round(
    flights.mile * 1.60934, 0)) 
                        .drop('mile')
    
    # Create 'label' column indicating whether flight delayed (1) or not (0)
    flights_km = flights_km.withColumn('label', (flights_km.delay >= 15).cast('integer'))
    
    # Check first five records
    flights_km.show(5)
    

    类别变量进行编码

    from pyspark.ml.feature import StringIndexer
    
    # Create an indexer
    indexer = StringIndexer(inputCol='carrier', outputCol='carrier_idx')
    
    # Indexer identifies categories in the data
    indexer_model = indexer.fit(flights)
    
    # Indexer creates a new column with numeric index values
    flights_indexed = indexer_model.transform(flights)
    
    # Repeat the process for the other categorical feature
    flights_indexed = StringIndexer(inputCol='org', outputCol='org_idx').fit(flights_indexed).transform(flights_indexed)
    

    ​StringIndexer转换器可以把一列类别型的特征(或标签)进行编码,使其数值化,索引的
    范围从0开始,该过程可以使得相应的特征索引化,使得某些无法接受类别型特征的算法可
    以使用,并提高诸如决策树等机器学习算法的效率。
    索引构建的顺序为标签的频率,优先编码频率较大的标签,所以出现频率最高的标签为0号。
    如果输入的是数值型的,我们会把它转化成字符型,然后再对其进行编码。

    Assembling columns

    把几列合并为1列

    # Import the necessary class
    from pyspark.ml.feature import VectorAssembler
    
    # Create an assembler object
    assembler = VectorAssembler(inputCols=[
        'mon', 'dom', 'dow', 'carrier_idx', 'org_idx', 'km', 'depart', 'duration'
    ], outputCol='features')
    
    # Consolidate predictor columns
    flights_assembled = assembler.transform(flights)
    
    # Check the resulting column
    flights_assembled.select('features', 'delay').show(5, truncate=False)
    
    <script.py> output:
        +-----------------------------------------+-----+
        |features                                 |delay|
        +-----------------------------------------+-----+
        |[0.0,22.0,2.0,0.0,0.0,509.0,16.33,82.0]  |30   |
        |[2.0,20.0,4.0,0.0,1.0,542.0,6.17,82.0]   |-8   |
        |[9.0,13.0,1.0,1.0,0.0,1989.0,10.33,195.0]|-5   |
        |[5.0,2.0,1.0,0.0,1.0,885.0,7.98,102.0]   |2    |
        |[7.0,2.0,6.0,1.0,0.0,1180.0,10.83,135.0] |54   |
        +-----------------------------------------+-----+
        only showing top 5 rows
    

    VectorAssembler是一个变换器,它将给定的列列表组合到一个向量列中。 将原始特征和由不同特征变换器生成的特征组合成单个特征向量非常有用,以便训练ML模型,如逻辑回归和决策树。 VectorAssembler接受以下输入列类型:所有数字类型,布尔类型和矢量类型。 在每一行中,输入列的值将按指定的顺序连接到一个向量中。

    dt

    决策树

    randomsplit

    等价于train_test_split

    随机划分数据集

    # Split into training and testing sets in a 80:20 ratio
    flights_train, flights_test = flights.randomSplit([0.8, 0.2], seed=17)
    
    # Check that training set has around 80% of records
    training_ratio = flights_train.count() / flights.count()
    print(training_ratio)
    

    决策树模型

    # Import the Decision Tree Classifier class
    from pyspark.ml.classification import DecisionTreeClassifier
    
    # Create a classifier object and fit to the training data
    tree = DecisionTreeClassifier()
    tree_model = tree.fit(flights_train)
    
    # Create predictions for the testing data and take a look at the predictions
    prediction = tree_model.transform(flights_test)
    prediction.select('label', 'prediction', 'probability').show(5, False)
    
    <script.py> output:
        +-----+----------+----------------------------------------+
        |label|prediction|probability                             |
        +-----+----------+----------------------------------------+
        |1    |1.0       |[0.2911010558069382,0.7088989441930619] |
        |1    |1.0       |[0.3875,0.6125]                         |
        |1    |1.0       |[0.3875,0.6125]                         |
        |0    |0.0       |[0.6337448559670782,0.3662551440329218] |
        |0    |0.0       |[0.9368421052631579,0.06315789473684211]|
        +-----+----------+----------------------------------------+
        only showing top 5 rows
    

    Evaluate the Decision Tree

    # Create a confusion matrix
    prediction.groupBy('label', 'prediction').count().show()
    
    # Calculate the elements of the confusion matrix
    TN = prediction.filter('prediction = 0 AND label = prediction').count()
    TP = prediction.filter('prediction = 1 AND label = prediction').count()
    FN = prediction.filter('prediction = 0 AND label != prediction').count()
    FP = prediction.filter('prediction = 1 AND label != prediction').count()
    
    # Accuracy measures the proportion of correct predictions
    accuracy = (TN + TP) / (TN + TP + FN + FP)
    print(accuracy)
    
    Sample of predictions:
    
    +-----+----------+----------------------------------------+
    |label|prediction|probability                             |
    +-----+----------+----------------------------------------+
    |1    |1.0       |[0.2911010558069382,0.7088989441930619] |
    |1    |1.0       |[0.3875,0.6125]                         |
    |1    |1.0       |[0.3875,0.6125]                         |
    |0    |0.0       |[0.6337448559670782,0.3662551440329218] |
    |0    |0.0       |[0.9368421052631579,0.06315789473684211]|
    +-----+----------+----------------------------------------+
    only showing top 5 rows
    
    <script.py> output:
        +-----+----------+-----+
        |label|prediction|count|
        +-----+----------+-----+
        |    1|       0.0|  154|
        |    0|       0.0|  289|
        |    1|       1.0|  328|
        |    0|       1.0|  190|
        +-----+----------+-----+
        
        0.6420395421436004
    

    Logistic Regression

    逻辑回归

    # Import the logistic regression class
    from pyspark.ml.classification import LogisticRegression
    
    # Create a classifier object and train on training data
    logistic = LogisticRegression().fit(flights_train)
    
    # Create predictions for the testing data and show confusion matrix
    prediction = logistic.transform(flights_test)
    prediction.groupBy('label', 'prediction').count().show()
    

    Evaluate the Logistic Regression model
    Accuracy is generally not a very reliable metric because it can be biased by the most common target class.
    There are two other useful metrics:
    precision and
    recall.
    Check the slides for this lesson to get the relevant expressions.
    Precision is the proportion of positive predictions which are correct. For all flights which are predicted to be delayed, what proportion is actually delayed?
    Recall is the proportion of positives outcomes which are correctly predicted. For all delayed flights, what proportion is correctly predicted by the model?
    The precision and recall are generally formulated in terms of the positive target class. But it's also possible to calculate weighted versions of these metrics which look at both target classes.
    The components of the confusion matrix are available as TN, TP, FN and FP, as well as the object prediction.

    from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
    
    # Calculate precision and recall
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    print('precision = {:.2f}
    recall    = {:.2f}'.format(precision, recall))
    
    # Find weighted precision
    multi_evaluator = MulticlassClassificationEvaluator()
    weighted_precision = multi_evaluator.evaluate(prediction, {multi_evaluator.metricName: "weightedPrecision"})
    
    # Find AUC
    binary_evaluator = BinaryClassificationEvaluator()
    auc = binary_evaluator.evaluate(prediction, {binary_evaluator.metricName: "areaUnderROC"})
    
    <script.py> output:
        precision = 0.58
        recall    = 0.59
    

    Punctuation, numbers and tokens

    # Import the necessary functions
    from pyspark.sql.functions import regexp_replace  #正则表达式
    from pyspark.ml.feature import Tokenizer
    
    # Remove punctuation (REGEX provided) and numbers
    wrangled = sms.withColumn('text', regexp_replace(sms.text, '[_():;,.!?\-]', ' '))
    wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, '[0-9]', ' '))
    
    # Merge multiple spaces
    wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))
    
    # Split the text into words
    wrangled = Tokenizer(inputCol='text', outputCol='words').transform(wrangled)
    
    wrangled.show(4, truncate=False)
    
    
    <script.py> output:
        +---+----------------------------------+-----+------------------------------------------+
        |id |text                              |label|words                                     |
        +---+----------------------------------+-----+------------------------------------------+
        |1  |Sorry I'll call later in meeting  |0    |[sorry, i'll, call, later, in, meeting]   |
        |2  |Dont worry I guess he's busy      |0    |[dont, worry, i, guess, he's, busy]       |
        |3  |Call FREEPHONE now                |1    |[call, freephone, now]                    |
        |4  |Win a cash prize or a prize worth |1    |[win, a, cash, prize, or, a, prize, worth]|
        +---+----------------------------------+-----+------------------------------------------+
        only showing top 4 rows
    
    

    hashing 编码

    这个就类似于gensim中的countvec

    from pyspark.ml.feature import StopWordsRemover, HashingTF, IDF
    
    # Remove stop words.
    wrangled = StopWordsRemover(inputCol='words', outputCol='terms')
          .transform(sms)
    
    # Apply the hashing trick
    wrangled = HashingTF(inputCol='terms', outputCol='hash', numFeatures=1024)
          .transform(wrangled)
    
    # Convert hashed symbols to TF-IDF
    tf_idf = IDF(inputCol='hash', outputCol='features')
          .fit(wrangled).transform(wrangled)
          
    tf_idf.select('terms', 'features').show(4, truncate=False)
    
    from pyspark.ml.feature import StopWordsRemover, HashingTF, IDF
    
    # Remove stop words.
    wrangled = StopWordsRemover(inputCol='words', outputCol='terms')
          .transform(sms)
    
    # Apply the hashing trick
    wrangled = HashingTF(inputCol='terms', outputCol='hash', numFeatures=1024)
          .transform(wrangled)
    
    # Convert hashed symbols to TF-IDF
    tf_idf = IDF(inputCol='hash', outputCol='features')
          .fit(wrangled).transform(wrangled)
          
    tf_idf.select('terms', 'features').show(4, truncate=False)
    

    逻辑回归的例子

    # Split the data into training and testing sets
    sms_train, sms_test = sms.randomSplit([0.8, 0.2], seed=13)
    
    # Fit a Logistic Regression model to the training data
    logistic = LogisticRegression(regParam=0.2).fit(sms_train)
    
    # Make predictions on the testing data
    prediction = logistic.transform(sms_test)
    
    # Create a confusion matrix, comparing predictions to known labels
    prediction.groupBy('label', 'prediction').count().show()
    
    Selected columns from first few rows of the sms DataFrame:
    
    +-----+--------------------+
    |label|            features|
    +-----+--------------------+
    |    0|(1024,[138,344,37...|
    |    0|(1024,[53,233,329...|
    |    1|(1024,[138,396],[...|
    |    1|(1024,[31,69,387,...|
    |    0|(1024,[116,262,33...|
    +-----+--------------------+
    only showing top 5 rows
    

    one-hot

    # Import the one hot encoder class
    from pyspark.ml.feature import OneHotEncoderEstimator
    
    # Create an instance of the one hot encoder
    onehot = OneHotEncoderEstimator(inputCols=['org_idx'], outputCols=['org_dummy'])
    
    # Apply the one hot encoder to the flights data
    onehot = onehot.fit(flights)
    flights_onehot = onehot.transform(flights)
    
    # Check the results
    flights_onehot.select('org', 'org_idx', 'org_dummy').distinct().sort('org_idx').show()
    
    Subset from the flights DataFrame:
    
    +---+-------+
    |org|org_idx|
    +---+-------+
    |JFK|2.0    |
    |ORD|0.0    |
    |SFO|1.0    |
    |ORD|0.0    |
    |ORD|0.0    |
    +---+-------+
    only showing top 5 rows
    
    <script.py> output:
        +---+-------+-------------+
        |org|org_idx|    org_dummy|
        +---+-------+-------------+
        |ORD|    0.0|(7,[0],[1.0])|
        |SFO|    1.0|(7,[1],[1.0])|
        |JFK|    2.0|(7,[2],[1.0])|
        |LGA|    3.0|(7,[3],[1.0])|
        |SJC|    4.0|(7,[4],[1.0])|
        |SMF|    5.0|(7,[5],[1.0])|
        |TUS|    6.0|(7,[6],[1.0])|
        |OGG|    7.0|    (7,[],[])|
        +---+-------+-------------+
    

    Regression

    from pyspark.ml.regression import LinearRegression
    from pyspark.ml.evaluation import RegressionEvaluator
    
    # Create a regression object and train on training data
    regression = LinearRegression(labelCol='duration').fit(flights_train)
    
    # Create predictions for the testing data and take a look at the predictions
    predictions = regression.transform(flights_test)
    predictions.select('duration', 'prediction').show(5, False)
    
    # Calculate the RMSE
    RegressionEvaluator(labelCol='duration').evaluate(predictions)
    Subset from the flights DataFrame:
    
    +------+--------+--------+
    |km    |features|duration|
    +------+--------+--------+
    |3465.0|[3465.0]|351     |
    |509.0 |[509.0] |82      |
    |542.0 |[542.0] |82      |
    |1989.0|[1989.0]|195     |
    |415.0 |[415.0] |65      |
    +------+--------+--------+
    only showing top 5 rows
    
    <script.py> output:
        +--------+------------------+
        |duration|prediction        |
        +--------+------------------+
        |105     |118.71205377865795|
        |204     |174.69339409767792|
        |160     |152.16523695718402|
        |297     |337.8153345965721 |
        |105     |113.5132482846978 |
        +--------+------------------+
        only showing top 5 rows
    
    
    # Intercept (average minutes on ground)
    inter = regression.intercept
    print(inter)
    
    # Coefficients
    coefs = regression.coefficients
    print(coefs)
    
    # Average minutes per km
    minutes_per_km = regression.coefficients[0]
    print(minutes_per_km)
    
    # Average speed in km per hour
    avg_speed = 60 / minutes_per_km
    print(avg_speed)
    
    <script.py> output:
        44.36345473899361
        [0.07566671399881963]
        0.07566671399881963
        792.9510458315392
    
    from pyspark.ml.regression import LinearRegression
    from pyspark.ml.evaluation import RegressionEvaluator
    
    # Create a regression object and train on training data
    regression = LinearRegression(labelCol='duration').fit(flights_train)
    
    # Create predictions for the testing data
    predictions = regression.transform(flights_test)
    
    # Calculate the RMSE on testing data
    RegressionEvaluator(labelCol='duration').evaluate(predictions)
    

    打印回归的相关系数

    # Average speed in km per hour
    avg_speed_hour = 60 / regression.coefficients[0]
    print(avg_speed_hour)
    
    # Average minutes on ground at OGG
    inter = regression.intercept
    print(inter)
    
    # Average minutes on ground at JFK
    avg_ground_jfk = inter + regression.coefficients[3]
    print(avg_ground_jfk)
    
    # Average minutes on ground at LGA
    avg_ground_lga = inter + regression.coefficients[4]
    print(avg_ground_lga)
    
    
    <script.py> output:
        807.3336599681242
        15.856628374450773
        68.53550999587868
        62.56747182033072
    

    Bucketing & Engineering

    bucketizer

    将连续变量进行离散化

    from pyspark.ml.feature import Bucketizer, OneHotEncoderEstimator
    
    # Create buckets at 3 hour intervals through the day
    buckets = Bucketizer(splits=[0, 3, 6, 9, 12, 15, 18, 21, 24], inputCol='depart', outputCol='depart_bucket')
    
    # Bucket the departure times
    bucketed = buckets.transform(flights)
    bucketed.select('depart', 'depart_bucket').show(5)
    
    # Create a one-hot encoder
    onehot = OneHotEncoderEstimator(inputCols=['depart_bucket'], outputCols=['depart_dummy'])
    
    # One-hot encode the bucketed departure times
    flights_onehot = onehot.fit(bucketed).transform(bucketed)
    flights_onehot.select('depart', 'depart_bucket', 'depart_dummy').show(5)
    
    <script.py> output:
        +------+-------------+
        |depart|depart_bucket|
        +------+-------------+
        |  9.48|          3.0|
        | 16.33|          5.0|
        |  6.17|          2.0|
        | 10.33|          3.0|
        |  8.92|          2.0|
        +------+-------------+
        only showing top 5 rows
        
        +------+-------------+-------------+
        |depart|depart_bucket| depart_dummy|
        +------+-------------+-------------+
        |  9.48|          3.0|(7,[3],[1.0])|
        | 16.33|          5.0|(7,[5],[1.0])|
        |  6.17|          2.0|(7,[2],[1.0])|
        | 10.33|          3.0|(7,[3],[1.0])|
        |  8.92|          2.0|(7,[2],[1.0])|
        +------+-------------+-------------+
        only showing top 5 rows
    

    分桶的特征工程

    完整的回归模型计算相关的系数

    # Find the RMSE on testing data
    from pyspark.ml.evaluation import RegressionEvaluator
    RegressionEvaluator(labelCol='duration').evaluate(predictions)
    
    # Average minutes on ground at OGG for flights departing between 21:00 and 24:00
    avg_eve_ogg = regression.intercept
    print(avg_eve_ogg)
    
    # Average minutes on ground at OGG for flights departing between 00:00 and 03:00
    avg_night_ogg = regression.intercept + regression.coefficients[8]
    print(avg_night_ogg)
    
    # Average minutes on ground at JFK for flights departing between 00:00 and 03:00
    avg_night_jfk = regression.intercept + regression.coefficients[8] + regression.coefficients[3]
    print(avg_night_jfk)
    

    正则化

    from pyspark.ml.regression import LinearRegression
    from pyspark.ml.evaluation import RegressionEvaluator
    
    # Fit linear regression model to training data
    regression = LinearRegression(labelCol='duration').fit(flights_train)
    
    # Make predictions on testing data
    predictions = regression.transform(flights_test)
    
    # Calculate the RMSE on testing data
    rmse = RegressionEvaluator(labelCol='duration').evaluate(predictions)
    print("The test RMSE is", rmse)
    
    # Look at the model coefficients
    coeffs = regression.coefficients
    print(coeffs)
    
    from pyspark.ml.regression import LinearRegression
    from pyspark.ml.evaluation import RegressionEvaluator
    
    # Fit Lasso model (α = 1) to training data
    regression = LinearRegression(labelCol='duration', regParam=1, elasticNetParam=1).fit(flights_train)
    
    # Calculate the RMSE on testing data
    rmse = RegressionEvaluator(labelCol='duration').evaluate(regression.transform(flights_test))
    print("The test RMSE is", rmse)
    
    # Look at the model coefficients
    coeffs = regression.coefficients
    print(coeffs)
    
    # Number of zero coefficients
    zero_coeff = sum([beta == 0 for beta in regression.coefficients])
    print("Number of coefficients equal to 0:", zero_coeff)
    

    管道

    # Convert categorical strings to index values
    indexer = StringIndexer(inputCol='org', outputCol='org_idx')
    
    # One-hot encode index values
    onehot = OneHotEncoderEstimator(
        inputCols=['org_idx', 'dow'],
        outputCols=['org_dummy', 'dow_dummy']
    )
    
    # Assemble predictors into a single column
    assembler = VectorAssembler(inputCols=['km', 'org_dummy', 'dow_dummy'], outputCol='features')
    
    # A linear regression object
    regression = LinearRegression(labelCol='duration')
    
    The first few rows of the flights DataFrame:
    
    +---+---+---+-------+------+---+------+--------+-----+------+
    |mon|dom|dow|carrier|flight|org|depart|duration|delay|km    |
    +---+---+---+-------+------+---+------+--------+-----+------+
    |11 |20 |6  |US     |19    |JFK|9.48  |351     |null |3465.0|
    |0  |22 |2  |UA     |1107  |ORD|16.33 |82      |30   |509.0 |
    |2  |20 |4  |UA     |226   |SFO|6.17  |82      |-8   |542.0 |
    |9  |13 |1  |AA     |419   |ORD|10.33 |195     |-5   |1989.0|
    |4  |2  |5  |AA     |325   |ORD|8.92  |65      |null |415.0 |
    +---+---+---+-------+------+---+------+--------+-----+------+
    only showing top 5 rows
    
    
    管道函数和py里面是一样的
    
    # Import class for creating a pipeline
    from pyspark.ml import Pipeline
    
    # Construct a pipeline
    pipeline = Pipeline(stages=[indexer, onehot, assembler, regression])
    
    # Train the pipeline on the training data
    pipeline = pipeline.fit(flights_train)
    
    # Make predictions on the testing data
    predictions = pipeline.transform(flights_test)
    
    from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
    
    # Break text into tokens at non-word characters
    tokenizer = Tokenizer(inputCol='text', outputCol='words')
    
    # Remove stop words
    remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol='terms')
    
    # Apply the hashing trick and transform to TF-IDF
    hasher = HashingTF(inputCol=remover.getOutputCol(), outputCol="hash")
    idf = IDF(inputCol=hasher.getOutputCol(), outputCol="features")
    
    # Create a logistic regression object and add everything to a pipeline
    logistic = LogisticRegression()
    pipeline = Pipeline(stages=[tokenizer, remover, hasher, idf, logistic])
    
    Selected columns from first few rows of the sms DataFrame:
    
    +---+---------------------------------+-----+
    |id |text                             |label|
    +---+---------------------------------+-----+
    |1  |Sorry I'll call later in meeting |0    |
    |2  |Dont worry I guess he's busy     |0    |
    |3  |Call FREEPHONE now               |1    |
    |4  |Win a cash prize or a prize worth|1    |
    +---+---------------------------------+-----+
    only showing top 4 rows
    

    Cross-Validation

    # Create an empty parameter grid
    params = ParamGridBuilder().build()
    
    # Create objects for building and evaluating a regression model
    regression = LinearRegression(labelCol='duration')
    evaluator = RegressionEvaluator(labelCol='duration')
    
    # Create a cross validator
    cv = CrossValidator(estimator=regression, estimatorParamMaps=params, evaluator=evaluator, numFolds=5)
    
    # Train and test model on multiple folds of the training data
    cv = cv.fit(flights_train)
    
    # NOTE: Since cross-validation builds multiple models, the fit() method can take a little while to complete.
    
    
    # Create an indexer for the org field
    indexer = StringIndexer(inputCol='org', outputCol='org_idx')
    
    # Create an one-hot encoder for the indexed org field
    onehot = OneHotEncoderEstimator(inputCols=['org_idx'], outputCols=['org_dummy'])
    
    # Assemble the km and one-hot encoded fields
    assembler = VectorAssembler(inputCols=['km', 'org_dummy'], outputCol='features')
    
    # Create a pipeline and cross-validator.
    pipeline = Pipeline(stages=[indexer, onehot, assembler, regression])
    cv = CrossValidator(estimator=pipeline,
                        estimatorParamMaps=params,
                        evaluator=evaluator)
    

    网格搜索

    # Create parameter grid
    params = ParamGridBuilder()
    
    # Add grids for two parameters
    params = params.addGrid(regression.regParam, [0.01, 0.1, 1.0, 10.0]) 
                   .addGrid(regression.elasticNetParam, [0.0, 0.5, 1.0])
    
    # Build the parameter grid
    params = params.build()
    print('Number of models to be tested: ', len(params))
    
    # Create cross-validator
    cv = CrossValidator(estimator=pipeline, estimatorParamMaps=params, evaluator=evaluator, numFolds=5)
    
    
    # Get the best model from cross validation
    best_model = cv.bestModel
    
    # Look at the stages in the best model
    print(best_model.stages)
    
    # Get the parameters for the LinearRegression object in the best model
    best_model.stages[3].extractParamMap()
    
    # Generate predictions on testing data using the best model then calculate RMSE
    predictions = best_model.transform(flights_test)
    evaluator.evaluate(predictions)
    
    <script.py> output:
        [StringIndexer_14299b2d5472, OneHotEncoderEstimator_9a650c117f1d, VectorAssembler_933acae88a6e, LinearRegression_9f5a93965597]
    
    

    Ensemble

    集成模型

    # Import the classes required
    from pyspark.ml.classification import DecisionTreeClassifier, GBTClassifier
    from pyspark.ml.evaluation import BinaryClassificationEvaluator
    
    # Create model objects and train on training data
    tree = DecisionTreeClassifier().fit(flights_train)
    gbt = GBTClassifier().fit(flights_train)
    
    # Compare AUC on testing data
    evaluator = BinaryClassificationEvaluator()
    evaluator.evaluate(tree.transform(flights_test))
    evaluator.evaluate(gbt.transform(flights_test))
    
    # Find the number of trees and the relative importance of features
    print(gbt.getNumTrees)
    print(gbt.featureImportances)
    
    Subset of data from the flights DataFrame:
    
    +---+------+--------+-----------------+-----+
    |mon|depart|duration|features         |label|
    +---+------+--------+-----------------+-----+
    |0  |16.33 |82      |[0.0,16.33,82.0] |1    |
    |2  |6.17  |82      |[2.0,6.17,82.0]  |0    |
    |9  |10.33 |195     |[9.0,10.33,195.0]|0    |
    |5  |7.98  |102     |[5.0,7.98,102.0] |0    |
    |7  |10.83 |135     |[7.0,10.83,135.0]|1    |
    +---+------+--------+-----------------+-----+
    only showing top 5 rows
    
    <script.py> output:
        20
        (3,[0,1,2],[0.30892329736156504,0.3043955359595801,0.3866811666788549])
    
    
    # Create a random forest classifier
    forest = RandomForestClassifier()
    
    # Create a parameter grid
    params = ParamGridBuilder() 
                .addGrid(forest.featureSubsetStrategy, ['all', 'onethird', 'sqrt', 'log2']) 
                .addGrid(forest.maxDepth, [2, 5, 10]) 
                .build()
    
    # Create a binary classification evaluator
    evaluator = BinaryClassificationEvaluator()
    
    # Create a cross-validator
    cv = CrossValidator(estimator=forest, estimatorParamMaps=params, evaluator=evaluator, numFolds=5)
    
    # Average AUC for each parameter combination in grid
    avg_auc = cv.avgMetrics
    
    # Average AUC for the best model
    best_model_auc = max(cv.avgMetrics)
    
    # What's the optimal parameter value?
    opt_max_depth = cv.bestModel.explainParam('maxDepth')
    opt_feat_substrat = cv.bestModel.explainParam('featureSubsetStrategy')
    
    # AUC for best model on testing data
    best_auc = evaluator.evaluate(cv.transform(flights_test))
    
    
  • 相关阅读:
    正则表达式点滴
    异步处理与界面交互
    关于利用VS2008创建项目遇到的小困惑备忘
    using App.cofig to Store value
    Castle ActiveRecord学习笔记三:初始化配置
    无服务器端的UDP群聊功能剖析
    为VS2010默认模板添加版权信息
    理论有何用?不问“何用”,先问“用否”!
    微软没有公开的游标分页
    那些满脑子只考虑后台数据库的人他整天研究的就是针对自己查询一些数据的sql语句
  • 原文地址:https://www.cnblogs.com/gaowenxingxing/p/13932572.html
Copyright © 2020-2023  润新知