• SparkMLlib分类算法之支持向量机


    SparkMLlib分类算法之支持向量机

    (一),概念

      支持向量机(support vector machine)是一种分类算法,通过寻求结构化风险最小来提高学习机泛化能力,实现经验风险和置信范围的最小化,从而达到在统计样本量较少的情况下,亦能获得良好统计规律的目的。通俗来讲,它是一种二类分类模型,其基本模型定义为特征空间上的间隔最大的线性分类器,即支持向量机的学习策略便是间隔最大化,最终可转化为一个凸二次规划问题的求解。参考网址:http://www.cnblogs.com/end/p/3848740.html

    (二),SparkMLlib中SVM回归应用

    1,数据集:参考这篇SparkMLlib学习分类算法之逻辑回归算法

    2,处理数据及获取训练集和测试集

    val orig_file=sc.textFile("train_nohead.tsv")
        //println(orig_file.first())
        val data_file=orig_file.map(_.split("	")).map{
          r =>
            val trimmed =r.map(_.replace(""",""))
            val lable=trimmed(r.length-1).toDouble
            val feature=trimmed.slice(4,r.length-1).map(d => if(d=="?")0.0
            else d.toDouble)
            LabeledPoint(lable,Vectors.dense(feature))
        }
       /*特征标准化优化*/
        val vectors=data_file.map(x =>x.features)
        val rows=new RowMatrix(vectors)
        println(rows.computeColumnSummaryStatistics().variance)//每列的方差
        val scaler=new StandardScaler(withMean=true,withStd=true).fit(vectors)//标准化
        val scaled_data=data_file.map(point => LabeledPoint(point.label,scaler.transform(point.features)))
            .randomSplit(Array(0.7,0.3),11L)
        val data_train=scaled_data(0)
        val data_test=scaled_data(1)

    2,建立支持向量机模型及模型评估

     /*训练 SVM 模型**/
        val model_Svm=SVMWithSGD.train(data_train,numIteration)
    val correct_svm=data_test.map{
          point => if(model_Svm.predict(point.features)==point.label)
            1 else 0
        }.sum()/data_test.count()//精确度:0.6060885608856088
    val metrics=Seq(model_Svm).map{
          model =>
            val socreAndLabels=data_test.map {
              point => (model.predict(point.features), point.label)
            }
            val metrics=new BinaryClassificationMetrics(socreAndLabels)
            (model.getClass.getSimpleName,metrics.areaUnderPR(),metrics.areaUnderROC())
        }
    val allMetrics = metrics 
        allMetrics.foreach{ case (m, pr, roc) =>
          println(f"$m, Area under PR: ${pr * 100.0}%2.4f%%, Area under ROC: ${roc * 100.0}%2.4f%%")
        }
    /*
    SVMModel, Area under PR: 72.5527%, Area under ROC: 60.4180%*/

    3,模型参数调优

       逻辑回归(SGD)和 SVM 模型有相同的参数,原因是它们都使用随机梯度下降( SGD )作为基础优化技术。不同点在于二者采用的损失函数不同

    3.1 定义调参函数及模型评估函数

    /*调参函数*/
        def trainWithParams(input: RDD[LabeledPoint], regParam: Double,
                            numIterations: Int, updater: Updater, stepSize: Double) = {
          val svm = new SVMWithSGD
          svm.optimizer.setNumIterations(numIterations).
            setUpdater(updater).setRegParam(regParam).setStepSize(stepSize)
          svm.run(input)
        }
        /*评估函数*/
        def createMetrics(label: String, data: RDD[LabeledPoint], model:
        ClassificationModel) = {
          val scoreAndLabels = data.map { point =>
            (model.predict(point.features), point.label)
          }
          val metrics = new BinaryClassificationMetrics(scoreAndLabels)
          (label, metrics.areaUnderROC)
        }

    3.2  改变迭代次数(发现一旦完成特定次数的迭代,再增大迭代次数对结果的影响较小)

    val iterResults = Seq(1, 5, 10, 50).map { param =>
          val model = trainWithParams(data_train, 0.0, param, new
              SimpleUpdater, 1.0)
          createMetrics(s"$param iterations", data_test, model)
        }
        iterResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
    /*
    1 iterations, AUC = 59.02%
    5 iterations, AUC = 60.04%
    10 iterations, AUC = 60.42%
    50 iterations, AUC = 60.42%
    */

    3.3 ,改变步长(以看出步长增长过大对性能有负面影响)

        在 SGD 中,在训练每个样本并更新模型的权重向量时,步长用来控制算法在最陡的梯度方向上应该前进多远。较大的步长收敛较快,但是步长太大可能导致收敛到局部最优解。

    val stepResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map { param =>
          val model = trainWithParams(data_train, 0.0, numIteration, new
              SimpleUpdater, param)
          createMetrics(s"$param step size", data_test, model)
        }
        stepResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
    /*
    0.001 step size, AUC = 59.02%
    0.01 step size, AUC = 59.02%
    0.1 step size, AUC = 59.01%
    1.0 step size, AUC = 60.42%
    10.0 step size, AUC = 56.09%
    
    */

    3.4 正则化 

    val regResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map { param =>
          val model = trainWithParams(data_train, param, numIteration,
            new SquaredL2Updater, 1.0)
          createMetrics(s"$param L2 regularization parameter",
            data_test, model)
        }
        regResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
    /*
    0.001 L2 regularization parameter, AUC = 60.42%
    0.01 L2 regularization parameter, AUC = 60.42%
    0.1 L2 regularization parameter, AUC = 60.37%
    1.0 L2 regularization parameter, AUC = 60.56%
    10.0 L2 regularization parameter, AUC = 41.54%
    */    
        可以看出,低等级的正则化对模型的性能影响不大。然而,增大正则化可以看到欠拟合会导致较低模型性能。
    (三),总结
        1,提高精确度感觉蛮难的,前提还是要先分析数据,对不同特征加以处理吧。。。。。
        2,以后多学习。。。。
  • 相关阅读:
    PAT顶级 1015 Letter-moving Game (35分)
    PAT顶级 1008 Airline Routes (35分)(有向图的强连通分量)
    PAT顶级 1025 Keep at Most 100 Characters (35分)
    PAT顶级 1027 Larry and Inversions (35分)(树状数组)
    PAT 顶级 1026 String of Colorful Beads (35分)(尺取法)
    PAT顶级 1009 Triple Inversions (35分)(树状数组)
    Codeforces 1283F DIY Garland
    Codeforces Round #438 A. Bark to Unlock
    Codeforces Round #437 E. Buy Low Sell High
    Codeforces Round #437 C. Ordering Pizza
  • 原文地址:https://www.cnblogs.com/ksWorld/p/6882591.html
Copyright © 2020-2023  润新知