• Spark MLlib编程API入门系列之特征选择之R模型公式(RFormula)


    不多说,直接上干货!

      特征选择里,常见的有:VectorSlicer(向量选择) RFormula(R模型公式) ChiSqSelector(卡方特征选择)。

      RFormula用于将数据中的字段通过R语言的Model Formulae转换成特征值,输出结果为一个特征向量和Double类型的label。关于R语言Model Formulae的介绍可参考:https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html

    代码编写

      RFormula.scala

    package zhouls.bigdata.DataFeatureSelection
    
    
    import org.apache.spark.SparkConf
    import org.apache.spark.SparkContext
    import org.apache.spark.ml.feature.RFormula//引入ml里的特征选择的RFormula算法
    
     
    /**
     * By  zhouls
     */
    object RFormula extends App {
      
        val conf = new SparkConf().setMaster("local").setAppName("RFormula")
        val sc = new SparkContext(conf)
        
        val sqlContext = new org.apache.spark.sql.SQLContext(sc)
        import sqlContext.implicits._
        
        //构造数据集
        val dataset = sqlContext.createDataFrame(Seq(
          (7, "US", 18, 1.0),
          (8, "CA", 12, 0.0),
          (9, "NZ", 15, 0.0)
        )).toDF("id", "country", "hour", "clicked")//导入到DataFrame
        dataset.select("id", "country", "hour", "clicked").show()
        
        //当需要通过country和hour来预测clicked时候,
        //构造RFormula,指定Formula表达式为clicked ~ country + hour
        val formula = new RFormula().setFormula("clicked ~ country + hour").setFeaturesCol("features").setLabelCol("label")
        //生成特征向量及label
        val output = formula.fit(dataset).transform(dataset)
        output.select("id", "country", "hour", "clicked", "features", "label").show()
     
    }

       由

      变成

     

  • 相关阅读:
    CF949C Data Center Maintenance 题解
    P1438 无聊的数列 题解
    CF620E New Year Tree 题解
    结构体优先队列的定义
    CF464E The Classic Problem 题解
    CF427C Checkposts
    CF161D Distance in Tree 题解
    P4375 [USACO18OPEN]Out of Sorts G 题解
    SCI, SCIE, 和ESCI的区别
    Matlab画图中图的方法
  • 原文地址:https://www.cnblogs.com/zlslch/p/7396185.html
Copyright © 2020-2023  润新知