• Spark ML 文本的分类


    最近一直在研究Spark的分类算法,因为我们是做日志文本分类,在官网和各大网站一直没找到相应的Demo,经过1个多月的研究,终于有点成效。

    val sparkConf = new SparkConf().setAppName("DecisionTree1").setMaster("local[2]")
    
        val sc = new SparkContext(sparkConf)
    
        var data1 = sc.textFile("/XXX/sample_libsvm_data.txt")
    
        val hashingTF = new HashingTF()
    
        val data = data1.map { line =>
          val parts = line.split('	')
          LabeledPoint(parts(0).toDouble, hashingTF.transform(parts.tail))
        }
    
        val splits = data.randomSplit(Array(0.9, 0.1))
        val (trainingData, testData) = (splits(0), splits(1))
    
        // Train a DecisionTree model.
        //  Empty categoricalFeaturesInfo indicates all features are continuous.
        val numClasses = 5
        val categoricalFeaturesInfo = Map[Int, Int]()
        val impurity = "gini"
        val maxDepth = 5
        val maxBins = 32
    
        println("--------------------train--------------------")
    
        val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
          impurity, maxDepth, maxBins)
    
        println("--------------------Test--------------------")
    
        // Evaluate model on test instances and compute test error
        val testStr = Array("l","o","k")
    
        val prediction = model.predict(hashingTF.transform(testStr))
    
        println("-----------------------------------------")
        println(prediction)
        println("-----------------------------------------")

    样例数据:

    2    f    g    k    m
    3    o    p    s    d 
    4    i    l    o    v
    4    i    l    o    w
    4    i    l    o    f
    4    i    l    o    k
    4    i    l    o    n
    4    i    l    o    a
    2    f    g    i    m
    2    f    g    o    m
    2    f    g    u    m
    2    f    g    w    m
    3    o    k    s    d
    3    o    m    s    d
    3    o    s    s    d
    3    o    i    s    d


     Classification算法只支持Double类型,其实我们的核心就是怎么把字符串转成Double型的向量,在Spark1.3.0版本中有 HashingTF 来做转化,就发现程序很简单了。

  • 相关阅读:
    Windows-Redis-x64-5.0.9【感谢大佬】
    Debezium初试
    一键结束进程
    Vscode自动刷新
    从零到一搭建一个jenkins+github持续构建平台
    git项目迁移
    AWS IoT 消息代理
    解析器:request.body、request.POST、request.data
    Unity程序员的Unreal 简明教程(二,模型与材质)
    Unity程序员的Unreal 简明教程(一、旋转的BOX)
  • 原文地址:https://www.cnblogs.com/qq27271609/p/4685297.html
Copyright © 2020-2023  润新知