Spark ML 文本的分类

最近一直在研究Spark的分类算法，因为我们是做日志文本分类，在官网和各大网站一直没找到相应的Demo，经过1个多月的研究，终于有点成效。

val sparkConf = new SparkConf().setAppName("DecisionTree1").setMaster("local[2]")

    val sc = new SparkContext(sparkConf)

    var data1 = sc.textFile("/XXX/sample_libsvm_data.txt")

    val hashingTF = new HashingTF()

    val data = data1.map { line =>
      val parts = line.split('	')
      LabeledPoint(parts(0).toDouble, hashingTF.transform(parts.tail))
    }

    val splits = data.randomSplit(Array(0.9, 0.1))
    val (trainingData, testData) = (splits(0), splits(1))

    // Train a DecisionTree model.
    //  Empty categoricalFeaturesInfo indicates all features are continuous.
    val numClasses = 5
    val categoricalFeaturesInfo = Map[Int, Int]()
    val impurity = "gini"
    val maxDepth = 5
    val maxBins = 32

    println("--------------------train--------------------")

    val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
      impurity, maxDepth, maxBins)

    println("--------------------Test--------------------")

    // Evaluate model on test instances and compute test error
    val testStr = Array("l","o","k")

    val prediction = model.predict(hashingTF.transform(testStr))

    println("-----------------------------------------")
    println(prediction)
    println("-----------------------------------------")

样例数据：

2    f    g    k    m
3    o    p    s    d 
4    i    l    o    v
4    i    l    o    w
4    i    l    o    f
4    i    l    o    k
4    i    l    o    n
4    i    l    o    a
2    f    g    i    m
2    f    g    o    m
2    f    g    u    m
2    f    g    w    m
3    o    k    s    d
3    o    m    s    d
3    o    s    s    d
3    o    i    s    d

Classification算法只支持Double类型，其实我们的核心就是怎么把字符串转成Double型的向量，在Spark1.3.0版本中有 HashingTF 来做转化，就发现程序很简单了。

相关阅读:
flask 之定时任务开发
locust 参数，数据详解
ab返回结果参数分析
ab使用命令
django 实现同一个ip十分钟内只能注册一次(redis版本)
django开发中利用缓存文件进行页面缓存
django 实现登录时候输入密码错误5次锁定用户十分钟
django 实现同一个ip十分钟内只能注册一次
钉钉消息通知机器人python版
用WMI监控IIS

原文地址：https://www.cnblogs.com/qq27271609/p/4685297.html