最近一直在研究Spark的分类算法,因为我们是做日志文本分类,在官网和各大网站一直没找到相应的Demo,经过1个多月的研究,终于有点成效。
val sparkConf = new SparkConf().setAppName("DecisionTree1").setMaster("local[2]") val sc = new SparkContext(sparkConf) var data1 = sc.textFile("/XXX/sample_libsvm_data.txt") val hashingTF = new HashingTF() val data = data1.map { line => val parts = line.split(' ') LabeledPoint(parts(0).toDouble, hashingTF.transform(parts.tail)) } val splits = data.randomSplit(Array(0.9, 0.1)) val (trainingData, testData) = (splits(0), splits(1)) // Train a DecisionTree model. // Empty categoricalFeaturesInfo indicates all features are continuous. val numClasses = 5 val categoricalFeaturesInfo = Map[Int, Int]() val impurity = "gini" val maxDepth = 5 val maxBins = 32 println("--------------------train--------------------") val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) println("--------------------Test--------------------") // Evaluate model on test instances and compute test error val testStr = Array("l","o","k") val prediction = model.predict(hashingTF.transform(testStr)) println("-----------------------------------------") println(prediction) println("-----------------------------------------")
样例数据:
2 f g k m 3 o p s d 4 i l o v 4 i l o w 4 i l o f 4 i l o k 4 i l o n 4 i l o a 2 f g i m 2 f g o m 2 f g u m 2 f g w m 3 o k s d 3 o m s d 3 o s s d 3 o i s d
Classification算法只支持Double类型,其实我们的核心就是怎么把字符串转成Double型的向量,在Spark1.3.0版本中有 HashingTF 来做转化,就发现程序很简单了。