spark mllib lda 简单示例

spark mllib lda 简单示例

舆情系统每日热词用到了lda主题聚类

原先的版本是python项目，分词应用Jieba,LDA应用Gensim

项目工作良好

有以下几点问题

1 舆情产品基于elasticsearch大数据,es内应用lucene分词，python的jieba分词和lucene分词结果并不一致(或需额外的工作保持一致)，早期需求只是展示每日热词，分词不一致并不是个问题，现在的新的需求，要求lda和数据无缝结合，es集成jieba,再把es内的数据全用全量数据重新分词，考虑工作量和技术难度上都不现实，只好改lda的分词算法了(实际应用上，不同的分词算法在lda提取主题和热词的场景下几乎没有影响)

2 python项目限于单点计算，不好扩展

应用lucene分词，再计算lda,切换至java的技术栈是最简单的办法

java下也有很多lda的算法实现,只作可行性验证

这篇主要调研的是spark-mllib,官方有lda实现

http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda

ls examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala

root@wx-social-consume2:/usr/spark-2.4.0# ./bin/run-example mllib.LDAExample
2019-03-21 10:28:07 WARN Utils:66 - Your hostname, wx-social-consume2 resolves to a loopback address: 127.0.0.1; using 10.10.3.47 instead (on interface em1)
2019-03-21 10:28:07 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-03-21 10:28:07 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: Missing argument <input>...
LDAExample: an example LDA app for plain text data.
Usage: LDAExample [options] <input>...

--k <value> number of topics. default: 20
--maxIterations <value> number of iterations of learning. default: 10
--docConcentration <value>
amount of topic smoothing to use (> 1.0) (-1=auto). default: -1.0
--topicConcentration <value>
amount of term (word) smoothing to use (> 1.0) (-1=auto). default: -1.0
--vocabSize <value> number of distinct word types to use, chosen by frequency. (-1=all) default: 10000
--stopwordFile <value> filepath for a list of stopwords. Note: This must fit on a single machine. default:
--algorithm <value> inference algorithm to use. em and online are supported. default: em
--checkpointDir <value> Directory for checkpointing intermediate results. Checkpointing helps with recovery and eliminates temporary shuffle files on disk. default: None
--checkpointInterval <value>
Iterations between each checkpoint. Only used if checkpointDir is set. default: 10
<input>... input paths (directories) to plain text corpora. Each text file line should hold 1 document.
2019-03-21 10:28:10 INFO ShutdownHookManager:54 - Shutdown hook called
2019-03-21 10:28:10 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-0022e162-fa2a-4e0e-8e89-c78ea7962dd1

官方示例
Refer to the LDA Scala docs and DistributedLDAModel Scala docs for details on the API.

import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/sample_lda_data.txt")
val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()

// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)

// Output topics. Each is a distribution over words (matching word count vectors)
println(s"Learned topics (as distributions over vocab of ${ldaModel.vocabSize} words):")
val topics = ldaModel.topicsMatrix
for (topic <- Range(0, 3)) {
print(s"Topic $topic :")
for (word <- Range(0, ldaModel.vocabSize)) {
print(s"${topics(word, topic)}")
}
println()
}

// Save and load model.
ldaModel.save(sc, "target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")
val sameModel = DistributedLDAModel.load(sc,
"target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")

文件为"data/mllib/sample_lda_data.txt"

root@wx-social-consume2:/usr/spark-2.4.0# ./bin/run-example mllib.LDAExample data/mllib/sample_lda_data.txt
2019-03-21 10:29:55 WARN Utils:66 - Your hostname, wx-social-consume2 resolves to a loopback address: 127.0.0.1; using 10.10.3.47 instead (on interface em1)
2019-03-21 10:29:55 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-03-21 10:29:56 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-03-21 10:29:58 INFO SparkContext:54 - Running Spark version 2.4.0
2019-03-21 10:29:58 INFO SparkContext:54 - Submitted application: LDAExample with {
input: List(data/mllib/sample_lda_data.txt),
k: 20,
maxIterations: 10,
docConcentration: -1.0,
topicConcentration: -1.0,
vocabSize: 10000,
stopwordFile: ,
algorithm: em,
checkpointDir: None,
checkpointInterval: 10
}
2019-03-21 10:29:58 INFO SecurityManager:54 - Changing view acls to: root
2019-03-21 10:29:58 INFO SecurityManager:54 - Changing modify acls to: root
2019-03-21 10:29:58 INFO SecurityManager:54 - Changing view acls groups to:
2019-03-21 10:29:58 INFO SecurityManager:54 - Changing modify acls groups to:
2019-03-21 10:29:58 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2019-03-21 10:29:58 INFO Utils:54 - Successfully started service 'sparkDriver' on port 63504.
2019-03-21 10:29:58 INFO SparkEnv:54 - Registering MapOutputTracker
2019-03-21 10:29:58 INFO SparkEnv:54 - Registering BlockManagerMaster
2019-03-21 10:29:58 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2019-03-21 10:29:58 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2019-03-21 10:29:58 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-ac172871-a0e5-41af-9e76-a09ea01c3428
2019-03-21 10:29:58 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB
2019-03-21 10:29:58 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2019-03-21 10:29:59 INFO log:192 - Logging initialized @4022ms
2019-03-21 10:29:59 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2019-03-21 10:29:59 INFO Server:419 - Started @4109ms
2019-03-21 10:29:59 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2019-03-21 10:29:59 INFO AbstractConnector:278 - Started ServerConnector@1386313f{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
2019-03-21 10:29:59 INFO Utils:54 - Successfully started service 'SparkUI' on port 4041.
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1921994e{/jobs,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a8df0b3{/jobs/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c112f5f{/jobs/job,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4fd05028{/jobs/job/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3a2d3909{/stages,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4fb392c4{/stages/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@194d329e{/stages/stage,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@541179e7{/stages/stage/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@24386839{/stages/pool,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b32b129{/stages/pool/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@439e3cb4{/storage,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1c9fbb61{/storage/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b81616b{/storage/rdd,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@15d42ccb{/storage/rdd/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@279dd959{/environment,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@46383a78{/environment/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@36c281ed{/executors,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@244418a{/executors/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4b5a078a{/executors/threadDump,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4c361f63{/executors/threadDump/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6ed922e1{/static,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@35f3a22c{/,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a0c5e9{/api,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@32e652b6{/jobs/job/kill,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4ba02375{/stages/stage/kill,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://10.10.3.47:4041
2019-03-21 10:29:59 INFO SparkContext:54 - Added JAR file:///usr/spark-2.4.0/examples/jars/scopt_2.11-3.7.0.jar at spark://10.10.3.47:63504/jars/scopt_2.11-3.7.0.jar with timestamp 1553164199210
2019-03-21 10:29:59 INFO SparkContext:54 - Added JAR file:///usr/spark-2.4.0/examples/jars/spark-examples_2.11-2.4.0.jar at spark://10.10.3.47:63504/jars/spark-examples_2.11-2.4.0.jar with timestamp 1553164199211
2019-03-21 10:29:59 INFO Executor:54 - Starting executor ID driver on host localhost
2019-03-21 10:29:59 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 29540.
2019-03-21 10:29:59 INFO NettyBlockTransferService:54 - Server created on 10.10.3.47:29540
2019-03-21 10:29:59 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2019-03-21 10:29:59 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO BlockManagerMasterEndpoint:54 - Registering block manager 10.10.3.47:29540 with 366.3 MB RAM, BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6587305a{/metrics/json,null,AVAILABLE,@Spark}

Corpus summary:
Training set size: 12 documents
Vocabulary size: 10 terms
Training set size: 62 tokens
Preprocessing time: 5.008004889 sec

Finished training LDA model. Summary:
Training time: 3.565145577 sec
Training data average log likelihood: -20.14862821928427

20 topics:
TOPIC 0
0 0.2776616245970018
1 0.2467437437143153
2 0.162799944092254
3 0.14357469330901143
4 0.06928101473026803
9 0.04699050355830519
5 0.030203661449368247
6 0.007640498460509008
7 0.007552251606432597
8 0.007552064482534436

TOPIC 1
0 0.36469051153508564
1 0.18646127739213272
2 0.15522432783704704
3 0.13106181033451642
4 0.06584596207691033
9 0.04395627892970295
5 0.030062711796482375
8 0.007597985215275549
7 0.00759765588135554
6 0.007501479001491501

TOPIC 2
0 0.31931887563375516
1 0.1962082806692423
2 0.17231677857837657
3 0.14159208272582544
4 0.07269228847838642
9 0.044804492369582984
5 0.030074171215304125
7 0.007707667829625457
8 0.007707603060281631
6 0.007577759439619957

TOPIC 3
0 0.33172427138380095
1 0.2099908538476618
2 0.1529315801366396
3 0.13055578109706914
4 0.07654434907166915
9 0.045380994421043694
5 0.03023101958436096
8 0.007550809491464853
7 0.007550747971239613
6 0.00753959299505035

TOPIC 4
0 0.32678177168416583
1 0.2207731082949569
2 0.1515085115377772
3 0.13531105489561282
4 0.06780498015453475
9 0.04539833136958728
5 0.02983504958532268
6 0.007557895637294794
7 0.007514688233324302
8 0.007514608607423405

TOPIC 5
0 0.3239146979457089
1 0.21573894993478243
2 0.14525808340440868
3 0.14073654896536467
4 0.07677900991703532
9 0.04501240647651081
5 0.03001971166224193
6 0.007555937737508214
7 0.00749238505224968
8 0.007492268904189478

TOPIC 6
0 0.31540967245137175
1 0.2455372478674326
2 0.15387618434880607
3 0.11776433445702658
4 0.06958508618823488
9 0.0454694806523595
5 0.029862323423880312
6 0.007550844343998941
7 0.007472470469363186
8 0.007472355797526257

TOPIC 7
0 0.3288373150630124
1 0.1984831879713159
2 0.14673351329227885
3 0.13721294614635954
4 0.09099655027113343
9 0.04408525595608986
5 0.031035301494144973
8 0.007542328434232405
7 0.007542260756614362
6 0.007531340614818258

TOPIC 8
0 0.3193586979466173
2 0.19372032290375615
1 0.17385196290359722
3 0.15067297324203144
4 0.06431584983802775
9 0.04459867057873216
5 0.030109171006041484
7 0.007889712095309039
8 0.007889665041634011
6 0.007592974444253261

TOPIC 9
0 0.30194242995872983
1 0.19005321586219182
2 0.18325896599060398
3 0.13814239333008232
4 0.08613769346253837
9 0.04615261626883708
5 0.031097388101704492
8 0.007811116080942582
7 0.00781095323650845
6 0.007593227707861054

TOPIC 10
0 0.3018333962608689
1 0.2101327107384649
2 0.17976426989184977
3 0.13028433691427393
4 0.0785628713958075
9 0.04589483819874114
5 0.030473519782385595
8 0.007732641246914153
7 0.007732610576531091
6 0.007588804994162942

TOPIC 11
0 0.2731844012758353
1 0.2437828502953355
2 0.15602289141674863
3 0.14124037120394928
4 0.08621787369766198
9 0.04595175938704448
5 0.03092529389836489
6 0.007624331986973666
8 0.007525230917852694
7 0.007524995920233587

TOPIC 12
0 0.2754166406308875
1 0.23790442921451685
2 0.16407059546150252
3 0.15224175554500483
4 0.06911748308147463
9 0.04812626048120002
5 0.030286734901316135
6 0.007658288924620451
8 0.007588920737731278
7 0.0075888910217458685

TOPIC 13
0 0.30106045484425825
1 0.24631760902734479
3 0.14410465573241527
2 0.12524822006978883
4 0.08639636802254201
9 0.04423754911967344
5 0.030456963732366213
6 0.007563767806759953
7 0.00730722647659661
8 0.0073071851682546506

TOPIC 14
0 0.31562044416659885
1 0.2526998226062497
2 0.14535277752245734
3 0.12218888456011873
4 0.0662622160602047
9 0.04569769393216447
5 0.02981990651641202
6 0.007553483942629782
7 0.007402409879010876
8 0.00740236081415358

TOPIC 15
0 0.32058662644832936
1 0.20394629563372832
2 0.14716518634378675
3 0.1428368219219527
4 0.08608013435759047
9 0.046286592753669746
5 0.030458243185553725
6 0.007557253951254248
8 0.007541483951171858
7 0.007541361452962736

TOPIC 16
0 0.34720661993137836
1 0.19335921917713203
2 0.1574642973682748
3 0.1315805755079897
4 0.07262519571107433
9 0.044460628102199806
5 0.03055492723901837
7 0.0076115120495338995
8 0.007611307835359196
6 0.0075257170780395

TOPIC 17
0 0.276113881630822
1 0.23014600960245993
2 0.1791490549120877
3 0.13887566068288787
4 0.0755460207226412
9 0.04662339718458672
5 0.03051544813778674
7 0.007695876502113754
8 0.007695757393934347
6 0.00763889323067976

TOPIC 18
0 0.3143607517778961
1 0.23372981830486098
2 0.15462561577264133
3 0.12748419732486796
4 0.07250495549430869
9 0.04461512010586714
5 0.030101883279393397
6 0.007561984484167818
8 0.007507862285458234
7 0.007507811170538331

TOPIC 19
0 0.27641365880548574
1 0.25811740868556055
2 0.15554256853273
3 0.1299929833331776
4 0.08206677473861888
9 0.04536961458219285
5 0.029948048909439938
6 0.007602171935254187
7 0.007473418458492894
8 0.007473352019047361

官方用例测试通过

./bin/run-example mllib.LDAExample data/mllib/sample_lda_data.txt --k=3

Corpus summary:
Training set size: 12 documents
Vocabulary size: 10 terms
Training set size: 62 tokens
Preprocessing time: 3.952237148 sec

Finished training LDA model. Summary:
Training time: 2.096613705 sec
Training data average log likelihood: -19.94356723200465

3 topics:
TOPIC 0
0 0.3859910534194794
1 0.27784612698328287
3 0.11309885992982183
2 0.07754302744583665
4 0.05944898611629052
9 0.03956337141492645
5 0.026961271416303733
6 0.00729179016688132
7 0.006153234509962162
8 0.006102278597215016

TOPIC 1
0 0.27760966762034767
2 0.20133458026264603
1 0.17674769647471528
3 0.17119003866520263
4 0.05856966790108555
9 0.054142861315096644
5 0.036444242827270906
6 0.008131736037023057
7 0.008064284551869805
8 0.007765224344742546

TOPIC 2
0 0.26781477963691924
1 0.20419276965972555
2 0.1988283635339889
3 0.12491741400170082
4 0.10934994319103938
9 0.0426866734196486
5 0.027519743427499636
8 0.008867804538854353
7 0.008517400571616986
6 0.007305108019006569

看sample_lda_data.txt的内容
root@wx-social-consume2:/usr/spark-2.4.0# cat data/mllib/sample_lda_data.txt
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0

矩阵11行，怎么只输出了10列，理解有错？
TOPIC 2
0 0.26781477963691924
1 0.20419276965972555
2 0.1988283635339889
3 0.12491741400170082
4 0.10934994319103938
9 0.0426866734196486
5 0.027519743427499636
8 0.008867804538854353
7 0.008517400571616986
6 0.007305108019006569

看代码
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 10)
是只取出了top10的词，理解应该没问题

离线单点计算
相关阅读:
HDU 6071
HDU 6073
HDU 2124 Repair the Wall(贪心)
HDU 2037 今年暑假不AC(贪心)
HDU 1257 最少拦截系统(贪心)
HDU 1789 Doing Homework again(贪心)
HDU 1009 FatMouse' Trade(贪心)
HDU 2216 Game III(BFS)
HDU 1509 Windows Message Queue(队列)
HDU 1081 To The Max(动态规划)
原文地址：https://www.cnblogs.com/zihunqingxin/p/10591799.html

热门文章
java8-Class
【CFRound#587-D】Swords 差值gcd
HDU 6105
HDU 6107
HDU 2222
HDU 6093
HDU 6085
HDU 6074
HDU 6068
HDU 6076