• spark mllib lda 简单示例


    舆情系统每日热词用到了lda主题聚类

    原先的版本是python项目,分词应用Jieba,LDA应用Gensim

    项目工作良好

    有以下几点问题


    1 舆情产品基于elasticsearch大数据,es内应用lucene分词,python的jieba分词和lucene分词结果并不一致(或需额外的工作保持一致),早期需求只是展示每日热词,分词不一致并不是个问题,现在的新的需求,要求lda和数据无缝结合,es集成jieba,再把es内的数据全用全量数据重新分词,考虑工作量和技术难度上都不现实,只好改lda的分词算法了(实际应用上,不同的分词算法在lda提取主题和热词的场景下几乎没有影响)

    2 python项目限于单点计算,不好扩展

    应用lucene分词,再计算lda,切换至java的技术栈是最简单的办法

    java下也有很多lda的算法实现,只作可行性验证

    这篇主要调研的是spark-mllib,官方有lda实现

    http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda


    ls examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala

    root@wx-social-consume2:/usr/spark-2.4.0# ./bin/run-example mllib.LDAExample
    2019-03-21 10:28:07 WARN Utils:66 - Your hostname, wx-social-consume2 resolves to a loopback address: 127.0.0.1; using 10.10.3.47 instead (on interface em1)
    2019-03-21 10:28:07 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
    2019-03-21 10:28:07 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Error: Missing argument <input>...
    LDAExample: an example LDA app for plain text data.
    Usage: LDAExample [options] <input>...

    --k <value> number of topics. default: 20
    --maxIterations <value> number of iterations of learning. default: 10
    --docConcentration <value>
    amount of topic smoothing to use (> 1.0) (-1=auto). default: -1.0
    --topicConcentration <value>
    amount of term (word) smoothing to use (> 1.0) (-1=auto). default: -1.0
    --vocabSize <value> number of distinct word types to use, chosen by frequency. (-1=all) default: 10000
    --stopwordFile <value> filepath for a list of stopwords. Note: This must fit on a single machine. default:
    --algorithm <value> inference algorithm to use. em and online are supported. default: em
    --checkpointDir <value> Directory for checkpointing intermediate results. Checkpointing helps with recovery and eliminates temporary shuffle files on disk. default: None
    --checkpointInterval <value>
    Iterations between each checkpoint. Only used if checkpointDir is set. default: 10
    <input>... input paths (directories) to plain text corpora. Each text file line should hold 1 document.
    2019-03-21 10:28:10 INFO ShutdownHookManager:54 - Shutdown hook called
    2019-03-21 10:28:10 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-0022e162-fa2a-4e0e-8e89-c78ea7962dd1


    官方示例
    Refer to the LDA Scala docs and DistributedLDAModel Scala docs for details on the API.

    import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
    import org.apache.spark.mllib.linalg.Vectors

    // Load and parse the data
    val data = sc.textFile("data/mllib/sample_lda_data.txt")
    val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
    // Index documents with unique IDs
    val corpus = parsedData.zipWithIndex.map(_.swap).cache()

    // Cluster the documents into three topics using LDA
    val ldaModel = new LDA().setK(3).run(corpus)

    // Output topics. Each is a distribution over words (matching word count vectors)
    println(s"Learned topics (as distributions over vocab of ${ldaModel.vocabSize} words):")
    val topics = ldaModel.topicsMatrix
    for (topic <- Range(0, 3)) {
    print(s"Topic $topic :")
    for (word <- Range(0, ldaModel.vocabSize)) {
    print(s"${topics(word, topic)}")
    }
    println()
    }

    // Save and load model.
    ldaModel.save(sc, "target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")
    val sameModel = DistributedLDAModel.load(sc,
    "target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")


    文件为"data/mllib/sample_lda_data.txt"

    root@wx-social-consume2:/usr/spark-2.4.0# ./bin/run-example mllib.LDAExample data/mllib/sample_lda_data.txt
    2019-03-21 10:29:55 WARN Utils:66 - Your hostname, wx-social-consume2 resolves to a loopback address: 127.0.0.1; using 10.10.3.47 instead (on interface em1)
    2019-03-21 10:29:55 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
    2019-03-21 10:29:56 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    2019-03-21 10:29:58 INFO SparkContext:54 - Running Spark version 2.4.0
    2019-03-21 10:29:58 INFO SparkContext:54 - Submitted application: LDAExample with {
    input: List(data/mllib/sample_lda_data.txt),
    k: 20,
    maxIterations: 10,
    docConcentration: -1.0,
    topicConcentration: -1.0,
    vocabSize: 10000,
    stopwordFile: ,
    algorithm: em,
    checkpointDir: None,
    checkpointInterval: 10
    }
    2019-03-21 10:29:58 INFO SecurityManager:54 - Changing view acls to: root
    2019-03-21 10:29:58 INFO SecurityManager:54 - Changing modify acls to: root
    2019-03-21 10:29:58 INFO SecurityManager:54 - Changing view acls groups to:
    2019-03-21 10:29:58 INFO SecurityManager:54 - Changing modify acls groups to:
    2019-03-21 10:29:58 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
    2019-03-21 10:29:58 INFO Utils:54 - Successfully started service 'sparkDriver' on port 63504.
    2019-03-21 10:29:58 INFO SparkEnv:54 - Registering MapOutputTracker
    2019-03-21 10:29:58 INFO SparkEnv:54 - Registering BlockManagerMaster
    2019-03-21 10:29:58 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    2019-03-21 10:29:58 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
    2019-03-21 10:29:58 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-ac172871-a0e5-41af-9e76-a09ea01c3428
    2019-03-21 10:29:58 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB
    2019-03-21 10:29:58 INFO SparkEnv:54 - Registering OutputCommitCoordinator
    2019-03-21 10:29:59 INFO log:192 - Logging initialized @4022ms
    2019-03-21 10:29:59 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
    2019-03-21 10:29:59 INFO Server:419 - Started @4109ms
    2019-03-21 10:29:59 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
    2019-03-21 10:29:59 INFO AbstractConnector:278 - Started ServerConnector@1386313f{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
    2019-03-21 10:29:59 INFO Utils:54 - Successfully started service 'SparkUI' on port 4041.
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1921994e{/jobs,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a8df0b3{/jobs/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c112f5f{/jobs/job,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4fd05028{/jobs/job/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3a2d3909{/stages,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4fb392c4{/stages/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@194d329e{/stages/stage,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@541179e7{/stages/stage/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@24386839{/stages/pool,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b32b129{/stages/pool/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@439e3cb4{/storage,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1c9fbb61{/storage/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b81616b{/storage/rdd,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@15d42ccb{/storage/rdd/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@279dd959{/environment,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@46383a78{/environment/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@36c281ed{/executors,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@244418a{/executors/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4b5a078a{/executors/threadDump,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4c361f63{/executors/threadDump/json,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6ed922e1{/static,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@35f3a22c{/,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a0c5e9{/api,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@32e652b6{/jobs/job/kill,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4ba02375{/stages/stage/kill,null,AVAILABLE,@Spark}
    2019-03-21 10:29:59 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://10.10.3.47:4041
    2019-03-21 10:29:59 INFO SparkContext:54 - Added JAR file:///usr/spark-2.4.0/examples/jars/scopt_2.11-3.7.0.jar at spark://10.10.3.47:63504/jars/scopt_2.11-3.7.0.jar with timestamp 1553164199210
    2019-03-21 10:29:59 INFO SparkContext:54 - Added JAR file:///usr/spark-2.4.0/examples/jars/spark-examples_2.11-2.4.0.jar at spark://10.10.3.47:63504/jars/spark-examples_2.11-2.4.0.jar with timestamp 1553164199211
    2019-03-21 10:29:59 INFO Executor:54 - Starting executor ID driver on host localhost
    2019-03-21 10:29:59 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 29540.
    2019-03-21 10:29:59 INFO NettyBlockTransferService:54 - Server created on 10.10.3.47:29540
    2019-03-21 10:29:59 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
    2019-03-21 10:29:59 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, 10.10.3.47, 29540, None)
    2019-03-21 10:29:59 INFO BlockManagerMasterEndpoint:54 - Registering block manager 10.10.3.47:29540 with 366.3 MB RAM, BlockManagerId(driver, 10.10.3.47, 29540, None)
    2019-03-21 10:29:59 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, 10.10.3.47, 29540, None)
    2019-03-21 10:29:59 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, 10.10.3.47, 29540, None)
    2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6587305a{/metrics/json,null,AVAILABLE,@Spark}

    Corpus summary:
    Training set size: 12 documents
    Vocabulary size: 10 terms
    Training set size: 62 tokens
    Preprocessing time: 5.008004889 sec

    Finished training LDA model. Summary:
    Training time: 3.565145577 sec
    Training data average log likelihood: -20.14862821928427

    20 topics:
    TOPIC 0
    0 0.2776616245970018
    1 0.2467437437143153
    2 0.162799944092254
    3 0.14357469330901143
    4 0.06928101473026803
    9 0.04699050355830519
    5 0.030203661449368247
    6 0.007640498460509008
    7 0.007552251606432597
    8 0.007552064482534436

    TOPIC 1
    0 0.36469051153508564
    1 0.18646127739213272
    2 0.15522432783704704
    3 0.13106181033451642
    4 0.06584596207691033
    9 0.04395627892970295
    5 0.030062711796482375
    8 0.007597985215275549
    7 0.00759765588135554
    6 0.007501479001491501

    TOPIC 2
    0 0.31931887563375516
    1 0.1962082806692423
    2 0.17231677857837657
    3 0.14159208272582544
    4 0.07269228847838642
    9 0.044804492369582984
    5 0.030074171215304125
    7 0.007707667829625457
    8 0.007707603060281631
    6 0.007577759439619957

    TOPIC 3
    0 0.33172427138380095
    1 0.2099908538476618
    2 0.1529315801366396
    3 0.13055578109706914
    4 0.07654434907166915
    9 0.045380994421043694
    5 0.03023101958436096
    8 0.007550809491464853
    7 0.007550747971239613
    6 0.00753959299505035

    TOPIC 4
    0 0.32678177168416583
    1 0.2207731082949569
    2 0.1515085115377772
    3 0.13531105489561282
    4 0.06780498015453475
    9 0.04539833136958728
    5 0.02983504958532268
    6 0.007557895637294794
    7 0.007514688233324302
    8 0.007514608607423405

    TOPIC 5
    0 0.3239146979457089
    1 0.21573894993478243
    2 0.14525808340440868
    3 0.14073654896536467
    4 0.07677900991703532
    9 0.04501240647651081
    5 0.03001971166224193
    6 0.007555937737508214
    7 0.00749238505224968
    8 0.007492268904189478

    TOPIC 6
    0 0.31540967245137175
    1 0.2455372478674326
    2 0.15387618434880607
    3 0.11776433445702658
    4 0.06958508618823488
    9 0.0454694806523595
    5 0.029862323423880312
    6 0.007550844343998941
    7 0.007472470469363186
    8 0.007472355797526257

    TOPIC 7
    0 0.3288373150630124
    1 0.1984831879713159
    2 0.14673351329227885
    3 0.13721294614635954
    4 0.09099655027113343
    9 0.04408525595608986
    5 0.031035301494144973
    8 0.007542328434232405
    7 0.007542260756614362
    6 0.007531340614818258

    TOPIC 8
    0 0.3193586979466173
    2 0.19372032290375615
    1 0.17385196290359722
    3 0.15067297324203144
    4 0.06431584983802775
    9 0.04459867057873216
    5 0.030109171006041484
    7 0.007889712095309039
    8 0.007889665041634011
    6 0.007592974444253261

    TOPIC 9
    0 0.30194242995872983
    1 0.19005321586219182
    2 0.18325896599060398
    3 0.13814239333008232
    4 0.08613769346253837
    9 0.04615261626883708
    5 0.031097388101704492
    8 0.007811116080942582
    7 0.00781095323650845
    6 0.007593227707861054

    TOPIC 10
    0 0.3018333962608689
    1 0.2101327107384649
    2 0.17976426989184977
    3 0.13028433691427393
    4 0.0785628713958075
    9 0.04589483819874114
    5 0.030473519782385595
    8 0.007732641246914153
    7 0.007732610576531091
    6 0.007588804994162942

    TOPIC 11
    0 0.2731844012758353
    1 0.2437828502953355
    2 0.15602289141674863
    3 0.14124037120394928
    4 0.08621787369766198
    9 0.04595175938704448
    5 0.03092529389836489
    6 0.007624331986973666
    8 0.007525230917852694
    7 0.007524995920233587

    TOPIC 12
    0 0.2754166406308875
    1 0.23790442921451685
    2 0.16407059546150252
    3 0.15224175554500483
    4 0.06911748308147463
    9 0.04812626048120002
    5 0.030286734901316135
    6 0.007658288924620451
    8 0.007588920737731278
    7 0.0075888910217458685

    TOPIC 13
    0 0.30106045484425825
    1 0.24631760902734479
    3 0.14410465573241527
    2 0.12524822006978883
    4 0.08639636802254201
    9 0.04423754911967344
    5 0.030456963732366213
    6 0.007563767806759953
    7 0.00730722647659661
    8 0.0073071851682546506

    TOPIC 14
    0 0.31562044416659885
    1 0.2526998226062497
    2 0.14535277752245734
    3 0.12218888456011873
    4 0.0662622160602047
    9 0.04569769393216447
    5 0.02981990651641202
    6 0.007553483942629782
    7 0.007402409879010876
    8 0.00740236081415358

    TOPIC 15
    0 0.32058662644832936
    1 0.20394629563372832
    2 0.14716518634378675
    3 0.1428368219219527
    4 0.08608013435759047
    9 0.046286592753669746
    5 0.030458243185553725
    6 0.007557253951254248
    8 0.007541483951171858
    7 0.007541361452962736

    TOPIC 16
    0 0.34720661993137836
    1 0.19335921917713203
    2 0.1574642973682748
    3 0.1315805755079897
    4 0.07262519571107433
    9 0.044460628102199806
    5 0.03055492723901837
    7 0.0076115120495338995
    8 0.007611307835359196
    6 0.0075257170780395

    TOPIC 17
    0 0.276113881630822
    1 0.23014600960245993
    2 0.1791490549120877
    3 0.13887566068288787
    4 0.0755460207226412
    9 0.04662339718458672
    5 0.03051544813778674
    7 0.007695876502113754
    8 0.007695757393934347
    6 0.00763889323067976

    TOPIC 18
    0 0.3143607517778961
    1 0.23372981830486098
    2 0.15462561577264133
    3 0.12748419732486796
    4 0.07250495549430869
    9 0.04461512010586714
    5 0.030101883279393397
    6 0.007561984484167818
    8 0.007507862285458234
    7 0.007507811170538331

    TOPIC 19
    0 0.27641365880548574
    1 0.25811740868556055
    2 0.15554256853273
    3 0.1299929833331776
    4 0.08206677473861888
    9 0.04536961458219285
    5 0.029948048909439938
    6 0.007602171935254187
    7 0.007473418458492894
    8 0.007473352019047361


    官方用例测试通过

    ./bin/run-example mllib.LDAExample data/mllib/sample_lda_data.txt --k=3

    Corpus summary:
    Training set size: 12 documents
    Vocabulary size: 10 terms
    Training set size: 62 tokens
    Preprocessing time: 3.952237148 sec

    Finished training LDA model. Summary:
    Training time: 2.096613705 sec
    Training data average log likelihood: -19.94356723200465

    3 topics:
    TOPIC 0
    0 0.3859910534194794
    1 0.27784612698328287
    3 0.11309885992982183
    2 0.07754302744583665
    4 0.05944898611629052
    9 0.03956337141492645
    5 0.026961271416303733
    6 0.00729179016688132
    7 0.006153234509962162
    8 0.006102278597215016

    TOPIC 1
    0 0.27760966762034767
    2 0.20133458026264603
    1 0.17674769647471528
    3 0.17119003866520263
    4 0.05856966790108555
    9 0.054142861315096644
    5 0.036444242827270906
    6 0.008131736037023057
    7 0.008064284551869805
    8 0.007765224344742546

    TOPIC 2
    0 0.26781477963691924
    1 0.20419276965972555
    2 0.1988283635339889
    3 0.12491741400170082
    4 0.10934994319103938
    9 0.0426866734196486
    5 0.027519743427499636
    8 0.008867804538854353
    7 0.008517400571616986
    6 0.007305108019006569

    看sample_lda_data.txt的内容
    root@wx-social-consume2:/usr/spark-2.4.0# cat data/mllib/sample_lda_data.txt
    1 2 6 0 2 3 1 1 0 0 3
    1 3 0 1 3 0 0 2 0 0 1
    1 4 1 0 0 4 9 0 1 2 0
    2 1 0 3 0 0 5 0 2 3 9
    3 1 1 9 3 0 2 0 0 1 3
    4 2 0 3 4 5 1 1 1 4 0
    2 1 0 3 0 0 5 0 2 2 9
    1 1 1 9 2 1 2 0 0 1 3
    4 4 0 3 4 2 1 3 0 0 0
    2 8 2 0 3 0 2 0 2 7 2
    1 1 1 9 0 2 2 0 0 3 3
    4 1 0 0 4 5 1 3 0 1 0

    矩阵11行,怎么只输出了10列,理解有错?
    TOPIC 2
    0 0.26781477963691924
    1 0.20419276965972555
    2 0.1988283635339889
    3 0.12491741400170082
    4 0.10934994319103938
    9 0.0426866734196486
    5 0.027519743427499636
    8 0.008867804538854353
    7 0.008517400571616986
    6 0.007305108019006569

    看代码
    val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 10)
    是只取出了top10的词,理解应该没问题


    离线单点计算

  • 相关阅读:
    HDU 6071
    HDU 6073
    HDU 2124 Repair the Wall(贪心)
    HDU 2037 今年暑假不AC(贪心)
    HDU 1257 最少拦截系统(贪心)
    HDU 1789 Doing Homework again(贪心)
    HDU 1009 FatMouse' Trade(贪心)
    HDU 2216 Game III(BFS)
    HDU 1509 Windows Message Queue(队列)
    HDU 1081 To The Max(动态规划)
  • 原文地址:https://www.cnblogs.com/zihunqingxin/p/10591799.html
Copyright © 2020-2023  润新知