数据分析（二）

一、科研项目成果分组

　　设计思路：使用kmeans，根据项目简介提取的关键对科研项目成果进行分组。

 1     stop_hanzi = get_stop_words()  #获取停词库
 2     datas = getData();         #获取数据
 3     analyzer = ClusterAnalyzer()   #初始化类
 4     for item in datas:
 5         content = item['intro']
 6         label = item['type']
 7         content = [x for x in jieba.cut(content) if x not in stop_hanzi]    # 分隔
 8         analyzer.addDocument(label,",".join(content))  #加载数据
 9     print(analyzer.repeatedBisection(3))    # 重复二分聚类（自定义分为几类）
10     print(analyzer.repeatedBisection(1.0))  # 自动判断聚类数量k（自动分n类）
11 #数据处理，插入数据库中
12     grouptype = analyzer.repeatedBisection(1.0)
13     finalgrouptype = []
14     for item in grouptype:
15         group = []
16         for jtem in item:
17             group.append(getcode(jtem))
18         finalgrouptype.append(group)
19     print(finalgrouptype)
20     num = 0
21     for item in finalgrouptype:
22         num+=1
23         for jtem in item:
24             insert(num,jtem)

我的关键代码

　　这里给一个官方的测试案例

 1 from pyhanlp import *
 2 
 3 ClusterAnalyzer = JClass('com.hankcs.hanlp.mining.cluster.ClusterAnalyzer')
 4 
 5 if __name__ == '__main__':
 6     analyzer = ClusterAnalyzer()
 7     analyzer.addDocument("赵一", "流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 流行, 蓝调, 蓝调, 蓝调, 蓝调, 蓝调, 蓝调, 摇滚, 摇滚, 摇滚, 摇滚")
 8     analyzer.addDocument("钱二", "爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲")
 9     analyzer.addDocument("张三", "古典, 古典, 古典, 古典, 民谣, 民谣, 民谣, 民谣")
10     analyzer.addDocument("李四", "爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 爵士, 金属, 金属, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲, 舞曲")
11     analyzer.addDocument("王五", "流行, 流行, 流行, 流行, 摇滚, 摇滚, 摇滚, 嘻哈, 嘻哈, 嘻哈")
12     analyzer.addDocument("马六", "古典, 古典, 古典, 古典, 古典, 古典, 古典, 古典, 摇滚")
13 
14     print(analyzer.repeatedBisection(3))    # 重复二分聚类  指定分多少类
15     print(analyzer.repeatedBisection(1.0))  # 自动判断聚类数量k    自动分类不需指定分多少类

View Code

　　echarts图例使用demo

二、科研项目成果间的强弱关系

　　设计思路：先判断关键词是否有相关联，再判断是否同一个单位，再判断是否含有相同的完成人，最终确定关联强度。

　　这里就不给代码。

　　echarts图例使用demo

三、京津冀行业发展优势和劣势

　　设计思路：统计优势的排在前三的行业和排在后面的三个，展示所有的行业现状。

　　echarts图例使用demo

相关阅读:
Linux搭建iscsi服务，客户端（Linux&Win XP）挂载使用
 SecucreCRT安装与破解
 最全的HCIA-R&S实验笔记
 AtCoder Grand Contest 036
Comet OJ CCPC－Wannafly & Comet OJ 夏季欢乐赛（2019）
2019慈溪集训小记
 Codeforces Round #573 (Div. 1)
Comet OJ
Codeforces Round #576 (Div. 1)
Codechef August Challenge 2019 Division 2
原文地址：https://www.cnblogs.com/goubb/p/12542211.html