NLP（十一）提取文本摘要

原文链接：http://www.one2know.cn/nlp11/

gensim.summarization库的函数
gensim.summarization.summarize(text, ratio=0.2, word_count=None, split=False)
Parameters（参数）：
text : str
Given text.
ratio : float, optional
Number between 0 and 1 that determines the proportion of the number of
sentences of the original text to be chosen for the summary.
word_count : int or None, optional
Determines how many words will the output contain.
If both parameters are provided, the ratio will be ignored.
split : bool, optional
If True, list of sentences will be returned. Otherwise joined
strings will bwe returned.
代码

from gensim.summarization import summarize # 基于文本排序的摘要算法
from bs4 import BeautifulSoup # 用于解析HTML文档的BeautifulSoup库
import requests # 用于下载HTTP资源的库
urls = { # 题目:网站 字典
    'Deconstructing Voice-over-IP':
    'http://scigen.csail.mit.edu/scicache/269/scimakelatex.25977.A.+G.+Hassan.html',
    'Exploration of the Location-Identity Split':
    'http://scigen.csail.mit.edu/scicache/270/scimakelatex.26087.Ali+Veli.Veli+Ali.Vel+Al.html',
}
# 摘要(真实的)：
# 1.The implications of ambimorphic archetypes have been far-reaching and pervasive. After years of natural research into consistent hashing, we argue the simulation of public-private key pairs, which embodies the confirmed principles of theory. Such a hypothesis might seem perverse but is derived from known results. Our focus in this paper is not on whether the well-known knowledge-based algorithm for the emulation of checksums by Herbert Simon runs in Θ( n ) time, but rather on exploring a semantic tool for harnessing telephony (Swale).
# 2.Superblocks must work. Given the current status of homogeneous configurations, security experts particularly desire the simulation of 802.11b. we consider how the Internet can be applied to the refinement of Scheme.
for key in urls.keys():
    url = urls[key]
    r = requests.get(url)
    soup = BeautifulSoup(r.text,'html.parser')
    data = soup.get_text() # HTML去标签后的文本
    pos1 = data.find('1 Introduction') + len('1 Introduction')
    pos2 = data.find('Related Work')
    text = data[pos1:pos2].strip() # 提取pos1与pos2之间的引言部分
    print('PAPER URL: {}'.format(url))
    print('TITLE: {}'.format(key))
    print('GENERATED SUMMARY: {}'.format(summarize(text)))
    print()

输出：

PAPER URL: http://scigen.csail.mit.edu/scicache/269/scimakelatex.25977.A.+G.+Hassan.html
TITLE: Deconstructing Voice-over-IP
GENERATED SUMMARY: 。。。。。。

PAPER URL: http://scigen.csail.mit.edu/scicache/270/scimakelatex.26087.Ali+Veli.Veli+Ali.Vel+Al.html
TITLE: Exploration of the Location-Identity Split
GENERATED SUMMARY: 。。。。。。

相关阅读:
3步轻松搞定Spring Boot缓存
 备战“金九银十”10道String高频面试题解析
 ConcurrentHashMap比其他并发集合的安全效率要高一些？
3年java开发竟然还不知道Lambda的这个坑
 5分钟搞清楚Synchronized和Lock的概念与区别
 3年Java开发都知道的Redis数据结构和通用命令
 8月份21道最新Java面试题剖析（数据库+JVM+微服务+高并发）
35个Java代码优化的细节，你知道几个？
vba里面打开word文档，并实现通过特殊的字符将文档中的字符实现切分
 通过vba实现替换word里面指定的字符的方法
原文地址：https://www.cnblogs.com/peng8098/p/nlp_11.html

NLP（十一） 提取文本摘要

NLP（十一）提取文本摘要