BART自动摘要效果测试

BART模型在文本生成任务上表现优秀。
本文测试BART模型在自动摘要任务的效果。

（1）首先安装transformers

!pip install transformers --upgrade

（2）b3_article.txt文件是一段长文本
这里直接使用中文代替。新闻来源：https://www.tmtpost.com/nictation/4765693.html

钛媒体9月23日消息，今日澳门面向内地居民旅游签注全面开放，携程数据显示澳门各类旅游产品搜索量从22日起开始暴增，最高涨幅500%。预计国庆期间，澳门或将迎来旅游小高峰。

根据携程提供的数据，上海、北京、成都、杭州、厦门成为最热出发地。

从携程数据中可以看出，澳门高星酒店订单预订量环比增加50%。携程预售的澳门酒店成为不少游客首选。游客对酒店要求更高。除了游玩当地景点，品尝特色小吃外。酒店美食套餐、酒店玩乐项目成游客“新宠”。预计游客在酒店待的时间较以往平均增长1.5个小时。酒店休闲，酒店度假，成为主流趋势。

同时携程发布“澳门寻宝计划”，注入优势资源，加入全新玩法，其中澳门威尼斯人、澳门巴黎人、澳门四季酒店、澳门瑞吉酒店、澳门喜来登大酒店是用户预订热门酒店。广州、珠海、澳门、上海、深圳、佛山、中山、北京、香港、东莞等城市的游客是澳门酒店预订的主力军。

（3）自动下载并加载模型，之后生成摘要

import torch
from transformers import BartTokenizer, BartForConditionalGeneration

from IPython.display import display, Markdown

torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

text = open('b3_article.txt').read().replace('
', '')
text = ' '.join(jieba.cut(text))
print(text)

article_input_ids = tokenizer.batch_encode_plus([text], return_tensors='pt', max_length=1024)['input_ids'].to(
    torch_device)
summary_ids = model.generate(article_input_ids, num_beams=4, length_penalty=2.0, max_length=142, min_length=56,
                             no_repeat_ngram_size=3)
summary_txt = tokenizer.decode(summary_ids.squeeze(), skip_special_tokens=True)
display(Markdown('> **Summary: **' + summary_txt))
print(summary_txt)

（4）输出结果：

/Users/huihui/anaconda3/bin/python /Users/huihui/git/workspace/my_tools/b3.py
Downloading: 100%|██████████| 899k/899k [00:03<00:00, 263kB/s]
Downloading: 100%|██████████| 456k/456k [00:02<00:00, 222kB/s]
Downloading: 100%|██████████| 1.34k/1.34k [00:00<00:00, 298kB/s]
Downloading: 100%|██████████| 1.63G/1.63G [02:46<00:00, 9.78MB/s]

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/_l/wz1852tj5qg496pgg8qj_tlc0000gn/T/jieba.cache
Loading model cost 0.863 seconds.
Prefix dict has been built successfully.
钛 媒体 9 月 23 日 消息 ， 今日 澳门 面向 内地 居民 旅游 签注 全面 开放 ， 携程 数据 显示 澳门 各类 旅游 产品 搜索 量 从 22 日起 开始 暴增 ， 最高 涨幅 500% 。 预计 国庆 期间 ， 澳门 或 将 迎来 旅游 小 高峰 。 根据 携程 提供 的 数据 ， 上海 、 北京 、 成都 、 杭州 、 厦门 成为 最 热 出发地 。 从 携程 数据 中 可以 看出 ， 澳门 高星 酒店 订单 预订 量 环比 增加 50% 。 携程 预售 的 澳门 酒店 成为 不少 游客 首选 。 游客 对 酒店 要求 更高 。 除了 游玩 当地 景点 ， 品尝 特色小吃 外 。 酒店 美食 套餐 、 酒店 玩乐 项目 成 游客 “ 新宠 ” 。 预计 游客 在 酒店 待 的 时间 较 以往 平均 增长 1.5 个 小时 。 酒店 休闲 ， 酒店 度假 ， 成为 主流 趋势 。 同时 携程 发布 “ 澳门 寻宝 计划 ” ， 注入 优势 资源 ， 加入 全新 玩法 ， 其中 澳门 威尼斯人 、 澳门 巴黎 人 、 澳门 四季 酒店 、 澳门 瑞吉 酒店 、 澳门 喜来登 大酒店 是 用户 预订 热门 酒店 。 广州 、 珠海 、 澳门 、 上海 、 深圳 、 佛山 、 中山 、 北京 、 香港 、 东莞 等 城市 的 游客 是 澳门 酒店 预订 的 主力军 。

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

面向   日       旅   签注   酒店  预订  量   环 比   增加 50%   1.5 个  “ ”   “ 澳门’  ” “” ‘’’ “ 特 ’小吃’, “1.5”, ”  “2.5,” 

Process finished with exit code 0

（5）结论
可见不能直接使用BART进行中文的自动摘要任务

相关阅读:
作业
 复习整理3
复习整理2
复习整理1
书籍-os 相关
 书籍正则
 书籍
 SocketServer 简化编写网络服务器的步骤
 socket 粘包
 经典排序算法
原文地址：https://www.cnblogs.com/xuehuiping/p/13719531.html

最新文章
python标准库笔记
 python_文本
 python_str
python入门记录
 入门系列
 ML_regression
手册框架
 深度强化学习
 概览
 魔术方法