数据清洗（二）

首先清洗文件的目的是为了统计词频和关键词的抽取

然后为了完成这个目的

文件的形式以一个文件存储所有的论文文本最为合适

省去了对文件的逐个打开操作

而且加快了运行效率

首先是逐个打开文件

def open_file(file_path):
    with open(file_path, encoding='utf-8') as f:
        # txt= f.read()
        txt0 = f.readlines()
        txt =[x.strip() for x in txt0]
        txt1 = " ".join(txt)
        txt2 = re.sub('(-s)', '', txt1)
        return txt2

相关阅读:
Gym
[APIO2014] 回文串
python选课系统
python面向对象之类成员修饰符
python面向对象之类成员
python的shelve模块
python的re模块
python的configparser模块
python的sys和os模块
python的hashlib模块

原文地址：https://www.cnblogs.com/huangmouren233/p/14892084.html