python中使用docx库操作word文档记录（1）- 读取文本和表格

python中使用docx库操作word文档记录（1）- 读取文本和表格
python中使用docx库操作word文档记录（1）- 读取文本和表格

本文记录docx库读取word文本和表格的方法

一、使用docx模块

Python可以利用python-docx模块处理word文档，处理方式是面向对象的。也就是说python-docx模块会把word文档，文档中的段落、文本、字体等都看做对象，对对象进行处理就是对word文档的内容处理。

安装方法为：pip install python-docx

二、相关概念

先了解python-docx模块的几个概念。

1，Document对象，表示一个word文档。
2，Paragraph对象，表示word文档中的一个段落
3，Paragraph对象的text属性，表示段落中的文本内容。

三、读取文本
```
from docx import Document #导入库

path = 'ys.docx' #文件路径
wordfile = Document(path) #读入文件

paragraphs = wordfile.paragraphs 

#输出每一段的内容
for paragraph in wordfile.paragraphs: 
    print(paragraph.text +"
 end")

 
#输出段落编号及段落内容
for i in range(len(wordfile.paragraphs)):
 print("第"+str(i)+"段的内容是："+wordfile.paragraphs[i].text)
```
如果不需要获取文本中的空行，则可以增加下面的判断条件：
```
if paragraphs[i].text.strip()!="": # 去空行
或者
if paragraph.text.count("
") == len(paragraph.text): # 去空行
```
四、读取表格
```
from docx import Document #导入库

path = '1.docx' #文件路径
document = Document(path) #读入文件
tables = document.tables #获取文件中的表格集

print(len(tables)) #获取文件中的表格数量

for table in tables:#遍历每一个表格
    for row in table.rows:#从表格第一行开始循环读取表格数据
    	for cell in row.cells:#遍历每一个单元格
    		print(cell.text) #获取单元格的内容
'''  
后面两行也可以用下面的方式
        for j in range(len(row.cells)):
           print(row.cells[j].text)
'''
```
这里要说明一下，不是word里面所有的表格都能正确读取。

如下图：带一个锚，表格外围有虚线的这类表格提取不到，不知道有没有好的方法。

这类的表格一般是pdf或者其他系统生成的表格，并非直接在word中插入的表格。
相关阅读:
[LeetCode] Wiggle Sort
[LeetCode] Perfect Squares
[LeetCode] Minimum Window Substring
[LeetCode] Valid Sudoku
[LeetCode] Sudoku Solver
[LeetCode] First Bad Version
[LeetCode] Find the Celebrity
[LeetCode] Paint Fence
[LeetCode] H-Index II
[LeetCode] H-Index
原文地址：https://www.cnblogs.com/eternalpal/p/14082231.html

python中使用docx库操作word文档记录（1）- 读取文本和表格

一、使用docx模块

二、相关概念

三、读取文本

四、读取表格