1、安装PyPDF2和pdfplumber库介绍
PyPDF2 可以更好的读取、写入、分割、合并 PDF 文件;
pdfplumber 可以更好地读取 PDF 文件内容和提取 PDF 中的表格;
2、利用pdfplumber提取文字
import pdfplumber,PyPDF2
with pdfplumber.open("python.pdf") as f:
page = f.pages[0] # 选择打开哪一页
print(page.extract_text()) # 提取页面上的文字
3、利用pdfplumber提取表格并写入excel
# extract_table():如果一页有一个表格
# extract_tables():如果一页有多个表格
import pdfplumber,PyPDF2
from openpyxl import Workbook
with pdfplumber.open("python.pdf") as f:
page = f.pages[0]
table = page.extract_table()
workbook = Workbook()
sheet = Workbook.active
for row in table:
sheet.append(row)
workbook.save(filename="new_pdf.xlsx")
4、PDF合并及页面的排序和旋转
4.1 合并pdf
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_writer = PdfFileWriter()
for i in range(1,len(os.listdir(r"G:6Tipdm7python 办公自动concat_pdf"))+1):
print(i*50+1,(i+1)*50)
pdf_reader = PdfFileReader("G:\6Tipdm\7python 办公自动化\concat_pdf{}-
{}.pdf".format(i*50+1,(i+1)*50))
for page in range(pdf_reader.getNumPages()):
pdf_writer.addPage(pdf_reader.getPage(page))
with open("G:\6Tipdm\7python 办公自动化\concat_pdfmerge.pdf", "wb") as out:
pdf_writer.write(out)
4.2 拆分pdf
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_reader = PdfFileReader(r"G:6Tipdm7python 办公自动化concat_pdf时间序
列.pdf")
for page in range(pdf_reader.getNumPages()):
pdf_writer = PdfFileWriter()
pdf_writer.addPage(pdf_reader.getPage(page))
with open(f"G:\6Tipdm\7python 办公自动化\concat_pdf\{page}.pdf","wb") as out: pdf_writer.write(out)