• python爬虫:读取PDF


    下面的代码可以实现用python读取PDF,包括读取本地和网络上的PDF。

    pdfminer下载地址:https://pypi.python.org/packages/source/p/pdfminer/pdfminer-20140328.tar.gz

    #!/usr/bin/python
    # -*- encoding:utf-8 -*-


    from urllib2 import urlopen
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    from cStringIO import StringIO

    def convert_pdf_to_txt(fp):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    fp.close()
    device.close()
    textstr = retstr.getvalue()
    retstr.close()
    return textstr

    url='http://pythonscraping.com/pages/warandpeace/chapter1.pdf'
    fp = StringIO(urlopen(url).read()) # for url

    # path='chapter1.pdf'
    # fp = file(path, 'rb') # for path

    text=convert_pdf_to_txt(fp)
    print text
  • 相关阅读:
    Vue学习笔记之Vue指令系统介绍
    Vue学习笔记之Vue的使用
    Vue学习笔记之Vue介绍
    vim笔记
    python安装包的方式
    基本认证与摘要认证
    curl常用指令
    python优缺点小结
    环境变量的作用
    mac os、linux及unix之间的关系
  • 原文地址:https://www.cnblogs.com/miranda-tang/p/5569439.html
Copyright © 2020-2023  润新知