• 用PDFMiner从PDF中提取文本文字


    1、下载并安装PDFMiner

      从https://pypi.python.org/pypi/pdfminer/下载PDFMineer

    wget https://pypi.python.org/packages/57/4f/e1df0437858188d2d36466a7bb89aa024d252bd0b7e3ba90cbc567c6c0b8/pdfminer-20140328.tar.gz#md5=dfe3eb1b7b7017ab514aad6751a7c2ea

      加压并安装

    tar -zxvf pdfminer-20140328.tar.gz
    cd pdfminer-20140328/
    make cmap  #防止中文乱码,否则处理中文会出现一大堆(CID:xxx)
    sudo python setup.py install

    2、提取文本文字

    from cStringIO import StringIO
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    import sys
    import string
    
    def convert_pdf_2_text(path):
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        device = TextConverter(rsrcmgr, retstr, codec='utf-8', laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        with open(path, 'rb') as fp:
            for page in PDFPage.get_pages(fp, set()):
                interpreter.process_page(page)
            text = retstr.getvalue()
        device.close()
        retstr.close()
        return text
    
    text = convert_pdf_2_text(sys.argv[1])
    open('real?.txt','wb').write(text)

    3、测试结果

    【1】http://www.unixuser.org/~euske/python/pdfminer/#source

    【2】https://www.zhihu.com/question/31586273

  • 相关阅读:
    【POJ2176】Folding
    【NOIP2018】赛道修建
    优雅的文本编辑器——Sublime Text 3的搭建与使用
    【NOIP2010】乌龟棋
    【POJ3585】Accumulation Degree
    【POJ3322】Bloxorz I
    python之路_常用模块介绍
    python之路_正则表达式及re模块
    python之路_内置函数及匿名函数
    python之路_递归函数及实例讲解
  • 原文地址:https://www.cnblogs.com/vincent-vg/p/6827031.html
Copyright © 2020-2023  润新知