• PHP在linux读取word文档


    几天帮朋友解决一个技术问题,在Linux下,将word文档中的内容读取,然后使用正则匹配,拼成sql入库

    查阅了外文资料和google之后,步骤如下:

    #wget http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
    #tar zxvf antiword-0.37.tar.gz
    #cd antiword-0.37
    #make
    #make install 

    antiword
    cp /root/bin/*antiword /usr/local/bin/
    mkdir /usr/share/antiword
    cp -R /root/.antiword/* /usr/share/antiword/
    chmod 777 /usr/local/bin/*antiword
    chmod 755 /usr/share/antiword/*

    安装完成之后,如果要在web上查看的话,需要使用root执行 make global_install

        <?php  
        header("Content-type: text/html; charset=utf-8");  
          
          
        $filename = 'test.doc';  
        #$content = shell_exec('antiword '.$filename);  
        $content = shell_exec('antiword -mUTF-8 '.$filename);   
          
          
        echo '<pre>';  
        print_r ($content);  
        echo '</pre>';  
    
    #coding=utf-8
    #usage python <script_name> <docFilePath>
    #pip install python-docx [安装一下扩展库]
    import sys
    import os
    
    from docx import Document
    
    #获取当前脚本得名称
    argv0_list = sys.argv[0].split("\");
    script_name = argv0_list[len(argv0_list) - 1]; 
    usage = "
     Usage python <"+script_name+"> <docFilePath>"
    
    if len(sys.argv) != 2:
    	print "Warning:
     docx file is empty" + usage
    	sys.exit()
    docx_path = sys.argv[1]
    if not os.path.exists(docx_path):
    	print "Warning:
     docx file is not exist" + usage
            sys.exit()
    
    #打开文档
    document = Document(docx_path)
    #读取每段资料
    l = [ paragraph.text.encode('utf8') for paragraph in document.paragraphs];
    #输出并观察结果,也可以通过其他手段处理文本即可
    for i in l:
        print i
    #读取表格材料,并输出结果
    tables = [table for table in document.tables];
    for table in tables:
        for row in table.rows:
            for cell in row.cells:
                print cell.text.encode('utf8'),'	',
    
  • 相关阅读:
    Codeforces 812E Sagheer and Apple Tree
    bzoj 4765: 普通计算姬
    bzoj 4552: [Tjoi2016&Heoi2016]排序
    bzoj 1096: [ZJOI2007]仓库建设
    bzoj 1030: [JSOI2007]文本生成器
    bzoj 1095: [ZJOI2007]Hide 捉迷藏
    JS实现HashMap
    A4纸表格打印
    JAVA字符串格式化-String.format()的使用
    证书打印CSS知识点总结
  • 原文地址:https://www.cnblogs.com/lixiuran/p/7101615.html
Copyright © 2020-2023  润新知