Tika解析word文件

Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files

对Doc文件的解析

需要poi-scratchpad/3.7.jar

POI-HWPF - A Quick Guide

基本的文本提取

有两个输入参数：inputstream,HWPFDocument,

getText()方法是得到所有的文本内容，

getParagraphText()是得到每一段的文本内容，

getTextFromPieces()是得到每一页的文本内容

特定文本属性提取

To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.

第一步：创建HWPFDocument

第二步：得到Range

getRange()： Returns the range which covers the whole of the document, but excludes any headers（页眉） and footers（页脚）.

int numParagraphs() Used to get the number of paragraphs in a range.

int numSections() Used to get the number of sections in a range（这个是“节”，就是插入、分隔符中的“节”）

第三步：得到段落

getParagraph()：

getText()

public static void main(String[] args) throws Exception {
        InputStream istream = new FileInputStream(
                "e:\Users\ywf\Desktop\文本校对\1.docx");
        HWPFDocument doc = new HWPFDocument(istream);
        Range range = doc.getRange();// Returns the range which covers the whole
                                        // of the document, but excludes any
                                        // headers and footers.
        for (int i = 0; i < range.numParagraphs(); i++) {
            Paragraph poiPara = range.getParagraph(i);
            int j = 0;
            while (true) {
                CharacterRun run = poiPara.getCharacterRun(j++);
                System.out.println("Color " + run.getColor());//颜色
                System.out.println("Font size " + run.getFontSize());//字体大小
                System.out.println("Font Name " + run.getFontName());//字体名称
                System.out.println(run.isBold() + " " + run.isItalic() + " "
                        + run.getUnderlineCode());//加粗，斜体，下划线
                System.out.println("Text is " + run.text());//文本内容
                if (run.getEndOffset() == poiPara.getEndOffset()) {
                    break;
                }
            }
        }


    }

对Docx文件的解析

需要poi-ooxml/3.7.jar

http://poi.apache.org/document/quick-guide-xwpf.html

package test;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.CharacterRun;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;

public class ParseWordDocxTest {

    /**
     * @param args
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {
        InputStream istream = new FileInputStream(
                "e:\Users\ywf\Desktop\文本校对\1.docx");
        XWPFDocument docx = new XWPFDocument(istream);
        List<XWPFParagraph> paraGraph = docx.getParagraphs();
        for(XWPFParagraph para :paraGraph ){
            List<XWPFRun> run = para.getRuns();
            for(XWPFRun r : run){
                int i = 0;
                System.out.println("字体颜色："+r.getColor());
                System.out.println("字体名称:"+r.getFontFamily());
                System.out.println("字体大小："+r.getFontSize());
                System.out.println("Text:"+r.getText(i++));
                System.out.println("粗体？："+r.isBold());
                System.out.println("斜体？："+r.isItalic());
                
            }
        }

    }

}

相关阅读:
Python集成开发环境搭建
 Windows10 解决 “/”应用程序中的服务器错误
 Windows系统清除远程连接记录的方法
 C# 处理DateTime算法，取某月第1天及最后一天
 SQL语句
 3389远程连接爆破实战
 SQL Server常用SQL集合
 VS2013、VS2015中，新建项目没有看到解决方案的问题（已解决）
我的黑苹果装机篇
 C# 使用GZip对字符串压缩和解压
原文地址：https://www.cnblogs.com/yuwenfeng/p/3624937.html