public String readDoc(File file) { StringBuffer buffer = new StringBuffer(); InputStream input = null; WordExtractor extractor = null; String[] paragraphs = null; try { input = new FileInputStream(file); extractor = new WordExtractor(input); paragraphs = extractor.getParagraphText(); for (String paragraph : paragraphs) { buffer.append(extractor.stripFields(paragraph)).append("\ \ "); } } catch (Exception e) { e.printStackTrace(); } finally { if (input != null) { try { input.close(); } catch (IOException e) { e.printStackTrace(); } } } return buffer.toString(); }
剔除方法:extractor.stripFields(paragraph);
提取文档内容文章。excel,pdf,word.....
http://blog.sina.com.cn/s/blog_67b9ad8d01010bwa.html
出现问题文章: