语言检测库

语言检测库
5.3.3 检测文档的语言
如果你不知道文档或查询的语言（大多时候是这样），可以使用语言检测软件，在一定程度
上可以检测文档或查询的语言。如果使用Java，可以使用几个可用的语言检测库之一。比如下面
这些：
 Apache Tika（http://tika.apache.org/）；
 Language detection（http://code.google.com/p/language-detection/）。
Language detection库声称支持53种语言并提供99%的准确度，可以说很多了。
应该记住，文本越长语言检测越准确。然而，由于查询的文本通常很短，你可能会在查询语
言识别过程中遇到一定程度的错误。

Apache Tika - a content analysis toolkit

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. You can find the latest release on the download page. Please see the Getting Started page for more information on how to start using Tika.

The Parser and Detector pages describe the main interfaces of Tika and how they work.

If you're interested in contributing to Tika, please see the Contributing page or send an email to the Tika development list.

Tika is a project of the Apache Software Foundation, and was formerly a subproject of Apache Lucene.

The Parser interface

The org.apache.tika.parser.Parser interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:
```
void parse(
    InputStream stream, ContentHandler handler, Metadata metadata,
    ParseContext context) throws IOException, SAXException, TikaException;
```
The parse method takes the document to be parsed and related metadata as input and outputs the results as XHTML SAX events and extra metadata. The parse context argument is used to specify context information (like the current local) that is not related to any individual document.
相关阅读:
解决mac os x下 tomcat启动报 java.net.BindException: Permission denied <null>:80 错误
 Mac下MySQL卸载方法转载
 利用JS函数制作时钟运行程序
 HTML页面弹出窗口调整代码总结
 JavaScript代码放在HTML代码不同位置的差别
 二十五种网页加速方法和seo优化技巧
 web前端之Html和Css应用中的细节问题
 利用css制作横向和纵向时间轴
 利用html5看雪花飘落的效果
 利用jQuery实现鼠标滑过整行变色
原文地址：https://www.cnblogs.com/rsapaper/p/9848968.html

Apache Tika - a content analysis toolkit

The Parser interface