5.3.3 检测文档的语言
如果你不知道文档或查询的语言(大多时候是这样),可以使用语言检测软件,在一定程度
上可以检测文档或查询的语言。如果使用Java,可以使用几个可用的语言检测库之一。比如下面
这些:
Apache Tika(http://tika.apache.org/);
Language detection(http://code.google.com/p/language-detection/)。
Language detection库声称支持53种语言并提供99%的准确度,可以说很多了。
应该记住,文本越长语言检测越准确。然而,由于查询的文本通常很短,你可能会在查询语
言识别过程中遇到一定程度的错误。
Apache Tika - a content analysis toolkit
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. You can find the latest release on the download page. Please see the Getting Started page for more information on how to start using Tika.
The Parser and Detector pages describe the main interfaces of Tika and how they work.
If you're interested in contributing to Tika, please see the Contributing page or send an email to the Tika development list.
Tika is a project of the Apache Software Foundation, and was formerly a subproject of Apache Lucene.
The Parser interface
The org.apache.tika.parser.Parser interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:
void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException;
The parse method takes the document to be parsed and related metadata as input and outputs the results as XHTML SAX events and extra metadata. The parse context argument is used to specify context information (like the current local) that is not related to any individual document.