Oracle text-Oracle Text的体系架构
一、 Oracle Text 索引文档时所使用的主要逻辑步骤如下:
(1)数据存储逻辑搜索表的所有行,并读取列中的数据。通常,这只是列数据,但有些数据存储使用列数据作为文档数据的指针。例如,URL_DATASTORE 将列数据作为URL使用。
(2)过滤器提取文档数据并将其转换为文本表示方式。存储二进制文档 (如 Word 或 Acrobat 文件) 时需要这样做。过滤器的输出不必是纯文本格式 -- 它可以是 XML 或 HTML 之类的文本格式。
(3)分段器提取过滤器的输出信息,并将其转换为纯文本。包括 XML 和 HTML 在内的不同文本格式有不同的分段器。转换为纯文本涉及检测重要文档段标记、移去不可见的信息和文本重新格式化。
(4)词法分析器提取分段器中的纯文本,并将其拆分为不连续的标记。既存在空白字符分隔语言使用的词法分析器,也存在分段复杂的亚洲语言使用的专门词法分析器。
(5)索引引擎提取词法分析器中的所有标记、文档段在分段器中的偏移量以及被称为非索引字的低信息含量字列表,并构建反向索引。倒排索引存储标记和含有这些标记的文档。
每个索引的许多选项组成功能组,称为“类”
每个类集中体现配置的某一方面,可以认为这些类就是与文档数据库有关的一些问题。例如:数据存储、过滤器、词法分析器、相关词表、存储等。
每个类具有许多预定义的行为,称之为对象。每个对象是类问题可能具有的答案,并且大多数对象都包含有属性。通过属性来定制对象,从而使对索引的配置更加多变以适应于不同的应用。
(1)存储(Storage)类
存储类指定构成Oracle Text索引的数据库表和索引的表空间参数和创建参数。它仅有一个基本对象:BASIC_STORAGE,其属性包括:I_Index_Clause、I_Table_Clause、K_Table_Clause、N_Table_Clause、P_Table_Clause、R_Table_Clause。
(2)数据存储(Datastore)类
数据存储:关于列中存储文本的位置和其他信息。默认情况下,文本直接存储到列中,表中的每行都表示一个单独的完整文档。其他数据存储位置包括存储在单独文件中或以其 URL 标识的 Web 页上。七个基本对象包括:Default_Datastore、Detail_Datastore、Direct_Datastore、File_Datastore、Multi_Column_Datastore 、URL_Datastore、User_Datastore,。
(3)文档段组(Section Group)类
文档段组是用于指定一组文档段的对象。必须先定义文档段,然后才能使用索引通过 WITHIN 运算符在文档段内进行查询。文档段定义为文档段组的一部分。包含七个基本对象:AUTO_SECTION_GROUP、BASIC_SECTION_GROUP、HTML_SECTION_GROUP、NEWS_SECTION_GROUP、NULL_SECTION_GROUP、XML_SECTION_GROUP、PATH_SECTION_GROUP。
(4)相关词表(Wordlist)类
相关词表标识用于索引的词干和模糊匹配查询选项的语言,只有一个基本对象BASIC_WORDLIST,其属性有:Fuzzy_Match、Fuzzy_Numresults、Fuzzy_Score、Stemmer、Substring_Index、Wildcard_Maxterms、Prefix_Index、Prefix_Max_Length、Prefix_Min_Length。
(5)索引集(Index Set)
索引集是一个或多个Oracle 索引 (不是Oracle Text索引) 的集合,用于创建 CTXCAT类型的Oracle Text索引,只有一个基本对象BASIC_INDEX_SET。
(6)词法分析器(Lexer)类
词法分析器类标识文本使用的语言,还确定在文本中如何标识标记。默认的词法分析器是英语或其他西欧语言,用空格、标准标点和非字母数字字符标识标记,同时禁用大小写。包含8个基本对象:BASIC_LEXER、CHINESE_LEXER、CHINESE_VGRAM_LEXER、JAPANESE_LEXER、JAPANESE_VGRAM_LEXER、KOREAN_LEXER、KOREAN__MORPH_ LEXER、MULTI_LEXER。
(7)过滤器(Filter)类
过滤器确定如何过滤文本以建立索引。可以使用过滤器对文字处理器处理的文档、格式化的文档、纯文本和 HTML 文档建立索引,包括5个基本对象:CHARSET_FILTER、INSO_FILTER INSO、NULL_FILTER、PROCEDURE_FILTER、USER_FILTER。
(8)非索引字表(Stoplist)类
非索引字表类是用以指定一组不编入索引的单词 (称为非索引字)。有两个基本对象:BASIC_STOPLIST (一种语言中的所有非索引字) 、 MULTI_STOPLIST (包含多种语言中的非索引字的多语言非索引字表)。
二、使用Oracle Text建立全文索引的完整步骤,归纳起来如下:
(1)建表并装载文本(包含带有需要检索的文本字段)
(2)配置索引
(3)建立索引
(4)发出查询
(5)索引维护:同步与优化(将在后面介绍)
三、索引类型
A CONTEXT
index is the basic type of Oracle Text index. This is an index on a text column. A CONTEXT
index is useful when your source text consists of many large, coherent documents. Query this index with the CONTAINS
operator in the WHERE
clause of a SELECT
statement. This index requires manual synchronization after DML. See Syntax for CONTEXT Index Type.
The CTXCAT
type of index is a combined index on a text column and one or more other columns. CTXCAT
is typically used to index small documents or text fragments, such as item names, prices and descriptions found in catalogs. Query this index with the CATSEARCH
operator in the WHERE
clause of a SELECT
statement. This type of index is optimized for mixed queries. This index is transactional, automatically updating itself with DML to the base table. See Syntax for CTXCAT Index Type.
A CTXRULE
index is used to build a document classification application. The CTXRULE
index is an index created on a table of queries or a column containing a set of queries, where the queries serve as rules to define the classification criteria. Query this index with the MATCHES
operator in the WHERE
clause of a SELECT
statement. See Syntax for CTXRULE Index Type.
Create this index when you need to speed up existsNode()
queries on an XMLType column. See Syntax for CTXXPATH Index Type.
四、词法分析器类型
Type |
Description |
Lexer for indexing columns that contain documents of different languages. | |
Lexer for extracting tokens from text in languages, such as English and most western European languages that use white space delimited words. | |
Lexer for indexing tables containing documents of different languages such as English, German, and Japanese. | |
Lexer for extracting tokens from Chinese text. | |
Lexer for extracting tokens from Chinese text. This lexer offers benefits over the · Generates a smaller index · Better query response time · Generates real world tokens resulting in better query precision · Supports stop words | |
Lexer for extracting tokens from Japanese text. | |
Lexer for extracting tokens from Japanese text. This lexer offers the following advantages over the · Generates smaller index · Better query response time · Generates real world tokens resulting in better precision | |
Lexer for extracting tokens from Korean text. | |
Lexer you create to index a particular language. | |
Lexer for indexing tables containing documents of different languages; autodetects languages in a document. |
在这着重介绍WORLD_LEXER:
Use the WORLD_LEXER
to index text columns that contain documents of different languages. For example, use this lexer to index a text column that stores English, Japanese, and German documents.
WORLD_LEXER
differs from MULTI_LEXER
in that WORLD_LEXER
automatically detects the language(s) of a document. Unlike MULTI_LEXER
, WORLD_LEXER
does not require you to have a language column in your base table nor to specify the language column when you create the index. Moreover, it is not necessary to use sub-lexers, as with MULTI_LEXER
. (See MULTI_LEXER.)
WORLD_LEXER supports all database character sets, and for languages whose character sets are Unicode-based, it supports the Unicode 5.0 standard. For a list of languages that WORLD_LEXER
can work with, see "World Lexer Features".
WORLD_LEXER Attribute
The WORLD_VGRAM_LEXER
has the following attribute:
Attribute |
Attribute Value |
|
Enable mixed-case (upper- and lower-case) searches of text (for example, cat and Cat). Allowable values are |
WORLD_LEXER Example
Here is an example of creating an index using WORLD_LEXER
.
exec ctx_ddl.create_preference('MYLEXER', 'world_lexer');
create index doc_idx on doc(data)
indextype is CONTEXT
parameters ('lexer MYLEXER
stoplist CTXSYS.EMPTY_STOPLIST');
五、创建全文索引语法
CREATE INDEX [schema.]index ON [schema.]table(txt_column)
INDEXTYPE IS ctxsys.context [ONLINE]
[FILTER BY filter_column[, filter_column]...]
[ORDER BY oby_column[desc|asc][, oby_column[desc|asc]]...]
[LOCAL [(PARTITION [partition] [PARAMETERS('paramstring')]
[, PARTITION [partition] [PARAMETERS('paramstring')]])]
[PARAMETERS(paramstring)] [PARALLEL n] [UNUSABLE]];
其中,PARALLEL n表示并行运行
PARAMETERS(paramstring)参数如下:(可以设置全文索引同步)
Optionally specify indexing parameters in paramstring. You can specify preferences owned by another user using the user.preference notation.
The syntax for paramstring is as follows:
paramstring =
'[DATASTORE datastore_pref]
[FILTER filter_pref]
[CHARSET COLUMN charset_column_name]
[FORMAT COLUMN format_column_name]
[LEXER lexer_pref]
[LANGUAGE COLUMN language_column_name]
[WORDLIST wordlist_pref]
[STORAGE storage_pref]
[STOPLIST stoplist]
[SECTION GROUP section_group]
[MEMORY memsize]
[POPULATE | NOPOPULATE]
[SYNC (MANUAL | EVERY "interval-string" | ON COMMIT)]
[TRANSACTIONAL]'