INDEX
由于性能、智能结果等多方面原因,在搜索文本时,全文搜索一般要优于通配符和正则表达式,前者为指定列建立索引,以便快速找到对应行,并且将结果集智能排序。启用查询扩展可以让我们得到未必包含关键字的相关行,启用布尔模式可以让我们指定搜索内容不应包含的单词、各个关键词的权重等。
WARNING
不是所有数据库引擎都支持全文搜索。MyISAM 支持全文索引,InnoDB 不支持全文索引。
PS. 据说 MySQL 5.6 以上版本的 InnoDB 支持全文索引,语法格式和 MyISAM 的全文索引类似。
Understanding Full-Text Searching
通配符和正则表达式都很强大,但是它们有几个很严重的缺点:
- 性能:通配符和正则表达式总是和表中的每一行匹配,因此当行数增加时这两种匹配方式非常耗时。
- 精确控制:通配符和正则表达式很难精确控制匹配什么不匹配什么。
- 智能结果:例如说,不论字段内容中有一个匹配还是多个匹配,通配符和正则表达式都一视同仁的返回该行;如果字段内容不包含匹配,但包括相关(相近)词不会返回该行。
上述的缺点都可以通过全文索引来解决。当使用全文搜索的时候, MySQL 不需要单独地匹配每行,而是创建单词(在指定的列中)的索引,这样 MySQL 就可以快速而有效地确定哪些词匹配,哪些不匹配等等。
可以理解为 MySQL 为指定列生成了一个目录,该目录标注了 包含 xxx 单词的行是第1、4、8、9行(假设),借助这个目录,搜索 xxx 单词时就可以很快找到对应行1、4、8、9了,类似于用空间换取时间。
Using Full-Text Searching
CREATE TABLE productnotes ( note_id int NOT NULL AUTO_INCREMENT, prod_id char(10) NOT NULL, note_date datetime NOT NULL, note_text text NULL , PRIMARY KEY(note_id), FULLTEXT(note_text) ) ENGINE=MyISAM;
PS. 全文索引多个字段也是可以的!
一旦定义了全文索引,MySQL 会自动维护这个索引,当增加记录、删除记录,索引会相应的变化
Don't Use FULLTEXT When Importing Data
Updating indexes takes timenot a lot of time, but time nonetheless. If you are importing data into a new table, you should not enable FULLTEXT indexing at that time. Rather, first import all of the data, and then modify the table to define FULLTEXT. This makes for a much faster data import (and the total time needed to index all data will be less than the sum of the time needed to index each row individually).
Performing Full-Text Searches
mysql> SELECT note_text -> FROM productnotes -> WHERE Match(note_text) Against('rabbit'); +----------------------------------------------------------------------------------------------------------------------+ | note_text | +----------------------------------------------------------------------------------------------------------------------+ | Customer complaint: rabbit has been able to detect trap, food apparently less effective now. | | Quantity varies, sold by the sack load. All guaranteed to be bright and orange, and suitable for use as rabbit bait. | +----------------------------------------------------------------------------------------------------------------------+ 2 rows in set (0.00 sec)
Match(colunm_name必须和定义为 FULLTEXT的列一致) 和 Against('xxx') ,记住这两个函数!另外,搜索是大小写不敏感的!
事实上,利用通配符也可以很方便地达到上面的效果,不过性能和结果都会有不同:
mysql> SELECT note_text -> FROM productnotes -> WHERE note_text LIKE '%rabbit%'; +----------------------------------------------------------------------------------------------------------------------+ | note_text | +----------------------------------------------------------------------------------------------------------------------+ | Quantity varies, sold by the sack load. All guaranteed to be bright and orange, and suitable for use as rabbit bait. | | Customer complaint: rabbit has been able to detect trap, food apparently less effective now. | +----------------------------------------------------------------------------------------------------------------------+ 2 rows in set (0.00 sec)
我们很震惊地发现,虽然行数是没错,不过顺序却不一致 。 。 。
这是因为 全文搜索有一个重要特性:结果排名(the ranking of results),排名高的优先返回。查看列的排名:
SELECT note_text, Match(note_text) Against('rabbit') AS rank FROM productnotes;
rank 越高的排名越前,这个与关键字出现与否、出现在前面还是后面等因素有关。
Using Query Expansion
When query expansion is used, MySQL makes two passes through the data and indexes to perform your search:
-
First, a basic full-text search is performed to find all rows that match the search criteria.
-
Next, MySQL examines those matched rows and selects all useful words (we'll explain how MySQL figures out what is useful and what is not shortly).
-
Then, MySQL performs the full-text search again, this time using not just the original criteria, but also all of the useful words.
Using query expansion you can therefore find results that might be relevant, even if they don't contain the exact words for which you were looking.
mysql> SELECT note_text -> FROM productnotes -> WHERE Match(note_text) Against('rabbit' WITH QUERY EXPANSION); +----------------------------------------------------------------------------------------------------------------------------------------------------------+ | note_text | +----------------------------------------------------------------------------------------------------------------------------------------------------------+ | Quantity varies, sold by the sack load. All guaranteed to be bright and orange, and suitable for use as rabbit bait. | | Customer complaint: rabbit has been able to detect trap, food apparently less effective now. | | Customer complaint: Circular hole in safe floor can apparently be easily cut with handsaw. | | Customer complaint: Sticks not individually wrapped, too easy to mistakenly detonate all at once. Recommend individual wrapping. | | Customer complaint: Not heavy enough to generate flying stars around head of victim. If being purchased for dropping, recommend ANV02 or ANV03 instead. | | Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils. | +----------------------------------------------------------------------------------------------------------------------------------------------------------+ 6 rows in set (0.00 sec)
Boolean Text Searches
SELECT note_text FROM productnotes WHERE Match(note_text) Against('+rabbit +bait"' IN BOOLEAN MODE);
更多操作参见原书第 18 章,通过各种符号来指明关键词优先级、包含什么单词、排除什么单词等。
Full-Text Search Usage Notes
重要的说明:
- MySQL 不会给短单词建立索引,短单词默认定义为三个字母以内的。
- MySQL 有一个内建的停用词列表,它的下场和短单词一样,当然这个也是可以手动修改的,具体方法可以搜索资料、查查文档。
- 许多单词出现频繁,以至于搜索它们将毫无用处(返回的结果太多)。 因此,MySQL授予50%的规则,一个单词出现在50%或更多的行中,它被视为一个停用词并被有效地忽略。 (50%的规则不用于IN BOOLEAN模式)。
- 如果表中的行少于三行(也就是两行或者两行以下,因为每个但凡出现的单词的出现次数总是至少为行数的50%),全文搜索不返回任何结果。
- 单词内的单引号会被忽略,例如 don't 的索引为 dont
- 没有单词分隔符(包括日语和中文)的语言将不会正确返回全文结果。
那么如何实现中文分词&全文索引呢?待更新