PostgreSQL的全文检索中文分词(1)NlpBamboo

PostgreSQL的全文检索中文分词(1)NlpBamboo
postgres 8.3提供了2个数据类型tsvector,tsquery来支持全文检索，所以只需要一个中文分词组件就可以提供中文全文检索。google..很快发现了NlpBamboo项目，使用起来很方便。

安装和配置NlpBamboo

1.安装编译工具cmake
```
apt-get install cmake
```
2.安装Bamboo依赖库crf++，下载crf++代码后　　
```
cd CRF++
./configure
make
make install
```
编译CRF++需要g++支持，否则会出现checking if g++ supports namespaces (required) … no这样的信息，直接apt-get install g++就可以了

3.编译安装Bamboo
```
cd nlpbamboo
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=release
make all
make install
```
4.安装postgresql分词库扩展

从bamboo项目主页下载分词数据库文件index.tar.bz2，解压到/opt/bamboo/index

编辑中文索引停止词（干扰词），文本中有些字符串不希望被索引，比如常用标点符号，”的”，英文中的”a”等。
```
touch /usr/share/postgresql/8.4/tsearch_data/chinese_utf8.stop
```
上面建了一个空的中文索引停止词文件，也可以自行用文本编辑器编辑，一个停止词占一行
```
cd /opt/bamboo/exts/postgres/pg_tokenize
make
make install
cd /opt/bamboo/exts/postgres/chinese_parser
make
make install
```
如果出现“pgxs.mk找不到”的错误，则安装 apt-get install postgresql-server-dev-8.4

将分词函数和分词模块导入到你的数据库:
```
psql mydbname -U username
mydbname=#\i /usr/share/postgresql/8.4/contrib/pg_tokenize.sql
mydbname=#\i /usr/share/postgresql/8.4/contrib/chinese_parser.sql
```
测试分词:
```
select to_tsvector('chinesecfg', '欢迎光临我的博客chengwei.org');
---------------------------------------------------------------
'chengwei':6 'org':8 '光临':2 '博客':5 '我':3 '欢迎':1 '的':4
(1 row)
```
到此为止，数据库已经支持中文全文检索了，但在项目中使用sql语句查询，还需要做些额外的工作。

使用postgresql的全文检索功能

Postgresql使用tsvector数据类型来保存索引后的内容，将一段文本转换成tsvector类型只需要使用to_tsvector函数就可以
```
select to_tsvector('english', 'Better late than never');
-------------------------------
 'better':1 'late':2 'never':4
(1 row)
```
要查询一个单词是否出现在’Better late than never’句子中，　
```
select to_tsvector('english', 'Better late than never') @@ 'better' as in;
 in
----
 t
(1 row)

select to_tsvector('english', 'Better late than never') @@ 'right' as in;
 in
----
 f
(1 row)
```
@@全文索引操作符返回一个true或者false值，@@后面实际上是一个 tsquery类型，可以使用 &, |等进行组合查询
```
select to_tsvector('english', 'Better late than never') @@ to_tsquery('right | better') as in;
 in
----
 t
(1 row)
```
中文全文索引实践

对数据库archive表的title字段做全文索引，假设表结构是
```
create table archive(
   id serial primary key,
   title text);
```
新建一个表用来保存tsvector类型，当然也可以直接保存在archive表中。ON DELETE CASCADE用来规则当父表删除数据时，同时删除子表中外键关联数据。默认是NO ACITON，即禁止删除父表数据
```
create table fti_archive(
    id integer primary key,
    fti_title tsvector,
    foreign key (id) references portal_archive(id) ON DELETE CASCADE);
```
archive表中已经有大量数据，根据现有数据生成title字段的全文索引
```
insert into fti_archive(id, fti_title)
select id, to_tsvector('chinesecfg',title) from archive
```
很显然，当在archive表中新增数据记录或者更新tilte字段是，对应记录的索引也需要更新，这可以创建一个触发器来实现
```
create or replace function update_fti_title()
returns trigger as $$
begin
    if TG_OP = 'INSERT' then
    insert into fti_archive(id, fti_title) values(NEW.id, to_tsvector('chinesecfg',NEW.title));
    else
    update fti_archive set fti_title=to_tsvector('chinesecfg',NEW.title) where id=NEW.id;
    end if;
    return new;
end
$$ LANGUAGE plpgsql;

create trigger update_fti_trigger after insert or update
on archive for each row execute procedure update_fti_title()
```
在全文索引表的fti_title字段建立索引　　
```
create index fti_archive_fti_title_inx on fti_archive
using gin(fti_title);
```
只需要一个连接查询就可以进行全文索引查询了：
```
select archive.* from archive inner join fti_archive
on archive.id=fti_archive.id
where fti_title @@ plainto_tsquery('chinesecfg','今天天气不错')
```
转自：http://chengwei.org/archives/postgresql-chinese-full-text-index-with-bamboo.html
相关阅读:
第6个作业
 团队-团队编程项目作业名称-团队一阶段互评
 课后作业-阅读任务-阅读提问-3
20171106-构建之法：现代软件工程-阅读笔记》
团队-团队编程项目作业名称-开发文档
 结对-结对编程项目作业名称-结对项目总结
 需求分析
 团队成员简介及分工
 课后作业-阅读任务-阅读提问-3
结对编程项目作业5
原文地址：https://www.cnblogs.com/shuaixf/p/2173260.html