solr 4.8+mysql数据库数据导入 + mmseg4j中文全文索引配置笔记

solr 4.8+mysql数据库数据导入 + mmseg4j中文全文索引配置笔记
转载请标明出处：http://www.cnblogs.com/chlde/p/3768733.html

1.如何将solr部署，请参考之前的文章

2.按上述配置好后，在solr_home文件夹中，将包含collection1文件夹，这就是solr的一个实例。下面我们来看看collection1中的文件内容。

collection1中包含conf和data两个子文件夹。data中包含tlog和index（如果没有也没关系，稍后再solr建立索引时，将会被创建）。tlog是记录日志的文件夹，index是存放索引的文件夹。conf中包含lang文件夹和若干文件。lang文件夹中包含的是词库文件，但是solr默认是没有中文词库的，所以之后会将中文词库加入该文件夹中。在conf中，包含了若干xml文件，我们针对solr配置，是需要配置solrconfig.xml和schema.xml即可。下面我们讲一下如何配置这两个文件。

3.先配置solrconfig.xml。solrconfig.xml是solr的核心文件。这里包含了jar包引用，数据库读取路径配置，操作接口配置。

jar包配置如下
```
 1     <lib dir="../contrib/extraction/lib" regex=".*.jar" />
 2     <lib dir="../dist/" regex="solr-cell-d.*.jar" />
 3 
 4     <lib dir="../contrib/clustering/lib/" regex=".*.jar" />
 5     <lib dir="../dist/" regex="solr-clustering-d.*.jar" />
 6 
 7     <lib dir="../contrib/langid/lib/" regex=".*.jar" />
 8     <lib dir="../dist/" regex="solr-langid-d.*.jar" />
 9 
10     <lib dir="../contrib/velocity/lib" regex=".*.jar" />
11     <lib dir="../dist/" regex="solr-velocity-d.*.jar" />
12     
13     <lib dir="../contrib/dataimporthandler/lib" regex=".*.jar" />
14     <lib dir="../dist/" regex="solr-dataimporthandler-d.*.jar" />  
```
其中，最后两行是数据导入的handler，这包含了从数据库读取数据所需要的jar包。这些jar的目录都在solr_homecontrib这个文件夹中。

配置dataimporthandler
```
　　<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
        <lst name="defaults">
          <str name="config">data-config.xml</str>
        </lst>
    </requestHandler>
```
这里需要你创建一个新的xml文件，放在conf文件夹中，命名为data-config.xml。内容如下
```
 1 <dataConfig>
 2    <dataSource type="JdbcDataSource" 
 3               driver="com.mysql.jdbc.Driver"
 4               url="jdbc:mysql://localhost/yourDBname" 
 5               user="root" 
 6               password="root"/>
 7    <document>

 8     <entity name="question1" query="select Guid,title,QuesBody,QuesParse,QuesType from question1 where Guid is not null">
 9        <field column="Guid" name="id"/>
10        <field column="title" name="question1_title"/>
11        <field column="QuesBody" name="question1_body"/>
12        <field column="QuesParse" name="question1_parse"/>
13        <field column="QuesType" name="question1_type"/>
14     </entity>
15     <entity name="question2" query="select Guid,title,QuesBody,QuesParse,QuesType from question2 where Guid is not null">
16        <field column="Guid" name="id"/>
17        <field column="title" name="question2_title"/>
18        <field column="QuesBody" name="question2_body"/>
19        <field column="QuesParse" name="question2_parse"/>
20        <field column="QuesType" name="question2_type"/>
21     </entity>
22   </document>
23 </dataConfig>
```
如上，包含了datasource和document两个大标签。datasource正如其名，包含了数据库的配置信息。document包含了entity。entity就是一个从数据库读取数据的动作。

query就是读取数据所用的sql，field是数据库中的字段与schma中的字段进行匹配的列表。稍后在schma.xml的介绍中，将会详细说明。

我们回到solrconfig.xml中，requestHandler这里定义了相应http请求的接口。如之前配置的name为/dataimport接口，在中间件启动后，访问http://localhost:8080/solr/collection1/dataimport即可查看数据导入的状态。若执行命令，便可执行http://localhost:8080/solr/collection1/dataimport?command=full-import 即可（这句的含义是全部重新索引，之前的索引将被删除），其他命令，请参考http://www.cnblogs.com/llz5023/archive/2012/11/15/2772154.html。同理，通过相同的形式，即可实现对solr的增删改查。这里还能对requestHandler进行一些高级配置，感兴趣的同学可以到apache-solr-ref-guide-4.8中阅读。

4.schma.xml配置。schma.xml完成了对索引数据的类型配置和索引一些相关动作的配置（如分词方法配置）。

solr需要为每条索引定义一个id作为主键，而且在查询中必须要有字段与主键id进行对应，否则将会报错。如在data-config中的Guid与id进行匹配，将guid作为主键。

field为solr索引的基本类型，type的值与fieldType对应，即通过type为每个field指定一个fieldType，而fieldType将为field规定如何进行索引。

例如，我们将用mmseg4j对中文进行索引
```
 1 
 2     <fieldType name="text_chn_complex" class="solr.TextField" >
 3       <analyzer>
 4         <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="lang/chn.txt"/>
 5       </analyzer>
 6     </fieldType>
 7     <fieldType name="text_chn_maxword" class="solr.TextField" >
 8       <analyzer>
 9         <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="lang/chn.txt"/>
10       </analyzer>
11     </fieldType>
12     <fieldType name="text_chn_simple" class="solr.TextField" >
13       <analyzer>
14         <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="lang/chn.txt"/>
15       </analyzer>
16     </fieldType>
```
如上，我们定义了三个fieldType，这三个表示了对中文进行索引的三种方式。都属于solr.TextField类。analyzer均为mmseg4j，只是使用的mode不同。dicPath即为词库所在位置。
```
1 <field name="question1_type" type="text_chn_maxword" indexed="true" stored="true"/>
```
这里定义了一个名为question1_type的field，使用text_chn_maxword方式进行索引。

这里有一点是要注意的，solr中是没有and的，所以，要在多个字段查询匹配的关键字，要使用到copyField这个类型。

例如
```
1     <field name="question2_title" type="text_chn_maxword" indexed="true" stored="true"/>
2     <field name="question2_body" type="text_chn_maxword" indexed="true" stored="true"/>
3 
4     <field name="question2_text" type="text_chn_maxword" indexed="true" stored="true"  multiValued="true"/>
5     <copyField source="question2_title" dest="question2_text"/>
6     <copyField source="question2_body" dest="question2_text"/>
```
这里就是将question2_title和question2_body共同索引到question2_text中，这样只要question2_title或question2_body任意被关键字匹配，就会将question2_text返回。注意question2_text的multiValued="true"，这点是必须的。

5.遇到的问题

中文词库下载

http://download.labs.sogou.com/dl/sogoulabdown/SogouW/SogouW.zip

mmseg4j需要使用2.0以上版本，2.0一下在solr4.8中会有bug

https://code.google.com/p/mmseg4j/

JAVA工程师：chlde2500@gmail.com
相关阅读:
ceph 集群 slow requests are blocked
k8s集成cephfs(StorageClass方式)
ceph错误application not enabled on 1 pool(s)解决方法
 安装 harbor v2.3.4
openstack高可用集群搭建(分布式路由)(train版)
github项目收集
 devops组件搭配选型
 JS中的getter和setter
[论文理解] Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials
Pytorch 训练停止，输出显示 died with <Signals,SIGKILL.9> 问题定位过程记录
原文地址：https://www.cnblogs.com/chlde/p/3768733.html

solr 4.8+mysql数据库数据导入 + mmseg4j中文全文索引 配置笔记

solr 4.8+mysql数据库数据导入 + mmseg4j中文全文索引配置笔记