• php+中文分词scws+sphinx+mysql打造千万级数据全文搜索


    Sphinx是由俄罗斯人Andrew Aksyonoff开发的一个全文检索引擎。意图为其他应用提供高速、低空间占用、高结果 相关度的全文搜索功能。Sphinx可以非常容易的与SQL数据库和脚本语言集成。当前系统内置MySQL和PostgreSQL 数据库数据源的支持,也支持从标准输入读取特定格式 的XML数据。
    Sphinx创建索引的速度为:创建100万条记录的索引只需3~4分钟,创建1000万条记录的索引可以在50分钟内完成,而只包含最新10万条记录的增量索引,重建一次只需几十秒。
    Sphinx的特性如下:
    a) 高速的建立索引(在当代CPU上,峰值性能可达到10 MB/秒);
    b) 高性能的搜索(在2 – 4GB 的文本数据上,平均每次检索响应时间小于0.1秒);
    c) 可处理海量数据(目前已知可以处理超过100 GB的文本数据, 在单一CPU的系统上可处理100 M 文档);
    d) 提供了优秀的相关度算法,基于短语相似度和统计(BM25)的复合Ranking方法;
    e) 支持分布式搜索;
    f) 支持短语搜索
    g) 提供文档摘要生成
    h) 可作为MySQL的存储引擎提供搜索服务;
    i) 支持布尔、短语、词语相似度等多种检索模式;
    j) 文档支持多个全文检索字段(最大不超过32个);
    k) 文档支持多个额外的属性信息(例如:分组信息,时间戳等);
    l) 支持断词;
    虽然mysql的MYISAM提供全文索引,但是性能却不敢让人恭维


    开始搭建

    系统环境:centos6.5+php5.6+apache+mysql

    1、安装依赖包

    1 yum -y install make gcc g++ gcc-c++ libtool autoconf automake imake php-devel mysql-devel libxml2-devel expat-devel

    2、安装Sphinx

    1 yum install expat expat-devel 
    2 wget -c http://sphinxsearch.com/files/sphinx-2.0.7-release.tar.gz 
    3 tar zxvf sphinx-2.0.7-release.tar.gz 
    4 cd sphinx-2.0.7-release 
    5 ./configure --prefix=/usr/local/sphinx --with-mysql --with-libexpat --enable-id64 
    6 make && make install

    3、安装libsphinxclient,php扩展用到

    1 cd api/libsphinxclient 
    2 ./configure --prefix=/usr/local/sphinx/libsphinxclient 
    3 make && make install

    4、安装Sphinx的PHP扩展:我的是5.6需装sphinx-1.3.3.tgz,如果是php5.4以下可sphinx-1.3.0.tgz

     
    wget -c http://pecl.php.net/get/sphinx-1.3.3.tgz 
    tar zxvf sphinx-1.3.3.tgz 
    cd sphinx-1.3.3 
    phpize 
    ./configure --with-sphinx=/usr/local/sphinx/libsphinxclient/ --with-php-config=/usr/bin/php-config 
    make && make install 
    成功后会提示: 
    Installing shared extensions: /usr/lib64/php/modules/ 
    
    
    echo "[Sphinx]" >> /etc/php.ini 
    echo "extension = sphinx.so" >> /etc/php.ini 
    #重启apache 
    service httpd restart

    5、创建测试数据
     

    CREATE TABLE IF NOT EXISTS `items` ( 
    `id` int(11) NOT NULL AUTO_INCREMENT, 
    `title` varchar(255) NOT NULL, 
    `content` text NOT NULL, 
    `created` datetime NOT NULL, 
    PRIMARY KEY (`id`) 
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='全文检索测试的数据表' AUTO_INCREMENT=11 ; 
    
    INSERT INTO `items` (`id`, `title`, `content`, `created`) VALUES 
    (1, 'linux mysql集群安装', 'MySQL Cluster 是MySQL 适合于分布式计算环境的高实用、可拓展、高性能、高冗余版本', '2016-09-07 00:00:00'), 
    (2, 'mysql主从复制', 'mysql主从备份(复制)的基本原理 mysql支持单向、异步复制,复制过程中一个服务器充当主服务器,而一个或多个其它服务器充当从服务器', '2016-09-06 00:00:00'), 
    (3, 'hello', 'can you search me?', '2016-09-05 00:00:00'), 
    (4, 'mysql', 'mysql is the best database?', '2016-09-03 00:00:00'), 
    (5, 'mysql索引', '关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车', '2016-09-01 00:00:00'), 
    (6, '集群', '关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车', '0000-00-00 00:00:00'), 
    (9, '复制原理', 'redis也有复制', '0000-00-00 00:00:00'), 
    (10, 'redis集群', '集群技术是构建高性能网站架构的重要手段,试想在网站承受高并发访问压力的同时,还需要从海量数据中查询出满足条件的数据,并快速响应,我们必然想到的是将数据进行切片,把数据根据某种规则放入多个不同的服务器节点,来降低单节点服务器的压力', '0000-00-00 00:00:00'); 
    
    CREATE TABLE IF NOT EXISTS `sph_counter` ( 
    `counter_id` int(11) NOT NULL, 
    `max_doc_id` int(11) NOT NULL, 
    PRIMARY KEY (`counter_id`) 
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COMMENT='增量索引标示的计数表';

    以下采用"Main + Delta" ("主索引"+"增量索引")的索引策略,使用Sphinx自带的一元分词。
    6、Sphinx配置:注意修改数据源配置信息
     

    vi /usr/local/sphinx/etc/sphinx.conf 
    source items { 
    type = mysql 
    sql_host = localhost 
    sql_user = root 
    sql_pass = 123456 
    sql_db = sphinx_items 
    
    sql_query_pre = SET NAMES utf8 
    sql_query_pre = SET SESSION query_cache_type = OFF 
    sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM items 
    
    sql_query_range = SELECT MIN(id), MAX(id) FROM items  
    WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1) 
    sql_range_step = 1000 
    sql_ranged_throttle = 1000 
    
    sql_query = SELECT id, title, content, created, 0 as deleted FROM items  
    WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)  
    AND id >= $start AND id <= $end 
    
    sql_attr_timestamp = created 
    sql_attr_bool = deleted 
    } 
    
    source items_delta : items { 
    sql_query_pre = SET NAMES utf8 
    sql_query_range = SELECT MIN(id), MAX(id) FROM items  
    WHERE id > (SELECT max_doc_id FROM sph_counter WHERE counter_id=1) 
    sql_query = SELECT id, title, content, created, 0 as deleted FROM items  
    WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )  
    AND id >= $start AND id <= $end 
    sql_query_post_index = set @max_doc_id :=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1) 
    sql_query_post_index = REPLACE INTO sph_counter SELECT 2, IF($maxid, $maxid, @max_doc_id) 
    } 
    #主索引 
    index items { 
    source = items 
    path = /usr/local/sphinx/var/data/items 
    docinfo = extern 
    morphology = none 
    min_word_len = 1 
    min_prefix_len = 0 
    html_strip = 1 
    html_remove_elements = style, script 
    ngram_len = 1 
    ngram_chars = U+3000..U+2FA1F 
    charset_type = utf-8 
    charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F 
    preopen = 1 
    min_infix_len = 1 
    } 
    
    #增量索引 
    index items_delta : items { 
    source = items_delta 
    path = /usr/local/sphinx/var/data/items-delta 
    } 
    
    #分布式索引 
    index master { 
    type = distributed 
    local = items 
    local = items_delta 
    } 
    
    indexer { 
    mem_limit = 256M 
    } 
    
    searchd { 
    listen = 9312 
    listen = 9306:mysql41 #Used for SphinxQL 
    log = /usr/local/sphinx/var/log/searchd.log 
    query_log = /usr/local/sphinx/var/log/query.log 
    compat_sphinxql_magics = 0 
    attr_flush_period = 600 
    mva_updates_pool = 16M 
    read_timeout = 5 
    max_children = 0 
    dist_threads = 2 
    pid_file = /usr/local/sphinx/var/log/searchd.pid 
    max_matches = 1000 
    seamless_rotate = 1 
    preopen_indexes = 1 
    unlink_old = 1 
    workers = threads # for RT to work 
    binlog_path = /usr/local/sphinx/var/data 
    
    } 

    保存退出

    7、Sphinx创建索引

    #第一次需重建索引: 
    [root@localhost bin]# ./indexer -c /usr/local/sphinx/etc/sphinx.conf --all 
    Sphinx 2.0.7-id64-release (r3759) 
    Copyright (c) 2001-2012, Andrew Aksyonoff 
    Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com) 
    
    using config file '/usr/local/sphinx/etc/sphinx.conf'... 
    indexing index 'items'... 
    collected 8 docs, 0.0 MB 
    sorted 0.0 Mhits, 100.0% done 
    total 8 docs, 1121 bytes 
    total 1.017 sec, 1101 bytes/sec, 7.86 docs/sec 
    indexing index 'items_delta'... 
    collected 0 docs, 0.0 MB 
    total 0 docs, 0 bytes 
    total 1.007 sec, 0 bytes/sec, 0.00 docs/sec 
    skipping non-plain index 'master'... 
    total 4 reads, 0.000 sec, 0.7 kb/call avg, 0.0 msec/call avg 
    total 14 writes, 0.001 sec, 0.5 kb/call avg, 0.1 msec/call avg 
    #启动sphinx 
    [root@localhost bin]# ./searchd -c /usr/local/sphinx/etc/sphinx.conf 
    Sphinx 2.0.7-id64-release (r3759) 
    Copyright (c) 2001-2012, Andrew Aksyonoff 
    Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com) 
    
    using config file '/usr/local/sphinx/etc/sphinx.conf'... 
    listening on all interfaces, port=9312 
    listening on all interfaces, port=9306 
    precaching index 'items' 
    precaching index 'items_delta' 
    rotating index 'items_delta': success 
    precached 2 indexes in 0.012 sec 


    #查看进程

    [root@localhost bin]# ps -ef | grep searchd 
    root 30431 1 0 23:59 ? 00:00:00 ./searchd -c /usr/local/sphinx/etc/sphinx.conf 
    root 30432 30431 0 23:59 ? 00:00:00 ./searchd -c /usr/local/sphinx/etc/sphinx.conf 
    root 30437 1490 0 23:59 pts/0 00:00:00 grep searchd 
    #停止Searchd: 
    ./searchd -c /usr/local/sphinx/etc/sphinx.conf --stop 
    #查看Searchd状态: 
    ./searchd -c /usr/local/sphinx/etc/sphinx.conf --status

    索引更新及使用说明
    "增量索引"每N分钟更新一次.通常在每天晚上低负载的时进行一次索引合并,同时重新建立"增量索引"。当然"主索引"数据不多的话,也可以直接重新建立"主索引"。
    API搜索的时,同时使用"主索引"和"增量索引",这样可以获得准实时的搜索数据.本文的Sphinx配置将"主索引"和"增量索引"放到分布式索引master中,因此只需查询分布式索引"master"即可获得全部匹配数据(包括最新数据)。

    索引的更新与合并的操作可以放到cron job完成:
     

    crontab -e 
    */1 * * * * /usr/local/sphinx/shell/delta_index_update.sh 
    0 3 * * * /usr/local/sphinx/shell/merge_daily_index.sh 
    crontab -l
    
    cron job所用的shell脚本例子:
    
    delta_index_update.sh:
    [php] view plain copy
    #!/bin/bash 
    /usr/local/sphinx/bin/indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta > /dev/null 2>&1 
    merge_daily_index.sh:
    [php] view plain copy
    #!/bin/bash 
    indexer=`which indexer` 
    mysql=`which mysql` 
    
    QUERY="use sphinx_items;select max_doc_id from sph_counter where counter_id = 2 limit 1;" 
    index_counter=$($mysql -h192.168.1.198 -uroot -p123456 -sN -e "$QUERY") 
    
    #merge "main + delta" indexes 
    $indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate --merge items items_delta --merge-dst-range deleted 0 0 >> /usr/local/sphinx/var/index_merge.log 2>&1 
    
    if [ "$?" -eq 0 ]; then 
    ##update sphinx counter 
    if [ ! -z $index_counter ]; then 
    $mysql -h192.168.1.198 -uroot -p123456 -Dsphinx_items -e "REPLACE INTO sph_counter VALUES (1, '$index_counter')" 
    fi 
    ##rebuild delta index to avoid confusion with main index 
    $indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta >> /usr/local/sphinx/var/rebuild_deltaindex.log 2>&1 
    fi

    8、php中文分词scws安装:注意扩展的版本和php的版本

    wget -c http://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2 
    tar jxvf scws-1.2.3.tar.bz2 
    cd scws-1.2.3 
    ./configure --prefix=/usr/local/scws 
    make && make install

    9、scws的PHP扩展安装:

    cd ./phpext 
    phpize 
    ./configure 
    make && make install 
    echo "[scws]" >> /etc/php.ini 
    echo "extension = scws.so" >> /etc/php.ini 
    echo "scws.default.charset = utf-8" >> /etc/php.ini 
    echo "scws.default.fpath = /usr/local/scws/etc/" >> /etc/php.ini

    10、词库安装:

    wget http://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2 
    tar jxvf scws-dict-chs-utf8.tar.bz2 -C /usr/local/scws/etc/ 
    chown www:www /usr/local/scws/etc/dict.utf8.xdb

    11、php使用Sphinx+scws测试例子
    在Sphinx源码API中,有好几种语言的API调用.其中有一个是sphinxapi.php。
    不过以下的测试使用的是Sphinx的PHP扩展.具体安装见本文开头的Sphinx安装部分。
    测试用的搜索类Search.php:注意修改getDBConnection()信息为自己的

    <?php 
    class Search { 
    /** 
    * @var SphinxClient 
    **/ 
    protected $client; 
    /** 
    * @var string 
    **/ 
    protected $keywords; 
    /** 
    * @var resource 
    **/ 
    private static $dbconnection = null; 
    
    /** 
    * Constructor 
    **/ 
    public function __construct($options = array()) { 
    $defaults = array( 
    'query_mode' => SPH_MATCH_EXTENDED2, 
    'sort_mode' => SPH_SORT_EXTENDED, 
    'ranking_mode' => SPH_RANK_PROXIMITY_BM25, 
    'field_weights' => array(), 
    'max_matches' => 1000, 
    'snippet_enabled' => true, 
    'snippet_index' => 'items', 
    'snippet_fields' => array(), 
    ); 
    $this->options = array_merge($defaults, $options); 
    $this->client = new SphinxClient(); 
    //$this->client->setServer("192.168.1.198", 9312); 
    $this->client->setMatchMode($this->options['query_mode']); 
    if ($this->options['field_weights'] !== array()) { 
    $this->client->setFieldWeights($this->options['field_weights']); 
    } 
    /* 
    if ( in_array($this->options['query_mode'], [SPH_MATCH_EXTENDED2,SPH_MATCH_EXTENDED]) ) { 
    $this->client->setRankingMode($this->options['ranking_mode']); 
    } 
    */ 
    } 
    
    /** 
    * Query 
    * 
    * @param string $keywords 
    * @param integer $offset 
    * @param integer $limit 
    * @param string $index 
    * @return array 
    **/ 
    public function query($keywords, $offset = 0, $limit = 10, $index = '*') { 
    $this->keywords = $keywords; 
    $max_matches = $limit > $this->options['max_matches'] ? $limit : $this->options['max_matches']; 
    $this->client->setLimits($offset, $limit, $max_matches); 
    $query_results = $this->client->query($keywords, $index); 
    
    if ($query_results === false) { 
    $this->log('error:' . $this->client->getLastError()); 
    } 
    
    $res = []; 
    if ( empty($query_results['matches']) ) { 
    return $res; 
    } 
    $res['total'] = $query_results['total']; 
    $res['total_found'] = $query_results['total_found']; 
    $res['time'] = $query_results['time']; 
    $doc_ids = array_keys($query_results['matches']); 
    unset($query_results); 
    $res['data'] = $this->fetch_data($doc_ids); 
    if ($this->options['snippet_enabled']) { 
    $this->buildExcerptRows($res['data']); 
    } 
    
    return $res; 
    } 
    
    /** 
    * custom sorting 
    * 
    * @param string $sortBy 
    * @param int $mode 
    * @return bool 
    **/ 
    public function setSortBy($sortBy = '', $mode = 0) { 
    if ($sortBy) { 
    $mode = $mode ?: $this->options['sort_mode']; 
    $this->client->setSortMode($mode, $sortBy); 
    } else { 
    $this->client->setSortMode(SPH_SORT_RELEVANCE); 
    } 
    } 
    
    /** 
    * fetch data based on matched doc_ids 
    * 
    * @param array $doc_ids 
    * @return array 
    **/ 
    protected function fetch_data($doc_ids) { 
    $ids = implode(',', $doc_ids); 
    $queries = self::getDBConnection()->query("SELECT * FROM items WHERE id in ($ids)", PDO::FETCH_ASSOC); 
    return iterator_to_array($queries); 
    } 
    
    /** 
    * build excerpts for data 
    * 
    * @param array $rows 
    * @return array 
    **/ 
    protected function buildExcerptRows(&$rows) { 
    $options = array( 
    'before_match' => '<b style="color:red">', 
    'after_match' => '</b>', 
    'chunk_separator' => '...', 
    'limit' => 256, 
    'around' => 3, 
    'exact_phrase' => false, 
    'single_passage' => true, 
    'limit_words' => 5, 
    ); 
    $scount = count($this->options['snippet_fields']); 
    foreach ($rows as &$row) { 
    foreach ($row as $fk => $item) { 
    if (!is_string($item) || ($scount && !in_array($fk, $this->options['snippet_fields'])) ) continue; 
    $item = preg_replace('/[
    	
    ]+/', '', strip_tags($item)); 
    $res = $this->client->buildExcerpts(array($item), $this->options['snippet_index'], $this->keywords, $options); 
    $row[$fk] = $res === false ? $item : $res[0]; 
    } 
    } 
    return $rows; 
    } 
    
    /** 
    * database connection 
    * 
    * @return resource 
    **/ 
    private static function getDBConnection() { 
    $dsn = 'mysql:host=192.168.1.198;dbname=sphinx_items'; 
    $user = 'root'; 
    $pass = '123456'; 
    if (!self::$dbconnection) { 
    try { 
    self::$dbconnection = new PDO($dsn, $user, $pass); 
    } catch (PDOException $e) { 
    die('Connection failed: ' . $e->getMessage()); 
    } 
    } 
    return self::$dbconnection; 
    } 
    
    /** 
    * Chinese words segmentation 
    * 
    **/ 
    public function wordSplit($keywords) { 
    $fpath = ini_get('scws.default.fpath'); 
    $so = scws_new(); 
    $so->set_charset('utf-8'); 
    $so->add_dict($fpath . '/dict.utf8.xdb'); 
    //$so->add_dict($fpath .'/custom_dict.txt', SCWS_XDICT_TXT); 
    $so->set_rule($fpath . '/rules.utf8.ini'); 
    $so->set_ignore(true); 
    $so->set_multi(false); 
    $so->set_duality(false); 
    $so->send_text($keywords); 
    $words = []; 
    $results = $so->get_result(); 
    foreach ($results as $res) { 
    $words[] = '(' . $res['word'] . ')'; 
    } 
    $words[] = '(' . $keywords . ')'; 
    return join('|', $words); 
    } 
    
    /** 
    * get current sphinx client 
    * 
    * @return resource 
    **/ 
    public function getClient() { 
    return $this->client; 
    } 
    /** 
    * log error 
    **/ 
    public function log($msg) { 
    // log errors here 
    //echo $msg; 
    } 
    /** 
    * magic methods 
    **/ 
    public function __call($method, $args) { 
    $rc = new ReflectionClass('SphinxClient'); 
    if ( !$rc->hasMethod($method) ) { 
    throw new Exception('invalid method :' . $method); 
    } 
    return call_user_func_array(array($this->client, $method), $args); 
    } 
    }

    测试文件test.php:

    <?php 
    require('Search.php'); 
    $s = new Search([ 
    'snippet_fields' => ['title', 'content'], 
    'field_weights' => ['title' => 20, 'content' => 10], 
    ]); 
    $s->setSortMode(SPH_SORT_EXTENDED, 'created desc,@weight desc'); 
    //$s->setSortBy('created desc,@weight desc'); 
    $words = $s->wordSplit("mysql集群");//先分词 结果:(mysql)|(mysql集群) 
    //print_r($words);exit; 
    $res = $s->query($words, 0, 10, 'master'); 
    echo '<pre/>';print_r($res); 

    测试结果:

    12、SphinxQL测试
    要使用SphinxQL需要在Searchd的配置里面增加相应的监听端口(参考上文配置)。

    [root@localhost bin]# mysql -h127.0.0.1 -P9306 -uroot -p 
    Enter password: 
    Welcome to the MySQL monitor. Commands end with ; or g. 
    Your MySQL connection id is 1 
    Server version: 2.0.7-id64-release (r3759) 
    
    Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved. 
    
    Oracle is a registered trademark of Oracle Corporation and/or its 
    affiliates. Other names may be trademarks of their respective 
    owners. 
    
    Type 'help;' or 'h' for help. Type 'c' to clear the current input statement. 
    
    mysql> show global variables; 
    +----------------------+---------+ 
    | Variable_name | Value | 
    +----------------------+---------+ 
    | autocommit | 1 | 
    | collation_connection | libc_ci | 
    | query_log_format | plain | 
    | log_level | info | 
    +----------------------+---------+ 
    4 rows in set (0.00 sec) 
    
    mysql> desc items; 
    +---------+-----------+ 
    | Field | Type | 
    +---------+-----------+ 
    | id | bigint | 
    | title | field | 
    | content | field | 
    | created | timestamp | 
    | deleted | bool | 
    +---------+-----------+ 
    5 rows in set (0.00 sec) 
    
    mysql> select * from master where match ('mysql集群') limit 10; 
    +------+---------+---------+ 
    | id | created | deleted | 
    +------+---------+---------+ 
    | 1 | 2016 | 0 | 
    | 6 | 0 | 0 | 
    +------+---------+---------+ 
    2 rows in set (0.00 sec) 
    
    mysql> show meta; 
    +---------------+-------+ 
    | Variable_name | Value | 
    +---------------+-------+ 
    | total | 2 | 
    | total_found | 2 | 
    | time | 0.006 | 
    | keyword[0] | mysql | 
    | docs[0] | 5 | 
    | hits[0] | 15 | 
    | keyword[1] | 集 | 
    | docs[1] | 3 | 
    | hits[1] | 4 | 
    | keyword[2] | 群 | 
    | docs[2] | 3 | 
    | hits[2] | 4 | 
    +---------------+-------+ 
    12 rows in set (0.00 sec) 
    
    mysql>
  • 相关阅读:
    2021.6.2 Python网络编程
    2021.6.1 数据库
    2021.5.29 PHP大作业
    2021.5.28 bootstrap和vue
    2021.5.27 三个和尚
    2021.5.26 Python操作Mysql数据库
    2021.5.25 PHP作业
    2021.5.24 Python解析XML
    如何用 GPU 训练模型?
    44 内核中的中断处理(上)
  • 原文地址:https://www.cnblogs.com/saonian/p/8204932.html
Copyright © 2020-2023  润新知