• 搜索引擎Zend_lucene


    Zend Lucene

     

    1.General

    Zend_Search_Lucene is a general purpose text search engine written entirely in PHP 5. it stores its index on the filesystem and does not require a database server.

    2. How to install Zend Lucene

    DownLoad WebSite :     http://www.zend.com/community/downloads

    Zend Framework version :   Zend Framework 1.9 minimal

    Download Zend Framework 1.9 minimal from DownLoad WebSite.

    Remove everything from Zend Folder but remain following files and directories:

    Exception.php

    Loader/

    Loader.php

    Search/

     

    3.How to create an index.

    an example of creating an index as below:

     <?php

    //File Name: createindex.php

    require_once 'Zend/Search/Lucene.php';

    $productsData= array(

    0=>array("PID"=>1,"url"=>"http://www.cybozu.jp","productName"=>"garoon","Description"=>"garoon Description","lag"=>"en"),

    1=>array("PID"=>2,"url"=>"http://www.cybozu.jp","productName"=>"share360","Description"=>"share360 Description" ,"lag"=>"en"),

    2=>array("PID"=>3,"url"=>"http://www.cybozu.jp a","productName"=>"日本語の製品名前","Description"=>"日本語の製品","lag"=>"jp"),

    3=>array("PID"=>4,"url"=>"http://www.cybozu.jp a","productName"=>"中文产品名","Description"=>"中文产品描述","lag"=>"zh")

    );

    $index=new Zend_Search_Lucene('index',true);

    $doc = new Zend_Search_Lucene_Document();

    foreach ($productsData as $productData)

    {

         $doc->addField(Zend_Search_Lucene_Field::keyword('PID', $productData['PID'], 'UTF-8'));

         $doc->addField(Zend_Search_Lucene_Field::Text('url', $productData['url'], 'UTF-8'));

          $doc->addField(Zend_Search_Lucene_Field::Text('productName', $productData['productName'], 'UTF-8'));

          $doc->addField(Zend_Search_Lucene_Field::Text('Description', $productData['Description'], 'UTF-8'));

         $doc->addField(Zend_Search_Lucene_Field::unIndexed('lan', $productData['lan'], 'UTF-8'));  

     $index->addDocument($doc);

         $index->commit();

        $index->optimize(); 

    }

    echo 'index has been created!';

    In KB project, index data is come from database, using method above , We can index all the text from database.

     

    4.Searching index

    After creating an index , We can search index as below:

    <?php

     //File Name: search.php

     require_once('Zend/Search/Lucene.php');

     $index = new Zend_Search_Lucene('index');

    $keywords='garoon';

     echo "Index contains {$index->count()} documents.\n";

     $query = Zend_Search_Lucene_Search_QueryParser::parse( $keywords, 'utf-8' );

     $hits = $index->find($query);

     foreach ($hits as $hit)

              {

                 echo 'PID: '.$hit->PID.'<br>';

                 echo 'Score: '.$hit->score.'<br>';

                 echo 'url: '.$hit->url.'<br>';

                 echo 'productName: '.$hit->productName.'<br>';

                 echo 'lan: '.$hit->lan.'<br>';

            }

    If we want to search the text for multiple language, We can get value of lan , and then display different results by lan.

     

    5.delete and update index.

    If we want to update index , first we must find the document in index by keyword, then delete it ,after deleting the old document ,We can add a new document. This is an example to update an index. We delete PID :1 product,and update the description.

    <?php

     require_once('Zend/Search/Lucene.php');

        $index = new Zend_Search_Lucene('index');

     //new product data to update

     $productNewData =array("PID"=>1,"url"=>"http://www.cybozu.jp","productName"=>"garoon","Description"=>"update garoon Description","lan"=>"en");

     $keywords="PID:1";

     $hits = $index->find($keywords);

     //Delete PID:1

       foreach ($hits as $hit)

             {

                 echo 'PID: '.$hit->PID .'has been deleted <br>';

                 $index->delete($hit->id);

            }

            $index->commit();

     //add new product data to index   

     $doc = new Zend_Search_Lucene_Document();

     $doc->addField(Zend_Search_Lucene_Field::keyword('PID', $productNewData['PID'], 'UTF-8'));

     $doc->addField(Zend_Search_Lucene_Field::Text('url', $productNewData['url'], 'UTF-8'));

     $doc->addField(Zend_Search_Lucene_Field::Text('productName', $productNewData['productName'], 'UTF-8'));

     $doc->addField(Zend_Search_Lucene_Field::Text('Description', $productNewData['Description'], 'UTF-8'));

     $doc->addField(Zend_Search_Lucene_Field::unIndexed('lan', $productNewData['lan'], 'UTF-8'));

     $index->addDocument($doc);

     $index->commit();

     $index->optimize(); 

     

    6.How to search japanese or chinese text by lucene.

    As default , lucene can only search English text.But in this project , we must search the text by English, Japanese and Chinese. So we have to change default analyzer of Lucene.

    This is an extend of default analyzer of Lucene as below:

    <?php

    // File Name:chinese.php

    require_once 'Zend/Search/Lucene/Analysis/Analyzer.php';

    require_once 'Zend/Search/Lucene/Analysis/Analyzer/Common.php';

     

    class CN_Lucene_Analyzer extends Zend_Search_Lucene_Analysis_Analyzer_Common

    {

        private $_position;

        private $_cnStopWords = array( );

        

        public function setCnStopWords( $cnStopWords )

        {

            $this->_cnStopWords = $cnStopWords;

        }

     

        /**

        * Reset token stream

        */

        public function reset()

        {

            $this->_position = 0;

            $search = array(",", "/", "\\", ".", ";", ":", "\"", "!", "~", "`", "^", "(", ")", "?", "-", "'", "<", ">", "$", "&", "%", "#", "@", "+", "=", "{", "}", "[", "]", "", "", "", "", "", "", "", "", "“", "”", "‘", "’", "", "", "", "—", " ", "", "", "", "…", "", "", "", "" );

        

            $this->_input = str_replace( $search, '', $this->_input );

            $this->_input = str_replace( $this->_cnStopWords, ' ', $this->_input );

        }

     

        /**

        * Tokenization stream API

        * Get next token

        * Returns null at the end of stream

        *

        * @return Zend_Search_Lucene_Analysis_Token|null

        */

        public function nextToken()

        {

            if ($this->_input === null)

            {

                return null;

            }

            $len = strlen($this->_input);

            //print "Old string".$this->_input."<br />";

            while ($this->_position < $len)

            {

                // Delete space at the begining

                while ($this->_position < $len &&$this->_input[$this->_position]==' ' )

                {

                    $this->_position++;

                }

                $termStartPosition = $this->_position;

                $temp_char = $this->_input[$this->_position];

                $isCnWord = false;

                if(ord($temp_char)>127)

                {

                    $i = 0;      

                    while( $this->_position < $len && ord( $this->_input[$this->_position] )>127 )

                    {

                        $this->_position = $this->_position + 3;

                        $i ++;

                        if($i==2)

                        {

                            $isCnWord = true;

                            break;

                        }

                    }

     

                    if($i==1) continue;

                }

                else

                {

                    while ($this->_position < $len && ctype_alnum( $this->_input[$this->_position] ))

                    {

                        $this->_position++;

                    }

                    //echo $this->_position.":".$this->_input[$this->_position-1]."\n";

                }

                if ($this->_position == $termStartPosition)

                {

                    $this->_position++;

                    continue;

                }

        

                $tmp_str = substr($this->_input, $termStartPosition, $this->_position - $termStartPosition);

                

                $token = new Zend_Search_Lucene_Analysis_Token( $tmp_str, $termStartPosition,$this->_position );

                

                $token = $this->normalize($token);

     

                if($isCnWord)

                {

                    $this->_position = $this->_position - 3;

                }

     

                if ($token !== null)

                {

                    return $token;

                }

            }

            

            return null;

        }

     

    With the help of chinese.php we can search Japanese and Chinese in kb. And also we must add codes as below before creating an index and searching.

     

    require_once 'chinese.php';

    Zend_Search_Lucene_Analysis_Analyzer::setDefault(new CN_Lucene_Analyzer());

     

    7.Is Zend Lucene need downtime?

      By using Zend Lucene , we don’t need any downtime. When add a new article we can add it to index at the same time, If we edit an article, we need to delete old document and update index with new one .

     

     

     

  • 相关阅读:
    Spread Studio中文支持图解
    C#反射实例No.1
    关于XML序列化的简单例子
    将数据结构类型序列化和反序列化(BinaryFormatter类)
    获取计算机名称,IP,MAC地址
    原始套接字发送自定义IP包
    IP包首部格式
    struct和byte[]相互转换(用Marshal类实现)
    图片保存到数据库和从数据库读取图片并显示(C#)
    单词分析器源码
  • 原文地址:https://www.cnblogs.com/likwo/p/1591319.html
Copyright © 2020-2023  润新知