• Lucene-Analyzer


    Lucene文本解析器实现 把一段文本信息拆分成多个分词,我们都知道搜索引擎是通过分词检索的,文本解析器的好坏直接决定了搜索的精度和搜索的速度。

    1.简单的Demo

        private static final String[] examples = { "The quick brown 1234 fox jumped over the lazy dog!","XY&Z 15.6 Corporation - xyz@example.com", "北京市北京大学" };
        private static final Analyzer[] ANALYZERS = new Analyzer[] { 
    new WhitespaceAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new StandardAnalyzer(), new CJKAnalyzer(), new SmartChineseAnalyzer() }; //空格符拆分 非字母拆分 非字母拆分去掉停词 Unicode文本分割 日韩文分割 简体中文分割
    @Test
    public void testAnalyzer() throws IOException { for (int i = 0; i < ANALYZERS.length; i++) { String simpleName = ANALYZERS[i].getClass().getSimpleName(); for (int j = 0; j < examples.length; j++) { //TokenStream是分析处理组件中的一种中间数据格式,它从一个reader中获取文本, 分词器Tokenizer和过滤器TokenFilter继承自TokenStream TokenStream contents = ANALYZERS[i].tokenStream("contents", new StringReader(examples[j])); //添加多个Attribute,从而可以了解到分词之后详细的词元信息 ,OffsetAttribute 表示token的首字母和尾字母在原文本中的位置 OffsetAttribute offsetAttribute = contents.addAttribute(OffsetAttribute.class); TypeAttribute typeAttribute = contents.addAttribute(TypeAttribute.class); //TypeAttribute 表示token的词汇类型信息,默认值为word contents.reset(); System.out.println(" " + simpleName + " analyzing : " + examples[j]); while (contents.incrementToken()) { String s1 = offsetAttribute.toString(); int i1 = offsetAttribute.startOffset();// 起始偏移量 int i2 = offsetAttribute.endOffset(); // 结束偏移量 System.out.println(" " + s1 + "[" + i1 + "," + i2 + ":" + typeAttribute.type() + "]" + " "); } contents.end(); contents.close(); //调用incrementToken()结束迭代之后,调用end()和close()方法,其中end()可以唤醒当前TokenStream的处理器去做一些收尾工作,close()可以关闭TokenStream和Analyzer去释放在分析过程中使用的资源。 System.out.println(); } } } }

    2. 了解tokenStream的Attribute

         tokenStream()方法之后,添加多个Attribute,可以了解到分词之后详细的词元信息,比如CharTermAttribute用于保存词元的内容,TypeAttribute用于保存词元的类型。

    CharTermAttribute              表示token本身的内容
    PositionIncrementAttribute  表示当前token相对于前一个token的相对位置,也就是相隔的词语数量(例如“text for attribute”,
                                              text和attribute之间的getPositionIncrement为2),如果两者之间没有停用词,那么该值被置为默认值1
    OffsetAttribute                   表示token的首字母和尾字母在原文本中的位置
    TypeAttribute                     表示token的词汇类型信息,默认值为word,
                                            其它值有<ALPHANUM> <APOSTROPHE> <ACRONYM> <COMPANY> <EMAIL> <HOST> <NUM> <CJ> <ACRONYM_DEP>
    FlagsAttribute                    与TypeAttribute类似,假设你需要给token添加额外的信息,而且希望该信息可以通过分析链,那么就可以通过flags去传递
    PayloadAttribute                在每个索引位置都存储了payload(关键信息),当使用基于Payload的查询时,该信息在评分中非常有用

        @Test
        public void testAttribute() throws IOException {
            Analyzer analyzer = new StandardAnalyzer();
            String input = "This is a test text for attribute! Just add-some word.";
            TokenStream tokenStream = analyzer.tokenStream("text", new StringReader(input));
            
            CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);        
            PositionIncrementAttribute positionIncrementAttribute = tokenStream.addAttribute(PositionIncrementAttribute.class);
            OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
            TypeAttribute typeAttribute = tokenStream.addAttribute(TypeAttribute.class);
            PayloadAttribute payloadAttribute = tokenStream.addAttribute(PayloadAttribute.class);
            payloadAttribute.setPayload(new BytesRef("Just"));
            
            tokenStream.reset();
            while (tokenStream.incrementToken()) {
                System.out.print(
                        "[" + charTermAttribute 
                         + " increment:" + positionIncrementAttribute.getPositionIncrement()
                         + " start:" + offsetAttribute.startOffset() 
                         + " end:" + offsetAttribute.endOffset() 
                         + " type:"+ typeAttribute.type() 
                         + " payload:" + payloadAttribute.getPayload() + "]
    ");
            }
    
            tokenStream.end();
            tokenStream.close();
        }

    3.Lucene 的分词器Tokenizer和过滤器TokenFilter

        一个分析器由一个分词器和多个过滤器组成,分词器接受reader数据转换成 TokenStream,TokenFilter主要用于TokenStream的过滤操作,用来处理Tokenizer或者上一个TokenFilter处理后的结果,如果是对现有分词器进行扩展或修改

        自定义TokenFilter需要实现incrementToken()抽象函数,

    public class TestTokenFilter {
        @Test
        public void test() throws IOException {
            String text = "Hi, Dr Wang, Mr Liu asks if you stay with Mrs Liu yesterday!";
            Analyzer analyzer = new WhitespaceAnalyzer();
            
            CourtesyTitleFilter filter = new CourtesyTitleFilter(analyzer.tokenStream("text", text));
            CharTermAttribute charTermAttribute = filter.addAttribute(CharTermAttribute.class);
            filter.reset();
            while (filter.incrementToken()) {
                System.out.print(charTermAttribute + " ");
            }
        }
    }
    
    /**
     * 自定义词扩展过滤器
     */
    class CourtesyTitleFilter extends TokenFilter {
        Map<String, String> courtesyTitleMap = new HashMap<>();
        private CharTermAttribute termAttribute;
        protected CourtesyTitleFilter(TokenStream input) {
            super(input);
            termAttribute = addAttribute(CharTermAttribute.class);
            courtesyTitleMap.put("Dr", "doctor");
            courtesyTitleMap.put("Mr", "mister");
            courtesyTitleMap.put("Mrs", "miss");
        }
    
        @Override
        public final boolean incrementToken() throws IOException {
            if (!input.incrementToken()) {
                return false;
            }
            String small = termAttribute.toString();
            if (courtesyTitleMap.containsKey(small)) {
                termAttribute.setEmpty().append(courtesyTitleMap.get(small));
            }
            return true;
        }
    }

     输出结果如下
       Hi, doctor Wang, mister Liu asks if you stay with miss Liu yesterday!

    4.自定义Analyzer实现扩展停用词

    class StopAnalyzerExtend extends Analyzer {
        private CharArraySet stopWordSet;//停止词词典
    
        public CharArraySet getStopWordSet() {
            return this.stopWordSet;
        }
    
        public void setStopWordSet(CharArraySet stopWordSet) {
            this.stopWordSet = stopWordSet;
        }
    
        public StopAnalyzerExtend() {
            super();
            setStopWordSet(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        }
    
        /**
         * @param stops 需要扩展的停止词
         */
        public StopAnalyzerExtend(List<String> stops) {
            this();
            /**如果直接为stopWordSet赋值的话,会报如下异常,这是因为在StopAnalyzer中有ENGLISH_STOP_WORDS_SET = CharArraySet.unmodifiableSet(stopSet);
             * ENGLISH_STOP_WORDS_SET 被设置为不可更改的set集合
             */
            //stopWordSet = getStopWordSet();
            stopWordSet = CharArraySet.copy(getStopWordSet());
            stopWordSet.addAll(StopFilter.makeStopSet(stops));
        }
    
        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer source = new LowerCaseTokenizer();
            return new TokenStreamComponents(source, new StopFilter(source, stopWordSet));
        }
    
        public static void main(String[] args) throws IOException {
            ArrayList<String> strings = new ArrayList<String>() {{
                add("小鬼子");
                add("美国佬");
            }};
            
            Analyzer analyzer = new StopAnalyzerExtend(strings);
            String content = "小鬼子 and 美国佬 are playing together!";
            TokenStream tokenStream = analyzer.tokenStream("myfield", content);
            tokenStream.reset();
            
            CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
            while (tokenStream.incrementToken()) {
                // 已经过滤掉自定义停用词
                // 输出:playing   together
                System.out.println(charTermAttribute.toString());
            }
            tokenStream.end();
            tokenStream.close();
        }
    }

    5.自定义Analyzer实现字长过滤

    class LongFilterAnalyzer extends Analyzer {
        private int len;
    
        public int getLen() {
            return this.len;
        }
    
        public void setLen(int len) {
            this.len = len;
        }
    
        public LongFilterAnalyzer() {
            super();
        }
    
        public LongFilterAnalyzer(int len) {
            super();
            setLen(len);
        }
    
        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            final Tokenizer source = new WhitespaceTokenizer();
            //过滤掉长度<len,并且>20的token
            TokenStream tokenStream = new LengthFilter(source, len, 20);
            return new TokenStreamComponents(source, tokenStream);
        }
    
        public static void main(String[] args) {
            //把长度小于2的过滤掉,开区间
            Analyzer analyzer = new LongFilterAnalyzer(2);
            String words = "I am a java coder! Testingtestingtesting!";
            TokenStream stream = analyzer.tokenStream("myfield", words);
            try {
                stream.reset();
                CharTermAttribute offsetAtt = stream.addAttribute(CharTermAttribute.class);
                while (stream.incrementToken()) {
                    System.out.println(offsetAtt.toString());
                }
                stream.end();
                stream.close();
            } catch (IOException e) {
            }
        }
    }
    长度小于两个字符的文本都被过滤掉了。

    6.PerFieldAnalyzerWrapper 处理不同的Field使用不同的Analyzer 。PerFieldAnalyzerWrapper可以像其它的Analyzer一样使用,包括索引和查询分析

        @Test
        public void testPerFieldAnalyzerWrapper() throws IOException, ParseException {
            Map<String, Analyzer> fields = new HashMap<>();
            fields.put("partnum", new KeywordAnalyzer());
            // 对于其他的域,默认使用SimpleAnalyzer分析器,对于指定的域partnum使用KeywordAnalyzer
            PerFieldAnalyzerWrapper perFieldAnalyzerWrapper = new PerFieldAnalyzerWrapper(new SimpleAnalyzer(), fields);
            
            Directory directory = new RAMDirectory();
            IndexWriterConfig indexWriterConfig = new IndexWriterConfig(perFieldAnalyzerWrapper);
            IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
            Document document = new Document();
            FieldType fieldType = new FieldType();
            fieldType.setStored(true);
            fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
            document.add(new Field("partnum", "Q36", fieldType));
            document.add(new Field("description", "Illidium Space Modulator", fieldType));
            indexWriter.addDocument(document);
            indexWriter.close();
            
            IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(directory));
            // 直接使用TermQuery是可以检索到的
            TopDocs search = indexSearcher.search(new TermQuery(new Term("partnum", "Q36")), 10);
            Assert.assertEquals(1, search.totalHits);
            // 如果使用QueryParser,那么必须要使用PerFieldAnalyzerWrapper,否则如下所示,是检索不到的
            Query description = new QueryParser("description", new SimpleAnalyzer()).parse("partnum:Q36 AND SPACE");
            search = indexSearcher.search(description, 10);
            Assert.assertEquals(0, search.totalHits);
            System.out.println("SimpleAnalyzer :" + description.toString());// +partnum:q
                                                                            // +description:space,原因是SimpleAnalyzer会剥离非字母字符并将字母小写化
            // 使用PerFieldAnalyzerWrapper可以检索到
            // partnum:Q36 AND SPACE表示在partnum中出现Q36,在description中出现SPACE
            description = new QueryParser("description", perFieldAnalyzerWrapper).parse("partnum:Q36 AND SPACE");
            search = indexSearcher.search(description, 10);
            Assert.assertEquals(1, search.totalHits);
            System.out.println("(SimpleAnalyzer,KeywordAnalyzer) :" + description.toString());// +partnum:Q36 +description:space
        }

    参考 : http://www.codepub.cn/2016/05/23/Lucene-6-0-in-action-4-The-text-analyzer/

  • 相关阅读:
    使用SoapUI发送Post请求
    webservice wsdl地址
    BurpSuite(二) proxy 模块
    BurpSuite(一)工具介绍
    Windows下配置DVWA
    .Net WebApi 支持跨域访问使用 Microsoft.AspNet.WebApi.Cors
    小程序学习 (一) 入门准备
    学习EChart.js(四) 移动端显示
    JQuery 获取select 的value值和文本值
    JS 获取URL参数
  • 原文地址:https://www.cnblogs.com/dengzy/p/6057197.html
Copyright © 2020-2023  润新知