• lucene中的数值型字段(NumericField)


    面对字段类型为数值时,lucene表现得并不是很完美,经常会带来一些意想不到的“问题”。

    下面从索引、排序、范围检索(rangeQuery)三个方面进行分析。

    搜索我们做好准备工作,建立索引。

    RAMDirectory dir = new RAMDirectory();
    
    	public void index() {
    		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
    		try {
    			IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(
    					Version.LUCENE_36, analyzer));
    			Random random = new Random();
    			Fieldable f0 = new Field("f0", "c", Store.YES, Index.NOT_ANALYZED);
    			Fieldable f1 = new Field("f1", "", Store.YES, Index.NOT_ANALYZED);
    			Fieldable f2 = new Field("f2", "", Store.YES, Index.NOT_ANALYZED);
    			Fieldable f3 = new NumericField("f3", Store.YES, true);
    			Fieldable f4 = new NumericField("f4", Store.YES, true);
    			for (int i = 0; i < 20; i++) {
    				int value = random.nextInt(100);
    				((Field) f1).setValue(value + "");
    				((Field) f2).setValue(value + random.nextFloat() + "");
    				((NumericField) f3).setIntValue(value);
    				((NumericField) f4).setFloatValue(value + random.nextFloat());
    				Document doc = new Document();
    				doc.add(f0);
    				doc.add(f1);
    				doc.add(f2);
    				doc.add(f3);
    				doc.add(f4);
    				writer.addDocument(doc);
    			}
    			writer.close();
    		} catch (CorruptIndexException e) {
    			e.printStackTrace();
    		} catch (LockObtainFailedException e) {
    			e.printStackTrace();
    		} catch (IOException e) {
    			e.printStackTrace();
    		}
    	}
    

    这里共5个字段,

    f1:field类型,填充int的StringValue;

    f2:field类型,填充float的StringValue;

    f3:numericField类型,填充int;

    f4:numericField类型,填充float;

    共20个document。

    排序

    从luceneApi可知,排序类型如下:

    Field Summary
    static int BYTE 
              Sort using term values as encoded Bytes.
    static int CUSTOM 
              Sort using a custom Comparator.
    static int DOC 
              Sort by document number (index order).
    static int DOUBLE 
              Sort using term values as encoded Doubles.
    static SortField FIELD_DOC 
              Represents sorting by document number (index order).
    static SortField FIELD_SCORE 
              Represents sorting by document score (relevance).
    static int FLOAT 
              Sort using term values as encoded Floats.
    static int INT 
              Sort using term values as encoded Integers.
    static int LONG 
              Sort using term values as encoded Longs.
    static int SCORE 
              Sort by document score (relevance).
    static int SHORT 
              Sort using term values as encoded Shorts.
    static int STRING 
              Sort using term values as Strings.
    static int STRING_VAL 
              Sort using term values as Strings, but comparing by value (using String.compareTo) for all comparisons.

    这里我们只关注String、int、float。

    public void sort() {
    		IndexReader reader;
    		try {
    			reader = IndexReader.open(dir);
    			IndexSearcher searcher = new IndexSearcher(reader);
    			TermQuery query = new TermQuery(new Term("f0", "c"));
    			// SortField field = new SortField("f1", SortField.STRING);// 有问题
    			// SortField field = new SortField("f1", SortField.INT);// 没问题
    			// SortField field = new SortField("f1", SortField.FLOAT);// 没问题
    
    			// SortField field = new SortField("f2", SortField.STRING);// 有问题
    			// SortField field = new SortField("f2", SortField.INT);//有问题
    			// SortField field = new SortField("f2", SortField.FLOAT);// 没问题
    
    			// SortField field = new SortField("f3", SortField.STRING);// 有问题
    			// SortField field = new SortField("f3", SortField.INT);//没问题
    			// SortField field = new SortField("f3", SortField.FLOAT);// 没问题
    
    			// SortField field = new SortField("f3", SortField.STRING);// 没问题
    			// SortField field = new SortField("f3", SortField.INT);// 没问题
    			SortField field = new SortField("f3", SortField.FLOAT);// 没问题
    			Sort sort = new Sort(field);
    			TopFieldDocs docs = searcher.search(query, 20, sort);
    			ScoreDoc[] sds = docs.scoreDocs;
    			for (ScoreDoc sd : sds) {
    				Document doc = reader.document(sd.doc);
    				System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
    						+ doc.get("f3") + "\t" + doc.get("f4"));
    			}
    		} catch (CorruptIndexException e) {
    			e.printStackTrace();
    		} catch (IOException e) {
    			e.printStackTrace();
    		}
    	}
    

    从上面的测试可知:

    如果使用field类进行索引,排序时可以指定“正确”的数据类型进行排序。使用String类型肯定不行,如果索引的时候存放的是float的StringValue,排序时使用SortField.INT同样会产生问题,异常如下:

    java.lang.NumberFormatException: Invalid shift value in prefixCoded string (is encoded value really an INT?)

    从异常可以判断,lucene排序的时候会先将String转换成指定的数值类型,如果指定错了(例如将1.2转成int型)就会遇到异常。

    如果使用numericField进行索引,索引的是什么类型排序就选用什么类型。如果考虑其他的太纠结。

    范围检索

    public void rangeSearch() {
    		IndexReader reader;
    		try {
    			reader = IndexReader.open(dir);
    			IndexSearcher searcher = new IndexSearcher(reader);
    			Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
    			// Query query = new TermRangeQuery("f1", "30", "60", true,
    			// true);//有问题
    			// Query query = NumericRangeQuery.newIntRange("f3", 30, 60,
    			// true, true);//没问题
    			// Query query = new TermRangeQuery("f2", "30", "60", true,
    			// true);//有问题
    			Query query = NumericRangeQuery.newFloatRange("f4", 30f, 60f, true,
    					true);// 没问题
    			TopDocs docs = searcher.search(query, 20);
    			ScoreDoc[] sds = docs.scoreDocs;
    			for (ScoreDoc sd : sds) {
    				Document doc = reader.document(sd.doc);
    				System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
    						+ doc.get("f3") + "\t" + doc.get("f4"));
    			}
    		} catch (CorruptIndexException e) {
    			e.printStackTrace();
    		} catch (IOException e) {
    			e.printStackTrace();
    		}
    	}
    

    检索时,我们常用queryParser,但是queryParser的范围检索对数值型不支持,因为lucene没有记录哪些域是数值型的,在queryParser解析时也会不特殊处理。

    这时我们可以创建queryParser的子类,例如:

    public class NumericQueryParser extends QueryParser {
    
    	protected NumericQueryParser(Version matchVersion, String field, Analyzer a) {
    		super(matchVersion, field, a);
    	}
    
    	@Override
    	protected org.apache.lucene.search.Query getRangeQuery(String field,
    			String part1, String part2, boolean inclusive)
    			throws ParseException {
    		TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field,
    				part1, part2, inclusive);
    		if ("f3".equals(field)) {
    			return NumericRangeQuery.newIntRange(field,
    					Integer.parseInt(query.getLowerTerm()),
    					Integer.parseInt(query.getUpperTerm()),
    					query.includesLower(), query.includesUpper());
    		} else {
    			return query;
    		}
    	}
    
    }
    

      

    使用其进行范围检索:

    public void rangeSearch() {
    		IndexReader reader;
    		try {
    			reader = IndexReader.open(dir);
    			IndexSearcher searcher = new IndexSearcher(reader);
    			Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
    			// QueryParser parser = new QueryParser(Version.LUCENE_36, "f0",
    			// analyzer);//有问题
    			NumericQueryParser parser = new NumericQueryParser(
    					Version.LUCENE_36, "f0", analyzer);
    			Query query = parser.parse("f3:[30 TO 60]");
    			TopDocs docs = searcher.search(query, 20);
    			ScoreDoc[] sds = docs.scoreDocs;
    			for (ScoreDoc sd : sds) {
    				Document doc = reader.document(sd.doc);
    				System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
    						+ doc.get("f3") + "\t" + doc.get("f4"));
    			}
    		} catch (CorruptIndexException e) {
    			e.printStackTrace();
    		} catch (IOException e) {
    			e.printStackTrace();
    		} catch (ParseException e) {
    			e.printStackTrace();
    		}
    	}
    

      

    自我提醒:

    1、有的问题从表面上不要考虑太多,例如上面的排序,如果是索引的是int,排序int肯定没有问题,不要再去尝试string,或者其他数值类型。没有太多意义!

    2、如果要把这些问题考虑情况,从本质下手,从源码开始!

  • 相关阅读:
    IO
    多线程
    常用类
    异常
    接口
    面向对象
    面向对象
    学习数组
    for的嵌套循环
    XML:是什么?怎样工作的?可以做什么?将来的发展有会怎样?
  • 原文地址:https://www.cnblogs.com/huangfox/p/2631240.html
Copyright © 2020-2023  润新知