面对字段类型为数值时,lucene表现得并不是很完美,经常会带来一些意想不到的“问题”。
下面从索引、排序、范围检索(rangeQuery)三个方面进行分析。
搜索我们做好准备工作,建立索引。
RAMDirectory dir = new RAMDirectory(); public void index() { Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); try { IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig( Version.LUCENE_36, analyzer)); Random random = new Random(); Fieldable f0 = new Field("f0", "c", Store.YES, Index.NOT_ANALYZED); Fieldable f1 = new Field("f1", "", Store.YES, Index.NOT_ANALYZED); Fieldable f2 = new Field("f2", "", Store.YES, Index.NOT_ANALYZED); Fieldable f3 = new NumericField("f3", Store.YES, true); Fieldable f4 = new NumericField("f4", Store.YES, true); for (int i = 0; i < 20; i++) { int value = random.nextInt(100); ((Field) f1).setValue(value + ""); ((Field) f2).setValue(value + random.nextFloat() + ""); ((NumericField) f3).setIntValue(value); ((NumericField) f4).setFloatValue(value + random.nextFloat()); Document doc = new Document(); doc.add(f0); doc.add(f1); doc.add(f2); doc.add(f3); doc.add(f4); writer.addDocument(doc); } writer.close(); } catch (CorruptIndexException e) { e.printStackTrace(); } catch (LockObtainFailedException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }
这里共5个字段,
f1:field类型,填充int的StringValue;
f2:field类型,填充float的StringValue;
f3:numericField类型,填充int;
f4:numericField类型,填充float;
共20个document。
排序
从luceneApi可知,排序类型如下:
Field Summary | |
---|---|
static int |
BYTE Sort using term values as encoded Bytes. |
static int |
CUSTOM Sort using a custom Comparator. |
static int |
DOC Sort by document number (index order). |
static int |
DOUBLE Sort using term values as encoded Doubles. |
static SortField |
FIELD_DOC Represents sorting by document number (index order). |
static SortField |
FIELD_SCORE Represents sorting by document score (relevance). |
static int |
FLOAT Sort using term values as encoded Floats. |
static int |
INT Sort using term values as encoded Integers. |
static int |
LONG Sort using term values as encoded Longs. |
static int |
SCORE Sort by document score (relevance). |
static int |
SHORT Sort using term values as encoded Shorts. |
static int |
STRING Sort using term values as Strings. |
static int |
STRING_VAL Sort using term values as Strings, but comparing by value (using String.compareTo) for all comparisons. |
这里我们只关注String、int、float。
public void sort() { IndexReader reader; try { reader = IndexReader.open(dir); IndexSearcher searcher = new IndexSearcher(reader); TermQuery query = new TermQuery(new Term("f0", "c")); // SortField field = new SortField("f1", SortField.STRING);// 有问题 // SortField field = new SortField("f1", SortField.INT);// 没问题 // SortField field = new SortField("f1", SortField.FLOAT);// 没问题 // SortField field = new SortField("f2", SortField.STRING);// 有问题 // SortField field = new SortField("f2", SortField.INT);//有问题 // SortField field = new SortField("f2", SortField.FLOAT);// 没问题 // SortField field = new SortField("f3", SortField.STRING);// 有问题 // SortField field = new SortField("f3", SortField.INT);//没问题 // SortField field = new SortField("f3", SortField.FLOAT);// 没问题 // SortField field = new SortField("f3", SortField.STRING);// 没问题 // SortField field = new SortField("f3", SortField.INT);// 没问题 SortField field = new SortField("f3", SortField.FLOAT);// 没问题 Sort sort = new Sort(field); TopFieldDocs docs = searcher.search(query, 20, sort); ScoreDoc[] sds = docs.scoreDocs; for (ScoreDoc sd : sds) { Document doc = reader.document(sd.doc); System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t" + doc.get("f3") + "\t" + doc.get("f4")); } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }
从上面的测试可知:
如果使用field类进行索引,排序时可以指定“正确”的数据类型进行排序。使用String类型肯定不行,如果索引的时候存放的是float的StringValue,排序时使用SortField.INT同样会产生问题,异常如下:
java.lang.NumberFormatException: Invalid shift value in prefixCoded string (is encoded value really an INT?)
从异常可以判断,lucene排序的时候会先将String转换成指定的数值类型,如果指定错了(例如将1.2转成int型)就会遇到异常。
如果使用numericField进行索引,索引的是什么类型排序就选用什么类型。如果考虑其他的太纠结。
范围检索
public void rangeSearch() { IndexReader reader; try { reader = IndexReader.open(dir); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); // Query query = new TermRangeQuery("f1", "30", "60", true, // true);//有问题 // Query query = NumericRangeQuery.newIntRange("f3", 30, 60, // true, true);//没问题 // Query query = new TermRangeQuery("f2", "30", "60", true, // true);//有问题 Query query = NumericRangeQuery.newFloatRange("f4", 30f, 60f, true, true);// 没问题 TopDocs docs = searcher.search(query, 20); ScoreDoc[] sds = docs.scoreDocs; for (ScoreDoc sd : sds) { Document doc = reader.document(sd.doc); System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t" + doc.get("f3") + "\t" + doc.get("f4")); } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }
检索时,我们常用queryParser,但是queryParser的范围检索对数值型不支持,因为lucene没有记录哪些域是数值型的,在queryParser解析时也会不特殊处理。
这时我们可以创建queryParser的子类,例如:
public class NumericQueryParser extends QueryParser { protected NumericQueryParser(Version matchVersion, String field, Analyzer a) { super(matchVersion, field, a); } @Override protected org.apache.lucene.search.Query getRangeQuery(String field, String part1, String part2, boolean inclusive) throws ParseException { TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field, part1, part2, inclusive); if ("f3".equals(field)) { return NumericRangeQuery.newIntRange(field, Integer.parseInt(query.getLowerTerm()), Integer.parseInt(query.getUpperTerm()), query.includesLower(), query.includesUpper()); } else { return query; } } }
使用其进行范围检索:
public void rangeSearch() { IndexReader reader; try { reader = IndexReader.open(dir); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); // QueryParser parser = new QueryParser(Version.LUCENE_36, "f0", // analyzer);//有问题 NumericQueryParser parser = new NumericQueryParser( Version.LUCENE_36, "f0", analyzer); Query query = parser.parse("f3:[30 TO 60]"); TopDocs docs = searcher.search(query, 20); ScoreDoc[] sds = docs.scoreDocs; for (ScoreDoc sd : sds) { Document doc = reader.document(sd.doc); System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t" + doc.get("f3") + "\t" + doc.get("f4")); } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (ParseException e) { e.printStackTrace(); } }
自我提醒:
1、有的问题从表面上不要考虑太多,例如上面的排序,如果是索引的是int,排序int肯定没有问题,不要再去尝试string,或者其他数值类型。没有太多意义!
2、如果要把这些问题考虑情况,从本质下手,从源码开始!