关于trieField的理解补充下3篇文档,相当的系统、全面!看相关文档连接,不解释。
http://blog.csdn.net/fancyerii/article/details/7256379
http://hadoopcn.iteye.com/blog/1550402
http://rdc.taobao.com/team/jm/archives/1699
-
extends
MultiTermQuery
A Query
that matches numeric values within a
specified range. To use this, you must first index the numeric
values using NumericField
(expert: NumericTokenStream
). If your terms are instead
textual, you should use TermRangeQuery
. NumericRangeFilter
is the filter equivalent of
this query.
You create a new NumericRangeQuery with the static factory
methods, eg:
matches all documents whose float valued "weight" field ranges
from 0.03 to 0.10, inclusive.
The performance of NumericRangeQuery is much better than the
corresponding TermRangeQuery
because the number of terms that
must be searched is usually far fewer, thanks to trie indexing,
described below.
You can optionally specify a precisionStep
when creating this query. This is
necessary if you've changed this configuration from its default (4)
during indexing. Lower values consume more disk space but speed up
searching. Suitable values are between 1 and 8. A
good starting point to test is 4, which is the default value
for all Numeric*
classes. See
below for details.
This query defaults to
MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT for 32 bit
(int/float) ranges with precisionStep ≤8 and 64 bit (long/double)
ranges with precisionStep ≤6. Otherwise it uses
MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE as the number of
terms is likely to be high. With precision steps of ≤4, this query
can be run with one of the BooleanQuery rewrite methods without
changing BooleanQuery's default max clause count.
How it works
See the publication about panFMP, where this algorithm was described
(referred to as TrieRangeQuery
):
Schindler, U, Diepenbroek, M, 2008.
Generic XML-based Framework for Metadata Portals.
Computers & Geosciences 34 (12), 1947-1955. doi:10.1016/j.cageo.2008.02.023
A quote from this paper: Because Apache Lucene is a
full-text search engine and not a conventional database, it cannot
handle numerical ranges (e.g., field value is inside user defined
bounds, even dates are numerical values). We have developed an
extension to Apache Lucene that stores the numerical values in a
special string-encoded format with variable precision (all
numerical values like doubles, longs, floats, and ints are
converted to lexicographic sortable string representations and
stored with different precisions (for a more detailed description
of how the values are stored, see NumericUtils
). A range is then divided recursively
into multiple intervals for searching: The center of the range is
searched only with the lowest possible precision in the
trie, while the boundaries are matched more exactly. This
reduces the number of terms dramatically.
For the variant that stores long values in 8 different
precisions (each reduced by 8 bits) that uses a lowest precision of
1 byte, the index contains only a maximum of 256 distinct values in
the lowest precision. Overall, a range could consist of a
theoretical maximum of 7*255*2 + 255 = 3825
distinct
terms (when there is a term for every distinct value of an
8-byte-number in the index and the range covers almost all of them;
a maximum of 255 distinct values is used because it would always be
possible to reduce the full 256 values to one term with degraded
precision). In practice, we have seen up to 300 terms in most cases
(index with 500,000 metadata records and a uniform value
distribution).
Precision
Step
You can choose any precisionStep
when encoding
values. Lower step values mean more precisions and so more terms in
index (and index gets larger). On the other hand, the maximum
number of terms to match reduces, which optimized query speed. The
formula to calculate the maximum term count is:
(this formula is only correct, whenbitsPerValue/precisionStep
is an integer; in other
cases, the value must be rounded up and the last summand must
contain the modulo of the division as precision step). For
longs stored using a precision step of 4, n = 15*15*2 + 15 =
, and for a precision step of 2,
465n = 31*3*2 + 3 =
. But the faster search speed is reduced by more seeking
189
in the term enum of the index. Because of this, the idealprecisionStep
value can only be found out by testing.
Important: You can index with a lower precision step value
and test search speed using a multiple of the original step
value.
Good values for precisionStep
are depending on
usage and data type:
- The default for all data types is 4, which is used, when
noprecisionStep
is given. - Ideal value in most cases for 64 bit data types
(long, double) is 6 or 8. - Ideal value in most cases for 32 bit data types
(int, float) is 4. - For low cardinality fields larger precision steps are good. If
the cardinality is < 100, it is fair to useInteger.MAX_VALUE
(see below). - Steps ≥64 for long/double and ≥32 for
int/float produces one token per value in the index and
querying is as slow as a conventionalTermRangeQuery
. But it can be used to produce
fields, that are solely used for sorting (in this case simply useInteger.MAX_VALUE
asprecisionStep
).
UsingNumericFields
for sorting is ideal, because
building the field cache is much faster than with text-only
numbers. These fields have one term per value and therefore also
work with term enumeration for building distinct lists (e.g. facets
/ preselected values to search for). Sorting is also possible with
range query optimized fields using one of the aboveprecisionSteps
.
Comparisons of the different types of RangeQueries on an index
with about 500,000 docs showed that TermRangeQuery
in boolean rewrite mode (with
raised BooleanQuery
clause count) took about 30-40 secs
to complete, TermRangeQuery
in constant score filter rewrite
mode took 5 secs and executing this class took <100ms to
complete (on an Opteron64 machine, Java 1.5, 8 bit precision step).
This query type was developed for a geographic portal, where the
performance for e.g. bounding boxes or exact date/time stamps is
important.
- Since:
- 2.9
- See Also:
-
Serialized Form