关于TrieField的全面认识、理解、运用

简介: 假期重新把之前在新浪博客里面的文字梳理了下,搬到这里。

关于trieField的理解补充下3篇文档,相当的系统、全面!看相关文档连接,不解释。

http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/NumericRangeQuery.html

http://blog.csdn.net/fancyerii/article/details/7256379

http://hadoopcn.iteye.com/blog/1550402

http://rdc.taobao.com/team/jm/archives/1699

 

public final class NumericRangeQueryNumber>

extends MultiTermQuery

A Query that matches numeric values within a specified range. To use this, you must first index the numeric values using NumericField (expert: NumericTokenStream). If your terms are instead textual, you should use TermRangeQuery. NumericRangeFilter is the filter equivalent of this query.

You create a new NumericRangeQuery with the static factory methods, eg:

Query q = NumericRangeQuery.newFloatRange("weight", 0.03f, 0.10f, true, true);

matches all documents whose float valued "weight" field ranges from 0.03 to 0.10, inclusive.

The performance of NumericRangeQuery is much better than the corresponding TermRangeQuery because the number of terms that must be searched is usually far fewer, thanks to trie indexing, described below.

You can optionally specify a precisionStep when creating this query. This is necessary if you've changed this configuration from its default (4) during indexing. Lower values consume more disk space but speed up searching. Suitable values are between 1 and 8. A good starting point to test is 4, which is the default value for all Numeric* classes. See below for details.

This query defaults to MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT for 32 bit (int/float) ranges with precisionStep ≤8 and 64 bit (long/double) ranges with precisionStep ≤6. Otherwise it uses MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE as the number of terms is likely to be high. With precision steps of ≤4, this query can be run with one of the BooleanQuery rewrite methods without changing BooleanQuery's default max clause count.

How it works

See the publication about panFMP, where this algorithm was described (referred to as TrieRangeQuery):

Schindler, U, Diepenbroek, M, 2008. Generic XML-based Framework for Metadata Portals.Computers & Geosciences 34 (12), 1947-1955. doi:10.1016/j.cageo.2008.02.023

A quote from this paper: Because Apache Lucene is a full-text search engine and not a conventional database, it cannot handle numerical ranges (e.g., field value is inside user defined bounds, even dates are numerical values). We have developed an extension to Apache Lucene that stores the numerical values in a special string-encoded format with variable precision (all numerical values like doubles, longs, floats, and ints are converted to lexicographic sortable string representations and stored with different precisions (for a more detailed description of how the values are stored, see NumericUtils). A range is then divided recursively into multiple intervals for searching: The center of the range is searched only with the lowest possible precision in thetrie, while the boundaries are matched more exactly. This reduces the number of terms dramatically.

For the variant that stores long values in 8 different precisions (each reduced by 8 bits) that uses a lowest precision of 1 byte, the index contains only a maximum of 256 distinct values in the lowest precision. Overall, a range could consist of a theoretical maximum of 7*255*2 + 255 = 3825distinct terms (when there is a term for every distinct value of an 8-byte-number in the index and the range covers almost all of them; a maximum of 255 distinct values is used because it would always be possible to reduce the full 256 values to one term with degraded precision). In practice, we have seen up to 300 terms in most cases (index with 500,000 metadata records and a uniform value distribution).

Precision Step

You can choose any precisionStep when encoding values. Lower step values mean more precisions and so more terms in index (and index gets larger). On the other hand, the maximum number of terms to match reduces, which optimized query speed. The formula to calculate the maximum term count is:

n = [ (bitsPerValue/precisionStep - 1) * (2^precisionStep - 1 ) * 2 ] + (2^precisionStep - 1 )

(this formula is only correct, when bitsPerValue/precisionStep is an integer; in other cases, the value must be rounded up and the last summand must contain the modulo of the division as precision step). For longs stored using a precision step of 4, n = 15*15*2 + 15 = 465, and for a precision step of 2, n = 31*3*2 + 3 = 189. But the faster search speed is reduced by more seeking in the term enum of the index. Because of this, the ideal precisionStep value can only be found out by testing. Important: You can index with a lower precision step value and test search speed using a multiple of the original step value.

Good values for precisionStep are depending on usage and data type:

·      The default for all data types is 4, which is used, when no precisionStep is given.

·      Ideal value in most cases for 64 bit data types (long, double) is 6 or 8.

·      Ideal value in most cases for 32 bit data types (int, float) is 4.

·      For low cardinality fields larger precision steps are good. If the cardinality is < 100, it is fair to use Integer.MAX_VALUE (see below).

·      Steps ≥64 for long/double and ≥32 for int/float produces one token per value in the index and querying is as slow as a conventional TermRangeQuery. But it can be used to produce fields, that are solely used for sorting (in this case simply use Integer.MAX_VALUE as precisionStep). Using NumericFields for sorting is ideal, because building the field cache is much faster than with text-only numbers. These fields have one term per value and therefore also work with term enumeration for building distinct lists (e.g. facets / preselected values to search for). Sorting is also possible with range query optimized fields using one of the above precisionSteps.

Comparisons of the different types of RangeQueries on an index with about 500,000 docs showed that TermRangeQuery in boolean rewrite mode (with raised BooleanQuery clause count) took about 30-40 secs to complete, TermRangeQuery in constant score filter rewrite mode took 5 secs and executing this class took <100ms to complete (on an Opteron64 machine, Java 1.5, 8 bit precision step). This query type was developed for a geographic portal, where the performance for e.g. bounding boxes or exact date/time stamps is important.

Since:

2.9

See Also:

Serialized Form

目录
相关文章
|
Java Maven
Maven三种仓库详解
仓库分类 1、本地仓库 本地仓库就是开发者本地已经下载下来的或者自己打包所有jar包的依赖仓库,本地仓库路径配置在maven对应的conf/settings.xml配置文件。
1822 0
|
人工智能 自然语言处理
Co-STAR 模型
Co-STAR 模型
1073 0
|
消息中间件 存储 负载均衡
中间件优解——RabbitMQ和Kafka的高可用集群原理
大家对当前比较常用的RabbitMQ和Kafka是否有一些了解呢,了解的多一些也不是坏事,面试或者跟人聊技术的时候也会让你更有话语权嘛。 今天就跟大家聊一聊RabbitMQ和Kafka在处理高可用集群时的原理,看看它们与RocketMQ有什么不同。小伙伴们可以重新温习一下常见的消息中间件有哪些?你们是怎么进行技术选型的?这篇文章,了解一下他们之间的区别。
|
机器学习/深度学习 存储 人工智能
深度学习模型部署综述(ONNX/NCNN/OpenVINO/TensorRT)(上)
今天自动驾驶之心很荣幸邀请到逻辑牛分享深度学习部署的入门介绍,带大家盘一盘ONNX、NCNN、OpenVINO等框架的使用场景、框架特点及代码示例。
深度学习模型部署综述(ONNX/NCNN/OpenVINO/TensorRT)(上)
|
监控 JavaScript 前端开发
【JAVA】基于微服务架构的智慧工地云平台源码带APP(springcloud+VUE+mysql+mybatis plus+redis)
智慧工地源码 后端:java + spring boot + mybatis plus + mysql + kafka+ redis + xxl-job + MQTT。 前端:vue + flutter。 智慧工地云平台,人机料法环是智慧工地中常用的概念,分别代表了人员、机械、材料、方法和环境。在施工过程中,通过有效地整合人员、机械设备、材料和工法等要素,并结合科学合理的管理手段,以实现高效、安全、环保和可持续的建筑施工。
|
Java 程序员 数据库连接
JdbcTemplate常用语句代码示例
JdbcTemplate常用语句代码示例
200 0
|
存储 SQL JSON
一种关于低代码平台(LCDP)建设实践与设计思路
作者在负责菜鸟商业中心CRM系统开发过程中发现有一个痛点:业务线很多,每个业务线对同一个页面都有个性化布局和不同的字段需求,而他所在的团队就3个人,那么在资源有限的情况下该如何支撑呢?本文就降本的情况下,和大家分享下作者是如何低成本构建产品能力去支撑多条业务线、多租户的。
一种关于低代码平台(LCDP)建设实践与设计思路
|
TensorFlow API 算法框架/工具
解决AttributeError: module ‘tensorflow‘ has no attribute ‘div‘
解决AttributeError: module ‘tensorflow‘ has no attribute ‘div‘
286 0
解决AttributeError: module ‘tensorflow‘ has no attribute ‘div‘
|
网络安全 数据安全/隐私保护
|
消息中间件 数据安全/隐私保护
RabbitMQ安装简单过程
找到一本ACTION IN RABBITMQ,仔细看。现在先安装起来。。 参考主要的URL,包括安装,用户管理,权限管理。我用的都是最新版本。 http://my.oschina.net/indestiny/blog/192313 http://my.
890 0