关于TrieField的全面认识、理解、运用

简介: 假期重新把之前在新浪博客里面的文字梳理了下,搬到这里。

关于trieField的理解补充下3篇文档,相当的系统、全面!看相关文档连接,不解释。

http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/NumericRangeQuery.html

http://blog.csdn.net/fancyerii/article/details/7256379

http://hadoopcn.iteye.com/blog/1550402

http://rdc.taobao.com/team/jm/archives/1699

 

public final class NumericRangeQueryNumber>

extends MultiTermQuery

A Query that matches numeric values within a specified range. To use this, you must first index the numeric values using NumericField (expert: NumericTokenStream). If your terms are instead textual, you should use TermRangeQuery. NumericRangeFilter is the filter equivalent of this query.

You create a new NumericRangeQuery with the static factory methods, eg:

Query q = NumericRangeQuery.newFloatRange("weight", 0.03f, 0.10f, true, true);

matches all documents whose float valued "weight" field ranges from 0.03 to 0.10, inclusive.

The performance of NumericRangeQuery is much better than the corresponding TermRangeQuery because the number of terms that must be searched is usually far fewer, thanks to trie indexing, described below.

You can optionally specify a precisionStep when creating this query. This is necessary if you've changed this configuration from its default (4) during indexing. Lower values consume more disk space but speed up searching. Suitable values are between 1 and 8. A good starting point to test is 4, which is the default value for all Numeric* classes. See below for details.

This query defaults to MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT for 32 bit (int/float) ranges with precisionStep ≤8 and 64 bit (long/double) ranges with precisionStep ≤6. Otherwise it uses MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE as the number of terms is likely to be high. With precision steps of ≤4, this query can be run with one of the BooleanQuery rewrite methods without changing BooleanQuery's default max clause count.

How it works

See the publication about panFMP, where this algorithm was described (referred to as TrieRangeQuery):

Schindler, U, Diepenbroek, M, 2008. Generic XML-based Framework for Metadata Portals.Computers & Geosciences 34 (12), 1947-1955. doi:10.1016/j.cageo.2008.02.023

A quote from this paper: Because Apache Lucene is a full-text search engine and not a conventional database, it cannot handle numerical ranges (e.g., field value is inside user defined bounds, even dates are numerical values). We have developed an extension to Apache Lucene that stores the numerical values in a special string-encoded format with variable precision (all numerical values like doubles, longs, floats, and ints are converted to lexicographic sortable string representations and stored with different precisions (for a more detailed description of how the values are stored, see NumericUtils). A range is then divided recursively into multiple intervals for searching: The center of the range is searched only with the lowest possible precision in thetrie, while the boundaries are matched more exactly. This reduces the number of terms dramatically.

For the variant that stores long values in 8 different precisions (each reduced by 8 bits) that uses a lowest precision of 1 byte, the index contains only a maximum of 256 distinct values in the lowest precision. Overall, a range could consist of a theoretical maximum of 7*255*2 + 255 = 3825distinct terms (when there is a term for every distinct value of an 8-byte-number in the index and the range covers almost all of them; a maximum of 255 distinct values is used because it would always be possible to reduce the full 256 values to one term with degraded precision). In practice, we have seen up to 300 terms in most cases (index with 500,000 metadata records and a uniform value distribution).

Precision Step

You can choose any precisionStep when encoding values. Lower step values mean more precisions and so more terms in index (and index gets larger). On the other hand, the maximum number of terms to match reduces, which optimized query speed. The formula to calculate the maximum term count is:

n = [ (bitsPerValue/precisionStep - 1) * (2^precisionStep - 1 ) * 2 ] + (2^precisionStep - 1 )

(this formula is only correct, when bitsPerValue/precisionStep is an integer; in other cases, the value must be rounded up and the last summand must contain the modulo of the division as precision step). For longs stored using a precision step of 4, n = 15*15*2 + 15 = 465, and for a precision step of 2, n = 31*3*2 + 3 = 189. But the faster search speed is reduced by more seeking in the term enum of the index. Because of this, the ideal precisionStep value can only be found out by testing. Important: You can index with a lower precision step value and test search speed using a multiple of the original step value.

Good values for precisionStep are depending on usage and data type:

·      The default for all data types is 4, which is used, when no precisionStep is given.

·      Ideal value in most cases for 64 bit data types (long, double) is 6 or 8.

·      Ideal value in most cases for 32 bit data types (int, float) is 4.

·      For low cardinality fields larger precision steps are good. If the cardinality is < 100, it is fair to use Integer.MAX_VALUE (see below).

·      Steps ≥64 for long/double and ≥32 for int/float produces one token per value in the index and querying is as slow as a conventional TermRangeQuery. But it can be used to produce fields, that are solely used for sorting (in this case simply use Integer.MAX_VALUE as precisionStep). Using NumericFields for sorting is ideal, because building the field cache is much faster than with text-only numbers. These fields have one term per value and therefore also work with term enumeration for building distinct lists (e.g. facets / preselected values to search for). Sorting is also possible with range query optimized fields using one of the above precisionSteps.

Comparisons of the different types of RangeQueries on an index with about 500,000 docs showed that TermRangeQuery in boolean rewrite mode (with raised BooleanQuery clause count) took about 30-40 secs to complete, TermRangeQuery in constant score filter rewrite mode took 5 secs and executing this class took <100ms to complete (on an Opteron64 machine, Java 1.5, 8 bit precision step). This query type was developed for a geographic portal, where the performance for e.g. bounding boxes or exact date/time stamps is important.

Since:

2.9

See Also:

Serialized Form

目录
相关文章
|
3天前
|
关系型数据库 Serverless 分布式数据库
高峰无忧,探索PolarDB PG版Serverless的弹性魅力
在数字经济时代,数据库成为企业命脉,面对爆炸式增长的数据,企业面临管理挑战。云原生和Serverless技术革新数据库领域,PolarDB PG Serverless作为阿里云的云原生数据库解决方案,融合Serverless与PostgreSQL,实现自动弹性扩展,按需计费,降低运维成本。它通过计算与存储分离技术,提供高可用性、灾备策略和简化运维。PolarDB PG Serverless智能应变业务峰值,实时监控与调整资源,确保性能稳定。通过免费体验,用户可观察其弹性性能和价格力,感受技术优势。
|
12天前
|
Kubernetes 安全 Devops
【云效流水线 Flow 测评】驾驭云海:五大场景下的云效Flow实战部署评测
云效是一款企业级持续集成和持续交付工具,提供免费、高可用的服务,集成阿里云多种服务,支持蓝绿、分批、金丝雀等发布策略。其亮点包括快速定位问题、节省维护成本、丰富的企业级特性及与团队协作的契合。基础版和高级版分别针对小型企业和大规模团队,提供不同功能和服务。此外,云效对比Jenkins在集成阿里云服务和易用性上有优势。通过实战演示了云效在ECS和K8s上的快速部署流程,以及代码质量检测和AI智能排查功能,展示了其在DevOps流程中的高效和便捷,适合不同规模的企业使用。本文撰写用时5小时,请各位看官帮忙多多支持,如有建议也请一并给出,您的建议能帮助我下一篇更加出色。
136108 16
|
13天前
|
存储 缓存 监控
你的Redis真的变慢了吗?性能优化如何做
本文先讲述了Redis变慢的判别方法,后面讲述了如何提升性能。
102161 2
|
13天前
|
机器学习/深度学习 并行计算 算法
Transformer 一起动手编码学原理
学习Transformer,快来跟着作者动手写一个。
94233 5
|
12天前
|
存储 SQL Apache
阿里云数据库内核 Apache Doris 基于 Workload Group 的负载隔离能力解读
阿里云数据库内核 Apache Doris 基于 Workload Group 的负载隔离能力解读
阿里云数据库内核 Apache Doris 基于 Workload Group 的负载隔离能力解读
|
18天前
|
人工智能 弹性计算 算法
一文解读:阿里云AI基础设施的演进与挑战
对于如何更好地释放云上性能助力AIGC应用创新?“阿里云弹性计算为云上客户提供了ECS GPU DeepGPU增强工具包,帮助用户在云上高效地构建AI训练和AI推理基础设施,从而提高算力利用效率。”李鹏介绍到。目前,阿里云ECS DeepGPU已经帮助众多客户实现性能的大幅提升。其中,LLM微调训练场景下性能最高可提升80%,Stable Difussion推理场景下性能最高可提升60%。
|
13天前
|
存储 弹性计算 Cloud Native
1 名工程师轻松管理 20 个工作流,创业企业用 Serverless 让数据处理流程提效
为应对挑战,语势科技采用云工作流CloudFlow和函数计算FC,实现数据处理流程的高效管理与弹性伸缩,提升整体研发效能。
64688 2
|
19天前
|
消息中间件 安全 API
Apache RocketMQ ACL 2.0 全新升级
RocketMQ ACL 2.0 不管是在模型设计、可扩展性方面,还是安全性和性能方面都进行了全新的升级。旨在能够为用户提供精细化的访问控制,同时,简化权限的配置流程。欢迎大家尝试体验新版本,并应用在生产环境中。
187462 6
|
15天前
|
存储 关系型数据库 数据库
|
23天前
|
物联网 PyTorch 测试技术
手把手教你捏一个自己的Agent
Modelscope AgentFabric是一个基于ModelScope-Agent的交互式智能体应用,用于方便地创建针对各种现实应用量身定制智能体,目前已经在生产级别落地。