lucene DocValues——本质是为通过docID查找某field的值 看图

简介:

Why DocValues?

The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.

For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient. The faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).

In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

 

From day one Apache Lucene provided a solid inverted index datastructure and the ability to store the text and binary chunks in stored field. In a typical usecase the inverted index is used to retrieve & score documents matching one or more terms. Once the matching documents have been scored stored fields are loaded for the top N documents for display purposes. So far so good! However, the retrieval process is essentially limited to the information available in the inverted index like term & document frequency, boosts and normalization factors. So what if you need custom information to score or filter documents? Stored fields are designed for bulk read, meaning the perform best if you load all their data while during document retrieval we need more fine grained data.

Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene’s internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term. Figure 1. illustrats the process.

Figure 1. Univerting a field to FieldCache

FieldCache serves very well for its purpose since accessing a value is basically doing a constant time array look. However, there are special cases where other datastructures are used in FieldCache but those are out of scope in this post.

 

摘自:http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/

 

Low Level Details

Lucene has four underlying types that a docvalues field can have. Currently Solr uses three of these:

  1. NUMERIC: a single-valued per-document numeric type. This is like having a large long[] array for the whole index, though the data is compressed based upon the values that are actually used.
    • For example, consider 3 documents with these values:
             doc[0] = 1005
             doc[1] = 1006
             doc[2] = 1005
      In this example the field would use around 1 bit per document, since that is all that is needed.
  2. SORTED: a single-valued per-document string type. This is like having a large String[] array for the whole index, but with an additional level of indirection. Each unique value is assigned a term number that represents its ordinal value. So each document really stores a compressed integer, and separately there is a "dictionary" mapping these term numbers back to term values.
    • For example, consider 3 documents with these values:
             doc[0] = "aardvark"
             doc[1] = "beaver"
             doc[2] = "aardvark"
      Value "aardvark" will be assigned ordinal 0, and "beaver" 1, creating these two data structures:
             doc[0] = 0
             doc[1] = 1
             doc[2] = 0
      
             term[0] = "aardvark"
       term[1] = "beaver"
  3. SORTED_SET: a multi-valued per-document string type. Its similar to SORTED, except each document has a "set" of values (in increasing sorted order). So it intentionally discards duplicate values (frequency) within a document and loses order within the document.
    • For example, consider 3 documents with these values:
             doc[0] = "cat", "aardvark", "beaver", "aardvark"
             doc[1] =
             doc[2] = "cat"
      Value "aardvark" will be assigned ordinal 0, "beaver" 1, and "cat" 2, creating these two data structures:
             doc[0] = [0, 1, 2]
             doc[1] = []
             doc[2] = [2]
      
             term[0] = "aardvark"
       term[1] = "beaver"  term[2] = "cat"
  4. BINARY: a single-valued per-document byte[] array. This can be used for encoding custom per-document datastructures.



















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6669714.html,如需转载请自行联系原作者

相关文章
|
算法 决策智能
基于GA-PSO遗传粒子群混合优化算法的TSP问题求解matlab仿真
本文介绍了基于GA-PSO遗传粒子群混合优化算法解决旅行商问题(TSP)的方法。TSP旨在寻找访问一系列城市并返回起点的最短路径,属于NP难问题。文中详细阐述了遗传算法(GA)和粒子群优化算法(PSO)的基本原理及其在TSP中的应用,展示了如何通过编码、选择、交叉、变异及速度和位置更新等操作优化路径。算法在MATLAB2022a上实现,实验结果表明该方法能有效提高求解效率和解的质量。
|
存储 Prometheus 监控
Grafana 与 Prometheus 集成:打造高效监控系统
【8月更文第29天】在现代软件开发和运维领域,监控系统已成为不可或缺的一部分。Prometheus 和 Grafana 作为两个非常流行且互补的开源工具,可以协同工作来构建强大的实时监控解决方案。Prometheus 负责收集和存储时间序列数据,而 Grafana 则提供直观的数据可视化功能。本文将详细介绍如何集成这两个工具,构建一个高效、灵活的监控系统。
1965 1
|
缓存 NoSQL Java
SpringBoot 分布式锁 @klock详解
SpringBoot 分布式锁 @klock详解
581 0
SpringBoot 分布式锁 @klock详解
|
机器学习/深度学习 人工智能 搜索推荐
探索人工智能在医疗健康领域的革新应用
本文将深入探讨人工智能(AI)技术在医疗健康领域的创新应用,从智能诊断系统、个性化治疗计划的制定到患者监护与健康管理,揭示AI如何提高医疗服务的效率和质量。通过分析具体案例,如AI辅助癌症检测和遗传病风险评估,本文旨在展现AI技术在现代医疗体系中扮演的关键角色,以及其对未来医疗保健的潜在影响。
|
存储 SQL 分布式计算
Hadoop 3.x各模式部署 - Ubuntu(上)
Hadoop 3.x各模式部署 - Ubuntu
635 0
|
存储 JSON Oracle
【最佳实践】esrally:Elasticsearch 官方压测工具及运用详解
由于 Elasticsearch(后文简称 es) 的简单易用及其在大数据处理方面的良好性能,越来越多的公司选用 es 作为自己的业务解决方案。然而在引入新的解决方案前,不免要做一番调研和测试,本文便是介绍官方的一个 es 压测工具 esrally,希望能为大家带来帮助。
21304 0
【最佳实践】esrally:Elasticsearch 官方压测工具及运用详解
|
人工智能 物联网 大数据
本周重磅直播来袭:「HARD模式下的产业突围」智慧商业专场 & 赛道明星排位赛
7月22日(周三)14:00-18:00,「HARD模式下的产业突围」智慧商业专场 7月23日(周四)14:00-19:00,赛道明星排位赛
本周重磅直播来袭:「HARD模式下的产业突围」智慧商业专场 & 赛道明星排位赛
|
存储 自然语言处理 索引
|
1天前
|
云安全 人工智能 运维
阿里云SecOps Agent,全新安全跨产品执行体验
自然语言驱动 云安全中心/WAF/CFW/ 等多款安全产品联动
1569 1