lucene DocValues——本质是为通过docID查找某field的值 看图

简介:

Why DocValues?

The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.

For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient. The faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).

In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

 

From day one Apache Lucene provided a solid inverted index datastructure and the ability to store the text and binary chunks in stored field. In a typical usecase the inverted index is used to retrieve & score documents matching one or more terms. Once the matching documents have been scored stored fields are loaded for the top N documents for display purposes. So far so good! However, the retrieval process is essentially limited to the information available in the inverted index like term & document frequency, boosts and normalization factors. So what if you need custom information to score or filter documents? Stored fields are designed for bulk read, meaning the perform best if you load all their data while during document retrieval we need more fine grained data.

Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene’s internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term. Figure 1. illustrats the process.

Figure 1. Univerting a field to FieldCache

FieldCache serves very well for its purpose since accessing a value is basically doing a constant time array look. However, there are special cases where other datastructures are used in FieldCache but those are out of scope in this post.

 

摘自:http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/

 

Low Level Details

Lucene has four underlying types that a docvalues field can have. Currently Solr uses three of these:

  1. NUMERIC: a single-valued per-document numeric type. This is like having a large long[] array for the whole index, though the data is compressed based upon the values that are actually used.
    • For example, consider 3 documents with these values:
             doc[0] = 1005
             doc[1] = 1006
             doc[2] = 1005
      In this example the field would use around 1 bit per document, since that is all that is needed.
  2. SORTED: a single-valued per-document string type. This is like having a large String[] array for the whole index, but with an additional level of indirection. Each unique value is assigned a term number that represents its ordinal value. So each document really stores a compressed integer, and separately there is a "dictionary" mapping these term numbers back to term values.
    • For example, consider 3 documents with these values:
             doc[0] = "aardvark"
             doc[1] = "beaver"
             doc[2] = "aardvark"
      Value "aardvark" will be assigned ordinal 0, and "beaver" 1, creating these two data structures:
             doc[0] = 0
             doc[1] = 1
             doc[2] = 0
      
             term[0] = "aardvark"
       term[1] = "beaver"
  3. SORTED_SET: a multi-valued per-document string type. Its similar to SORTED, except each document has a "set" of values (in increasing sorted order). So it intentionally discards duplicate values (frequency) within a document and loses order within the document.
    • For example, consider 3 documents with these values:
             doc[0] = "cat", "aardvark", "beaver", "aardvark"
             doc[1] =
             doc[2] = "cat"
      Value "aardvark" will be assigned ordinal 0, "beaver" 1, and "cat" 2, creating these two data structures:
             doc[0] = [0, 1, 2]
             doc[1] = []
             doc[2] = [2]
      
             term[0] = "aardvark"
       term[1] = "beaver"  term[2] = "cat"
  4. BINARY: a single-valued per-document byte[] array. This can be used for encoding custom per-document datastructures.



















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6669714.html,如需转载请自行联系原作者

相关文章
|
7月前
|
存储 算法 Java
Java查找算法概览:二分查找适用于有序数组,通过比较中间元素缩小搜索范围;哈希查找利用哈希函数快速定位,示例中使用HashMap存储键值对,支持多值关联。
【6月更文挑战第21天】Java查找算法概览:二分查找适用于有序数组,通过比较中间元素缩小搜索范围;哈希查找利用哈希函数快速定位,示例中使用HashMap存储键值对,支持多值关联。简单哈希表实现未涵盖冲突解决和删除操作。
69 1
|
7月前
|
存储 Java 索引
JavaSE——集合框架一(5/7)-Set系列集合:Set集合的特点、底层原理、哈希表、去重复原理
JavaSE——集合框架一(5/7)-Set系列集合:Set集合的特点、底层原理、哈希表、去重复原理
57 1
|
前端开发 JavaScript
js对map排序,后端返回有序的LinkedHashMap类型时前端获取后顺序依旧从小到大的解决方法
在后端进行时间倒序查询后,返回map类型的数据,在postman获取是这样:
586 0
|
算法 前端开发
前端算法-查找旋排序数组中最小值
前端算法-查找旋转排序数组中最小值
|
Java 索引
Java中的集合父亲之collection使用和遍历方式--(单列集合顶级接口)
collection:单列集合的祖宗,一次只能往集合里面添加一个元素
152 0
Java中的集合父亲之collection使用和遍历方式--(单列集合顶级接口)
|
Java
Java经典编程习题100例:第18例:编写程序,将一个数组中的元素倒排过来。例如原数组为1,2,3,4,5;则倒排后数组中的值
Java经典编程习题100例:第18例:编写程序,将一个数组中的元素倒排过来。例如原数组为1,2,3,4,5;则倒排后数组中的值
254 0
|
存储 搜索推荐 Java
搜索引擎solr中string类型字段排序混乱问题
elasticsearch与solr作为目前市面上主流的搜索引擎可以满足绝大多数搜索场景,伴随着搜索而来的就是排序。下面记录一次solr中string类型字段排序混乱问题。
558 0
搜索引擎solr中string类型字段排序混乱问题
|
算法
【刷题记录】34. 在排序数组中查找元素的第一个和最后一个位置
【刷题记录】34. 在排序数组中查找元素的第一个和最后一个位置
124 0
【刷题记录】34. 在排序数组中查找元素的第一个和最后一个位置
|
索引
Excel 技术篇 - 利用Match公式返回匹配的最后一个数据的索引
Excel 技术篇 - 利用Match公式返回匹配的最后一个数据的索引
501 0
Excel 技术篇 - 利用Match公式返回匹配的最后一个数据的索引