Lucene dvd dvm文件便是docvalues文件——就是针对field value的列存储

本文涉及的产品
检索分析服务 Elasticsearch 版,2核4GB开发者规格 1个月
简介:
public final class Lucene54DocValuesFormat
extends DocValuesFormat
Lucene 5.4 DocValues format.

Encodes the five per-document value types (Numeric,Binary,Sorted,SortedSet,SortedNumeric) with these strategies:

NUMERIC:

  • Delta-compressed: per-document integers written as deltas from the minimum value, compressed with bitpacking. For more information, see DirectWriter.
  • Table-compressed: when the number of unique values is very small (< 256), and when there are unused "gaps" in the range of values used (such as SmallFloat), a lookup table is written instead. Each per-document entry is instead the ordinal to this table, and those ordinals are compressed with bitpacking (DirectWriter).
  • GCD-compressed: when all numbers share a common divisor, such as dates, the greatest common denominator (GCD) is computed, and quotients are stored using Delta-compressed Numerics.
  • Monotonic-compressed: when all numbers are monotonically increasing offsets, they are written as blocks of bitpacked integers, encoding the deviation from the expected delta.
  • Const-compressed: when there is only one possible non-missing value, only the missing bitset is encoded.
  • Sparse-compressed: only documents with a value are stored, and lookups are performed using binary search.

BINARY:

  • Fixed-width Binary: one large concatenated byte[] is written, along with the fixed length. Each document's value can be addressed directly with multiplication (docID * length).
  • Variable-width Binary: one large concatenated byte[] is written, along with end addresses for each document. The addresses are written as Monotonic-compressed numerics.
  • Prefix-compressed Binary: values are written in chunks of 16, with the first value written completely and other values sharing prefixes. chunk addresses are written as Monotonic-compressed numerics. A reverse lookup index is written from a portion of every 1024th term.

SORTED:

  • Sorted: a mapping of ordinals to deduplicated terms is written as Binary, along with the per-document ordinals written using one of the numeric strategies above.

SORTED_SET:

  • Single: if all documents have 0 or 1 value, then data are written like SORTED.
  • SortedSet table: when there are few unique sets of values (< 256) then each set is assigned an id, a lookup table is written and the mapping from document to set id is written using the numeric strategies above.
  • SortedSet: a mapping of ordinals to deduplicated terms is written as Binary, an ordinal list and per-document index into this list are written using the numeric strategies above.

SORTED_NUMERIC:

  • Single: if all documents have 0 or 1 value, then data are written like NUMERIC.
  • SortedSet table: when there are few unique sets of values (< 256) then each set is assigned an id, a lookup table is written and the mapping from document to set id is written using the numeric strategies above.
  • SortedNumeric: a value list and per-document index into this list are written using the numeric strategies above.

Files:

  1. .dvd: DocValues data
  2. .dvm: DocValues metadata

转自:http://lucene.apache.org/core/6_4_2/core/org/apache/lucene/codecs/lucene54/Lucene54DocValuesFormat.html

 

可以看到占用空间非常小!!!

复制代码
du -sm elasticsearch/nodes/0/indices/hec_test2/0/index/*
299     elasticsearch/nodes/0/indices/hec_test2/0/index/_e.fdt
1       elasticsearch/nodes/0/indices/hec_test2/0/index/_e.fdx
1       elasticsearch/nodes/0/indices/hec_test2/0/index/_e.fnm
148     elasticsearch/nodes/0/indices/hec_test2/0/index/_e_Lucene50_0.doc
130     elasticsearch/nodes/0/indices/hec_test2/0/index/_e_Lucene50_0.tim
5       elasticsearch/nodes/0/indices/hec_test2/0/index/_e_Lucene50_0.tip
1       elasticsearch/nodes/0/indices/hec_test2/0/index/_e_Lucene54_0.dvd
1       elasticsearch/nodes/0/indices/hec_test2/0/index/_e_Lucene54_0.dvm
1       elasticsearch/nodes/0/indices/hec_test2/0/index/_e.si
1       elasticsearch/nodes/0/indices/hec_test2/0/index/segments_7
0       elasticsearch/nodes/0/indices/hec_test2/0/index/write.lock
复制代码

 














本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6669414.html,如需转载请自行联系原作者




相关实践学习
使用阿里云Elasticsearch体验信息检索加速
通过创建登录阿里云Elasticsearch集群,使用DataWorks将MySQL数据同步至Elasticsearch,体验多条件检索效果,简单展示数据同步和信息检索加速的过程和操作。
ElasticSearch 入门精讲
ElasticSearch是一个开源的、基于Lucene的、分布式、高扩展、高实时的搜索与数据分析引擎。根据DB-Engines的排名显示,Elasticsearch是最受欢迎的企业搜索引擎,其次是Apache Solr(也是基于Lucene)。 ElasticSearch的实现原理主要分为以下几个步骤: 用户将数据提交到Elastic Search 数据库中 通过分词控制器去将对应的语句分词,将其权重和分词结果一并存入数据 当用户搜索数据时候,再根据权重将结果排名、打分 将返回结果呈现给用户 Elasticsearch可以用于搜索各种文档。它提供可扩展的搜索,具有接近实时的搜索,并支持多租户。
相关文章
|
存储 移动开发 关系型数据库
R语言-rhdf5解析hdf5文件(.h5)展示文件组织结构和数据索引实现
本文简单示例了在R语言如何使用 `rhdf5` 软件包解析 .h5 文件的代码过程
621 0
|
Linux
linux文件合并、去重、拆分
linux文件合并、去重、拆分
1002 0