ES doc_values介绍——本质是field value的列存储,做聚合分析用,ES默认开启,会占用存储空间(列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩)

简介:

doc_values

Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. They store the same values as the _source but in a column-oriented fashion that is way more efficient for sorting and aggregations.(本质!!!) Doc values are supported on almost all field types, with the notable exception of analyzed string fields.

All fields which support doc values have them enabled by default. If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space:

PUT my_index
{
  "mappings": { "my_type": { "properties": { "status_code": {  "type": "keyword" }, "session_id": {  "type": "keyword", "doc_values": false } } } } }
 

The status_code field has doc_values enabled by default.

The session_id has doc_values disabled, but can still be queried.

 

摘自:https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html

 

 

Column-store compression

At a high level, doc values are essentially a serialized column-store. As we discussed in the last section, column-stores excel at certain operations because the data is naturally laid out in a fashion that is amenable to those queries.

But they also excel at compressing data, particularly numbers. This is important for both saving space on disk and for faster access. Modern CPU’s are many orders of magnitude faster than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means it is often advantageous to minimize the amount of data that must be read from disk, even if it requires extra CPU cycles to decompress.

To see how it can help compression, take this set of doc values for a numeric field:

Doc      Terms
-----------------------------------------------------------------
Doc_1 | 100
Doc_2 | 1000
Doc_3 | 1500
Doc_4 | 1200
Doc_5 | 300
Doc_6 | 1900
Doc_7 | 4200
-----------------------------------------------------------------

The column-stride layout means we have a contiguous block of numbers:[100,1000,1500,1200,300,1900,4200]

xxx

Doc values use several tricks like this. In order, the following compression schemes are checked:

  1. If all values are identical (or missing), set a flag and record the value
  2. If there are fewer than 256 values, a simple table encoding is used
  3. If there are > 256 values, check to see if there is a common divisor
  4. If there is no common divisor, encode everything as an offset from the smallest value

You’ll note that these compression schemes are not "traditional" general purpose compression like DEFLATE or LZ4.Because the structure of column-stores are rigid and well-defined, we can achieve higher compression by using specialized schemes rather than the more general compression algorithms like LZ4.

Note

You may be thinking "Well that’s great for numbers, but what about strings?" Strings are encoded similarly, with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values. Which means strings enjoy many of the same compression benefits that numerics do.

The ordinal table itself has some compression tricks, such as using fixed, variable or prefix-encoded strings.

   

摘自:https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html












本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6401466.html,如需转载请自行联系原作者


相关文章
|
4月前
|
存储 Go
go 切片长度与容量的区别
go 切片长度与容量的区别
|
6月前
|
SQL 安全 数据挖掘
Elasticsearch如何聚合查询多个统计值,如何嵌套聚合?并相互引用,统计索引中某一个字段的空值率?语法是怎么样的?
Elasticsearch聚合查询用于复杂数据分析,包括统计空值率。示例展示了如何计算字段`my_field`非空非零文档的百分比。查询分为三步:总文档数计数、符合条件文档数计数及计算百分比。聚合概念涵盖度量、桶和管道聚合。脚本在聚合中用于动态计算。常见聚合类型如`sum`、`avg`、`date_histogram`等。组合使用可实现多值统计、嵌套聚合和空值率计算。[阅读更多](https://zhangfeidezhu.com/?p=515)
314 0
Elasticsearch如何聚合查询多个统计值,如何嵌套聚合?并相互引用,统计索引中某一个字段的空值率?语法是怎么样的?
|
5月前
|
存储 NoSQL Redis
Redis07命令-String类型字符串,不管是哪种格式,底层都是字节数组形式存储的,最大空间不超过512m,SET添加,MSET批量添加,INCRBY age 2可以,MSET,INCRSETEX
Redis07命令-String类型字符串,不管是哪种格式,底层都是字节数组形式存储的,最大空间不超过512m,SET添加,MSET批量添加,INCRBY age 2可以,MSET,INCRSETEX
|
算法 PHP 数据安全/隐私保护
为什么PHP的MD5可以将任意长度的数据映射为固定长度的哈希值?底层原理是什么?
为什么PHP的MD5可以将任意长度的数据映射为固定长度的哈希值?底层原理是什么?
273 0
|
机器学习/深度学习 算法 Java
四种方式统计二进制表示中 1 的个数
四种方式统计二进制表示中 1 的个数
L1-4 字符串压缩 (10 分)
编写一个程序,输入一个字符串,然后采用如下的规则对该字符串当中的每一个字符进行压缩: (1) 如果该字符是空格,则保留该字符; (2) 如果该字符是第一次出现或第三次出现或第六次出现,则保留该字符; (3) 否则,删除该字符。 例如,若用户输入“occurrence”,经过压缩后,字符c的第二次出现被删除,第一和第三次出现仍保留;字符r和e的第二次出现均被删除,因此最后的结果为:“ocurenc”。
106 0
|
Oracle 关系型数据库
[20171203]平均长度和虚拟列.txt
[20171203]平均长度和虚拟列.txt --//昨天看链接https://blog.dbi-services.com/doag-2017-avg_row_len-with-virtual-columns/ --//重复测试看看.
947 0