How to reduce Index size on disk?减少ES索引大小的一些小手段

本文涉及的产品
检索分析服务 Elasticsearch 版,2核4GB开发者规格 1个月
简介:

ES索引文件瘦身总结如下:

原始数据:
(1)学习splunk,原始data存big string
(2)原始文件还可以再度压缩
倒排索引:
(1)去掉不必要的倒排索引信息:例如文件位置倒排、_source和field store选择之一
(2)合并倒排文件,去掉一些冗余的小文件
(3)原始数据big string存储后负责ES聚合功能的doc_values去掉
(4)其他方面:倒排列表数据结构是skiplist本质是空间换时间,可考虑用有序数组存储。

 

 

Strange that I haven't receive any suggestion on my query anyways following are some steps which I performed to reduce index size .Hope it will help someone .Please feel free to add more in case I miss something .

1) Delete unnecessary fields (or do not index unwanted fields, I am handling it at the LS level)
2) Delete @message field (if Message field is not in use you can delete this)
3) Disable _all field ( Be careful with this setting ) 
It is a special catch-all field which concatenates the values of all of the other fields into one big string, using space as a delimiter. It requires extra CPU cycles and uses more disk space. If not needed, it can be completely disabled. 
Benefits of having _All field enabled :- Allows you to search for values in documents without knowing which field contains the value, but CPU will be compromised .
Downside of Disabling this field :- Kibana Search bar will not act as full text search bar , so user have to fire query like name : “vikas” or name:vika* (provided name is an analyzed field ) . Also the _all field loses the distinction between field types like (string integer, or IP ) because it stores all the values as string.
4) Analyzed and Not Analyzed fields :- Be very careful while making a field Analyzed and Not analyzed because to perform partial search(name :vik*) we need analyzed field but it will consume more disk space . Recommended option is to make all the string fields to not analyzed in the first go and then make any filed as analyzed field if needed .
5) Doc_Value :-Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. So, doc values offload this heap burden by writing the fielddata to disk at index time, thereby allowing Elasticsearch to load the values outside of your Java heap as they are needed. In the latest version of ES this feature has already been enabled .In our case we are on ES 1.7.1 version an we have to enable it explicitly which will consume extra Disk space but this does not degrade performance at all. The overall benefits of doc values significantly outweigh the cost.

Thanks
VG

 

摘自:https://discuss.elastic.co/t/how-to-reduce-index-size-on-disk/49415

下文来自:https://github.com/jordansissel/experiments/tree/master/elasticsearch/disk

logstash+elasticsearch storage experiments


These results are from an experiment done in 2012 and are irrelevant today.


Problem: Many users observe a 5x inflation of storage data from "raw logs" vs logstash data stored in elasticsearch.

Hypothesis: There are likely small optimizations we can make on the elasticsearch side to occupy less physical disk space.

Constraints: Data loss is not acceptable (can't just stop storing the logs)

Options:

  • Compression (LZF and Snappy)
  • Disable the '_all' field
  • For parsed logs, there are lots of duplicate and superluous fields we can remove.

Discussion

The compression features really need no discussion.

The purpose of the '_all' field is documented in the link above. In logstash, users have reported success in disabling this feature without losing functionality.

In this scenario, I am parsing apache logs. Logstash reads lines from a file and sets the '@message' field to the contents of that line. After grok parses it and produces a nice structure, making fields like 'bytes', 'response', and 'clientip' available in the event, we no longer need the original log line, so it is quite safe to delete the @message (original log line) in this case. Doing this saves us much duplicate data in the event itself.

Test scenarios

  • 0: test defaults
  • 1: disable _all
  • 2: store compress + disable _all
  • 3: store compress w/ snappy + disable _all
  • 4: compress + remove duplicate things (@message and @source)
  • 5: compress + remove all superfluous things (simulate 'apache logs in json')
  • 6: compress + remove all superfluous things + use 'grok singles'

Test data

One million apache logs from semicomplete.com:

% du -hs /data/jls/million.apache.logs 
218M    /data/jls/million.apache.logs
% wc -l /data/jls/million.apache.logs
1000000 /data/jls/million.apache.logs

Environment

This should be unrelated to the experiment, but including for posterity if the run-time of these tests is of interest to you.

  • CPU: Xeon E31230 (4-core)
  • Memory: 16GB
  • Disk: Unknown spinning variety, 1TB

Results

run space usage elasticsearch/original ratio run time (wall clock)
ORIGIN 218M /data/jls/million.apache.logs N/A N/A
0 1358M /data/jls/millionlogstest/0.yml 6.23x 6m47.343s
1 1183M /data/jls/millionlogstest/1.yml 5.47x 6m13.339s
2 539M /data/jls/millionlogstest/2.yml 2.47x 6m17.103s
3 537M /data/jls/millionlogstest/3.yml 2.47x 6m15.382s
4 395M /data/jls/millionlogstest/4.yml 1.81x 6m39.278s
5 346M /data/jls/millionlogstest/5.yml 1.58x 6m35.877s
6 344M /data/jls/millionlogstest/6.yml 1.57x 6m27.440s

Conclusion

This test confirms what many logstash users have already reported: it is easy to achieve a 5-6x increase in storage from raw logs caused by common logstash filter uses, for example grok.

Summary of test results:

  • Enabling store compression uses 55% less storage
  • Removing the @message and @source fields save you 26% of storage.
  • Disabling the '_all' field saves you 13% in storage.
  • Using grok with 'singles => true' had no meaningful impact.
  • Compression ratios in LZF were the same as Snappy.

Final storage size was 25% the size of the common case (1358mb vs 344mb!)

Recommendations

  • Always enable compression in elasticsearch.
  • If you don't need the '_all' field, disable it.
  • The 'remove fields' steps performed here will be unnecessary if you log directly in a structured format. For example, if you follow the 'apache log in json' logstash cookbook recipe, grok, date, and mutate filters here will not be necessary, meaning the only tuning you'll have to do is in disabling '_all' and enabling compression in elasticsearch.

Future Work

It's likely we can take this example of "ship apache 'combined format' access logs into logstash" a bit further and with some tuning improve storage a bit more.

For now, I am happy to have reduced the inflation from 6.2x to 1.58x :)














本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6401317.html,如需转载请自行联系原作者

相关实践学习
使用阿里云Elasticsearch体验信息检索加速
通过创建登录阿里云Elasticsearch集群,使用DataWorks将MySQL数据同步至Elasticsearch,体验多条件检索效果,简单展示数据同步和信息检索加速的过程和操作。
ElasticSearch 入门精讲
ElasticSearch是一个开源的、基于Lucene的、分布式、高扩展、高实时的搜索与数据分析引擎。根据DB-Engines的排名显示,Elasticsearch是最受欢迎的企业搜索引擎,其次是Apache Solr(也是基于Lucene)。 ElasticSearch的实现原理主要分为以下几个步骤: 用户将数据提交到Elastic Search 数据库中 通过分词控制器去将对应的语句分词,将其权重和分词结果一并存入数据 当用户搜索数据时候,再根据权重将结果排名、打分 将返回结果呈现给用户 Elasticsearch可以用于搜索各种文档。它提供可扩展的搜索,具有接近实时的搜索,并支持多租户。
相关文章
|
4月前
|
XML IDE 前端开发
IDEA忽略node_modules减少内存消耗,提升索引速度
在后端开发中,IDEA 在运行前端代码时,频繁扫描 `node_modules` 文件夹会导致高内存消耗和慢索引速度,甚至可能会导致软件卡死。为了改善这一问题,可以按照以下步骤将 `node_modules` 文件夹设为忽略:通过状态菜单右键排除该文件夹、在设置选项中将其加入忽略列表,并且手动修改项目的 `.iml` 文件以添加排除配置。这些操作可以有效提高IDE的运行性能、减少内存占用并简化项目结构,但需要注意的是,排除后将无法对该文件夹进行索引,操作文件时需谨慎。
249 4
IDEA忽略node_modules减少内存消耗,提升索引速度
|
8月前
|
存储 关系型数据库 MySQL
MySQL的优化利器⭐️Multi Range Read与Covering Index是如何优化回表的?
本文以小白的视角使用通俗易懂的流程图深入浅出分析Multi Range Read与Covering Index是如何优化回表
array和list效率对比1--增加数据
array和list效率对比1--增加数据
97 0
array和list效率对比1--增加数据
|
监控 Java 索引
增大max_result_window是错的,ES只能查询前10000条数据的正确解决方案
增大max_result_window是错的,ES只能查询前10000条数据的正确解决方案
增大max_result_window是错的,ES只能查询前10000条数据的正确解决方案
|
SQL 关系型数据库 PostgreSQL
PostgreSQL Heap Only Tuple - HOT (降低UPDATE引入的索引写IO放大)
标签 PostgreSQL , Heap Only Tuple , HOT 背景 PostgreSQL目前默认的存储引擎在更新记录时,会在堆内产生一条新版本,旧版本在不需要使用后VACUUM回收,回收旧版本前,需要先回收所有关联这个版本的所有索引POINT。
2587 0
|
SQL Oracle 大数据