Elasticsearch 干货入门篇（二）-阿里云开发者社区

Elasticsearch 干货入门篇（二）

2022-05-13 128

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

检索分析服务 Elasticsearch 版，2核4GB开发者规格 1个月

简介： Elasticsearch 干货入门篇（二）

核心的数据类型#

各种类型的使用及范围参见官网,点击进入

数字类型#

long, integer, short, byte, double, float, half_float, scaled_float

示例:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "number_of_bytes": {
          "type": "integer"
        },
        "time_in_seconds": {
          "type": "float"
        },
        "price": {
          "type": "scaled_float",
          "scaling_factor": 100
        }
      }
    }
  }
}

日期类型#

date

示例:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "date": {
          "type": "date" 
        }
      }
    }
  }
}
PUT my_index/_doc/1
{ "date": "2015-01-01" }

boolean类型#

string类型的字符串可以被ES解释成boolean

boolean

示例:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "is_published": {
          "type": "boolean"
        }
      }
    }
  }
}

二进制类型#

binary

示例

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text"
        },
        "blob": {
          "type": "binary"
        }
      }
    }
  }
}
PUT my_index/_doc/1
{
  "name": "Some binary blob",
  "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
}

范围#

integer_range, float_range, long_range, double_range, date_range

示例

PUT range_index
{
  "settings": {
    "number_of_shards": 2
  },
  "mappings": {
    "_doc": {
      "properties": {
        "expected_attendees": {
          "type": "integer_range"
        },
        "time_frame": {
          "type": "date_range", 
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        }
      }
    }
  }
}
PUT range_index/_doc/1?refresh
{
  "expected_attendees" : { 
    "gte" : 10,
    "lte" : 20
  },
  "time_frame" : { 
    "gte" : "2015-10-31 12:00:00", 
    "lte" : "2015-11-01"
  }
}

复杂数据类型#

对象类型,嵌套对象类型

示例:

PUT my_index/_doc/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": { 
      "first": "John",
      "last":  "Smith"
    }
  }
}

在ES内部这些值被转换成这种样式

{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John",
  "manager.name.last":  "Smith"
}

Geo-type#

ES支持地理上的定位点

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "location": {
          "type": "geo_point"
        }
      }
    }
  }
}
PUT my_index/_doc/1
{
  "text": "Geo-point as an object",
  "location": { 
    "lat": 41.12,
    "lon": -71.34
  }
}
PUT my_index/_doc/4
{
  "text": "Geo-point as an array",
  "location": [ -71.34, 41.12 ] 
}

Arrays 和 Multi-field#

更多内容参见官网**,点击进入

查看某个index下的某个type的mapping#

GET /index/_mapping/type

定制type field#

可以给现存的type添加field,但是不能修改,否则就会报错

PUT twitter
{
  "mappings": {
    "user": {
      "properties": {
        "name": { 
        "type": "text" , # 会被全部检索
        "analyzer":"english" # 指定当前field使用 english分词器
        },
        "user_name": { "type": "keyword" },
        "email": { "type": "keyword" }
      }
    },
    "tweet": {
      "properties": {
        "content": { "type": "text" },
        "user_name": { "type": "keyword" },
        "tweeted_at": { "type": "date" },
         "tweeted_at": { 
         "type": "date" 
         "index": "not_analyzeed" # 设置为当前field  tweeted_at不能被分词
         }
      }
    }
  }
}

mapping复杂数据类型再底层的存储格式#

Object类型

{
    "address":{
        "province":"shandong",
        "city":"dezhou"
    },
    "name":"zhangsan",
    "age":"12"
}

转换

{
    "name" : [zhangsan],
    "name" : [12],
    "address.province" : [shandong],
    "address.city" : [dezhou]
}

Object数组类型

{
    "address":[
        {"age":"12","name":"张三"},
        {"age":"12","name":"张三"},
        {"age":"12","name":"张三"}
    ]
}

转换

{
    "address.age" : [12,12,12],
    "address.name" : [张三,张三,张三]
}

精确匹配与全文检索#

精确匹配称为 : exact value#

搜索时,输入的value必须和目标完全一致才算作命中

"query": { "match_phrase": { "address": "mill lane" } }, # 短语检索 address完全匹配 milllane才算命中,返回

全文检索 full text#

全文检索时存在各种优化处理如下:

缩写: cn == china
格式转换 liked == like == likes
大小写 Tom == tom
同义词 like == love

示例

GET /_search
{
    "query": {
        "match" : {
            "message" : "this is a test"
        }
    }
}

倒排索引 & 正排索引#

倒排索引 inverted index#

倒排索引指向所有document分词的field

假设我们存在这样两句话

doc1 : hello world you and me
doc2 : hi world how are you

建立倒排索引就是这样

-	doc1	doc2
hello	*	-
world	*	*
you	*	*
and	*	-
me	*	-
hi	-	*
how	-	*
are	-	*

这时,我们拿着hello world you 来检索,经过分词后去上面索引中检索,doc12都会被检索出,但是doc1命中了更多的词,因此doc1得分会更高

正排索引 doc value#

doc value实际上指向所有不分词的document的field

ES中,进行搜索动作时需要借助倒排索引,但是在排序,聚合,过滤时,需要借助正排索引,所谓正排索引就是其doc value在建立正排索引时一遍建立正排索引一遍建立倒排索引, doc value会被保存在磁盘上,如果内存充足也会将其保存在内存中

正排索引大概长这样

document	name	age
doc1	张三	12
doc2	李四	34

正排索引也会写入磁盘文件中,然后os cache会对其进行缓存,以提成访问doc value的速度,当OS Cache中内存大小不够存放整个正排索引时,doc value中的值会被写入到磁盘中

关于性能方面的问题: ES官方建议,大量使用OS

Cache来进行缓存和提升性能,不建议使用jvm内存来进行缓存数据,那样会导致一定的gc开销,甚至可能导致oom问题,所以官方的建议是,给JVM更小的内存,给OS Cache更大的内存, 假如我们的机器64g,只需要给JVM 16g即可

`doc value`存储压缩 -- `column`压缩#

为了减少doc value占用内存空间的大小,采用column对其进行压缩, 比如我们存在三个doc, 如下

doc 1: 550
doc 2: 550
doc 3: 500

合并相同值,doc1,doc2的值相同都是550,保存一个550标识即可

所有值都相同的话,直接保留单位
少于256的值,使用table encoding的模式进行压缩
大于256的值,检查他们是否有公约数,有的话就除以最大公约数,并保留最大公约数

如: doc1: 24  doc2 :36
 除以最大公约数 6
    doc1: 4   doc2 : 6  保存下最大公约数6

没有最大公约数就使用 offset结合压缩方式

禁用`doc value`#

假设,我们不使用聚合等操作,为了节省空间,在创建mappings时,可以选择禁用doc value

PUT /index
{
    "mappings":{
        "my_type":{
            "properties":{
                "my_field":{
                    "type":"text",
                    "doc_values":false # 禁用doc value
                }
            }
        }
    }
}

Elasticsearch 干货入门篇（二）

核心的数据类型#

数字类型#

日期类型#

boolean类型#

二进制类型#

范围#

复杂数据类型#

Geo-type#

Arrays 和 Multi-field#

查看某个index下的某个type的mapping#

定制type field#

mapping复杂数据类型再底层的存储格式#

精确匹配与全文检索#

精确匹配称为 : exact value#

全文检索 full text#

倒排索引 & 正排索引#

倒排索引 inverted index#

正排索引 doc value#

`doc value`存储压缩 -- `column`压缩#

禁用`doc value`#

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Elasticsearch 干货 入门篇 （二）

核心的数据类型#

数字类型#

日期类型#

boolean类型#

二进制类型#

范围#

复杂数据类型#

Geo-type#

Arrays 和 Multi-field#

查看某个index下的某个type的mapping#

定制type field#

mapping复杂数据类型再底层的存储格式#

精确匹配与全文检索#

精确匹配称为 : exact value#

全文检索 full text#

倒排索引 & 正排索引#

倒排索引 inverted index#

正排索引 doc value#

doc value存储压缩 -- column压缩#

禁用doc value#

热门文章

最新文章

相关课程

相关电子书

相关实验场景

Elasticsearch 干货入门篇（二）

`doc value`存储压缩 -- `column`压缩#

禁用`doc value`#