Elasticsearch学习随笔与Scrapy中Elasticsearch的应用-阿里云开发者社区

elasticsearch概念

集群: 一个或者多个节点组织在一起

节点: 一个节点是集群中的一个服务器,由一个名字来标识，默认是一个随机的漫画角色的名字

分片: 将索引划分为多份的能力,允许水平分割和扩展容量, 多个分片响应请求,提高性能和吞吐量

副本: 创建分片的一份货多份的能力,在一个节点失败其余节点可以顶上

elasticsearch中的index(索引),type(类型),documents(文档),fields

与mysql中的数据库,表,行,列一一对应

倒排索引

倒排索引源于实际应用中需要根据属性的值来查找记录。这种索引表中的每一项都包括一个属性值和具有该属性值得各记录的地址。由于不是由记录来确定属性值，而是有属性值来确定记录的位置，因而称为倒排索引，带有倒排索引的文件我们称为倒排索引文件，简称倒排文件。

es的文档、索引的CRUD操作

新建分片，副本数量

1# 新建分片，副本数量
 2PUT lagou
 3{
 4  "settings": {
 5    "index":{
 6        "number_of_shards":5,
 7        "number_of_replicas":1
 8    }
 9  }
10}

查看设置、修改设置、查看索引信息

1# 查看
 2GET lagou/_settings
 3GET .all/_settings
 4GET _settings
 5GET .kibana,lagou/_settings
 6
 7# 修改setting
 8PUT lagou/_settings
 9{
10       "number_of_replicas":2
11}
12
13# 索引信息
14GET _all
15GET lagou

新增文档

1# 新增文档
 2POST lagou/job
 3{
 4 "title":"python分布式爬虫开发",
 5 "salary_min":15000,
 6 "company":{
 7  "name":"百度",
 8  "company_addr":"北京软件园"
 9 },
10 "publish_date":"2017-4-16",
11 "comments":15
12}
13
14# 新增文档
15POST lagou/job/1
16{
17 "title":"python web开发",
18 "salary_min":18000,
19 "company":{
20  "name":"美团",
21  "company_addr":"北京软件园"
22 },
23 "publish_date":"2017-4-16",
24 "comments":20
25}

查看文档

1# 查看文档
2GET lagou/job/1
3GET lagou/job/1?_source
4GET lagou/job/1?_source=title
5GET lagou/job/1?_source=title,comments

修改文档(覆盖式)与指定修改

1# 修改文档(覆盖式)
 2PUT lagou/job/1
 3{
 4 "title":"python web开发",
 5 "salary_min":18000,
 6 "company":{
 7"city":"上海",
 8  "name":"美团",
 9  "company_addr":"北京软件园"
10 },
11 "publish_date":"2017-4-16",
12 "comments":201
13}
14
15# 修改(指定修改)
16POST lagou/job/1/_update
17{
18 "doc":{
19  "comments":20
20 }
21
22}

删除文档,类(无法删除)，索引

1# 删除文档,类(无法删除)，索引
2DELETE lagou/job/1
3DELETE lagou/job
4DELETE lagou

es的mget和bulk批量操作

mget操作实例

1GET _mget
 2{
 3  "docs":[
 4    {"_index":"testdb",
 5      "_type":"job1",
 6      "_id":1
 7    },
 8    {"_index":"testdb",
 9      "_type":"job1",
10      "_id":2
11    }
12  ]
13}
14
15GET testdb/_mget
16{
17  "docs":[
18    {
19      "_type":"job1",
20      "_id":1
21    },
22    {
23      "_type":"job1",
24      "_id":2
25    }
26  ]
27}
28
29GET testdb/job1/_mget
30{
31  "docs":[
32    {
33      "_id":1
34    },
35    {
36      "_id":2
37    }
38  ]
39}
40
41GET testdb/job1/_mget
42{
43  "ids":[1,2]
44}

bulk批量操作

批量导入可以合并多个操作,比如index,delete,update,create等,也可以帮助从一个索引导入到另一个索引。

1action_and_mete_data\n
2optional_source\n
3action_and_mete_data\n
4optional_source\n
5action_and_mete_data\n
6optional_source\n
7....

简单实例:

1POST _bulk
2{"index":{"_index":"lagou", "_type":"job", "_id":"1"}}
3{"title":"python web开发", "salary_min":18000, "company":{"city":"上海","name":"美团","company_addr":"北京软件园"}, "publish_date":"2017-4-16", "comments":20}
4{"index":{"_index":"lagou", "_type":"job2", "_id":"2"}}
5{"title":"python web flask开发", "salary_min":20000, "company":{"city":"北京","name":"阿里","company_addr":"北京软件园2"}, "publish_date":"2017-4-18", "comments":20}

其他实例:

1{"index":{"_index":"test","_type":"type1","_id":"1"}}
2{"field1":"value1"}
3{"delete":{"_index":"test","_type":"type1","_id":"2"}}
4{"create":{"_index":"test","_type":"type1","_id":"3"}}
5{"field1":"value3"}
6{"update":{"id":"1","_type":"type1","_index":"index1"}}
7{"doc":{"field2":"value2"}}

映射

创建索引的时候,可以预先定义字段的类型以及相关属性

es会根据json源数据的基础类型猜测你想要的字段映射,将输入的数据转变成可搜索的索引项，mapping就是我们自己定义的字段数据类型，同时告诉es如何索引数据以及是否可以被搜索

作用: 会让索引建立的更加细致和完善

类型: 静态映射和动态类型

内置类型

string类型: text,keyword

数字类型: long,integer,short,byte,double,float

日期类型: date

bool类型:boolean

binary类型: binary

复杂类型: object，nested

geo类型: geo-point, geo-shape

专业类型: ip, competion

映射实例

1PUT lagou
 2{
 3  "mappings":{
 4    "job":{
 5      "properties":{
 6        "title":{
 7          "type":"text"
 8        },
 9        "salary_min":{
10          "type":"integer"
11        },
12        "city":{
13          "type":"keyword"
14        },
15        "company":{
16          "properties":{
17            "name":{
18              "type":"text"
19            },
20            "company_addr":{
21              "type":"text"
22            },
23            "employee_count":{
24              "type":"integer"
25            }
26          }
27        },
28        "publish_date":{
29          "type":"date",
30          "format":"yyyy-MM-dd"
31        },
32        "comments":{
33          "type":"integer"
34        }
35      }
36    }
37  }
38}

插入实例:

1PUT lagou/job/2
 2{
 3  "title":"python分布式爬虫",
 4  "salary_min":15000,
 5  "city":"北京",
 6  "company":{
 7      "name":"百度",
 8      "company_addr":"北京",
 9      "employee_count":50
10  },
11  "publish_date":"2017-4-18",
12  "comments":15
13}

查询

基本查询: 使用es内置查询条件进行查询

组合查询: 把多个查询组合在一起进行复合查询

过滤: 查询同时,通过filter条件在不影响打分的情况下筛选数据

基础查询

先建立映射:

1PUT lagou
 2{
 3  "mappings":{
 4    "job":{
 5      "properties":{
 6        "title":{
 7          "store":true,
 8          "type":"text",
 9          "analyzer": "ik_max_word"
10        },
11        "company":{
12          "store":true,
13          "type": "keyword"
14        },
15        "desc":{
16          "type":"text"
17        },
18        "add_time":{
19          "type":"date",
20          "format":"yyyy-MM-dd"
21        },
22        "comments":{
23          "type":"integer"
24        }
25      }
26    }
27  }
28}

match查询(分词查询):

1GET lagou/job/_search
 2{
 3  "query":{
 4      "match":{
 5          "title":"python爬虫"
 6      }
 7  }
 8}
 9
10# 这里查询时会将python爬虫分为“python”和”爬虫”进行查询,包含这个分词结果的都会返回

term查询(全量查询):

1GET lagou/job/_search
 2{
 3  "query":{
 4      "term":{
 5          "title":"python爬虫"
 6      }
 7  }
 8}
 9
10# 将python爬虫直接进行全量查询,包含“python爬虫”的才会被返回回来,类似查询关键字

terms查询(全量查询):

1GET lagou/job/_search
 2{
 3  "query":{
 4      "terms":{
 5          "title":["python","系统","django"]
 6      }
 7  }
 8}
 9
10# 查询列表中包含的多个关键字,包含其中一个就返回回来

控制查询返回的数量:

1GET lagou/job/_search
 2{
 3  "query":{
 4      "match":{
 5          "title":"python"
 6      }
 7  },
 8  "from":0,
 9  "size":3
10}
11
12# from属性指从第几个开始
13# size指返回几条结果

返回所有查询结果:

1GET lagou/_search
2{
3  "query":{
4      "match_all":{}
5  }
6}

短语查询:

1GET lagou/job/_search
 2{
 3  "query":{
 4      "match_phrase":{
 5          "title":{
 6              "query":"python师",
 7              "slop":3
 8          }
 9      }
10  }
11}
12
13# 短语查询查询的是分词后“python”和“师”中间隔3个字以上(含3个字)的短语
14# slop 间隔的字符数

查询多个字段:

1GET lagou/job/_search
 2{
 3  "query":{
 4      "multi_match":{
 5          "query":"python",
 6          "fields":["title^3","desc"]
 7      }
 8  }
 9}
10
11# 查询title 和 desc 字段包含 python 的关键词文档
12# 其中 箭号+数字 表示权重, 上图表示 title 的 权重 是desc 的3倍

返回指定字段:

1GET lagou/job/_search
 2{
 3  "stored_fields":["title","company"],
 4  "query":{
 5      "match":{
 6          "title":"python"
 7
 8      }
 9  }
10}
11
12# 这里只返回 title company 两个字段
13# 需要注意的是 这里显示的字段需在映射时设置 store 为 true 默认为false

将查询结果进行排序:

1GET lagou/job/_search
 2{
 3  "query":{
 4      "match_all":{}
 5  },
 6  "sort":[{
 7      "comment":{
 8          "order":"asc" # 升序
 9      }
10  }]
11}

范围查询:

1GET lagou/job/_search
 2{
 3  "query":{
 4      "range":{
 5          "comment":{
 6              "gte":10,
 7              "lte":50,
 8              "boost":2.0
 9          }
10
11      }
12  }
13}
14
15# gte 大于等于  gt 大于
16# lte 小于等于  lt 小于

模糊查询:

1GET lagou/_search
 2{
 3  "query":{
 4      "wildcard":{
 5          "title":{
 6              "value":"pyth*n",
 7              "boost":2.0
 8          }
 9
10      }
11  }
12}
13# 这里的boost指权重

bool组合查询

bool查询包含以下四种:

1bool:{
2    "filter":[],  # 字段过滤,不参与打分
3    "must":[],    # 数组里的所有查询必须全部满足
4    "should":[],  # 数组里的所有查询满足一个就行
5    "must_not":[] # 数组里的所有查询必须全部不满足
6}

bool查询的简单实例

这里的每个例子都使用sql语句作为对照学习，加深理解。

数据插入:

1  POST lagou/testjob/_bulk
2  {"index":{"_id":1}}
3  {"salary":10, "title":"Python"}
4  {"index":{"_id":2}}
5  {"salary":20, "title":"Scrapy"}
6  {"index":{"_id":3}}
7  {"salary":30, "title":"Django"}
8  {"index":{"_id":4}}
9  {"salary":30, "title":"Elasticsearch"}

简单的过滤查询:

1select * from testjob where comments=20

1GET lagou/testjob/_search
 2{
 3    "query":{
 4        "bool":{
 5            "must":{
 6                "match_all":{}
 7            },
 8            "filter":{
 9                "match":{
10                    "salary":20
11                }
12            }
13        }
14    }
15}

同时可以指定多个值:

1GET lagou/testjob/_search
 2{
 3    "query":{
 4        "bool":{
 5            "must":{
 6                "match_all":{}
 7            },
 8            "filter":{
 9                "match":{
10                    "salary":[10,20]
11                }
12            }
13        }
14    }
15}

组合过滤查询:

1select * from testjob where (salary=20 or title=python) and (salary != 30)

1GET lagou/testjob/_search
 2{
 3    "query":{
 4        "bool":{
 5            "should":[
 6                {"term":{"salary":20}},
 7                {"term":{"title":"python"}}
 8            ],
 9            "must_not":{
10                "term":{
11                    "price":30
12                }
13            }
14        }
15    }
16}

嵌套查询:

1select * from testjob where title="Python" or (title="Elasticsearch" and salary=30)

1GET lagou/testjob/_search
 2{
 3    "query":{
 4        "bool":{
 5            "should":[
 6                {"term":{"salary":20}},
 7                {"bool":{
 8                    "must":[
 9                        {"term":{"title":"elasticsearch"}},
10                        {"term":{"salary":30}}
11                    ]
12                }
13        }
14    ]
15  }
16}
17}

过滤空和非空

先插入测试数据:

1POST lagou/testjob2/_bulk
 2{"index":{"_id":1}}
 3{"tags":["search"]}
 4{"index":{"_id":2}}
 5{"tags":["search","python"]}
 6{"index":{"_id":3}}
 7{"other_field":["some data"]}
 8{"index":{"_id":4}}
 9{"tags":null}
10{"index":{"_id":5}}
11{"tags":["search",null]}

处理null空值的方法:

1select tags from testjob2 where tags is not NULL

1GET lagou/testjob2/_search
 2{
 3    "query":{
 4        "bool":{
 5            "filter":{
 6                "exists":{
 7                    "field":"tags"
 8                }
 9            }
10        }
11    }
12}

查看分析器解析结果

使用ik_max_word分析器,会最大化的生成分词结果。

1# 查看分析器解析的结果
2GET _analyze
3{
4  "analyzer":"ik_max_word",
5  "text":"python网络"
6}

使用ik_max_word分析器,会最少的生成分词结果。

1# 查看分析器解析的结果
2GET _analyze
3{
4  "analyzer":"ik_smart",
5  "text":"python网络"
6}

Scrapy如何将数据存入elasticsearch

安装 elasticsearch-dsl

1pip install elasticsearch-dsl

提供一个bobby老师的pipline模板(此处代码有省略)

新建一份 models.py 文件,文件作用就是定义映射,之后运行代码就会生成索引。

1from datetime import datetime
 2from elaticsearch-dsl import DocType, Date, Nested, Boolean, analyzer, InnerObjectWrapper, Completion, Keyword, Text
 3from elaticsearch-dsl.connections import connections
 4connections.create_connection(hosts = ["localhost"])
 5class AticleType(DocType):
 6    title = Text(analyzer="ik_max_word")
 7
 8    class Meta:
 9        index = "jobbole"
10        doc_type = "article"
11
12if __name__ == '__main__':
13    AticleType.init()

在pipline中实现下面的类,完成。

1class ElasticsearchPipeline(object):
2    def process_item(self, item, spider):
3        article = AticleType()
4        article.title = item['title']
5
6        article.save()
7        return item

Elasticsearch学习随笔与Scrapy中Elasticsearch的应用

elasticsearch概念

倒排索引

es的文档、索引的CRUD操作

新建分片，副本数量

查看设置、修改设置、查看索引信息

新增文档

查看文档

修改文档(覆盖式)与指定修改

删除文档,类(无法删除)，索引

es的mget和bulk批量操作

mget操作实例

bulk批量操作

映射

内置类型

映射实例

查询

基础查询

bool组合查询

bool查询的简单实例

过滤空和非空

查看分析器解析结果

Scrapy如何将数据存入elasticsearch

安装 elasticsearch-dsl

提供一个bobby老师的pipline模板(此处代码有省略)

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Elasticsearch学习随笔与Scrapy中Elasticsearch的应用

elasticsearch概念

倒排索引

es的文档、索引的CRUD操作

新建分片，副本数量

查看设置、修改设置、查看索引信息

新增文档

查看文档

修改文档(覆盖式)与指定修改

删除文档,类(无法删除)，索引

es的mget和bulk批量操作

mget操作实例

bulk批量操作

映射

内置类型

映射实例

查询

基础查询

bool组合查询

bool查询的简单实例

过滤空和非空

查看分析器解析结果

Scrapy如何将数据存入elasticsearch

安装 elasticsearch-dsl

提供一个bobby老师的pipline模板(此处代码有省略)

热门文章

最新文章

相关课程

相关电子书

相关实验场景