背景
有时候需要ES模糊多个多个和中文相关的字段,可以把多个字段合成一个逻辑意义上的字段进行模糊
相关信息
此时需要两个配置:
1、copy_to (将多个字段整合成一个字段)官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/5.5/copy-to.html
2、ngram (分词器在对纯中文或者中英文混合相关等检索的时候很犀利,无脑的将词分隔成成为几个字连接起来)官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-ngram-tokenizer.html
如下示配置看,将first_name和last_name字段copy到一个字段full_name中
注意⚠️:(这个会不会产生一个情况?用户输入的前半段和后半段,命中了字段合在一起的中间部分,如A字段值:abc,B字段值为:def,把A、B字段合并后进行模糊分词匹配的时候会不会匹配中:cd这个值?)如果是使用copyto,不用担心这个。如果你使用的是binlog同步使用字符串拼接的方式产生的字段,会产生括号中所属的情况
PUT my_index { "mappings": { "my_type": { "properties": { "first_name": { "type": "text", "copy_to": "full_name" }, "last_name": { "type": "text", "copy_to": "full_name" }, "full_name": { "type": "text" } } } } } PUT my_index/my_type/1 { "first_name": "John", "last_name": "Smith" } GET my_index/_search { "query": { "match": { "full_name": { "query": "John Smith", "operator": "and" } } } }
试一下没有进行额外配置ngrem默认的分词效果:会为字符创建一个offset,更加方便命中数据
GET _analyze { "tokenizer": "ngram", "text": "我的测试" } #产生的效果 { "tokens": [ { "token": "我", "start_offset": 0, "end_offset": 1, "type": "word", "position": 0 }, { "token": "我的", "start_offset": 0, "end_offset": 2, "type": "word", "position": 1 }, { "token": "的", "start_offset": 1, "end_offset": 2, "type": "word", "position": 2 }, { "token": "的测", "start_offset": 1, "end_offset": 3, "type": "word", "position": 3 }, { "token": "测", "start_offset": 2, "end_offset": 3, "type": "word", "position": 4 }, { "token": "测试", "start_offset": 2, "end_offset": 4, "type": "word", "position": 5 }, { "token": "试", "start_offset": 3, "end_offset": 4, "type": "word", "position": 6 } ] }
进行ngrem配置
min_gram和max_gram 有助于分词的效率,类似一个窗口(如下配置像一个长度1~3单位的窗口,可以在字段字符上来回滑动分词),窗口长度拉的越长,匹配的更加具体,长度越小,匹配质量越低
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "ngram", "min_gram": 3, "max_gram": 3, "token_chars": [ "letter", "digit" ] } } } } } POST my_index/_analyze { "analyzer": "my_analyzer", "text": "2 Quick Foxes." }
后续java示例伪代码
int from = (pageNum - 1) * pageSize; SearchSourceBuilder builder = new SearchSourceBuilder(); BoolQueryBuilder rootQuery = QueryBuilders.boolQuery(); BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery(); boolQueryBuilder.must(QueryBuilders.matchQuery("full_name", "模糊匹配的内容").operator(Operator.AND)); rootQuery.filter(boolQueryBuilder); builder.setQuery(rootQuery); builder.from(from); builder.size(pageSize); SearchRequest searchRequest = new SearchRequest("indexName"); searchRequest.source(builder); try { SearchResponse response = client.search(searchRequest); return response; } catch (IOException e) { throw new BaseException("es查询连接出错:" + e.getMessage()); }catch (ElasticsearchException e){ throw new BaseException("es查询出错:" + e.getMessage()); }