1 使用ES实现的效果
汉字补全
拼写纠错
2 产品搜索与自动补全
Term suggester :词条建议器。对给输入的文本进进行分词,为每个分词提供词项建议
Phrase suggester :短语建议器,在term的基础上,会考量多个term之间的关系
Completion Suggester,它主要针对的应用场景就是"Auto Completion"
Context Suggester:上下文建议器
GET product_completion_index/_search { "from": 0, "size": 100, "suggest": { "czbk-suggest": { "prefix": "小米", "completion": { "field": "searchkey", "size": 20, "skip_duplicates": true } } } }
2.1 汉字补全OpenAPI
2.1.1 定义自动补全接口
GET product_completion_index/_search { "from": 0, "size": 100, "suggest": { "czbk-suggest": { "prefix": "小米", "completion": { "field": "searchkey", "size": 20, "skip_duplicates": true } } } }
package com.oldlu.service; import com.oldlu.commons.pojo.CommonEntity; import org.elasticsearch.action.DocWriteResponse; import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.rest.RestStatus; import org.elasticsearch.search.SearchHit; import org.elasticsearch.search.suggest.completion.CompletionSuggestion; import java.util.List; import java.util.Map; /** * @Class: ElasticsearchDocumentService * @Package com.oldlu.service * @Description: 文档操作接口 * @Company: http://www.oldlu.com/ */ public interface ElasticsearchDocumentService { //自动补全(完成建议) public List<String> cSuggest(CommonEntity commonEntity) throws Exception; }
2.1.2 定义自动补全实现
/* * @Description: 自动补全 根据用户的输入联想到可能的词或者短语 * @Method: suggester * @Param: [commonEntity] * @Update: * @since: 1.0.0 * @Return: org.elasticsearch.action.search.SearchResponse * */ public List<String> cSuggest(CommonEntity commonEntity) throws Exception { //定义返回 List<String> suggestList = new ArrayList<>(); //构建查询请求 SearchRequest searchRequest = new SearchRequest(commonEntity.getIndexName()); //通过查询构建器定义评分排序 SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.sort(new ScoreSortBuilder().order(SortOrder.DESC)); //构造搜索建议语句,搜索条件字段 CompletionSuggestionBuilder completionSuggestionBuilder =new CompletionSuggestionBuilder(commonEntity.getSuggestFileld()); //搜索关键字 completionSuggestionBuilder.prefix(commonEntity.getSuggestValue()); //去除重复 completionSuggestionBuilder.skipDuplicates(true); //匹配数量 completionSuggestionBuilder.size(commonEntity.getSuggestCount()); searchSourceBuilder.suggest(new SuggestBuilder().addSuggestion("czbk- suggest", completionSuggestionBuilder)); //czbk-suggest为返回的字段,所有返回将在czbk-suggest里面,可写死,sort按照评分排 序 searchRequest.source(searchSourceBuilder); //定义查找响应 SearchResponse suggestResponse = client.search(searchRequest, RequestOptions.DEFAULT); //定义完成建议对象 CompletionSuggestion completionSuggestion = suggestResponse.getSuggest().getSuggestion("czbk-suggest"); List<CompletionSuggestion.Entry.Option> optionsList = completionSuggestion.getEntries().get(0).getOptions(); //从optionsList取出结果 if (!CollectionUtils.isEmpty(optionsList)) { optionsList.forEach(item -> suggestList.add(item.getText().toString())); } return suggestList; }
2.1.3 定义自动补全控制器
/* * @Description 自动补全 * @Method: suggester * @Param: [commonEntity] * @Update: * @since: 1.0.0 * @Return: com.oldlu.commons.result.ResponseData * */ @GetMapping(value = "/csuggest") public ResponseData cSuggest(@RequestBody CommonEntity commonEntity) { // 构造返回数据 ResponseData rData = new ResponseData(); if (StringUtils.isEmpty(commonEntity.getIndexName()) || StringUtils.isEmpty(commonEntity.getSuggestFileld()) || StringUtils.isEmpty(commonEntity.getSuggestValue())) { rData.setResultEnum(ResultEnum.PARAM_ISNULL); return rData; } //批量查询返回结果 List<String> result = null; try { //通过高阶API调用批量新增操作方法 result = elasticsearchDocumentService.cSuggest(commonEntity); //通过类型推断自动装箱(多个参数取交集) rData.setResultEnum(result, ResultEnum.SUCCESS, result.size()); //日志记录 logger.info(TipsEnum.CSUGGEST_GET_DOC_SUCCESS.getMessage()); } catch (Exception e) { //日志记录 logger.error(TipsEnum.CSUGGEST_GET_DOC_FAIL.getMessage(), e); //构建错误返回信息 rData.setResultEnum(ResultEnum.ERROR); } return rData; }
2.1.4 自动补全调用验证
http://localhost:8888/v1/docs/csuggest
参数
{ "indexName": "product_completion_index", "suggestFileld": "searchkey", "suggestValue": "小米", "suggestCount": 13 }
indexName索引名称
suggestFileld:自动补全查找列
suggestValue:自动补全输入的关键字
suggestCount:自动补全返回个数(京东是13个)
返回
{ "code": "200", "desc": "操作成功!", "data": [ "小米10", "小米10Pro", "小米8", "小米9", "小米充电宝", "小米手机", "小米摄像头", "小米电视", "小米电饭煲", "小米笔记本", "小米耳环", "小米路由器" ], "count": 12 }
tips: 自动补全自动去重
2.2 拼音补全OpenAPI
使用拼音访问【小米】
http://localhost:8888/v1/docs/csuggest
全拼访问 { "indexName": "product_completion_index", "suggestFileld": "searchkey", "suggestValue": "xiaomi", "suggestCount": 13 } 全拼访问(分隔) { "indexName": "product_completion_index", "suggestFileld": "searchkey", "suggestValue": "xiao mi", "suggestCount": 13 } 首字母访问 { "indexName": "product_completion_index", "suggestFileld": "searchkey", "suggestValue": "xm", "suggestCount": 13 }
2.2.1 下载拼插件
wget https://github.com/medcl/elasticsearch-analysis-
pinyin/releases/download/v7.4.0/elasticsearch-analysis-pinyin-7.4.0.zip
或者
https://github.com/medcl/elasticsearch-analysis-pinyin/releases/tag/v7.4.0
当我们创建索引时可以自定义分词器,通过指定映射去匹配自定义分词器
{ "indexName": "product_completion_index", "map": { "settings": { "number_of_shards": 1, "number_of_replicas": 2, "analysis": { "analyzer": { "ik_pinyin_analyzer": { "type": "custom", "tokenizer": "ik_smart", "filter": "pinyin_filter" } }, "filter": { "pinyin_filter": { "type": "pinyin", "keep_first_letter": true, "keep_separate_first_letter": false, "keep_full_pinyin": true, "keep_original": true, "limit_first_letter_length": 16, "lowercase": true, "remove_duplicated_term": true } } } }, "mapping": { "properties": { "name": { "type": "text" }, "searchkey": { "type": "completion", "analyzer": "ik_pinyin_analyzer" } } } } }
调用【新增文档开发API】接口进行新增数据
开始拼音补全
3 什么是语言处理(拼写纠错)
场景描述
例如:错误输入"【adidaas官方旗舰店】 ”能够纠错为【adidas官方旗舰店】
3.1 语言处理OpenAPI
GET product_completion_index/_search { "suggest": { "czbk-suggestion": { "text": "adidaas官方旗舰店", "phrase": { "field": "name", "size": 13 } } } }
返回
3.1.1 定义拼写纠错接口
//拼写纠错 public String pSuggest(CommonEntity commonEntity) throws Exception;
3.1.2 定义拼写纠错实现
/* * @Description: 拼写纠错 * @Method: psuggest * @Param: [commonEntity] * @Update: * @since: 1.0.0 * @Return: java.util.List<java.lang.String> * */ @Override public String pSuggest(CommonEntity commonEntity) throws Exception { //定义返回 String pSuggestString = new String(); //定义查询请求 SearchRequest searchRequest = new SearchRequest(commonEntity.getIndexName()); //定义查询条件构建器 SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); //定义排序器 searchSourceBuilder.sort(new ScoreSortBuilder().order(SortOrder.DESC)); //构造短语建议器对象(参数为匹配列) PhraseSuggestionBuilder pSuggestionBuilder = new PhraseSuggestionBuilder(commonEntity.getSuggestFileld()); //搜索关键字(被纠错的值) pSuggestionBuilder.text(commonEntity.getSuggestValue()); //匹配数量 pSuggestionBuilder.size(1); searchSourceBuilder.suggest(new SuggestBuilder().addSuggestion("czbk- suggest", pSuggestionBuilder)); searchRequest.source(searchSourceBuilder); //定义查找响应 SearchResponse suggestResponse = client.search(searchRequest, RequestOptions.DEFAULT); //定义短语建议对象 PhraseSuggestion phraseSuggestion = suggestResponse.getSuggest().getSuggestion("czbk-suggest"); //获取返回数据 List<PhraseSuggestion.Entry.Option> optionsList = phraseSuggestion.getEntries().get(0).getOptions(); //从optionsList取出结果 if (!CollectionUtils.isEmpty(optionsList) &&optionsList.get(0).getText()!=null) { pSuggestString = optionsList.get(0).getText().string().replaceAll(" ",""); } return pSuggestString; }
3.1.3 定义拼写纠错控制器
/* * @Description: 拼写纠错 * @Method: suggester2 * @Param: [commonEntity] * @Update: * @since: 1.0.0 * @Return: com.oldlu.commons.result.ResponseData * */ @GetMapping(value = "/psuggest") public ResponseData pSuggest(@RequestBody CommonEntity commonEntity) { // 构造返回数据 ResponseData rData = new ResponseData(); if (StringUtils.isEmpty(commonEntity.getIndexName()) || StringUtils.isEmpty(commonEntity.getSuggestFileld()) || StringUtils.isEmpty(commonEntity.getSuggestValue())) { rData.setResultEnum(ResultEnum.PARAM_ISNULL); return rData; } //批量查询返回结果 String result = null; try { //通过高阶API调用批量新增操作方法 result = elasticsearchDocumentService.pSuggest(commonEntity); //通过类型推断自动装箱(多个参数取交集) rData.setResultEnum(result, ResultEnum.SUCCESS, null); //日志记录 logger.info(TipsEnum.PSUGGEST_GET_DOC_SUCCESS.getMessage()); } catch (Exception e) { //日志记录 logger.error(TipsEnum.PSUGGEST_GET_DOC_FAIL.getMessage(), e); //构建错误返回信息 rData.setResultEnum(ResultEnum.ERROR); } return rData; }
3.1.4 语言处理调用验证
http://localhost:8888/v1/docs/psuggest
参数
{ "indexName": "product_completion_index", "suggestFileld": "name", "suggestValue": "adidaas官方旗舰店" }
indexName索引名称
suggestFileld:自动补全查找列
suggestValue:自动补全输入的关键字
返回
{ "code": "200", "desc": "操作成功!", "data": "adidas官方旗舰店" }
4 总结
- 需要一个搜索词库/语料库,不要和业务索引库在一起,方便维护和升级语料库
- 根据分词及其他搜索条件去语料库中查询若干条(京东13条、淘宝(天猫)10条、百度4条)记录
返回 - 为了提升准确率,通常都是前缀搜索