Solr DisjunctionMax 注解

简介: 假期重新把之前在新浪博客里面的文字梳理了下,搬到这里。

Disjunction Max 析取最大 (并集)



本质多域联合搜索,并且不同域指定不同的权重,命中时取最大得分域结果作为结果得分。与直接多域boost求和是完全不同的结果。使用起来非常复杂,需要debugquery 看结果,反复尝试!



http://wiki.apache.org/solr/DisMax

http://searchhub.org/dev/2010/05/23/whats-a-dismax/


What’s a “DisMax” ? Posted by hossman

The term “dismax” gets tossed around(被抛出来) on the Solr lists frequently, which can be fairly confusing to new users. It originated as a shorthand name for the DisMaxRequestHandler (which I named after the DisjunctionMaxQueryParser, which I named after the DisjunctionMaxQuery class that it uses heavily). In recent years, the DisMaxRequestHandler and the StandardRequestHandler were both refactored into(重构) a single SearchHandler class, and now the term “dismax” usually refers to the DisMaxQParser.



注解:dismax 现在对应于DisMaxQParser,而DismaxRequestHandler standardRequestHandler 重构到SearchHandler



Clear as Mudd, right?

Regardless of whether you use the DisMaxRequestHandler via the qt=dismax parameter, or use the SearchHandler with the DisMaxQParser via defType=dismax the end result is that your q parameter gets parsed by the DisjunctionMaxQueryParser.



注解:qt=dismax,采取DisMaxRequestHandler,defType=dismax,SearchHandler中使用DisMaxQParser,二者q的参数采取DisJunctionMaxQueryParser解析



The original goals of dismax (whichever meaning you might infer) have never changed:

… supports a simplified version of the Lucene QueryParser syntax. Quotes can be used to group phrases(分组短语), and +/- can be used to denote mandatory(强制性、必选的) and optional(可选的) clauses … but all other Lucene query parser special characters are escaped to simplify the user experience. The handler takes responsibility for building a good query from the user’s input using BooleanQueries containing DisjunctionMaxQueries across fields and boosts you specify It also allows you to provide additional boosting queries, boosting functions, and filtering queries to artificially(人工) affect the outcome of all searches. These options can all be specified as default parameters for the handler in your solrconfig.xml or overridden the Solr query URL.

In short: You worry about what fields and boosts you want to use when you configure it, your users just give you words w/o worrying too much about syntax.



注解: dismax句柄主要负责使用布尔查询封装DisjunctionMaxQueries,同时允许手工执行query激励、函数激励、过滤query影响最终搜索结果。所有参数可以通过在solrconfig.xml中配置,作为全局查询用,也可以通过url添加参数,在每一次或者每一类查询中动态使用。



The magic of dismax (in my opinion) comes from the query structure it produces. What it essentially boils down to is matrix multiplication: a one column matrix of each “chunk” of your user’s input, multiplied by a one row matrix of the qf fields to produce a big matrix of every field:chunk permutation(排列). The matrix is then turned into a BooleanQuery consisting of DisjunctionMaxQueries for each row in the matrix. DisjunctionMaxQuery is used because it’s score is determined by the maximum score of it’s subclauses — instead of the sum like a BooleanQuery — so no one word from the user input dominates the final score. The best way to explain this is with an example, so let’s consider the following input…

defType = dismax

    mm = 50%

    qf = features^2 name^3

     q = +"apache solr" search server

First off, we consider the “markup” characters of the parser that appear in this q string:

·      white space – dividing input string into chunk ( 分词 )

·      quotes – makes a single phrase chunk ( 括号 )

·      + – makes a chunk mandatory ( 组合关系 )

So we have 3 “chunks” of user input:

·      “apache solr” (must match)

·      “search” (should match)

·      “server” (should match>

If we “multiply” that with our qf list (features, name) we get a matrix like this…

features:”apache solr”

name:”apache solr”

(must match)

features:”search”

name:”search”

(should match)

features:”server”

name:”server”

(should match)

If we then factor in the mm param to determing the “minimum number of ‘ShouldMatch’ clauses that (ahem) must match” (50% of 2 == 1) we get the following query structure (in psuedo-code)…

q = BooleanQuery(

 minNumberShouldMatch => 1,

 booleanClauses => ClauseList(

   MustMatch(DisjunctionMaxQuery(

     PhraseQuery("features","apache solr")^2,

     PhraseQuery("name","apache solr")^3)

   ),

   ShouldMatch(DisjunctionMaxQuery(

     TermQuery("features","search")^2,

     TermQuery("name","search")^3)

   ),

   ShouldMatch(DisjunctionMaxQuery(

     TermQuery("features","server")^2,

     TermQuery("name","server")^3))

));

 

注解:boolean查询这个是最最基本的原子查询,其他高级查询都是基于这个查询的组合、封装,Dismax也是如此。从dismax qp分解过程和定义看,dismax也是分解为boolean查询,并且field激励也同一般域boost一致,但是不同的时候dismax是以最大得分作为最终得分,而一般多域独立boost时候是求和得分。


With me so far right?

Where people tend to get tripped up(绊倒), is in thinking about how Solr’s per-field analysis configuration (in schema.xml) impacts all of this. Our example above was pretty straight forward, but lets consider for a moment what might happen if:

·      The name field uses the WordDelimiterFilter单词分割符过滤器at query time but features does not.

·      The features field is configured so that “the” is a stopword, but name is not.

Now let’s look at what we get when our input parameters are structurally similar to what we had before, but just different enough to for WordDelimiterFilter and StopFilter to come into play…

defType = dismax

    mm = 50%

    qf = features^2 name^3

     q = +"apache solr" the search-server

Our resulting query is going to be something like…

q = BooleanQuery(

 minNumberShouldMatch => 1,

 booleanClauses => ClauseList(

   MustMatch(DisjunctionMaxQuery(

     PhraseQuery("features","apache solr")^2,

     PhraseQuery("name","apache solr")^3)

   ),

   ShouldMatch(DisjunctionMaxQuery(

     TermQuery("name","the")^3)

   ),

   ShouldMatch(DisjunctionMaxQuery(

     TermQuery("features","search-server")^2,

     PhraseQuery("name","search server")^3))

 ));

The use of WordDelimiterFilter hasn’t changed things very much: features is treating “search-server” as a single Term, while in the name field we are searching for the phrase “search server” — hopefully this shouldn’t surprise anyone given the use of WordDelimiterFilter for the name field (presumably that’s why it’s being used). This DisjunctionMaxQuery still “makes sense”, but other fields with odd analysis that produce less/more Tokens then a “typical” field for the same thunk might produce queries that aren’t as easily to understand. In particular consider what has happened in our example with the word “the”: Because “the” is a stop word in the features field, no Query object is produced for that field/chunk combination. But a Query is produced for the name field, which means the total number of “ShouldMatch” clauses in our top level query is still 2 so our minNumberShouldMatch is still 1 (50% of 2 == 1).

This type of situation tends to confuse a lot of people: since “the” is a stop word in one field, they don’t expect it to matter in the final query — but as long as at least one qf field produces a Token for it (name in our example) it will be included in the final query, and will contribute to the count of “ShouldMatch” clauses.

So, what’s the take away from all of this?

DisMax is a complicated creature. When using it, you need to consider all of it’s optionscarefully, and look at the debugQuery=true output while experimenting with different query strings and different analysis configurations to make really sure you understand how queries from your users will be parsed.

注解:dismax 构造非常复杂,使用的时候需要仔细考虑所有选项,同时,开启debugQuery=true,针对不同的查询串和分词器。

For qf (Query Fields), pf (Phrase Fields), mm (Minimum ‘Should’ Match), and tie (Tie Breaker), see: the Solr Wiki DisMaxQParserPlugin.

Solr: Forcing items with all query terms to the top of a Solr search » Robot Librarian

http://robotlibrarian.billdueber.com/solr-forcing-items-with-all-query-terms-to-the-top-of-a-solr-search/


Lucid Imagination » Solr Powered ISFDB – Part #10: Tweaking Relevancy

http://searchhub.org/dev/2011/06/20/solr-powered-isfdb-part-10/

 

Lucid Imagination » Solr Powered ISFDB – Part #11: Using DisMax

http://searchhub.org/dev/2011/08/08/solr-powered-isfdb-part-11/



http://tm.durusau.net/?p=21573

 

Using Solr’s Dismax Tie Parameter « Another Word For It (tie breake配合断路器)

http://java.dzone.com/articles/using-solrs-dismax-tie

 

Solr Powered ISFDB – Part #11: Using DisMax

http://searchhub.org/dev/2011/06/20/solr-powered-isfdb-part-10/

目录
相关文章
|
3月前
|
Java Apache Spring
整合Spring Boot和Apache Solr进行全文搜索
整合Spring Boot和Apache Solr进行全文搜索
|
消息中间件 XML Java
SSM集成kafka——注解,xml配置两种方式实现
SSM集成kafka——注解,xml配置两种方式实现
326 0
SSM集成kafka——注解,xml配置两种方式实现
|
Java 测试技术 Spring
Spring @Profile注解使用和源码解析
在之前的文章中,写了一篇使用Spring @Profile实现开发环境,测试环境,生产环境的切换,之前的文章是使用SpringBoot项目搭建,实现了不同环境数据源的切换,在我们实际开发中,会分为dev,test,prod等环境,他们之间数独立的,今天进来详解介绍Spring @Profile的原理。
95 0
|
NoSQL Java MongoDB
SpringBoot 系列教程 Solr 之查询使用姿势小结
接下来进入 solr CURD 的第四篇,查询的使用姿势介绍,本文将主要包括以下知识点 基本的查询操作 fq 查询 fl 指定字段查询 比较/范围 排序 分页 分组
392 0
SpringBoot 系列教程 Solr 之查询使用姿势小结
|
存储 Java 应用服务中间件
SpringBoot 整合 Solr|学习笔记
快速学习 SpringBoot 整合 Solr
177 0
SpringBoot 整合 Solr|学习笔记
|
XML 缓存 自然语言处理
Solr 的作用,为什么要用solr服务,
Solr 的作用,为什么要用solr服务,
270 0
|
Java 索引 存储
通过solrj对solr进行开发
应用场景 当安装部署完solr之后,我们可以通过solrj来连接solr,进行新建,删除索引等等操作,达到全文检索的效果。
1150 0
|
数据库
solr6.6初探之配置篇
一.solr的简介 1) solr是企业级应用的全文检索项目,它是基于Apache Lucence搜索引擎开发出来的用于搜索的应用工程 2) solr最新版本6.6 下载地址:下载地址   二 启动与配置solr 1) 下载并解压文件后会得到以下界面: 我们重点关注以下几个文件夹: 1.bin 放置solr的相关执行脚本,在solr5.0版本以前,部署过程相当麻烦,好在Apache帮我们简化了相关solr的配置 2.example :这个文件夹里放置的一些solr应用实例。
1253 0
|
Java 数据格式 Spring
Spring Web工程web.xml零配置即使用Java Config + Annotation
摘要: 在Spring 3.0之前,我们工程中常用Bean都是通过XML形式的文件注解的,少了还可以,但是数量多,关系复杂到后期就很难维护了,所以在3.x之后Spring官方推荐使用Java Config方式去替换以前冗余的XML格式文件的配置方式; 在开始之前,我们需要注意一下,要基于Java Config实现无web.
1160 0