ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.

简介:

Theory Behind Relevance Scoring

Lucene (and thus Elasticsearch) uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.

Note

Don’t be alarmed! These concepts are not as complicated as the names make them appear. While this section mentions algorithms, formulae, and mathematical models, it is intended for consumption by mere humans. Understanding the algorithms themselves is not as important as understanding the factors that influence the outcome.

Boolean Model

The Boolean model simply applies the ANDOR, and NOT conditions expressed in the query to find all the documents that match. A query for

full AND text AND search AND (elasticsearch OR lucene)

will include only documents that contain all of the terms fulltext, and search, and eitherelasticsearch or lucene.

This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.

Term Frequency/Inverse Document Frequency (TF/IDF)

Once we have a list of matching documents, they need to be ranked by relevance. Not all documents will contain all the terms, and some terms are more important than others. The relevance score of the whole document depends (in part) on the weight of each query term that appears in that document.

The weight of a term is determined by three factors, which we already introduced in What Is Relevance?. The formulae are included for interest’s sake, but you are not required to remember them.

Term frequency

How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:

tf(t in d) = √frequency 

The term frequency (tf) for term t in document d is the square root of the number of times the term appears in the document.

If you don’t care about how often a term appears in a field, and all you care about is that the term is present, then you can disable term frequencies in the field mapping:

PUT /my_index
{
  "mappings": { "doc": { "properties": { "text": { "type": "string", "index_options": "docs"  } } } } }

Setting index_options to docs will disable term frequencies and term positions. A field with this mapping will not count how many times a term appears, and will not be usable for phrase or proximity queries. Exact-value not_analyzed string fields use this setting by default.

Inverse document frequency

How often does the term appear in all documents in the collection? The more often, the lower the weight. Common terms like and or the contribute little to relevance, as they appear in most documents, while uncommon terms like elastic or hippopotamus help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:

idf(t) = 1 + log ( numDocs / (docFreq + 1)) 

The inverse document frequency (idf) of term t is the logarithm of the number of documents in the index, divided by the number of documents that contain the term.



















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6474089.html ,如需转载请自行联系原作者



相关文章
|
机器学习/深度学习 人工智能 自然语言处理
Term Frequency-Inverse Document Frequency
TF-IDF算法全称是"Term Frequency-Inverse Document Frequency",可译为"术语频率-文档逆向频率"。
114 4
|
6月前
|
机器学习/深度学习 人工智能
【CatBoost报错解决】CatBoostError: Bad value for num feature[non default doc idx=0,feature idx=19]=
【CatBoost报错解决】CatBoostError: Bad value for num feature[non default doc idx=0,feature idx=19]=
|
机器学习/深度学习 自然语言处理 算法
词频-逆文档频率(Term Frequency-Inverse Document Frequency,
词频-逆文档频率(Term Frequency-Inverse Document Frequency,简称 TF-IDF)是一种统计方法,用以评估一个词对于一个文本或一组文本的重要性。
497 3
|
人工智能 编解码 自动驾驶
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
YOLOv7在5 FPS到160 FPS的范围内,在速度和精度方面都超过了所有已知的物体检测器,在GPU V100上以30 FPS或更高的速度在所有已知的实时物体检测器中具有最高的精度56.8% AP。
457 0
|
机器学习/深度学习 安全
Re28:读论文 CECP Charge Prediction by Constitutive Elements Matching of Crimes
Re28:读论文 CECP Charge Prediction by Constitutive Elements Matching of Crimes
Re28:读论文 CECP Charge Prediction by Constitutive Elements Matching of Crimes
LeetCode Contest 186 5392. 分割字符串的最大得分 Maximum Score After Splitting a String
LeetCode Contest 186 5392. 分割字符串的最大得分 Maximum Score After Splitting a String
Data Structures and Algorithms (English) - 6-11 Shortest Path [1](25 分)
Data Structures and Algorithms (English) - 6-11 Shortest Path [1](25 分)
111 0
Data Structures and Algorithms (English) - 6-17 Shortest Path [4](25 分)
Data Structures and Algorithms (English) - 6-17 Shortest Path [4](25 分)
107 0
Data Structures and Algorithms (English) - 6-11 Shortest Path [2](25 分)
Data Structures and Algorithms (English) - 6-11 Shortest Path [2](25 分)
122 0
Data Structures and Algorithms (English) - 6-16 Shortest Path [3](25 分)
Data Structures and Algorithms (English) - 6-16 Shortest Path [3](25 分)
102 0