【转】lucene4 VSM的变动情况参考

简介: 假期重新把之前在新浪博客里面的文字梳理了下,搬到这里

New Lucene 4 Functions Improve Enterprise Search Indexing

原文链接http://java.dzone.com/articles/new-index-statistics-lucene-40

In the past, Lucene recorded only the bare minimal aggregate index statistics necessary to support its hard-wiredclassic vector space scoring model.

Fortunately, this situation iswildly improved in trunk(to be 4.0), where wehave a selection of modern scoring models, includingOkapi BM25,Language models,Divergence from Randomness modelsandInformation-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.

To understand the new statistics, let's pretend we've indexed the following two example documents, each with only one field "title":

·      document 1: The Lion, the Witch, and the Wardrobe

·      document 2: The Da Vinci Code

Assume we tokenize on whitespace, commas are removed, all terms are downcased and we don't discard stop-words. Here are the statistics Lucene tracks:

   TermsEnum.docFreq()

How many documents contain at least one occurrence of the term in the field; 3.x indices also save this (TermEnum.docFreq()). For term "lion" docFreq is 1, and for term "the" it's 2.

 

   Terms.getSumDocFreq()

Number of postings, i.e. sum of TermsEnum.docFreq() across all terms in the field. For our example documents this is 9.

 

   TermsEnum.totalTermFreq()

Number of occurrences of this term in the field, across all documents. For term "the" it's 4, for term "vinci" it's 1.

 

   Terms.getSumTotalTermFreq()

Number of term occurrences in the field, across all documents; this is the sum of TermsEnum.totalTermFreq() across all unique terms in the field. For our example documents this is 11.

 

   Terms.getDocCount()

How many documents have at least one term for this field. In our example documents, this is 2, but if for example one of the documents was missing the title field, it would be 1.

 

   Terms.getUniqueTermCount()

How many unique terms were seen in this field. For our example documents this is 8. Note that this statistic is of limited utility for scoring, because it's only available per-segment and you cannot (efficiently!) compute this across all segments in the index (unless there is only one segment).

 

   Fields.getUniqueTermCount()

Number of unique terms across all fields; this is the sum of Terms.getUniqueTermCount() across all fields. In our example documents this is 8. Note that this is also only available per-segment.

 

   Fields.getUniqueFieldCount()

Number of unique fields. For our example documents this is 1; if we also had a body field and an abstract field, it would be 3. Note that this is also only available per-segment.



3.x indices only store TermsEnum.docFreq(), so if you want to experiment with the new scoring models in Lucene 4.0, you should either re-index or upgrade your index using IndexUpgrader. Note that the new scoring models all use the same single-byte norms format, so you can freely switch between them without re-indexing.

In addition to what's stored in the index, there are also these statistics available per-field, per-document while indexing, in the FieldInvertState passed to Similarity.computeNorm method for both 3.x and 4.0:


   length

How many tokens in the document. For document 1 it's 7; for document 2 it's 4.

 

   uniqueTermCount

For this field in this document, how many unique terms are there? For document 1, it's 5; for document 2 it's 4.

 

   maxTermFrequency

What was the count for the most frequent term in this document. For document 1 it's 3 ("the" occurs 3 times); for document 2 it's 1.



In 3.x, if you want to consume these indexing-time statistics, you'll have to save them away yourself (e.g., somehow encoding them into the single-byte norm value). However, since 4.0 uses
doc valuesfor norms, you have more freedom to encode these statistics however you'd like. Your custom similarity can then pull from these.

From these available statistics you're now free to derive other commonly used statistics:

·      Average document length is Terms.getSumTotalTermFreq() divided by Terms.getDocCount().

·      Average within-document term frequency is FieldInvertState.length divided by FieldInvertState.uniqueTermCount.

·      Average document length across the collection is Terms.getSumTotalTermFreq() divided by maxDoc (or Terms.getDocCount(), if not all documents have the field).

·      Average number of unique terms per document is Terms.getSumDocFreq() divided by maxDoc (or Terms.getDocCount(field), if not all documents have the field).

Remember that the statistics do not reflect deleted documents, until those documents are merged away; in general this also means that segment merging will alter scores! Similarly, if the field omits term frequencies, then the statistics will not be correct (though they will still be consistent with one another: we will pretend each term occurred once per document).

Published at DZone with permission ofMichael Mccandless, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, In

目录
相关文章
|
2月前
|
消息中间件 监控 数据挖掘
Elasticsearch 使用误区之二——频繁更新文档
【8月更文挑战第15天】在大数据与搜索技术日益成熟的今天,Elasticsearch 作为一款分布式、RESTful 风格的搜索与数据分析引擎,凭借其强大的全文搜索能力和可扩展性,成为了众多企业和开发者的首选。然而,在使用 Elasticsearch 的过程中,一些常见的误区可能会导致性能下降或数据不一致等问题,其中“频繁更新文档”便是一个不容忽视的误区。本文将深入探讨这一误区的根源、影响及解决方案,帮助读者更好地利用 Elasticsearch。2
55 0
|
4月前
|
算法 索引
一篇文章讲明白Lucene学习总结之九:Lucene的查询对象(2)
一篇文章讲明白Lucene学习总结之九:Lucene的查询对象(2)
19 0
|
12月前
|
存储 SQL 算法
一文教你玩转 Apache Doris 分区分桶新功能|新版本揭秘
一文教你玩转 Apache Doris 分区分桶新功能|新版本揭秘
505 0
|
5月前
|
SQL 分布式计算 API
Apache Hudi从零到一:深入研究读取流程和查询类型(二)
Apache Hudi从零到一:深入研究读取流程和查询类型(二)
171 1
|
5月前
|
Web App开发 小程序 专有云
mPaaS问题之文档配置flavor后报错如何解决
mPaaS配置是指在mPaaS平台上对移动应用进行的各项设置,以支持应用的定制化和优化运行;本合集将提供mPaaS配置的操作指南和最佳实践,助力开发者高效管理和调整移动应用的设置。
|
存储 数据采集 自然语言处理
lucene 索引流程详细分析|学习笔记
快速学习 lucene 索引流程详细分析
139 0
lucene 索引流程详细分析|学习笔记
|
Java API 数据安全/隐私保护
SpringBoot高级篇搜索Solr之文档新增与修改使用姿势
大多涉及到数据的处理,无非CURD四种操作,对于搜索SOLR而言,基本操作也可以说就这么几种,在实际应用中,搜索条件的多样性才是重点,我们在进入复杂的搜索之前,先来看一下如何新增和修改文档
426 0
SpringBoot高级篇搜索Solr之文档新增与修改使用姿势
|
存储 自然语言处理 关系型数据库
Lucene的查询过程
Lucene的查询过程
189 0
|
存储 自然语言处理 分布式计算
看Lucene源码必须知道的基本规则和算法
 下面介绍一些Lucene使用基本规则和算法。这些规则和算法的选择,都和Lucene和支持TB级的倒排索引有关。
|
监控 NoSQL Java
关于” 记一次logback传输日志到logstash根据自定义设置动态创建ElasticSearch索引” 这篇博客相关的优化采坑记录
之前写过一篇博客是关于记录日志的简单方式的   主要就是  应用->redis->logstash->elasticsearch 整个流程的配置方法和过程的 https://www.cnblogs.com/zhyg/p/6994314.
2370 0