Solr-lucene 使用案例大全

2022-05-02 239

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 假期重新把之前在新浪博客里面的文字梳理了下，搬到这里。本文sole lucene的使用案例汇总。

Solr-lucene 使用案例大全

格式有失真，可以到微盘下载。认识lucene&solr-part3.pdf

1. Cfs格式

version+num+offset_list+content_list

头部索引信息+内容信息.eg 123|5|0-12-25-30-46-60|con0|con1|con2|con3|con4|con5

用于小文件管理。参考 CompoundFileDirectory

2. CRC 校验

Size(int)+len(int)+bytes+crc(long), 其中 size=4+bytes.len+8

用于分片数据传输

3. 内存申请

用于高效利用内存资源

线性法

Eg init size=2k any time put item, check if(needSize>size) then size=size*2

Eg init size=2k any time put item, check if(needSize> size ) then size=size+size>>1

分层查找法 Eg 参考 ByteBlockPool

* An array holding the offset into the {@link ByteBlockPool#LEVEL_SIZE_ARRAY}

* to quickly navigate to the next slice level.

public final static int[] NEXT_LEVEL_ARRAY = {1, 2, 3, 4, 5, 6, 7, 8, 9, 9};

* An array holding the level sizes for byte slices.

public final static int[] LEVEL_SIZE_ARRAY = {5, 14, 20, 30, 40, 40, 80, 80, 120, 200};

4. trieTree

trietree and FST 用于前缀查找、偏移映射、统计词频、统计IP等。参考FST util 工具

Eg tree 每个node 保存频率数，遍历预料库之后，遍历输出终止结点频率，就输出了磁盘。FST 直接将term与存储偏移关系映射起来。

5. 数组池

参考 IntBlockPool 用于批量Int数字存储访问

6. 堆

用于top排序参考 PriorityQueue包括子类，主要用于top排序。变种，带指定阀值的翻页top。

7. TimSort

用于局部有序的数值排序，速度杠杠的牛逼。前提：局部有序。实际计算规模比快速排序还少。参考TimSorter。

8. 引用计数

用于并发资源共享和资源释放。Eg 读索引的时候，每次reader被应用原子计数++，用完后--，最后close的时候，计数为0 释放资源。参考SolrIndexSearcher 的numOpens和numClose用法

9. 线程局部变量访问

参考CloseableThreadLocal，是改进的ThreadLocal。用在需要线程级别参数传递。Eg 客户端多线程查询，每个线程传入路由信息，写入threadlocal中，然后客户端获取threadlocal值。前提是，其他方式传入路由不是很方便的时候。或者代价高。

10. 跳跃表

用于快速查找、lazy load。通过构建多层-有序-分块明文索引信息（eg 偏移值或者下标值），然后通过比较明文索引信息，定位数据块，然后从块中顺序比较获取值。又或者内存一次无法全部加载整个数据，那么对磁盘数据做预先调表，然后需要的时候，就加载进来。

参考底层SegmentCoreReaders，上层eg StandardDirectoryReader

11. Compress

压缩一种应用在数值类型，一种应用在字符串类型，其中数值类型如下：vbyte/vint/vlong

zig-zags/diff/lz4，对应字符串eg city to code、property to code、exist by boolean 、url 前缀编码。

trieInt trieLong TrieDate 对数值的字节分段建索引和查找存储

12. 分布式query

参考SearchHandler.handRequestBody分阶段并发处理请求，并且有超时控制、cache链接共享。其中LBHttpSolrServer 负载均衡SolrServer，也是不错的实现

13. Plugin机制

参考SolrResourceLoader，实现服务、插件的动态拔插。将solr功能口子完全暴露给研发者、性能优化者、个性化服务定制等。ClassLoader生成和反射构造对象，使得初始化对象可能出现“不一致”，原因是自己的classLoader和solr启动的不一样！

14. 绕口solr直接操作lucene API

通过firstSearcher获取索引视图SolrIndexSearch.getReader(),然后以lucene API来执行特殊业务的定制。Eg批量term，返回其中文档只需包含批量term的只是一个就可以了。配置solrconfig.xml

15. 定制更新

参考DirectUpdateHandler2 实现自己的更新逻辑。进而实现自己的内存索引+磁盘索引结构。实现自己的实时结构。

16. 多块磁盘多驱动下索引读写

同一个物理机，单进程多个solrcore，可以根据不同solrcore配置不同索引数据目录

${solr.ulog.dir:}

${solr.data.dir:}

17. LRUCache&FastLRUCache

通用的cache，可以直接使用到其他场景中

参考FastLRUCacheLRUCache，前者适合读多写少，后者适合写多读少

18. Codec B+

https://issues.jboss.org/browse/ISPN-1349

http://grokbase.com/p/gg/elasticsearch/133112b662/issue-indexing-50mil-docs-via-bulk-api

http://www.research.ibm.com/haifa/Workshops/ir2005/papers/DougCutting-Haifa05.pdf

19. 低频词优化

> schema.xml

solrconfig.xml

low frequence terms is writtern is a special way to save a single IO when retrieving a documents。低频词直接将docid写在term后面了。

20. unique id codec

http://opensourceconnections.com/blog/2013/06/05/build-your-own-lucene-codec/

https://issues.apache.org/jira/browse/LUCENE-4498

http://lucene.472066.n3.nabble.com/Fetching-uniqueKey-and-other-int-quickly-from-documentCache-td4119445.html

21. SPI

SPI出现在4.*序列中，应用在codec的加载

22. Solr自己的序列化

参考JavaBinCodec 用于solr范围内的对象序列化

23. Directory

扩展分布式虚拟文件系统的Directory。

StandardDirectoryFactory RAMDirectoryFactory MMapDirectoryFactory RAMDirectoryFactory

HdfsDirectoryFactory

24. DataImport

from mysql batch task to hdfs block to build

from mysql batch task to local xml to build

from hdfs to build/ from odps to build

fealtime add to master(with backup select) to slave commit log then syn consum added record object to index

25. Analyzer

Paoding /AliWs /IK/WhitespaceTokenizerFactory /NgramFilterFactory/ EdgeNGramFilterFactory

PatternTokenizerFactory /StandardTokenizerFactory /Payload term:value/JsonPreAnalyzedParser/CustomJson

ChineseFilterFactory /CJKBigramFilterFactory

26. 拼音

拼音搜汉字，包括拼音的缩写搜汉字

<fieldType name="cn_pinyin"class="solr.TextField"positionIncrementGap="100"omitNorms="false"omitPositions="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.EdgeNGramFilterFactory"side="front"minGramSize="1"maxGramSize="20" />
    <filter class="solr.PimICUTransformFilterFactory"/>
  </analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

27. 汉字

繁体字转汉字

<fieldType>
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.EdgeNGramFilterFactory"side="front"minGramSize="1"maxGramSize="20" />
    <filter class="solr.ICUTransformFilterFactory"id="Traditional-Simplified"direction="forward"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.ICUTransformFilterFactory"id="Traditional-Simplified"direction="forward"/>
    </analyzer>
</fieldType>

28. 裂变

{"discount_price":"3000","original_price":"35.0","pic":"http://img.taobaocdn.net/bao/uploaded/i1/T1R1CzXeRiXXcckdZZ_032046.jpg","post_fee":"0"}

29. Not exist

Index 阶段 Null to -1 Null to “”

query阶段 ! [0 TO MAX] ![* TO *]

30. Deep page

with score deep page

without score deep page

31. Result mixed & pass repeat

use collect chain that two collects defined

much get and remove under collect

32. Random

Random score then random result

topN random

33. like

For include one term

More like this

termVectors=“true”

defType=edismax&mlt=true&mlt.fl=name&mlt.mintf=1&mlt.mindf=1

For include more term

34. Synonyms

Eg 一百天与100天，注意排序问题

35. Minimum match

defType=dismax

minimum mm=2

q.op=AND

36. Group by

group by field, by query , by function queries

37. Fl alias name

ALIAS_NAME:FIELD_NAME

price:price_usd

38. Sql query

Sql syntax to solr syntax

词法分析-句法分析-语义分析-语法树

性能优化：公式变换+公共部分提取共享

参考链接 http://blog.sina.com.cn/s/blog_4d58e3c0010185n2.html

39. Performance

put s more than gets solr.LRUCache

gets more than puts solr.FastLRUCache

facet.method=enum filter cache each term

StandardDirectoryFactory try to choose itself

SimpleFSDirectoryFactory local fileSystem doesn’t scale well with a high number of threads

NIOFSDirectoryFactory schales well with many threads ,doesn’t work well on mirosoft windows

MMapDirectoryFactory default for solr for 64 –bit linux system(3.1 to 4.0) desirable for not NRT search

NRTCachingDirectoryFactory nrt search store some parts of index in memory

RAMDirectoryFactory is not designed to hold large amounts of data. Replication won’t work

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件