4.7 深度翻页源码分析和应用参考
参数载体: CursorMark
核心执行流程:
SolrIndexSearcher.search()--->
SolrIndexSearcher.getDocListC()---->
SolrIndexSearcher.getDocListNC()里面会依赖参数,由cmd.getCursorMark() 里面取
getDocListNC里面
finalTopDocsCollector topCollector = buildTopDocsCollector(len, cmd);
Collector collector = topCollector;
其中BuildTopDocsCollector里面调用
returnTopFieldCollector.create(weightedSort, len, searchAfter,
fillFields, needScores, needScores,true);
TopFieldCollector.create里面再调用
returnnewPagingFieldCollector(queue, after, numHits, fillFields, trackDocScores, trackMaxScore);
关于
TopFieldCollectorPagingFieldCollector
publicPagingFieldCollector(
FieldValueHitQueue queue, FieldDoc after,intnumHits,booleanfillFields,
booleantrackDocScores,booleantrackMaxScore) {
支持多个field以及自定义排序sort实现。
收集的核心:判断是否重复分,然后判断内部id,决定是否收集。因为可能多个field排序,所以是循环先判断是否重复分。在3.*序列,没有循环判断,因为3.*只支持的是默认得分下的deep paging。另外传入的内部doc id是全局的,在setNextReader的时候,会计算偏移量的。
另外,做全量切换过程的deep paging可能会出现数据不一致,如果出现两次深度翻页请求,正好跨越新旧两个全量索引集合。
深度翻页是传入比较器,非深度翻页是分有序、无序的collect。参考TopFieldCollector create
注意:倒排链获取是有序的,这个是传入内部lucene id 的前提,然后用来处理得分相同的doc场景。这就引入另外一个问题,lucene id是局部的、可变的。如果一个doc 刚刚被update了,起doc id 靠后了,然后得分可能没变,下一次深度翻页,可能需要最后才出现,而不是修改后可见级别调高。更潜在的一个问题:实时shard搜索。由于重启后重新消费commitlog,然后docid 编号递增靠后,使得deep paging 前后请求如果来自不同的solrCore,那么就可能出现,数据重复。隐藏域确保唯一性,可以控制不同shard搜索的排序不稳定。eg
fromsolrCore1------1---1-----------------thenupdateseconddoc,thenitsluceneinternaldocincr.maybyrestartsolrcore2orfirstupdatesolrCore2fromsolrCore2------1--------------------1----so, fromsolrCore2againfetchdocsecondpublicvoidcollect(intdoc)throwsIOException{//System.out.println(" collect doc="+doc);totalHits++;floatscore=Float.NaN;if(trackMaxScore){score=scorer.score();if(score>maxScore){maxScore=score;}}if(queueFull){//Fastmatch:returnifthishitisnobetterthan//theworsthitcurrentlyinthequeue:for(inti=0;;i++){finalintc=reverseMul[i]*comparators[i].compareBottom(doc);if(c<0){//Definitelynotcompetitive.return;}elseif(c>0){//Definitelycompetitive.break;}elseif(i==comparators.length-1){//Thisistheequalscase.if(doc+docBase>bottom.doc){//Definitelynotcompetitivereturn;}break;}}}//Checkifthishitwasalreadycollectedona//previouspage:booleansameValues=true;for(intcompIDX=0;compIDX<</span>comparators.length;compIDX++){finalFieldComparatorcomp=comparators[compIDX];finalintcmp=reverseMul[compIDX]*comp.compareTop(doc);if(cmp>0){//Alreadycollectedonapreviouspage//System.out.println(" skip: before");return;}elseif(cmp<0){//NotyetcollectedsameValues=false;//System.out.println(" keep: after; reverseMul="+reverseMul[compIDX]);break;}}//Tie-breakbydocID:if(sameValues&&doc<=afterDoc){//Alreadycollectedonapreviouspage//System.out.println(" skip: tie-break");return;}if(queueFull){//Thishitiscompetitive-replacebottomelementinqueue&adjustTopfor(inti=0;i<</span>comparators.length;i++){comparators[i].copy(bottom.slot, doc);}//Computescoreonlyifitiscompetitive.if(trackDocScores&&!trackMaxScore){score=scorer.score();}updateBottom(doc, score);for(inti=0;i<</span>comparators.length;i++){comparators[i].setBottom(bottom.slot);}}else{collectedHits++;//Startuptransient:queuehasn't gathered numHits yetfinalintslot=collectedHits-1;//System.out.println(" slot="+slot);//Copyhitintoqueuefor(inti=0;i<</span>comparators.length;i++){comparators[i].copy(slot, doc);}//Computescoreonlyifitiscompetitive.if(trackDocScores&&!trackMaxScore){score=scorer.score();}bottom=pq.add(newEntry(slot, docBase+doc, score));queueFull=collectedHits==numHits;if(queueFull){for(inti=0;i<</span>comparators.length;i++){comparators[i].setBottom(bottom.slot);}}}}
3.*深度翻页核心的Collect
4.7的是返回序列化后的CurseMark,而3.*的是直接的doc和socre。4.7的solr对象,都是支持JavaBinCodec来序列化和反序列化的。
@Overridepublicvoidcollect(intdoc)throwsIOException{floatscore=scorer.score();//Thiscollectorcannothandlethesescores:assertscore!=Float.NEGATIVE_INFINITY;assert!Float.isNaN(score);totalHits++;if(score>after.score||(score==after.score&&doc<=afterDoc)){//hitwascollectedonapreviouspagereturn;}if(score<=pqTop.score){//Sincedocsarereturnedin-order(i.e.,increasingdocId), adocument//withequalscoretopqTop.scorecannotcompetesinceHitQueuefavors//documentswithlowerdocIds. Thereforerejectthosedocstoo.return;}collectedHits++;pqTop.doc=doc+docBase;pqTop.score=score;pqTop=pq.updateTop();}Theclassisdesignedtooptimalyserialize/deserializeanysupportedtypesinSolrresponse. Asweknowthereareonlyalimitedtypeofitemsthisclasscandoitwithveryminimalamountofpayloadandcode. Thereare15knowntypesandifthereisanobjectintheobjecttreewhichdoesnotfallintothesetypes, Itmustbeconvertedtooneofthese. ImplementanObjectResolverandpassitoverItisexpectedthatthisclassisusedonbothendofthepipes. TheclasshasonereadmethodandonewritemethodforeachofthedatatypesNote--Neverre-useaninstanceofthisclassformorethanonemarshalorunmarshalloperation. Alwayscreateanewinstance.publicclassJavaBinCodec{