lucenc代码阅读指南、测试范例

简介:

Lucene 原理与代码分析完整版  -- 力荐

Lucene介绍及源码剖析: http://javenstudio.org/blog/annotated-lucene  -- 核心IndexWriter

下载:Annotated+Lucene+.pdf: http://ishare.iask.sina.com.cn/f/24103589.html

阅读步骤:

1、了解检索的基本原理和概念

2、了解lucene的基本概念

3、熟悉lucene的索引文件格式 -- 关键

4、熟悉lucene的索引流程:具体代码的类层次较多,且引入不必要的设计模式致使代码阅读相对困难。基本思路:controler + model 封装索引链,实现多线程并发处理(数据不共享)。

5、熟悉lucene的搜索流程

6、了解lucene搜索语法解析器 和 熟悉分词

 

推荐资料深入剖析lucene的源码,非常有价值。光看文档,不够形象,大体看过文档后,建议结合源码理解文档内容。代码能让读者有大体的基本概念,但文档对源码细节的解释容易让读者"只见枝叶不见森林”,理解困难。根据文档作者提供的大体思路,结合实际源码,读起来更容易。

测试

测试对于了解lucene的工作原理、代码执行流程极有帮助,是阅读代码的重要辅助手段。

IndexerExample.java

复制代码
/*
 * Compiler: javac -classpath .:../lucene-core-2.9.1.jar:http://www.cnblogs.com/ChineseSegmenter/chineseSegmenter.jar  IndexerExample.java  
 * Exec    : java  -classpath .:../lucene-core-2.9.1.jar:http://www.cnblogs.com/ChineseSegmenter/chineseSegmenter.jar  IndexerExample  
 *
 */

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.cn.ChineseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;


public class IndexerExample {
    
    private static void EnExample() throws Exception {

        // Store the index on disk
        Directory directory = FSDirectory.getDirectory("/tmp/testindex");
        // Use standard analyzer
        Analyzer analyzer = new StandardAnalyzer();
        // Create IndexWriter object
        IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
        iwriter.setMaxFieldLength(25000);
        // make a new, empty document
        Document doc = new Document();
        File f = new File("/tmp/test.txt");
        
        // Add the path of the file as a field named "path".  Use a field that is
        // indexed (i.e. searchable), but don't tokenize the field into words.
        doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
        
        String text = "This is the text to be indexed.";
        doc.add(new Field("fieldname", text, Field.Store.YES,      Field.Index.TOKENIZED));
        doc.add(new Field("name", text, Field.Store.YES,      Field.Index.TOKENIZED));
        
        // Add the last modified date of the file a field named "modified".  Use
        // a field that is indexed (i.e. searchable), but don't tokenize the field
        // into words.
        doc.add(new Field("modified",
                    DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
                    Field.Store.YES, Field.Index.UN_TOKENIZED));
        // Add the contents of the file to a field named "contents".  Specify a Reader,
        // so that the text of the file is tokenized and indexed, but not stored.
        // Note that FileReader expects the file to be in the system's default encoding.
        // If that's not the case searching for special characters will fail.
        doc.add(new Field("contents", new FileReader(f)));
        
        iwriter.addDocument(doc);
        iwriter.optimize();
        iwriter.close();

    }
 
    private static void CnExample() throws Exception {

        // Store the index on disk
        Directory directory = FSDirectory.getDirectory("/tmp/testindex");
        // Use chinese analyzer
        Analyzer analyzer = new ChineseAnalyzer();
        PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new WhitespaceAnalyzer());
        wrapper.addAnalyzer("name", analyzer);
        
        // Create IndexWriter object
        IndexWriter iwriter = new IndexWriter(directory, wrapper, true);
        iwriter.setMaxFieldLength(25000);
        // make a new, empty document
        Document doc = new Document();
        File f = new File("/tmp/test.txt");
        
        // Add the path of the file as a field named "path".  Use a field that is
        // indexed (i.e. searchable), but don't tokenize the field into words.
        doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
        
        String text = "This is the text to be indexed.";
        doc.add(new Field("fieldname", text, Field.Store.YES, Field.Index.TOKENIZED));
        
        String name = "2013春装新款女气质修身风衣大翻领双层大摆长款外套 系腰带";
        doc.add(new Field("name", name, Field.Store.YES, Field.Index.TOKENIZED));
        
        // Add the last modified date of the file a field named "modified".  Use
        // a field that is indexed (i.e. searchable), but don't tokenize the field
        // into words.
        doc.add(new Field("modified",
                    DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
                    Field.Store.YES, Field.Index.UN_TOKENIZED));
        // Add the contents of the file to a field named "contents".  Specify a Reader,
        // so that the text of the file is tokenized and indexed, but not stored.
        // Note that FileReader expects the file to be in the system's default encoding.
        // If that's not the case searching for special characters will fail.
        doc.add(new Field("contents", new FileReader(f)));
        
        iwriter.addDocument(doc);
        iwriter.optimize();
        iwriter.close();
    }

    public static void main(String[] args) throws Exception {
        System.out.println("Start test: ");

        if( args.length > 0){
            CnExample();
        }
        else{
            EnExample();
        }

        System.out.println("Index dir: /tmp/testindex");
    }
}
复制代码

SearcherExample.java

复制代码
/*
 * Compiler: javac -classpath .:../lucene-core-2.9.1.jar:http://www.cnblogs.com/ChineseSegmenter/chineseSegmenter.jar  SearcherExample.java  
 * Exec    : java  -classpath .:../lucene-core-2.9.1.jar:http://www.cnblogs.com/ChineseSegmenter/chineseSegmenter.jar  SearcherExample
 * 
 */

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StringReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.cn.ChineseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;


public class SearcherExample { 

    public static void main(String[] args) throws Exception { 
        if (args.length < 2) { 
            throw new Exception("Usage: java " + Searcher.class.getName() 
                    + "<index dir> <query> [cn]"); 
        } 
        File indexDir = new File(args[0]);
        String q = args[1]; 
        boolean bCn = args.length > 2? true : false;

        if (!indexDir.exists() || !indexDir.isDirectory()) { 
            throw new Exception(indexDir + 
                    " does not exist or is not a directory."); 
        } 
        search(indexDir, q, bCn); 
    } 

    public static void search(File indexDir, String q, boolean bCn) 
        throws Exception { 
        Directory fsDir = FSDirectory.getDirectory(indexDir, false); 
        IndexSearcher is = new IndexSearcher(fsDir);

        Analyzer analyzer = new StandardAnalyzer();
        if( bCn ){
            analyzer = new ChineseAnalyzer();
        }

        QueryParser parser = new QueryParser( "name",  analyzer);
        Query query = parser.parse(q); 
        
        System.out.println("Query: " + query.toString());
        long start = new Date().getTime(); 
        Hits hits = is.search(query);
        long end = new Date().getTime(); 

        System.err.println("Found " + hits.length() + 
                " document(s) (in " + (end - start) + 
                " milliseconds) that matched query '" + 
                q + "'"); 

        for (int i = 0; i < hits.length(); i++) { 
            Document doc = hits.doc(i); 
            System.out.println( "HIT " + i + " :" + doc.get("name")); 
        } 
    } 
} 
复制代码

中文分词可采用lucene自带的库,效果不好,或者自行封装,核心就是封装分词Tokenizer。

复制代码
package org.apache.lucene.analysis.cn;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.Tokenizer;

public class SnippetTermTokenizer extends Tokenizer {
        private StringBuffer buffer = new StringBuffer();
        private BufferedReader inputBuffer;
        private JNISelecter selecter;     // 中文分词核心类
        private List<Token> tokenList = null;
        private List<String> phraseTokenList = null;
        private Iterator<Token> tokenIter = null;

        public SnippetTermTokenizer(Reader reader, JNISelecter s) {
                inputBuffer = new BufferedReader(reader, 2048);
                selecter = s;
        }

        public Token next() throws IOException {
                if (tokenIter != null) {
                        if (tokenIter.hasNext()) {
                                return tokenIter.next();
                        } else {
                                // finish read input
                                return null;
                        }
                }
                // need to read content
                readContent();
                if (segment()) {
                        // segment succeed, create iterator
                        return tokenIter.next();
                }
                return null;
        }

        public void close() throws IOException {
                inputBuffer.close();
        }
       
        // 分词相关略
复制代码

 

本文转自 zhenjing 博客园博客,原文链接:http://www.cnblogs.com/zhenjing/archive/2013/03/18/lucene_source_code.html   ,如需转载请自行联系原作者


相关文章
|
JavaScript 测试技术
vue 官方测试工具 unit-jest 实用教程(含实战范例:登录组件的测试)
vue 官方测试工具 unit-jest 实用教程(含实战范例:登录组件的测试)
189 0
|
Web App开发 小程序 测试技术
最简单的Web Monkey 测试范例
最简单的Web Monkey 测试范例
409 0
六石风格文档范例:做测试结果表格
六石风格文档范例:做测试结果表格
127 0
六石风格文档范例:做测试结果表格
|
Java
NLPIR java测试(没找到范例代码)
NLPIR java测试(没找到范例代码)
159 0
|
3月前
|
Java 测试技术 容器
Jmeter工具使用:HTTP接口性能测试实战
希望这篇文章能够帮助你初步理解如何使用JMeter进行HTTP接口性能测试,有兴趣的话,你可以研究更多关于JMeter的内容。记住,只有理解并掌握了这些工具,你才能充分利用它们发挥其应有的价值。+
716 23
|
8月前
|
数据可视化 前端开发 测试技术
接口测试新选择:Postman替代方案全解析
在软件开发中,接口测试工具至关重要。Postman长期占据主导地位,但随着国产工具的崛起,越来越多开发者转向更适合中国市场的替代方案——Apifox。它不仅支持中英文切换、完全免费不限人数,还具备强大的可视化操作、自动生成文档和API调试功能,极大简化了开发流程。
|
5月前
|
SQL 安全 测试技术
2025接口测试全攻略:高并发、安全防护与六大工具实战指南
本文探讨高并发稳定性验证、安全防护实战及六大工具(Postman、RunnerGo、Apipost、JMeter、SoapUI、Fiddler)选型指南,助力构建未来接口测试体系。接口测试旨在验证数据传输、参数合法性、错误处理能力及性能安全性,其重要性体现在早期发现问题、保障系统稳定和支撑持续集成。常用方法包括功能、性能、安全性及兼容性测试,典型场景涵盖前后端分离开发、第三方服务集成与数据一致性检查。选择合适的工具需综合考虑需求与团队协作等因素。
633 24
|
5月前
|
SQL 测试技术
除了postman还有什么接口测试工具
最好还是使用国内的接口测试软件,其实国内替换postman的软件有很多,这里我推荐使用yunedit-post这款接口测试工具来代替postman,因为它除了接口测试功能外,在动态参数的支持、后置处理执行sql语句等支持方面做得比较好。而且还有接口分享功能,可以生成接口文档给团队在线浏览。
224 2
|
7月前
|
JSON 前端开发 测试技术
大前端之前端开发接口测试工具postman的使用方法-简单get接口请求测试的使用方法-简单教学一看就会-以实际例子来说明-优雅草卓伊凡
大前端之前端开发接口测试工具postman的使用方法-简单get接口请求测试的使用方法-简单教学一看就会-以实际例子来说明-优雅草卓伊凡
298 10
大前端之前端开发接口测试工具postman的使用方法-简单get接口请求测试的使用方法-简单教学一看就会-以实际例子来说明-优雅草卓伊凡

热门文章

最新文章