基础:
Introduction to Information Retrieval_irbookprint
managing_gigabytes
搜索引擎原理与技术
现代信息检索
走进搜索引擎
知名研究团队
华中科技大学智能与分布式计算实验室 http://idc.hust.edu.cn/
中科院计算所信息检索组 http://ir.ict.ac.cn/blog/
哈工大社会计算与信息检索研究中心 http://ir.hit.edu.cn/
北京大学网络实验室 http://www.cwirf.org/ http://sewm.pku.edu.cn/project/SIPE.html
清华大学智能技术与系统国家重点实验室 http://166.111.138.86/cms/
华南理工大学 http://dmir.gdut.edu.cn/members.html
浙江大学 http://jpkc.zju.edu.cn/k/244/
斯坦福大学 http://nlp.stanford.edu/software/parser-faq.shtml
DataMine http://www.cs.waikato.ac.nz/ml/weka/
国内重要官方博客
soso blog http://blog.csdn.net/soso_blog
sougoulab http://www.sogou.com/labs/
taobao http://blog.search.taobao.com/
baidu http://www.baidu-tech.com/
最高最新技术代表
sigir http://www.sigir.org/
trec http://trec.nist.gov/
3w http://www.w3.org/Conferences/Overview-WWW.html
challenges in building large scale information retriveval systems WSDM09-keynote.pdf
Information Retrieval Current and Future Research_03tc.pdf
Inverted Files for Text Search Engines.pdf
Performance of compressed inverted list caching in search engines
Inverted Index Compression&Query Processing with Optimized Document Ordering
索引相关度 http://nlp.stanford.edu/IR-book/html/htmledition/index-1.html
SSD 华中科技大学智能与分布式计算实验室 http://idc.hust.edu.cn/
GPU 华中科技大学智能与分布式计算实验室 http://idc.hust.edu.cn/
http://koala.poly.edu/ShuaiDing.html
http://cis.poly.edu/suel/
http://www.azintablog.com/2010/10/16/gpu-large-scale-data-mining/
http://membres-liglab.imag.fr/termier/ParallelDMWorkshop/index.html
http://moss.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/epham09.pdf
http://people.gucas.ac.cn/~yingliu?language=en
http://hi.baidu.com/sebarzi/blog/item/9d7c7fe98e156031b80e2deb.html
重要开源
Lucene/Solr http://lucene.apache.org/
Solr Application Development Tutorial Presentation.pdf
Livro Solr 1.4 Enterprise Search Server.pdf
Open Tools for Machine Learning http://hi.baidu.com/michzel/blog/item/ffce9e2018c186184d088d11.html
I. Information Retrieval
1. Lemur/Indri
The Lemur Toolkit for Language Modeling and Information Retrieval
http://www.lemurproject.org/
Indri: Lemur's latest search engine
2. Lucene/Nutch
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
http://lucene.apache.org/http://www.nutch.org/
3. WGet
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.
http://www.gnu.org/software/wget/wget.html
II. Natural Language Processing
1. EGYPT: A Statistical Machine Translation Toolkit
http://www.clsp.jhu.edu/ws99/projects/mt/
2. GIZA++ (Statistical Machine Translation)
http://www.fjoch.com/GIZA++.html
3. PHARAOH (Statistical Machine Translation)
http://www.isi.edu/licensed-sw/pharaoh/
a beam search decoder for phrase-based statistical machine translation models
4. OpenNLP:
http://opennlp.sourceforge.net/
5. MINIPAR by Dekang Lin (Univ. of Alberta, Canada)
MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.
binary填一个表后可以免费下载
http://www.cs.ualberta.ca/~lindek/minipar.htm
6. WordNet
http://wordnet.princeton.edu/
7. HowNet
http://www.keenage.com/
8. Statistical Language Modeling Toolkit
http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
9. SRI Language Modeling Toolkit
www.speech.sri.com/projects/srilm/
10. ReWrite Decoder
http://www.isi.edu/licensed-sw/rewrite-decoder/
11. GATE (General Architecture for Text Engineering)
http://gate.ac.uk/
12. NLTK (Natural Language Toolkit)
http://nltk.sourceforge.net/index.php/Main_Page
III. Machine Learning
1. YASMET: Yet Another Small MaxEnt Toolkit (Statistical Machine Learning)
http://www.fjoch.com/YASMET.html
2. LibSVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
3. SVM Light
http://svmlight.joachims.org/
4. CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
5. CRF++
http://chasen.org/~taku/software/CRF++/
6. SVM Struct
http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html
7. MALLET
MAchine Learning for LanguagE Toolkit http://mallet.cs.umass.edu/index.php
IV. Misc:
1. WinMerge: 用于文本内容比较,找出不同版本的两个程序的差异
winmerge.sourceforge.net/
2. OpenPerlIDE: 开源的perl编辑器,内置编译、逐行调试功能
open-perl-ide.sourceforge.net/
3. Berkeley DB
http://www.sleepycat.com/
----------------------------------------
经典文章、
Why Google Cannot Beat Baidu in China Search Engine Market.pdf
百度与谷歌在算法上的区别.docx
百度 vs Google:优秀与伟大之别
海狗不是狗探秘支付宝准实时搜索查询
互联网用户常见心理特征 http://www.chinaz.com/manage/2011/1221/227402.shtml