http://lucene.apache.org/nutch/
How to Setup Nutch and Hadoop
http://wiki.apache.org/nutch/NutchHadoopTutorial
-
下载
$ cd /usr/local/src/ $ wget http://apache.etoak.com/lucene/nutch/nutch-1.0.tar.gz $ tar zxvf nutch-1.0.tar.gz $ sudo cp -r nutch-1.0 .. $ cd .. $ sudo ln -s nutch-1.0 apache-nutch
-
创建文件myurl
$ cd apache-nutch $ mkdir urls $ vim urls/myurl http://netkiller.8800.org/
-
配置文件 crawl-urlfilter.txt
编辑conf/crawl-urlfilter.txt文件,修改MY.DOMAIN.NAME部分,把它替换为你想要抓取的域名
$ cp conf/crawl-urlfilter.txt conf/crawl-urlfilter.txt.old $ vim conf/crawl-urlfilter.txt # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ 修改为: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*netkiller.8800.org/
-
http.agent.name
$ vim conf/nutch-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1) Gecko/20090624 Firefox/3.5</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.agent.description</name> <value></value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value>http://netkiller.8800.org/robot.html</value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value>openunix@163.com</value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property> </configuration>
-
运行以下命令行开始工作
$ bin/nutch crawl urls -dir crawl -depth 3 -threads 5bin/nutch crawl <your_url> -dir <your_dir> -depth 2 -threads 4 >&logs/logs1.log urls 存放需要爬行的url文件的目录,即目录/nutch/urls。 -dir dirnames 设置保存所抓取网页的目录. -depth depth 表明抓取网页的层次深度 -delay delay 表明访问不同主机的延时,单位为“秒” -threads threads 表明需要启动的线程数 -topN 50 topN 一个网站保存的最大页面数。 $ nohup bin/nutch crawl /usr/local/apache-nutch/urls -dir /usr/local/apache-nutch/crawl -depth 5 -threads 50 -topN 50 > /tmp/nutch.log &
-
depoly
$ cd /usr/local/apache-tomcat/conf/Catalina/localhost $ vim nutch.xml <Context docBase="/usr/local/apache-nutch/nutch-1.0.war" debug="0" crossContext="true" > </Context>
searcher.dir
$ vim /usr/local/apache-tomcat/webapps/nutch/WEB-INF/classes/nutch-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>searcher.dir</name> <value>/usr/local/apache-nutch/crawl</value> </property> </configuration>
test
http://172.16.0.1:8080/nutch/
原文出处:Netkiller 系列 手札
本文作者:陈景峯
转载请与作者联系,同时请务必标明文章原始出处和作者信息及本声明。