资料
WebMagic的架构设计参照了Scrapy
项目主页:http://webmagic.io/
github地址:https://github.com/code4craft/webmagic
项目文档:http://webmagic.io/docs/zh/
环境配置
使用 IntelliJ IDEA 新建maven项目
1、依赖文件配置
WebMagicSpider/pom.xml
<dependencies> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> </dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version> </dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version> <exclusions> <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> </exclusion> </exclusions> </dependency> </dependencies>
2、日志文件配置
WebMagicSpider/src/main/resources/log4j.properties
log4j.rootLogger=WARN, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
项目构建
1、爬虫程序编写
WebMagicSpider/src/main/java/BaiduPageProcessor.java
import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.pipeline.ConsolePipeline; import us.codecraft.webmagic.pipeline.JsonFilePipeline; import us.codecraft.webmagic.processor.PageProcessor; public class BaiduPageProcessor implements PageProcessor { private Site site = Site.me() .setRetryTimes(1) .setSleepTime(1000) .setCharset("utf-8"); public void process(Page page) { page.putField("title", page.getHtml().css("title", "text").toString()); } public Site getSite() { return site; } public static void main(String[] args) { Spider.create(new BaiduPageProcessor()) .addUrl("http://www.baidu.com/") .addPipeline(new ConsolePipeline()) .addPipeline(new JsonFilePipeline("/Users/qmp/myproject/WebMagicSpider")) .thread(1) .run(); } }
2、执行程序
控制台输出
get page: http://www.baidu.com/ title: 百度一下,你就知道
文件输出
{"title":"百度一下,你就知道"}