因为做官网,没有内容,因此就想办法从OSChina中写的博客里弄点内容,这就要用到爬虫了。
然后就花了几分钟搞了一下,步骤如下:
第一步,写个方法抓目录:
1
2
3
4
5
6
7
8
9
10
11
|
public
static
void
processCategory(String categoryId) {
Watcher watcher =
new
WatcherImpl();
Spider spider =
new
SpiderImpl();
watcher.addProcessor(
new
OsChinaCategoryProcessor());
QuickNameFilter<HtmlNode> nodeFilter =
new
QuickNameFilter<HtmlNode>();
nodeFilter.setNodeName(
"li"
);
nodeFilter.setIncludeAttribute(
"class"
,
"Blog"
);
watcher.setNodeFilter(nodeFilter);
spider.addWatcher(watcher);
spider.processUrl(
"http://my.oschina.net/tinyframework/blog?catalog="
+categoryId);
}
|
1
2
3
4
5
6
7
8
9
10
11
|
public
static
void
processTopic(String pageId) {
Watcher watcher =
new
WatcherImpl();
Spider spider =
new
SpiderImpl();
watcher.addProcessor(
new
OsChinaTopicProcessor());
QuickNameFilter<HtmlNode> nodeFilter =
new
QuickNameFilter<HtmlNode>();
nodeFilter.setNodeName(
"div"
);
nodeFilter.setIncludeAttribute(
"class"
,
"BlogContent"
);
watcher.setNodeFilter(nodeFilter);
spider.addWatcher(watcher);
spider.processUrl(
"http://my.oschina.net/tinyframework/blog/"
+pageId);
}
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
public
class
OsChinaCategoryProcessor
implements
Processor {
public
void
process(String url, HtmlNode node) {
HtmlNode a = node.getSubNodeRecursively(
"h2"
).getSubNode(
"a"
);
String href = a.getAttribute(
"href"
);
String topicId = href.substring(href.lastIndexOf(
'/'
) +
1
);
System.out.printf(
"<a href=\"%s.page\">%s</a>\n"
, topicId, a.getPureText());
try
{
Thread.sleep(
200
);
//这里怕oschina拒绝访问,休息一下
}
catch
(InterruptedException e) {
e.printStackTrace();
}
OSchinaSpider.processTopic(topicId);
}
}
|
第四步,写一下文章处理器:
1
2
3
4
5
6
7
8
9
10
11
12
|
public
class
OsChinaTopicProcessor
implements
Processor {
String outoutPath=
"E:\\oschina\\"
;
public
void
process(String url, HtmlNode node) {
String fileName=outoutPath+url.substring(url.lastIndexOf(
'/'
)+
1
)+
".page"
;
try
{
IOUtils.writeToOutputStream(
new
FileOutputStream(fileName),node.toString(),
"UTF-8"
);
}
catch
(Exception e) {
e.printStackTrace();
}
}
}
|
1
2
3
|
public
static
void
main(String[] args) {
processCategory(
"377413"
);
}
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
|
<
a
href
=
"214018.page"
>300粉丝集结号吹响了,可以开源重量级的流程引擎或UI引擎 </
a
>
<
a
href
=
"268983.page"
>Tiny实例:TINY框架官网制作过程详解 </
a
>
<
a
href
=
"267764.page"
>从应用示例来认识Tiny框架 </
a
>
<
a
href
=
"266707.page"
>TinyRMI---RMI的封装、扩展及踩到的坑的解决 </
a
>
<
a
href
=
"233111.page"
>悠然乱弹:五一部署了sonar有hudson,发布了1.1.0正式版到Maven中央仓库 </
a
>
<
a
href
=
"228712.page"
>悠然乱弹:我的架构观 </
a
>
<
a
href
=
"226850.page"
>TinyDBF-用200行的DBF解析器来展示良好架构设计 </
a
>
<
a
href
=
"225959.page"
>新增TinyMessage,并实现邮件接收处理 </
a
>
<
a
href
=
"223310.page"
>如何让程序员更容易的开发Web界面?重构SmartAdmin展示TinyUI框架 </
a
>
<
a
href
=
"221930.page"
>Velocity宏定义的坑与解决办法 </
a
>
<
a
href
=
"220619.page"
>不一样的味道--Html及Xml解析、格式化、遍历 </
a
>
<
a
href
=
"214309.page"
>TinyINI开源了~~ </
a
>
<
a
href
=
"213622.page"
>Tiny分布式计算框架开源了 </
a
>
<
a
href
=
"212682.page"
>悠然乱弹:切身体会来说明人性化设计的重要性 </
a
>
<
a
href
=
"212639.page"
>Tiny Formater </
a
>
<
a
href
=
"206718.page"
>TINY框架FAQ汇集 </
a
>
<
a
href
=
"205733.page"
>Tiny框架启动过程日志 </
a
>
<
a
href
=
"205279.page"
>Tiny之Web工程构建 </
a
>
<
a
href
=
"204994.page"
>开源框架Tiny之内容组成 </
a
>
<
a
href
=
"203075.page"
>Tiny后续版本需求整理 </
a
>
<
a
href
=
"202825.page"
>TinyUI组件开发示例 </
a
>
<
a
href
=
"201307.page"
>TinyDbRouter开源喽~~~ </
a
>
<
a
href
=
"201071.page"
>Tiny中文分词 </
a
>
<
a
href
=
"200604.page"
>在Linux下搭建Tiny开发环境 </
a
>
<
a
href
=
"200408.page"
>一个Maven工程中,不同的模块需要不同的JDK进行编译的解决方案 </
a
>
<
a
href
=
"199515.page"
>业务流程引擎 </
a
>
<
a
href
=
"196486.page"
>Tiny并行计算框架之复杂示例 </
a
>
<
a
href
=
"196373.page"
>Tiny并行计算框架之实现机理 </
a
>
<
a
href
=
"196070.page"
>Tiny并行计算框架之使用介绍 </
a
>
<
a
href
=
"194610.page"
>TinySpider开源喽~~~ </
a
>
<
a
href
=
"194578.page"
>TinyHtmlParser开源喽~~~ </
a
>
<
a
href
=
"194574.page"
>TinyXmlParser开源喽~~~ </
a
>
<
a
href
=
"194551.page"
>TinyDBRouter </
a
>
<
a
href
=
"194413.page"
>开源前要做好哪些准备工作? </
a
>
<
a
href
=
"192778.page"
>分布式锁的简单实现 </
a
>
<
a
href
=
"189259.page"
>TinyIOC </
a
>
<
a
href
=
"188780.page"
>TinyDBCluster Vs routing4db </
a
>
<
a
href
=
"186637.page"
>文档生成框架 </
a
>
<
a
href
=
"186583.page"
>数据库分区分片框架 </
a
>
<
a
href
=
"185134.page"
>分区分表支持 </
a
>
<
a
href
=
"178153.page"
>Resetting the root password for MySQL </
a
>
<
a
href
=
"177224.page"
>Tiny框架之内容组成 </
a
>
<
a
href
=
"176153.page"
>JSP放入Jar包支持 </
a
>
<
a
href
=
"172180.page"
>流程式编程 </
a
>
<
a
href
=
"170799.page"
>强悍的上下文Context </
a
>
<
a
href
=
"170763.page"
>类Spring IoC容器 </
a
>
<
a
href
=
"170741.page"
>虚拟文件系统VFS </
a
>
<
a
href
=
"170401.page"
>BigPipe为什么可以节省时间? </
a
>
<
a
href
=
"170326.page"
>XmlParser和HtmlParser </
a
>
<
a
href
=
"170154.page"
>线程组 </
a
>
<
a
href
=
"170117.page"
>流程自动化布局 </
a
>
<
a
href
=
"169813.page"
>涉密数据的处理 </
a
>
<
a
href
=
"169553.page"
>Word文档生成 </
a
>
<
a
href
=
"169509.page"
>如何快速开发网站? </
a
>
<
a
href
=
"169399.page"
>如何让Web.xml变得简洁? </
a
>
<
a
href
=
"169339.page"
>Hello,World 百态 </
a
>
<
a
href
=
"169278.page"
>关于中文处理方面的研究 </
a
>
<
a
href
=
"169260.page"
>构建网络爬虫?so easy </
a
>
<
a
href
=
"169206.page"
>UI开发的终极解决方案 </
a
>
<
a
href
=
"168896.page"
>基于业务单元的开发与部署模式 </
a
>
<
a
href
=
"168477.page"
>一种基于主客体模型的权限管理框架 </
a
>
<
a
href
=
"167430.page"
>MDA数据校验规则定义 </
a
>
<
a
href
=
"166930.page"
>Tiny之7*24集群服务方案 </
a
>
<
a
href
=
"166893.page"
>Tiny设计原则 </
a
>
<
a
href
=
"166846.page"
>构建Tiny生态圈 </
a
>
<
a
href
=
"166845.page"
>Tiny框架设计理念 </
a
>
<
a
href
=
"166843.page"
>基于实体模型开发主题管理简析 </
a
>
<
a
href
=
"166842.page"
>MDA模型定义及扩展 </
a
>
<
a
href
=
"165566.page"
>JS、CSS合并带来的效率提升 </
a
>
<
a
href
=
"165402.page"
>主题切换及其管理 </
a
>
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
E:\oschina 的目录
[.] [..]
165402
.page
165566
.page
166842
.page
166843
.page
166845
.page
166846
.page
166893
.page
166930
.page
167430
.page
168477
.page
168896
.page
169206
.page
169260
.page
169278
.page
169339
.page
169399
.page
169509
.page
169553
.page
169813
.page
170117
.page
170154
.page
170326
.page
170401
.page
170741
.page
170763
.page
170799
.page
172180
.page
176153
.page
177224
.page
178153
.page
185134
.page
186583
.page
186637
.page
188780
.page
189259
.page
192778
.page
194413
.page
194551
.page
194574
.page
194578
.page
194610
.page
196070
.page
196373
.page
196486
.page
199515
.page
200408
.page
200604
.page
201071
.page
201307
.page
202825
.page
204994
.page
205279
.page
205733
.page
206718
.page
212639
.page
212682
.page
213622
.page
214018
.page
214309
.page
220619
.page
221930
.page
223310
.page
225959
.page
226850
.page
228712
.page
233111
.page
266707
.page
267764
.page
268983
.page
|
第7步,把文件放入tinysite中去。
爽,收工