今天碰到问题,url正则过滤老是出问题,不爽之下,又打开源码了。
Crawl.java里有这么一段
for (i = 0; i < depth; i++) { // generate new segment
Path[] segs = generator.generate(crawlDb, segments, -1, topN, System
.currentTimeMillis());
if (segs == null) {
LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");
break;
}
fetcher.fetch(segs[0], threads); // fetch it
if (!Fetcher.isParsing(job)) {
parseSegment.parse(segs[0]); // parse it, if needed
}
crawlDbTool.update(crawlDb, segs, true, true); // update crawldb
}
可以看到,生成下一步要去fetch抓取的url列表是由下面的代码决定:
Path[] segs = generator.generate(crawlDb, segments, -1, topN, System
.currentTimeMillis());
跟踪org.apache.nutch.crawl.Generator类
public Path[] generate(Path dbDir, Path segments, int numLists, long topN, long curTime)
throws IOException {
JobConf job = new NutchJob(getConf());
boolean filter = job.getBoolean(GENERATOR_FILTER, true);
boolean normalise = job.getBoolean(GENERATOR_NORMALISE, true);
return generate(dbDir, segments, numLists, topN, curTime, filter, normalise, false, 1);
}
继续进入
generate(dbDir, segments, numLists, topN, curTime, filter, normalise, false, 1);
有这么一段
job.setMapperClass(Selector.class);
job.setPartitionerClass(Selector.class);
job.setReducerClass(Selector.class);
可见用了hadoop模式的map/reduce, 类是Selector.class.
跟踪Selector.java
map函数代码如下:
/** Select & invert subset due for fetch. */
public void map(Text key, CrawlDatum value,
OutputCollector<FloatWritable,SelectorEntry> output, Reporter reporter)
throws IOException {
Text url = key;
if (filter) {
// If filtering is on don't generate URLs that don't pass
// URLFilters
try {
if (filters.filter(url.toString()) == null) return;
} catch (URLFilterException e) {
if (LOG.isWarnEnabled()) {
LOG.warn("Couldn't filter url: " + url + " (" + e.getMessage() + ")");
}
}
}
CrawlDatum crawlDatum = value;
// check fetch schedule
if (!schedule.shouldFetch(url, crawlDatum, curTime)) {
LOG.debug("-shouldFetch rejected '" + url + "', fetchTime="
+ crawlDatum.getFetchTime() + ", curTime=" + curTime);
return;
}
LongWritable oldGenTime = (LongWritable) crawlDatum.getMetaData().get(
Nutch.WRITABLE_GENERATE_TIME_KEY);
if (oldGenTime != null) { // awaiting fetch & update
if (oldGenTime.get() + genDelay > curTime) // still wait for
// update
return;
}
float sort = 1.0f;
try {
sort = scfilters.generatorSortValue(key, crawlDatum, sort);
} catch (ScoringFilterException sfe) {
if (LOG.isWarnEnabled()) {
LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe);
}
}
if (restrictStatus != null
&& !restrictStatus.equalsIgnoreCase(CrawlDatum.getStatusName(crawlDatum.getStatus()))) return;
// consider only entries with a score superior to the threshold
if (scoreThreshold != Float.NaN && sort < scoreThreshold) return;
// consider only entries with a retry (or fetch) interval lower than threshold
if (intervalThreshold != -1 && crawlDatum.getFetchInterval() > intervalThreshold) return;
// sort by decreasing score, using DecreasingFloatComparator
sortValue.set(sort);
// record generation time
crawlDatum.getMetaData().put(Nutch.WRITABLE_GENERATE_TIME_KEY, genTime);
entry.datum = crawlDatum;
entry.url = key;
output.collect(sortValue, entry); // invert for sort by score
}
重点在于:
filters.filter(url.toString())
看看filter函数
public String filter(String urlString) throws URLFilterException {
for (int i = 0; i < this.filters.length; i++) {
if (urlString == null)
return null;
urlString = this.filters[i].filter(urlString);
}
return urlString;
}
也就是说:
任何一个url,都必须经过N个符合条件的url过滤器的检查,一旦其中一个检查没通过,那就没通过。
这里的url过滤器,不用说,又是插件。
开辟第二战场,我们去看过滤器的代码,我的 conf/nutch-site.xml文件里是这么定义的。
plugin.includes
protocol-http|urlfilter-(domain|regex)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|htmlparsefilter-youku
这里有2个url过滤器,urlfilter-domain,urlfilter-regex.
重点放在urlfilter-regex的研究上!
这个插件的代码位于$...nutch-1.7/src/plugin/urlfilterregex/src/java/org/apache/nutch/urlfilter/regex
目录下的 RegexURLFilter.java
那我们的任务就是研究这个类的filter函数。
这个类本身没做filter,那就看父类
import org.apache.nutch.urlfilter.api.RegexURLFilterBase;
这个类在哪呢:
src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
最重要的函数是filter
public String filter(String url) {
for (RegexRule rule : rules) {
if (rule.match(url)) {
return rule.accept() ? url : null;
}
};
return null;
}
结论:
拿到一个url后,跟conf/regex-urlfilter.txt里的正则表达式挨个匹配
找到第一个匹配规则就肯定会返回结果,至于结果就根据第一个字符是+还是-表示通过还是失败。
正则过滤规则分析完毕!