boilerpipe(Boilerplate Removal and Fulltext Extraction from HTML pages) 源码分析

简介:

开源Java模块boilerpipe(1.1.0), http://code.google.com/p/boilerpipe/

使用例子,
URL url = new URL("http://www.example.com/some-location/index.html ");
// NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
String text = ArticleExtractor .INSTANCE.getText(url);
那就从ActicleExtractor开始分析, 这个类用了singleton的design pattern, 使用INSTANCE取得唯一的实例, 实际处理如下步骤

HTML Parser 
The HTML Parser is based upon CyberNeko 1.9.13. It is called internally from within the Extractors.
The parser takes an HTML document and transforms it into a TextDocument , consisting of one or more TextBlocks . It knows about specific HTML elements (SCRIPT, OPTION etc.) that are ignored automatically.
Each TextBlock stores a portion of text from the HTML document. Initially (after parsing) almost every TextBlock represents a text section from the HTML document, except for a few inline elements that do not separate per defintion (for example '<A>'anchor tags).
The TextBlock objects also store shallow text statistics for the block's content such as the number of words and the number of words in anchor text.

Extractors 
Extractors consist of one or more pipelined Filters . They are used to get the content of a webpage. Several different Extractors exist, ranging from a generic DefaultExtractor to extractors specific for news article extraction (ArticleExtractor).
ArticleExtractor.process() 就包含了这个pipeline filter, 这个design做的非常具有可扩展性, 把整个处理过程分成若干小的步骤分别实现, 在用的时候象搭积木一样搭成一个处理流. 当想扩展或改变处理过程时, 非常简单, 只需加上或替换其中的一块就可以了.
这样也非常方便于多语言扩展, 比如这儿用的english包里的相应的处理函数,
import de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter;
import de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter;
如果要扩展到其他语言, 如韩文, 只需在filters包里面加上个korean包, 分别实现这些filter处理函数, 然后只需要修改import, 就可以实现对韩语的support.

 TerminatingBlocksFinder.INSTANCE.process(doc)
                | new DocumentTitleMatchClassifier(doc.getTitle()).process(doc)
                | NumWordsRulesClassifier.INSTANCE.process(doc)
                | IgnoreBlocksAfterContentFilter.DEFAULT_INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1.process(doc)
                | BoilerplateBlockFilter.INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1_CONTENT_ONLY.process(doc)
                | KeepLargestFulltextBlockFilter.INSTANCE.process(doc)
                | ExpandTitleToContentFilter.INSTANCE.process(doc);
下面具体看一下处理流的每个环节.

TerminatingBlocksFinder
Finds blocks which are potentially indicating the end of an article text and marks them with {@link DefaultLabels#INDICATES_END_OF_TEXT}. This can be used in conjunction with a downstream {@link IgnoreBlocksAfterContentFilter}.(意思是IgnoreBlocksAfterContentFilter必须作为它的 downstream)

原理很简单, 就是判断这个block, 在tb.getNumWords() < 20的情况下是否满足下面的条件,
text.startsWith("Comments")
                        || N_COMMENTS.matcher(text).find() //N_COMMENTS = Pattern.compile("(?msi)^[0-9]+ (Comments|users responded in)")
                        || text.contains("What you think...")
                        || text.contains("add your comment")
                        || text.contains("Add your comment")
                        || text.contains("Add Your Comment")
                        || text.contains("Add Comment")
                        || text.contains("Reader views")
                        || text.contains("Have your say")
                        || text.contains("Have Your Say")
                        || text.contains("Reader Comments")
                        || text.equals("Thanks for your comments - this feedback is now closed")
                        || text.startsWith("© Reuters")
                        || text.startsWith("Please rate this")
如果满足就认为这个block为artical的结尾, 并加上标记tb.addLabel(DefaultLabels.INDICATES_END_OF_TEXT);

DocumentTitleMatchClassifier
这个很简单, 就是根据'<title>'的内容去页面中去标注title的位置, 做法就是根据'<title>'的内容产生一个potentialTitles列表, 然后去匹配block, 匹配上就标注成DefaultLabels.TITLE

NumWordsRulesClassifier
Classifies {@link TextBlock}s as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of words per block and link density per block.
这个模块实现了个分类器, 用于区分content/not-content , 分类器的构建参见上面这篇文章的4.3节.
分类器使用Decision Trees算法, 用标注过的google news作为训练集, 接着对训练完的Decision Trees经行剪枝, Applying reduced-error pruning we were able to simplify the decision tree to only use 6 dimensions (2 features each for current, previous and next block) without a significant loss in accuracy.
最后用伪码描述出Decision Trees的decision过程, 这就是使用Decision Trees的最大好处, 它的decision rules是可以理解的, 所以可以用各种语言描述出来.
这个模块实现的是Algorithm 2 Classifier based on Number of Words

curr_linkDensity <= 0.333333
| prev_linkDensity <= 0.555556
| | curr_numWords <= 16
| | | next_numWords <= 15
| | | | prev_numWords <= 4: BOILERPLATE
| | | | prev_numWords > 4: CONTENT
| | | next_numWords > 15: CONTENT
| | curr_numWords > 16: CONTENT
| prev_linkDensity > 0.555556
| | curr_numWords <= 40
| | | next_numWords <= 17: BOILERPLATE
| | | next_numWords > 17: CONTENT
| | curr_numWords > 40: CONTENT
curr_linkDensity > 0.333333: BOILERPLATE

有了Classifies, 接下来的事情就是对于所有block进行分类并标注.

IgnoreBlocksAfterContentFilter
Marks all blocks as "non-content" that occur after blocks that have been marked {@link DefaultLabels#INDICATES_END_OF_TEXT}. These marks are ignored unless a minimum number of words in content blocks occur before this mark (default: 60). This can be used in conjunction with an upstream {@link TerminatingBlocksFinder}.

这个模块是TerminatingBlocksFinder模块的downstream, 就是说必须在它后面做, 简单的很, 找到DefaultLabels#INDICATES_END_OF_TEXT, 后面的内容全标为BOILERPLATE.
除了前面正文length不到minimum number of words(default: 60), 还需要继续抓点文字凑数.

BlockProximityFusion
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit. This probably makes sense only in cases where an upstream filter already has removed some blocks.
这个模块用来合并block的, 合并的依据主要是根据两个block的offset的差值不大于2, 也就是说中间最多只能隔一个block.
当要求contentOnly时, 会check两个block都标注为content时才会fusion.
int diffBlocks = block.getOffsetBlocksStart() - prevBlock.getOffsetBlocksEnd() - 1;
if (diffBlocks <= maxBlocksDistance)

那么block的offset怎么来的了, 查一下block构造的时候的代码
BoilerpipeHTMLContentHandler .flushBlock()
TextBlock tb = new TextBlock(textBuffer.toString().trim(), currentContainedTextElements, numWords, numLinkedWords, numWordsInWrappedLines, numWrappedLines, offsetBlocks);
offsetBlocks++;

TextBlock构造函数
this.offsetBlocksStart = offsetBlocks;
this.offsetBlocksEnd = offsetBlocks;
可以看出初始情况下, block的offset就是递增的, 并且再没有做过fusion的情况下, offsetBlocksStart和offsetBlocksEnd是相等的.
所以象注释讲的那样, 只有当upstream filter remove了部分blocks以后, 这个模块的合并依据才是有意义的, 不然在没有任何删除的情况下, 所有block都满足fusion条件.

看完这段代码, 我很奇怪, Paper中fusion是根据text density的, 而这儿只是根据block的offset, 有所减弱.
There, adjacent text fragments of similar text density (interpreted as /similar class") are iteratively fused until the blocks' densities (and therefore the text classes) are distinctive
enough.
而且我更加不理解的是, 在ArticleExtractor关于这个模块的用法如下,
                  BlockProximityFusion.MAX_DISTANCE_1.process(doc)
                | BoilerplateBlockFilter.INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1_CONTENT_ONLY.process(doc)
调用了BlockProximityFusion两次, 分别在BoilerplateBlockFilter(含义在下节)的down,upstream, 对于BlockProximityFusion.MAX_DISTANCE_1_CONTENT_ONLY.process(doc)的调用我还是能理解 的, 再删除完非content的block后, 对剩下的block做一下fusion, 比如原来两个block中间隔了个广告. 不过这儿根据offset, 而不根据text density, 个人觉得功能有所减弱.
可是对于BlockProximityFusion.MAX_DISTANCE_1.process(doc)的调用, 可能是我没看懂, 实在无法理解, 为什么要加这步, 唯一的解释是想将一些没有标注为content的block fusion到content里面去. 奇怪的是这儿fusion是无条件的(在没有删除block的情况下,判断offset无效), 只需要当前的block是content是就和Prev进行fusion. 而且为什么只判断当前block, Prevblock是content是否也应该fusion.个人觉得这边逻辑完全不合理......

BoilerplateBlockFilter
Removes {@link TextBlock}s which have explicitly been marked as "not content"
没啥好说的, 就是遍历每个block, 把没有标注为"content"的都删掉.

KeepLargestFulltextBlockFilter
Keeps the largest {@link TextBlock} only (by the number of words). In case of more than one block with the same number of words, the first block is chosen. All discarded blocks are marked "not content" and flagged as {@link DefaultLabels#MIGHT_BE_CONTENT}
很好理解, 找出最大的文本block作为正文, 其他的标注为DefaultLabels#MIGHT_BE_CONTENT

ExpandTitleToContentFilter
Marks all {@link TextBlock}s "content" which are between the headline and the part that has already been marked content, if they are marked {@link DefaultLabels#MIGHT_BE_CONTENT}. This filter is quite specific to the news domain.
逻辑是找出标注为DefaultLabels.TITLE的block, 和content开始的那个block, 把这两个block之间的标注为MIGHT_BE_CONTENT的都改标注为Content.

TextDocument.getContent() 
最后需要做的一步, 是把抽取的内容输出成文本. 遍历每一个标注为content的block, 把内容append并输出.

DefaultExtractor 
下面再看看除了ArticleExtractor (针对news)以外, 很常用的DefaultExtractor
SimpleBlockFusionProcessor.INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1.process(doc)
                | DensityRulesClassifier.INSTANCE.process(doc);
相对比较简单, 就三步, 第二步很奇怪, 前面没有任何upstream会标注content, 那么这步就什么都不会做

SimpleBlockFusionProcessor
Merges two subsequent blocks if their text densities are equal.
遍历每一个block, 两个block的text densities相同就merge

DensityRulesClassifier
Classifies {@link TextBlock}s as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features", particularly using text densities and link densities.
参照NumWordsRulesClassifier , 这儿实现了Paper里面的Algorithm 1 Densitometric Classifier
curr_linkDensity <= 0.333333
| prev_linkDensity <= 0.555556
| | curr_textDensity <= 9
| | | next_textDensity <= 10
| | | | prev_textDensity <= 4: BOILERPLATE
| | | | prev_textDensity > 4: CONTENT
| | | next_textDensity > 10: CONTENT
| | curr_textDensity > 9
| | | next_textDensity = 0: BOILERPLATE
| | | next_textDensity > 0: CONTENT
| prev_linkDensity > 0.555556
| | next_textDensity <= 11: BOILERPLATE
| | next_textDensity > 11: CONTENT
curr_linkDensity > 0.333333: BOILERPLATE

如果有兴趣, 你可以学习其他extractor, 或自己design合适自己的extractor.


本文章摘自博客园,原文发布日期:2011-07-05

目录
相关文章
|
16天前
|
定位技术
时尚的联系我们表单HTML模板(源码)
一款时尚的联系我们表单Html模板,带地图和所在位置,输入基本信息和信息发送,看起来很漂亮的联系我们页面。
29 1
时尚的联系我们表单HTML模板(源码)
|
4天前
|
前端开发 JavaScript
用HTML CSS JS打造企业级官网 —— 源码直接可用
必看!用HTML+CSS+JS打造企业级官网-源码直接可用,文章代码仅用于学习,禁止用于商业
32 1
|
10天前
|
移动开发 HTML5
HTML5熊猫弹跳手机小游戏源码
一款html5手机端小游戏源码,熊猫跳跃小游戏源码下载。熊猫脚底有弹簧,长按变化力度跳跃,计分游戏,html5手机熊猫也疯狂小游戏源代码。
30 5
|
12天前
|
JavaScript
JS鼠标框选并删除HTML源码
这是一个js鼠标框选效果,可实现鼠标右击出现框选效果的功能。右击鼠标可拖拽框选元素,向下拖拽可实现删除效果,简单实用,欢迎下载
25 4
|
13天前
|
移动开发 HTML5
HTML5实现手机端红包下落抢红包特效源码
HTML5实现手机端红包下落抢红包特效源码是一款手机移动端的抢红包小游戏源码下载。红包像下雪一样,点击抓我呀,可以抢红包,需要此款代码的朋友们可以前来下载使用。本段代码兼容目前最新的各类主流浏览器,是一款非常优秀的特效源码。
30 4
|
11天前
|
移动开发 HTML5
html5+three.js公路开车小游戏源码
html5公路开车小游戏是一款html5基于three.js制作的汽车开车小游戏源代码,在公路上开车网页小游戏源代码。
34 0
html5+three.js公路开车小游戏源码
|
19天前
|
前端开发 JavaScript
Canvas三维变化背景动画HTML源码
Canvas三维变化背景动画HTML源码
23 5
|
19天前
轻量级消息弹框HTML源码
轻量级消息弹框插件特效源码是一段不错的消息弹框代码,支持设置通知类型、动画、字体、位置等等属性,可定制程度较高,进行适当扩展等,欢迎对此段代码感兴趣的朋友参考使用。
27 0
轻量级消息弹框HTML源码
|
2月前
|
移动开发 HTML5
响应式精品资源导航源码html5
一款响应式精品网站推荐导航源码,可以自己修改代码替换图标图片和指向网址。背景图支持自动替换,背景图可以在images中修改,本地双击html即可查看效果
40 2
响应式精品资源导航源码html5
|
2月前
|
搜索推荐
2024微信个人名片在线生成HTML源码
微信个人名片卡片在线生成,这是一款微信个人名片生成网站源码,无第三方接口,本地直接生成可长期使用。 主要用于生成用户个性化的名片页面,包括头像、姓名、联系方式、个人介绍等信息。 在本地浏览器打开即可,源码是html的,也可上传到服务器上。
49 0
2024微信个人名片在线生成HTML源码