arex
node article extractor
Last updated 2 years ago by successage .
BSD-2-Clause · Repository · Bugs · Original npm · Tarball · package.json
$ cnpm install arex 
SYNC missed versions from official npm registry.

arex

node.js实现自动提取文章正文, 标题, 发布日期。自动生成文章摘要.

#安装

npm install arex

#使用例子:

var arex = require('arex');
//example 1, 给定网址自动抓取,提取正文,生成摘要
arex.get_article('http://finance.sina.com.cn/consume/puguangtai/2016-03-15/doc-ifxqhmve9227502.shtml',120,(err,result)=>{
                //120: 摘要长度为120,如果不需要生成摘要此参数传入false.
		//result: {"title":"...","content":"....", "summary":"...", "pubdate":"..."}
		console.log(result['content']);
});

//example 2, 给html内容,提取正文,生成摘要
result = arex.get_article_sync('<html.........</html>',120);//result: {"title":"...","content":"....", "summary":"...", "pubdate":"..."}

//example 3, 给html内容,生成摘要
//summarize(content, exptd_len=120, shingle=false, min=150, max=350, filter=[], title)
//shingle的意义: 以摘要长度的句子组合为单位计算权重,shingle为false则以自然句为单位计算权重, filter是过滤规则,符合规则的段落都会被过滤不作为摘要
var summary = arex.summarize('<html>.......</html>', 120, true);
var summary = arex.summarize('<html>.......</html>', 0.04, true, 100, 300);//摘要长度比例 4%, 最短 100, 最长 300

#测试

##获取源码

git clone https://github.com/ahkimkoo/arex.git

##测试某个网页的抽取

cd arex
npm install
node test/test.js http://finance.sina.com.cn/consume/puguangtai/2016-03-15/doc-ifxqhmve9227502.shtml 120

120表示期望文摘的长度

##算法说明

  • 正文抽取: 基于行块密度分布来抽取正文, 每个行块由若干自然段落组成。
  • 标题抽取: 分别从正文附近抽取h1标签,从title标签取值,取最可能是标题的那一个。
  • 发布日期抽取: 用正则表达式抽取正文附近的日期。(有误差)。
  • 自动文摘: sentense rank算法,参照pagerank算法的实现,可以指定期望的文摘长度。优化点:加入了神经网络模型判断一句话是否适合作为摘要。

arex

node.js article extractor, automatic summarization.

#Install

npm install arex

#Usage:

var arex = require('arex');
//example 1
arex.get_article('http://finance.sina.com.cn/consume/puguangtai/2016-03-15/doc-ifxqhmve9227502.shtml',120,(err,result)=>{
                //120: summary limited, if you do not need summary set it to false.
		//result: {"title":"...","content":"....", "summary":"...", "pubdate":"..."}
		console.log(result['content']);
});

//example 2
result = arex.get_article_sync('<html.........</html>',120);//result: {"title":"...","content":"....", "summary":"...", "pubdate":"..."}

//example 3
//summarize(content, exptd_len=120, shingle=false, min=150, max=350, filter=[], title)
var summary = arex.summarize('<html>.......</html>', 120, true);
var summary = arex.summarize('<html>.......</html>', 0.04, true, 100, 300);//summary ratio 4%, min length 100, max length 300

#Test

##get source

git clone https://github.com/ahkimkoo/arex.git

##test link

cd arex
npm install
node test/test.js http://finance.sina.com.cn/consume/puguangtai/2016-03-15/doc-ifxqhmve9227502.shtml 120

##About algorithm

  • article extractor: based density of article blocks, a bock consists of a number of natual lines.
  • title extracor: h1 tag or title tag, choose the best one.
  • pubdate extractor: regex extraction nearby the begging or article.
  • summarizer: based sentense rank, similar pagerank. Optimization: neural network model to determine whether a sentence is suitable as a summary.

Current Tags

  • 0.4.5                                ...           latest (2 years ago)

27 Versions

  • 0.4.5                                ...           2 years ago
  • 0.4.4                                ...           2 years ago
  • 0.4.2                                ...           2 years ago
  • 0.4.0                                ...           2 years ago
  • 0.3.9                                ...           2 years ago
  • 0.3.8                                ...           2 years ago
  • 0.3.6                                ...           2 years ago
  • 0.3.5                                ...           2 years ago
  • 0.3.2                                ...           3 years ago
  • 0.3.0                                ...           3 years ago
  • 0.2.8                                ...           3 years ago
  • 0.2.6                                ...           3 years ago
  • 0.2.4                                ...           3 years ago
  • 0.2.2                                ...           3 years ago
  • 0.2.1                                ...           3 years ago
  • 0.2.0                                ...           4 years ago
  • 0.1.9                                ...           4 years ago
  • 0.1.7                                ...           4 years ago
  • 0.1.6                                ...           4 years ago
  • 0.1.5                                ...           4 years ago
  • 0.1.0                                ...           4 years ago
  • 0.0.7                                ...           4 years ago
  • 0.0.6                                ...           4 years ago
  • 0.0.4                                ...           4 years ago
  • 0.0.3                                ...           4 years ago
  • 0.0.2                                ...           4 years ago
  • 0.0.1                                ...           4 years ago
Maintainers (1)
Downloads
Today 0
This Week 1
This Month 1
Last Day 0
Last Week 1
Last Month 1
Dependencies (6)
Dev Dependencies (0)
None

Copyright 2014 - 2016 © taobao.org |