Extracting Information from Text With NLTK-阿里云开发者社区

Extracting Information from Text With NLTK

2017-05-02 1548

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

因为现实中的数据多为‘非结构化数据’，比如一般的txt文档，或是‘半结构化数据’，比如html，对于这样的数据需要采用一些技术才能从中提取出有用的信息。如果所有数据都是‘结构化数据’，比如Xml或关系数据库，那么就不需要特别去提取了，可以根据元数据去任意取到你想要的信息。

那么就来讨论一下用NLTK来实现文本信息提取的方法，

first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer . Next, each sentence is tagged with part-of-speech tags , which will prove very helpful in the next step,named entity recognition . In this step, we search for mentions of potentially interesting entities in each sentence. Finally, we use relation recognition to search for likely relations between different entities in the text.

可见这儿描述的信息提取的过程，包含4步，分词，词性标注，命名实体识别，实体关系识别，对于分词和词性标注前面已经介绍过了，那么就详细来看看named entity recognition 怎么来实现的。

Chunking

The basic technique we will use for entity recognition is chunking, which segments and labels multitoken sequences。

实体识别最基本的技术就是chunking，即分块，可以理解为把多个token组成词组。

Noun Phrase Chunking

我们就先以名词词组从chunking为例，即NP-chunking

One of the most useful sources of information for NP-chunking is part-of-speech tags.

>>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> grammar = "NP: {<DT>?<JJ>*<NN>}" #Tag Patterns，定语（0或1个）形容词（任意个）名词（1个）
>>> cp = nltk.RegexpParser(grammar)
>>> result = cp.parse(sentence)
>>> print result
(S
(NP the/DT little/JJ yellow/JJ dog/NN) #NP-chunking, the little yellow dog
barked/VBD
at/IN
(NP the/DT cat/NN)) #NP-chunking, # NP-chunking, the cat
上面的这个方法就是用Regular Expressions来表示tag pattern，从而找到NP-chunking

再给个例子，tag patterns可以加上多条，可以变的更复杂

grammar = r"""NP: {<DT|PP/>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nouns {<NNP>+} # chunk sequences of proper nouns """ cp = nltk.RegexpParser(grammar) sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
>>> print cp.parse(sentence)
(S
(NP Rapunzel/NNP) #NP-chunking, Rapunzel
let/VBD
down/RP
(NP her/PP$ long/JJ golden/JJ hair/NN)) #NP-chunking, her long golden hair

下面给个例子看看怎么从语料库中找到匹配的词性组合，

>>> cp = nltk.RegexpParser(''CHUNK: {<V.*> <TO> <V.*>}'') ＃找‘动词 to 动词’的组合
>>> brown = nltk.corpus.brown
>>> for sent in brown.tagged_sents():
...       tree = cp.parse(sent)
...         for subtree in tree.subtrees():
...             if subtree.node == ''CHUNK'': print subtree
...
(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)

本文章摘自博客园，原文发布日期：2011-07-04

Extracting Information from Text With NLTK

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Extracting Information from Text With NLTK

热门文章

最新文章

相关电子书