nlp入门之spaCy工具的使用

2023-08-10 257

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

检索分析服务 Elasticsearch 版，2核4GB开发者规格 1个月

实时计算 Flink 版，5000CU*H 3个月

智能开放搜索 OpenSearch行业算法版，1GB 20LCU 1个月

简介： 本文作为nlp开山篇的第四篇，简要介绍了spaCy工具的用法

源码请到：自然语言处理练习: 学习自然语言处理时候写的一些代码 (gitee.com)

四、spacy工具包的使用

4.1 spacy工具包安装

spacy工具包宣称可以做到nltk做到的所有事情，并且速度更快，还更好的适配深度学习，最关键的是提供了中文语言模型！！

由于某些不可说的原因，使用官网的安装方式很难成功推荐直接使用conda内部的整合包

运行

condainstallspacycondainstall-cconda-forgespacy-model-en_core_web_sm

就可以安装成功了

如果不成功可以网上寻找spacy的离线安装包，可以参考这篇文章

安装spaCy（最简单的教程）_spacy安装_御用厨师的博客-CSDN博客

4.2 加载模型

可以自行选择安装需要的模型，然后使用命令加载，我这里使用英文模型做示范

示例：

# 加载模型nlp=spacy.load("en_core_web_sm")

4.3 分词

spacy同样可以做到分词

示例：

# 加载语料doc=nlp('Weather is good, very windy and sunny. We have no classes in the afternoon')
# 分词fortokenindoc:
print(token)

4.4 分句

spacy还提供了分句功能

示例：

# 分句forsentindoc.sents:
print(sent)

4.5 词性

spacy和nltk一样提供了分析词性的功能

示例：

# 词性fortokenindoc:
print('{}-{}'.format(token, token.pos_))

词性对照表可以参考

SpaCy词性对照表 - 知乎 (zhihu.com)

4.6 命名体识别

spacy也提供了命名体识别功能

示例：

# 命名体识别doc_2=nlp("I went to Paris where I met my old friend Jack from uni")
forentindoc_2.ents:
print('{}-{}'.format(ent, ent.label_))

还可以将结果进行可视化展示

# 展示doc=nlp("I went to Paris where I met my old friend Jack from uni")
svg=displacy.render(doc, style='ent')
output_path=Path(os.path.join("./", "sentence.html"))
output_path.open('w', encoding="utf-8").write(svg)

4.7 找出书中所有人物的名字

以傲慢与偏见为语料，做一个找出所有人物名字的实战示例

示例：

# 找到书中所有人物名字defread_file(file_name):
withopen(file_name, 'r') asf:
returnf.read()
text=read_file(os.path.join('./', 'data/Pride and Prejudice.txt'))
processed_text=nlp(text)
sentences= [sforsinprocessed_text.sents]
print(len(sentences))
print(sentences[:5])
deffind_person(doc):
c=Counter()
forentindoc.ents:
ifent.label_=='PERSON':
c[ent.lemma_] +=1returnc.most_common(10)
print(find_person(processed_text))

4.8 恐怖袭击分析

根据世界反恐怖组织官网上下载的恐怖袭击事件，来分析特定的组织在特定的地点作案的次数

示例：

# 恐怖袭击分析defread_file_to_list(file_name):
withopen(file_name, 'r') asf:
returnf.readlines()
terrorist_articles=read_file_to_list(os.path.join('./', 'data/rand-terrorism-dataset.txt'))
print(terrorist_articles[:5])
terrorist_articles_nlp= [nlp(art.lower()) forartinterrorist_articles]
common_terrorist_groups= [
'taliban',
'al-qaeda',
'hamas',
'fatah',
'plo',
'bilad al-rafidayn']
commmon_locations= [
'iraq',
'baghdad',
'kirkuk',
'mosul',
'afghanistan',
'kabul',
'basra',
'palestine',
'gaza',
'israel',
'istanbul',
'beirut',
'pakistan']
location_entity_dict=defaultdict(Counter)
forarticleinterrorist_articles_nlp:
article_terrorist_groups= [ent.lemma_forentinarticle.entsifent.label_=='PERSON'orent.label_=="ORG"]
article_locations= [ent.lemma_forentinarticle.entsifent.label_=='GPE']
terrorist_common= [entforentinarticle_terrorist_groupsifentincommon_terrorist_groups]
location_common= [entforentinarticle_locationsifentincommmon_locations]
forfound_entityinterrorist_common:
forfound_locationinlocation_common:
location_entity_dict[found_entity][found_location] +=1print(location_entity_dict)
location_entity_df=pd.DataFrame.from_dict(dict(location_entity_dict), dtype=int)
location_entity_df=location_entity_df.fillna(value=0).astype(int)
print(location_entity_df)
plt.figure(figsize=(12, 10))
hmap=sns.heatmap(location_entity_df, annot=True, fmt='d', cmap='YlGnBu', cbar=False)
plt.title("Global Incidents by Terrorist group")
plt.xticks(rotation=30)
plt.show()

nlp入门之spaCy工具的使用

四、spacy工具包的使用

大数据与机器学习

热门文章

最新文章

相关课程

相关电子书

相关实验场景