Python 中文分词：jieba库的使用

2022-11-08 429

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Python基础入门jieba库的使用。如何安装，常用函数方法。老人与海、水浒传词频统计案例。

✅作者简介：人工智能专业本科在读，喜欢计算机与编程，写博客记录自己的学习历程。
🍎个人主页：小嗷犬的博客
🍊个人信条：为天地立心，为生民立命，为往圣继绝学，为万世开太平。
🥭本文内容：Python 中文分词：jieba库的使用

1.jieba库的安装

jieba是Python中一个重要的第三方中文分词函数库，需要通过pip指令安装：

pip install jieba   
# 或者 
pip3 install jieba

2.常用函数方法

jieba库的常用函数方法如下：

函数	描述
jieba.cut(s)	精确模式，返回一个可迭代的数据类型
jieba.cut(s, cut_all=True)	全模式，输出文本s中所有可能单词
jieba.cut_for_search(s)	搜索引擎模式，适合搜索引擎建立索引的分词结果
jieba.lcut(s)	精确模式，返回一个列表类型，建议使用
jieba.lcut(s, cut_all=True)	全模式，返回一个列表类型，建议使用
jieba.add_word(w)	向分词词典中增加新词w

代码实例：

import jieba
print(jieba.lcut('Python是一种十分便捷的编程语言'))
print(jieba.lcut('Python是一种十分便捷的编程语言', cut_all=True))
print(jieba.lcut_for_search('Python是一种十分便捷的编程语言'))

3.jieba库的应用：文本词频统计

3.1 《The Old Man And the Sea》英文词频统计

import jieba
def getText():
    txt = open("Documents/《The Old Man And the Sea》.txt", "r", encoding='utf-8').read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格
    return txt

words  = getText().split()
counts = {}
for word in words:
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items[:10]
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))
# 输出：
# the        2751
# and        1458
# he         1221
# of          788
# to          555
# a           538
# it          528
# his         513
# in          503
# i           472

观察输出结果可以看到，高频单词大多数是冠词、代词、连接词等语法型词汇，并不能代表文章的含义。进一步，可以采用集合类型构建一个排除词汇库 excludes，在输出结果中排除这个词汇库中内容。

excludes = {"the","and","of","you","a","i","my","in","he","to","it","his","was",
            "that","is","but","him","as","on","not","with","had","said","now","for",
           "thought","they","have","then","were","from","could","there","out","be",
           "when","at","them","all","will","would","no","do","are","or","down","so",
            "up","what","if","back","one","can","must","this","too","more","again",
           "see","great","two"}

def getText():
    txt = open("Documents/《The Old Man And the Sea》.txt", "r", encoding='utf-8').read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~“”':
        txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格
    return txt

words  = getText().split()
counts = {}
for word in words:
    counts[word] = counts.get(word,0) + 1
for word in excludes:
    del(counts[word])
items = list(counts.items())
items[:10]
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))
# 输出：
# old         300
# man         298
# fish        281
# line        139
# water       107
# boy         105
# hand         91
# sea          67
# head         65
# come         60

3.2 《水浒传》人物出场统计

import jieba

txt = open("Documents/《水浒传》.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:  #排除单个字符的分词结果
        continue
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(15):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))
# 输出：
# 宋江         2538
# 两个         1733
# 一个         1399
# 李逵         1117
# 武松         1053
# 只见          917
# 如何          911
# 那里          858
# 哥哥          750
# 说道          729
# 林冲          720
# 军马          719
# 头领          707
# 吴用          654
# 众人          652

观察输出，我们发现结果中有非人名词汇，与英文词频统计类似，我们需要排除一些人名无关词汇。

import jieba
excludes = {'两个','一个','只见','如何','那里','哥哥','说道','军马',
           '头领','众人','这里','兄弟','出来','小人','梁山泊','这个',
           '今日','妇人','先锋','好汉','便是','人马','问道','起来',
           '甚么','因此','却是','我们','正是','三个','如此','且说',
           '不知','不是','只是','次日','不曾','呼延','不得','一面',
           '看时','不敢','如今','来到','当下','原来','将军','山寨',
           '喝道','兄长','只得','军士','里面','大喜','天子','一齐',
           '知府','性命','商议','小弟','那个','公人','将来','前面',
            '东京','喽罗','那厮','城中','弟兄','下山','不见','怎地',
            '上山','随即','不要'}



txt = open("Documents/《水浒传》.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "宋江道":
        rword = "宋江"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del(counts[word])
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(15):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))
# 输出：
# 宋江         3010
# 李逵         1117
# 武松         1053
# 林冲          720
# 吴用          654
# 卢俊义         546
# 鲁智深         356
# 戴宗          312
# 柴进          301
# 公孙胜         272
# 花荣          270
# 秦明          258
# 燕青          252
# 朱仝          245
# 晁盖          238

Python 中文分词：jieba库的使用

1.jieba库的安装

2.常用函数方法

3.jieba库的应用：文本词频统计

3.1 《The Old Man And the Sea》英文词频统计

3.2 《水浒传》人物出场统计

热门文章

最新文章

相关课程

相关电子书

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Python 中文分词：jieba库的使用

1.jieba库的安装

2.常用函数方法

3.jieba库的应用：文本词频统计

3.1 《The Old Man And the Sea》英文词频统计

3.2 《水浒传》人物出场统计

热门文章

最新文章

相关课程

相关电子书

推荐镜像