使用 NLTK 删除停止词_问答-阿里云开发者社区

NLTK 具有大多数语言的停止词表。要获得英文停止词，你可以使用以下代码：

from nltk.corpus import stopwords
stopwords.words('english')

现在，让我们修改我们的代码，并在绘制图形之前清理标记。首先，我们复制一个列表。然后，我们通过对列表中的标记进行遍历并删除其中的停止词：

clean_tokens = tokens[:] 
sr = stopwords.words('english')
for token in tokens:
    if token in stopwords.words('english'):
        clean_tokens.remove(token)

你可以在这里查看Python List 函数，了解如何处理列表。

最终的代码应该是这样的：

from bs4 import BeautifulSoup 
import urllib.request 
import nltk 
from nltk.corpus import stopwords 
response = urllib.request.urlopen('http://php.net/') 
html = response.read() 
soup = BeautifulSoup(html,"html5lib") 
text = soup.get_text(strip=True) 
tokens = [t for t in text.split()] 
clean_tokens = tokens[:] 
sr = stopwords.words('english') 
for token in tokens: 
    if token in stopwords.words('english'): 
        clean_tokens.remove(token) 
freq = nltk.FreqDist(clean_tokens) 
for key,val in freq.items(): 
    print (str(key) + ':' + str(val))

如果你现在检查图表，会感觉比之前那张图标更加清晰，因为没有了停止词的干扰。

freq.plot(20,cumulative=False)

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

使用 NLTK 删除停止词