使用朴素贝叶斯算法过滤垃圾邮件

使用朴素贝叶斯算法过滤垃圾邮件

（1）收集数据：提供文本文件

（2）准备数据：将文本文件解析成词条向量

（3）分析数据：检查词条确保解析的正确性

（4）训练算法：使用trainNB0()函数

（5）测试算法：使用classifyNB()，并且构建一个新的测试函数计算文档的错误率

（6）使用算法：构建一个完整的程序对一组文档进行分类，将错分的文档输出

谷歌笔记本（可选）

In [1]:

from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive

准备数据：切分文本

In [3]:

mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'

In [12]:

import re
regEx = re.compile('\W+')
listOfTokens = regEx.split(mySent)
listOfTokens
Out[12]:

['This',
'book',
'is',
'the',
'best',
'book',
'on',
'Python',
'or',
'M',
'L',
'I',
'have',
'ever',
'laid',
'eyes',
'upon',
'']
In [13]:
[tok for tok in listOfTokens if len(tok) > 0]

Out[13]:

['This',
'book',
'is',
'the',
'best',
'book',
'on',
'Python',
'or',
'M',
'L',
'I',
'have',
'ever',
'laid',
'eyes',
'upon']

In [14]:

[tok.lower() for tok in listOfTokens if len(tok) > 0]

Out[14]:

['this',
'book',
'is',
'the',
'best',
'book',
'on',
'python',
'or',
'm',
'l',
'i',
'have',
'ever',
'laid',
'eyes',
'upon']

In [22]:

emailText = open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/ham/6.txt', 'r', encoding='latin-1').read()
emailText

Out[22]:

'Hello,\n\nSince you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions.  Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.\n\nFor example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you\x92re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.\n\nyou have received this mandatory email service announcement to update you about important changes to Google Groups.'

In [18]:

listOfTokens = regEx.split(emailText)

In [20]:

listOfTokens[:10]

Out[20]:

['Hello', 'Since', 'you', 'are', 'an', 'owner', 'of', 'at', 'least', 'one']

测试算法：使用朴素贝叶斯进行交叉验证

In [46]:

def createVocabList(dataSet):
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet | set(document)
return list(vocabSet)

In [47]:

def setOfWords2Vec(vocabList, inputSet):
returnVec = [0] * len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:
print("the word: %s is not in my Vocabulary!" % word)
return returnVec

In [48]:

def textParse(bigString):
listOfTokens = re.split('\W+', bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 2]

In [49]:

import random
from numpy import *

In [50]:

def trainNB0(trainMatrix, trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory) / float(numTrainDocs)
p0Num = ones(numWords)
p1Num = ones(numWords)
p0Denom = 2.0
p1Denom = 2.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p1Vect = log(p1Num / p1Denom)
p0Vect = log(p0Num / p0Denom)
return p0Vect, p1Vect, pAbusive

In [51]:

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify * p1Vec) + log(pClass1)
p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
if p1 > p0:
return 1
else:
return 0

In [61]:

def spamTest():
docList = []
classList = []
fullText = []
for i in range(1, 26):
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)
sums = 0
for i in range(10):
trainingSet = list(range(50))
testSet = []
for i in range(10):
randIndex = int(random.uniform(0, len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat = []
trainClasses = []
for docIndex in trainingSet:
trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V, p1V, pSpam = trainNB0(trainMat, trainClasses)
errorCount = 0
for docIndex in testSet:
wordVector = setOfWords2Vec(vocabList, docList[docIndex])
res = classifyNB(wordVector, p0V, p1V, pSpam)
if res != classList[docIndex]:
errorCount += 1
print("the error rate is: ", float(errorCount)/len(testSet))
sums += float(errorCount)/len(testSet)
print("the average error rate is: ", sums/10.0)

In [62]:

spamTest()
the error rate is:  0.0
the error rate is:  0.2
the error rate is:  0.2
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the average error rate is:  0.05

In [63]:

spamTest()
the error rate is:  0.2
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the average error rate is:  0.04

In [64]:

spamTest()
the error rate is:  0.1
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.1
the average error rate is:  0.06

|
15天前
|

markdown 探索贝叶斯定理：从默默无闻到AI基石。18世纪数学家贝叶斯的理论，初期未受重视，后成为20世纪机器学习、医学诊断和金融分析等领域关键。贝叶斯定理是智能背后的逻辑，朴素贝叶斯分类器在文本分类等应用中表现出色。贝叶斯网络则用于表示变量间条件依赖，常见于医学诊断和故障检测。贝叶斯推理通过更新信念以适应新证据，广泛应用于统计和AI。尽管有计算复杂性等局限，贝叶斯算法在小数据集和高不确定性场景中仍极具价值。了解并掌握这一算法，助你笑傲智能江湖！ 
21 2
|
13天前
|

12 2
|
15天前
|

【机器学习】朴素贝叶斯算法及其应用探索

44 1
|
6天前
|

7 0
|
1月前
|

【机器学习】比较朴素贝叶斯算法与逻辑回归算法
【5月更文挑战第10天】【机器学习】比较朴素贝叶斯算法与逻辑回归算法
75 14
|
1月前
|

52 0
|
1月前
|

40 4
|
1月前
|

52 0
|
1月前
|

SVM算法、朴素贝叶斯算法讲解及对iris数据集分类实战（附源码）
SVM算法、朴素贝叶斯算法讲解及对iris数据集分类实战（附源码）
129 0
|
1天前
|

**摘要：** 该研究利用遗传算法（GA）对混合发电系统进行优化配置，旨在最小化风能、太阳能及电池储能的成本并提升系统性能。MATLAB 2022a用于实现这一算法。仿真结果展示了一系列图表，包括总成本随代数变化、最佳适应度随代数变化，以及不同数据的分布情况，如负荷、风速、太阳辐射、弃电、缺电和电池状态等。此外，代码示例展示了如何运用GA求解，并绘制了发电单元的功率输出和年变化。该系统原理基于GA的自然选择和遗传原理，通过染色体编码、初始种群生成、适应度函数、选择、交叉和变异操作来寻找最优容量配置，以平衡成本、效率和可靠性。
29 11