使用朴素贝叶斯算法过滤垃圾邮件
(1)收集数据:提供文本文件
(2)准备数据:将文本文件解析成词条向量
(3)分析数据:检查词条确保解析的正确性
(4)训练算法:使用trainNB0()函数
(5)测试算法:使用classifyNB(),并且构建一个新的测试函数计算文档的错误率
(6)使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出
谷歌笔记本(可选)
In [1]:
from google.colab import drive drive.mount("/content/drive")
Mounted at /content/drive
准备数据:切分文本
In [3]:
mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'
In [12]:
import re regEx = re.compile('\W+') listOfTokens = regEx.split(mySent) listOfTokens
Out[12]: ['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', ''] In [13]:
[tok for tok in listOfTokens if len(tok) > 0]
Out[13]:
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']
In [14]:
[tok.lower() for tok in listOfTokens if len(tok) > 0]
Out[14]:
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']
In [22]:
emailText = open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/ham/6.txt', 'r', encoding='latin-1').read() emailText
Out[22]:
'Hello,\n\nSince you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions. Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.\n\nFor example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you\x92re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.\n\nyou have received this mandatory email service announcement to update you about important changes to Google Groups.'
In [18]:
listOfTokens = regEx.split(emailText)
In [20]:
listOfTokens[:10]
Out[20]:
['Hello', 'Since', 'you', 'are', 'an', 'owner', 'of', 'at', 'least', 'one']
测试算法:使用朴素贝叶斯进行交叉验证
In [46]:
def createVocabList(dataSet): vocabSet = set([]) for document in dataSet: vocabSet = vocabSet | set(document) return list(vocabSet)
In [47]:
def setOfWords2Vec(vocabList, inputSet): returnVec = [0] * len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] = 1 else: print("the word: %s is not in my Vocabulary!" % word) return returnVec
In [48]:
def textParse(bigString): listOfTokens = re.split('\W+', bigString) return [tok.lower() for tok in listOfTokens if len(tok) > 2]
In [49]:
import random from numpy import *
In [50]:
def trainNB0(trainMatrix, trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory) / float(numTrainDocs) p0Num = ones(numWords) p1Num = ones(numWords) p0Denom = 2.0 p1Denom = 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num / p1Denom) p0Vect = log(p0Num / p0Denom) return p0Vect, p1Vect, pAbusive
In [51]:
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = sum(vec2Classify * p1Vec) + log(pClass1) p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) if p1 > p0: return 1 else: return 0
In [61]:
def spamTest(): docList = [] classList = [] fullText = [] for i in range(1, 26): wordList = textParse(open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/spam/%d.txt'%i, 'r', encoding='latin-1').read()) docList.append(wordList) fullText.extend(wordList) classList.append(1) wordList = textParse(open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/ham/%d.txt'%i, 'r', encoding='latin-1').read()) docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList) sums = 0 for i in range(10): trainingSet = list(range(50)) testSet = [] for i in range(10): randIndex = int(random.uniform(0, len(trainingSet))) testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) trainMat = [] trainClasses = [] for docIndex in trainingSet: trainMat.append(setOfWords2Vec(vocabList, docList[docIndex])) trainClasses.append(classList[docIndex]) p0V, p1V, pSpam = trainNB0(trainMat, trainClasses) errorCount = 0 for docIndex in testSet: wordVector = setOfWords2Vec(vocabList, docList[docIndex]) res = classifyNB(wordVector, p0V, p1V, pSpam) if res != classList[docIndex]: errorCount += 1 print("the error rate is: ", float(errorCount)/len(testSet)) sums += float(errorCount)/len(testSet) print("the average error rate is: ", sums/10.0)
In [62]:
spamTest()
the error rate is: 0.0 the error rate is: 0.2 the error rate is: 0.2 the error rate is: 0.0 the error rate is: 0.1 the error rate is: 0.0 the error rate is: 0.0 the error rate is: 0.0 the error rate is: 0.0 the error rate is: 0.0 the average error rate is: 0.05
In [63]:
spamTest()
the error rate is: 0.2 the error rate is: 0.0 the error rate is: 0.0 the error rate is: 0.1 the error rate is: 0.0 the error rate is: 0.0 the error rate is: 0.1 the error rate is: 0.0 the error rate is: 0.0 the error rate is: 0.0 the average error rate is: 0.04
In [64]:
spamTest()
the error rate is: 0.1 the error rate is: 0.1 the error rate is: 0.0 the error rate is: 0.0 the error rate is: 0.1 the error rate is: 0.0 the error rate is: 0.1 the error rate is: 0.1 the error rate is: 0.0 the error rate is: 0.1 the average error rate is: 0.06