使用朴素贝叶斯算法过滤垃圾邮件-阿里云开发者社区

使用朴素贝叶斯算法过滤垃圾邮件

2024-06-11 50

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 使用朴素贝叶斯算法过滤垃圾邮件

使用朴素贝叶斯算法过滤垃圾邮件

（1）收集数据：提供文本文件

（2）准备数据：将文本文件解析成词条向量

（3）分析数据：检查词条确保解析的正确性

（4）训练算法：使用trainNB0()函数

（5）测试算法：使用classifyNB()，并且构建一个新的测试函数计算文档的错误率

（6）使用算法：构建一个完整的程序对一组文档进行分类，将错分的文档输出

谷歌笔记本（可选）

In [1]:

from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive

准备数据：切分文本

In [3]:

mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'

In [12]:

import re
regEx = re.compile('\W+')
listOfTokens = regEx.split(mySent)
listOfTokens

Out[12]:

['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon',
 '']
In [13]:

[tok for tok in listOfTokens if len(tok) > 0]

Out[13]:

['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon']

In [14]:

[tok.lower() for tok in listOfTokens if len(tok) > 0]

Out[14]:

['this',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'python',
 'or',
 'm',
 'l',
 'i',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon']

In [22]:

emailText = open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/ham/6.txt', 'r', encoding='latin-1').read()
emailText

Out[22]:

'Hello,\n\nSince you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions.  Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.\n\nFor example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you\x92re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.\n\nyou have received this mandatory email service announcement to update you about important changes to Google Groups.'

In [18]:

listOfTokens = regEx.split(emailText)

In [20]:

listOfTokens[:10]

Out[20]:

['Hello', 'Since', 'you', 'are', 'an', 'owner', 'of', 'at', 'least', 'one']

测试算法：使用朴素贝叶斯进行交叉验证

In [46]:

def createVocabList(dataSet):
  vocabSet = set([])
  for document in dataSet:
    vocabSet = vocabSet | set(document)
  return list(vocabSet)

In [47]:

def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else:
      print("the word: %s is not in my Vocabulary!" % word)
  return returnVec

In [48]:

def textParse(bigString):
  listOfTokens = re.split('\W+', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]

In [49]:

import random
from numpy import *

In [50]:

def trainNB0(trainMatrix, trainCategory):
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory) / float(numTrainDocs)
  p0Num = ones(numWords)
  p1Num = ones(numWords)
  p0Denom = 2.0
  p1Denom = 2.0
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num / p1Denom)
  p0Vect = log(p0Num / p0Denom)
  return p0Vect, p1Vect, pAbusive

In [51]:

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0

In [61]:

def spamTest():
  docList = []
  classList = []
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/spam/%d.txt'%i, 'r', encoding='latin-1').read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/ham/%d.txt'%i, 'r', encoding='latin-1').read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList)
  sums = 0
  for i in range(10):
    trainingSet = list(range(50))
    testSet = []
    for i in range(10):
      randIndex = int(random.uniform(0, len(trainingSet)))
      testSet.append(trainingSet[randIndex])
      del(trainingSet[randIndex])
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
      trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
      trainClasses.append(classList[docIndex])
    p0V, p1V, pSpam = trainNB0(trainMat, trainClasses)
    errorCount = 0
    for docIndex in testSet:
      wordVector = setOfWords2Vec(vocabList, docList[docIndex])
      res = classifyNB(wordVector, p0V, p1V, pSpam)
      if res != classList[docIndex]:
        errorCount += 1
    print("the error rate is: ", float(errorCount)/len(testSet))
    sums += float(errorCount)/len(testSet)
  print("the average error rate is: ", sums/10.0)

In [62]:

spamTest()

the error rate is:  0.0
the error rate is:  0.2
the error rate is:  0.2
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the average error rate is:  0.05

In [63]:

spamTest()

the error rate is:  0.2
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the average error rate is:  0.04

In [64]:

spamTest()

the error rate is:  0.1
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.1
the average error rate is:  0.06

使用朴素贝叶斯算法过滤垃圾邮件

使用朴素贝叶斯算法过滤垃圾邮件

谷歌笔记本（可选）

准备数据：切分文本

测试算法：使用朴素贝叶斯进行交叉验证

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

使用朴素贝叶斯算法过滤垃圾邮件

使用朴素贝叶斯算法过滤垃圾邮件

谷歌笔记本（可选）

准备数据：切分文本

测试算法：使用朴素贝叶斯进行交叉验证

热门文章

最新文章

相关课程

相关电子书

相关实验场景