使用朴素贝叶斯算法过滤垃圾邮件

简介: 使用朴素贝叶斯算法过滤垃圾邮件

使用朴素贝叶斯算法过滤垃圾邮件

(1)收集数据:提供文本文件


(2)准备数据:将文本文件解析成词条向量


(3)分析数据:检查词条确保解析的正确性


(4)训练算法:使用trainNB0()函数


(5)测试算法:使用classifyNB(),并且构建一个新的测试函数计算文档的错误率


(6)使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出

谷歌笔记本(可选)

In [1]:

from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


准备数据:切分文本

In [3]:

mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'

In [12]:

import re
regEx = re.compile('\W+')
listOfTokens = regEx.split(mySent)
listOfTokens
Out[12]:

['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon',
 '']
In [13]:
[tok for tok in listOfTokens if len(tok) > 0]

Out[13]:

['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon']

In [14]:

[tok.lower() for tok in listOfTokens if len(tok) > 0]

Out[14]:

['this',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'python',
 'or',
 'm',
 'l',
 'i',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon']

In [22]:

emailText = open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/ham/6.txt', 'r', encoding='latin-1').read()
emailText


Out[22]:

'Hello,\n\nSince you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions.  Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.\n\nFor example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you\x92re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.\n\nyou have received this mandatory email service announcement to update you about important changes to Google Groups.'

In [18]:

listOfTokens = regEx.split(emailText)

In [20]:

listOfTokens[:10]

Out[20]:

['Hello', 'Since', 'you', 'are', 'an', 'owner', 'of', 'at', 'least', 'one']

测试算法:使用朴素贝叶斯进行交叉验证

In [46]:

def createVocabList(dataSet):
  vocabSet = set([])
  for document in dataSet:
    vocabSet = vocabSet | set(document)
  return list(vocabSet)

In [47]:

def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else:
      print("the word: %s is not in my Vocabulary!" % word)
  return returnVec

In [48]:

def textParse(bigString):
  listOfTokens = re.split('\W+', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]

In [49]:

import random
from numpy import *

In [50]:

def trainNB0(trainMatrix, trainCategory):
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory) / float(numTrainDocs)
  p0Num = ones(numWords)
  p1Num = ones(numWords)
  p0Denom = 2.0
  p1Denom = 2.0
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num / p1Denom)
  p0Vect = log(p0Num / p0Denom)
  return p0Vect, p1Vect, pAbusive

In [51]:

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0

In [61]:

def spamTest():
  docList = []
  classList = []
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/spam/%d.txt'%i, 'r', encoding='latin-1').read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/ham/%d.txt'%i, 'r', encoding='latin-1').read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList)
  sums = 0
  for i in range(10):
    trainingSet = list(range(50))
    testSet = []
    for i in range(10):
      randIndex = int(random.uniform(0, len(trainingSet)))
      testSet.append(trainingSet[randIndex])
      del(trainingSet[randIndex])
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
      trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
      trainClasses.append(classList[docIndex])
    p0V, p1V, pSpam = trainNB0(trainMat, trainClasses)
    errorCount = 0
    for docIndex in testSet:
      wordVector = setOfWords2Vec(vocabList, docList[docIndex])
      res = classifyNB(wordVector, p0V, p1V, pSpam)
      if res != classList[docIndex]:
        errorCount += 1
    print("the error rate is: ", float(errorCount)/len(testSet))
    sums += float(errorCount)/len(testSet)
  print("the average error rate is: ", sums/10.0)

In [62]:

spamTest()
the error rate is:  0.0
the error rate is:  0.2
the error rate is:  0.2
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the average error rate is:  0.05

In [63]:

spamTest()
the error rate is:  0.2
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the average error rate is:  0.04

In [64]:

spamTest()
the error rate is:  0.1
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.1
the average error rate is:  0.06
目录
相关文章
|
5月前
|
机器学习/深度学习 人工智能 自然语言处理
算法金 | AI 基石,无处不在的朴素贝叶斯算法
```markdown 探索贝叶斯定理:从默默无闻到AI基石。18世纪数学家贝叶斯的理论,初期未受重视,后成为20世纪机器学习、医学诊断和金融分析等领域关键。贝叶斯定理是智能背后的逻辑,朴素贝叶斯分类器在文本分类等应用中表现出色。贝叶斯网络则用于表示变量间条件依赖,常见于医学诊断和故障检测。贝叶斯推理通过更新信念以适应新证据,广泛应用于统计和AI。尽管有计算复杂性等局限,贝叶斯算法在小数据集和高不确定性场景中仍极具价值。了解并掌握这一算法,助你笑傲智能江湖! ```
55 2
算法金 | AI 基石,无处不在的朴素贝叶斯算法
|
3月前
|
数据采集 前端开发 算法
基于朴素贝叶斯算法的新闻类型预测,django框架开发,前端bootstrap,有爬虫有数据库
本文介绍了一个基于Django框架和朴素贝叶斯算法开发的新闻类型预测系统,该系统具备用户登录注册、后台管理、数据展示、新闻分类分布分析、新闻数量排名和新闻标题预测等功能,旨在提高新闻处理效率和个性化推荐服务。
|
3月前
|
机器学习/深度学习 算法 Python
python与朴素贝叶斯算法(附示例和代码)
朴素贝叶斯算法以其高效性和优良的分类性能,成为文本处理领域一项受欢迎的方法。提供的代码示例证明了其在Python语言中的易用性和实用性。尽管算法假设了特征之间的独立性,但在实际应用中,它仍然能够提供强大的分类能力。通过调整参数和优化模型,你可以进一步提升朴素贝叶斯分类器的性能。
97 0
|
3月前
|
监控 数据可视化 算法
基于朴素贝叶斯算法的微博舆情监控系统,flask后端,可视化丰富
本文介绍了一个基于朴素贝叶斯算法和Python技术栈的微博舆情监控系统,该系统使用Flask作为后端框架,通过数据爬取、清洗、情感分析和可视化等手段,为用户提供丰富的舆情分析和监测功能。
|
5月前
|
算法 Python
朴素贝叶斯算法
朴素贝叶斯算法
28 2
|
5月前
|
机器学习/深度学习 算法 大数据
【机器学习】朴素贝叶斯算法及其应用探索
在机器学习的广阔领域中,朴素贝叶斯分类器以其实现简单、计算高效和解释性强等特点,成为了一颗璀璨的明星。尽管名字中带有“朴素”二字,它在文本分类、垃圾邮件过滤、情感分析等多个领域展现出了不凡的效果。本文将深入浅出地介绍朴素贝叶斯的基本原理、数学推导、优缺点以及实际应用案例,旨在为读者构建一个全面而深刻的理解框架。
173 1
|
6月前
|
机器学习/深度学习 算法
【机器学习】比较朴素贝叶斯算法与逻辑回归算法
【5月更文挑战第10天】【机器学习】比较朴素贝叶斯算法与逻辑回归算法
|
5月前
|
算法
朴素贝叶斯算法
朴素贝叶斯算法
49 0
|
6月前
|
算法 Python
使用Python实现朴素贝叶斯算法
使用Python实现朴素贝叶斯算法
100 0
|
22天前
|
算法 安全 数据安全/隐私保护
基于game-based算法的动态频谱访问matlab仿真
本算法展示了在认知无线电网络中,通过游戏理论优化动态频谱访问,提高频谱利用率和物理层安全性。程序运行效果包括负载因子、传输功率、信噪比对用户效用和保密率的影响分析。软件版本:Matlab 2022a。完整代码包含详细中文注释和操作视频。