使用朴素贝叶斯算法过滤垃圾邮件

简介: 使用朴素贝叶斯算法过滤垃圾邮件

使用朴素贝叶斯算法过滤垃圾邮件

(1)收集数据:提供文本文件


(2)准备数据:将文本文件解析成词条向量


(3)分析数据:检查词条确保解析的正确性


(4)训练算法:使用trainNB0()函数


(5)测试算法:使用classifyNB(),并且构建一个新的测试函数计算文档的错误率


(6)使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出

谷歌笔记本(可选)

In [1]:

from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


准备数据:切分文本

In [3]:

mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'

In [12]:

import re
regEx = re.compile('\W+')
listOfTokens = regEx.split(mySent)
listOfTokens
Out[12]:

['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon',
 '']
In [13]:
[tok for tok in listOfTokens if len(tok) > 0]

Out[13]:

['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon']

In [14]:

[tok.lower() for tok in listOfTokens if len(tok) > 0]

Out[14]:

['this',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'python',
 'or',
 'm',
 'l',
 'i',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon']

In [22]:

emailText = open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/ham/6.txt', 'r', encoding='latin-1').read()
emailText


Out[22]:

'Hello,\n\nSince you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions.  Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.\n\nFor example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you\x92re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.\n\nyou have received this mandatory email service announcement to update you about important changes to Google Groups.'

In [18]:

listOfTokens = regEx.split(emailText)

In [20]:

listOfTokens[:10]

Out[20]:

['Hello', 'Since', 'you', 'are', 'an', 'owner', 'of', 'at', 'least', 'one']

测试算法:使用朴素贝叶斯进行交叉验证

In [46]:

def createVocabList(dataSet):
  vocabSet = set([])
  for document in dataSet:
    vocabSet = vocabSet | set(document)
  return list(vocabSet)

In [47]:

def setOfWords2Vec(vocabList, inputSet):
  returnVec = [0] * len(vocabList)
  for word in inputSet:
    if word in vocabList:
      returnVec[vocabList.index(word)] = 1
    else:
      print("the word: %s is not in my Vocabulary!" % word)
  return returnVec

In [48]:

def textParse(bigString):
  listOfTokens = re.split('\W+', bigString)
  return [tok.lower() for tok in listOfTokens if len(tok) > 2]

In [49]:

import random
from numpy import *

In [50]:

def trainNB0(trainMatrix, trainCategory):
  numTrainDocs = len(trainMatrix)
  numWords = len(trainMatrix[0])
  pAbusive = sum(trainCategory) / float(numTrainDocs)
  p0Num = ones(numWords)
  p1Num = ones(numWords)
  p0Denom = 2.0
  p1Denom = 2.0
  for i in range(numTrainDocs):
    if trainCategory[i] == 1:
      p1Num += trainMatrix[i]
      p1Denom += sum(trainMatrix[i])
    else:
      p0Num += trainMatrix[i]
      p0Denom += sum(trainMatrix[i])
  p1Vect = log(p1Num / p1Denom)
  p0Vect = log(p0Num / p0Denom)
  return p0Vect, p1Vect, pAbusive

In [51]:

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
  p1 = sum(vec2Classify * p1Vec) + log(pClass1)
  p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
  if p1 > p0:
    return 1
  else:
    return 0

In [61]:

def spamTest():
  docList = []
  classList = []
  fullText = []
  for i in range(1, 26):
    wordList = textParse(open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/spam/%d.txt'%i, 'r', encoding='latin-1').read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(open('/content/drive/MyDrive/MachineLearning/机器学习/朴素贝叶斯/使用朴素贝叶斯过滤垃圾邮件/email/ham/%d.txt'%i, 'r', encoding='latin-1').read())
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
  vocabList = createVocabList(docList)
  sums = 0
  for i in range(10):
    trainingSet = list(range(50))
    testSet = []
    for i in range(10):
      randIndex = int(random.uniform(0, len(trainingSet)))
      testSet.append(trainingSet[randIndex])
      del(trainingSet[randIndex])
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
      trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
      trainClasses.append(classList[docIndex])
    p0V, p1V, pSpam = trainNB0(trainMat, trainClasses)
    errorCount = 0
    for docIndex in testSet:
      wordVector = setOfWords2Vec(vocabList, docList[docIndex])
      res = classifyNB(wordVector, p0V, p1V, pSpam)
      if res != classList[docIndex]:
        errorCount += 1
    print("the error rate is: ", float(errorCount)/len(testSet))
    sums += float(errorCount)/len(testSet)
  print("the average error rate is: ", sums/10.0)

In [62]:

spamTest()
the error rate is:  0.0
the error rate is:  0.2
the error rate is:  0.2
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the average error rate is:  0.05

In [63]:

spamTest()
the error rate is:  0.2
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.0
the average error rate is:  0.04

In [64]:

spamTest()
the error rate is:  0.1
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.1
the error rate is:  0.1
the error rate is:  0.0
the error rate is:  0.1
the average error rate is:  0.06
目录
相关文章
|
18天前
|
机器学习/深度学习 人工智能 自然语言处理
算法金 | AI 基石,无处不在的朴素贝叶斯算法
```markdown 探索贝叶斯定理:从默默无闻到AI基石。18世纪数学家贝叶斯的理论,初期未受重视,后成为20世纪机器学习、医学诊断和金融分析等领域关键。贝叶斯定理是智能背后的逻辑,朴素贝叶斯分类器在文本分类等应用中表现出色。贝叶斯网络则用于表示变量间条件依赖,常见于医学诊断和故障检测。贝叶斯推理通过更新信念以适应新证据,广泛应用于统计和AI。尽管有计算复杂性等局限,贝叶斯算法在小数据集和高不确定性场景中仍极具价值。了解并掌握这一算法,助你笑傲智能江湖! ```
25 2
算法金 | AI 基石,无处不在的朴素贝叶斯算法
|
16天前
|
算法 Python
朴素贝叶斯算法
朴素贝叶斯算法
13 2
|
18天前
|
机器学习/深度学习 算法 大数据
【机器学习】朴素贝叶斯算法及其应用探索
在机器学习的广阔领域中,朴素贝叶斯分类器以其实现简单、计算高效和解释性强等特点,成为了一颗璀璨的明星。尽管名字中带有“朴素”二字,它在文本分类、垃圾邮件过滤、情感分析等多个领域展现出了不凡的效果。本文将深入浅出地介绍朴素贝叶斯的基本原理、数学推导、优缺点以及实际应用案例,旨在为读者构建一个全面而深刻的理解框架。
47 1
|
9天前
|
算法
朴素贝叶斯算法
朴素贝叶斯算法
7 0
|
1月前
|
机器学习/深度学习 算法
【机器学习】比较朴素贝叶斯算法与逻辑回归算法
【5月更文挑战第10天】【机器学习】比较朴素贝叶斯算法与逻辑回归算法
|
1月前
|
算法 Python
使用Python实现朴素贝叶斯算法
使用Python实现朴素贝叶斯算法
52 0
|
1月前
|
算法
朴素贝叶斯算法应用
朴素贝叶斯算法应用
40 4
|
1月前
|
算法
朴素贝叶斯典型的三种算法
朴素贝叶斯主要有三种算法:贝努利朴素贝叶斯、高斯贝叶斯和多项式贝叶斯三种算法
|
1月前
|
机器学习/深度学习 算法
SVM算法、朴素贝叶斯算法讲解及对iris数据集分类实战(附源码)
SVM算法、朴素贝叶斯算法讲解及对iris数据集分类实战(附源码)
136 0
|
2天前
|
机器学习/深度学习 自然语言处理 算法
m基于深度学习的OFDM+QPSK链路信道估计和均衡算法误码率matlab仿真,对比LS,MMSE及LMMSE传统算法
**摘要:** 升级版MATLAB仿真对比了深度学习与LS、MMSE、LMMSE的OFDM信道估计算法,新增自动样本生成、复杂度分析及抗频偏性能评估。深度学习在无线通信中,尤其在OFDM的信道估计问题上展现潜力,解决了传统方法的局限。程序涉及信道估计器设计,深度学习模型通过学习导频信息估计信道响应,适应频域变化。核心代码展示了信号处理流程,包括编码、调制、信道模拟、降噪、信道估计和解调。
23 8