03 贝叶斯算法 - 案例二 - 新闻数据分类

简介:

01 贝叶斯算法 - 朴素贝叶斯
02 贝叶斯算法 - 案例一 - 鸢尾花数据分类

常规操作

import numpy as np
from time import time
import matplotlib.pyplot as plt
import matplotlib as mpl

from sklearn.datasets import fetch_20newsgroups#引入新闻数据包
from sklearn.feature_extraction.text import TfidfVectorizer#做tfidf编码
from sklearn.feature_selection import SelectKBest, chi2#卡方检验——特征筛选
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import LinearSVC,SVC
from sklearn.naive_bayes import MultinomialNB, BernoulliNB #引入多项式和伯努利的贝叶斯
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

## 设置属性防止中文乱码
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False

基准模型方法

def benchmark(clf,name):
    print (u'分类器:', clf)
    
    ##  设置最优参数,并使用5折交叉验证获取最优参数值
    alpha_can = np.logspace(-2, 1, 10)
    model = GridSearchCV(clf, param_grid={'alpha': alpha_can}, cv=5)
    m = alpha_can.size
    
    ## 如果模型有一个参数是alpha,进行设置
    if hasattr(clf, 'alpha'):
        model.set_params(param_grid={'alpha': alpha_can})
        m = alpha_can.size
    ## 如果模型有一个k近邻的参数,进行设置
    if hasattr(clf, 'n_neighbors'):
        neighbors_can = np.arange(1, 15)
        model.set_params(param_grid={'n_neighbors': neighbors_can})
        m = neighbors_can.size
    ## LinearSVC最优参数配置
    if hasattr(clf, 'C'):
        C_can = np.logspace(1, 3, 3)
        model.set_params(param_grid={'C':C_can})
        m = C_can.size
    ## SVM最优参数设置
    if hasattr(clf, 'C') & hasattr(clf, 'gamma'):
        C_can = np.logspace(1, 3, 3)
        gamma_can = np.logspace(-3, 0, 3)
        model.set_params(param_grid={'C':C_can, 'gamma':gamma_can})
        m = C_can.size * gamma_can.size
    ## 设置深度相关参数,决策树
    if hasattr(clf, 'max_depth'):
        max_depth_can = np.arange(4, 10)
        model.set_params(param_grid={'max_depth': max_depth_can})
        m = max_depth_can.size
    
    ## 模型训练
    t_start = time()
    model.fit(x_train, y_train)
    t_end = time()
    t_train = (t_end - t_start) / (5*m)
    print (u'5折交叉验证的训练时间为:%.3f秒/(5*%d)=%.3f秒' % ((t_end - t_start), m, t_train))
    print (u'最优超参数为:', model.best_params_)
    
    ## 模型预测
    t_start = time()
    y_hat = model.predict(x_test)
    t_end = time()
    t_test = t_end - t_start
    print (u'测试时间:%.3f秒' % t_test)
    
    ## 模型效果评估
    train_acc = metrics.accuracy_score(y_train, model.predict(x_train))
    test_acc = metrics.accuracy_score(y_test, y_hat)
    print (u'训练集准确率:%.2f%%' % (100 * train_acc))
    print (u'测试集准确率:%.2f%%' % (100 * test_acc))
    
    ## 返回结果(训练时间耗时,预测数据耗时,训练数据错误率,测试数据错误率, 名称)
    return t_train, t_test, 1-train_acc, 1-test_acc, name

数据加载

print (u'加载数据...')
t_start = time()
## 不要头部信息
remove = ('headers', 'footers', 'quotes')
## 只要这四类数据
categories = 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'

## 分别加载训练数据和测试数据
data_train = fetch_20newsgroups(data_home='./datas/',subset='train', categories=categories, shuffle=True, random_state=0, remove=remove)
data_test = fetch_20newsgroups(data_home='./datas/',subset='test', categories=categories, shuffle=True, random_state=0, remove=remove)

## 完成
print (u"完成数据加载过程.耗时:%.3fs" % (time() - t_start))

加载数据...
完成数据加载过程.耗时:4.729s


len(data_train['data'])

2034


print(data_train.target_names)

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


获取加载数据的相关信息

def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

categories = data_train.target_names
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print (u'数据类型:', type(data_train.data))
print("%d文本数量 - %0.3fMB (训练数据集)" % (len(data_train.data), data_train_size_mb))
print("%d文本数量 - %0.3fMB (测试数据集)" % (len(data_test.data), data_test_size_mb))
print (u'训练集和测试集使用的%d个类别的名称:' % len(categories))
print(categories)

数据类型:
2034文本数量 - 2.428MB (训练数据集)
1353文本数量 - 1.800MB (测试数据集)
训练集和测试集使用的4个类别的名称:
['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


数据重命名

x_train = data_train.data
y_train = data_train.target
x_test = data_test.data
y_test = data_test.target

输出前5个样本

print (u' -- 前5个文本 -- ')
for i in range(5):
    print (u'文本%d(属于类别 - %s):' % (i+1, categories[y_train[i]]))
    print (x_train[i])
    print ('\n\n')

-- 前5个文本 --
文本1(属于类别 - alt.atheism):
If one is a vegan (a vegetarian taht eats no animal products at at i.e eggs,
milk, cheese, etc., after about 3 years of a vegan diet, you need to start
taking B12 supplements because b12 is found only in animals.) Acutally our
bodies make B12, I think, but our bodies use up our own B12 after 2 or 3

  1. Lacto-oveo vegetarians, like myself, still get B12 through milk products

and eggs, so we don't need supplements.
And If anyone knows more, PLEASE post it. I'm nearly contridicting myself
with the mish-mash of knowledge I've gleaned.

文本2(属于类别 - comp.graphics):
Hi,

     I have a friend who is working on 2-d and 3-d object recognition. He is looking

for references describing algorithms on the following subject areas:

Thresholding
Edge Segmentation
Marr-Hildreth
Sobel Operator
Chain Codes
Thinning - Skeletonising

If anybody is willing to post an algorithm that they have implemented which demonstrates
any of the above topics, it would be much appreciated.

Please post all replies to my e-mail address. If requested I will post a summary to the
newsgroup in a couple of weeks.

Thanks in advance for all replies

文本3(属于类别 - comp.graphics):
Hello netters

Sorry, I don't know if this is the right way of doing this kind of thing,
probably should be a CFV, but since I don't have tha ability to create a
news group myself, I just want to start the discussion.

I enjoy reading c.g very much, but I often find it difficult to sort out what
I'm interested in. Everything from screen-drivers, graphics cards, graphics
programming and graphics programs are discused here. What I'd like is a
comp.graphics.programmer news group.
What do you other think.

文本4(属于类别 - comp.graphics):

Yes, I did punch in the wrong numbers (working too many late nites). I
intended on stating 640x400 is 256,000 bytes. It's not in the bios, just my
VESA TSR.

文本5(属于类别 - talk.religion.misc):

Well, I am not Andy, but if you had familiarized yourself with some of
the current theories/hypotheses about abiogenesis before posting :-), you
would be aware of the fact that none of them claims that proteins were
assembled randomly from amino acids. It is current thinking that RNA-
based replicators came before proteinaceous enzymes, and that proteins
were assembled by some kind of primitive translation machinery.

Now respond to 2. :-)
--Cornelius.


文档转换为向量

转换
要求输入的数据类型必须是list,list中的每条数据,单词是以空格分割开的

vectorizer = TfidfVectorizer(input='content', stop_words='english', max_df=0.5, sublinear_tf=True)
x_train = vectorizer.fit_transform(data_train.data)  # x_train是稀疏的,scipy.sparse.csr.csr_matrix
x_test = vectorizer.transform(data_test.data)
print (u'训练集样本个数:%d,特征个数:%d' % x_train.shape)
print (u'停止词:\n')
print(vectorizer.get_stop_words())
## 获取最终的特征属性名称
feature_names = np.asarray(vectorizer.get_feature_names())

训练集样本个数:2034,特征个数:26576
停止词:
frozenset({'back', 'much', 'your', 'a', 'above', 'almost', 'those', 'indeed', 'between', 'last', 'mill', 'some', 'therein', 'they', 'who', 'could', 'too', 'nor', 'even', 'against', 'am', 'herein', 'except', 'further', 'hereupon', 'may', 'move', 'myself', 'on', 'sixty', 'empty', 'bottom', 'cry', 'eg', 'amoungst', 'find', 'take', 'otherwise', 'before', 'six', 'de', 'any', 'ltd', 'per', 'we', 'both', 'and', 'my', 'amount', 'formerly', 'over', 'seems', 'therefore', 'couldnt', 'own', 'whereby', 'beyond', 'four', 'ie', 'many', 'together', 'whole', 'whom', 'however', 'beforehand', 'sometimes', 'only', 'elsewhere', 'cant', 'rather', 'twenty', 'when', 'until', 'more', 'to', 'full', 'former', 'still', 'two', 're', 'whenever', 'yourself', 'five', 'meanwhile', 'found', 'whether', 'himself', 'three', 'none', 'how', 'onto', 'should', 'are', 'afterwards', 'another', 'since', 'them', 'enough', 'thus', 'ourselves', 'after', 'becoming', 'themselves', 'wherein', 'noone', 'me', 'as', 'fire', 'become', 'moreover', 'fill', 'for', 'no', 'he', 'etc', 'nobody', 'others', 'thereafter', 'though', 'behind', 'anywhere', 'by', 'once', 'along', 'so', 'very', 'whereupon', 'whose', 'the', 'everything', 'perhaps', 'made', 'alone', 'side', 'fifty', 'then', 'up', 'where', 'down', 'during', 'next', 'thick', 'whereas', 'itself', 'these', 'it', 'already', 'every', 'is', 'because', 'been', 'in', 'someone', 'call', 'latterly', 'un', 'can', 'besides', 'into', 'hundred', 'same', 'whatever', 'serious', 'seem', 'his', 'upon', 'forty', 'without', 'eight', 'due', 'our', 'thin', 'becomes', 'such', 'amongst', 'several', 'yet', 'hasnt', 'here', 'hence', 'most', 'an', 'across', 'least', 'interest', 'from', 'wherever', 'thereupon', 'neither', 'twelve', 'each', 'might', 'thru', 'was', 'nowhere', 'their', 'were', 'whither', 'either', 'less', 'within', 'but', 'eleven', 'yourselves', 'than', 'anything', 'of', 'sometime', 'again', 'top', 'always', 'other', 'towards', 'while', 'around', 'would', 'herself', 'if', 'nothing', 'detail', 'also', 'i', 'out', 'her', 'hers', 'ours', 'front', 'one', 'you', 'now', 'us', 'became', 'namely', 'thereby', 'about', 'done', 'somehow', 'throughout', 'thence', 'with', 'toward', 'why', 'its', 'anyway', 'somewhere', 'show', 'mostly', 'latter', 'him', 'get', 'or', 'being', 'third', 'has', 'all', 'first', 'under', 'co', 'had', 'nevertheless', 'go', 'off', 'else', 'part', 'put', 'what', 'whereafter', 'at', 'yours', 'be', 'must', 'keep', 'will', 'among', 'seemed', 'give', 'everyone', 'something', 'nine', 'everywhere', 'anyhow', 'hereby', 'via', 'do', 'have', 'few', 'seeming', 'beside', 'inc', 'hereafter', 'not', 'sincere', 'system', 'that', 'bill', 'con', 'describe', 'please', 'through', 'see', 'ten', 'this', 'fifteen', 'although', 'never', 'whence', 'whoever', 'mine', 'well', 'often', 'anyone', 'which', 'she', 'name', 'cannot', 'below', 'there', 'ever'})


特征选择

ch2 = SelectKBest(chi2, k=1000)
x_train = ch2.fit_transform(x_train, y_train)
x_test = ch2.transform(x_test)
feature_names = [feature_names[i] for i in ch2.get_support(indices=True)]

使用不同的分类器对数据进行比较

print (u'分类器的比较:\n')
clfs = [
    [RidgeClassifier(), 'Ridge'],
    [KNeighborsClassifier(), 'KNN'],
    [MultinomialNB(), 'MultinomialNB'],
    [BernoulliNB(), 'BernoulliNB'],
    [RandomForestClassifier(n_estimators=200), 'RandomForest'],
    [SVC(), 'SVM'],
    [LinearSVC(loss='squared_hinge', penalty='l1', dual=False, tol=1e-4), 'LinearSVC-l1'],
    [LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-4), 'LinearSVC-l2']
]

## 将训练数据保存到一个列表中

result = []
for clf,name in clfs:
    # 计算算法结果
    a = benchmark(clf,name)
    # 追加到一个列表中,方便进行展示操作
    result.append(a)
    print ('\n')
## 将列表转换为数组
result = np.array(result)

分类器的比较:

分类器: RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,

    max_iter=None, normalize=False, random_state=None, solver='auto',
    tol=0.001)

5折交叉验证的训练时间为:14.476秒/(5*10)=0.290秒
最优超参数为: {'alpha': 0.46415888336127775}
测试时间:0.002秒
训练集准确率:92.63%
测试集准确率:75.83%

分类器: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',

       metric_params=None, n_jobs=1, n_neighbors=5, p=2,
       weights='uniform')

5折交叉验证的训练时间为:9.060秒/(5*14)=0.129秒
最优超参数为: {'n_neighbors': 1}
测试时间:0.105秒
训练集准确率:96.51%
测试集准确率:50.55%

分类器: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
5折交叉验证的训练时间为:0.328秒/(5*10)=0.007秒
最优超参数为: {'alpha': 0.01}
测试时间:0.001秒
训练集准确率:91.40%
测试集准确率:76.72%

分类器: BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
5折交叉验证的训练时间为:0.431秒/(5*10)=0.009秒
最优超参数为: {'alpha': 0.01}
测试时间:0.001秒
训练集准确率:88.64%
测试集准确率:74.28%

分类器: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',

        max_depth=None, max_features='auto', max_leaf_nodes=None,
        min_impurity_decrease=0.0, min_impurity_split=None,
        min_samples_leaf=1, min_samples_split=2,
        min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
        oob_score=False, random_state=None, verbose=0,
        warm_start=False)

5折交叉验证的训练时间为:25.608秒/(5*6)=0.854秒
最优超参数为: {'max_depth': 9}
测试时间:0.088秒
训练集准确率:74.93%
测试集准确率:66.89%

分类器: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
5折交叉验证的训练时间为:38.628秒/(5*9)=0.858秒
最优超参数为: {'C': 100.0, 'gamma': 0.03162277660168379}
测试时间:0.239秒
训练集准确率:93.36%
测试集准确率:73.54%

分类器: LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,

 intercept_scaling=1, loss='squared_hinge', max_iter=1000,
 multi_class='ovr', penalty='l1', random_state=None, tol=0.0001,
 verbose=0)

5折交叉验证的训练时间为:11.595秒/(5*3)=0.773秒
最优超参数为: {'C': 10.0}
测试时间:0.002秒
训练集准确率:96.31%
测试集准确率:72.21%

分类器: LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,

 intercept_scaling=1, loss='squared_hinge', max_iter=1000,
 multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
 verbose=0)

5折交叉验证的训练时间为:1.346秒/(5*3)=0.090秒
最优超参数为: {'C': 10.0}
测试时间:0.001秒
训练集准确率:95.48%
测试集准确率:74.58%


获取需要画图的数据

result = [[x[i] for x in result] for i in range(5)]
training_time, test_time, training_err, test_err, clf_names = result

training_time = np.array(training_time).astype(np.float)
test_time = np.array(test_time).astype(np.float)
training_err = np.array(training_err).astype(np.float)
test_err = np.array(test_err).astype(np.float)

画图

x = np.arange(len(training_time))
plt.figure(figsize=(10, 7), facecolor='w')
ax = plt.axes()
b0 = ax.bar(x+0.1, training_err, width=0.2, color='#77E0A0')
b1 = ax.bar(x+0.3, test_err, width=0.2, color='#8800FF')
ax2 = ax.twinx()
b2 = ax2.bar(x+0.5, training_time, width=0.2, color='#FFA0A0')
b3 = ax2.bar(x+0.7, test_time, width=0.2, color='#FF8080')
plt.xticks(x+0.5, clf_names)
plt.legend([b0[0], b1[0], b2[0], b3[0]], (u'训练集错误率', u'测试集错误率', u'训练时间', u'测试时间'), loc='upper left', shadow=True)
plt.title(u'新闻组文本数据分类及不同分类器效果比较', fontsize=18)
plt.xlabel(u'分类器名称')
plt.grid(True)
plt.tight_layout(2)
plt.show()

04 贝叶斯算法 - 贝叶斯网络

相关文章
|
6月前
|
数据采集 机器学习/深度学习 算法
|
4月前
|
存储 分布式计算 算法
大数据-106 Spark Graph X 计算学习 案例:1图的基本计算、2连通图算法、3寻找相同的用户
大数据-106 Spark Graph X 计算学习 案例:1图的基本计算、2连通图算法、3寻找相同的用户
91 0
|
26天前
|
移动开发 算法 计算机视觉
基于分块贝叶斯非局部均值优化(OBNLM)的图像去噪算法matlab仿真
本项目基于分块贝叶斯非局部均值优化(OBNLM)算法实现图像去噪,使用MATLAB2022A进行仿真。通过调整块大小和窗口大小等参数,研究其对去噪效果的影响。OBNLM结合了经典NLM算法与贝叶斯统计理论,利用块匹配和概率模型优化相似块的加权融合,提高去噪效率和保真度。实验展示了不同参数设置下的去噪结果,验证了算法的有效性。
|
2月前
|
机器学习/深度学习 算法 数据安全/隐私保护
基于贝叶斯优化CNN-GRU网络的数据分类识别算法matlab仿真
本项目展示了使用MATLAB2022a实现的贝叶斯优化、CNN和GRU算法优化效果。优化前后对比显著,完整代码附带中文注释及操作视频。贝叶斯优化适用于黑盒函数,CNN用于时间序列特征提取,GRU改进了RNN的长序列处理能力。
|
4月前
|
机器学习/深度学习 算法 数据安全/隐私保护
基于贝叶斯优化CNN-LSTM网络的数据分类识别算法matlab仿真
本项目展示了基于贝叶斯优化(BO)的CNN-LSTM网络在数据分类中的应用。通过MATLAB 2022a实现,优化前后效果对比明显。核心代码附带中文注释和操作视频,涵盖BO、CNN、LSTM理论,特别是BO优化CNN-LSTM网络的batchsize和学习率,显著提升模型性能。
|
4月前
|
存储 算法 搜索推荐
这些算法在实际应用中有哪些具体案例呢
【10月更文挑战第19天】这些算法在实际应用中有哪些具体案例呢
87 1
|
4月前
|
算法 数据可视化 新制造
Threejs路径规划_基于A*算法案例完整版
这篇文章详细介绍了如何在Three.js中完整实现基于A*算法的路径规划案例,包括网格构建、路径寻找算法的实现以及路径可视化展示等方面的内容。
115 0
Threejs路径规划_基于A*算法案例完整版
|
4月前
|
机器学习/深度学习 算法 数据安全/隐私保护
基于贝叶斯优化卷积神经网络(Bayes-CNN)的多因子数据分类识别算法matlab仿真
本项目展示了贝叶斯优化在CNN中的应用,包括优化过程、训练与识别效果对比,以及标准CNN的识别结果。使用Matlab2022a开发,提供完整代码及视频教程。贝叶斯优化通过构建代理模型指导超参数优化,显著提升模型性能,适用于复杂数据分类任务。
|
4月前
|
存储 算法 机器人
Threejs路径规划_基于A*算法案例V2
这篇文章详细介绍了如何在Three.js中使用A*算法进行高效的路径规划,并通过三维物理电路的实例演示了路径计算和优化的过程。
106 0
|
6月前
|
机器学习/深度学习 人工智能 算法
【人工智能】传统语音识别算法概述,应用场景,项目实践及案例分析,附带代码示例
传统语音识别算法是将语音信号转化为文本形式的技术,它主要基于模式识别理论和数学统计学方法。以下是传统语音识别算法的基本概述
174 2