这里我使用的是sklearn自带的一个数据集,属于一个新闻分类数据,有20类,18846个样本,其中11314个训练样本。其类别信息分别为:
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
这里会选取其中的4个类别进行文本分类处理:"alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space"
1. 文本向量提取
1.1 数据集导入
from sklearn.datasets import fetch_20newsgroups categories = [ "alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space",] remove = ("headers", "footers", "quotes") data_train = fetch_20newsgroups( data_home='./dataset/', subset='train', categories=categories, remove=remove, random_state=42 ) data_test = fetch_20newsgroups( data_home='./dataset/', subset='test', categories=categories, remove=remove, random_state=42 ) data_train.target_names, data_test.target_names # 输出: # (['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc'], # ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']) y_train, y_test = data_train.target, data_test.target
ps:这里通过remove删除文件的标题、签名快与应用块,使得数据更加的真实。这是因为,如果不删除分类器会过分拟合许多的内容:
- 几乎每个组都通过诸如 NNTP-Posting-Host:和之类的标题是否Distribution:经常出现来区分。
- 另一个重要特征涉及发件人是否隶属于大学,如其标题或签名所示。
- “文章”这个词是一个重要的特征,基于人们引用以前的帖子的频率是这样的:“在文章 [文章 ID],[名称] <[电子邮件地址]> 写道:”
- 其他功能与当时发帖的特定人员的姓名和电子邮件地址相匹配。
有了如此丰富的区分新闻组的线索,分类器几乎不需要从文本中识别主题,而且它们都在相同的高水平上执行。所以,这里删除了('headers', 'footers', 'quotes')的信息。
1.2 哈希编码
from sklearn.feature_extraction.text import HashingVectorizer from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = HashingVectorizer(stop_words="english", alternate_sign=False, n_features=2 ** 10) X_train = vectorizer.fit_transform(data_train.data) X_test = vectorizer.transform(data_test.data) print("X_train.shape:{}, X_test.shape:{}".format(X_train.shape, X_test.shape)) # 输出: # X_train.shape:(2034, 1024), X_test.shape:(1353, 1024)
1.3 卡方过滤
from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 select_kbest = SelectKBest(chi2, k=200) X_train = select_kbest.fit_transform(X_train, y_train) X_test = select_kbest.transform(X_test) print("X_train.shape:{}, X_test.shape:{}".format(X_train.shape, X_test.shape)) # 输出: # X_train.shape:(2034, 200), X_test.shape:(1353, 200)
哈希编码内容与卡方过滤内容详细可以见我之前的两篇笔记,链接如下:
1. sklearn特征提取方法汇总(包含字典、文本、图像的特征提取)
2. klearn特征降维方法汇总(方差过滤,卡方,F过滤,互信息,嵌入法)
2. 机器学习模型训练
2.1 SVM模型测试
from time import time from sklearn.svm import LinearSVC from sklearn.metrics import accuracy_score t0 = time() clf = LinearSVC(dual=False, tol=1e-3) clf.fit(X_train, y_train) train_time = time() - t0 print("train time: %0.3fs" % train_time) t0 = time() pred = clf.predict(X_test) test_time = time() - t0 print("test time: %0.3fs" % test_time) test_sroce = accuracy_score(y_test, pred) train_sroce = accuracy_score(y_train, clf.predict(X_train)) print("test accuracy: %0.3f" % test_sroce, "train accuracy: %0.3f" % train_sroce)
输出:
train time: 0.020s test time: 0.001s test accuracy: 0.662 train accuracy: 0.775
2.2 随机森林模型测试
from time import time from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score t0 = time() clf = RandomForestClassifier(verbose=1, random_state=42) clf.fit(X_train, y_train) train_time = time() - t0 print("train time: %0.3fs" % train_time) t0 = time() pred = clf.predict(X_test) test_time = time() - t0 print("test time: %0.3fs" % test_time) test_sroce = accuracy_score(y_test, pred) train_sroce = accuracy_score(y_train, clf.predict(X_train)) print("test accuracy: %0.3f" % test_sroce, "train accuracy: %0.3f" % train_sroce)
输出:
train time: 0.466s test time: 0.032s test accuracy: 0.617 train accuracy: 0.964
3. 搭建神经网络测试
其实,文本分类只要是将文本信息编码为特征,但是这一步已经在哈希编码与卡方过滤后提取了出来。最后每个文本信息被编码为了一个200维的特征信息,那么现在就可以利用这个特征信息构建一个神经网络来进行训练。
所以下面会简单的搭建一个多层感知机来进行训练,这里我就搭建了一个3层的全连接层,没有特别的设计(其实我也不会特别设计,哭…)
3.1 神经网络搭建
import torch.nn as nn class MLP(nn.Module): def __init__(self, embedding=200, n_class=4): super().__init__() self.model = nn.Sequential( nn.Linear(embedding, 256), nn.Dropout(p=0.6), nn.ReLU(), nn.Linear(256, 512), nn.Dropout(p=0.6), nn.ReLU(), nn.Linear(512, n_class), ) def forward(self, x): return self.model(x)
3.2 神经网络训练
- 数据格式转换
在训练之前,需要进行数据格式的转换。由于卡方过滤出来的矩阵是一个稀疏矩阵,也就是一个稀疏的格式矩阵,所以需要进行.toarray()转换为numpy的array格式。eg:X_train = X_train..toarray()
之后具体是转什么样的格式,就执行在出错的位置看error的提醒就可以了,或者直接按照我这里的数据格式转换,基本是正确的。
import torch X_train = torch.tensor(X_train.toarray()).float() X_test = torch.tensor(X_test.toarray()).float() y_train = torch.tensor(y_train).long() y_test = torch.tensor(y_test).long()
- 神经网络训练
数据格式转换后,就可以使用神经网络来进行训练,参考代码如下:
import torch from torch import optim from time import time from sklearn.metrics import accuracy_score # 设置相关参数 epochsize = 500 learning_rate = 1e-3 best_acc = 0 model = MLP() criteon = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=learning_rate) # 训练 t0 = time() for epoch in range(epochsize): model.train() # 损失计算 pred = model(X_train) loss = criteon(pred, y_train) # 反向传播 optimizer.zero_grad() loss.backward() optimizer.step() # 查看训练中正确个数 # print(epoch, 'loss:', loss.item()) model.eval() with torch.no_grad(): category = model(X_test) pred = category.argmax(dim=1) score = accuracy_score(y_test, pred) if best_acc < score: best_acc = score train_time = time() - t0 print("_" * 80) print("best acc:{}".format(best_acc)) print("train time: %0.3fs" % train_time)
输出:
________________________________________________________________________________ best acc:0.6688839615668883 train time: 2.777s
最后,用神经网络来训练的结果好像与svm得到的结果差不多,无论是集成算法还是支持向量机算法,还是神经网络,最后的结果都在66%左右。
4. 权重向量编码
对于文本分类来说,如果想要了解对于一个文本的分类哪个单词特征是最重要的,可以通过权重向量编码配合一些设计权重分配的分类器来查看。(比如SVM等)
4.1 获取列表中最大的N个数索引
这里用一节的内容来介绍一个trick,也就是一个库的使用方法,参考链接见参考资料3.
- 针对数组无重复数的
import heapq int n lis=[2,4,5,1,7] re1 = map(lis.index, heapq.nlargest(n, lis)) #求最大的n个索引 nsmallest 求最小 nlargest求最大 re2 = heapq.nlargest(n, lis) #求最大的三个元素 print(list(re1)) #因为re1由map()生成的不是list,直接print不出来,添加list()就行了 print(re2)
- 针对有重复数的
import heapq lis= [2, 4, 4, 1, 0] int n max_number = heapq.nlargest(n, lis) max_index = [] for t in max_number: index = lis.index(t) max_index.append(index) lis[index] = 0 print(max_number) print(max_index)
4.2 查看息量最大的N个特征名字
np.argsort实现
import numpy as np # 查看信息量最大的前10个特征单词 def show_top10(classifier, vectorizer, categories): # 获取文本位置编码信息 # vectorizer.get_feature_names_out() 以列表的形式输出 # vectorizer.vocabulary_: 以字典的形式输出 feature_names = vectorizer.get_feature_names_out() for i, category in enumerate(categories): # clf.coef_.shape: (4, 26576) 表示的是每个类别每个单词的重要程度 # np.argsort是返回排序后的树荫,这里先排序再选择 top10 = np.argsort(classifier.coef_[i])[-10:] print("%s:: %s" % (category, " ".join(feature_names[top10]))) show_top10(clf, vectorizer, data_train.target_names)
输出:
alt.atheism:: nanci islamic deletion motto islam atheist bobby atheists religion atheism comp.graphics:: card images 42 looking hi computer 3d file image graphics sci.space:: flight mars solar moon shuttle spacecraft launch nasa orbit space talk.religion.misc:: commandment koresh blood jesus children rosicrucian christ fbi christians christian
代码来源,见参考资料4.
heapq实现
import numpy as np import heapq # 查看信息量最大的前10个特征单词 def show_top10(classifier, vectorizer, categories, top_k=10): # 获取文本位置编码信息 # vectorizer.get_feature_names_out() 以列表的形式输出 # vectorizer.vocabulary_: 以字典的形式输出 feature_names = vectorizer.get_feature_names_out() for i, category in enumerate(categories): # 利用heapq取出数值最高的前k个索引 top = map(list(clf.coef_[i]).index, heapq.nlargest(top_k, clf.coef_[i])) top = list(top) print("%s:: %s" % (category, " ".join(feature_names[top]))) show_top10(clf, vectorizer, data_train.target_names)
输出:
alt.atheism:: atheism religion atheists bobby atheist islam motto deletion islamic nanci comp.graphics:: graphics image file 3d computer hi looking 42 images card sci.space:: space orbit nasa launch spacecraft shuttle moon solar mars flight talk.religion.misc:: christian christians fbi christ rosicrucian children jesus blood koresh commandment
可以看见,这两种的方法的输出结果是一直的,只是第二种方法还会按权重的大小从大到小排列。而对于输出结果来说,大概可以知道权重越大的单词和文本类别越相关。
4.3 使用权重编码进行文本分类
参考代码如下:
from time import time from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC from sklearn.metrics import accuracy_score # 权重向量编码 vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english") X_train = vectorizer.fit_transform(data_train.data) X_test = vectorizer.transform(data_test.data) # X_train.shape, X_test.shape: ((2034, 26576), (1353, 26576)) y_train, y_test = data_train.target, data_test.target # 构建分类器训练 t0 = time() clf = LinearSVC(penalty='l2', dual=False, tol=1e-3) clf.fit(X_train, y_train) train_time = time() - t0 print("train time: %0.3fs" % train_time) t0 = time() pred = clf.predict(X_test) test_time = time() - t0 print("test time: %0.3fs" % test_time) # 查看训练效果 test_sroce = accuracy_score(y_test, pred) train_sroce = accuracy_score(y_train, clf.predict(X_train)) print("test accuracy: %0.3f" % test_sroce, "train accuracy: %0.3f" % train_sroce)
输出:
train time: 0.247s test time: 0.002s test accuracy: 0.780 train accuracy: 0.978
分析:可以看见,使用权重编码相比哈希编码的效果要更好,这是因为哈希编码其实是一种降维的方法,其减少了训练的时长,但是同时也会丢失部分的信息。所以哈希编码的特征分类效果要差一点,但是训练的时间要短一点。
5. 使用稀疏特征对文本文档进行分类
这里贴上一个官方的提供代码,供自己学习,具体内容见参考资料5. 相关的内容这里翻译为中文。
这是一个示例,展示了如何使用 scikit-learn 使用词袋方法按主题对文档进行分类。此示例使用 scipy.sparse 矩阵来存储特征,并演示了可以有效处理稀疏矩阵的各种分类器。
此示例中使用的数据集是 20 个新闻组数据集。它将被自动下载,然后缓存。
5.1 参数设置
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com> # Olivier Grisel <olivier.grisel@ensta.org> # Mathieu Blondel <mathieu@mblondel.org> # Lars Buitinck # License: BSD 3 clause import logging import numpy as np from optparse import OptionParser import sys from time import time import matplotlib.pyplot as plt from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import HashingVectorizer from sklearn.feature_selection import SelectFromModel from sklearn.feature_selection import SelectKBest, chi2 from sklearn.linear_model import RidgeClassifier from sklearn.pipeline import Pipeline from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier from sklearn.linear_model import Perceptron from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB from sklearn.neighbors import KNeighborsClassifier from sklearn.neighbors import NearestCentroid from sklearn.ensemble import RandomForestClassifier from sklearn.utils.extmath import density from sklearn import metrics # Display progress logs on stdout logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s") op = OptionParser() op.add_option( "--report", action="store_true", dest="print_report", help="Print a detailed classification report.", ) op.add_option( "--chi2_select", action="store", type="int", dest="select_chi2", help="Select some number of features using a chi-squared test", ) op.add_option( "--confusion_matrix", action="store_true", dest="print_cm", help="Print the confusion matrix.", ) op.add_option( "--top10", action="store_true", dest="print_top10", help="Print ten most discriminative terms per class for every classifier.", ) op.add_option( "--all_categories", action="store_true", dest="all_categories", help="Whether to use all categories or not.", ) op.add_option("--use_hashing", action="store_true", help="Use a hashing vectorizer.") op.add_option( "--n_features", action="store", type=int, default=2 ** 16, help="n_features when using the hashing vectorizer.", ) op.add_option( "--filtered", action="store_true", help=( "Remove newsgroup information that is easily overfit: " "headers, signatures, and quoting." ), ) def is_interactive(): return not hasattr(sys.modules["__main__"], "__file__") # work-around for Jupyter notebook and IPython console argv = [] if is_interactive() else sys.argv[1:] (opts, args) = op.parse_args(argv) if len(args) > 0: op.error("this script takes no arguments.") sys.exit(1) print(__doc__) op.print_help() print()
Out:
Usage: plot_document_classification_20newsgroups.py [options] Options: -h, --help show this help message and exit --report Print a detailed classification report. --chi2_select=SELECT_CHI2 Select some number of features using a chi-squared test --confusion_matrix Print the confusion matrix. --top10 Print ten most discriminative terms per class for every classifier. --all_categories Whether to use all categories or not. --use_hashing Use a hashing vectorizer. --n_features=N_FEATURES n_features when using the hashing vectorizer. --filtered Remove newsgroup information that is easily overfit: headers, signatures, and quoting.
5.2 从训练集中加载数据
让我们从新闻组数据集中加载数据,该数据集包含 20 个主题的大约 18000 个新闻组帖子,分为两个子集:一个用于训练(或开发),另一个用于测试(或性能评估)。
if opts.all_categories: categories = None else: categories = [ "alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space", ] if opts.filtered: remove = ("headers", "footers", "quotes") else: remove = () print("Loading 20 newsgroups dataset for categories:") print(categories if categories else "all") data_train = fetch_20newsgroups( subset="train", categories=categories, shuffle=True, random_state=42, remove=remove ) data_test = fetch_20newsgroups( subset="test", categories=categories, shuffle=True, random_state=42, remove=remove ) print("data loaded") # order of labels in `target_names` can be different from `categories` target_names = data_train.target_names def size_mb(docs): return sum(len(s.encode("utf-8")) for s in docs) / 1e6 data_train_size_mb = size_mb(data_train.data) data_test_size_mb = size_mb(data_test.data) print( "%d documents - %0.3fMB (training set)" % (len(data_train.data), data_train_size_mb) ) print("%d documents - %0.3fMB (test set)" % (len(data_test.data), data_test_size_mb)) print("%d categories" % len(target_names)) print() # split a training set and a test set y_train, y_test = data_train.target, data_test.target print("Extracting features from the training data using a sparse vectorizer") t0 = time() if opts.use_hashing: vectorizer = HashingVectorizer( stop_words="english", alternate_sign=False, n_features=opts.n_features ) X_train = vectorizer.transform(data_train.data) else: vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english") X_train = vectorizer.fit_transform(data_train.data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_train_size_mb / duration)) print("n_samples: %d, n_features: %d" % X_train.shape) print() print("Extracting features from the test data using the same vectorizer") t0 = time() X_test = vectorizer.transform(data_test.data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_test_size_mb / duration)) print("n_samples: %d, n_features: %d" % X_test.shape) print() # mapping from integer feature name to original token string if opts.use_hashing: feature_names = None else: feature_names = vectorizer.get_feature_names_out() if opts.select_chi2: print("Extracting %d best features by a chi-squared test" % opts.select_chi2) t0 = time() ch2 = SelectKBest(chi2, k=opts.select_chi2) X_train = ch2.fit_transform(X_train, y_train) X_test = ch2.transform(X_test) if feature_names is not None: # keep selected feature names feature_names = feature_names[ch2.get_support()] print("done in %fs" % (time() - t0)) print() def trim(s): """Trim string to fit on terminal (assuming 80-column display)""" return s if len(s) <= 80 else s[:77] + "..."
Out:
Loading 20 newsgroups dataset for categories: ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'] data loaded 2034 documents - 3.980MB (training set) 1353 documents - 2.867MB (test set) 4 categories Extracting features from the training data using a sparse vectorizer done in 0.383082s at 10.388MB/s n_samples: 2034, n_features: 33809 Extracting features from the test data using the same vectorizer done in 0.236998s at 12.099MB/s n_samples: 1353, n_features: 33809
5.3 分类构建
用 15 种不同的分类模型训练和测试数据集,并获得每个模型的性能结果。
def benchmark(clf): print("_" * 80) print("Training: ") print(clf) t0 = time() clf.fit(X_train, y_train) train_time = time() - t0 print("train time: %0.3fs" % train_time) t0 = time() pred = clf.predict(X_test) test_time = time() - t0 print("test time: %0.3fs" % test_time) score = metrics.accuracy_score(y_test, pred) print("accuracy: %0.3f" % score) if hasattr(clf, "coef_"): print("dimensionality: %d" % clf.coef_.shape[1]) print("density: %f" % density(clf.coef_)) if opts.print_top10 and feature_names is not None: print("top 10 keywords per class:") for i, label in enumerate(target_names): top10 = np.argsort(clf.coef_[i])[-10:] print(trim("%s: %s" % (label, " ".join(feature_names[top10])))) print() if opts.print_report: print("classification report:") print(metrics.classification_report(y_test, pred, target_names=target_names)) if opts.print_cm: print("confusion matrix:") print(metrics.confusion_matrix(y_test, pred)) print() clf_descr = str(clf).split("(")[0] return clf_descr, score, train_time, test_time results = [] for clf, name in ( (RidgeClassifier(tol=1e-2, solver="sag"), "Ridge Classifier"), (Perceptron(max_iter=50), "Perceptron"), (PassiveAggressiveClassifier(max_iter=50), "Passive-Aggressive"), (KNeighborsClassifier(n_neighbors=10), "kNN"), (RandomForestClassifier(), "Random forest"), ): print("=" * 80) print(name) results.append(benchmark(clf)) for penalty in ["l2", "l1"]: print("=" * 80) print("%s penalty" % penalty.upper()) # Train Liblinear model results.append(benchmark(LinearSVC(penalty=penalty, dual=False, tol=1e-3))) # Train SGD model results.append(benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty=penalty))) # Train SGD with Elastic Net penalty print("=" * 80) print("Elastic-Net penalty") results.append( benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty="elasticnet")) ) # Train NearestCentroid without threshold print("=" * 80) print("NearestCentroid (aka Rocchio classifier)") results.append(benchmark(NearestCentroid())) # Train sparse Naive Bayes classifiers print("=" * 80) print("Naive Bayes") results.append(benchmark(MultinomialNB(alpha=0.01))) results.append(benchmark(BernoulliNB(alpha=0.01))) results.append(benchmark(ComplementNB(alpha=0.1))) print("=" * 80) print("LinearSVC with L1-based feature selection") # The smaller C, the stronger the regularization. # The more regularization, the more sparsity. results.append( benchmark( Pipeline( [ ( "feature_selection", SelectFromModel(LinearSVC(penalty="l1", dual=False, tol=1e-3)), ), ("classification", LinearSVC(penalty="l2")), ] ) ) )
Out:
================================================================================ Ridge Classifier ________________________________________________________________________________ Training: RidgeClassifier(solver='sag', tol=0.01) /home/circleci/project/sklearn/linear_model/_ridge.py:729: UserWarning: "sag" solver requires many iterations to fit an intercept with sparse inputs. Either set the solver to "auto" or "sparse_cg", or set a low "tol" and a high "max_iter" (especially if inputs are not standardized). warnings.warn( train time: 0.167s test time: 0.001s accuracy: 0.898 dimensionality: 33809 density: 1.000000 ================================================================================ Perceptron ________________________________________________________________________________ Training: Perceptron(max_iter=50) train time: 0.015s test time: 0.001s accuracy: 0.888 dimensionality: 33809 density: 0.255302 ================================================================================ Passive-Aggressive ________________________________________________________________________________ Training: PassiveAggressiveClassifier(max_iter=50) train time: 0.027s test time: 0.001s accuracy: 0.902 dimensionality: 33809 density: 0.711867 ================================================================================ kNN ________________________________________________________________________________ Training: KNeighborsClassifier(n_neighbors=10) train time: 0.001s test time: 0.148s accuracy: 0.858 ================================================================================ Random forest ________________________________________________________________________________ Training: RandomForestClassifier() train time: 1.258s test time: 0.079s accuracy: 0.826 ================================================================================ L2 penalty ________________________________________________________________________________ Training: LinearSVC(dual=False, tol=0.001) train time: 0.072s test time: 0.001s accuracy: 0.900 dimensionality: 33809 density: 1.000000 ________________________________________________________________________________ Training: SGDClassifier(max_iter=50) train time: 0.024s test time: 0.001s accuracy: 0.903 dimensionality: 33809 density: 0.579424 ================================================================================ L1 penalty ________________________________________________________________________________ Training: LinearSVC(dual=False, penalty='l1', tol=0.001) train time: 0.176s test time: 0.001s accuracy: 0.873 dimensionality: 33809 density: 0.005553 ________________________________________________________________________________ Training: SGDClassifier(max_iter=50, penalty='l1') train time: 0.092s test time: 0.002s accuracy: 0.880 dimensionality: 33809 density: 0.022509 ================================================================================ Elastic-Net penalty ________________________________________________________________________________ Training: SGDClassifier(max_iter=50, penalty='elasticnet') train time: 0.134s test time: 0.001s accuracy: 0.901 dimensionality: 33809 density: 0.184685 ================================================================================ NearestCentroid (aka Rocchio classifier) ________________________________________________________________________________ Training: NearestCentroid() train time: 0.004s test time: 0.002s accuracy: 0.855 ================================================================================ Naive Bayes ________________________________________________________________________________ Training: MultinomialNB(alpha=0.01) train time: 0.003s test time: 0.001s accuracy: 0.899 /home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26). warnings.warn(msg, category=FutureWarning) dimensionality: 33809 density: 1.000000 ________________________________________________________________________________ Training: BernoulliNB(alpha=0.01) train time: 0.005s test time: 0.004s accuracy: 0.884 /home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26). warnings.warn(msg, category=FutureWarning) dimensionality: 33809 density: 1.000000 ________________________________________________________________________________ Training: ComplementNB(alpha=0.1) train time: 0.003s test time: 0.001s accuracy: 0.911 /home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26). warnings.warn(msg, category=FutureWarning) dimensionality: 33809 density: 1.000000 ================================================================================ LinearSVC with L1-based feature selection ________________________________________________________________________________ Training: Pipeline(steps=[('feature_selection', SelectFromModel(estimator=LinearSVC(dual=False, penalty='l1', tol=0.001))), ('classification', LinearSVC())]) train time: 0.192s test time: 0.002s accuracy: 0.879
5.4 可视化处理
条形图表示每个分类器的准确度、训练时间(标准化)和测试时间(标准化)。
indices = np.arange(len(results)) results = [[x[i] for x in results] for i in range(4)] clf_names, score, training_time, test_time = results training_time = np.array(training_time) / np.max(training_time) test_time = np.array(test_time) / np.max(test_time) plt.figure(figsize=(12, 8)) plt.title("Score") plt.barh(indices, score, 0.2, label="score", color="navy") plt.barh(indices + 0.3, training_time, 0.2, label="training time", color="c") plt.barh(indices + 0.6, test_time, 0.2, label="test time", color="darkorange") plt.yticks(()) plt.legend(loc="best") plt.subplots_adjust(left=0.25) plt.subplots_adjust(top=0.95) plt.subplots_adjust(bottom=0.05) for i, c in zip(indices, clf_names): plt.text(-0.3, i, c) plt.show()
Out:
参考资料:
1. sklearn特征提取方法汇总(包含字典、文本、图像的特征提取)
2. klearn特征降维方法汇总(方差过滤,卡方,F过滤,互信息,嵌入法)
3. python获取列表中最大的N个数索引
4. The 20 newsgroups text dataset
5. Classification of text documents using sparse features