【34】文本文档分类实战(哈希编码/权重编码提取特征 + 卡方过滤 + 搭建神经网络分类)

简介: 【34】文本文档分类实战(哈希编码/权重编码提取特征 + 卡方过滤 + 搭建神经网络分类)

这里我使用的是sklearn自带的一个数据集,属于一个新闻分类数据,有20类,18846个样本,其中11314个训练样本。其类别信息分别为:

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


这里会选取其中的4个类别进行文本分类处理:"alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space"


1. 文本向量提取


1.1 数据集导入

from sklearn.datasets import fetch_20newsgroups
categories = [ "alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space",]
remove = ("headers", "footers", "quotes")
data_train = fetch_20newsgroups(
    data_home='./dataset/', subset='train', categories=categories, remove=remove, random_state=42
)
data_test = fetch_20newsgroups(
    data_home='./dataset/', subset='test', categories=categories, remove=remove, random_state=42
)
data_train.target_names, data_test.target_names
# 输出:
# (['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc'],
# ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc'])
y_train, y_test = data_train.target, data_test.target


ps:这里通过remove删除文件的标题、签名快与应用块,使得数据更加的真实。这是因为,如果不删除分类器会过分拟合许多的内容:


  • 几乎每个组都通过诸如 NNTP-Posting-Host:和之类的标题是否Distribution:经常出现来区分。
  • 另一个重要特征涉及发件人是否隶属于大学,如其标题或签名所示。
  • “文章”这个词是一个重要的特征,基于人们引用以前的帖子的频率是这样的:“在文章 [文章 ID],[名称] <[电子邮件地址]> 写道:”
  • 其他功能与当时发帖的特定人员的姓名和电子邮件地址相匹配。

有了如此丰富的区分新闻组的线索,分类器几乎不需要从文本中识别主题,而且它们都在相同的高水平上执行。所以,这里删除了('headers', 'footers', 'quotes')的信息。


1.2 哈希编码

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = HashingVectorizer(stop_words="english", alternate_sign=False, n_features=2 ** 10)
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
print("X_train.shape:{}, X_test.shape:{}".format(X_train.shape, X_test.shape))
# 输出:
# X_train.shape:(2034, 1024), X_test.shape:(1353, 1024)


1.3 卡方过滤

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
select_kbest = SelectKBest(chi2, k=200)
X_train = select_kbest.fit_transform(X_train, y_train)
X_test = select_kbest.transform(X_test)
print("X_train.shape:{}, X_test.shape:{}".format(X_train.shape, X_test.shape))
# 输出:
# X_train.shape:(2034, 200), X_test.shape:(1353, 200)


哈希编码内容与卡方过滤内容详细可以见我之前的两篇笔记,链接如下:


1. sklearn特征提取方法汇总(包含字典、文本、图像的特征提取)


2. klearn特征降维方法汇总(方差过滤,卡方,F过滤,互信息,嵌入法)


2. 机器学习模型训练


2.1 SVM模型测试

from time import time
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
t0 = time()
clf = LinearSVC(dual=False, tol=1e-3)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()    
pred = clf.predict(X_test)
test_time = time() - t0
print("test time:  %0.3fs" % test_time)
test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy:   %0.3f" % test_sroce, 
      "train accuracy:   %0.3f" % train_sroce)


输出:


train time: 0.020s
test time:  0.001s
test accuracy:   0.662 train accuracy:   0.775


2.2 随机森林模型测试

from time import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
t0 = time()
clf = RandomForestClassifier(verbose=1, random_state=42)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()    
pred = clf.predict(X_test)
test_time = time() - t0
print("test time:  %0.3fs" % test_time)
test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy:   %0.3f" % test_sroce, 
      "train accuracy:   %0.3f" % train_sroce)


输出:


train time: 0.466s
test time:  0.032s
test accuracy:   0.617 train accuracy:   0.964


3. 搭建神经网络测试


其实,文本分类只要是将文本信息编码为特征,但是这一步已经在哈希编码与卡方过滤后提取了出来。最后每个文本信息被编码为了一个200维的特征信息,那么现在就可以利用这个特征信息构建一个神经网络来进行训练。


所以下面会简单的搭建一个多层感知机来进行训练,这里我就搭建了一个3层的全连接层,没有特别的设计(其实我也不会特别设计,哭…)


3.1 神经网络搭建

import torch.nn as nn
class MLP(nn.Module):
    def __init__(self, embedding=200, n_class=4):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(embedding, 256),
            nn.Dropout(p=0.6),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.Dropout(p=0.6),
            nn.ReLU(),
            nn.Linear(512, n_class),
        )
    def forward(self, x):
        return self.model(x)


3.2 神经网络训练

  • 数据格式转换

在训练之前,需要进行数据格式的转换。由于卡方过滤出来的矩阵是一个稀疏矩阵,也就是一个稀疏的格式矩阵,所以需要进行.toarray()转换为numpy的array格式。eg:X_train = X_train..toarray()


之后具体是转什么样的格式,就执行在出错的位置看error的提醒就可以了,或者直接按照我这里的数据格式转换,基本是正确的。


import torch
X_train = torch.tensor(X_train.toarray()).float()
X_test = torch.tensor(X_test.toarray()).float()
y_train = torch.tensor(y_train).long()
y_test = torch.tensor(y_test).long()


  • 神经网络训练

数据格式转换后,就可以使用神经网络来进行训练,参考代码如下:


import torch
from torch import optim
from time import time
from sklearn.metrics import accuracy_score
# 设置相关参数
epochsize = 500
learning_rate = 1e-3
best_acc = 0
model = MLP()
criteon = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# 训练
t0 = time()
for epoch in range(epochsize):
    model.train()
    # 损失计算
    pred = model(X_train)
    loss = criteon(pred, y_train)
    # 反向传播
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    # 查看训练中正确个数
    # print(epoch, 'loss:', loss.item())
    model.eval()
    with torch.no_grad():
        category = model(X_test)
        pred = category.argmax(dim=1)
        score = accuracy_score(y_test, pred)
    if best_acc < score:
        best_acc = score
train_time = time() - t0
print("_" * 80)
print("best acc:{}".format(best_acc))
print("train time: %0.3fs" % train_time)


输出:


________________________________________________________________________________
best acc:0.6688839615668883
train time: 2.777s


最后,用神经网络来训练的结果好像与svm得到的结果差不多,无论是集成算法还是支持向量机算法,还是神经网络,最后的结果都在66%左右。


4. 权重向量编码


对于文本分类来说,如果想要了解对于一个文本的分类哪个单词特征是最重要的,可以通过权重向量编码配合一些设计权重分配的分类器来查看。(比如SVM等)


4.1 获取列表中最大的N个数索引

这里用一节的内容来介绍一个trick,也就是一个库的使用方法,参考链接见参考资料3.


  • 针对数组无重复数的
import heapq
int n
lis=[2,4,5,1,7]
re1 = map(lis.index, heapq.nlargest(n, lis)) #求最大的n个索引    nsmallest 求最小  nlargest求最大
re2 = heapq.nlargest(n, lis) #求最大的三个元素
print(list(re1)) #因为re1由map()生成的不是list,直接print不出来,添加list()就行了
print(re2)


  • 针对有重复数的
import heapq
lis= [2, 4, 4, 1, 0]
int n
max_number = heapq.nlargest(n, lis) 
max_index = []
for t in max_number:
    index = lis.index(t)
    max_index.append(index)
    lis[index] = 0
print(max_number)
print(max_index)


4.2 查看息量最大的N个特征名字

np.argsort实现

import numpy as np
# 查看信息量最大的前10个特征单词
def show_top10(classifier, vectorizer, categories):
    # 获取文本位置编码信息
    # vectorizer.get_feature_names_out() 以列表的形式输出
    # vectorizer.vocabulary_: 以字典的形式输出
    feature_names = vectorizer.get_feature_names_out()
    for i, category in enumerate(categories):
        # clf.coef_.shape: (4, 26576) 表示的是每个类别每个单词的重要程度
        # np.argsort是返回排序后的树荫,这里先排序再选择
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s::  %s" % (category, " ".join(feature_names[top10])))
show_top10(clf, vectorizer, data_train.target_names)


输出:


alt.atheism::  nanci islamic deletion motto islam atheist bobby atheists religion atheism
comp.graphics::  card images 42 looking hi computer 3d file image graphics
sci.space::  flight mars solar moon shuttle spacecraft launch nasa orbit space
talk.religion.misc::  commandment koresh blood jesus children rosicrucian christ fbi christians christian


代码来源,见参考资料4.


heapq实现

import numpy as np
import heapq
# 查看信息量最大的前10个特征单词
def show_top10(classifier, vectorizer, categories, top_k=10):
    # 获取文本位置编码信息
    # vectorizer.get_feature_names_out() 以列表的形式输出
    # vectorizer.vocabulary_: 以字典的形式输出
    feature_names = vectorizer.get_feature_names_out()
    for i, category in enumerate(categories):
        # 利用heapq取出数值最高的前k个索引
        top = map(list(clf.coef_[i]).index, heapq.nlargest(top_k, clf.coef_[i]))
        top = list(top)
        print("%s::  %s" % (category, " ".join(feature_names[top])))
show_top10(clf, vectorizer, data_train.target_names)


输出:


alt.atheism::  atheism religion atheists bobby atheist islam motto deletion islamic nanci
comp.graphics::  graphics image file 3d computer hi looking 42 images card
sci.space::  space orbit nasa launch spacecraft shuttle moon solar mars flight
talk.religion.misc::  christian christians fbi christ rosicrucian children jesus blood koresh commandment


可以看见,这两种的方法的输出结果是一直的,只是第二种方法还会按权重的大小从大到小排列。而对于输出结果来说,大概可以知道权重越大的单词和文本类别越相关。


4.3 使用权重编码进行文本分类

参考代码如下:


from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
# 权重向量编码
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english")
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
# X_train.shape, X_test.shape: ((2034, 26576), (1353, 26576))
y_train, y_test = data_train.target, data_test.target
# 构建分类器训练
t0 = time()
clf = LinearSVC(penalty='l2', dual=False, tol=1e-3)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()
pred = clf.predict(X_test)
test_time = time() - t0
print("test time: %0.3fs" % test_time)
# 查看训练效果
test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy:   %0.3f" % test_sroce, 
      "train accuracy:   %0.3f" % train_sroce)


输出:

train time: 0.247s
test time: 0.002s
test accuracy:   0.780 train accuracy:   0.978


分析:可以看见,使用权重编码相比哈希编码的效果要更好,这是因为哈希编码其实是一种降维的方法,其减少了训练的时长,但是同时也会丢失部分的信息。所以哈希编码的特征分类效果要差一点,但是训练的时间要短一点。


5. 使用稀疏特征对文本文档进行分类


这里贴上一个官方的提供代码,供自己学习,具体内容见参考资料5. 相关的内容这里翻译为中文。


这是一个示例,展示了如何使用 scikit-learn 使用词袋方法按主题对文档进行分类。此示例使用 scipy.sparse 矩阵来存储特征,并演示了可以有效处理稀疏矩阵的各种分类器。


此示例中使用的数据集是 20 个新闻组数据集。它将被自动下载,然后缓存。


5.1 参数设置

# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Olivier Grisel <olivier.grisel@ensta.org>
#         Mathieu Blondel <mathieu@mblondel.org>
#         Lars Buitinck
# License: BSD 3 clause
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
op = OptionParser()
op.add_option(
    "--report",
    action="store_true",
    dest="print_report",
    help="Print a detailed classification report.",
)
op.add_option(
    "--chi2_select",
    action="store",
    type="int",
    dest="select_chi2",
    help="Select some number of features using a chi-squared test",
)
op.add_option(
    "--confusion_matrix",
    action="store_true",
    dest="print_cm",
    help="Print the confusion matrix.",
)
op.add_option(
    "--top10",
    action="store_true",
    dest="print_top10",
    help="Print ten most discriminative terms per class for every classifier.",
)
op.add_option(
    "--all_categories",
    action="store_true",
    dest="all_categories",
    help="Whether to use all categories or not.",
)
op.add_option("--use_hashing", action="store_true", help="Use a hashing vectorizer.")
op.add_option(
    "--n_features",
    action="store",
    type=int,
    default=2 ** 16,
    help="n_features when using the hashing vectorizer.",
)
op.add_option(
    "--filtered",
    action="store_true",
    help=(
        "Remove newsgroup information that is easily overfit: "
        "headers, signatures, and quoting."
    ),
)
def is_interactive():
    return not hasattr(sys.modules["__main__"], "__file__")
# work-around for Jupyter notebook and IPython console
argv = [] if is_interactive() else sys.argv[1:]
(opts, args) = op.parse_args(argv)
if len(args) > 0:
    op.error("this script takes no arguments.")
    sys.exit(1)
print(__doc__)
op.print_help()
print()


Out:


Usage: plot_document_classification_20newsgroups.py [options]
Options:
  -h, --help            show this help message and exit
  --report              Print a detailed classification report.
  --chi2_select=SELECT_CHI2
                        Select some number of features using a chi-squared
                        test
  --confusion_matrix    Print the confusion matrix.
  --top10               Print ten most discriminative terms per class for
                        every classifier.
  --all_categories      Whether to use all categories or not.
  --use_hashing         Use a hashing vectorizer.
  --n_features=N_FEATURES
                        n_features when using the hashing vectorizer.
  --filtered            Remove newsgroup information that is easily overfit:
                        headers, signatures, and quoting.


5.2 从训练集中加载数据

让我们从新闻组数据集中加载数据,该数据集包含 20 个主题的大约 18000 个新闻组帖子,分为两个子集:一个用于训练(或开发),另一个用于测试(或性能评估)。


if opts.all_categories:
    categories = None
else:
    categories = [
        "alt.atheism",
        "talk.religion.misc",
        "comp.graphics",
        "sci.space",
    ]
if opts.filtered:
    remove = ("headers", "footers", "quotes")
else:
    remove = ()
print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")
data_train = fetch_20newsgroups(
    subset="train", categories=categories, shuffle=True, random_state=42, remove=remove
)
data_test = fetch_20newsgroups(
    subset="test", categories=categories, shuffle=True, random_state=42, remove=remove
)
print("data loaded")
# order of labels in `target_names` can be different from `categories`
target_names = data_train.target_names
def size_mb(docs):
    return sum(len(s.encode("utf-8")) for s in docs) / 1e6
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)
print(
    "%d documents - %0.3fMB (training set)" % (len(data_train.data), data_train_size_mb)
)
print("%d documents - %0.3fMB (test set)" % (len(data_test.data), data_test_size_mb))
print("%d categories" % len(target_names))
print()
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
print("Extracting features from the training data using a sparse vectorizer")
t0 = time()
if opts.use_hashing:
    vectorizer = HashingVectorizer(
        stop_words="english", alternate_sign=False, n_features=opts.n_features
    )
    X_train = vectorizer.transform(data_train.data)
else:
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english")
    X_train = vectorizer.fit_transform(data_train.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_train_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_train.shape)
print()
print("Extracting features from the test data using the same vectorizer")
t0 = time()
X_test = vectorizer.transform(data_test.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_test_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_test.shape)
print()
# mapping from integer feature name to original token string
if opts.use_hashing:
    feature_names = None
else:
    feature_names = vectorizer.get_feature_names_out()
if opts.select_chi2:
    print("Extracting %d best features by a chi-squared test" % opts.select_chi2)
    t0 = time()
    ch2 = SelectKBest(chi2, k=opts.select_chi2)
    X_train = ch2.fit_transform(X_train, y_train)
    X_test = ch2.transform(X_test)
    if feature_names is not None:
        # keep selected feature names
        feature_names = feature_names[ch2.get_support()]
    print("done in %fs" % (time() - t0))
    print()
def trim(s):
    """Trim string to fit on terminal (assuming 80-column display)"""
    return s if len(s) <= 80 else s[:77] + "..."


Out:


Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
data loaded
2034 documents - 3.980MB (training set)
1353 documents - 2.867MB (test set)
4 categories
Extracting features from the training data using a sparse vectorizer
done in 0.383082s at 10.388MB/s
n_samples: 2034, n_features: 33809
Extracting features from the test data using the same vectorizer
done in 0.236998s at 12.099MB/s
n_samples: 1353, n_features: 33809


5.3 分类构建

用 15 种不同的分类模型训练和测试数据集,并获得每个模型的性能结果。


def benchmark(clf):
    print("_" * 80)
    print("Training: ")
    print(clf)
    t0 = time()
    clf.fit(X_train, y_train)
    train_time = time() - t0
    print("train time: %0.3fs" % train_time)
    t0 = time()
    pred = clf.predict(X_test)
    test_time = time() - t0
    print("test time:  %0.3fs" % test_time)
    score = metrics.accuracy_score(y_test, pred)
    print("accuracy:   %0.3f" % score)
    if hasattr(clf, "coef_"):
        print("dimensionality: %d" % clf.coef_.shape[1])
        print("density: %f" % density(clf.coef_))
        if opts.print_top10 and feature_names is not None:
            print("top 10 keywords per class:")
            for i, label in enumerate(target_names):
                top10 = np.argsort(clf.coef_[i])[-10:]
                print(trim("%s: %s" % (label, " ".join(feature_names[top10]))))
        print()
    if opts.print_report:
        print("classification report:")
        print(metrics.classification_report(y_test, pred, target_names=target_names))
    if opts.print_cm:
        print("confusion matrix:")
        print(metrics.confusion_matrix(y_test, pred))
    print()
    clf_descr = str(clf).split("(")[0]
    return clf_descr, score, train_time, test_time
results = []
for clf, name in (
    (RidgeClassifier(tol=1e-2, solver="sag"), "Ridge Classifier"),
    (Perceptron(max_iter=50), "Perceptron"),
    (PassiveAggressiveClassifier(max_iter=50), "Passive-Aggressive"),
    (KNeighborsClassifier(n_neighbors=10), "kNN"),
    (RandomForestClassifier(), "Random forest"),
):
    print("=" * 80)
    print(name)
    results.append(benchmark(clf))
for penalty in ["l2", "l1"]:
    print("=" * 80)
    print("%s penalty" % penalty.upper())
    # Train Liblinear model
    results.append(benchmark(LinearSVC(penalty=penalty, dual=False, tol=1e-3)))
    # Train SGD model
    results.append(benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty=penalty)))
# Train SGD with Elastic Net penalty
print("=" * 80)
print("Elastic-Net penalty")
results.append(
    benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty="elasticnet"))
)
# Train NearestCentroid without threshold
print("=" * 80)
print("NearestCentroid (aka Rocchio classifier)")
results.append(benchmark(NearestCentroid()))
# Train sparse Naive Bayes classifiers
print("=" * 80)
print("Naive Bayes")
results.append(benchmark(MultinomialNB(alpha=0.01)))
results.append(benchmark(BernoulliNB(alpha=0.01)))
results.append(benchmark(ComplementNB(alpha=0.1)))
print("=" * 80)
print("LinearSVC with L1-based feature selection")
# The smaller C, the stronger the regularization.
# The more regularization, the more sparsity.
results.append(
    benchmark(
        Pipeline(
            [
                (
                    "feature_selection",
                    SelectFromModel(LinearSVC(penalty="l1", dual=False, tol=1e-3)),
                ),
                ("classification", LinearSVC(penalty="l2")),
            ]
        )
    )
)


Out:


================================================================================
Ridge Classifier
________________________________________________________________________________
Training:
RidgeClassifier(solver='sag', tol=0.01)
/home/circleci/project/sklearn/linear_model/_ridge.py:729: UserWarning: "sag" solver requires many iterations to fit an intercept with sparse inputs. Either set the solver to "auto" or "sparse_cg", or set a low "tol" and a high "max_iter" (especially if inputs are not standardized).
  warnings.warn(
train time: 0.167s
test time:  0.001s
accuracy:   0.898
dimensionality: 33809
density: 1.000000
================================================================================
Perceptron
________________________________________________________________________________
Training:
Perceptron(max_iter=50)
train time: 0.015s
test time:  0.001s
accuracy:   0.888
dimensionality: 33809
density: 0.255302
================================================================================
Passive-Aggressive
________________________________________________________________________________
Training:
PassiveAggressiveClassifier(max_iter=50)
train time: 0.027s
test time:  0.001s
accuracy:   0.902
dimensionality: 33809
density: 0.711867
================================================================================
kNN
________________________________________________________________________________
Training:
KNeighborsClassifier(n_neighbors=10)
train time: 0.001s
test time:  0.148s
accuracy:   0.858
================================================================================
Random forest
________________________________________________________________________________
Training:
RandomForestClassifier()
train time: 1.258s
test time:  0.079s
accuracy:   0.826
================================================================================
L2 penalty
________________________________________________________________________________
Training:
LinearSVC(dual=False, tol=0.001)
train time: 0.072s
test time:  0.001s
accuracy:   0.900
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50)
train time: 0.024s
test time:  0.001s
accuracy:   0.903
dimensionality: 33809
density: 0.579424
================================================================================
L1 penalty
________________________________________________________________________________
Training:
LinearSVC(dual=False, penalty='l1', tol=0.001)
train time: 0.176s
test time:  0.001s
accuracy:   0.873
dimensionality: 33809
density: 0.005553
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50, penalty='l1')
train time: 0.092s
test time:  0.002s
accuracy:   0.880
dimensionality: 33809
density: 0.022509
================================================================================
Elastic-Net penalty
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50, penalty='elasticnet')
train time: 0.134s
test time:  0.001s
accuracy:   0.901
dimensionality: 33809
density: 0.184685
================================================================================
NearestCentroid (aka Rocchio classifier)
________________________________________________________________________________
Training:
NearestCentroid()
train time: 0.004s
test time:  0.002s
accuracy:   0.855
================================================================================
Naive Bayes
________________________________________________________________________________
Training:
MultinomialNB(alpha=0.01)
train time: 0.003s
test time:  0.001s
accuracy:   0.899
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
  warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
BernoulliNB(alpha=0.01)
train time: 0.005s
test time:  0.004s
accuracy:   0.884
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
  warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
ComplementNB(alpha=0.1)
train time: 0.003s
test time:  0.001s
accuracy:   0.911
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
  warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000
================================================================================
LinearSVC with L1-based feature selection
________________________________________________________________________________
Training:
Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False, penalty='l1',
                                                     tol=0.001))),
                ('classification', LinearSVC())])
train time: 0.192s
test time:  0.002s
accuracy:   0.879


5.4 可视化处理

条形图表示每个分类器的准确度、训练时间(标准化)和测试时间(标准化)。


indices = np.arange(len(results))
results = [[x[i] for x in results] for i in range(4)]
clf_names, score, training_time, test_time = results
training_time = np.array(training_time) / np.max(training_time)
test_time = np.array(test_time) / np.max(test_time)
plt.figure(figsize=(12, 8))
plt.title("Score")
plt.barh(indices, score, 0.2, label="score", color="navy")
plt.barh(indices + 0.3, training_time, 0.2, label="training time", color="c")
plt.barh(indices + 0.6, test_time, 0.2, label="test time", color="darkorange")
plt.yticks(())
plt.legend(loc="best")
plt.subplots_adjust(left=0.25)
plt.subplots_adjust(top=0.95)
plt.subplots_adjust(bottom=0.05)
for i, c in zip(indices, clf_names):
    plt.text(-0.3, i, c)
plt.show()


Out:

image.png


参考资料:


1. sklearn特征提取方法汇总(包含字典、文本、图像的特征提取)


2. klearn特征降维方法汇总(方差过滤,卡方,F过滤,互信息,嵌入法)


3. python获取列表中最大的N个数索引


4. The 20 newsgroups text dataset


5. Classification of text documents using sparse features


目录
相关文章
|
26天前
|
机器学习/深度学习 PyTorch 算法框架/工具
目标检测实战(一):CIFAR10结合神经网络加载、训练、测试完整步骤
这篇文章介绍了如何使用PyTorch框架,结合CIFAR-10数据集,通过定义神经网络、损失函数和优化器,进行模型的训练和测试。
69 2
目标检测实战(一):CIFAR10结合神经网络加载、训练、测试完整步骤
|
22天前
|
网络协议
计算机网络的分类
【10月更文挑战第11天】 计算机网络可按覆盖范围(局域网、城域网、广域网)、传输技术(有线、无线)、拓扑结构(星型、总线型、环型、网状型)、使用者(公用、专用)、交换方式(电路交换、分组交换)和服务类型(面向连接、无连接)等多种方式进行分类,每种分类方式揭示了网络的不同特性和应用场景。
|
13天前
|
机器学习/深度学习 计算机视觉 网络架构
【YOLO11改进 - C3k2融合】C3k2融合YOLO-MS的MSBlock : 分层特征融合策略,轻量化网络结构
【YOLO11改进 - C3k2融合】C3k2融合YOLO-MS的MSBlock : 分层特征融合策略,轻量化网络结构
|
20天前
|
机器学习/深度学习 Serverless 索引
分类网络中one-hot的作用
在分类任务中,使用神经网络时,通常需要将类别标签转换为一种合适的输入格式。这时候,one-hot编码(one-hot encoding)是一种常见且有效的方法。one-hot编码将类别标签表示为向量形式,其中只有一个元素为1,其他元素为0。
27 3
|
2月前
|
机器学习/深度学习 人工智能 算法
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
文本分类识别系统。本系统使用Python作为主要开发语言,首先收集了10种中文文本数据集("体育类", "财经类", "房产类", "家居类", "教育类", "科技类", "时尚类", "时政类", "游戏类", "娱乐类"),然后基于TensorFlow搭建CNN卷积神经网络算法模型。通过对数据集进行多轮迭代训练,最后得到一个识别精度较高的模型,并保存为本地的h5格式。然后使用Django开发Web网页端操作界面,实现用户上传一段文本识别其所属的类别。
87 1
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
|
1月前
|
机器学习/深度学习 PyTorch 算法框架/工具
深度学习入门案例:运用神经网络实现价格分类
深度学习入门案例:运用神经网络实现价格分类
|
21天前
|
存储 分布式计算 负载均衡
|
21天前
|
安全 区块链 数据库
|
2月前
|
机器学习/深度学习 数据采集 数据可视化
深度学习实践:构建并训练卷积神经网络(CNN)对CIFAR-10数据集进行分类
本文详细介绍如何使用PyTorch构建并训练卷积神经网络(CNN)对CIFAR-10数据集进行图像分类。从数据预处理、模型定义到训练过程及结果可视化,文章全面展示了深度学习项目的全流程。通过实际操作,读者可以深入了解CNN在图像分类任务中的应用,并掌握PyTorch的基本使用方法。希望本文为您的深度学习项目提供有价值的参考与启示。
|
3月前
|
算法 前端开发 数据挖掘
【类脑智能】脑网络通信模型分类及量化指标(附思维导图)
本文概述了脑网络通信模型的分类、算法原理及量化指标,介绍了扩散过程、路由协议和参数模型三种通信模型,并详细讨论了它们的性能指标、优缺点以及在脑网络研究中的应用,同时提供了思维导图以帮助理解这些概念。
47 3
【类脑智能】脑网络通信模型分类及量化指标(附思维导图)