这里我使用的是sklearn自带的一个数据集，属于一个新闻分类数据，有20类，18846个样本，其中11314个训练样本。其类别信息分别为：

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

这里会选取其中的4个类别进行文本分类处理："alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space"

1. 文本向量提取

1.1 数据集导入

from sklearn.datasets import fetch_20newsgroups
categories = [ "alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space",]
remove = ("headers", "footers", "quotes")
data_train = fetch_20newsgroups(
    data_home='./dataset/', subset='train', categories=categories, remove=remove, random_state=42
)
data_test = fetch_20newsgroups(
    data_home='./dataset/', subset='test', categories=categories, remove=remove, random_state=42
)
data_train.target_names, data_test.target_names
# 输出：
# (['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc'],
# ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc'])
y_train, y_test = data_train.target, data_test.target

ps：这里通过remove删除文件的标题、签名快与应用块，使得数据更加的真实。这是因为，如果不删除分类器会过分拟合许多的内容：

几乎每个组都通过诸如 NNTP-Posting-Host:和之类的标题是否Distribution:经常出现来区分。
另一个重要特征涉及发件人是否隶属于大学，如其标题或签名所示。
“文章”这个词是一个重要的特征，基于人们引用以前的帖子的频率是这样的：“在文章 [文章 ID]，[名称] <[电子邮件地址]> 写道：”
其他功能与当时发帖的特定人员的姓名和电子邮件地址相匹配。

有了如此丰富的区分新闻组的线索，分类器几乎不需要从文本中识别主题，而且它们都在相同的高水平上执行。所以，这里删除了('headers', 'footers', 'quotes')的信息。

1.2 哈希编码

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = HashingVectorizer(stop_words="english", alternate_sign=False, n_features=2 ** 10)
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
print("X_train.shape:{}, X_test.shape:{}".format(X_train.shape, X_test.shape))
# 输出：
# X_train.shape:(2034, 1024), X_test.shape:(1353, 1024)

1.3 卡方过滤

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
select_kbest = SelectKBest(chi2, k=200)
X_train = select_kbest.fit_transform(X_train, y_train)
X_test = select_kbest.transform(X_test)
print("X_train.shape:{}, X_test.shape:{}".format(X_train.shape, X_test.shape))
# 输出：
# X_train.shape:(2034, 200), X_test.shape:(1353, 200)

哈希编码内容与卡方过滤内容详细可以见我之前的两篇笔记，链接如下：

1. sklearn特征提取方法汇总（包含字典、文本、图像的特征提取）

2. klearn特征降维方法汇总（方差过滤，卡方，F过滤，互信息，嵌入法）

2. 机器学习模型训练

2.1 SVM模型测试

from time import time
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
t0 = time()
clf = LinearSVC(dual=False, tol=1e-3)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()    
pred = clf.predict(X_test)
test_time = time() - t0
print("test time:  %0.3fs" % test_time)
test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy:   %0.3f" % test_sroce, 
      "train accuracy:   %0.3f" % train_sroce)

输出：

train time: 0.020s
test time:  0.001s
test accuracy:   0.662 train accuracy:   0.775

2.2 随机森林模型测试

from time import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
t0 = time()
clf = RandomForestClassifier(verbose=1, random_state=42)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()    
pred = clf.predict(X_test)
test_time = time() - t0
print("test time:  %0.3fs" % test_time)
test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy:   %0.3f" % test_sroce, 
      "train accuracy:   %0.3f" % train_sroce)

输出：

train time: 0.466s
test time:  0.032s
test accuracy:   0.617 train accuracy:   0.964

3. 搭建神经网络测试

其实，文本分类只要是将文本信息编码为特征，但是这一步已经在哈希编码与卡方过滤后提取了出来。最后每个文本信息被编码为了一个200维的特征信息，那么现在就可以利用这个特征信息构建一个神经网络来进行训练。

所以下面会简单的搭建一个多层感知机来进行训练，这里我就搭建了一个3层的全连接层，没有特别的设计（其实我也不会特别设计，哭…）

3.1 神经网络搭建

import torch.nn as nn
class MLP(nn.Module):
    def __init__(self, embedding=200, n_class=4):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(embedding, 256),
            nn.Dropout(p=0.6),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.Dropout(p=0.6),
            nn.ReLU(),
            nn.Linear(512, n_class),
        )
    def forward(self, x):
        return self.model(x)

3.2 神经网络训练

数据格式转换

在训练之前，需要进行数据格式的转换。由于卡方过滤出来的矩阵是一个稀疏矩阵，也就是一个稀疏的格式矩阵，所以需要进行.toarray()转换为numpy的array格式。eg：X_train = X_train..toarray()

之后具体是转什么样的格式，就执行在出错的位置看error的提醒就可以了，或者直接按照我这里的数据格式转换，基本是正确的。

import torch
X_train = torch.tensor(X_train.toarray()).float()
X_test = torch.tensor(X_test.toarray()).float()
y_train = torch.tensor(y_train).long()
y_test = torch.tensor(y_test).long()

神经网络训练

数据格式转换后，就可以使用神经网络来进行训练，参考代码如下：

import torch
from torch import optim
from time import time
from sklearn.metrics import accuracy_score
# 设置相关参数
epochsize = 500
learning_rate = 1e-3
best_acc = 0
model = MLP()
criteon = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# 训练
t0 = time()
for epoch in range(epochsize):
    model.train()
    # 损失计算
    pred = model(X_train)
    loss = criteon(pred, y_train)
    # 反向传播
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    # 查看训练中正确个数
    # print(epoch, 'loss:', loss.item())
    model.eval()
    with torch.no_grad():
        category = model(X_test)
        pred = category.argmax(dim=1)
        score = accuracy_score(y_test, pred)
    if best_acc < score:
        best_acc = score
train_time = time() - t0
print("_" * 80)
print("best acc:{}".format(best_acc))
print("train time: %0.3fs" % train_time)

输出：

________________________________________________________________________________
best acc:0.6688839615668883
train time: 2.777s

最后，用神经网络来训练的结果好像与svm得到的结果差不多,无论是集成算法还是支持向量机算法，还是神经网络，最后的结果都在66%左右。

4. 权重向量编码

对于文本分类来说，如果想要了解对于一个文本的分类哪个单词特征是最重要的，可以通过权重向量编码配合一些设计权重分配的分类器来查看。（比如SVM等）

4.1 获取列表中最大的N个数索引

这里用一节的内容来介绍一个trick，也就是一个库的使用方法，参考链接见参考资料3.

针对数组无重复数的

import heapq
int n
lis=[2,4,5,1,7]
re1 = map(lis.index, heapq.nlargest(n, lis)) #求最大的n个索引    nsmallest 求最小  nlargest求最大
re2 = heapq.nlargest(n, lis) #求最大的三个元素
print(list(re1)) #因为re1由map()生成的不是list，直接print不出来，添加list()就行了
print(re2)

针对有重复数的

import heapq
lis= [2, 4, 4, 1, 0]
int n
max_number = heapq.nlargest(n, lis) 
max_index = []
for t in max_number:
    index = lis.index(t)
    max_index.append(index)
    lis[index] = 0
print(max_number)
print(max_index)

4.2 查看息量最大的N个特征名字

np.argsort实现

import numpy as np
# 查看信息量最大的前10个特征单词
def show_top10(classifier, vectorizer, categories):
    # 获取文本位置编码信息
    # vectorizer.get_feature_names_out() 以列表的形式输出
    # vectorizer.vocabulary_： 以字典的形式输出
    feature_names = vectorizer.get_feature_names_out()
    for i, category in enumerate(categories):
        # clf.coef_.shape: (4, 26576) 表示的是每个类别每个单词的重要程度
        # np.argsort是返回排序后的树荫，这里先排序再选择
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s::  %s" % (category, " ".join(feature_names[top10])))
show_top10(clf, vectorizer, data_train.target_names)

输出：

alt.atheism::  nanci islamic deletion motto islam atheist bobby atheists religion atheism
comp.graphics::  card images 42 looking hi computer 3d file image graphics
sci.space::  flight mars solar moon shuttle spacecraft launch nasa orbit space
talk.religion.misc::  commandment koresh blood jesus children rosicrucian christ fbi christians christian

代码来源，见参考资料4.

heapq实现

import numpy as np
import heapq
# 查看信息量最大的前10个特征单词
def show_top10(classifier, vectorizer, categories, top_k=10):
    # 获取文本位置编码信息
    # vectorizer.get_feature_names_out() 以列表的形式输出
    # vectorizer.vocabulary_： 以字典的形式输出
    feature_names = vectorizer.get_feature_names_out()
    for i, category in enumerate(categories):
        # 利用heapq取出数值最高的前k个索引
        top = map(list(clf.coef_[i]).index, heapq.nlargest(top_k, clf.coef_[i]))
        top = list(top)
        print("%s::  %s" % (category, " ".join(feature_names[top])))
show_top10(clf, vectorizer, data_train.target_names)

输出：

alt.atheism::  atheism religion atheists bobby atheist islam motto deletion islamic nanci
comp.graphics::  graphics image file 3d computer hi looking 42 images card
sci.space::  space orbit nasa launch spacecraft shuttle moon solar mars flight
talk.religion.misc::  christian christians fbi christ rosicrucian children jesus blood koresh commandment

可以看见，这两种的方法的输出结果是一直的，只是第二种方法还会按权重的大小从大到小排列。而对于输出结果来说，大概可以知道权重越大的单词和文本类别越相关。

4.3 使用权重编码进行文本分类

参考代码如下：

from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
# 权重向量编码
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english")
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
# X_train.shape, X_test.shape: ((2034, 26576), (1353, 26576))
y_train, y_test = data_train.target, data_test.target
# 构建分类器训练
t0 = time()
clf = LinearSVC(penalty='l2', dual=False, tol=1e-3)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()
pred = clf.predict(X_test)
test_time = time() - t0
print("test time: %0.3fs" % test_time)
# 查看训练效果
test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy:   %0.3f" % test_sroce, 
      "train accuracy:   %0.3f" % train_sroce)

输出：

train time: 0.247s
test time: 0.002s
test accuracy:   0.780 train accuracy:   0.978

分析：可以看见，使用权重编码相比哈希编码的效果要更好，这是因为哈希编码其实是一种降维的方法，其减少了训练的时长，但是同时也会丢失部分的信息。所以哈希编码的特征分类效果要差一点，但是训练的时间要短一点。

5. 使用稀疏特征对文本文档进行分类

这里贴上一个官方的提供代码，供自己学习，具体内容见参考资料5. 相关的内容这里翻译为中文。

这是一个示例，展示了如何使用 scikit-learn 使用词袋方法按主题对文档进行分类。此示例使用 scipy.sparse 矩阵来存储特征，并演示了可以有效处理稀疏矩阵的各种分类器。

此示例中使用的数据集是 20 个新闻组数据集。它将被自动下载，然后缓存。

5.1 参数设置

# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Olivier Grisel <olivier.grisel@ensta.org>
#         Mathieu Blondel <mathieu@mblondel.org>
#         Lars Buitinck
# License: BSD 3 clause
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
op = OptionParser()
op.add_option(
    "--report",
    action="store_true",
    dest="print_report",
    help="Print a detailed classification report.",
)
op.add_option(
    "--chi2_select",
    action="store",
    type="int",
    dest="select_chi2",
    help="Select some number of features using a chi-squared test",
)
op.add_option(
    "--confusion_matrix",
    action="store_true",
    dest="print_cm",
    help="Print the confusion matrix.",
)
op.add_option(
    "--top10",
    action="store_true",
    dest="print_top10",
    help="Print ten most discriminative terms per class for every classifier.",
)
op.add_option(
    "--all_categories",
    action="store_true",
    dest="all_categories",
    help="Whether to use all categories or not.",
)
op.add_option("--use_hashing", action="store_true", help="Use a hashing vectorizer.")
op.add_option(
    "--n_features",
    action="store",
    type=int,
    default=2 ** 16,
    help="n_features when using the hashing vectorizer.",
)
op.add_option(
    "--filtered",
    action="store_true",
    help=(
        "Remove newsgroup information that is easily overfit: "
        "headers, signatures, and quoting."
    ),
)
def is_interactive():
    return not hasattr(sys.modules["__main__"], "__file__")
# work-around for Jupyter notebook and IPython console
argv = [] if is_interactive() else sys.argv[1:]
(opts, args) = op.parse_args(argv)
if len(args) > 0:
    op.error("this script takes no arguments.")
    sys.exit(1)
print(__doc__)
op.print_help()
print()

Out：

Usage: plot_document_classification_20newsgroups.py [options]
Options:
  -h, --help            show this help message and exit
  --report              Print a detailed classification report.
  --chi2_select=SELECT_CHI2
                        Select some number of features using a chi-squared
                        test
  --confusion_matrix    Print the confusion matrix.
  --top10               Print ten most discriminative terms per class for
                        every classifier.
  --all_categories      Whether to use all categories or not.
  --use_hashing         Use a hashing vectorizer.
  --n_features=N_FEATURES
                        n_features when using the hashing vectorizer.
  --filtered            Remove newsgroup information that is easily overfit:
                        headers, signatures, and quoting.

5.2 从训练集中加载数据

让我们从新闻组数据集中加载数据，该数据集包含 20 个主题的大约 18000 个新闻组帖子，分为两个子集：一个用于训练（或开发），另一个用于测试（或性能评估）。

if opts.all_categories:
    categories = None
else:
    categories = [
        "alt.atheism",
        "talk.religion.misc",
        "comp.graphics",
        "sci.space",
    ]
if opts.filtered:
    remove = ("headers", "footers", "quotes")
else:
    remove = ()
print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")
data_train = fetch_20newsgroups(
    subset="train", categories=categories, shuffle=True, random_state=42, remove=remove
)
data_test = fetch_20newsgroups(
    subset="test", categories=categories, shuffle=True, random_state=42, remove=remove
)
print("data loaded")
# order of labels in `target_names` can be different from `categories`
target_names = data_train.target_names
def size_mb(docs):
    return sum(len(s.encode("utf-8")) for s in docs) / 1e6
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)
print(
    "%d documents - %0.3fMB (training set)" % (len(data_train.data), data_train_size_mb)
)
print("%d documents - %0.3fMB (test set)" % (len(data_test.data), data_test_size_mb))
print("%d categories" % len(target_names))
print()
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
print("Extracting features from the training data using a sparse vectorizer")
t0 = time()
if opts.use_hashing:
    vectorizer = HashingVectorizer(
        stop_words="english", alternate_sign=False, n_features=opts.n_features
    )
    X_train = vectorizer.transform(data_train.data)
else:
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english")
    X_train = vectorizer.fit_transform(data_train.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_train_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_train.shape)
print()
print("Extracting features from the test data using the same vectorizer")
t0 = time()
X_test = vectorizer.transform(data_test.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_test_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_test.shape)
print()
# mapping from integer feature name to original token string
if opts.use_hashing:
    feature_names = None
else:
    feature_names = vectorizer.get_feature_names_out()
if opts.select_chi2:
    print("Extracting %d best features by a chi-squared test" % opts.select_chi2)
    t0 = time()
    ch2 = SelectKBest(chi2, k=opts.select_chi2)
    X_train = ch2.fit_transform(X_train, y_train)
    X_test = ch2.transform(X_test)
    if feature_names is not None:
        # keep selected feature names
        feature_names = feature_names[ch2.get_support()]
    print("done in %fs" % (time() - t0))
    print()
def trim(s):
    """Trim string to fit on terminal (assuming 80-column display)"""
    return s if len(s) <= 80 else s[:77] + "..."

Out:

Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
data loaded
2034 documents - 3.980MB (training set)
1353 documents - 2.867MB (test set)
4 categories
Extracting features from the training data using a sparse vectorizer
done in 0.383082s at 10.388MB/s
n_samples: 2034, n_features: 33809
Extracting features from the test data using the same vectorizer
done in 0.236998s at 12.099MB/s
n_samples: 1353, n_features: 33809

5.3 分类构建

用 15 种不同的分类模型训练和测试数据集，并获得每个模型的性能结果。

def benchmark(clf):
    print("_" * 80)
    print("Training: ")
    print(clf)
    t0 = time()
    clf.fit(X_train, y_train)
    train_time = time() - t0
    print("train time: %0.3fs" % train_time)
    t0 = time()
    pred = clf.predict(X_test)
    test_time = time() - t0
    print("test time:  %0.3fs" % test_time)
    score = metrics.accuracy_score(y_test, pred)
    print("accuracy:   %0.3f" % score)
    if hasattr(clf, "coef_"):
        print("dimensionality: %d" % clf.coef_.shape[1])
        print("density: %f" % density(clf.coef_))
        if opts.print_top10 and feature_names is not None:
            print("top 10 keywords per class:")
            for i, label in enumerate(target_names):
                top10 = np.argsort(clf.coef_[i])[-10:]
                print(trim("%s: %s" % (label, " ".join(feature_names[top10]))))
        print()
    if opts.print_report:
        print("classification report:")
        print(metrics.classification_report(y_test, pred, target_names=target_names))
    if opts.print_cm:
        print("confusion matrix:")
        print(metrics.confusion_matrix(y_test, pred))
    print()
    clf_descr = str(clf).split("(")[0]
    return clf_descr, score, train_time, test_time
results = []
for clf, name in (
    (RidgeClassifier(tol=1e-2, solver="sag"), "Ridge Classifier"),
    (Perceptron(max_iter=50), "Perceptron"),
    (PassiveAggressiveClassifier(max_iter=50), "Passive-Aggressive"),
    (KNeighborsClassifier(n_neighbors=10), "kNN"),
    (RandomForestClassifier(), "Random forest"),
):
    print("=" * 80)
    print(name)
    results.append(benchmark(clf))
for penalty in ["l2", "l1"]:
    print("=" * 80)
    print("%s penalty" % penalty.upper())
    # Train Liblinear model
    results.append(benchmark(LinearSVC(penalty=penalty, dual=False, tol=1e-3)))
    # Train SGD model
    results.append(benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty=penalty)))
# Train SGD with Elastic Net penalty
print("=" * 80)
print("Elastic-Net penalty")
results.append(
    benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty="elasticnet"))
)
# Train NearestCentroid without threshold
print("=" * 80)
print("NearestCentroid (aka Rocchio classifier)")
results.append(benchmark(NearestCentroid()))
# Train sparse Naive Bayes classifiers
print("=" * 80)
print("Naive Bayes")
results.append(benchmark(MultinomialNB(alpha=0.01)))
results.append(benchmark(BernoulliNB(alpha=0.01)))
results.append(benchmark(ComplementNB(alpha=0.1)))
print("=" * 80)
print("LinearSVC with L1-based feature selection")
# The smaller C, the stronger the regularization.
# The more regularization, the more sparsity.
results.append(
    benchmark(
        Pipeline(
            [
                (
                    "feature_selection",
                    SelectFromModel(LinearSVC(penalty="l1", dual=False, tol=1e-3)),
                ),
                ("classification", LinearSVC(penalty="l2")),
            ]
        )
    )
)

Out：

================================================================================
Ridge Classifier
________________________________________________________________________________
Training:
RidgeClassifier(solver='sag', tol=0.01)
/home/circleci/project/sklearn/linear_model/_ridge.py:729: UserWarning: "sag" solver requires many iterations to fit an intercept with sparse inputs. Either set the solver to "auto" or "sparse_cg", or set a low "tol" and a high "max_iter" (especially if inputs are not standardized).
  warnings.warn(
train time: 0.167s
test time:  0.001s
accuracy:   0.898
dimensionality: 33809
density: 1.000000
================================================================================
Perceptron
________________________________________________________________________________
Training:
Perceptron(max_iter=50)
train time: 0.015s
test time:  0.001s
accuracy:   0.888
dimensionality: 33809
density: 0.255302
================================================================================
Passive-Aggressive
________________________________________________________________________________
Training:
PassiveAggressiveClassifier(max_iter=50)
train time: 0.027s
test time:  0.001s
accuracy:   0.902
dimensionality: 33809
density: 0.711867
================================================================================
kNN
________________________________________________________________________________
Training:
KNeighborsClassifier(n_neighbors=10)
train time: 0.001s
test time:  0.148s
accuracy:   0.858
================================================================================
Random forest
________________________________________________________________________________
Training:
RandomForestClassifier()
train time: 1.258s
test time:  0.079s
accuracy:   0.826
================================================================================
L2 penalty
________________________________________________________________________________
Training:
LinearSVC(dual=False, tol=0.001)
train time: 0.072s
test time:  0.001s
accuracy:   0.900
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50)
train time: 0.024s
test time:  0.001s
accuracy:   0.903
dimensionality: 33809
density: 0.579424
================================================================================
L1 penalty
________________________________________________________________________________
Training:
LinearSVC(dual=False, penalty='l1', tol=0.001)
train time: 0.176s
test time:  0.001s
accuracy:   0.873
dimensionality: 33809
density: 0.005553
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50, penalty='l1')
train time: 0.092s
test time:  0.002s
accuracy:   0.880
dimensionality: 33809
density: 0.022509
================================================================================
Elastic-Net penalty
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50, penalty='elasticnet')
train time: 0.134s
test time:  0.001s
accuracy:   0.901
dimensionality: 33809
density: 0.184685
================================================================================
NearestCentroid (aka Rocchio classifier)
________________________________________________________________________________
Training:
NearestCentroid()
train time: 0.004s
test time:  0.002s
accuracy:   0.855
================================================================================
Naive Bayes
________________________________________________________________________________
Training:
MultinomialNB(alpha=0.01)
train time: 0.003s
test time:  0.001s
accuracy:   0.899
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
  warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
BernoulliNB(alpha=0.01)
train time: 0.005s
test time:  0.004s
accuracy:   0.884
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
  warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
ComplementNB(alpha=0.1)
train time: 0.003s
test time:  0.001s
accuracy:   0.911
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
  warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000
================================================================================
LinearSVC with L1-based feature selection
________________________________________________________________________________
Training:
Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False, penalty='l1',
                                                     tol=0.001))),
                ('classification', LinearSVC())])
train time: 0.192s
test time:  0.002s
accuracy:   0.879

5.4 可视化处理

条形图表示每个分类器的准确度、训练时间（标准化）和测试时间（标准化）。

indices = np.arange(len(results))
results = [[x[i] for x in results] for i in range(4)]
clf_names, score, training_time, test_time = results
training_time = np.array(training_time) / np.max(training_time)
test_time = np.array(test_time) / np.max(test_time)
plt.figure(figsize=(12, 8))
plt.title("Score")
plt.barh(indices, score, 0.2, label="score", color="navy")
plt.barh(indices + 0.3, training_time, 0.2, label="training time", color="c")
plt.barh(indices + 0.6, test_time, 0.2, label="test time", color="darkorange")
plt.yticks(())
plt.legend(loc="best")
plt.subplots_adjust(left=0.25)
plt.subplots_adjust(top=0.95)
plt.subplots_adjust(bottom=0.05)
for i, c in zip(indices, clf_names):
    plt.text(-0.3, i, c)
plt.show()

Out：

参考资料：

1. sklearn特征提取方法汇总（包含字典、文本、图像的特征提取）

2. klearn特征降维方法汇总（方差过滤，卡方，F过滤，互信息，嵌入法）

3. python获取列表中最大的N个数索引

4. The 20 newsgroups text dataset

5. Classification of text documents using sparse features

【34】文本文档分类实战（哈希编码/权重编码提取特征 + 卡方过滤 + 搭建神经网络分类）

1. 文本向量提取

1.1 数据集导入

1.2 哈希编码

1.3 卡方过滤

2. 机器学习模型训练

2.1 SVM模型测试

2.2 随机森林模型测试

3. 搭建神经网络测试

3.1 神经网络搭建

3.2 神经网络训练

4. 权重向量编码

4.1 获取列表中最大的N个数索引

4.2 查看息量最大的N个特征名字

4.3 使用权重编码进行文本分类

5. 使用稀疏特征对文本文档进行分类

5.1 参数设置

5.2 从训练集中加载数据

5.3 分类构建

5.4 可视化处理

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

【34】文本文档分类实战（哈希编码/权重编码提取特征 + 卡方过滤 + 搭建神经网络分类）

1. 文本向量提取

1.1 数据集导入

1.2 哈希编码

1.3 卡方过滤

2. 机器学习模型训练

2.1 SVM模型测试

2.2 随机森林模型测试

3. 搭建神经网络测试

3.1 神经网络搭建

3.2 神经网络训练

4. 权重向量编码

4.1 获取列表中最大的N个数索引

4.2 查看息量最大的N个特征名字

4.3 使用权重编码进行文本分类

5. 使用稀疏特征对文本文档进行分类

5.1 参数设置

5.2 从训练集中加载数据

5.3 分类构建

5.4 可视化处理

热门文章

最新文章

相关课程

相关电子书

相关实验场景