简介
自己由于最近参加了一个比赛“达观杯”文本智能处理挑战赛,上一周主要在做这一个比赛,看了一写论文和资料,github上搜刮下。。感觉一下子接触的知识很多,自己乘热打铁整理下吧。
接着上一篇文章20 newsgroups数据介绍以及文本分类实例,我们继续探讨下文本分类方法。文本分类作为NLP领域最为经典场景之一,当目前为止在业界和学术界已经积累了很多方法,主要分为两大类:
- 基于传统机器学习的文本分类
- 基于深度学习的文本分类
传统机器学习的文本分类通常提取tfidf或者词袋特征,然后给LR
模型进行训练;这里模型有很多,比如贝叶斯、svm
等;深度学习的文本分类,主要采用CNN、RNN、LSTM、Attention
等。
利用传统机器学习和深度学习进行文本分类
- 基于传统机器学习方法进行文本分类
基本思路是:提取tfidf特征,然后喂给各种分类模型进行训练
import numpy as np from sklearn.pipeline import Pipeline from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.neighbors import KNeighborsClassifier from sklearn.neural_network.multilayer_perceptron import MLPClassifier from sklearn.svm import SVC,LinearSVC,LinearSVR from sklearn.linear_model.stochastic_gradient import SGDClassifier from sklearn.linear_model.logistic import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier # 选取下面的8类 selected_categories = [ 'comp.graphics', 'rec.motorcycles', 'rec.sport.baseball', 'misc.forsale', 'sci.electronics', 'sci.med', 'talk.politics.guns', 'talk.religion.misc'] # 加载数据集 newsgroups_train=fetch_20newsgroups(subset='train', categories=selected_categories, remove=('headers','footers','quotes')) newsgroups_test=fetch_20newsgroups(subset='train', categories=selected_categories, remove=('headers','footers','quotes')) train_texts=newsgroups_train['data'] train_labels=newsgroups_train['target'] test_texts=newsgroups_test['data'] test_labels=newsgroups_test['target'] print(len(train_texts),len(test_texts)) # 贝叶斯 text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',MultinomialNB())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("MultinomialNB准确率为:",np.mean(predicted==test_labels)) # SGD text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',SGDClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("SGDClassifier准确率为:",np.mean(predicted==test_labels)) # LogisticRegression text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',LogisticRegression())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("LogisticRegression准确率为:",np.mean(predicted==test_labels)) # SVM text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',SVC())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("SVC准确率为:",np.mean(predicted==test_labels)) text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',LinearSVC())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("LinearSVC准确率为:",np.mean(predicted==test_labels)) text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',LinearSVR())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("LinearSVR准确率为:",np.mean(predicted==test_labels)) # MLPClassifier text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',MLPClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("MLPClassifier准确率为:",np.mean(predicted==test_labels)) # KNeighborsClassifier text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',KNeighborsClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("KNeighborsClassifier准确率为:",np.mean(predicted==test_labels)) # RandomForestClassifier text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',RandomForestClassifier(n_estimators=8))]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("RandomForestClassifier准确率为:",np.mean(predicted==test_labels)) # GradientBoostingClassifier text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',GradientBoostingClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("GradientBoostingClassifier准确率为:",np.mean(predicted==test_labels)) # AdaBoostClassifier text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',AdaBoostClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("AdaBoostClassifier准确率为:",np.mean(predicted==test_labels)) # DecisionTreeClassifier text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',DecisionTreeClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("DecisionTreeClassifier准确率为:",np.mean(predicted==test_labels))
输出结果为:
MultinomialNB准确率为: 0.8960196779964222 SGDClassifier准确率为: 0.9724955277280859 LogisticRegression准确率为: 0.9304561717352415 SVC准确率为: 0.13372093023255813 LinearSVC准确率为: 0.9749552772808586 LinearSVR准确率为: 0.00022361359570661896 MLPClassifier准确率为: 0.9758497316636852 KNeighborsClassifier准确率为: 0.45840787119856885 RandomForestClassifier准确率为: 0.9680232558139535 GradientBoostingClassifier准确率为: 0.9186046511627907 AdaBoostClassifier准确率为: 0.5916815742397138 DecisionTreeClassifier准确率为: 0.9758497316636852
从上面结果可以看出,不同分类器在改数据集上的表现差别是比较大的,所以在做文本分类的时候要多尝试几种方法,说不定有意外收获;另外TfidfVectorizer、LogisticRegression等方法,我们可以设置很多参数,这里对实验的效果也影响比较大,比如TfidfVectorizer中一个参数ngram_range直接影响提取的特征,这里也是需要多磨多练;
更多请见:https://github.com/yanqiangmiffy/20newsgroups-text-classification