Bag of Tricks for Efficient Text Classification
论文:Bag of Tricks for Efficient Text Classification
作者:Armand Joulin,Edouard Grave,Piotr Bojanowski,Tomas Mikolov
时间:2016
一、完整代码
直接调用fastext库就好,很快就能搞定!
import fasttext # data.train.txt是一个文本文件,每行包含一个训练句和标签。默认情况下,我们假设标签是以 __label__ 为前缀的单词 model = fasttext.train_supervised('data.train.txt') # 返回概率最高的三个结果,由于预测两个,一共会返回6个结果 model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)
api中的label是前缀,默认为__label__
二、论文解读
2.1 模型架构
A simple and efficient baseline for sentence classification is to represent sentences as bag of words (BoW) and train a linear classifier, e.g., a logistic regression or an SVM (Joachims, 1998; Fan et al., 2008).
However, linear classifiers do not share parameters among features and classes.
This possibly limits their generalization in the context of large output space where some classes have very few examples. Common solutions to this problem are to factorize the linear classifier into low rank matrices (Schutze, 1992; Mikolov et al., 2013) or to use multilayer neural networks (Collobert and Weston, 2008; Zhang et al., 2015).
由于线性分类器不会在类别和特征之间共享参数,所以我们不需要计算每一个类别的softmax值;即可以使用Hierarchical Softmax 或者 Negative Sampling 来加快训练速度;
同时,由于各个词的内部特征也可以进行考虑,根据论文Enriching Word Vectors with Subword Information,我们可以使用论文的subwords方法进行映射;
模型架构如下:
与论文Enriching Word Vectors with Subword Information不同的是,那里的output是词向量,而这里是类别(class or label);完毕!
三、过程实现
在论文中Enriching Word Vectors with Subword Information详细讲了;这里只需要softmax就可以了
四、整体总结
实现难度远不如Enriching Word Vectors with Subword Information;