[FastText in Text Classification]论文实现：Bag of Tricks for Efficient Text Classification-阿里云开发者社区

[FastText in Text Classification]论文实现：Bag of Tricks for Efficient Text Classification

2024-05-09 54 发布于浙江

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： [FastText in Text Classification]论文实现：Bag of Tricks for Efficient Text Classification

Bag of Tricks for Efficient Text Classification

论文：Bag of Tricks for Efficient Text Classification

作者：Armand Joulin，Edouard Grave，Piotr Bojanowski，Tomas Mikolov

时间：2016

地址：https://cs.brown.edu/people/pfelzens/segment

一、完整代码

直接调用fastext库就好，很快就能搞定！

import fasttext
# data.train.txt是一个文本文件，每行包含一个训练句和标签。默认情况下，我们假设标签是以  __label__ 为前缀的单词
model = fasttext.train_supervised('data.train.txt')
# 返回概率最高的三个结果，由于预测两个，一共会返回6个结果
model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)

api中的label是前缀，默认为__label__

二、论文解读

2.1 模型架构

A simple and efficient baseline for sentence classification is to represent sentences as bag of words (BoW) and train a linear classifier, e.g., a logistic regression or an SVM (Joachims, 1998; Fan et al., 2008). However, linear classifiers do not share parameters among features and classes. This possibly limits their generalization in the context of large output space where some classes have very few examples. Common solutions to this problem are to factorize the linear classifier into low rank matrices (Schutze, 1992; Mikolov et al., 2013) or to use multilayer neural networks (Collobert and Weston, 2008; Zhang et al., 2015).

由于线性分类器不会在类别和特征之间共享参数，所以我们不需要计算每一个类别的softmax值；即可以使用Hierarchical Softmax 或者 Negative Sampling 来加快训练速度；

同时，由于各个词的内部特征也可以进行考虑，根据论文Enriching Word Vectors with Subword Information，我们可以使用论文的subwords方法进行映射；

模型架构如下：

与论文Enriching Word Vectors with Subword Information不同的是，那里的output是词向量，而这里是类别(class or label)；完毕！

三、过程实现

在论文中Enriching Word Vectors with Subword Information详细讲了；这里只需要softmax就可以了

四、整体总结

实现难度远不如Enriching Word Vectors with Subword Information；