【文本分类】Bag of Tricks for Efficient Text Classification

2023-02-24 109

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 【文本分类】Bag of Tricks for Efficient Text Classification

·阅读摘要：

本文主要提出fastText模型。

·参考文献：

[1] Bag of Tricks for Efficient Text Classification

[0] 摘要

文章提出fastText模型，效果接近深度学习基线模型，但是速度非常快。

[1] 介绍

深度学习模型在实践中取得了非常好的性能，但它们在训练和测试时往往相对较慢，从而限制了它们在非常大的数据集上的使用。

线性分类器通常被认为是文本分类问题的强基线。如果使用得当，它们通常会有最先进的性能，从而应用到大语料库。

论文提出的fastText模型表明，线性模型与秩约束和快速损失近似可以在十分钟内训练十亿字，同时实现高性能的表现。

[2] 模型结构

这里从代码的角度上来讲解会更清楚。

pytorch版本的fastText代码如下：

class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
        self.embedding_ngram2 = nn.Embedding(config.n_gram_vocab, config.embed)
        self.embedding_ngram3 = nn.Embedding(config.n_gram_vocab, config.embed)
        self.dropout = nn.Dropout(config.dropout)
        self.fc1 = nn.Linear(config.embed * 3, config.hidden_size)
        # self.dropout2 = nn.Dropout(config.dropout)
        self.fc2 = nn.Linear(config.hidden_size, config.num_classes)
    def forward(self, x):
        out_word = self.embedding(x[0])
        out_bigram = self.embedding_ngram2(x[2])
        out_trigram = self.embedding_ngram3(x[3])
        out = torch.cat((out_word, out_bigram, out_trigram), -1)
        out = out.mean(dim=1)
        out = self.dropout(out)
        out = self.fc1(out)
        out = F.relu(out)
        out = self.fc2(out)
        return out

可以看到，一元语法的embedding可以从预训练词向量获取，二元语法、三元语法就只能模型自己来训练了。

但随着语料库的增加，由于二元语法、三元语法的存在，内存需求也会不断增加，严重影响模型构建速度，针对这些问题我们使用以下几种解决方案：

1、使用hash来存储二元语法、三元语法

2、由采用字粒度变化为采用词粒度

构建数据集时，我们把二元语法、三元语法通过Hash整合到一起，变成一个索引值，操作如下：

    def biGramHash(sequence, t, buckets):
        t1 = sequence[t - 1] if t - 1 >= 0 else 0
        return (t1 * 14918087) % buckets
    def triGramHash(sequence, t, buckets):
        t1 = sequence[t - 1] if t - 1 >= 0 else 0
        t2 = sequence[t - 2] if t - 2 >= 0 else 0
        return (t2 * 14918087 * 18408749 + t1 * 14918087) % buckets

【文本分类】Bag of Tricks for Efficient Text Classification

[0] 摘要

[1] 介绍

[2] 模型结构

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

【文本分类】Bag of Tricks for Efficient Text Classification

[0] 摘要

[1] 介绍

[2] 模型结构

热门文章

最新文章

相关电子书