Kaggle Jigsaw文本分类比赛方案总结-阿里云开发者社区

Kaggle Jigsaw文本分类比赛方案总结

2022-05-21 455

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Kaggle Jigsaw文本分类比赛方案总结

以下资源来自国内外选手分享的资源与方案，非常感谢他们的无私分享

比赛简介

一年一度的jigsaw有毒评论比赛开赛了，这次比赛与前两次举办的比赛不同，以往比赛都是英文训练集和测试集，但是这次的比赛确是训练集是前两次比赛的训练集的一个组合，验证集则是三种语言分别是es（西班牙语）、it（意大利语）、tr（土耳其语），测试集语言则是六种语言分别是es（西班牙语）、it（意大利语）、tr（土耳其语），ru（俄语）、pt（葡萄牙语）、fr（法语）。

--kaggle的Jigsaw多语言评论识别全球top15比赛心得分享

题目分析

这个比赛是一个文本分类的比赛，这个比赛目标是在给定文本中判断是否为恶意评论即01分类。训练数据还给了其他多列特征，包括一些敏感词特征还有一些其他指标评价的得分特征。测试集没有这些额外的特征只有文本数据。

通过比赛的评价指标可以看出来，这个比赛不仅仅是简单的01分类的比赛。这个比赛不仅关注分类正确，还关注于在预测结果中不是恶意评论中包含敏感词和是恶意评论中不包含敏感词两部分数据的得分。所以我们需要关注一下这两类的数据。可以考虑给这两类的数据赋予更高的权重，更方便模型能够准确的对这些数据预测正确。

文本统计特征如下：

词云展示

第三名方案解析

代码仓库：https://github.com/sakami0000/kaggle_jigsaw
方案帖子:https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/discussion/97471#latest-582610

模型1 LstmGruNet

模型如其名，作者主要基于LSTM以及GRU两种序列循环神经网络搭建了文本分类模型

class LstmGruNet(nn.Module):
    def __init__(self, embedding_matrices, num_aux_targets, embedding_size=256, lstm_units=128,
                 gru_units=128):
        super(LstmGruNet, self).__init__()
        self.embedding = ProjSumEmbedding(embedding_matrices, embedding_size)
        self.embedding_dropout = SpatialDropout(0.2)
        self.lstm = nn.LSTM(embedding_size, lstm_units, bidirectional=True, batch_first=True)
        self.gru = nn.GRU(lstm_units * 2, gru_units, bidirectional=True, batch_first=True)
        dense_hidden_units = gru_units * 4
        self.linear1 = nn.Linear(dense_hidden_units, dense_hidden_units)
        self.linear2 = nn.Linear(dense_hidden_units, dense_hidden_units)
        self.linear_out = nn.Linear(dense_hidden_units, 1)
        self.linear_aux_out = nn.Linear(dense_hidden_units, num_aux_targets)
    def forward(self, x):
        h_embedding = self.embedding(x)
        h_embedding = self.embedding_dropout(h_embedding)
        h1, _ = self.lstm(h_embedding)
        h2, _ = self.gru(h1)
        # global average pooling
        avg_pool = torch.mean(h2, 1)
        # global max pooling
        max_pool, _ = torch.max(h2, 1)
        h_conc = torch.cat((max_pool, avg_pool), 1)
        h_conc_linear1 = F.relu(self.linear1(h_conc))
        h_conc_linear2 = F.relu(self.linear2(h_conc))
        hidden = h_conc + h_conc_linear1 + h_conc_linear2
        result = self.linear_out(hidden)
        aux_result = self.linear_aux_out(hidden)
        out = torch.cat([result, aux_result], 1)
        return out

模型2 LstmCapsuleAttenModel

该模型有递归神经网络、胶囊网络以及注意力神经网络搭建。

class LstmCapsuleAttenModel(nn.Module):
    def __init__(self, embedding_matrix, maxlen=200, lstm_hidden_size=128, gru_hidden_size=128,
                 embedding_dropout=0.2, dropout1=0.2, dropout2=0.1, out_size=16,
                 num_capsule=5, dim_capsule=5, caps_out=1, caps_dropout=0.3):
        super(LstmCapsuleAttenModel, self).__init__()
        self.embedding = nn.Embedding(*embedding_matrix.shape)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        self.embedding_dropout = nn.Dropout2d(embedding_dropout)
        self.lstm = nn.LSTM(embedding_matrix.shape[1], lstm_hidden_size, bidirectional=True, batch_first=True)
        self.gru = nn.GRU(lstm_hidden_size * 2, gru_hidden_size, bidirectional=True, batch_first=True)
        self.lstm_attention = Attention(lstm_hidden_size * 2, maxlen=maxlen)
        self.gru_attention = Attention(gru_hidden_size * 2, maxlen=maxlen)
        self.capsule = Capsule(input_dim_capsule=gru_hidden_size * 2,
                               num_capsule=num_capsule,
                               dim_capsule=dim_capsule)
        self.dropout_caps = nn.Dropout(caps_dropout)
        self.lin_caps = nn.Linear(num_capsule * dim_capsule, caps_out)
        self.norm = nn.LayerNorm(lstm_hidden_size * 2 + gru_hidden_size * 6 + caps_out)
        self.dropout1 = nn.Dropout(dropout1)
        self.linear = nn.Linear(lstm_hidden_size * 2 + gru_hidden_size * 6 + caps_out, out_size)
        self.dropout2 = nn.Dropout(dropout2)
        self.out = nn.Linear(out_size, 1)
    def apply_spatial_dropout(self, h_embedding):
        h_embedding = h_embedding.transpose(1, 2).unsqueeze(2)
        h_embedding = self.embedding_dropout(h_embedding).squeeze(2).transpose(1, 2)
        return h_embedding
    def forward(self, x):
        h_embedding = self.embedding(x)
        h_embedding = self.apply_spatial_dropout(h_embedding)
        h_lstm, _ = self.lstm(h_embedding)
        h_gru, _ = self.gru(h_lstm)
        h_lstm_atten = self.lstm_attention(h_lstm)
        h_gru_atten = self.gru_attention(h_gru)
        content3 = self.capsule(h_gru)
        batch_size = content3.size(0)
        content3 = content3.view(batch_size, -1)
        content3 = self.dropout_caps(content3)
        content3 = torch.relu(self.lin_caps(content3))
        avg_pool = torch.mean(h_gru, 1)
        max_pool, _ = torch.max(h_gru, 1)
        conc = torch.cat((h_lstm_atten, h_gru_atten, content3, avg_pool, max_pool), 1)
        conc = self.norm(conc)
        conc = self.dropout1(conc)
        conc = torch.relu(conc)
        conc = self.linear(conc)
        conc = self.dropout2(conc)
        out = self.out(conc)
        return out

模型3 LstmConvModel

该模型有LSTM和Convolutional Neural Network搭建

class LstmConvModel(nn.Module):
    def __init__(self, embedding_matrix, lstm_hidden_size=128, gru_hidden_size=128, n_channels=64,
                 embedding_dropout=0.2, out_size=20, out_dropout=0.1):
        super(LstmConvModel, self).__init__()
        self.embedding = nn.Embedding(*embedding_matrix.shape)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        self.embedding_dropout = nn.Dropout2d(0.2)
        self.lstm = nn.LSTM(embedding_matrix.shape[1], lstm_hidden_size, bidirectional=True, batch_first=True)
        self.gru = nn.GRU(lstm_hidden_size * 2, gru_hidden_size, bidirectional=True, batch_first=True)
        self.conv = nn.Conv1d(gru_hidden_size * 2, n_channels, 3, padding=2)
        nn.init.xavier_uniform_(self.conv.weight)
        self.linear = nn.Linear(n_channels * 2, out_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(out_dropout)
        self.out = nn.Linear(out_size, 1)
    def apply_spatial_dropout(self, h_embedding):
        h_embedding = h_embedding.transpose(1, 2).unsqueeze(2)
        h_embedding = self.embedding_dropout(h_embedding).squeeze(2).transpose(1, 2)
        return h_embedding
    def forward(self, x):
        h_embedding = self.embedding(x)
        h_embedding = self.apply_spatial_dropout(h_embedding)
        h_lstm, _ = self.lstm(h_embedding)
        h_gru, _ = self.gru(h_lstm)
        h_gru = h_gru.transpose(2, 1)
        conv = self.conv(h_gru)
        conv_avg_pool = torch.mean(conv, 2)
        conv_max_pool, _ = torch.max(conv, 2)
        conc = torch.cat((conv_avg_pool, conv_max_pool), 1)
        conc = self.relu(self.linear(conc))
        conc = self.dropout(conc)
        out = self.out(conc)
        return out

模型4 Bert&GPT2

from pytorch_pretrained_bert import GPT2Model
import torch
from torch import nn
class GPT2ClassificationHeadModel(GPT2Model):
    def __init__(self, config, clf_dropout=0.4, n_class=8):
        super(GPT2ClassificationHeadModel, self).__init__(config)
        self.transformer = GPT2Model(config)
        self.dropout = nn.Dropout(clf_dropout)
        self.linear = nn.Linear(config.n_embd * 3, n_class)
        nn.init.normal_(self.linear.weight, std=0.02)
        nn.init.normal_(self.linear.bias, 0)
        self.apply(self.init_weights)
    def forward(self, input_ids, position_ids=None, token_type_ids=None, lm_labels=None, past=None):
        hidden_states, presents = self.transformer(input_ids, position_ids, token_type_ids, past)
        avg_pool = torch.mean(hidden_states, 1)
        max_pool, _ = torch.max(hidden_states, 1)
        h_conc = torch.cat((avg_pool, max_pool, hidden_states[:, -1, :]), 1)
        logits = self.linear(self.dropout(h_conc))
        return logits

代码获取：

链接：https://pan.baidu.com/s/1JdAe2sWRyuNShVhFF0ZvGg

提取码：lm80

复制这段内容后打开百度网盘手机App，操作更方便哦

Kaggle Jigsaw文本分类比赛方案总结

比赛简介

题目分析

第三名方案解析

模型1 LstmGruNet

模型2 LstmCapsuleAttenModel

模型3 LstmConvModel

模型4 Bert&GPT2

相关知识点

1 胶囊网络

2 Spatial Dropout

更多方案解析

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

直播

下载

镜像站

技术资料

Kaggle Jigsaw文本分类比赛方案总结

比赛简介

题目分析

第三名方案解析

模型1 LstmGruNet

模型2 LstmCapsuleAttenModel

模型3 LstmConvModel

模型4 Bert&GPT2

相关知识点

1 胶囊网络

2 Spatial Dropout

更多方案解析

热门文章

最新文章

相关课程

相关电子书

相关实验场景