【文本摘要（3）】Pytorch之Seq2seq: attention-阿里云开发者社区

写在最前面

代码参考：

https://github.com/jasoncao11/nlp-notebook/tree/master/4-2.Seq2seq_Att

跪谢大佬，文本摘要的全部代码几乎都有了

只有小部分需要修改，可能是版本原因

本代码已跑通，若有问题欢迎留言，一起交流探讨

如有理解不对的地方，还请过路的大佬们指点一二

参考：

https://www.bilibili.com/video/BV1op4y1U7ag?t=1013

https://github.com/bentrevett/pytorch-seq2seq

https://github.com/DD-DuDa/nlp_course

https://zhuanlan.zhihu.com/p/383866592

本文承接

【文本摘要（2）】pytorch之Seq2Seq

https://blog.csdn.net/WTYuong/article/details/129683262

Attention

注：seq2seq中的attention不常用

可以细看transformer中的attention，那个用的更多并且更简单

在上一篇中我们说到，我们的编码器是把所有的输入编码成一个向量context，这个向量来自于Encoder最后一层的输出。

解码器Decoder仅仅通过这个向量解码对应的句子。

问题

第一个问题：这个向量context真的能包含输入的句子的所有信息吗？试想你要翻译一个含有100个单词的句子，而这个context也就仅有200维，而它不仅需要包含每个单词，还有顺序，语义等等，这几乎不太可能。

第二个问题：假设向量context真的包含了所有信息，那解码器Decoder,真的可以只看着这个单一向量就可以翻译出来对应的所有东西吗？解码器的每一步都需要从这个向量中提取对应位置的信息。

举个简单的例子就是，假设你是一个解码器，你在听一个一分钟的英文演讲，边听边记，然后听完之后，就只用你自己记下来的零零碎碎的东西用中文把一分钟的内容翻译出来。

解决

那怎么办呢？

很简单，听一句然后暂停，翻译完，然后继续：）

这里就能体会到attention的一个思想——对齐align

在翻译的每一步中，我们的模型需要关注对应的输入位置。

Ex: 假设模型需要翻译”Change your life today“，我们的Decoder的第一个输入，需要知道Encoder输入的第一个输入是”change“，然后Decoder看着这个”change“来翻译。

如何进行attention

我们的Encoder是不需要做什么变化的，主要就是我们Decoder的输入发生了变化。

Decoder的输入：由【context向量+Embedding】

变成了【context向量+attention_output+Embedding】

Decoder的线性层也随之发生变化了

attention_output

在没有attention的情况下，decoder的第一个输入应该就是【encoder的最后隐藏层输出+embedding】，把这个隐藏层输出的向量叫为h 0 h0h0

这个时候需要计算encoder的每个隐藏层的状态s 1 , s 2 , s 3 , s 4 , s 5 s1,s2,s3,s4,s5s1,s2,s3,s4,s5和这个h 0 h0h0之间的s c o r e ( h 0 , s k ) , k = 1 , . . . 5 score(h0,sk),k=1,...5score(h0,sk),k=1,...5

利用softmax，把所有的score换成【0，1】的概率分布，变成a k , k = 1 , . . . 5 ak,k=1,...5ak,k=1,...5

计算注意输出:即带有注意权值的编码器状态的加权和c

之后以此类推

代码结构

和seq2seq类似的代码不细说

模型结构定义model.py

模型结构定义代码

# -*- coding: utf-8 -*-
import random
import torch.nn as nn
import torch 
import torch.nn.functional as F

Encoder函数

与之前的模型采用双层GRU，现在使用了双向RNN。

对于双向RNN，每层有两个RNN。一个前向RNN从左到右遍历嵌入的句子(如下图中绿色所示)，一个后向RNN从右到左遍历嵌入的句子(蓝绿色)。

在代码中所需要做的就是设置bidirectional = True，然后像之前一样将嵌入的句子传递给RNN。

Encoder函数构建一个encoder，内部RNN使用了torch内置的GRU，参数为：

input_dim：输入词表的大小

emb_dim：embedding的维度

enc_hid_dim：隐藏层的大小

dropout：dropout的概率

forward参数：

src：原文数据，是已经由词通过词表转换成序号的数据

forword输出Encoder整体的输出，以及Encoder每个状态的输出。每个状态的输出用来计算后续的attention。

可选，为了规避掉后续计算attention时受到序列中存在pad符号的影响，应用nn.utils的pad_paddad_sequence方法，可以去掉doc_len以后的pad符号。

doc_len：每个数据的真实长度，在计算RNN时，可以只计算相应长度的状态，不计算pad符号

pad_packed_sequence的输入为单词序列的embedding和序列的真实长度，这样在计算序列时，就不会计算doc_len后的pad符号了。

packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, doc_len)

在计算完RNN后，为了形成一个矩阵方便GPU计算，会把每个doc_len < max_len的序列填充起来，这里使用了pad_packed_sequence方法，输入为RNN计算后的序列packed_outputs，在后续的attention计算时，会把填充的信息规避掉。

outputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs)

具体实现时，矩阵维度的变换比较繁琐，为了矩阵的运算经常需要增减维度或者交换维度的顺序，代码中已给出标注，建议自己调试一遍，感受维度变换过程。

encoder的输入为原文，输出为hidden_state，size需要设置

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()       
        self.embedding = nn.Embedding(input_dim, emb_dim)       
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True, batch_first=True)     
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)     
        self.dropout = nn.Dropout(dropout)
    def forward(self, src):     
        #src = [batch size, src len]
        embedded = self.dropout(self.embedding(src))
        #embedded = [batch size, src len, emb dim]
        outputs, hidden = self.rnn(embedded)
        #outputs = [batch size, src len, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        #outputs = [batch size, src len, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        return outputs, hidden

hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))) 由于采用了双向GRU，所以最后隐藏层的输出是有正向和反向的。

在这个例子当中，只搭建了一层GRU，所以其实最后的输出的维度是【2，batch size, hid dim】

把输出变成隐藏层的维度，只需要让这两个合并起来的向量进入一个线性层，然后作一个线性变换

最终得到了h0,即decoder的第一个输入

Attention模块

1、self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim) 这里就是下图的W1, h对照的是dec_hid_dim(隐藏层的维度)， sk是第K个encoder正向和反向合并在一起的向量enc_hid_dim * 2

上面的输出的维度变成了[dec hid dim, src len]

2、self.v = nn.Linear(dec_hid_dim, 1, bias = False)

对应下图，即需要把每一个输入转换成一个score：

输出维度变成了【src len】

3、hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)

假设这个就是h0, 我们此时需要src_len个h0和隐藏层的状态sk，k=1…5,进行合并，所以需要重复src_len个h0

4、energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) F.softmax(attention, dim=1)

5、F.softmax(attention, dim=1)

class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()       
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
    def forward(self, hidden, encoder_outputs):        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        src_len = encoder_outputs.shape[1]     
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)                
        #hidden = [batch size, src len, dec hid dim]      
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        #energy = [batch size, src len, dec hid dim]
        attention = self.v(energy).squeeze(2)        
        #attention= [batch size, src len]        
        return F.softmax(attention, dim=1)

Decoder

1、self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim) emb_dim对应的就是输出词的词嵌入 enc_hid_dim2, 对应的是attention_output, 因为双向所以2

2、self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim) 每一块decoder对应的线性层，包含了attention_output， decoder的output, 还有输出词的词嵌入

3、weighted = torch.bmm(a, encoder_outputs) 这个就是C(t) 之后我们每个decoer的输入就是【hidden_State + C(t) + embedding 】

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention       
        self.embedding = nn.Embedding(output_dim, emb_dim)        
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim, batch_first=True)        
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)      
        self.dropout = nn.Dropout(dropout)
    def forward(self, inputs, hidden, encoder_outputs):             
        #inputs = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]        
        inputs = inputs.unsqueeze(1)
        #inputs = [batch size, 1]        
        embedded = self.dropout(self.embedding(inputs))
        #embedded = [batch size, 1, emb dim]        
        a = self.attention(hidden, encoder_outputs)                
        #a = [batch size, src len]     
        a = a.unsqueeze(1)        
        #a = [batch size, 1, src len]
        weighted = torch.bmm(a, encoder_outputs)      
        #weighted = [batch size, 1, enc hid dim * 2]     
        rnn_input = torch.cat((embedded, weighted), dim = 2)    
        #rnn_input = [batch size, 1, (enc hid dim * 2) + emb dim]           
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))       
        #output = [batch size, seq len, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]    
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [batch size, 1, dec hid dim]
        #hidden = [1, batch size, dec hid dim]       
        embedded = embedded.squeeze(1)
        output = output.squeeze(1)
        weighted = weighted.squeeze(1)        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))        
        #prediction = [batch size, output dim]        
        return prediction, hidden.squeeze(0)

训练+验证

loss_vals = []
loss_vals_eval = []
for epoch in range(N_EPOCHS):
    model.train()
    epoch_loss= []
    pbar = tqdm(train_iter) # 为进度条设置描述
    # print(type(pbar))
    pbar.set_description("[Train Epoch {}]".format(epoch))  #设置描述
    for i,batch in enumerate(pbar):
        # print(batch)
        trg = batch.trg
        src = batch.src
        # print(type(trg),type(src))
        trg, src = trg.to(device), src.to(device)
        model.zero_grad()
        output = model(src, trg)
        #trg = [batch size, trg len]
        #output = [batch size, trg len, output dim]        
        output_dim = output.shape[-1]       
        output = output[:,1:,:].reshape(-1, output_dim)
        trg = trg[:,1:].reshape(-1)               
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]     
        loss = criterion(output, trg)    
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP)
        epoch_loss.append(loss.item())
        optimizer.step()
        pbar.set_postfix(loss=loss.item())
    loss_vals.append(np.mean(epoch_loss))
    model.eval()
    epoch_loss_eval= []
    pbar = tqdm(val_iter)
    pbar.set_description("[Eval Epoch {}]".format(epoch)) 
    for i,batch in enumerate(pbar):
        # print(batch)
        trg = batch.trg
        src = batch.src
        trg, src = trg.to(device), src.to(device)
        model.zero_grad()
        output = model(src, trg)
        #trg = [batch size, trg len]
        #output = [batch size, trg len, output dim]        
        output_dim = output.shape[-1]       
        output = output[:,1:,:].reshape(-1, output_dim)
        trg = trg[:,1:].reshape(-1)               
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]     
        loss = criterion(output, trg)    
        epoch_loss_eval.append(loss.item())
        pbar.set_postfix(loss=loss.item())
    loss_vals_eval.append(np.mean(epoch_loss_eval))

【文本摘要（3）】Pytorch之Seq2seq: attention

写在最前面

Attention

问题

解决

如何进行attention

attention_output

代码结构

模型结构定义model.py

Encoder函数

Attention模块

Decoder

训练+验证

热门文章

最新文章

相关电子书

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

【文本摘要（3）】Pytorch之Seq2seq: attention

写在最前面

Attention

问题

解决

如何进行attention

attention_output

代码结构

模型结构定义model.py

Encoder函数

Attention模块

Decoder

训练+验证

热门文章

最新文章

相关电子书

推荐镜像