简介: 【NLP】Datawhale-AI夏令营Day8-10打卡:大模型基础Transformer



Transformer是一种完全基于注意力的序列转录模型,它用 多头自注意力(multi-headed self-attention) 取代了编码器-解码器架构中最常用的循环层。

Transformer, a sequence transduction model based entirely onattention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.





研究综述:已有将 attention 机制用于RNN的研究;已有自注意力机制(self-attention)和 memory networks 的相关研究,Self-attention是在单个句子不同位置上做的Attention,并得到序列的一个表示。



编码器输入的序列(词) ( x 1 , x 2 , . . . x n ) (x_1,x_2,...x_n)(x1,x2,...xn),对应向量表示 ( z 1 , z 2 , . . . z n ) (z_1,z_2,...z_n)(z1,z2,...zn)

给定z,解码器生成序列 ( y 1 , y 2 , . . . y m ) (y_1,y_2,...y_m)(y1,y2,...ym)y t y_tyt的生成和y 1 y_1y1y t − 1 y_{t-1}yt1有关


At each step the model is auto-regressive(自回归), consuming the previously generated symbols as additional input when generating the next.



The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.


由图可见,Transformer 包含编码器和解码器两部分。对于中文翻译英文的任务来说,编码器的输入就是中文输入,而解码器对于预测任务来说其实是没有输入的,它的输入是之前时刻的输出,shifted right 表示向右移位。

✏️对于输入的文本信息,首先进入 Input Embedding 环节,就是将文本转为词的组合,然后再用向量表示每个词。编码器内部可视为多头注意力环节 (Multi-Head Self-Attention) +MLP (多层感知器),中间有残差连接 (residual)。编码器的输出作为解码器的输入,解码器相对于编码器多了一个 Masked Multi-Head Attention 环节。



class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model =d_model
    def forward(self, x):
        embedds = self.lut(x)
        return embedds * math.sqrt(self.d_model)


编码器由N = 6个相同层的堆栈组成。每一层有两个子层。第一个是多头自注意机制(multi-head self-attention mechanism),第二个是一个简单的、位置明确的全连接前馈网络(fully connected feed-forward network,实际是MLP)。我们在每两个子层周围使用残差连接(residual connection),然后进行归一化。也就是说,每个子层的输出是 LayerNorm(x + Sublayer(x)),其中 Sublayer(x) 是子层本身实现的函数。为了方便这些残差连接,模型中的所有子层以及嵌入层产生的输出维度为 d m o d e l d_{model}dmodel = 512(维数没有减少)。

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.


LayerNorm 和 BatchNorm 的区别:BatchNorm 是对一个 batch-size 样本内的每个特征做归一化,LayerNorm 是对每个样本的所有特征做归一化。经过处理之后都是均值为0方差为1。

🔗 batchNormalization与layerNormalization的区别

下面两个矩形是针对2维输入的情况,蓝色的是 BatchNorm 方法,黄色的是 LayerNorm 方法,可以看到两种方法的“切法”不同。一般而言 Transformer 输入是3维的,即 batch * sequence(Nx) * feature (d m o d e l d_{model}dmodel),也就是上面的立方体的情况。

BatchNorm 方法需要存储全局的均值和方差,对于样本长度较大的情况,小批量计算时均值和方差波动较大,而 LayerNorm 方法仅针对每个样本计算均值和方差,无需存储全局的均值和方差,可以较好地应对样本长度带来的计算波动性问题。

# 定义一个clones函数,来更方便的将某个结构复制若干份
def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
class Encoder(nn.Module):
    The encoder is composed of a stack of N=6 identical layers.
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        # 调用时会将编码器层传进来,我们简单克隆N分,叠加在一起,组成完整的Encoder
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)
class SublayerConnection(nn.Module):
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x, sublayer):
        # 原paper的方案
        #sublayer_out = sublayer(x)
        #x_norm = self.norm(x + self.dropout(sublayer_out))
        # 稍加调整的版本
        sublayer_out = sublayer(x)
        sublayer_out = self.dropout(sublayer_out)
        x_norm = x + self.norm(sublayer_out)
        return x_norm
class EncoderLayer(nn.Module):
    "EncoderLayer is made up of two sublayer: self-attn and feed forward"                                                                                                         
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size   # embedding's dimention of model, 默认512
    def forward(self, x, mask):
        # attention sub layer
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        # feed forward sub layer
        z = self.sublayer[1](x, self.feed_forward)
        return z



解码器也由N = 6相同层的堆栈组成。除了每个编码器层中的两个子层之外,解码器插入第三个子层,该子层对编码器堆栈的输出执行多头注意力。与编码器类似,我们在每个子层周围使用残差连接,然后进行层规范化。我们还修改了解码器堆栈中的自注意力子层,以防止位置关注后续位置。这种掩码,再加上输出嵌入被偏移一个位置的事实,确保了位置i的预测只能依赖于位置小于i的已知输出。

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

再次展示 Transformer 架构:


掩码的作用:在transformer中,掩码主要的作用有两个,一个是屏蔽掉无效的padding区域,一个是屏蔽掉来自“未来”的信息。 Encoder中的掩码主要是起到第一个作用,Decoder中的掩码则同时发挥着两种作用。



线性层的作用: 通过对上一步的线性变化得到指定维度的输出,也就是转换维度的作用。转换后的维度对应着输出类别的个数,如果是翻译任务,那就对应的是文字字典的大小。

def subsequent_mask(size):
    "Mask out subsequent positions."
    attn_shape = (1, size, size)
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    #最后将numpy类型转化为torch中的tensor,内部做一个1- 的操作。这个其实是做了一个三角阵的反转,subsequent_mask中的每个元素都会被1减。
    return torch.from_numpy(subsequent_mask) == 0
class Decoder(nn.Module):
    "Generic N layer decoder with masking."
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)
class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)
    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)



An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

输出的维度和值的维度是相同的,权重是由query和key的相似度(compatibility function)计算而来,不同的注意力机制对应不同的相似度计算方法。

Scaled Dot-Product Attention

这部分介绍论文所提出的注意力机制 Scaled Dot-Product Attention。

每个 query 和 key 做点积,然后将每个点积除以 ( d k ) \sqrt(d_k)(dk),其中 d k d_kdk 是queries 和 keys 的维,如果点积越大,则相似度越大,相似度越大,则权重越大,最后使用 softmax 函数来获得 key 的权重。

We call our particular attention “Scaled Dot-Product Attention”. The input consists of queries and keys of dimension d k d_kdk, and values of dimension d v d_vdv. We compute the dot products of the query with all keys, divide each by ( d k ) \sqrt(d_k)(dk), and apply a softmax function to obtain the weights on the values.


In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as:

Attention ⁡ ( Q , K , V ) = softmax ⁡ ( Q K T d k ) V \operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) VAttention(Q,K,V)=softmax(dkQKT)V

两个最常用的注意力函数是加型注意力和点积(乘法)注意力。点积注意力与论文提出的算法相同,只是比例因子为1 / ( d k ) 1/\sqrt(d_k)1/(dk) (Scaled环节)。加型注意力使用具有单个隐藏层的前馈网络来计算兼容性函数。虽然两者在理论复杂性上相似,但点积注意力在实践中要快得多,空间效率也更高,因为它可以使用高度优化的矩阵乘法代码来实现。

The two most commonly used attention functions are additive attention, and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1 / ( d k ) 1/\sqrt(d_k)1/(dk). Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

为什么要除以( d k ) \sqrt(d_k)(dk) ,不直接用传统的点积注意力机制而多加一个Scaled?

虽然对于较小的d k d_kdk值,这两种机制的表现相似,但在不缩放较大的d k d_kdk值的情况下,加型注意力优于点积注意力。我们怀疑,对于较大的d k d_kdk值,点积的大小会变大(置信的偏向1,否则偏向0),从而将softmax函数推向具有极小梯度的区域。为了抵消这种影响,我们将点积按比例缩放1 / ( d k ) 1/\sqrt(d_k)1/(dk)

While for small values of d k d_kdk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of d k d_kdk. We suspect that for large values of d k d_kdk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1 / ( d k ) 1/\sqrt(d_k)1/(dk) .

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

✏️Mask:计算权重时,忽略Q和K在t时刻后面的乘积,具体而言,可以让计算结果乘以负无穷大,使得经过 Softmax 层时趋近于0.

Multi-Head Attention

如上图所示,将查询、键值和值线性投影(Linear环节) h 次到 dk、dk 和 dv 维度,然后并行执行注意力函数(Scaled Dot-Product Attention环节),接着合并(Concat环节),最后投影(Linear环节)。

Instead of performing a single attention function with d m o d e l d_{model}dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values.


多头注意力允许模型同时关注来自不同位置的不同表示子空间的信息。 对于单一注意力头,平均会抑制这种情况。

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.




MultiHead ⁡ ( Q , K , V ) = Concat ⁡ ( head ⁡ 1 , … , head ⁡ h ) W O  where head  = Attention ⁡ ( Q W i Q , K W i K , V W i V ) \begin{aligned} \operatorname{MultiHead}(Q, K, V) & =\operatorname{Concat}\left(\operatorname{head}_{1}, \ldots, \operatorname{head}_{\mathrm{h}}\right) W^{O} \\ \text { where head } & =\operatorname{Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned}MultiHead(Q,K,V) where head =Concat(head1,,headh)WO=Attention(QWiQ,KWiK,VWiV)

Where the projections are parameter matrices W i Q ∈ R d model  × d k , W i K ∈ R d model  × d k , W i V ∈ R d model  × d v W_{i}^{Q} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, W_{i}^{K} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, W_{i}^{V} \in \mathbb{R}^{d_{\text {model }} \times d_{v}}WiQRdmodel ×dk,WiKRdmodel ×dk,WiVRdmodel ×dv and W O ∈ R h d v × d model  W^{O} \in \mathbb{R}^{h d_{v} \times d_{\text {model }}}WORhdv×dmodel .

In this work we employ h=8 parallel attention layers, or heads. For each of these we use d k = d v = d model  / h = 64 d_{k}=d_{v}=d_{\text {model }} / h=64dk=dv=dmodel /h=64 . Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = \
            [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]
        # 2) Apply attention on all the projected vectors in batch. 
        x, self.attn = attention(query, key, value, mask=mask, 
        # 3) "Concat" using a view and apply a final linear. 
        x = x.transpose(1, 2).contiguous() \
             .view(nbatches, -1, self.h * self.d_k)
        return self.linears[-1](x)



  1. 在"Encoder-Decoder Attention"层,Query来自先前的解码器层,并且Key和Value来自Encoder的输出。Decoder中的每个位置Attend输入序列中的所有位置,这与Seq2Seq模型中的经典的Encoder-Decoder Attention机制一致。
  2. Encoder中包含Self-attention层。在Self-attention层中,所有的Key、Value和Query都来同一个地方,这里都是来自Encoder中前一层的输出。Encoder中当前层的每个位置都能Attend到前一层的所有位置。
  3. 类似的,解码器中的Self-attention层允许解码器中的每个位置注意当前解码位置和它前面的所有位置。这里需要屏蔽解码器中向左的信息流以保持自回归属性。具体的实现方式是在缩放后的点积Attention中,屏蔽(设为负无穷)Softmax的输入中所有对应着非法连接的Value。

The Transformer uses multi-head attention in three different ways:

  • In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models.
  • The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
  • Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.

Position-wise Feed-Forward Networks

除了注意力子层之外,我们的编码器和解码器中的每个层都包含一个全连接的前馈网络,该网络单独且相同地应用于每个位置。这由两个线性变换组成,中间有一个 ReLU 激活。

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

F F N ( x ) = m a x ( 0 , x W 1 + b 1 ) W 2 + b 2 FFN(x) = max(0, xW_1 + b_1)W_2 + b_2FFN(x)=max(0,xW1+b1)W2+b2

虽然不同位置的线性变换是相同的,但它们在层与层之间使用不同的参数。另一种描述方式是使用两个内核大小为 1 的卷积。

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1.

输入和输出的维数为 dmodel = 512,内层的维数为 dff = 2048。

The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality dff = 2048.



x W 1 + b 1 xW_1 + b_1xW1+b1:Linear;

m a x ( 0 , x W 1 + b 1 ) max(0, xW_1 + b_1)max(0,xW1+b1):ReLU,ReLU的数学表达式:f(x)=max(0,x);


Feed Forward Layer 其实就是简单的由两个前向全连接层组成,核心在于,Attention模块每个时间步的输出都整合了所有时间步的信息,而Feed Forward Layer每个时间步只是对自己的特征的一个进一步整合,与其他时间步无关。

class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        return self.w_2(self.dropout(F.relu(self.w_1(x))))

Transformer 和 RNN 的异同



Embeddings and Softmax

与其他序列转换模型类似,我们使用预学习的 Embedding 将输入 Token(词) 序列和输出Token序列转化为 d m o d e l d_{model}dmodel 维向量,我们还使用常用的预训练的线性变换和 Softmax 函数将解码器输出转换为预测下一个 Token 的概率。在我们的模型中,我们在两个 Embedding 层和 Pre-softmax 线性变换之间使用相同的权重矩阵。在Embedding层中,我们将这些权重乘以 d m o d e l \sqrt{d_{model}}dmodel .

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation. In the embedding layers, we multiply those weights by √dmodel.

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model
    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

Positional Encoding

✏️为什么需要 Positional Encoding?是因为 Attention 不像RNN那样的循环结构有前后不同的先后顺序,所有的时间步是同时输入,并行推理的,不存在时序信息,即便顺序变了,输出值也不变,而实际上词的顺序是会影响语义的。为了解决这个问题,需要把时序信息引入进来。RNN对此的解决方法是将上一时刻的输出作为下一时刻的输入,因此能够处理时序信息。

由于我们的模型不包含递归和卷积结构,为了使模型能够有效利用序列的顺序特征,我们需要加入序列中各个 Token 间相对位置或 Token 在序列中绝对位置的信息。 在这里,我们将位置编码添加到编码器和解码器栈底部的输入 Embedding。由于位置编码与 Embedding 具有相同的维度 d m o d e l d_{model}dmodel ,因此两者可以直接相加,其实这里还有许多位置编码可供选择,其中包括可更新的和固定不变的。

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed.


In this work, we use sine and cosine functions of different frequencies:

P E ( p o s , 2 i ) = s i n ( p o s / 1000 0 2 i / d model ) PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{\text{model}}})PE(pos,2i)=sin(pos/100002i/dmodel)

P E ( p o s , 2 i + 1 ) = c o s ( p o s / 1000 0 2 i / d model ) PE_{(pos,2i+1)} = cos(pos / 10000^{2i/d_{\text{model}}})PE(pos,2i+1)=cos(pos/100002i/dmodel)

其中pos是位置,i是维度,也就是说,位置编码的每个维度都对应于一个正弦曲线,其波长形成从2 π 2\pi2π10000 ⋅ 2 π 10000 \cdot 2\pi100002π的等比级数。我们选择这个函数是因为我们假设它可以让模型很容易地通过相对位置来学习,因为对于任何固定的偏移量k kk, P E p o s + k PE_{pos+k}PEpos+k可以表示为P E p o s PE_{pos}PEpos的线性函数。

where p o s pospos is the position and i ii is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2 π 2\pi2π to 10000 ⋅ 2 π 10000 \cdot 2\pi100002π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k kk, P E p o s + k PE_{pos+k}PEpos+k can be represented as a linear function of P E p o s PE_{pos}PEpos.

此外,在编码器和解码器堆栈中,我们在Embedding与位置编码的加和上都使用了Dropout机制,在基本模型上,我们使用P d r o p = 0.1 P_{drop}=0.1Pdrop=0.1的比率。

In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of P d r o p = 0.1 P_{drop}=0.1Pdrop=0.1.




class PositionalEncoding(nn.Module):
    "Implement the PE function."
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
    def forward(self, x):
        x = x + Variable(self.pe[:, :x.size(1)], 
        return self.dropout(x)






Attention 对模型做的假设较少,需要更多的数据才能实现和RNN、CNN差不多的效果。


Transformer 优点:

需要调的参数比较少:N(编码器解码器层数)、d m o d e l d_{model}dmodel(嵌入层产生的输出维度)、h(投影次数)。





