1. 谈谈你的理解🎄
2. 什么是Self-attention🎄
3. 什么是Q、K、V🎄
- V:输入特征的向量 Q和K:计算Attention权重的特征向量。
- Attention机制中的Q,K,V:我们对当前的Query和所有的Key计算相似度,将这个相似度值通过Softmax层进行得到一组权重,根据这组权重与对应Value的乘积求和得到Attention下的Value值。
4. 什么是Multi-head attention🎄
multi-head attention是多个自注意机制模块,通过对self-attention赋予不一样的权重,来得到不一样的结果,并把所有的attention结果拼接起来,通过一个全连接层得到最终结果,从而有助于捕捉到更丰富特征。
5. 什么是位置编码,解决什么问题🎄
6. 如何理解transformer的并行运算🎄
最核心的在multi-head attention ,多组KQV进行self-attention运算,它们是可以同时运算的,由于使用同步运算,所以对于硬件要求比较高。
7. self-attention pytorch 代码🎄
import torch
import numpy as np
import torch.nn as nn
import math
import torch.nn.functional as F
class selfAttention(nn.Module) :
def __init__(self, num_attention_heads, input_size, hidden_size):
super(selfAttention, self).__init__()
if hidden_size % num_attention_heads != 0 :
raise ValueError(
"the hidden size %d is not a multiple of the number of attention heads"
"%d" % (hidden_size, num_attention_heads)
self.num_attention_heads = num_attention_heads
self.attention_head_size = int(hidden_size / num_attention_heads)
self.all_head_size = hidden_size
self.key_layer = nn.Linear(input_size, hidden_size)
self.query_layer = nn.Linear(input_size, hidden_size)
self.value_layer = nn.Linear(input_size, hidden_size)
def trans_to_multiple_heads(self, x):
new_size = x.size()[ : -1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(new_size)
return x.permute(0, 2, 1, 3)
def forward(self, x):
key = self.key_layer(x)
query = self.query_layer(x)
value = self.value_layer(x)
key_heads = self.trans_to_multiple_heads(key)
query_heads = self.trans_to_multiple_heads(query)
value_heads = self.trans_to_multiple_heads(value)
attention_scores = torch.matmul(query_heads, key_heads.permute(0, 1, 3, 2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
attention_probs = F.softmax(attention_scores, dim = -1)
context = torch.matmul(attention_probs, value_heads)
context = context.permute(0, 2, 1, 3).contiguous()
new_size = context.size()[ : -2] + (self.all_head_size , )
context = context.view(*new_size)
return context
features = torch.rand((32, 20, 10))
attention = selfAttention(2, 10, 20)
result = attention.forward(features)
print(result.shape) # torch.Size([32, 20, 20])