GPT-3,BERT,XLNet这些都是当前自然语言处理(NLP)的新技术,它们都使用一种称为 transformer 的特殊架构组件,这是因为,transformer 这种新机制非常强大,完整的transformer 通常包含三个结构:
- scaled dot-product attention
- self-attention
- cross-attention
- multi-head attention
- positional encoding
让我们从Scaled Dot-Product Attention开始,因为我们还需要它来构建 Multi-Head Attention。
Scaled Dot-Product Attention
在数学上,Scaled Dot-Product Attention表示为:
Q,K和V是经过卷积后得到的特征,其形状为(batch_size,seq_length,num_features)。
将查询(Q)和键(K)相乘会得到(batch_size,seq_length,seq_length)特征,这大致告诉我们序列中每个元素的重要性,确定我们“注意”哪些元素。 注意数组使用softmax标准化,因此所有权重之和为1。 最后,注意力将通过矩阵乘法应用于值(V)数组。
scaled dot-product attention 的代码 非常简单-只需几个矩阵乘法,再加上softmax函数。 为了更加简单,我们省略了可选的Mask操作。
- from torch import Tensor
- import torch.nn.functional as f
- def scaled_dot_product_attention(query: Tensor, key: Tensor, value: Tensor) -> Tensor:
- temp = query.bmm(key.transpose(1, 2))
- scale = query.size(-1) ** 0.5
- softmax = f.softmax(temp / scale, dim=-1)
- return softmax.bmm(value)
请注意,MatMul操作在PyTorch中对应为torch.bmm。 这是因为Q,K和V(查询,键和值数组)都是矩阵,每个矩阵的形状均为(batch_size,sequence_length,num_features),矩阵乘法仅在最后两个维度上执行。
在了解了Scaled Dot-Product Attention之后,就很容易理解self-attention和cross-attention了,区别仅仅是Q,K和V的来源不同。
- self-attention的Q,K和V都是同一个输入, 即当前序列由上一层输出的高维表达。
- cross-attention的Q代表当前序列;而K和V是同一个输入,对应的是encoder最后一层的输出结果
Multi-Head Attention
从上图可以看出, Multi-Head Attention 由几个相同的Head Attention组成。 每个关注头包含3个线性层,
代码如下:
- import torch
- from torch import nn
- class HeadAttention(nn.Module):
- def __init__(self, dim_in: int, dim_k: int, dim_v: int):
- super().__init__()
- self.q = nn.Linear(dim_in, dim_k)
- self.k = nn.Linear(dim_in, dim_k)
- self.v = nn.Linear(dim_in, dim_v)
- def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:
- return scaled_dot_product_attention(self.q(query), self.k(key), self.v(value))
现在,建立Multi-Head Attention 就非常容易。 只需将num_heads个不同的关注头和一个Linear层组合在一起即可输出。
- class MultiHeadAttention(nn.Module):
- def __init__(self, num_heads: int, dim_in: int, dim_k: int, dim_v: int):
- super().__init__()
- self.heads = nn.ModuleList(
- [HeadAttention(dim_in, dim_k, dim_v) for _ in range(num_heads)]
- )
- self.linear = nn.Linear(num_heads * dim_v, dim_in)
- def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:
- return self.linear(
- torch.cat([h(query, key, value) for h in self.heads], dim=-1)
- )
Positional Encoding
在构建完整的transformer之前,我们还需要一个组件:Positional Encoding。 请注意,MultiHeadAttention没有在序列维度上运行, 一切都在特征维上进行,因此它与序列长度无关。 我们必须向模型提供位置信息,以便它知道输入序列中数据点的相对位置。
transformer 论文里使用三角函数对位置信息进行编码:
为什么使用正弦编码呢? 因为正弦/余弦函数是周期性的,并且它们覆盖[0,1]的范围。所以,尽管事实证明学习的嵌入表现出同样良好的效果,但作者仍然选择使用正弦编码。
我们只需几行代码即可实现:
- def position_encoding(
- seq_len: int, dim_model: int, device: torch.device = torch.device("cpu"),
- ) -> Tensor:
- pos = torch.arange(seq_len, dtype=torch.float, device=device).reshape(1, -1, 1)
- dim = torch.arange(dim_model, dtype=torch.float, device=device).reshape(1, 1, -1)
- phase = (pos / 1e4) ** (dim // dim_model)
- return torch.where(dim.long() % 2 == 0, -torch.sin(phase), torch.cos(phase))
Transformer
最后,我们准备构建“Transformer”了! 让我们再看一下完整的网络图:
注意,transformer使用编码器-解码器体系结构。 编码器(左)处理输入序列并返回特征向量(或存储向量)。 解码器处理目标序列,并合并来自编码器存储器的信息。 解码器的输出是我们模型的预测!
我们可以彼此独立地对编码器/解码器模块进行编码,然后最后将它们组合。 首先,我们先构建encoder。如下:
- def feed_forward(dim_input: int = 512, dim_feedforward: int = 2048) -> nn.Module:
- return nn.Sequential(
- nn.Linear(dim_input, dim_feedforward),
- nn.ReLU(),
- nn.Linear(dim_feedforward, dim_input),
- )
- class Residual(nn.Module):
- def __init__(self, sublayer: nn.Module, dimension: int, dropout: float = 0.1):
- super().__init__()
- self.sublayer = sublayer
- self.norm = nn.LayerNorm(dimension)
- self.dropout = nn.Dropout(dropout)
- def forward(self, *tensors: Tensor) -> Tensor:
- # Assume that the "value" tensor is given last, so we can compute the
- # residual. This matches the signature of 'MultiHeadAttention'.
- return self.norm(tensors[-1] self.dropout(self.sublayer(*tensors)))
- class TransformerEncoderLayer(nn.Module):
- def __init__(
- self,
- dim_model: int = 512,
- num_heads: int = 6,
- dim_feedforward: int = 2048,
- dropout: float = 0.1,
- ):
- super().__init__()
- dim_k = dim_v = dim_model // num_heads
- self.attention = Residual(
- MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
- dimension=dim_model,
- dropout=dropout,
- )
- self.feed_forward = Residual(
- feed_forward(dim_model, dim_feedforward),
- dimension=dim_model,
- dropout=dropout,
- )
- def forward(self, src: Tensor) -> Tensor:
- src = self.attention(src, src, src)
- return self.feed_forward(src)
- class TransformerEncoder(nn.Module):
- def __init__(
- self,
- num_layers: int = 6,
- dim_model: int = 512,
- num_heads: int = 8,
- dim_feedforward: int = 2048,
- dropout: float = 0.1,
- ):
- super().__init__()
- self.layers = nn.ModuleList([
- TransformerEncoderLayer(dim_model, num_heads, dim_feedforward, dropout)
- for _ in range(num_layers)
- ])
- def forward(self, src: Tensor) -> Tensor:
- seq_len, dimension = src.size(1), src.size(2)
- src = position_encoding(seq_len, dimension)
- for layer in self.layers:
- src = layer(src)
- return src
解码器模块非常相似。只是一些小的区别:
- 解码器接受两个参数(target和memory),而不是一个;
- 每层有两个多头部注意力模块,而不是一个;
- 第二个多头注意力接受两个输入的记忆;
- 解码器中包含了self-attention和cross-attention。
- class TransformerDecoderLayer(nn.Module):
- def __init__(
- self,
- dim_model: int = 512,
- num_heads: int = 6,
- dim_feedforward: int = 2048,
- dropout: float = 0.1,
- ):
- super().__init__()
- dim_k = dim_v = dim_model // num_heads
- self.attention_1 = Residual(
- MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
- dimension=dim_model,
- dropout=dropout,
- )
- self.attention_2 = Residual(
- MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
- dimension=dim_model,
- dropout=dropout,
- )
- self.feed_forward = Residual(
- feed_forward(dim_model, dim_feedforward),
- dimension=dim_model,
- dropout=dropout,
- )
- def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
- tgt = self.attention_1(tgt, tgt, tgt)
- tgt = self.attention_2(memory, memory, tgt)
- return self.feed_forward(tgt)
- class TransformerDecoder(nn.Module):
- def __init__(
- self,
- num_layers: int = 6,
- dim_model: int = 512,
- num_heads: int = 8,
- dim_feedforward: int = 2048,
- dropout: float = 0.1,
- ):
- super().__init__()
- self.layers = nn.ModuleList([
- TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout)
- for _ in range(num_layers)
- ])
- self.linear = nn.Linear(dim_model, dim_model)
- def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
- seq_len, dimension = tgt.size(1), tgt.size(2)
- tgt = position_encoding(seq_len, dimension)
- for layer in self.layers:
- tgt = layer(tgt, memory)
- return torch.softmax(self.linear(tgt), dim=-1)
最后,我们需要将所有内容打包成一个Transformer类,只要把一个编码器和解码器放在一起,然后以正确的顺序通过它们传递数据。
- class Transformer(nn.Module):
- def __init__(
- self,
- num_encoder_layers: int = 6,
- num_decoder_layers: int = 6,
- dim_model: int = 512,
- num_heads: int = 6,
- dim_feedforward: int = 2048,
- dropout: float = 0.1,
- activation: nn.Module = nn.ReLU(),
- ):
- super().__init__()
- self.encoder = TransformerEncoder(
- num_layers=num_encoder_layers,
- dim_model=dim_model,
- num_heads=num_heads,
- dim_feedforward=dim_feedforward,
- dropout=dropout,
- )
- self.decoder = TransformerDecoder(
- num_layers=num_decoder_layers,
- dim_model=dim_model,
- num_heads=num_heads,
- dim_feedforward=dim_feedforward,
- dropout=dropout,
- )
- def forward(self, src: Tensor, tgt: Tensor) -> Tensor:
- return self.decoder(tgt, self.encoder(src))
让我们创建一个简单的测试,作为实现的健全性检查。我们可以构造src和tgt的随机张量,检查我们的模型执行没有错误,并确认输出张量具有正确的形状。
- src = torch.rand(64, 16, 512)
- tgt = torch.rand(64, 16, 512)
- out = Transformer()(src, tgt)
- print(out.shape)
- # torch.Size([64, 16, 512])
Conclusions
希望这篇有助于了解transformer是如何搭建的,以及它们是如何工作的。计算机视觉领域,以前可能没有遇到过这些模型,但DETR和ViT已经取得了突破性的成果,预计在未来几年里会看到更多这样的模型。