很多人工智能小白可能不知道那些高大上的语音助理、机器翻译或者聊天机器人都是怎么被创造出来的,也不知道一个深度学习模型是怎么从零开始搭建并运行起来的。
今天我就简单教大家如何从零开始搭建一个Transformer模型,并在自己的数据上训练起来。这个教程非常基础,所以训练出来的模型也很傻瓜,适合零基础小白长知识用。
首先整个训练流程可以分为下面几步,我们在后面章节依次介绍:
- 处理数据
- 创建模型
- 创建损失函数
- 创建参数优化器
- 进行训练
- 进行预测
安装环境
这里我们需要使用到的有三样东西:
- 训练深度学习模型需要用PyTorch。
- 对句子进行分词处理需要用Hugging Face的分词器。
- 搭建Transformer模型需要用LightSeq的快速模型、损失函数以及参数优化器。
所以运行下面安装命令即可:
pip3 install torch transformers git clone https://github.com/bytedance/lightseq.git cd lightseq pip3 install -e .
然后导入必要的一些文件:
import torch from transformers import BertTokenizer from lightseq.training import LSTransformer, LSCrossEntropyLayer, LSAdam
处理数据
因为深度学习模型擅长和数字打交道,所以你需要将你说的话或者写的句子变成一串整数id,用来表示每个单词在词表中的序号。
这里我们使用到的是Hugging Face的分词器,它能帮你把输入的句子直接变成一串整数id,非常便捷。
def create_data(): # 创建Hugging Face分词器 tokenizer = BertTokenizer.from_pretrained("bert-base-cased") vocab_size = tokenizer.vocab_size sep_id = tokenizer.encode( tokenizer.special_tokens_map["sep_token"], add_special_tokens=False )[0] # 将源文本映射成整数id src_text = [ "What is the fastest library in the world?", "You are so pretty!", "What do you love me for?", "The sparrow outside the window hovering on the telephone pole.", ] src_tokens = tokenizer.batch_encode_plus( src_text, padding=True, return_tensors="pt" ) src_tokens = src_tokens["input_ids"].to(torch.device("cuda:0")) batch_size, src_seq_len = src_tokens.size(0), src_tokens.size(1) # 将目标文本映射成整数id trg_text = [ "I guess it must be LightSeq, because ByteDance is the fastest.", "Thanks very much and you are pretty too.", "Love your beauty, smart, virtuous and kind.", "You said all this is very summery.", ] trg_tokens = tokenizer.batch_encode_plus( trg_text, padding=True, return_tensors="pt" ) trg_tokens = trg_tokens["input_ids"].to(torch.device("cuda:0")) trg_seq_len = trg_tokens.size(1) # 将目标文本左移1个单词位置,用来作为解码端输出 target = trg_tokens.clone()[:, 1:] trg_tokens = trg_tokens[:, :-1] return ( tokenizer, src_text, src_tokens, trg_text, trg_tokens, target, sep_id, vocab_size, batch_size, src_seq_len, trg_seq_len, )
代码中注释写的非常清楚了,只需要创建输入文本和输出文本即可,而标准的解码端输出就是输出文本左移一个单词,也就是每个单词输入后预测下一个单词是什么。
创建模型
这里我们使用Transformer-base模型进行训练,使用LightSeq来创建Transformer模型非常简单,只需要创建一个配置,然后用它就能创建Transformer模型了。
def create_model(vocab_size): transformer_config = LSTransformer.get_config( model="transformer-base", max_batch_tokens=2048, max_seq_len=512, vocab_size=vocab_size, padding_idx=0, num_encoder_layer=6, num_decoder_layer=6, fp16=True, local_rank=0, ) model = LSTransformer(transformer_config) model.to(dtype=torch.half, device=torch.device("cuda:0")) return model
创建损失函数
这里我们使用交叉熵损失函数,使用LightSeq来创建同样非常简单,只需要创建一个配置。
def create_criterion(): ce_config = LSCrossEntropyLayer.get_config( max_batch_tokens=2048, padding_idx=0, epsilon=0.0, fp16=True, local_rank=0, ) loss_fn = LSCrossEntropyLayer(ce_config) loss_fn.to(dtype=torch.half, device=torch.device("cuda:0")) return loss_fn
创建参数优化器
使用LightSeq来创建参数优化器的过程和平常使用PyTorch创建一模一样,只要一行代码就行了。
opt = LSAdam(model.parameters(), lr=1e-5)
进行训练
模型训练过程也和平常一模一样,这里我们训练2000轮。因为训练过程中需要知道目标端的文本是什么,所以需要输入源端和目标端两个文本。
print("========================TRAIN========================") model.train() for epoch in range(2000): output = model(src_tokens, trg_tokens) loss, _ = loss_fn(output, target) if epoch % 200 == 0: print("epoch {:03d}: {:.3f}".format(epoch, loss.item())) loss.backward() opt.step()
进行预测
在模型训练好之后,我们用它进行预测。这时候你就不知道目标端的文本是什么了,你只能输入源端文本,然后目标端输入一个句子开始标记,后面的目标端文本都得通过模型预测得到。
print("========================TEST========================") model.eval() # 获得编码器的输出和掩码表示 encoder_out, encoder_padding_mask = model.encoder(src_tokens) # 使用目标端文本的第一个单词作为解码器的初始输入,预测后面单词 predict_tokens = trg_tokens[:, :1] cache = {} for _ in range(trg_seq_len - 1): # 使用缓存来加速解码速度 output = model.decoder( predict_tokens[:, -1:], encoder_out, encoder_padding_mask, cache ) # 预测下一个单词 output = torch.reshape(torch.argmax(output, dim=-1), (batch_size, -1)) # 将预测得到的单词和历史预测拼接,作为最终预测结果 predict_tokens = torch.cat([predict_tokens, output], dim=-1) # 将结束符后的单词都标记为结束符 mask = torch.cumsum(torch.eq(predict_tokens, sep_id).int(), dim=1) predict_tokens = predict_tokens.masked_fill(mask > 0, sep_id) # 将预测结果的id还原为文本 predict_text = tokenizer.batch_decode(predict_tokens, skip_special_tokens=True) print(">>>>> source text") print("\n".join(src_text)) print(">>>>> target text") print("\n".join(trg_text)) print(">>>>> predict text") print("\n".join(predict_text))
完整代码
完整代码如下,保存在run.py
里面,然后运行下面命令就行了:
python3 run.py
import torch from transformers import BertTokenizer from lightseq.training import LSTransformer, LSCrossEntropyLayer, LSAdam def create_data(): # 创建Hugging Face分词器 tokenizer = BertTokenizer.from_pretrained("bert-base-cased") vocab_size = tokenizer.vocab_size sep_id = tokenizer.encode( tokenizer.special_tokens_map["sep_token"], add_special_tokens=False )[0] # 将源文本映射成整数id src_text = [ "What is the fastest library in the world?", "You are so pretty!", "What do you love me for?", "The sparrow outside the window hovering on the telephone pole.", ] src_tokens = tokenizer.batch_encode_plus( src_text, padding=True, return_tensors="pt" ) src_tokens = src_tokens["input_ids"].to(torch.device("cuda:0")) batch_size, src_seq_len = src_tokens.size(0), src_tokens.size(1) # 将目标文本映射成整数id trg_text = [ "I guess it must be LightSeq, because ByteDance is the fastest.", "Thanks very much and you are pretty too.", "Love your beauty, smart, virtuous and kind.", "You said all this is very summery.", ] trg_tokens = tokenizer.batch_encode_plus( trg_text, padding=True, return_tensors="pt" ) trg_tokens = trg_tokens["input_ids"].to(torch.device("cuda:0")) trg_seq_len = trg_tokens.size(1) # 将目标文本左移1个单词位置,用来作为解码端输出 target = trg_tokens.clone()[:, 1:] trg_tokens = trg_tokens[:, :-1] return ( tokenizer, src_text, src_tokens, trg_text, trg_tokens, target, sep_id, vocab_size, batch_size, src_seq_len, trg_seq_len, ) def create_model(vocab_size): transformer_config = LSTransformer.get_config( model="transformer-base", max_batch_tokens=2048, max_seq_len=512, vocab_size=vocab_size, padding_idx=0, num_encoder_layer=6, num_decoder_layer=6, fp16=True, local_rank=0, ) model = LSTransformer(transformer_config) model.to(dtype=torch.half, device=torch.device("cuda:0")) return model def create_criterion(): ce_config = LSCrossEntropyLayer.get_config( max_batch_tokens=2048, padding_idx=0, epsilon=0.0, fp16=True, local_rank=0, ) loss_fn = LSCrossEntropyLayer(ce_config) loss_fn.to(dtype=torch.half, device=torch.device("cuda:0")) return loss_fn if __name__ == "__main__": ( tokenizer, src_text, src_tokens, trg_text, trg_tokens, target, sep_id, vocab_size, batch_size, src_seq_len, trg_seq_len, ) = create_data() model = create_model(vocab_size) loss_fn = create_criterion() opt = LSAdam(model.parameters(), lr=1e-5) print("========================TRAIN========================") model.train() for epoch in range(2000): output = model(src_tokens, trg_tokens) loss, _ = loss_fn(output, target) if epoch % 200 == 0: print("epoch {:03d}: {:.3f}".format(epoch, loss.item())) loss.backward() opt.step() print("========================TEST========================") model.eval() # 获得编码器的输出和掩码表示 encoder_out, encoder_padding_mask = model.encoder(src_tokens) # 使用目标端文本的第一个单词作为解码器的初始输入,预测后面单词 predict_tokens = trg_tokens[:, :1] cache = {} for _ in range(trg_seq_len - 1): # 使用缓存来加速解码速度 output = model.decoder( predict_tokens[:, -1:], encoder_out, encoder_padding_mask, cache ) # 预测下一个单词 output = torch.reshape(torch.argmax(output, dim=-1), (batch_size, -1)) # 将预测得到的单词和历史预测拼接,作为最终预测结果 predict_tokens = torch.cat([predict_tokens, output], dim=-1) # 将结束符后的单词都标记为结束符 mask = torch.cumsum(torch.eq(predict_tokens, sep_id).int(), dim=1) predict_tokens = predict_tokens.masked_fill(mask > 0, sep_id) # 将预测结果的id还原为文本 predict_text = tokenizer.batch_decode(predict_tokens, skip_special_tokens=True) print(">>>>> source text") print("\n".join(src_text)) print(">>>>> target text") print("\n".join(trg_text)) print(">>>>> predict text") print("\n".join(predict_text))
如果运行顺利的话,你会看到下面的输出信息:
========================TRAIN======================== TransformerEmbeddingLayer #0 bind weights and grads. TransformerEncoderLayer #0 bind weights and grads. TransformerEncoderLayer #1 bind weights and grads. TransformerEncoderLayer #2 bind weights and grads. TransformerEncoderLayer #3 bind weights and grads. TransformerEncoderLayer #4 bind weights and grads. TransformerEncoderLayer #5 bind weights and grads. TransformerEmbeddingLayer #1 bind weights and grads. TransformerDecoderLayer #0 bind weights and grads. Decoder layer #0 allocate encdec_kv memory TransformerDecoderLayer #1 bind weights and grads. TransformerDecoderLayer #2 bind weights and grads. TransformerDecoderLayer #3 bind weights and grads. TransformerDecoderLayer #4 bind weights and grads. TransformerDecoderLayer #5 bind weights and grads. epoch 000: 725.560 epoch 200: 96.252 epoch 400: 15.151 epoch 600: 5.770 epoch 800: 3.212 epoch 1000: 1.748 epoch 1200: 0.930 epoch 1400: 0.457 epoch 1600: 0.366 epoch 1800: 0.299 ========================TEST======================== >>>>> source text What is the fastest library in the world? You are so pretty! What do you love me for? The sparrow outside the window hovering on the telephone pole. >>>>> target text I guess it must be LightSeq, because ByteDance is the fastest. Thanks very much and you are pretty too. Love your beauty, smart, virtuous and kind. You said all this is very summery. >>>>> predict text I guess it must be LightSeq, because ByteDance is the fastest. Thanks very much and you are pretty too. Love your beauty, smart, virtuous and kind. You said all this is very summery.
可以看到,最后的预测文本和真实的目标端文本完全一致。
当然这里的例子非常简单,输入输出只有4句话。如果你有大量的对话数据集的话,你就可以训练出一个非常完美的聊天机器人啦,还愁啥没有女朋友呢?
如果觉得LightSeq比较好用,别忘了给个star,是给我们最大的支持。