让算力资源用到极致,是每一位开发者的必修课。
自从大模型变成热门趋势之后,GPU 就成了紧俏的物资。很多企业的储备都不一定充足,更不用说个人开发者了。有没有什么方法可以更高效的利用算力训练模型?在最近的一篇博客,Sebastian Raschka 介绍了「梯度累积」的方法,能够在 GPU 内存受限时使用更大 batch size 训练模型,绕开硬件限制。在此之前,Sebastian Raschka 也分享过一篇运用多 GPU 训练策略加速大型语言模型微调的文章,包括模型或 tensor sharding 等机制,这些机制将模型权重和计算分布在不同的设备上,以解决 GPU 的内存限制。微调 BLOOM 模型进行分类假设我们有兴趣采用近期预训练的大型语言模型来处理文本分类等下游任务。那么,我们可能会选择使用 GPT-3 的开源替代品 BLOOM 模型,特别是「仅有」 5.6 亿个参数的 BLOOM 版本 —— 它应该可以毫无问题地融入至传统 GPU 的 RAM 中(Google Colab 免费版本拥有 15 Gb RAM 的 GPU)。一旦开始,就很可能遇到问题:内存会在训练或微调期间迅速增加。训练这个模型的唯一方法是使批大小为 1(batch size=1)。使用批大小为 1(batch size=1)为目标分类任务微调 BLOOM 的代码如下所示。你也可以在 GitHub 项目页面下载完整代码:https://github.com/rasbt/gradient-accumulation-blog/blob/main/src/1_batchsize-1.py你可以将此代码直接复制并粘贴到 Google Colab 中,但还必须将随附的 local_dataset_utilities.py 文件拖放到从该文件导入了一些数据集实用程序的同一文件夹中。
# pip install torch lightning matplotlib pandas torchmetrics watermark transformers datasets -U import os import os.path as op import time from datasets import load_dataset from lightning import Fabric import torch from torch.utils.data import DataLoader import torchmetrics from transformers import AutoTokenizer from transformers import AutoModelForSequenceClassification from watermark import watermark from local_dataset_utilities import download_dataset, load_dataset_into_to_dataframe, partition_dataset from local_dataset_utilities import IMDBDataset def tokenize_text (batch): return tokenizer (batch ["text"], truncation=True, padding=True, max_length=1024) def train (num_epochs, model, optimizer, train_loader, val_loader, fabric): for epoch in range (num_epochs): train_acc = torchmetrics.Accuracy ( task="multiclass", num_classes=2).to (fabric.device) for batch_idx, batch in enumerate (train_loader): model.train () ### FORWARD AND BACK PROP outputs = model ( batch ["input_ids"], attention_mask=batch ["attention_mask"], labels=batch ["label"] ) fabric.backward (outputs ["loss"]) ### UPDATE MODEL PARAMETERS optimizer.step () optimizer.zero_grad () ### LOGGING if not batch_idx % 300: print (f"Epoch: {epoch+1:04d}/{num_epochs:04d}" f"| Batch {batch_idx:04d}/{len (train_loader):04d}" f"| Loss: {outputs ['loss']:.4f}") model.eval () with torch.no_grad (): predicted_labels = torch.argmax (outputs ["logits"], 1) train_acc.update (predicted_labels, batch ["label"]) ### MORE LOGGING model.eval () with torch.no_grad (): val_acc = torchmetrics.Accuracy (task="multiclass", num_classes=2).to (fabric.device) for batch in val_loader: outputs = model ( batch ["input_ids"], attention_mask=batch ["attention_mask"], labels=batch ["label"] ) predicted_labels = torch.argmax (outputs ["logits"], 1) val_acc.update (predicted_labels, batch ["label"]) print (f"Epoch: {epoch+1:04d}/{num_epochs:04d}" f"| Train acc.: {train_acc.compute ()*100:.2f}%" f"| Val acc.: {val_acc.compute ()*100:.2f}%" ) train_acc.reset (), val_acc.reset () if __name__ == "__main__": print (watermark (packages="torch,lightning,transformers", python=True)) print ("Torch CUDA available?", torch.cuda.is_available ()) device = "cuda" if torch.cuda.is_available () else "cpu" torch.manual_seed (123) # torch.use_deterministic_algorithms (True) ########################## ### 1 Loading the Dataset ########################## download_dataset () df = load_dataset_into_to_dataframe () if not (op.exists ("train.csv") and op.exists ("val.csv") and op.exists ("test.csv")): partition_dataset (df) imdb_dataset = load_dataset ( "csv", data_files={ "train": "train.csv", "validation": "val.csv", "test": "test.csv", }, ) ######################################### ### 2 Tokenization and Numericalization ######################################### tokenizer = AutoTokenizer.from_pretrained ("bigscience/bloom-560m", max_length=1024) print ("Tokenizer input max length:", tokenizer.model_max_length, flush=True) print ("Tokenizer vocabulary size:", tokenizer.vocab_size, flush=True) print ("Tokenizing ...", flush=True) imdb_tokenized = imdb_dataset.map (tokenize_text, batched=True, batch_size=None) del imdb_dataset imdb_tokenized.set_format ("torch", columns=["input_ids", "attention_mask", "label"]) os.environ ["TOKENIZERS_PARALLELISM"] = "false" ######################################### ### 3 Set Up DataLoaders ######################################### train_dataset = IMDBDataset (imdb_tokenized, partition_key="train") val_dataset = IMDBDataset (imdb_tokenized, partition_key="validation") test_dataset = IMDBDataset (imdb_tokenized, partition_key="test") train_loader = DataLoader ( dataset=train_dataset, batch_size=1, shuffle=True, num_workers=4, drop_last=True, ) val_loader = DataLoader ( dataset=val_dataset, batch_size=1, num_workers=4, drop_last=True, ) test_loader = DataLoader ( dataset=test_dataset, batch_size=1, num_workers=2, drop_last=True, ) ######################################### ### 4 Initializing the Model ######################################### fabric = Fabric (accelerator="cuda", devices=1, precision="16-mixed") fabric.launch () model = AutoModelForSequenceClassification.from_pretrained ( "bigscience/bloom-560m", num_labels=2) optimizer = torch.optim.Adam (model.parameters (), lr=5e-5) model, optimizer = fabric.setup (model, optimizer) train_loader, val_loader, test_loader = fabric.setup_dataloaders ( train_loader, val_loader, test_loader) ######################################### ### 5 Finetuning ######################################### start = time.time () train ( num_epochs=1, model=model, optimizer=optimizer, train_loader=train_loader, val_loader=val_loader, fabric=fabric, ) end = time.time () elapsed = end-start print (f"Time elapsed {elapsed/60:.2f} min") with torch.no_grad (): model.eval () test_acc = torchmetrics.Accuracy (task="multiclass", num_classes=2).to (fabric.device) for batch in test_loader: outputs = model ( batch ["input_ids"], attention_mask=batch ["attention_mask"], labels=batch ["label"] ) predicted_labels = torch.argmax (outputs ["logits"], 1) test_acc.update (predicted_labels, batch ["label"]) print (f"Test accuracy {test_acc.compute ()*100:.2f}%")
作者使用了 Lightning Fabric,因为它可以让开发者在不同硬件上运行此代码时灵活地改变 GPU 数量和多 GPU 训练策略。它还允许仅通过调整查准率 flag 来启用混合精度训练(mixed-precision training)。在这种情况下,混合精度训练可以将训练速度提高三倍,并将内存需求降低约 25%。上面展示的主要代码都是在主函数(if __name__ == "__main__" 的 context)中执行的,即使只使用单个 GPU,也推荐使用 PyTorch 运行环境执行多 GPU 训练。而后,包含在 if __name__ == "__main__" 中的以下三个代码部分负责数据加载:
# 1 加载数据集
# 2 token 化和数值化
# 3 设置数据加载器
第 4 节是初始化模型(Initializing the Model)中,然后在第 5 节 微调(Finetuning)中,调用 train 函数,这是开始让事情变得有趣的地方。在 train (...) 函数中,实现了标准的 PyTorch 循环。核心训练循环的注释版本如下所示:批大小为 1(Batch size=1)的问题是梯度更新将会变得非常混乱和困难,正如下述训练模型时基于波动的训练损失和糟糕的测试集性能所看到的:
... torch : 2.0.0 lightning : 2.0.0 transformers: 4.27.2 Torch CUDA available? True ... Epoch: 0001/0001 | Batch 23700/35000 | Loss: 0.0969 Epoch: 0001/0001 | Batch 24000/35000 | Loss: 1.9902 Epoch: 0001/0001 | Batch 24300/35000 | Loss: 0.0395 Epoch: 0001/0001 | Batch 24600/35000 | Loss: 0.2546 Epoch: 0001/0001 | Batch 24900/35000 | Loss: 0.1128 Epoch: 0001/0001 | Batch 25200/35000 | Loss: 0.2661 Epoch: 0001/0001 | Batch 25500/35000 | Loss: 0.0044 Epoch: 0001/0001 | Batch 25800/35000 | Loss: 0.0067 Epoch: 0001/0001 | Batch 26100/35000 | Loss: 0.0468 Epoch: 0001/0001 | Batch 26400/35000 | Loss: 1.7139 Epoch: 0001/0001 | Batch 26700/35000 | Loss: 0.9570 Epoch: 0001/0001 | Batch 27000/35000 | Loss: 0.1857 Epoch: 0001/0001 | Batch 27300/35000 | Loss: 0.0090 Epoch: 0001/0001 | Batch 27600/35000 | Loss: 0.9790 Epoch: 0001/0001 | Batch 27900/35000 | Loss: 0.0503 Epoch: 0001/0001 | Batch 28200/35000 | Loss: 0.2625 Epoch: 0001/0001 | Batch 28500/35000 | Loss: 0.1010 Epoch: 0001/0001 | Batch 28800/35000 | Loss: 0.0035 Epoch: 0001/0001 | Batch 29100/35000 | Loss: 0.0009 Epoch: 0001/0001 | Batch 29400/35000 | Loss: 0.0234 Epoch: 0001/0001 | Batch 29700/35000 | Loss: 0.8394 Epoch: 0001/0001 | Batch 30000/35000 | Loss: 0.9497 Epoch: 0001/0001 | Batch 30300/35000 | Loss: 0.1437 Epoch: 0001/0001 | Batch 30600/35000 | Loss: 0.1317 Epoch: 0001/0001 | Batch 30900/35000 | Loss: 0.0112 Epoch: 0001/0001 | Batch 31200/35000 | Loss: 0.0073 Epoch: 0001/0001 | Batch 31500/35000 | Loss: 0.7393 Epoch: 0001/0001 | Batch 31800/35000 | Loss: 0.0512 Epoch: 0001/0001 | Batch 32100/35000 | Loss: 0.1337 Epoch: 0001/0001 | Batch 32400/35000 | Loss: 1.1875 Epoch: 0001/0001 | Batch 32700/35000 | Loss: 0.2727 Epoch: 0001/0001 | Batch 33000/35000 | Loss: 0.1545 Epoch: 0001/0001 | Batch 33300/35000 | Loss: 0.0022 Epoch: 0001/0001 | Batch 33600/35000 | Loss: 0.2681 Epoch: 0001/0001 | Batch 33900/35000 | Loss: 0.2467 Epoch: 0001/0001 | Batch 34200/35000 | Loss: 0.0620 Epoch: 0001/0001 | Batch 34500/35000 | Loss: 2.5039 Epoch: 0001/0001 | Batch 34800/35000 | Loss: 0.0131 Epoch: 0001/0001 | Train acc.: 75.11% | Val acc.: 78.62% Time elapsed 69.97 min Test accuracy 78.53%