PyTorch 2.2 中文官方教程(十三)(3)https://developer.aliyun.com/article/1482566
步骤
- 准备数据和模型
- 使用分析器记录执行事件
- 运行分析器
- 使用 TensorBoard 查看结果并分析模型性能
- 通过分析器提高性能
- 使用其他高级功能分析性能
- 额外练习:在 AMD GPU 上对 PyTorch 进行分析
1. 准备数据和模型
首先,导入所有必要的库:
import torch import torch.nn import torch.optim import torch.profiler import torch.utils.data import torchvision.datasets import torchvision.models import torchvision.transforms as T
然后准备输入数据。在本教程中,我们使用 CIFAR10 数据集。将其转换为所需的格式,并使用DataLoader
加载每批数据。
transform = T.Compose( [T.Resize(224), T.ToTensor(), T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True)
接下来,创建 Resnet 模型、损失函数和优化器对象。要在 GPU 上运行,请将模型和损失移动到 GPU 设备。
device = torch.device("cuda:0") model = torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device) criterion = torch.nn.CrossEntropyLoss().cuda(device) optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9) model.train()
为每批输入数据定义训练步骤。
def train(data): inputs, labels = data[0].to(device=device), data[1].to(device=device) outputs = model(inputs) loss = criterion(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step()
2. 使用分析器记录执行事件
通过上下文管理器启用分析器,并接受几个参数,其中一些最有用的是:
schedule
- 接受步骤(int)作为单个参数并返回每个步骤执行的分析器操作的可调用函数。
在此示例中,使用wait=1, warmup=1, active=3, repeat=1
,分析器将跳过第一步/迭代,从第二步开始热身,记录接下来的三次迭代,之后跟踪将变为可用,并调用 on_trace_ready(如果设置)。总共,循环重复一次。在 TensorBoard 插件中,每个循环称为“span”。
在wait
步骤期间,分析器被禁用。在warmup
步骤期间,分析器开始跟踪,但结果被丢弃。这是为了减少分析的开销。在分析开始时,开销很高,容易给分析结果带来偏差。在active
步骤期间,分析器工作并记录事件。on_trace_ready
- 在每个周期结束时调用的可调用函数;在本示例中,我们使用torch.profiler.tensorboard_trace_handler
生成 TensorBoard 的结果文件。分析后,结果文件将保存在./log/resnet18
目录中。将此目录指定为logdir
参数以在 TensorBoard 中分析配置文件。record_shapes
- 是否记录操作符输入的形状。profile_memory
- 跟踪张量内存分配/释放。请注意,对于旧版本的 PyTorch(1.10 之前的版本),如果遇到长时间的分析时间,请禁用它或升级到新版本。with_stack
- 记录操作的源信息(文件和行号)。如果在 VS Code 中启动了 TensorBoard(参考链接),点击堆栈帧将导航到特定的代码行。
with torch.profiler.profile( schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1), on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'), record_shapes=True, profile_memory=True, with_stack=True ) as prof: for step, batch_data in enumerate(train_loader): prof.step() # Need to call this at each step to notify profiler of steps' boundary. if step >= 1 + 1 + 3: break train(batch_data)
另外,也支持以下非上下文管理器的启动/停止。
prof = torch.profiler.profile( schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1), on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'), record_shapes=True, with_stack=True) prof.start() for step, batch_data in enumerate(train_loader): prof.step() if step >= 1 + 1 + 3: break train(batch_data) prof.stop()
3. 运行分析器
运行上述代码。分析结果将保存在./log/resnet18
目录下。
4. 使用 TensorBoard 查看结果并分析模型性能
注意
TensorBoard 插件支持已被弃用,因此一些这些功能可能不再像以前那样工作。请查看替代方案,HTA。
安装 PyTorch 分析器 TensorBoard 插件。
pip install torch_tb_profiler
启动 TensorBoard。
tensorboard --logdir=./log
在 Google Chrome 浏览器或 Microsoft Edge 浏览器中打开 TensorBoard 配置文件 URL(不支持 Safari)。
http://localhost:6006/#pytorch_profiler
您可以看到如下所示的 Profiler 插件页面。
- 概述
概述显示了模型性能的高级摘要。
“GPU 摘要”面板显示 GPU 配置、GPU 使用情况和张量核心使用情况。在此示例中,GPU 利用率较低。这些指标的详细信息在这里。
“步骤时间分解”显示在不同执行类别上花费在每个步骤中的时间的分布。在此示例中,您可以看到DataLoader
的开销很大。
底部的“性能建议”使用分析数据自动突出显示可能的瓶颈,并为您提供可操作的优化建议。
您可以在左侧的“视图”下拉列表中更改视图页面。
- 操作员视图
操作员视图显示了在主机或设备上执行的每个 PyTorch 操作员的性能。
“自身”持续时间不包括其子操作员的时间。“总”持续时间包括其子操作员的时间。
- 查看调用堆栈
单击操作员的“查看调用堆栈”,将显示具有相同名称但不同调用堆栈的操作员。然后单击此子表中的“查看调用堆栈”,将显示调用堆栈帧。
如果在 VS Code 中启动了 TensorBoard(启动指南),单击调用堆栈帧将导航到特定的代码行。
- 内核视图
GPU 内核视图显示 GPU 上花费的所有内核时间。
是否使用张量核心:此内核是否使用张量核心。
每个 SM 的平均块数:每个 SM 的块数=此内核的块数/此 GPU 的 SM 数。如果此数字小于 1,则表示 GPU 多处理器未充分利用。“每个 SM 的平均块数”是此内核名称的所有运行的加权平均值,使用每次运行的持续时间作为权重。
平均估计实现占用率:此列的工具提示中定义了估计实现占用率。对于大多数情况,如内存带宽受限的内核,数值越高越好。“平均估计实现占用率”是此内核名称的所有运行的加权平均值,使用每次运行的持续时间作为权重。
- 跟踪视图
跟踪视图显示了受监视的操作员和 GPU 内核的时间轴。您可以选择它以查看以下详细信息。
您可以使用右侧工具栏移动图形并放大/缩小。键盘也可以用于在时间轴内部缩放和移动。‘w’和‘s’键以鼠标为中心放大,‘a’和‘d’键将时间轴向左或向右移动。您可以多次按这些键,直到看到可读的表示。
如果后向操作员的“传入流”字段的值为“前向对应后向”,则可以单击文本以获取其启动前向操作员。
在这个例子中,我们可以看到以enumerate(DataLoader)
为前缀的事件耗费了大量时间。在大部分时间内,GPU 处于空闲状态。因为这个函数正在主机端加载数据和转换数据,期间 GPU 资源被浪费。
5. 借助分析器提高性能
在“概览”页面的底部,“性能建议”中的建议提示瓶颈是DataLoader
。PyTorch 的DataLoader
默认使用单进程。用户可以通过设置参数num_workers
来启用多进程数据加载。这里有更多细节。
在这个例子中,我们遵循“性能建议”,将num_workers
设置如下,将不同的名称传递给tensorboard_trace_handler
,然后再次运行。
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True, num_workers=4)
然后在左侧的“Runs”下拉列表中选择最近分析的运行。
从上述视图中,我们可以看到步骤时间与之前的运行相比减少到约 76ms,而DataLoader
的时间减少主要起作用。
从上述视图中,我们可以看到enumerate(DataLoader)
的运行时间减少了,GPU 利用率增加了。
6. 使用其他高级功能进行性能分析
- 内存视图
为了对内存进行分析,必须在torch.profiler.profile
的参数中将profile_memory
设置为True
。
您可以尝试在 Azure 上使用现有示例
pip install azure-storage-blob tensorboard --logdir=https://torchtbprofiler.blob.core.windows.net/torchtbprofiler/demo/memory_demo_1_10
分析器在分析过程中记录所有内存分配/释放事件和分配器的内部状态。内存视图由以下三个组件组成。
这些组件分别是内存曲线图、内存事件表和内存统计表,从上到下依次排列。
内存类型可以在“设备”选择框中选择。例如,“GPU0”表示以下表格仅显示 GPU 0 上每个操作符的内存使用情况,不包括 CPU 或其他 GPU。
内存曲线显示内存消耗的趋势。“已分配”曲线显示实际使用的总内存,例如张量。在 PyTorch 中,CUDA 分配器和一些其他分配器采用了缓存机制。“保留”曲线显示分配器保留的总内存。您可以在图表上左键单击并拖动以选择所需范围内的事件:
选择后,这三个组件将针对受限时间范围进行更新,以便您可以获取更多信息。通过重复这个过程,您可以深入了解非常细微的细节。右键单击图表将重置图表到初始状态。
在内存事件表中,分配和释放事件成对显示在一个条目中。“operator”列显示导致分配的即时 ATen 操作符。请注意,在 PyTorch 中,ATen 操作符通常使用aten::empty
来分配内存。例如,aten::ones
实际上是由aten::empty
后跟一个aten::fill_
实现的。仅显示aten::empty
操作符名称并没有太大帮助。在这种特殊情况下,它将显示为aten::ones (aten::empty)
。如果事件发生在时间范围之外,则“分配时间”、“释放时间”和“持续时间”列的数据可能会丢失。
在内存统计表中,“大小增加”列总结了所有分配大小并减去所有内存释放大小,即在此运算符之后内存使用量的净增加。“自身大小增加”列类似于“大小增加”,但它不计算子运算符的分配。关于 ATen 运算符的实现细节,一些运算符可能调用其他运算符,因此内存分配可以发生在调用堆栈的任何级别。也就是说,“自身大小增加”仅计算当前调用堆栈级别的内存使用量增加。最后,“分配大小”列总结了所有分配,而不考虑内存释放。
- 分布式视图
插件现在支持使用 NCCL/GLOO 作为后端在分布式 DDP 上进行性能分析。
您可以通过在 Azure 上使用现有示例来尝试:
pip install azure-storage-blob tensorboard --logdir=https://torchtbprofiler.blob.core.windows.net/torchtbprofiler/demo/distributed_bert
“计算/通信概述”显示了计算/通信比和它们的重叠程度。从这个视图中,用户可以找出工作人员之间的负载平衡问题。例如,如果一个工作人员的计算+重叠时间比其他工作人员的大得多,那么可能存在负载平衡问题,或者这个工作人员可能是一个慢工作者。
“同步/通信概述”显示了通信的效率。“数据传输时间”是实际数据交换的时间。“同步时间”是等待和与其他工作人员同步的时间。
如果一个工作人员的“同步时间”比其他工作人员的短得多,那么这个工作人员可能是一个比其他工作人员有更多计算工作量的慢工作者。
“通信操作统计”总结了每个工作人员中所有通信操作的详细统计信息。
7. 附加实践:在 AMD GPU 上对 PyTorch 进行性能分析
AMD ROCm 平台是一个为 GPU 计算设计的开源软件堆栈,包括驱动程序、开发工具和 API。我们可以在 AMD GPU 上运行上述提到的步骤。在本节中,我们将使用 Docker 在安装 PyTorch 之前安装 ROCm 基础开发镜像。
为了示例,让我们创建一个名为profiler_tutorial
的目录,并将步骤 1中的代码保存为test_cifar10.py
在这个目录中。
mkdir ~/profiler_tutorial cd profiler_tutorial vi test_cifar10.py
在撰写本文时,ROCm 平台上 PyTorch 的稳定(2.1.1
)Linux 版本是ROCm 5.6。
- 从Docker Hub获取安装了正确用户空间 ROCm 版本的基础 Docker 镜像。
它是rocm/dev-ubuntu-20.04:5.6
。
- 启动 ROCm 基础 Docker 容器:
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 8G -v ~/profiler_tutorial:/profiler_tutorial rocm/dev-ubuntu-20.04:5.6
- 在容器内,安装安装 wheels 包所需的任何依赖项。
sudo apt update sudo apt install libjpeg-dev python3-dev -y pip3 install wheel setuptools sudo apt install python-is-python3
- 安装 wheels:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6
- 安装
torch_tb_profiler
,然后运行 Python 文件test_cifar10.py
:
pip install torch_tb_profiler cd /profiler_tutorial python test_cifar10.py
现在,我们有了在 TensorBoard 中查看所需的所有数据:
tensorboard --logdir=./log
选择不同的视图,如步骤 4中所述。例如,下面是操作员视图:
在撰写本节时,跟踪视图不起作用,不显示任何内容。您可以通过在 Chrome 浏览器中输入chrome://tracing
来解决问题。
- 将
trace.json
文件复制到~/profiler_tutorial/log/resnet18
目录下的 Windows。
如果文件位于远程位置,您可能需要使用scp
来复制文件。
- 点击加载按钮,从浏览器中的
chrome://tracing
页面加载跟踪 JSON 文件。
如前所述,您可以移动图形并放大或缩小。您还可以使用键盘在时间轴内部放大和移动。 w
和s
键以鼠标为中心放大,a
和d
键将时间轴向左或向右移动。您可以多次按这些键,直到看到可读的表示。
了解更多
查看以下文档以继续学习,并随时在此处提出问题。
脚本的总运行时间:(0 分钟 0.000 秒)
下载 Python 源代码:tensorboard_profiler_tutorial.py
下载 Jupyter 笔记本:tensorboard_profiler_tutorial.ipynb
使用 Ray Tune 进行超参数调整
原文:
pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html
译者:飞龙
注意
点击这里下载完整的示例代码
超参数调整可以使普通模型和高度准确的模型之间产生巨大差异。通常简单的事情,比如选择不同的学习率或改变网络层大小,都可以对模型性能产生显著影响。
幸运的是,有一些工具可以帮助找到最佳参数组合。Ray Tune是一个行业标准的分布式超参数调整工具。Ray Tune 包括最新的超参数搜索算法,与 TensorBoard 和其他分析库集成,并通过Ray 的分布式机器学习引擎原生支持分布式训练。
在本教程中,我们将向您展示如何将 Ray Tune 集成到 PyTorch 训练工作流程中。我们将扩展来自 PyTorch 文档的这个教程,用于训练 CIFAR10 图像分类器。
正如您将看到的,我们只需要添加一些轻微的修改。特别是,我们需要
- 将数据加载和训练封装在函数中,
- 使一些网络参数可配置,
- 添加检查点(可选),
- 并定义模型调优的搜索空间
要运行此教程,请确保安装了以下软件包:
ray[tune]
:分布式超参数调整库torchvision
:用于数据转换器
设置/导入
让我们从导入开始:
from functools import partial import os import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import random_split import torchvision import torchvision.transforms as transforms from ray import tune from ray.air import Checkpoint, session from ray.tune.schedulers import ASHAScheduler
大部分导入都是用于构建 PyTorch 模型。只有最后三个导入是为了 Ray Tune。
数据加载器
我们将数据加载器封装在自己的函数中,并传递一个全局数据目录。这样我们可以在不同的试验之间共享一个数据目录。
def load_data(data_dir="./data"): transform = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))] ) trainset = torchvision.datasets.CIFAR10( root=data_dir, train=True, download=True, transform=transform ) testset = torchvision.datasets.CIFAR10( root=data_dir, train=False, download=True, transform=transform ) return trainset, testset
可配置的神经网络
我们只能调整可配置的参数。在这个例子中,我们可以指定全连接层的层大小:
class Net(nn.Module): def __init__(self, l1=120, l2=84): super(Net, self).__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, l1) self.fc2 = nn.Linear(l1, l2) self.fc3 = nn.Linear(l2, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = torch.flatten(x, 1) # flatten all dimensions except batch x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x
训练函数
现在变得有趣了,因为我们对示例进行了一些更改来自 PyTorch 文档。
我们将训练脚本封装在一个函数train_cifar(config, data_dir=None)
中。config
参数将接收我们想要训练的超参数。data_dir
指定我们加载和存储数据的目录,以便多次运行可以共享相同的数据源。如果提供了检查点,我们还会在运行开始时加载模型和优化器状态。在本教程的后面部分,您将找到有关如何保存检查点以及它的用途的信息。
net = Net(config["l1"], config["l2"]) checkpoint = session.get_checkpoint() if checkpoint: checkpoint_state = checkpoint.to_dict() start_epoch = checkpoint_state["epoch"] net.load_state_dict(checkpoint_state["net_state_dict"]) optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"]) else: start_epoch = 0
优化器的学习率也是可配置的:
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
我们还将训练数据分成训练集和验证集。因此,我们在 80%的数据上进行训练,并在剩余的 20%上计算验证损失。我们可以配置通过训练和测试集的批处理大小。
使用 DataParallel 添加(多)GPU 支持
图像分类在很大程度上受益于 GPU。幸运的是,我们可以继续在 Ray Tune 中使用 PyTorch 的抽象。因此,我们可以将我们的模型包装在nn.DataParallel
中,以支持在多个 GPU 上进行数据并行训练:
device = "cpu" if torch.cuda.is_available(): device = "cuda:0" if torch.cuda.device_count() > 1: net = nn.DataParallel(net) net.to(device)
通过使用device
变量,我们确保在没有 GPU 可用时训练也能正常进行。PyTorch 要求我们明确将数据发送到 GPU 内存,就像这样:
for i, data in enumerate(trainloader, 0): inputs, labels = data inputs, labels = inputs.to(device), labels.to(device)
现在的代码支持在 CPU 上、单个 GPU 上和多个 GPU 上进行训练。值得注意的是,Ray 还支持分数 GPU,因此我们可以在试验之间共享 GPU,只要模型仍适合 GPU 内存。我们稍后会回到这个问题。
与 Ray Tune 通信
最有趣的部分是与 Ray Tune 的通信:
checkpoint_data = { "epoch": epoch, "net_state_dict": net.state_dict(), "optimizer_state_dict": optimizer.state_dict(), } checkpoint = Checkpoint.from_dict(checkpoint_data) session.report( {"loss": val_loss / val_steps, "accuracy": correct / total}, checkpoint=checkpoint, )
在这里,我们首先保存一个检查点,然后将一些指标报告给 Ray Tune。具体来说,我们将验证损失和准确率发送回 Ray Tune。然后,Ray Tune 可以使用这些指标来决定哪种超参数配置会产生最佳结果。这些指标也可以用来及早停止表现不佳的试验,以避免浪费资源在这些试验上。
检查点保存是可选的,但是如果我们想要使用高级调度程序(如基于种群的训练),则是必要的。此外,通过保存检查点,我们可以稍后加载训练好的模型并在测试集上验证。最后,保存检查点对于容错性很有用,它允许我们中断训练并稍后继续训练。
完整的训练函数
完整的代码示例如下:
def train_cifar(config, data_dir=None): net = Net(config["l1"], config["l2"]) device = "cpu" if torch.cuda.is_available(): device = "cuda:0" if torch.cuda.device_count() > 1: net = nn.DataParallel(net) net.to(device) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9) checkpoint = session.get_checkpoint() if checkpoint: checkpoint_state = checkpoint.to_dict() start_epoch = checkpoint_state["epoch"] net.load_state_dict(checkpoint_state["net_state_dict"]) optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"]) else: start_epoch = 0 trainset, testset = load_data(data_dir) test_abs = int(len(trainset) * 0.8) train_subset, val_subset = random_split( trainset, [test_abs, len(trainset) - test_abs] ) trainloader = torch.utils.data.DataLoader( train_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8 ) valloader = torch.utils.data.DataLoader( val_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8 ) for epoch in range(start_epoch, 10): # loop over the dataset multiple times running_loss = 0.0 epoch_steps = 0 for i, data in enumerate(trainloader, 0): # get the inputs; data is a list of [inputs, labels] inputs, labels = data inputs, labels = inputs.to(device), labels.to(device) # zero the parameter gradients optimizer.zero_grad() # forward + backward + optimize outputs = net(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() # print statistics running_loss += loss.item() epoch_steps += 1 if i % 2000 == 1999: # print every 2000 mini-batches print( "[%d, %5d] loss: %.3f" % (epoch + 1, i + 1, running_loss / epoch_steps) ) running_loss = 0.0 # Validation loss val_loss = 0.0 val_steps = 0 total = 0 correct = 0 for i, data in enumerate(valloader, 0): with torch.no_grad(): inputs, labels = data inputs, labels = inputs.to(device), labels.to(device) outputs = net(inputs) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() loss = criterion(outputs, labels) val_loss += loss.cpu().numpy() val_steps += 1 checkpoint_data = { "epoch": epoch, "net_state_dict": net.state_dict(), "optimizer_state_dict": optimizer.state_dict(), } checkpoint = Checkpoint.from_dict(checkpoint_data) session.report( {"loss": val_loss / val_steps, "accuracy": correct / total}, checkpoint=checkpoint, ) print("Finished Training") • 95
正如您所看到的,大部分代码直接从原始示例中适应而来。
测试集准确率
通常,机器学习模型的性能是在一个保留的测试集上测试的,该测试集包含未用于训练模型的数据。我们也将这包装在一个函数中:
def test_accuracy(net, device="cpu"): trainset, testset = load_data() testloader = torch.utils.data.DataLoader( testset, batch_size=4, shuffle=False, num_workers=2 ) correct = 0 total = 0 with torch.no_grad(): for data in testloader: images, labels = data images, labels = images.to(device), labels.to(device) outputs = net(images) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() return correct / total
该函数还期望一个device
参数,因此我们可以在 GPU 上对测试集进行验证。
配置搜索空间
最后,我们需要定义 Ray Tune 的搜索空间。这是一个示例:
config = { "l1": tune.choice([2 ** i for i in range(9)]), "l2": tune.choice([2 ** i for i in range(9)]), "lr": tune.loguniform(1e-4, 1e-1), "batch_size": tune.choice([2, 4, 8, 16]) }
tune.choice()
接受一个从中均匀抽样的值列表。在这个例子中,l1
和l2
参数应该是介于 4 和 256 之间的 2 的幂次方,因此可以是 4、8、16、32、64、128 或 256。lr
(学习率)应该在 0.0001 和 0.1 之间均匀抽样。最后,批量大小是 2、4、8 和 16 之间的选择。
在每次试验中,Ray Tune 现在将从这些搜索空间中随机抽样一组参数的组合。然后,它将并行训练多个模型,并在其中找到表现最佳的模型。我们还使用ASHAScheduler
,它将及早终止表现不佳的试验。
我们使用functools.partial
将train_cifar
函数包装起来,以设置常量data_dir
参数。我们还可以告诉 Ray Tune 每个试验应该有哪些资源可用:
gpus_per_trial = 2 # ... result = tune.run( partial(train_cifar, data_dir=data_dir), resources_per_trial={"cpu": 8, "gpu": gpus_per_trial}, config=config, num_samples=num_samples, scheduler=scheduler, checkpoint_at_end=True)
您可以指定 CPU 的数量,然后可以将其用于增加 PyTorch DataLoader
实例的num_workers
。所选数量的 GPU 在每个试验中对 PyTorch 可见。试验没有访问未为其请求的 GPU - 因此您不必担心两个试验使用相同的资源集。
在这里,我们还可以指定分数 GPU,因此像gpus_per_trial=0.5
这样的东西是完全有效的。试验将在彼此之间共享 GPU。您只需确保模型仍适合 GPU 内存。
训练模型后,我们将找到表现最佳的模型,并从检查点文件中加载训练好的网络。然后,我们获得测试集准确率,并通过打印报告所有内容。
完整的主函数如下:
def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2): data_dir = os.path.abspath("./data") load_data(data_dir) config = { "l1": tune.choice([2**i for i in range(9)]), "l2": tune.choice([2**i for i in range(9)]), "lr": tune.loguniform(1e-4, 1e-1), "batch_size": tune.choice([2, 4, 8, 16]), } scheduler = ASHAScheduler( metric="loss", mode="min", max_t=max_num_epochs, grace_period=1, reduction_factor=2, ) result = tune.run( partial(train_cifar, data_dir=data_dir), resources_per_trial={"cpu": 2, "gpu": gpus_per_trial}, config=config, num_samples=num_samples, scheduler=scheduler, ) best_trial = result.get_best_trial("loss", "min", "last") print(f"Best trial config: {best_trial.config}") print(f"Best trial final validation loss: {best_trial.last_result['loss']}") print(f"Best trial final validation accuracy: {best_trial.last_result['accuracy']}") best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"]) device = "cpu" if torch.cuda.is_available(): device = "cuda:0" if gpus_per_trial > 1: best_trained_model = nn.DataParallel(best_trained_model) best_trained_model.to(device) best_checkpoint = best_trial.checkpoint.to_air_checkpoint() best_checkpoint_data = best_checkpoint.to_dict() best_trained_model.load_state_dict(best_checkpoint_data["net_state_dict"]) test_acc = test_accuracy(best_trained_model, device) print("Best trial test set accuracy: {}".format(test_acc)) if __name__ == "__main__": # You can change the number of GPUs per trial here: main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /var/lib/jenkins/workspace/beginner_source/data/cifar-10-python.tar.gz 0% 0/170498071 [00:00<?, ?it/s] 0% 491520/170498071 [00:00<00:34, 4901426.98it/s] 4% 7307264/170498071 [00:00<00:03, 42047898.29it/s] 10% 17629184/170498071 [00:00<00:02, 69798204.67it/s] 16% 27820032/170498071 [00:00<00:01, 82407622.17it/s] 22% 38338560/170498071 [00:00<00:01, 90604441.34it/s] 29% 48726016/170498071 [00:00<00:01, 95049915.99it/s] 35% 59342848/170498071 [00:00<00:01, 98624828.60it/s] 41% 69828608/170498071 [00:00<00:01, 100103452.88it/s] 47% 80707584/170498071 [00:00<00:00, 102701251.79it/s] 54% 91226112/170498071 [00:01<00:00, 103410219.64it/s] 60% 101842944/170498071 [00:01<00:00, 104217418.28it/s] 66% 112394240/170498071 [00:01<00:00, 104577303.94it/s] 72% 122912768/170498071 [00:01<00:00, 104690232.44it/s] 78% 133464064/170498071 [00:01<00:00, 104835011.32it/s] 84% 144015360/170498071 [00:01<00:00, 104975230.73it/s] 91% 154566656/170498071 [00:01<00:00, 105068640.23it/s] 97% 165085184/170498071 [00:01<00:00, 104644047.95it/s] 100% 170498071/170498071 [00:01<00:00, 96529746.41it/s] Extracting /var/lib/jenkins/workspace/beginner_source/data/cifar-10-python.tar.gz to /var/lib/jenkins/workspace/beginner_source/data Files already downloaded and verified 2024-02-03 05:16:34,052 WARNING services.py:1816 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 2147479552 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. 2024-02-03 05:16:34,193 INFO worker.py:1625 -- Started a local Ray instance. 2024-02-03 05:16:35,349 INFO tune.py:218 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `tune.run(...)`. (pid=2669) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. (pid=2669) _torch_pytree._register_pytree_node( == Status == Current time: 2024-02-03 05:16:40 (running for 00:00:05.27) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (9 PENDING, 1 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | |-------------------------+----------+-----------------+--------------+------+------+-------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | train_cifar_668d1_00001 | PENDING | | 4 | 1 | 2 | 0.013416 | | train_cifar_668d1_00002 | PENDING | | 2 | 256 | 64 | 0.0113784 | | train_cifar_668d1_00003 | PENDING | | 8 | 64 | 256 | 0.0274071 | | train_cifar_668d1_00004 | PENDING | | 4 | 16 | 2 | 0.056666 | | train_cifar_668d1_00005 | PENDING | | 4 | 8 | 64 | 0.000353097 | | train_cifar_668d1_00006 | PENDING | | 8 | 16 | 4 | 0.000147684 | | train_cifar_668d1_00007 | PENDING | | 8 | 256 | 256 | 0.00477469 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | +-------------------------+----------+-----------------+--------------+------+------+-------------+ (func pid=2669) Files already downloaded and verified (func pid=2669) Files already downloaded and verified (pid=2758) _torch_pytree._register_pytree_node( (pid=2758) _torch_pytree._register_pytree_node( (pid=2765) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. (pid=2765) _torch_pytree._register_pytree_node( == Status == Current time: 2024-02-03 05:16:46 (running for 00:00:11.10) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Logical resource usage: 14.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (3 PENDING, 7 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | |-------------------------+----------+-----------------+--------------+------+------+-------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | | train_cifar_668d1_00007 | PENDING | | 8 | 256 | 256 | 0.00477469 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | +-------------------------+----------+-----------------+--------------+------+------+-------------+ (func pid=2756) Files already downloaded and verified (func pid=2765) Files already downloaded and verified (pid=3549) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) (pid=3549) _torch_pytree._register_pytree_node( [repeated 5x across cluster] (func pid=2669) [1, 2000] loss: 2.332 (func pid=2758) Files already downloaded and verified [repeated 10x across cluster] == Status == Current time: 2024-02-03 05:16:53 (running for 00:00:18.39) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | |-------------------------+----------+-----------------+--------------+------+------+-------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | +-------------------------+----------+-----------------+--------------+------+------+-------------+ == Status == Current time: 2024-02-03 05:16:58 (running for 00:00:23.40) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | |-------------------------+----------+-----------------+--------------+------+------+-------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | +-------------------------+----------+-----------------+--------------+------+------+-------------+ (func pid=2756) [1, 2000] loss: 2.311 (func pid=3549) Files already downloaded and verified [repeated 2x across cluster] (func pid=2764) [1, 2000] loss: 2.303 == Status == Current time: 2024-02-03 05:17:03 (running for 00:00:28.41) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | |-------------------------+----------+-----------------+--------------+------+------+-------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | +-------------------------+----------+-----------------+--------------+------+------+-------------+ == Status == Current time: 2024-02-03 05:17:08 (running for 00:00:33.42) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | |-------------------------+----------+-----------------+--------------+------+------+-------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | +-------------------------+----------+-----------------+--------------+------+------+-------------+ (func pid=3549) [1, 2000] loss: 1.855 [repeated 6x across cluster] == Status == Current time: 2024-02-03 05:17:13 (running for 00:00:38.43) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | |-------------------------+----------+-----------------+--------------+------+------+-------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | +-------------------------+----------+-----------------+--------------+------+------+-------------+ (func pid=2760) [1, 4000] loss: 1.031 [repeated 7x across cluster] == Status == Current time: 2024-02-03 05:17:18 (running for 00:00:43.44) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | |-------------------------+----------+-----------------+--------------+------+------+-------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | +-------------------------+----------+-----------------+--------------+------+------+-------------+ == Status == Current time: 2024-02-03 05:17:23 (running for 00:00:48.45) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | |-------------------------+----------+-----------------+--------------+------+------+-------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | +-------------------------+----------+-----------------+--------------+------+------+-------------+ (func pid=2756) [1, 6000] loss: 0.770 (func pid=2764) [1, 6000] loss: 0.681 == Status == Current time: 2024-02-03 05:17:28 (running for 00:00:53.46) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | |-------------------------+----------+-----------------+--------------+------+------+-------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | +-------------------------+----------+-----------------+--------------+------+------+-------------+ Result for train_cifar_668d1_00006: accuracy: 0.1208 date: 2024-02-03_05-17-29 done: false hostname: 8642c088913e iterations_since_restore: 1 loss: 2.293956341743469 node_ip: 172.17.0.2 pid: 2765 should_checkpoint: true time_since_restore: 43.53398323059082 time_this_iter_s: 43.53398323059082 time_total_s: 43.53398323059082 timestamp: 1706937449 training_iteration: 1 trial_id: 668d1_00006 Result for train_cifar_668d1_00003: accuracy: 0.2079 date: 2024-02-03_05-17-31 done: false hostname: 8642c088913e iterations_since_restore: 1 loss: 2.028138545417786 node_ip: 172.17.0.2 pid: 2760 should_checkpoint: true time_since_restore: 45.46037745475769 time_this_iter_s: 45.46037745475769 time_total_s: 45.46037745475769 timestamp: 1706937451 training_iteration: 1 trial_id: 668d1_00003 (func pid=2669) [1, 10000] loss: 0.461 [repeated 5x across cluster] == Status == Current time: 2024-02-03 05:17:37 (running for 00:01:01.57) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: -2.1610474435806273 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 1 | 45.4604 | 2.02814 | 0.2079 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | | | | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | | | | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 1 | 43.534 | 2.29396 | 0.1208 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | | | | | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | | | | | +-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ Result for train_cifar_668d1_00007: accuracy: 0.4793 date: 2024-02-03_05-17-40 done: false hostname: 8642c088913e iterations_since_restore: 1 loss: 1.4310961763858796 node_ip: 172.17.0.2 pid: 3549 should_checkpoint: true time_since_restore: 46.97845983505249 time_this_iter_s: 46.97845983505249 time_total_s: 46.97845983505249 timestamp: 1706937460 training_iteration: 1 trial_id: 668d1_00007 (func pid=2758) [1, 8000] loss: 0.575 [repeated 4x across cluster] == Status == Current time: 2024-02-03 05:17:45 (running for 00:01:10.40) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: -2.028138545417786 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 1 | 45.4604 | 2.02814 | 0.2079 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | | | | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | | | | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 1 | 43.534 | 2.29396 | 0.1208 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | 1 | 46.9785 | 1.4311 | 0.4793 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | | | | | +-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ (func pid=2762) [1, 10000] loss: 0.468 [repeated 6x across cluster] == Status == Current time: 2024-02-03 05:17:50 (running for 00:01:15.41) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: -2.028138545417786 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 1 | 45.4604 | 2.02814 | 0.2079 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | | | | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | | | | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 1 | 43.534 | 2.29396 | 0.1208 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | 1 | 46.9785 | 1.4311 | 0.4793 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | | | | | +-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ == Status == Current time: 2024-02-03 05:17:55 (running for 00:01:20.42) Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: -2.028138545417786 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (2 PENDING, 8 RUNNING) +-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00001 | RUNNING | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 1 | 45.4604 | 2.02814 | 0.2079 | | train_cifar_668d1_00004 | RUNNING | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | | | | | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | | | | | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 1 | 43.534 | 2.29396 | 0.1208 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | 1 | 46.9785 | 1.4311 | 0.4793 | | train_cifar_668d1_00008 | PENDING | | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | PENDING | | 2 | 2 | 16 | 0.0286986 | | | | | +-------------------------+----------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ (func pid=3549) [2, 2000] loss: 1.406 (func pid=2669) [1, 14000] loss: 0.329 Result for train_cifar_668d1_00001: accuracy: 0.1009 date: 2024-02-03_05-17-58 done: true hostname: 8642c088913e iterations_since_restore: 1 loss: 2.3118444224357604 node_ip: 172.17.0.2 pid: 2756 should_checkpoint: true time_since_restore: 72.15020895004272 time_this_iter_s: 72.15020895004272 time_total_s: 72.15020895004272 timestamp: 1706937478 training_iteration: 1 trial_id: 668d1_00001 Trial train_cifar_668d1_00001 completed. Result for train_cifar_668d1_00005: accuracy: 0.3539 date: 2024-02-03_05-17-58 done: false hostname: 8642c088913e iterations_since_restore: 1 loss: 1.7180780637741089 node_ip: 172.17.0.2 pid: 2764 should_checkpoint: true time_since_restore: 72.5149827003479 time_this_iter_s: 72.5149827003479 time_total_s: 72.5149827003479 timestamp: 1706937478 training_iteration: 1 trial_id: 668d1_00005 (func pid=2756) Files already downloaded and verified Result for train_cifar_668d1_00004: accuracy: 0.1042 date: 2024-02-03_05-17-59 done: true hostname: 8642c088913e iterations_since_restore: 1 loss: 2.317199463367462 node_ip: 172.17.0.2 pid: 2762 should_checkpoint: true time_since_restore: 73.49483036994934 time_this_iter_s: 73.49483036994934 time_total_s: 73.49483036994934 timestamp: 1706937479 training_iteration: 1 trial_id: 668d1_00004 Trial train_cifar_668d1_00004 completed. (func pid=2756) Files already downloaded and verified == Status == Current time: 2024-02-03 05:18:04 (running for 00:01:29.51) Using AsyncHyperBand: num_stopped=2 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: -2.1610474435806273 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (8 RUNNING, 2 TERMINATED) +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 1 | 45.4604 | 2.02814 | 0.2079 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | 1 | 72.515 | 1.71808 | 0.3539 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 1 | 43.534 | 2.29396 | 0.1208 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | 1 | 46.9785 | 1.4311 | 0.4793 | | train_cifar_668d1_00008 | RUNNING | 172.17.0.2:2756 | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | RUNNING | 172.17.0.2:2762 | 2 | 2 | 16 | 0.0286986 | | | | | | train_cifar_668d1_00001 | TERMINATED | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | 1 | 72.1502 | 2.31184 | 0.1009 | | train_cifar_668d1_00004 | TERMINATED | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | 1 | 73.4948 | 2.3172 | 0.1042 | +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ (func pid=2669) [1, 16000] loss: 0.288 [repeated 4x across cluster] (func pid=2762) Files already downloaded and verified [repeated 2x across cluster] == Status == Current time: 2024-02-03 05:18:09 (running for 00:01:34.53) Using AsyncHyperBand: num_stopped=2 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: -2.1610474435806273 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (8 RUNNING, 2 TERMINATED) +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 1 | 45.4604 | 2.02814 | 0.2079 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | 1 | 72.515 | 1.71808 | 0.3539 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 1 | 43.534 | 2.29396 | 0.1208 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | 1 | 46.9785 | 1.4311 | 0.4793 | | train_cifar_668d1_00008 | RUNNING | 172.17.0.2:2756 | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | RUNNING | 172.17.0.2:2762 | 2 | 2 | 16 | 0.0286986 | | | | | | train_cifar_668d1_00001 | TERMINATED | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | 1 | 72.1502 | 2.31184 | 0.1009 | | train_cifar_668d1_00004 | TERMINATED | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | 1 | 73.4948 | 2.3172 | 0.1042 | +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ Result for train_cifar_668d1_00006: accuracy: 0.1824 date: 2024-02-03_05-18-11 done: false hostname: 8642c088913e iterations_since_restore: 2 loss: 2.1994362588882446 node_ip: 172.17.0.2 pid: 2765 should_checkpoint: true time_since_restore: 84.67864155769348 time_this_iter_s: 41.14465832710266 time_total_s: 84.67864155769348 timestamp: 1706937491 training_iteration: 2 trial_id: 668d1_00006 (func pid=2756) [1, 2000] loss: 2.138 [repeated 5x across cluster] == Status == Current time: 2024-02-03 05:18:16 (running for 00:01:40.64) Using AsyncHyperBand: num_stopped=2 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -2.1994362588882446 | Iter 1.000: -2.1610474435806273 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (8 RUNNING, 2 TERMINATED) +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 1 | 45.4604 | 2.02814 | 0.2079 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | 1 | 72.515 | 1.71808 | 0.3539 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 2 | 84.6786 | 2.19944 | 0.1824 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | 1 | 46.9785 | 1.4311 | 0.4793 | | train_cifar_668d1_00008 | RUNNING | 172.17.0.2:2756 | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | RUNNING | 172.17.0.2:2762 | 2 | 2 | 16 | 0.0286986 | | | | | | train_cifar_668d1_00001 | TERMINATED | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | 1 | 72.1502 | 2.31184 | 0.1009 | | train_cifar_668d1_00004 | TERMINATED | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | 1 | 73.4948 | 2.3172 | 0.1042 | +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ Result for train_cifar_668d1_00003: accuracy: 0.2459 date: 2024-02-03_05-18-16 done: false hostname: 8642c088913e iterations_since_restore: 2 loss: 1.9869435796737671 node_ip: 172.17.0.2 pid: 2760 should_checkpoint: true time_since_restore: 90.14830899238586 time_this_iter_s: 44.687931537628174 time_total_s: 90.14830899238586 timestamp: 1706937496 training_iteration: 2 trial_id: 668d1_00003 == Status == Current time: 2024-02-03 05:18:21 (running for 00:01:46.25) Using AsyncHyperBand: num_stopped=2 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -2.0931899192810057 | Iter 1.000: -2.1610474435806273 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (8 RUNNING, 2 TERMINATED) +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 2 | 90.1483 | 1.98694 | 0.2459 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | 1 | 72.515 | 1.71808 | 0.3539 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 2 | 84.6786 | 2.19944 | 0.1824 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | 1 | 46.9785 | 1.4311 | 0.4793 | | train_cifar_668d1_00008 | RUNNING | 172.17.0.2:2756 | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | RUNNING | 172.17.0.2:2762 | 2 | 2 | 16 | 0.0286986 | | | | | | train_cifar_668d1_00001 | TERMINATED | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | 1 | 72.1502 | 2.31184 | 0.1009 | | train_cifar_668d1_00004 | TERMINATED | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | 1 | 73.4948 | 2.3172 | 0.1042 | +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ (func pid=2764) [2, 4000] loss: 0.814 [repeated 2x across cluster] Result for train_cifar_668d1_00007: accuracy: 0.5056 date: 2024-02-03_05-18-25 done: false hostname: 8642c088913e iterations_since_restore: 2 loss: 1.4163358207702637 node_ip: 172.17.0.2 pid: 3549 should_checkpoint: true time_since_restore: 91.53078818321228 time_this_iter_s: 44.55232834815979 time_total_s: 91.53078818321228 timestamp: 1706937505 training_iteration: 2 trial_id: 668d1_00007 (func pid=2758) [1, 14000] loss: 0.310 [repeated 3x across cluster] == Status == Current time: 2024-02-03 05:18:30 (running for 00:01:54.95) Using AsyncHyperBand: num_stopped=2 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.9869435796737671 | Iter 1.000: -2.1610474435806273 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (8 RUNNING, 2 TERMINATED) +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 2 | 90.1483 | 1.98694 | 0.2459 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | 1 | 72.515 | 1.71808 | 0.3539 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 2 | 84.6786 | 2.19944 | 0.1824 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | 2 | 91.5308 | 1.41634 | 0.5056 | | train_cifar_668d1_00008 | RUNNING | 172.17.0.2:2756 | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | RUNNING | 172.17.0.2:2762 | 2 | 2 | 16 | 0.0286986 | | | | | | train_cifar_668d1_00001 | TERMINATED | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | 1 | 72.1502 | 2.31184 | 0.1009 | | train_cifar_668d1_00004 | TERMINATED | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | 1 | 73.4948 | 2.3172 | 0.1042 | +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ == Status == Current time: 2024-02-03 05:18:35 (running for 00:01:59.96) Using AsyncHyperBand: num_stopped=2 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.9869435796737671 | Iter 1.000: -2.1610474435806273 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (8 RUNNING, 2 TERMINATED) +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 2 | 90.1483 | 1.98694 | 0.2459 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | 1 | 72.515 | 1.71808 | 0.3539 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 2 | 84.6786 | 2.19944 | 0.1824 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | 2 | 91.5308 | 1.41634 | 0.5056 | | train_cifar_668d1_00008 | RUNNING | 172.17.0.2:2756 | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | RUNNING | 172.17.0.2:2762 | 2 | 2 | 16 | 0.0286986 | | | | | | train_cifar_668d1_00001 | TERMINATED | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | 1 | 72.1502 | 2.31184 | 0.1009 | | train_cifar_668d1_00004 | TERMINATED | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | 1 | 73.4948 | 2.3172 | 0.1042 | +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ (func pid=2762) [1, 6000] loss: 0.779 [repeated 4x across cluster] == Status == Current time: 2024-02-03 05:18:40 (running for 00:02:04.97) Using AsyncHyperBand: num_stopped=2 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.9869435796737671 | Iter 1.000: -2.1610474435806273 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (8 RUNNING, 2 TERMINATED) +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 2 | 90.1483 | 1.98694 | 0.2459 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | 1 | 72.515 | 1.71808 | 0.3539 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 2 | 84.6786 | 2.19944 | 0.1824 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | 2 | 91.5308 | 1.41634 | 0.5056 | | train_cifar_668d1_00008 | RUNNING | 172.17.0.2:2756 | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | RUNNING | 172.17.0.2:2762 | 2 | 2 | 16 | 0.0286986 | | | | | | train_cifar_668d1_00001 | TERMINATED | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | 1 | 72.1502 | 2.31184 | 0.1009 | | train_cifar_668d1_00004 | TERMINATED | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | 1 | 73.4948 | 2.3172 | 0.1042 | +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ (func pid=3549) [3, 2000] loss: 1.268 [repeated 3x across cluster] == Status == Current time: 2024-02-03 05:18:45 (running for 00:02:09.98) Using AsyncHyperBand: num_stopped=2 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: -1.9869435796737671 | Iter 1.000: -2.1610474435806273 Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60) Result logdir: /var/lib/jenkins/ray_results/train_cifar_2024-02-03_05-16-35 Number of trials: 10/10 (8 RUNNING, 2 TERMINATED) +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ | Trial name | status | loc | batch_size | l1 | l2 | lr | iter | total time (s) | loss | accuracy | |-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------| | train_cifar_668d1_00000 | RUNNING | 172.17.0.2:2669 | 2 | 16 | 1 | 0.00213327 | | | | | | train_cifar_668d1_00002 | RUNNING | 172.17.0.2:2758 | 2 | 256 | 64 | 0.0113784 | | | | | | train_cifar_668d1_00003 | RUNNING | 172.17.0.2:2760 | 8 | 64 | 256 | 0.0274071 | 2 | 90.1483 | 1.98694 | 0.2459 | | train_cifar_668d1_00005 | RUNNING | 172.17.0.2:2764 | 4 | 8 | 64 | 0.000353097 | 1 | 72.515 | 1.71808 | 0.3539 | | train_cifar_668d1_00006 | RUNNING | 172.17.0.2:2765 | 8 | 16 | 4 | 0.000147684 | 2 | 84.6786 | 2.19944 | 0.1824 | | train_cifar_668d1_00007 | RUNNING | 172.17.0.2:3549 | 8 | 256 | 256 | 0.00477469 | 2 | 91.5308 | 1.41634 | 0.5056 | | train_cifar_668d1_00008 | RUNNING | 172.17.0.2:2756 | 8 | 128 | 256 | 0.0306227 | | | | | | train_cifar_668d1_00009 | RUNNING | 172.17.0.2:2762 | 2 | 2 | 16 | 0.0286986 | | | | | | train_cifar_668d1_00001 | TERMINATED | 172.17.0.2:2756 | 4 | 1 | 2 | 0.013416 | 1 | 72.1502 | 2.31184 | 0.1009 | | train_cifar_668d1_00004 | TERMINATED | 172.17.0.2:2762 | 4 | 16 | 2 | 0.056666 | 1 | 73.4948 | 2.3172 | 0.1042 | +-------------------------+------------+-----------------+--------------+------+------+-------------+--------+------------------+---------+------------+ Result for train_cifar_668d1_00008: accuracy: 0.2278 date: 2024-02-03_05-18-45 done: false hostname: 8642c088913e iterations_since_restore: 1 loss: 2.150844239425659 node_ip: 172.17.0.2 pid: 2756 should_checkpoint: true time_since_restore: 47.30649995803833 time_this_iter_s: 47.30649995803833 time_total_s: 47.30649995803833 timestamp: 1706937525 training_iteration: 1 trial_id: 668d1_00008
如果您运行代码,示例输出可能如下所示:
Number of trials: 10/10 (10 TERMINATED) +-----+--------------+------+------+-------------+--------+---------+------------+ | ... | batch_size | l1 | l2 | lr | iter | loss | accuracy | |-----+--------------+------+------+-------------+--------+---------+------------| | ... | 2 | 1 | 256 | 0.000668163 | 1 | 2.31479 | 0.0977 | | ... | 4 | 64 | 8 | 0.0331514 | 1 | 2.31605 | 0.0983 | | ... | 4 | 2 | 1 | 0.000150295 | 1 | 2.30755 | 0.1023 | | ... | 16 | 32 | 32 | 0.0128248 | 10 | 1.66912 | 0.4391 | | ... | 4 | 8 | 128 | 0.00464561 | 2 | 1.7316 | 0.3463 | | ... | 8 | 256 | 8 | 0.00031556 | 1 | 2.19409 | 0.1736 | | ... | 4 | 16 | 256 | 0.00574329 | 2 | 1.85679 | 0.3368 | | ... | 8 | 2 | 2 | 0.00325652 | 1 | 2.30272 | 0.0984 | | ... | 2 | 2 | 2 | 0.000342987 | 2 | 1.76044 | 0.292 | | ... | 4 | 64 | 32 | 0.003734 | 8 | 1.53101 | 0.4761 | +-----+--------------+------+------+-------------+--------+---------+------------+ Best trial config: {'l1': 64, 'l2': 32, 'lr': 0.0037339984519545164, 'batch_size': 4} Best trial final validation loss: 1.5310075663924216 Best trial final validation accuracy: 0.4761 Best trial test set accuracy: 0.4737
为了避免浪费资源,大多数试验都被提前停止了。表现最好的试验实现了约 47%的验证准确率,这可以在测试集上得到确认。
就是这样!您现在可以调整 PyTorch 模型的参数了。
脚本的总运行时间:(9 分钟 49.698 秒)
下载 Python 源代码:hyperparameter_tuning_tutorial.py
下载 Jupyter 笔记本:hyperparameter_tuning_tutorial.ipynb