wandb使用教程(持续更新ing...)

简介: wandb使用教程(持续更新ing...)

1. 系统整体设置


https://wandb.ai/settings中可以设置什么时候以邮件或slack(我没下,所以我全用邮件了)提醒:

image.png

其中run finished在使用jupyter notebook环境时不会起效,以防每个cell运行结束后都提醒一次。如果用jupyter notebook的话,需要用wandb.alert()来触发提醒。


此外还可以关联GitHub以迅速提交benchmark runs(这个我还没试过,所以还没有关联):

image.png


储存空间可以进行管理,免费的有100个G,点击“manage storage”可以直接删不需要的文件:

image.png


2. 学术


2.1 免费学术小组

这个我还没试过,意思应该是学校或科研机构之类的可以免费建小组。如果有需要可以使用:https://wandb.ai/site/academic


2.2 在论文中引用wandb的方式

来自https://wandb.ai/site/academic


@misc{wandb,
title = {Experiment Tracking with Weights and Biases},
year = {2020},
note = {Software available from wandb.com},
url={https://www.wandb.com/},
author = {Biewald, Lukas},
}


2.3 wandb白皮书

https://www.dropbox.com/s/0ipub9ewwkml8jf/Experiment%20Tracking%20with%20Weights%20%26%20Biases.pdf?dl=1


3. wandb文档


Weights & Biases - Documentation


wandb支持多平台、多框架。

image.png


3.1 wandb quickstart

Quickstart - Documentation

Jupyter Notebook代码实践:Intro_to_Weights_&_Biases.ipynb - Colaboratory


3.1.1 set up

命令行:


pip install wandb
wandb login


(如果代码中有用到wandb,其实会自动要求登录,但是如果用nohup挂起脚本的话就无法实现,所以还是建议提前登录好)

API key的位置在:https://wandb.ai/authorize

(值得注意的是,这个配置应该是全局的,虽然我在一个虚拟环境下登录了我的账号,但是在别的虚拟环境下重新安装wandb还是可以直接通用。我估计是因为缓存在了本地)

用wandb login --relogin强制重新登录。


用jupyter notebook则为:


!pip install wandb
wandb.login()



3.1.2 开启新项目,跟踪指标、超参,添加报警信息

感觉就是init新项目,然后指标用log(在下面例子里可以看到,可以分组(如分为train/val)),超参用config,需要报警的内容就添加alert,结果添加到summary,最后finish就行,比较好用。


init文档:Launch Experiments with wandb.init - Documentation

对dashboard整体布局(run page)的介绍:Run Page - Documentation

数据可视化/跟踪指标:Data Visualization - Documentation

跟踪超参:Configure Experiments with wandb.config - Documentation

警告部分文档:Send Alerts with wandb.alert - Documentation


在测试代码时如临时不想与wandb同步,需设置环境变量,使wandb模式变成离线: WANDB_MODE=offline(具体做法是在运行Python代码的命令行中,在python前面加上这句命令)


文档中给出的核心代码:


import wandb
wandb.init(project="my-awesome-project")  #其他入参:name(见下例),config(见下)
wandb.log({'accuracy': train_acc, 'loss': train_loss})
wandb.config.dropout = 0.2
wandb.alert(
    title="Low accuracy", 
    text=f"Accuracy {acc} is below the acceptable threshold {thresh}"
)


在jupyter notebook上,用一个假示例来模拟:


import random
# Launch 5 simulated experiments
total_runs = 5
for run in range(total_runs):
  # 🐝 1️⃣ Start a new run to track this script
  wandb.init(
      # Set the project where this run will be logged
      project="wandbexample1", 
      # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
      name=f"experiment_{run}", 
      # Track hyperparameters and run metadata
      config={
      "learning_rate": 0.02,
      "architecture": "CNN",
      "dataset": "CIFAR-100",
      "epochs": 10,
      })
  # This simple block simulates a training loop logging metrics
  epochs = 10
  offset = random.random() / 5
  for epoch in range(2, epochs):
      acc = 1 - 2 ** -epoch - random.random() / epoch - offset
      loss = 2 ** -epoch + random.random() / epoch + offset
      # 🐝 2️⃣ Log metrics from your script to W&B
      wandb.log({"acc": acc, "loss": loss})
  # Mark the run as finished
  wandb.finish()


在输出中会给出wandb项目链接,在浏览器中打开即可。

记录的内容:

image.png

输出的内容:

image.png

可以看出运行数据保存在了本地的wandb文件夹,额外的信息都没有上传到wandb上。


3.1.3 上一节的PyTorch例子

jupyter notebook上:MNIST分类器


#@title
import wandb
import math
import random
import torch, torchvision
import torch.nn as nn
import torchvision.transforms as T
from tqdm.notebook import tqdm
device = "cuda:0" if torch.cuda.is_available() else "cpu"
def get_dataloader(is_train, batch_size, slice=5):
    "Get a training dataloader"
    full_dataset = torchvision.datasets.MNIST(root=".", train=is_train, transform=T.ToTensor(), download=True)
    sub_dataset = torch.utils.data.Subset(full_dataset, indices=range(0, len(full_dataset), slice))
    loader = torch.utils.data.DataLoader(dataset=sub_dataset, 
                                         batch_size=batch_size, 
                                         shuffle=True if is_train else False, 
                                         pin_memory=True, num_workers=2)
    return loader
def get_model(dropout):
    "A simple model"
    model = nn.Sequential(nn.Flatten(),
                         nn.Linear(28*28, 256),
                         nn.BatchNorm1d(256),
                         nn.ReLU(),
                         nn.Dropout(dropout),
                         nn.Linear(256,10)).to(device)
    return model
def validate_model(model, valid_dl, loss_func, log_images=False, batch_idx=0):
    "Compute performance of the model on the validation dataset and log a wandb.Table"
    model.eval()
    val_loss = 0.
    with torch.inference_mode():
        correct = 0
        for i, (images, labels) in tqdm(enumerate(valid_dl), leave=False):
            images, labels = images.to(device), labels.to(device)
            # Forward pass ➡
            outputs = model(images)
            val_loss += loss_func(outputs, labels)*labels.size(0)
            # Compute accuracy and accumulate
            _, predicted = torch.max(outputs.data, 1)
            correct += (predicted == labels).sum().item()
            # Log one batch of images to the dashboard, always same batch_idx.
            if i==batch_idx and log_images:
                log_image_table(images, predicted, labels, outputs.softmax(dim=1))
    return val_loss / len(valid_dl.dataset), correct / len(valid_dl.dataset)
def log_image_table(images, predicted, labels, probs):
    "Log a wandb.Table with (img, pred, target, scores)"
    # 🐝 Create a wandb Table to log images, labels and predictions to
    table = wandb.Table(columns=["image", "pred", "target"]+[f"score_{i}" for i in range(10)])
    for img, pred, targ, prob in zip(images.to("cpu"), predicted.to("cpu"), labels.to("cpu"), probs.to("cpu")):
        table.add_data(wandb.Image(img[0].numpy()*255), pred, targ, *prob.numpy())
    wandb.log({"predictions_table":table}, commit=False)


训练:


# Launch 5 experiments, trying different dropout rates
for i in range(5):
    # 🐝 initialise a wandb run
    wandb.init(
        project="wandbexample1",
        name="pytorch_example"+str(i),
        config={
            "epochs": 10,
            "batch_size": 128,
            "lr": 1e-3,
            "dropout": random.uniform(0.01, 0.80),
            })
    # Copy your config 
    config = wandb.config
    # Get the data
    train_dl = get_dataloader(is_train=True, batch_size=config.batch_size)
    valid_dl = get_dataloader(is_train=False, batch_size=2*config.batch_size)
    n_steps_per_epoch = math.ceil(len(train_dl.dataset) / config.batch_size)
    # A simple MLP model
    model = get_model(config.dropout)
    # Make the loss and optimizer
    loss_func = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=config.lr)
   # Training
    example_ct = 0
    step_ct = 0
    for epoch in tqdm(range(config.epochs)):
        model.train()
        for step, (images, labels) in enumerate(tqdm(train_dl, leave=False)):
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            train_loss = loss_func(outputs, labels)
            optimizer.zero_grad()
            train_loss.backward()
            optimizer.step()
            example_ct += len(images)
            metrics = {"train/train_loss": train_loss, 
                       "train/epoch": (step + 1 + (n_steps_per_epoch * epoch)) / n_steps_per_epoch, 
                       "train/example_ct": example_ct}
            if step + 1 < n_steps_per_epoch:
                # 🐝 Log train metrics to wandb 
                wandb.log(metrics)
            step_ct += 1
        val_loss, accuracy = validate_model(model, valid_dl, loss_func, log_images=(epoch==(config.epochs-1)))
        # 🐝 Log train and validation metrics to wandb
        val_metrics = {"val/val_loss": val_loss, 
                       "val/val_accuracy": accuracy}
        wandb.log({**metrics, **val_metrics})
        print(f"Train Loss: {train_loss:.3f}, Valid Loss: {val_loss:3f}, Accuracy: {accuracy:.2f}")
    # If you had a test set, this is how you could log it as a Summary metric
    wandb.summary['test_accuracy'] = 0.8
    # 🐝 Close your wandb run 
    wandb.finish()


wandb网页首页:

image.png

点进一个run内:

image.png


输出:

image.png

image.png


3.1.4 jupyter notebook wandb警告示例

# Start a wandb run
wandb.init(project="wandbexample1")
# Simulating a model training loop
acc_threshold = 0.3
for training_step in range(1000):
    # Generate a random number for accuracy
    accuracy = round(random.random() + random.random(), 3)
    print(f'Accuracy is: {accuracy}, {acc_threshold}')
    # 🐝 Log accuracy to wandb
    wandb.log({"Accuracy": accuracy})
    # 🔔 If the accuracy is below the threshold, fire a W&B Alert and stop the run
    if accuracy <= acc_threshold:
        # 🐝 Send the wandb Alert
        wandb.alert(
            title='Low Accuracy',
            text=f'Accuracy {accuracy} at step {training_step} is below the acceptable theshold, {acc_threshold}',
        )
        print('Alert triggered')
        break
# Mark the run as finished (useful in Jupyter notebooks)
wandb.finish()


image.png

acc变成0.155,小于阈值0.3,所以报了警告。因为我没有slack所以是给发到邮件上:

image.png

项目的dashboard:

image.png


3.2 调参

Hyperparameter Tuning - Documentation

Organizing_Hyperparameter_Sweeps_in_PyTorch_with_W&B.ipynb


3.3 合作报告

Collaborative Reports - Documentation


3.4 跟踪pipeline的数据和模型版本

Data + Model Versioning - Documentation


3.5 数据可视化/跟踪指标日志

Data Visualization - Documentation


3.6 自动化深度学习平台的配置文件

Environment Variables - Documentation


3.7 本地化的解决方案

这个我暂时用不到,但是也列在这里以便备用:

Private Hosting - Documentation


3.8 示例

Examples - Documentation

示例的dashboard:wandb_example Workspace – Weights & Biases


3.9 Interagations

Integrations - Documentation


3.10 输入输出储存到wandb上的数据

Import & Export Data - Documentation


4. 常见问题


4.1 wandb.errors.UsageError: Error communicating with wandb process

参考https://docs.wandb.ai/library/init#init-start-error:在wandb.init()中添加入参settings=wandb.Settings(start_method="fork")即可解决问题。


4.2 本地储存空间不够→删除本地缓存

在 wandb.log() 处报错。


报错信息:


Thread SenderThread:
OSError: [Errno 28] No space left on device
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "env_path/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 51, in run
    self._run()
  File "env_path/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 102, in _run
    self._process(record)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/internal/internal.py", line 310, in _process
    self._sm.send(record)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/internal/sender.py", line 304, in send
    send_handler(record)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/internal/sender.py", line 931, in send_summary
    self._update_summary()
  File "env_path/lib/python3.8/site-packages/wandb/sdk/internal/sender.py", line 944, in _update_summary
    f.write(json_summary)
OSError: [Errno 28] No space left on device
wandb: ERROR Internal wandb error: file data was not synced
Traceback (most recent call last):
  File "c49.py", line 293, in <module>
    wandb.log({'valid_law_acc':law_accuracy,'valid_charge_acc':charge_accuracy,
  File "env_path/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 289, in wrapper
    return func(self, *args, **kwargs)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 255, in wrapper
    return func(self, *args, **kwargs)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1591, in log
    self._log(data=data, step=step, commit=commit)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1375, in _log
    self._partial_history_callback(data, step, commit)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1259, in _partial_history_callback
    self._backend.interface.publish_partial_history(
  File "env_path/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 553, in publish_partial_history
    self._publish_partial_history(partial_history)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 67, in _publish_partial_history
    self._publish(rec)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 150, in send_record_publish
    self.send_server_request(server_req)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 84, in send_server_request
    self._send_message(msg)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 81, in _send_message
    self._sendall_with_error_handle(header + data)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 61, in _sendall_with_error_handle
    sent = self._sock.send(data[total_sent:])
BrokenPipeError: [Errno 32] Broken pipe
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "env_path/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 81, in _send_message
    self._sendall_with_error_handle(header + data)
  File "env_path/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 61, in _sendall_with_error_handle
    sent = self._sock.send(data[total_sent:])
BrokenPipeError: [Errno 32] Broken pipe
相关文章
|
8月前
|
缓存
Autojs4.1.0实战教程---彩蛋视频功能合集
Autojs4.1.0实战教程---彩蛋视频功能合集
86 0
|
8月前
AutoJs Pro 7.0.4-1 实战教程 ---火热持续更新中
AutoJs Pro 7.0.4-1 实战教程 ---火热持续更新中
291 0
|
机器学习/深度学习 缓存 Shell
VSCode上的Git使用手记(持续更新ing...)
本笔记是我想要学习如何将本地文件发布到GitHub上时开始看廖雪峰的Git教程,然后打开了VSCode,发现VSCode上面集成的Git辅助使用功能真的很好用…… 基本上到了不用看教程都可以猜懂的地步。 为了整理、规范使用技巧,在经过了一番学习和试验之后,觉得以这样一篇使用手记的形式发布使用技巧相关的博文,以记录和沉淀经验,并帮助更多Git和VSCode初学者少踩坑。 本文参考的教程、文档等内容见本文末尾。
VSCode上的Git使用手记(持续更新ing...)
|
算法
《算法竞赛进阶指南》(持续更新ing)
《算法竞赛进阶指南》(持续更新ing)
123 0
|
编译器 C语言 C++
【C++系列P1】带上这篇基础小宝典,进发C++!(持续更新ing~)
【C++系列P1】带上这篇基础小宝典,进发C++!(持续更新ing~)
|
存储 算法 搜索推荐
C++系列笔记(十一)
C++系列笔记(十一)
C++系列笔记(十一)
|
缓存 Linux PyTorch
Anaconda教程(持续更新ing...)
Anaconda是一个Python和R语言的工具,专注于数据科学领域。由于我只使用Python,因此本文只会介绍Python相关的内容。 本文介绍使用Anaconda相关使用教程。
Anaconda教程(持续更新ing...)
|
缓存 Java Shell
Linux常用命令行集锦(持续更新ing...)
本文介绍Linux中常用的命令行。
Linux常用命令行集锦(持续更新ing...)
|
TensorFlow 网络安全 开发工具
VSCode编程小技巧集锦(持续更新ing...)
VSCode编程小技巧集锦(持续更新ing...)
VSCode编程小技巧集锦(持续更新ing...)

热门文章

最新文章

相关实验场景

更多
下一篇
开通oss服务