1. PyG包的Dataset和Data

这部分的详细解释可以参考我写的另一篇博文：PyTorch Geometric (PyG) 入门教程。

2. ogb包介绍

因为没有写过专门的ogb包教程，所以将对ogb包的概览和理解都写在这里。以后如有需要可能会整合为专门的相关教程。

obg包很多函数没有文档，所以只能靠查源码……这对我来说还是挺难的，所以这些函数我就只管用，先不做理解了。

官网：Get Started | Open Graph Benchmark
示例代码：baseline experiments/example scripts
OGB包中包含了一系列真实、大规模、常用于实现图机器学习任务benchmark的数据集。ogb包提供了数据集的dataloader和evaluator。

Data Loaders官网示例代码：

from ogb.graphproppred import PygGraphPropPredDataset
from torch_geometric.data import DataLoader
# Download and process data at './dataset/ogbg_molhiv/'
dataset = PygGraphPropPredDataset(name = "ogbg-molhiv", root = 'dataset/')
split_idx = dataset.get_idx_split() 
train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True)
valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False)
test_loader = DataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False)

在这里的dataset与PyG中的dataset类似，都可以执行用索引提取Data，切片，应用DataLoader等操作。

DataLoader的shuffle：训练时置True，测试时置False

split_idx是类似这样的字典：

{‘train’: tensor([ 0, 1, 2, …, 169145, 169148, 169251]),

‘valid’: tensor([ 349, 357, 366, …, 169185, 169261, 169296]),

‘test’: tensor([ 346, 398, 451, …, 169340, 169341, 169342])}

ogb中的数据集都映射自现实世界实体，具体的映射信息可以见根目录中的mapping目录。

Evaluator官方示例代码

from ogb.graphproppred import Evaluator
evaluator = Evaluator(name = "ogbg-molhiv")
input_dict = {"y_true": y_true, "y_pred": y_pred}
result_dict = evaluator.eval(input_dict) # E.g., {"rocauc": 0.7321}

input_dict和result_dict的格式可以通过 evaluator.expected_input_format 和 evaluator.expected_output_format 打印。

3. 节点预测任务

这一部分colab应该是参考了 ogb/gnn.py at master · snap-stanford/ogb 的GCN部分代码。

3.0 导包

import torch
import torch.nn.functional as F
# The PyG built-in GCNConv
from torch_geometric.nn import GCNConv
import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator
import copy

3.1 加载ogbn-arxiv数据集

ogb节点分类任务数据集官方文档

ogbn-arxiv数据集官方文档

dataset_name = 'ogbn-arxiv'
# Load the dataset and transform it to sparse tensor
dataset = PygNodePropPredDataset(name=dataset_name,
                                 transform=T.ToSparseTensor())
print(dataset.task_type)
print(dataset.num_classes)
print(dataset.num_tasks)
print(dataset.eval_metric)

multiclass classification

acc

对代码中转换到稀疏矩阵的部分还不了解。总之简单来说，toSparseTensor方法是将edge_index转换为 torch_sparse.SparseTensor 格式（torch_sparse的GitHub项目是rusty1s/pytorch_sparse: PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations，还没有了解过详情）。（总之如果不加transform，这一项属性就是edge_index）

ogbn-arxiv数据集是一个较小的数据集，用于节点多分类任务。来自MAG3语料集，表示arxiv论文互相引用的状态，节点是论文，链接是引用。每个节点有128维的特征，是其标题与摘要词嵌入的平均值。词嵌入通过skip-gram模型4获取。

有169,343个节点，是有向图5，1,166,243条边，多分类任务（40个论文主题）。

数据集切分的依据是论文发表时间（2017年及其之前的数据作为训练集，2018年的数据作为验证集，2019年及后的作为测试集）（90491个训练集数据，29799个验证集数据，48603个测试集数据）。

ogb官方提供的评估指标是accuracy。

有一张图，打印出来Data是：Data(adj_t=[169343, 169343, nnz=1166243], node_year=[169343, 1], x=[169343, 128], y=[169343, 1])

3.2 预处理数据

# Make the adjacency matrix to symmetric
data.adj_t = data.adj_t.to_symmetric()

（to_symmetric() 函数没找到文档，只找到了源代码：pytorch_sparse/tensor.py at master · rusty1s/pytorch_sparse 没看懂，算了）

将data转移到cuda上（如果有GPU的话），并进行数据划分

device = 'cuda' if torch.cuda.is_available() else 'cpu'
# If you use GPU, the device should be cuda
print('Device: {}'.format(device))
data = data.to(device)
split_idx = dataset.get_idx_split()  #将数据集分成了train，valid，test三部分
train_idx = split_idx['train'].to(device)

3.3 搭建神经网络模型

在这里用作分类模型。

如果将最后一层分类层（Softmax）摘掉，就直接输出这一层的隐藏节点嵌入。相当于将模型作为一个嵌入维度为output_dim的节点嵌入模型。（在后文图分类部分使用）

网络模型示意图：

class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers,
                 dropout, return_embeds=False):
        """
        dropout是dropout的概率
        return_embeds如果置True的话就跳过分类层，输出节点嵌入（原话：Skip classification layer and return node embeddings）
        """
        super(GCN, self).__init__()
        self.convs = torch.nn.ModuleList()
        for i in range(num_layers - 1):
            self.convs.append(GCNConv(input_dim, hidden_dim))
            input_dim = hidden_dim
        self.convs.append(GCNConv(hidden_dim, output_dim))
        self.bns=torch.nn.ModuleList([torch.nn.BatchNorm1d(hidden_dim) for i in range(num_layers-1)])
        self.softmax=torch.nn.LogSoftmax()
        self.dropout = dropout
        self.return_embeds = return_embeds
    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()
        for bn in self.bns:
            bn.reset_parameters()
    def forward(self, x, adj_t):
        out = None
  #前 num_layers-1 层
        for layer in range(len(self.convs)-1):
            x=self.convs[layer](x,adj_t)
            #forward(x: torch.Tensor, edge_index: Union[torch.Tensor, torch_sparse.tensor.SparseTensor], 
            #edge_weight: Optional[torch.Tensor] = None)
            x=self.bns[layer](x)
            x=F.relu(x)
            x=F.dropout(x,self.dropout,self.training)
        #最后一层
        out=self.convs[-1](x,adj_t)
        if not self.return_embeds:
            out=self.softmax(out)
        return out

那个 reset_parameters() 方法重置了它的网络层的参数，这些网络层应该是一开始就自动调用 reset_parameters() 了，之所以要再重新写一遍，我在GitHub上问了 ogb/gnn.py at master · snap-stanford/ogb 原作者，他说是因为在他的代码中可能需要多次训练模型，每次都期待有不同的初始化参数，所以专门写了这个函数来实现重置所有子Module的参数。（见：Why network module in example/. define reset_parameters manually? · Discussion #227 · snap-stanford/ogb）

……那么现在问题来了，colab2里面就跑了一次这个模型为啥还非要再写一遍这个方法？我个人倾向于是猜测是因为老师抄作业的时候抄拉了。

关于 F.dropout() 方法第三个参数self.training，可参考我写的博文：PyTorch的F.dropout为什么要加self.training？

3.4 构建train()函数

注意这里，我们在训练时是拿所有数据（整张图）喂进模型训练的，但是计算loss时只用训练集的loss来计算梯度。

def train(model, data, train_idx, optimizer, loss_fn):
    model.train()
    loss = 0
    optimizer.zero_grad()
    out=model(data.x,data.adj_t)
    train_output=out[train_idx]
    train_label=data.y[train_idx,0]
    #这里注意data.y是个二维矩阵，但是我们希望输出一维向量
    #所以也可以用squeeze, view, reshape 反正性质是一样的
    loss=loss_fn(train_output,train_label)
    loss.backward()
    optimizer.step()
    return loss.item()

3.5 构建test()函数

返回在训练集、验证集、测试集上的评估指标结果

@torch.no_grad()
def test(model, data, split_idx, evaluator):
    model.eval()
    out=model(data.x,data.adj_t)
    y_pred = out.argmax(dim=-1, keepdim=True)
  #ogbn-arxiv的评估指标是Accuracy
  #print(evaluator.expected_output_format)输出是：{'acc': acc}
    train_acc = evaluator.eval({
        'y_true': data.y[split_idx['train']],
        'y_pred': y_pred[split_idx['train']],
    })['acc']
    valid_acc = evaluator.eval({
        'y_true': data.y[split_idx['valid']],
        'y_pred': y_pred[split_idx['valid']],
    })['acc']
    test_acc = evaluator.eval({
        'y_true': data.y[split_idx['test']],
        'y_pred': y_pred[split_idx['test']],
    })['acc']
    return train_acc, valid_acc, test_acc

3.6 设置超参

args = {
    'device': device,
    'num_layers': 3,
    'hidden_dim': 256,
    'dropout': 0.5,
    'lr': 0.01,
    'epochs': 100,
}

3.7 初始化模型和评估器

model = GCN(data.num_features, args['hidden_dim'],
            dataset.num_classes, args['num_layers'],
            args['dropout']).to(device)
evaluator = Evaluator(name='ogbn-arxiv')

3.8 训练

跑 args[“epochs”] 轮epoch，将在验证集上表现最好的模型保存下来。

# reset the parameters to initial random value
model.reset_parameters()
optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
loss_fn = F.nll_loss
best_model = None
best_valid_acc = 0
for epoch in range(1, 1 + args["epochs"]):
  loss = train(model, data, train_idx, optimizer, loss_fn)
  result = test(model, data, split_idx, evaluator)
  train_acc, valid_acc, test_acc = result
  if valid_acc > best_valid_acc:
      best_valid_acc = valid_acc
      best_model = copy.deepcopy(model)
  print(f'Epoch: {epoch:02d}, '
        f'Loss: {loss:.4f}, '
        f'Train: {100 * train_acc:.2f}%, '
        f'Valid: {100 * valid_acc:.2f}% '
        f'Test: {100 * test_acc:.2f}%')

3.9 输出最好模型的表现结果

best_result = test(best_model, data, split_idx, evaluator)
train_acc, valid_acc, test_acc = best_result
print(f'Best model: '
      f'Train: {100 * train_acc:.2f}%, '
      f'Valid: {100 * valid_acc:.2f}% '
      f'Test: {100 * test_acc:.2f}%')

Best model: Train: 74.44%, Valid: 71.94% Test: 71.15%

4. 图分类任务

这一部分的代码应该有参考自：

ogb/main_pyg.py at master · snap-stanford/ogb。但是这部分感觉参考得不太多，所以我就没仔细看这部分ogb官方的代码。

4.0 导包

from ogb.graphproppred import PygGraphPropPredDataset, Evaluator
from ogb.graphproppred.mol_encoder import AtomEncoder
from torch_geometric.data import DataLoader
from torch_geometric.nn import global_add_pool, global_mean_pool
from tqdm.notebook import tqdm
import copy

4.1 加载ogbg-molhiv数据集

ogb图分类任务数据集官方文档

ogbg-molhiv数据集官方文档

dataset = PygGraphPropPredDataset(name='ogbg-molhiv')
split_idx = dataset.get_idx_split()
print(dataset.task_type)
print(dataset.num_classes)
print(dataset.num_tasks)
print(dataset.eval_metric)

binary classification

rocauc

ogbg-molhiv是个较小的分子属性预测数据集，用于图分类任务（二元分类）。有41,127个无向图，平均每个图有25.5个节点、13.75个边。任务目标是二元分类。评估指标是ROC-AUC。

数据集改自 MoleculeNet6，每个分子都已通过 RDKit7 进行了预处理。每个图代表一个分子，节点代表原子，边代表化学键。节点有9维特征，包含了其原子数、手征、形式电荷、该原子是否在环中等信息。

官方网站上提供了对原始特征的预处理示例代码：

from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
atom_encoder = AtomEncoder(emb_dim = 100)
bond_encoder = BondEncoder(emb_dim = 100)
atom_emb = atom_encoder(x) # x is input atom feature
edge_emb = bond_encoder(edge_attr) # edge_attr is input edge feature

作为示例，打印数据集的第一个图，输出如下：

Data(edge_attr=[40, 3], edge_index=[2, 40], x=[19, 9], y=[1, 1])

4.2 切分数据集，将数据集加载到DataLoader上

train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, num_workers=0)
valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, num_workers=0)
test_loader = DataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False, num_workers=0)

4.3 设置超参

device = 'cuda' if torch.cuda.is_available() else 'cpu'
args = {
    'device': device,
    'num_layers': 5,
    'hidden_dim': 256,
    'dropout': 0.5,
    'lr': 0.001,
    'epochs': 30,
}

4.4 搭建神经网络模型

class GCN_Graph(torch.nn.Module):
    def __init__(self, hidden_dim, output_dim, num_layers, dropout):
        super(GCN_Graph, self).__init__()
        # Load encoders for Atoms in molecule graphs
        self.node_encoder = AtomEncoder(hidden_dim)
        # Node embedding model
        self.gnn_node = GCN(hidden_dim, hidden_dim,
            hidden_dim, num_layers, dropout, return_embeds=True)
        self.pool=global_mean_pool
        self.linear = torch.nn.Linear(hidden_dim, output_dim)
    def reset_parameters(self):
      self.gnn_node.reset_parameters()
      self.linear.reset_parameters()
    def forward(self, batched_data):
        x, edge_index, batch = batched_data.x, batched_data.edge_index, batched_data.batch
        embed = self.node_encoder(x)
        out=self.gnn_node(embed,edge_index)
        out=self.pool(out,batch)
        out=self.linear(out)
        return out

4.5 构建train()函数

def train(model, device, data_loader, optimizer, loss_fn):
    """
    optimizer是给定优化器（torch.optim）
    loss_fn是给定损失函数
    """
    model.train()
    loss = 0
    for step, batch in enumerate(tqdm(data_loader, desc="Iteration")):
      batch = batch.to(device)
      if batch.x.shape[0] == 1 or batch.batch[-1] == 0:
          pass
      else:
        ## ignore nan targets (unlabeled) when computing training loss.
        is_labeled = batch.y == batch.y
        optimizer.zero_grad()
        op=model(batch)
        train_op=op[is_labeled]
        train_labels=batch.y[is_labeled]
        #loss=loss_fn(train_op,train_labels)
        #RuntimeError: result type Float can't be cast to the desired output type Long
        #train_op的dtype是torch.float32
        #train_labels的dtype是torch.int64
        loss=loss_fn(train_op,train_labels.float())
        loss.backward()
        optimizer.step()
    return loss.item()

代码里面有一句判断x第一维长度为1或者batch最后一个数据为0（这是只有一个图的情况的意思吗？），还有忽略无标签数据的……其实我没搞懂这是在干啥，我还专门跑了一下如下两个代码：

for batch in train_loader:
    if batch.x.shape[0] == 1 or batch.batch[-1] == 0:
        print(batch)
        break

for batch in train_loader:
    if batch.x.shape[0] == 1 or batch.batch[-1] == 0:
        pass
    else:
        is_labeled = batch.y == batch.y
        if False in is_labeled:
            print(batch)
            break

发现都没有输出，也就是两种情况都不存在。

……所以我也没搞懂在什么情况下这个功能会起作用。

但是is_labeled在这里同时作为将二维矩阵变形到一维向量使用（跟3.4用的第二维度取索引为0的一维，或者squeeze view reshape之类的函数在此方面功能相似）。

这个机制的原理是这样的：见 numpy官方文档索引部分对布尔或mask索引的讲解，batch.y是和is_labeled是同shape的布尔矩阵（这很显然嘛），同shape的布尔矩阵作为索引，返回的就是一维矩阵。

4.6 构建eval()函数

def eval(model, device, loader, evaluator):
    model.eval()
    y_true = []
    y_pred = []
    for step, batch in enumerate(tqdm(loader, desc="Iteration")):
        batch = batch.to(device)
        if batch.x.shape[0] == 1:
            pass
        else:
            with torch.no_grad():
                pred = model(batch)
            y_true.append(batch.y.view(pred.shape).detach().cpu())
            y_pred.append(pred.detach().cpu())
    y_true = torch.cat(y_true, dim = 0).numpy()
    y_pred = torch.cat(y_pred, dim = 0).numpy()
    input_dict = {"y_true": y_true, "y_pred": y_pred}
    return evaluator.eval(input_dict)

4.7 初始化模型和评估器

model = GCN_Graph(args['hidden_dim'],
            dataset.num_tasks, args['num_layers'],
            args['dropout']).to(device)
evaluator = Evaluator(name='ogbg-molhiv')

4.8 训练

跑 args[“epochs”] 轮epoch，将在验证集上表现最好的模型保存下来。

（注意虽然这里评估指标写的是acc，但是其实是AUC……）

model.reset_parameters()
optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
loss_fn = torch.nn.BCEWithLogitsLoss()
best_model = None
best_valid_acc = 0
for epoch in range(1, 1 + args["epochs"]):
  print('Training...')
  loss = train(model, device, train_loader, optimizer, loss_fn)
  print('Evaluating...')
  train_result = eval(model, device, train_loader, evaluator)
  val_result = eval(model, device, valid_loader, evaluator)
  test_result = eval(model, device, test_loader, evaluator)
  train_acc, valid_acc, test_acc = train_result[dataset.eval_metric], val_result[dataset.eval_metric], test_result[dataset.eval_metric]
  if valid_acc > best_valid_acc:
      best_valid_acc = valid_acc
      best_model = copy.deepcopy(model)
  print(f'Epoch: {epoch:02d}, '
        f'Loss: {loss:.4f}, '
        f'Train: {100 * train_acc:.2f}%, '
        f'Valid: {100 * valid_acc:.2f}% '
        f'Test: {100 * test_acc:.2f}%')

4.9 输出最好模型的表现结果

train_acc = eval(best_model, device, train_loader, evaluator)[dataset.eval_metric]
valid_acc = eval(best_model, device, valid_loader, evaluator)[dataset.eval_metric]
test_acc = eval(best_model, device, test_loader, evaluator)[dataset.eval_metric]
print(f'Best model: '
      f'Train: {100 * train_acc:.2f}%, '
      f'Valid: {100 * valid_acc:.2f}% '
      f'Test: {100 * test_acc:.2f}%')

进度条不赘

打印输出：

Best model: Train: 87.07%, Valid: 81.10% Test: 76.75%

5. 参考资料

写代码的时候有参考过这篇：cs224w-winter-2021/CS224W_Colab_2.ipynb at main · XckCodeDD/cs224w-winter-2021 但是，照例，我不想再重复多看别人的代码了，所以我也不知道这个答案有没有问题了。

可参考：PyTorch Geometric (PyG) 入门教程
注意有趣的一点（其实并不有趣因为我没搞懂）：模型的实现使用的是GCN，这个算法到底是transductive还是inductive的我至今还没搞懂。

在节点分类任务中是用全图进行训练，仅通过训练集索引的节点数据进行优化，最后测试时也是用得到的模型对所有数据进行运算。

但是图分类任务中就可以直接很inductive地划分数据集，在训练集上训练，验证集上验证，测试集上测试……就这个问题我想了一下，我觉得在图分类任务上面应该是说GCN就单纯作为一个embedding的方法，所以大概它的参数是可以直接用在没见过的新图的数据上的……所以它应该inductive吧，要不然这咋整。

感觉我好像get到了什么，但是又隐约感到一点只能意会不可言传的微妙的懵逼感，所以大概我没搞懂。

Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representationsof words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3111–3119, 2013.

注意，这个Data没法直接调用 is_directed() 方法，因为 is_undirected() 方法必须要用edge_index这个attribute。

所以我是用没加transform参数获取的数据集的Data调用的 is_directed() 方法确认它是有向图的。

在PyG文档中也介绍 toSparseTensor 方法最好晚点调用，因为有很多方法可能会依赖edge_index属性。原话：In case of composing multiple transforms, it is best to convert the data object to a SparseTensor as late as possible, since there exist some transforms that are only able to operate on data.edge_index for now.

（当然在文档里本来就有写它是个有向图就是了……）

Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh SPappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical Science, 9(2):513–530, 2018.

Greg Landrum et al. RDKit: Open-source cheminformatics, 2006.

cs224w（图机器学习）2021冬季课程学习笔记8 Colab 2

1. PyG包的Dataset和Data

2. ogb包介绍

3. 节点预测任务

3.0 导包

3.1 加载ogbn-arxiv数据集

3.2 预处理数据

3.3 搭建神经网络模型

3.4 构建train()函数

3.5 构建test()函数

3.6 设置超参

3.7 初始化模型和评估器

3.8 训练

3.9 输出最好模型的表现结果

4. 图分类任务

4.0 导包

4.1 加载ogbg-molhiv数据集

4.2 切分数据集，将数据集加载到DataLoader上

4.3 设置超参

4.4 搭建神经网络模型

4.5 构建train()函数

4.6 构建eval()函数

4.7 初始化模型和评估器

4.8 训练

4.9 输出最好模型的表现结果

5. 参考资料

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

cs224w（图机器学习）2021冬季课程学习笔记8 Colab 2

1. PyG包的Dataset和Data

2. ogb包介绍

3. 节点预测任务

3.0 导包

3.1 加载ogbn-arxiv数据集

3.2 预处理数据

3.3 搭建神经网络模型

3.4 构建train()函数

3.5 构建test()函数

3.6 设置超参

3.7 初始化模型和评估器

3.8 训练

3.9 输出最好模型的表现结果

4. 图分类任务

4.0 导包

4.1 加载ogbg-molhiv数据集

4.2 切分数据集，将数据集加载到DataLoader上

4.3 设置超参

4.4 搭建神经网络模型

4.5 构建train()函数

4.6 构建eval()函数

4.7 初始化模型和评估器

4.8 训练

4.9 输出最好模型的表现结果

5. 参考资料

热门文章

最新文章

相关课程

相关电子书

相关实验场景