1. 可视化
一个教程中提供的可视化函数visualize()
%matplotlib inline import torch import networkx as nx import matplotlib.pyplot as plt # Visualization function for NX graph or PyTorch tensor def visualize(h, color, epoch=None, loss=None): plt.figure(figsize=(7,7)) plt.xticks([]) plt.yticks([]) if torch.is_tensor(h): #可视化神经网络运行中间结果 h = h.detach().cpu().numpy() plt.scatter(h[:, 0], h[:, 1], s=140, c=color, cmap="Set2") if epoch is not None and loss is not None: plt.xlabel(f'Epoch: {epoch}, Loss: {loss.item():.4f}', fontsize=16) else: #可视化图 nx.draw_networkx(G, pos=nx.spring_layout(G, seed=42), with_labels=False, node_color=color, cmap="Set2") plt.show()
2. 数据集
torch_geometric.datasets — pytorch_geometric 1.7.0 documentation
Dataset文档
PyG内置了一系列公开benchmark数据集。
e.g., all Planetoid datasets (Cora, Citeseer, Pubmed), all graph classification datasets from http://graphkernels.cs.tu-dortmund.de and their cleaned versions, the QM7 and QM9 dataset, and a handful of 3D mesh/point cloud datasets like FAUST, ModelNet10/40 and ShapeNet.
来源:https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html
对数据集进行初始化后,就会自动下载原始文件、将其处理为已经描述好的Data格式。
2.1 以ENZYMES为例
TUDataset官方文档 一系列graph kernel的benchmark数据集
参数:
- use_node_attr:默认为False。如果置True,数据集会包含额外的连续边attributes(如果存在)
ENZYMES有600个图,600个类
示例代码:
from torch_geometric.datasets import TUDataset dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES') >>> ENZYMES(600) len(dataset) >>> 600 dataset.num_classes >>> 6 dataset.num_node_features >>> 3
从中取出一张图的Data:
data = dataset[0] >>> Data(edge_index=[2, 168], x=[37, 3], y=[1]) data.is_undirected() >>> True
2.2 以Cora为例
Planetoid官方文档 包含了 “Cora”, “CiteSeer” 和 “PubMed” 三个文献引用网络数据集,出自 Revisiting Semi-Supervised Learning with Graph Embeddings 这篇论文。节点表示文件,链接表示引用关系。数据集切分都用二元mask2实现。
由于Planetoid数据集是从GitHub仓库下载数据,所以国内有时可能会出现无法下载的问题。解决方式可以参考我之前写过的博文:Planetoid无法直接下载Cora等数据集的3个解决方式
Cora数据集含有一张无向图。
示例代码:
from torch_geometric.datasets import Planetoid dataset = Planetoid(root='/tmp/Cora', name='Cora') >>> Cora() len(dataset) >>> 1 dataset.num_classes >>> 7 dataset.num_node_features >>> 1433 data = dataset[0] >>> Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708]) data.is_undirected() >>> True data.train_mask.sum().item() >>> 140 data.val_mask.sum().item() >>> 500 data.test_mask.sum().item() >>> 1000
2.3 以KarateClub为例
torch_geometric.datasets官方文档
(官方文档中说是156条边(指无向图边×2的数目))
举例:class KarateClub(transform=None)2
维基百科:Zachary’s karate club
原论文:
An Information Flow Model for Conflict and Fission in Small Groups 34个节点,78条无向无权边
Kipf, T., & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. ArXiv, abs/1609.02907. 基于modularity-based clustering将节点聚合为4类,聚合基于每一类的一个有标签节点(也就是一共有4个真实值节点)
数据集中只有一张图。该图描述了一个空手道俱乐部会员的社交关系,以34名会员作为节点,如果两位会员在俱乐部之外仍保持社交关系,则在节点间增加一条边。
每个节点具有一个34维的特征向量。节点类型class共有4类,分别代表会员所属的社区community。
获取数据示例代码:
from torch_geometric.datasets import KarateClub dataset = KarateClub() print(f'Dataset: {dataset}:') print('======================') print(f'Number of graphs: {len(dataset)}') print(f'Number of features: {dataset.num_features}') print(f'Number of classes: {dataset.num_classes}')
输出结果:
Dataset: KarateClub():
======================
Number of graphs: 1
Number of features: 34
Number of classes: 4
2.4 Dataset的常用属性及方法
- 可以通过索引的方式获取Dataset中的图,示例:data=dataset[0]
返回值是torch_geometric.data.data.Data类的实例
- len(dataset) 相当于返回图的数量,毕竟Dataset就是很多Data,就可以通过切片获取其中的Data
- num_class 数据集中的类数
- num_features 或 num_node_features 每个节点上的特征维度
- shuffle(return_perm: bool = False) 随机排序;等价于 torch.randperm(len(dataset))
3. Data
Data类的文档:torch_geometric.data.Data
Data是PyG中表示图的Python object
3.1 Data attributes
Data可含多种attributes,默认attributes为:
x (Tensor) - 节点特征矩阵,形状为: [num_nodes, num_node_features]
edge_index (LongTensor) - Graph connectivity in COO format (coordinate format)3,形状为 [2, num_edges],type是 torch.long
举例来说,可以是一个由两个元素组成的tuple,第一个元素是由边起点组成的list,第二个元素是由边终点组成的list(如果是无向图,两种方向都要写)
edge_attr - 边特征矩阵,形状为:[num_edges, num_edge_features]
y (Tensor) - 图或节点的目标矩阵(尺寸可以是 [num_nodes, *] 或 [1, *] )
pos - 节点位置矩阵……我没搞懂这是个什么。形状为:[num_nodes, num_dimensions]
直接以attribute名为索引可以调出对应的attribute矩阵。示例代码:
print(data['x'])
也可以以类似字典的格式迭代attribute名和数据。示例代码:
for key, item in data: pass
判断某一attribute是否在Data中,示例代码:
'edge_attr' in data
3.2 创建Data
3.2.1 创建Data示例代码1
import torch from torch_geometric.data import Data edge_index = torch.tensor([[0, 1, 1, 2], [1, 0, 2, 1]], dtype=torch.long) x = torch.tensor([[-1], [0], [1]], dtype=torch.float) data = Data(x=x, edge_index=edge_index)
Data(edge_index=[2, 4], x=[3, 1])
3.2.2 创建Data示例代码2
如果想要使用index tuples的list来建立Data的edge_index属性,需要将其转置并调用contiguous方法4:
import torch from torch_geometric.data import Data edge_index = torch.tensor([[0, 1], [1, 0], [1, 2], [2, 1]], dtype=torch.long) x = torch.tensor([[-1], [0], [1]], dtype=torch.float) data = Data(x=x, edge_index=edge_index.t().contiguous())
Data(edge_index=[2, 4], x=[3, 1])
3.3 打印Data
以KarateClub图为例,将其打印出来为:Data(edge_index=[2, 156], train_mask=[34], x=[34, 34], y=[34])
train_mask是训练集的mask,是一个元素为布尔值的向量。5描述我们已知哪些节点的真实社区(在本示例中有4个节点)。
x是一个有34个观测的34维特征向量矩阵,y是有34个观测的一维目标向量矩阵。
3.4 Data的重要property和方法
- keys 返回图属性名的list
- num_nodes 图中的节点数
- num_edges 图中的边数(无向图会返回两个方向的边数,即唯一边数的两倍)
- num_node_features 或 num_features 图中节点特征维度
- contains_isolated_nodes() 图中是否含有孤立点
- contains_self_loops() 图中是否含有自环
- is_undirected() 图是不是无向的
- is_directed() 图是不是有向的
- to(device, *keys, **kwargs)
示例代码(仍以KarateClub的data举例):
print(f'Number of nodes: {data.num_nodes}') print(f'Number of edges: {int(data.num_edges/2)}') print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}') #2E/N print(f'Number of training nodes: {data.train_mask.sum()}') print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}') print(f'Contains isolated nodes: {data.contains_isolated_nodes()}') print(f'Contains self-loops: {data.contains_self_loops()}') print(f'Is undirected: {data.is_undirected()}')
输出结果:
Number of nodes: 34
Number of edges: 78
Average node degree: 4.59
Number of training nodes: 4
Training node label rate: 0.12
Contains isolated nodes: False
Contains self-loops: False
Is undirected: True
- 转换为NetworkX的Graph或DiGraph格式6:torch_grometric.utils.to_networkx(data,to_undirected=False)
data是Data格式数据。
如果to_undirected置True就返回Graph,反之返回DiGraph。
示例代码:
from torch_geometric.utils import to_networkx G = to_networkx(data, to_undirected=True)
对其进行可视化visualize(G, color=data.y):
4. Mini-batch
4.1 DataLoader
DataLoader官方文档
CLASS DataLoader(dataset, batch_size=1, shuffle=False, follow_batch=[], exclude_keys=[], **kwargs)[source]
示例代码:
from torch_geometric.datasets import TUDataset from torch_geometric.data import DataLoader dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES', use_node_attr=True) loader = DataLoader(dataset, batch_size=32, shuffle=True) for batch in loader: batch >>> Batch(batch=[1082], edge_index=[2, 4066], x=[1082, 21], y=[32]) batch.num_graphs >>> 32
4.2 Batch
Batch官方文档
CLASS Batch(batch=None, ptr=None, **kwargs)
batch 是一个列向量,将每个节点映射到其对应在Batch中的图上。
用batch来计算每个图上的平均节点特征示例代码:
from torch_scatter import scatter_mean from torch_geometric.datasets import TUDataset from torch_geometric.data import DataLoader dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES', use_node_attr=True) loader = DataLoader(dataset, batch_size=32, shuffle=True) for data in loader: data >>> Batch(batch=[1082], edge_index=[2, 4066], x=[1082, 21], y=[32]) data.num_graphs >>> 32 x = scatter_mean(data.x, data.batch, dim=0) x.size() >>> torch.Size([32, 21])
对代码中scatter_mean操作可以参考 pytorch_scatter官方文档
5. 应用GNN
5.1 示例1
举例:GCN
文档地址:torch_geometric.nn.GCNConv(in_channels,out_channels)
forward(x: torch.Tensor, edge_index: Union[torch.Tensor, torch_sparse.tensor.SparseTensor], edge_weight: Optional[torch.Tensor] = None)
原论文:Kipf, T., & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. ArXiv, abs/1609.02907.
示例代码——建立模型:
import torch from torch.nn import Linear from torch_geometric.nn import GCNConv class GCN(torch.nn.Module): def __init__(self): super(GCN, self).__init__() torch.manual_seed(12345) self.conv1 = GCNConv(dataset.num_features, 4) self.conv2 = GCNConv(4, 4) self.conv3 = GCNConv(4, 2) self.classifier = Linear(2, dataset.num_classes) def forward(self, x, edge_index): h = self.conv1(x, edge_index) h = h.tanh() h = self.conv2(h, edge_index) h = h.tanh() h = self.conv3(h, edge_index) h = h.tanh() # Final GNN embedding space. # Apply a final (linear) classifier. out = self.classifier(h) return out, h model = GCN() print(model)
这个模型可以聚合节点的3-hop邻居信息。
模型作为表示学习输出h,降维过程为34→4→4→2
半监督学习训练:
import time model = GCN() criterion = torch.nn.CrossEntropyLoss() # Define loss criterion. optimizer = torch.optim.Adam(model.parameters(), lr=0.01) # Define optimizer. def train(data): optimizer.zero_grad() # Clear gradients. out, h = model(data.x, data.edge_index) # Perform a single forward pass. loss = criterion(out[data.train_mask], data.y[data.train_mask]) # Compute the loss solely based on the training nodes. loss.backward() # Derive gradients. optimizer.step() # Update parameters based on gradients. return loss, h for epoch in range(401): loss, h = train(data) # Visualize the node embeddings every 10 epochs if epoch % 10 == 0: visualize(h, color=data.y, epoch=epoch, loss=loss) time.sleep(0.3)
每次可视化都有输出结果。最后一次输出结果(从图中可以看出,3层GCNConv构成的神经网络模型可以比较好地将四类节点区分开来):
5.2 示例2
加载Cora数据集:
from torch_geometric.datasets import Planetoid dataset = Planetoid(root='/tmp/Cora', name='Cora') >>> Cora()
应用2层GCN:
import torch import torch.nn.functional as F from torch_geometric.nn import GCNConv class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = GCNConv(dataset.num_node_features, 16) self.conv2 = GCNConv(16, dataset.num_classes) def forward(self, data): x, edge_index = data.x, data.edge_index x = self.conv1(x, edge_index) x = F.relu(x) x = F.dropout(x, training=self.training) x = self.conv2(x, edge_index) return F.log_softmax(x, dim=1)
训练:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = Net().to(device) data = dataset[0].to(device) optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4) model.train() for epoch in range(200): optimizer.zero_grad() out = model(data) loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask]) loss.backward() optimizer.step()
在测试集上评估模型:
model.eval() _, pred = model(data).max(dim=1) correct = int(pred[data.test_mask].eq(data.y[data.test_mask]).sum().item()) acc = correct / int(data.test_mask.sum()) print('Accuracy: {:.4f}'.format(acc)) >>> Accuracy: 0.8150