5. Results
In this section, we summarize the results of our experiments and discuss our key findings. We collect top-1 errors for all the sampled relational graphs on different tasks and architectures, and also record the graph measures (average path length L and clustering coefficient C) for each sampled graph. We present these results as heat maps of graph measures vs. predictive performance (Figure 4(a)(c)(f)).
在本节中,我们将总结我们的实验结果并讨论我们的主要发现。我们收集了不同任务和架构上的所有抽样关系图的top-1错误,并记录了每个抽样图的图度量(平均路径长度L和聚类系数C)。我们将这些结果作为图表测量与预测性能的热图(图4(a)(c)(f))。
5.1. A Sweet Spot for Top Neural Networks
Overall, the heat maps of graph measures vs. predictive performance (Figure 4(f)) show that there exist graph structures that can outperform the complete graph (the pixel on bottom right) baselines. The best performing relational graph can outperform the complete graph baseline by 1.4% top-1 error on CIFAR-10, and 0.5% to 1.2% for models on ImageNet. Notably, we discover that top-performing graphs tend to cluster into a sweet spot in the space defined by C and L (red rectangles in Figure 4(f)). We follow these steps to identify a sweet spot: (1) we downsample and aggregate the 3942 graphs in Figure 4(a) into a coarse resolution of 52 bins, where each bin records the performance of graphs that fall into the bin; (2) we identify the bin with best average performance (red cross in Figure 4(f)); (3) we conduct onetailed t-test over each bin against the best-performing bin, and record the bins that are not significantly worse than the best-performing bin (p-value 0.05 as threshold). The minimum area rectangle that covers these bins is visualized as a sweet spot. For 5-layer MLP on CIFAR-10, the sweet spot is C ∈ [0.10, 0.50], L ∈ [1.82, 2.75]. 总的来说,图度量与预测性能的热图(图4(f))表明,存在的图结构可以超过整个图(右下角的像素)基线。在CIFAR-10中,表现最好的关系图的top-1误差比整个图基线高出1.4%,而在ImageNet中,模型的top-1误差为0.5%到1.2%。值得注意的是,我们发现性能最好的图往往聚集在C和L定义的空间中的一个最佳点(图4(f)中的红色矩形)。我们按照以下步骤来确定最佳点:(1)我们向下采样并将图4(a)中的3942个图汇总为52个大致分辨率的bin,每个bin记录落入bin的图的性能;(2)我们确定平均性能最佳的bin(图4(f)中的红十字会);(3)对每个箱子与性能最好的箱子进行最小t检验,记录性能不明显差的箱子(p-value 0.05为阈值)。覆盖这些箱子的最小面积矩形被可视化为一个甜点点。对于CIFAR-10上的5层MLP,最优点C∈[0.10,0.50],L∈[1.82,2.75]。
5.2. Neural Network Performance as a Smooth Function over Graph Measures
In Figure 4(f), we observe that neural network’s predictive performance is approximately a smooth function of the clustering coefficient and average path length of its relational graph. Keeping one graph measure fixed in a small range (C ∈ [0.4, 0.6], L ∈ [2, 2.5]), we visualize network performances against the other measure (shown in Figure 4(b)(d)). We use second degree polynomial regression to visualize the overall trend. We observe that both clustering coefficient and average path length are indicative of neural network performance, demonstrating a smooth U-shape correlation
在图4(f)中,我们观察到神经网络的预测性能近似是其聚类系数和关系图平均路径长度的平滑函数。将一个图度量固定在一个小范围内(C∈[0.4,0.6],L∈[2,2.5]),我们将网络性能与另一个度量进行可视化(如图4(b)(d)所示)。我们使用二次多项式回归来可视化总体趋势。我们观察到,聚类系数和平均路径长度都是神经网络性能的指标,呈平滑的u形相关
5.3. Consistency across Architectures
Figure 5: Quickly identifying a sweet spot. Left: The correlation between sweet spots identified using fewer samples of relational graphs and using all 3942 graphs. Right: The correlation between sweet spots identified at the intermediate training epochs and the final epoch (100 epochs).
图5:快速确定最佳点。左图:使用较少的关系图样本和使用全部3942幅图识别出的甜点之间的相关性。右图:中间训练时期和最后训练时期(100个时期)确定的甜蜜点之间的相关性。
Given that relational graph defines a shared design space across various neural architectures, we observe that relational graphs with certain graph measures may consistently perform well regardless of how they are instantiated.
Qualitative consistency. We visually observe in Figure 4(f) that the sweet spots are roughly consistent across different architectures. Specifically, if we take the union of the sweet spots across architectures, we have C ∈ [0.43, 0.50], L ∈ [1.82, 2.28] which is the consistent sweet spot across architectures. Moreover, the U-shape trends between graph measures and corresponding neural network performance, shown in Figure 4(b)(d), are also visually consistent.
Quantitative consistency. To further quantify this consistency across tasks and architectures, we select the 52 bins in the heat map in Figure 4(f), where the bin value indicates the average performance of relational graphs whose graph measures fall into the bin range. We plot the correlation of the 52 bin values across different pairs of tasks, shown in Figure 4(e). We observe that the performance of relational graphs with certain graph measures correlates across different tasks and architectures. For example, even though a ResNet-34 has much higher complexity than a 5-layer MLP, and ImageNet is a much more challenging dataset than CIFAR-10, a fixed set relational graphs would perform similarly in both settings, indicated by a Pearson correlation of 0.658 (p-value < 10−8 ).
假设关系图定义了跨各种神经结构的共享设计空间,我们观察到,无论如何实例化,具有特定图度量的关系图都可以始终执行得很好。
定性的一致性。在图4(f)中,我们可以直观地看到,不同架构之间的甜点点基本上是一致的。具体来说,如果我们取跨架构的甜蜜点的并集,我们有C∈[0.43,0.50],L∈[1.82,2.28],这是跨架构的一致的甜蜜点。此外,图4(b)(d)所示的图测度与对应的神经网络性能之间的u形趋势在视觉上也是一致的。
量化一致性。为了进一步量化跨任务和架构的一致性,我们在图4(f)的热图中选择了52个bin,其中bin值表示图度量在bin范围内的关系图的平均性能。我们绘制52个bin值在不同任务对之间的相关性,如图4(e)所示。我们观察到,具有特定图形的关系图的性能度量了不同任务和架构之间的关联。例如,尽管ResNet-34比5层MLP复杂得多,ImageNet是一个比ciremote -10更具挑战性的数据集,一个固定的集合关系图在两种设置中表现相似,通过0.658的Pearson相关性表示(p值< 10−8)。
5.4. Quickly Identifying a Sweet Spot
Training thousands of relational graphs until convergence might be computationally prohibitive. Therefore, we quantitatively show that a sweet spot can be identified with much less computational cost, e.g., by sampling fewer graphs and training for fewer epochs.
How many graphs are needed? Using the 5-layer MLP on CIFAR-10 as an example, we consider the heat map over 52 bins in Figure 4(f) which is computed using 3942 graph samples. We investigate if a similar heat map can be produced with much fewer graph samples. Specifically, we sub-sample the graphs in each bin while making sure each bin has at least one graph. We then compute the correlation between the 52 bin values computed using all 3942 graphs and using sub-sampled fewer graphs, as is shown in Figure 5 (left). We can see that bin values computed using only 52 samples have a high 0.90 Pearson correlation with the bin values computed using full 3942 graph samples. This finding suggests that, in practice, much fewer graphs are needed to conduct a similar analysis.
训练成千上万的关系图,直到运算上无法收敛为止。因此,我们定量地表明,可以用更少的计算成本来确定一个最佳点,例如,通过采样更少的图和训练更少的epoch。
需要多少个图?以CIFAR-10上的5层MLP为例,我们考虑图4(f)中52个箱子上的热图,该热图使用3942个图样本计算。我们研究了是否可以用更少的图表样本制作类似的热图。具体来说,我们对每个容器中的图进行子采样,同时确保每个容器至少有一个图。然后,我们计算使用所有3942图和使用更少的次采样图计算的52个bin值之间的相关性,如图5(左)所示。我们可以看到,仅使用52个样本计算的bin值与使用全部3942个图样本计算的bin值有很高的0.90 Pearson相关性。这一发现表明,实际上,进行类似分析所需的图表要少得多。
5.5. Network Science and Neuroscience Connections
Network science. The average path length that we measure characterizes how well information is exchanged across the network (Latora & Marchiori, 2001), which aligns with our definition of relational graph that consists of rounds of message exchange. Therefore, the U-shape correlation in Figure 4(b)(d) might indicate a trade-off between message exchange efficiency (Sengupta et al., 2013) and capability of learning distributed representations (Hinton, 1984). Neuroscience. The best-performing relational graph that we discover surprisingly resembles biological neural networks, as is shown in Table 2 and Figure 6. The similarities are in two-fold: (1) the graph measures (L and C) of top artificial neural networks are highly similar to biological neural networks; (2) with the relational graph representation, we can translate biological neural networks to 5-layer MLPs, and found that these networks also outperform the baseline complete graphs. While our findings are preliminary, our approach opens up new possibilities for interdisciplinary research in network science, neuroscience and deep learning. 网络科学。我们测量的平均路径长度表征了信息在网络中交换的良好程度(Latora & Marchiori, 2001),这与我们对包含轮消息交换的关系图的定义一致。因此,图4(b)(d)中的u形相关性可能表明消息交换效率(Sengupta et al., 2013)和学习分布式表示的能力(Hinton, 1984)之间的权衡。神经科学。我们发现的性能最好的关系图与生物神经网络惊人地相似,如表2和图6所示。相似点有两方面:
(1)top人工神经网络的图测度(L和C)与生物神经网络高度相似;
(2)利用关系图表示,我们可以将生物神经网络转换为5层MLPs,并发现这些网络的性能也优于基线完整图。
虽然我们的发现还处于初步阶段,但我们的方法为网络科学、神经科学和深度学习领域的跨学科研究开辟了新的可能性。
6. Related Work
Neural network connectivity. The design of neural network connectivity patterns has been focused on computational graphs at different granularity: the macro structures, i.e. connectivity across layers (LeCun et al., 1998; Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; Szegedy et al., 2015; He et al., 2016; Huang et al., 2017; Tan & Le, 2019), and the micro structures, i.e. connectivity within a layer (LeCun et al., 1998; Xie et al., 2017; Zhang et al., 2018; Howard et al., 2017; Dao et al., 2019; Alizadeh et al., 2019). Our current exploration focuses on the latter, but the same methodology can be extended to the macro space. Deep Expander Networks (Prabhu et al., 2018) adopt expander graphs to generate bipartite structures. RandWire (Xie et al., 2019) generates macro structures using existing graph generators. However, the statistical relationships between graph structure measures and network predictive performances were not explored in those works. Another related work is Cross-channel Communication Networks (Yang et al., 2019) which aims to encourage the neuron communication through message passing, where only a complete graph structure is considered.
神经网络的连通性。神经网络连通性模式的设计一直关注于不同粒度的计算图:宏观结构,即跨层连通性(LeCun et al., 1998;Krizhevsky等,2012;Simonyan & Zisserman, 2015;Szegedy等,2015;He et al., 2016;黄等,2017;Tan & Le, 2019)和微观结构,即层内的连通性(LeCun et al., 1998;谢等,2017;张等,2018;Howard等人,2017;Dao等,2019年;Alizadeh等,2019)。我们目前的研究重点是后者,但同样的方法可以扩展到宏观空间。深度扩展器网络(Prabhu et al., 2018)采用扩展器图生成二部图结构。RandWire (Xie等人,2019)使用现有的图生成器生成宏结构。然而,图结构测度与网络预测性能之间的统计关系并没有在这些工作中探索。另一项相关工作是跨通道通信网络(Yang et al., 2019),旨在通过消息传递促进神经元通信,其中只考虑了完整的图结构。
Neural architecture search. Efforts on learning the connectivity patterns at micro (Ahmed & Torresani, 2018; Wortsman et al., 2019; Yang et al., 2018), or macro (Zoph & Le, 2017; Zoph et al., 2018) level mostly focus on improving learning/search algorithms (Liu et al., 2018; Pham et al., 2018; Real et al., 2019; Liu et al., 2019). NAS-Bench101 (Ying et al., 2019) defines a graph search space by enumerating DAGs with constrained sizes (≤ 7 nodes, cf. 64-node graphs in our work). Our work points to a new path: instead of exhaustively searching over all the possible connectivity patterns, certain graph generators and graph measures could define a smooth space where the search cost could be significantly reduced. 神经结构搜索。在micro学习连接模式的努力(Ahmed & Torresani, 2018;Wortsman等人,2019年;Yang et al., 2018),或macro (Zoph & Le, 2017;Zoph等,2018)水平主要关注于改进学习/搜索算法(Liu等,2018;Pham等人,2018年;Real等人,2019年;Liu等,2019)。NAS-Bench101 (Ying et al., 2019)通过枚举大小受限的DAGs(≤7个节点,我们的工作cf. 64节点图)来定义图搜索空间。我们的工作指向了一个新的路径:不再对所有可能的连通性模式进行穷举搜索,某些图生成器和图度量可以定义一个平滑的空间,在这个空间中搜索成本可以显著降低。
7. Discussions
Hierarchical graph structure of neural networks. As the first step in this direction, our work focuses on graph structures at the layer level. Neural networks are intrinsically hierarchical graphs (from connectivity of neurons to that of layers, blocks, and networks) which constitute a more complex design space than what is considered in this paper. Extensive exploration in that space will be computationally prohibitive, but we expect our methodology and findings to generalize.
Efficient implementation. Our current implementation uses standard CUDA kernels thus relies on weight masking, which leads to worse wall-clock time performance compared with baseline complete graphs. However, the practical adoption of our discoveries is not far-fetched. Complementary to our work, there are ongoing efforts such as block-sparse kernels (Gray et al., 2017) and fast sparse ConvNets (Elsen et al., 2019) which could close the gap between theoretical FLOPS and real-world gains. Our work might also inform the design of new hardware architectures, e.g., biologicallyinspired ones with spike patterns (Pei et al., 2019).
神经网络的层次图结构。作为这个方向的第一步,我们的工作集中在层层次上的图结构。神经网络本质上是层次图(从神经元的连通性到层、块和网络的连通性),它构成了比本文所考虑的更复杂的设计空间。在那个空间进行广泛的探索在计算上是不可能的,但我们希望我们的方法和发现可以一般化。
高效的实现。我们目前的实现使用标准CUDA内核,因此依赖于weight masking,这导致wall-clock时间性能比基线完整图更差。然而,实际应用我们的发现并不牵强。作为我们工作的补充,还有一些正在进行的工作,如块稀疏核(Gray et al., 2017)和快速稀疏卷积网络(Elsen et al., 2019),它们可以缩小理论失败和现实收获之间的差距。我们的工作也可能为新的硬件架构的设计提供信息,例如,受生物学启发的带有spike图案的架构(Pei et al., 2019)。
Prior vs. Learning. We currently utilize the relational graph representation as a structural prior, i.e., we hard-wire the graph structure on neural networks throughout training. It has been shown that deep ReLU neural networks can automatically learn sparse representations (Glorot et al., 2011). A further question arises: without imposing graph priors, does any graph structure emerge from training a (fully-connected) neural network?
Figure 7: Prior vs. Learning. Results for 5-layer MLPs on CIFAR-10. We highlight the best-performing graph when used as a structural prior. Additionally, we train a fullyconnected MLP, and visualize the learned weights as a relational graph (different points are graphs under different thresholds). The learned graph structure moves towards the “sweet spot” after training but does not close the gap.
之前与学习。我们目前利用关系图表示作为结构先验,即。,在整个训练过程中,我们将图形结构硬连接到神经网络上。已有研究表明,深度ReLU神经网络可以自动学习稀疏表示(Glorot et al., 2011)。一个进一步的问题出现了:在不强加图先验的情况下,训练一个(完全连接的)神经网络会产生任何图结构吗?
图7:先验与学习。5层MLPs在CIFAR-10上的结果。当使用结构先验时,我们会突出显示表现最佳的图。此外,我们训练一个完全连接的MLP,并将学习到的权重可视化为一个关系图(不同的点是不同阈值下的图)。学习后的图结构在训练后向“最佳点”移动,但并没有缩小差距。
As a preliminary exploration, we “reverse-engineer” a trained neural network and study the emerged relational graph structure. Specifically, we train a fully-connected 5-layer MLP on CIFAR-10 (the same setup as in previous experiments). We then try to infer the underlying relational graph structure of the network via the following steps: (1) to get nodes in a relational graph, we stack the weights from all the hidden layers and group them into 64 nodes, following the procedure described in Section 2.2; (2) to get undirected edges, the weights are summed by their transposes; (3) we compute the Frobenius norm of the weights as the edge value; (4) we get a sparse graph structure by binarizing edge values with a certain threshold.
We show the extracted graphs under different thresholds in Figure 7. As expected, the extracted graphs at initialization follow the patterns of E-R graphs (Figure 3(left)), since weight matrices are randomly i.i.d. initialized. Interestingly, after training to convergence, the extracted graphs are no longer E-R random graphs and move towards the sweet spot region we found in Section 5. Note that there is still a gap between these learned graphs and the best-performing graph imposed as a structural prior, which might explain why a fully-connected MLP has inferior performance.
In our experiments, we also find that there are a few special cases where learning the graph structure can be superior (i.e., when the task is simple and the network capacity is abundant). We provide more discussions in the Appendix. Overall, these results further demonstrate that studying the graph structure of a neural network is crucial for understanding its predictive performance.
作为初步的探索,我们“逆向工程”一个训练过的神经网络和研究出现的关系图结构。具体来说,我们在CIFAR-10上训练了一个完全连接的5层MLP(与之前的实验相同的设置)。然后尝试通过以下步骤来推断网络的底层关系图结构:(1)为了得到关系图中的节点,我们将所有隐含层的权值进行叠加,并按照2.2节的步骤将其分组为64个节点;(2)对权值的转置求和,得到无向边;(3)计算权值的Frobenius范数作为边缘值;(4)通过对具有一定阈值的边值进行二值化,得到一种稀疏图结构。
我们在图7中显示了在不同阈值下提取的图。正如预期的那样,初始化时提取的图遵循E-R图的模式(图3(左)),因为权重矩阵是随机初始化的。有趣的是,经过收敛训练后,提取的图不再是E-R随机图,而是朝着我们在第5节中发现的最佳点区域移动。请注意,在这些学习图和作为结构先验的最佳性能图之间仍然存在差距,这可能解释了为什么完全连接的MLP性能较差。
在我们的实验中,我们也发现在一些特殊的情况下学习图结构是更好的。当任务简单且网络容量充足时)。我们在附录中提供了更多的讨论。总的来说,这些结果进一步证明了研究神经网络的图结构对于理解其预测性能是至关重要的。
Unified view of Graph Neural Networks (GNNs) and general neural architectures. The way we define neural networks as a message exchange function over graphs is partly inspired by GNNs (Kipf & Welling, 2017; Hamilton et al., 2017; Velickovi ˇ c et al. ´ , 2018). Under the relational graph representation, we point out that GNNs are a special class of general neural architectures where: (1) graph structure is regarded as the input instead of part of the neural architecture; consequently, (2) message functions are shared across all the edges to respect the invariance properties of the input graph. Concretely, recall how we define general neural networks as relational graphs:
Therefore, our work offers a unified view of GNNs and general neural architecture design, which we hope can bridge the two communities and inspire new innovations. On one hand, successful techniques in general neural architectures can be naturally introduced to the design of GNNs, such as separable convolution (Howard et al., 2017), group normalization (Wu & He, 2018) and Squeeze-and-Excitation block (Hu et al., 2018); on the other hand, novel GNN architectures (You et al., 2019b; Chen et al., 2019) beyond the commonly used paradigm (i.e., Equation 6) may inspire more advanced neural architecture designs.
图形神经网络(GNNs)和一般神经结构的统一视图。我们将神经网络定义为图形上的信息交换功能的方式,部分受到了gnn的启发(Kipf & Welling, 2017;Hamilton等,2017;Velickoviˇc et al .´, 2018)。在关系图表示下,我们指出gnn是一类特殊的一般神经结构,其中:(1)将图结构作为输入,而不是神经结构的一部分;因此,(2)消息函数在所有边之间共享,以尊重输入图的不变性。具体地说,回想一下我们是如何将一般的神经网络定义为关系图的:
因此,我们的工作提供了一个关于gnn和一般神经结构设计的统一观点,我们希望能够搭建这两个社区的桥梁,激发新的创新。一方面,一般神经结构中的成功技术可以自然地引入到gnn的设计中,如可分离卷积(Howard et al., 2017)、群归一化(Wu & He, 2018)和挤压-激励块(Hu et al., 2018);另一方面,新的GNN架构(You et al., 2019b;陈等人,2019)超越了常用的范式(即,(6)可以启发更先进的神经结构设计。
8. Conclusion
In sum, we propose a new perspective of using relational graph representation for analyzing and understanding neural networks. Our work suggests a new transition from studying conventional computation architecture to studying graph structure of neural networks. We show that well-established graph techniques and methodologies offered in other science disciplines (network science, neuroscience, etc.) could contribute to understanding and designing deep neural networks. We believe this could be a fruitful avenue of future research that tackles more complex situations.
最后,我们提出了一种利用关系图表示来分析和理解神经网络的新观点。我们的工作提出了一个新的过渡,从研究传统的计算结构到研究神经网络的图结构。我们表明,在其他科学学科(网络科学、神经科学等)中提供的成熟的图形技术和方法可以有助于理解和设计深度神经网络。我们相信这将是未来解决更复杂情况的研究的一个富有成效的途径。
Acknowledgments
This work is done during Jiaxuan You’s internship at Facebook AI Research. Jure Leskovec is a Chan Zuckerberg Biohub investigator. The authors thank Alexander Kirillov, Ross Girshick, Jonathan Gomes Selman, Pan Li for their helpful discussions. 这项工作是在You Jiaxuan在Facebook AI Research实习期间完成的。Jure Leskovec是陈-扎克伯格生物中心的调查员。作者感谢Alexander Kirillov, Ross Girshick, Jonathan Gomes Selman和Pan Li的讨论。