Paper summary: Graph Convolutional Networks
Many types of data such as transactions, social relationships, and traffic routes are best represented by graphs. A whole area of deep learning is therefore dedicated to learning prediction tasks on a graph. These models can be used to find fraudulent customers at a bank from a transaction network or predict possible matches on Tinder. The purpose of this summary is to give an understanding of the graph convolutional architecture as well as the key contributions of the paper which introduced the architecture. The paper presented a breakthrough in graph networks by showing a more efficient new architecture that improved the accuracy on several citation network benchmarks compared to previous state-of-the-art models
Graph convolutional neural networks (GCNs) is one of the most used architecture to this day and one of, if not the most, cited paper within graph neural networks. The architecture introduced is a convolutional network for graphs which is derived from results on approximations of spectral graph convolutions. Its usefulness was proved with experimental results and an improved time bound compared to previous research. The paper also presented a new approach for semi-supervised learning on graphs. For a graph, an example task is node classification. In many cases, labels may only be available for a fraction of the nodes and their new architectures use the graph structure to include the information from the unlabelled nodes in the training of the model.
Background
The context of the paper is to consider a graph G = (V, E) with N nodes where the graph structure is described by its adjacency matrix, A. One example of such a network is a citation network where the nodes represent a paper and the undirected edges correspond to one of the papers citing the other one. Each of the nodes has C features that describe it. Examples of node features are the authors, what subjects they publish in, and the paper’s number of references. Some of the nodes in the graph also have a label of what subject it belongs to. The task considered by the GCN paper is to classify each of the papers (nodes) into different subjects. I.e. can we from the node features and the graph structure decide the topic of a paper?
Previous research
Neural network architectures for applications on graphs were introduced by among others Gori et al. [2]. The original graph neural networks based their architecture on the idea that nodes in a graph represent objects and edges their relationships. A common approach in the previous graph neural network models was to train a model on a combined objective that first considers the node features, represented by the matrix X, and then regularizes the model with information of the node’s neighbors in the graph. An example of this is to use a loss function of the form
where the first term, L0, refers to a supervised loss and the second term, Lreg, is a regularization term that minimizes the prediction difference between a node and its neighbors. Belkin et al. [3] was one of the papers that proposed a model with an objective similar to Equation 1. The motivation for the regularization term was to smooth predictions over the graph between connected nodes. The approach had the underlying assumption that nodes connected by an edge often have similar labels. By regularizing the model with the distance between first-degree neighbors the model is encouraged to give similar predictions for neighboring nodes.
The main contribution of graph convolutional networks was that it was a model that took as input not only the node features but also the graph structure information allowing the model to learn different underlying concepts from the graph structure than that of similarity between connected nodes [1].
Graph convolutional networks
Graph convolutional network is based on the same principles as convolutional networks with a local computation for each node in the network. A layer of a graph convolution has a weight matrix W and for each node, the same weights are used in the computation. The computation at a particular node v is essentially a mean of the node’s and its 1-degree neighbors’ features weighted by W. Let each node, v, have a feature representation h then the next representation for node v is computed as
where N (v) are the neighbors to V and |N (v)| is the degree of v. The term B is the weight term for the node v itself. The activation function σ is arbitrary and in the experiments of the paper, ReLU was used [1].
The main contribution of the paper is the graph convolutional architecture that five years later is one of the dominant among graph networks. Its conception was important since it demonstrated a successful approach to taking an entire graph, both nodes and edges, as input to a network. There are a few of the properties of the architecture that are particularly significant. Firstly, a graph convolutional network has no assumption of the graph structure other than that it is undirected. This also allows them to define an architecture with a time complexity that is linear in the number of edges and features O(|E|F C) where C is the number of input features
and F is the number of output features.
Further, the design of the architecture has a message-passing property. In the first layer of a graph convolutional network, the resulting features of each node will contain information from its first-degree neighbors. In the second layer, however, the same computation will give information on the second-degree neighbors of a particular node. This can easily be seen in Equation (2) if you consider that in the second layer each hidden representation h_u will have information from its first-degree neighbors.
Model architecture
The graph convolutional network architecture was shown to at the time outperform previous state-of-the-art models on famous benchmarks. In their experiments, Kipf and Welling [1] used a simple model consisting of two graph convolutional layers. With matrix notation the formula is
where  is the matrix form of the normalizing constant of hu in (2). The output Z contains a prediction probability for each class and each node. The model was then trained with a cross-entropy loss for all the nodes that had labels. For the nodes that do not contain labels, the outputs are ignored as is seen in Figure 1. The reason for why the paper was a step forward in semi-supervised learning on graphs is that the adjacency matrix is used in the model. Thus through back-propagation, the loss signal from the labels will also reach the unlabelled nodes.
Experiments
The model was benchmarked on the data sets presented in Table 1. These data sets are mainly citation networks with the task to classify each node into one of several classes. In the citation networks, each edge represents a citation link. The label rate in Table 1 refers to the fraction of nodes that are labeled.
The results presented in Table 2 show that on these benchmarks the GCN architecture [1] was able to outperform earlier state-of-the-art models for accuracy and in some cases training time. Further, the results show that apart from citation networks the architecture is also effective for knowledge graphs such as the Pubmed dataset. The accuracy reported in Table 2 is averaged over 100 runs of the model. In many of the benchmarks, the margin by which GCN is better is significant. The authors propose that one of the reasons is that the model does not assume that edge connections only encode the similarity of nodes such as the ManiReg model does [3]. This hypothesis would be in line with the assumption that giving the model more relevant information on the problem improves performance.
Conclusion and future research
The most important takeaway from Kipf and Welling’s work [1] is how the GCN architecture allows information to pass through the graph layer-wise and how the underlying graph structure is learned from the adjacency matrix. The authors, however, leave several questions unanswered. Many graphs are today way larger than what is feasible to hold in memory and the computation Kipf and Welling [1] introduce relies on this. Modern approaches, therefore, have had to consider ways of making computations more sparse or batching the graph to perform the computations of its pieces.
After GCNs, several variations and extensions have been proposed in the literature. Some of these include graph attention networks (GATs), which incorporate attention mechanisms to weigh the importance of different nodes in the graph [10]. Additionally, there have been efforts to improve the efficiency and scalability of GCNs, such as through the use of hierarchical graph convolutions [11] and multi-scale graph convolutions [12]. Overall, the field of graph neural networks continues to evolve and advance, with new developments and innovations being proposed regularly. In conclusion, with a simple model of only two layers, the GCN architecture outperformed previous approaches on several benchmarks and became an important link in the chain of graph neural network advancements.
References
[1] Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. Feb. 22, 2017. doi: 10.48550/arXiv.1609.02907. arXiv: 1609.02907[cs, stat]. url: http://arxiv.org/abs/1609.02907.
[2] M. Gori, G. Monfardini, and F. Scarselli. “A new model for learning in graph domains”. In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. International Joint Conference on Neural Networks 2005. Vol. 2. Montreal, Que., Canada: IEEE, 2005, pp. 729–734. isbn: 9780780390485. doi: 10.1109/IJCNN.2005.1555942. url: http://ieeexplore.ieee.org/document/1555942/.
[3] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. “Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples”. In: Journal of Machine Learning Research 7.85 (2006), pp. 2399–2434. issn: 1533–7928. url: http://jmlr.org/papers/v7/belkin06a.html.
[4] Jason Weston et al. “Deep Learning via Semi-Supervised Embedding”. In: Neural Networks Tricks of the Trade, Reloaded. 2013.
[5] Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty. “Semi-supervised learning using Gaussian fields and harmonic functions”. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning. ICML’03. Washington, DC, USA: AAAI Press, Aug. 21, 2003, pp. 912–919. isbn: 9781577351894.
[6] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting Semi-Supervised Learning with Graph Embeddings. May 26, 2016. doi: 10.48550/arXiv.1603.08861. arXiv: 1603.08861[cs]. url: http://arxiv.org/abs/
1603.08861.
[7] David K. Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on Graphs via Spectral Graph Theory. Dec. 18, 2009. doi: 10.48550/arXiv.0912.3848. arXiv: 0912.3848[cs,math]. url: http://arxiv.org/abs/0912.3848.
[8] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. “DeepWalk: Online Learning of Social Representations”. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. Aug. 24, 2014, pp. 701–710. doi: 10.1145/2623330.2623732. arXiv: 1403.6652[cs]. url: http://arxiv.org/abs/1403.6652.
[9] Lise Getoor. “Link-based Classification”. In: Advanced Methods for Knowledge Discovery from Complex Data. Ed. by Sanghamitra Bandyopadhyay et al. Advanced Information and Knowledge Processing. London: Springer, 2005, pp. 189–207. isbn: 9781846282843. doi: 10.1007/1- 84628- 284- 5_7. url: https://doi.org/10.1007/1-84628-284-5_7.
[10] Petar Veličković et al. Graph Attention Networks. Feb. 4, 2018. doi: 10.48550/arXiv.1710.10903. arXiv: 1710.10903[cs,stat]. url: http://arxiv.org/
abs/1710.10903.
[11] Fenyu Hu et al. Hierarchical Graph Convolutional Networks for Semi-supervised Node Classification. June 10, 2019. doi: 10.48550/arXiv.1902.06667. arXiv: 1902.06667[cs,stat]. url: http://arxiv.org/abs/1902.06667.
[12] Sami Abu-El-Haija et al. N-GCN: Multi-scale Graph Convolution for Semi-supervised Node Classification. Feb. 24, 2018. doi: 10.48550/arXiv.1802.08888. arXiv: 1802.08888[cs,stat]. url: http://arxiv.org/abs/1802.08888.