0

Graph data consists of node features, labels, and an adjacency matrix where any non-zero element locating at $(u, v)$ of the adjacency matrix denotes a directional edge from node $u$ to $v$. Among various tasks using the graph data, this paper focuses on **node prediction**.

Node prediction

Given a set of node with features $X$, a set of corresponding labels $Y$, and an adjacency matrix $A$, the goal of model $\mathcal{F}$ is to predict a label of an unseen node. The model is parametrized by $\Theta^\star$ which is obatined by following optimization.$\Theta^\star = \argmin_{\Theta} \sum_{v \in V_t}{\mathscr{L}\left(\widehat{Y}_v, Y_v \right)}$Where $V_t$ is a set of training nodes, $\widehat{Y}$ is a prediction, and $Y$ is the true label. The full model $\mathcal{F}$ can be expressed as follows

$\widehat{Y} = \mathcal{F}\left(X, A, \Theta^\star\right)$

In conventional differentially private training algorithms, namely DP-SGD [1], it is possible to train the model without revealing private information from features and labels. The algorithm is designed with an assumption that only three components (features, labels, and model weights) are participating in the entire learning process. However, GNN with node prediction utilizes *connectivity* of nodes with using labels. Often the connectivity was combined with structure of the neural network. An example is Graph Convolutional Network (GCN) introduced by Kipf [2]. The two-layer GCN is defined as follows.

$\widehat{Y} = \mathscr{S}\left( \tilde{A }\mathscr{R}\left(\tilde{A}XW_0\right)W_1\right)$

Where $\mathscr{S}$ and $\mathscr{R}$ are softmax and ReLu activation functions respectively. $\tilde{A}$ is $A+I_n$, which basically is an adjacency matrix with self loop. The equation only considers two-hop neighbors.

Applying DPSGD[1] directly on above equation will fail to conceal connectivity of nodes, as it will only focus on preserving $X$ and $Y$ during the training. Which preserves the exact feature of each node, but the algorithm does not obfuscate $\tilde{A}X$, hence allowing an adversary to see which nodes are connected to which nodes. Thus the goal of the paper is to construe a DP algorithm for node-prediction task that satisfies both training and inference privacy of GNN.

**Inference privacy**: In transductive setting, training nodes and test nodes are separated from the whole graph before training, but they are still possibly interconnected. Upon making an inference, model can accidently include a node in training dataset which happened to be within neighbor of a test node. This can reveal private information of the training node.

The author presented two privacy definitions: edge level adjacent and node level adjacent.

DefinitionEdge-level adjacent graphs

Two graphs $\mathcal{G}$ and $\mathcal{G}'$ are edge-level adjacent if one can be obtained by removing a single edge. Therefore, $\mathcal{G}$ and $\mathcal{G}'$ differ by at most one edge.

DefinitionNode-level adjacent graphs

Two graphs $\mathcal{G}$ and $\mathcal{G}'$ are node-level adjacent if one can be obtained by removing a single node. Therefore, $\mathcal{G}$ and $\mathcal{G}'$ differ by at most one node.

From above definitions, the author define $(\epsilon, \delta)$-edge-level and node-level differentially private algorithms. Intuitivly, it is obfuscating feature embeddings by adding controlled amount of noise to deter detecting edge-level or node-level adjacency. The edge-level dp protects an edge The node-level adjacency is harder to achieve, as removing a node also removes a set of edges connected to the node. Thus, the node-level DP is achieved by removing multiple edges.

The proposed GAP has three components: Encoder, Aggregation and Classification module. The encoder reduces dimension of node features. Aggregation is responsible for perturbing the multi-hop embeddings, and classification is a set of neural networks predicting the label of the node.

The main purpose of the Encoder module is to reduce dimensions of node features. This allows lowering the privacy cost for perturbation.

Encoder is trained by attaching a linear softmax layer after it.

$\widehat{Y} = \mathscr{S}\left( ENC\left(X, \Theta_{enc}\right)\cdot W\right)$

where the encoder is parametrized by $\Theta_{enc}$. $X$ is original node features, $W$ is a weight for linear layer, and $\mathscr{S}$ is a softmax activation function.

The authors proposed private multi-hop aggregation (PMA) algorithm for achieving edge-level and node-level differential privacy. The algorithm is as follows

Parameters | Usage | where |
---|---|---|

$K$ | number of multi-hops | Input |

$\check{X}^{\left(0\right)}$ | Normalized low-level node features | Input |

$\sigma^2$ | Noise variance | Input |

$A$ | Adjacency matrix | Input |

$\check{X}^{\left(1\right)} \cdots \check{X}^{\left(K\right)}$ | Private aggregated node feature matrices | Output |

$\textbf{for } k \in \{1, \cdots K\} \textbf{ do}$

$X^{\left(k\right)} \leftarrow A^T \check{X}^{\left(k-1\right)}$

$\tilde{X}^{\left(k\right)} \leftarrow X^{\left(k\right)} + \mathcal{N}\left(\sigma^2, \mathbb{I}\right)$

$\textbf{for } v \in V \textbf{ do}$$\check{X}^{\left(k\right)} \leftarrow \tilde{X}^{\left(k\right)} / \lVert \tilde{X}^{\left(k\right)} \rVert$

$\textbf{end}$

$\textbf{end}$

$\textbf{return } \check{X}^{\left(1\right)} \cdots \check{X}^{\left(K\right)}$

The algorithm is very straightforward. First, it gets aggregated embedding by multiplying adjacency matrix and node features. Then it adds the noise to every row of $X^{\left(k\right)}$ independently. Finally, normalize each row such that L2 norm of each row is 1.

Each aggregate from each hop is a kind of a node embedding, meaning that it includes connectivity information as well as features of the node. Thus, to effectively hide any presence of edge or node, the author perturbed the aggregate by each row separately.

Each private feature aggregates $\check{X}^{\left(k\right)}$ are passed into a neural network. If there are 5 feature aggregates, then there are 5 individual multi-layer perceptrons handing the aggregates respectively. Finally, everything is combined into a single multi-layer perceptron called the head. The head provides a label prediction.

$H_{head} = \textrm{Combine}\left(H^{\left(0\right)} \cdots H^{\left(K\right)}\right)$

For node-level privacy, encoder and classification moduld has to be trained using DPSGD. Also, node degree has to be bounded to $D$. That is, for each subgraph, the aggregation module need to pick at most $D$ edges for each node.

I am omitting some figures and tables.

This section is more about any insight I had about the result.

The paper only simulates transductive setting. Yet the privacy guarantee holds for both transductive and inductive settings. Hence the implementation only deals with the transductive settings. The performance of GAP-NDP relys on privacy cost. It suffers utility loss than GAP-EDP mainly because

1. Encoder/Classifier has to be trained under DP-SGD

2. Degree has to be bound

The encoder module of the model helps analyzing the node features, hence increasing accuracy. The Facebook data and reddit data has 500 and 600 features respectively. WIth bigger $\varepsilon$, accuracies from models with encoders are increased steeper than models with no encoders. However, when $\varepsilon$ is small, performance of both encoder-aided model and its counterparts are very small, indicating the dimensionality reduction was seldom useful against severe utility loss from lower $\varepsilon$. Obviously, this approach is effective when the graph is simpler yet has various node features.

The number of hops provided tradeoffs. With more hops, the model is able to analyze further neighbors. Intuitively this seems to help the accuracy. However, the accuracy was decreased after several hops. The authors explicated that the loss of utility with larget hops is due to the noise added to each aggregate. Personally I deem it is a bit questionable. Need to further check implementataions.

The maximum degree on GAP-NDP will number of neighbors to consider as well as the amount of noise to be injected. For a dense dataset such as Reddit, the bigger maximum degree is helpful. However, on the other hand, Facebook and Amazon dataset has smaller degree, hence are not able to aid from settings for higher degree. Utility will only decrease for them as more noise will be infused.

For a fair comparison, urgency is on identifying $\varepsilon$. Need to inversely set the $\varepsilon$ to match the privacy cost of DPSGD.

[1] Martin Abadi, Andy Chu, Ian Goodfellow, H BrendanMcMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. **Deep learning with differential privacy**. *In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security*, pages 308–318, 2016.

[2] Thomas N. Kipf and Max Welling. **Semi-Supervised Classification with Graph Convolutional Networks**. *ICLR*, 2017.