GAP: Differentially private Graph Neural Networks with Aggregation Perturbation

Pensacola·2023년 1월 5일
0

1. Problem & Motivation

1.1 Challenges on differentially private GNN

Graph data consists of node features, labels, and an adjacency matrix where any non-zero element locating at (u,v)(u, v) of the adjacency matrix denotes a directional edge from node uu to vv. Among various tasks using the graph data, this paper focuses on node prediction.

Node prediction
Given a set of node with features XX, a set of corresponding labels YY, and an adjacency matrix AA, the goal of model F\mathcal{F} is to predict a label of an unseen node. The model is parametrized by Θ\Theta^\star which is obatined by following optimization.

Θ=arg minΘvVtL(Y^v,Yv)\Theta^\star = \argmin_{\Theta} \sum_{v \in V_t}{\mathscr{L}\left(\widehat{Y}_v, Y_v \right)}

Where VtV_t is a set of training nodes, Y^\widehat{Y} is a prediction, and YY is the true label. The full model F\mathcal{F} can be expressed as follows

Y^=F(X,A,Θ)\widehat{Y} = \mathcal{F}\left(X, A, \Theta^\star\right)

In conventional differentially private training algorithms, namely DP-SGD [1], it is possible to train the model without revealing private information from features and labels. The algorithm is designed with an assumption that only three components (features, labels, and model weights) are participating in the entire learning process. However, GNN with node prediction utilizes connectivity of nodes with using labels. Often the connectivity was combined with structure of the neural network. An example is Graph Convolutional Network (GCN) introduced by Kipf [2]. The two-layer GCN is defined as follows.

Y^=S(A~R(A~XW0)W1)\widehat{Y} = \mathscr{S}\left( \tilde{A }\mathscr{R}\left(\tilde{A}XW_0\right)W_1\right)

Where S\mathscr{S} and R\mathscr{R} are softmax and ReLu activation functions respectively. A~\tilde{A} is A+InA+I_n, which basically is an adjacency matrix with self loop. The equation only considers two-hop neighbors.

Applying DPSGD[1] directly on above equation will fail to conceal connectivity of nodes, as it will only focus on preserving XX and YY during the training. Which preserves the exact feature of each node, but the algorithm does not obfuscate A~X\tilde{A}X, hence allowing an adversary to see which nodes are connected to which nodes. Thus the goal of the paper is to construe a DP algorithm for node-prediction task that satisfies both training and inference privacy of GNN.

  • Inference privacy: In transductive setting, training nodes and test nodes are separated from the whole graph before training, but they are still possibly interconnected. Upon making an inference, model can accidently include a node in training dataset which happened to be within neighbor of a test node. This can reveal private information of the training node.

1.2. Privacy definition and requirements

The author presented two privacy definitions: edge level adjacent and node level adjacent.

Definition Edge-level adjacent graphs
Two graphs G\mathcal{G} and G\mathcal{G}' are edge-level adjacent if one can be obtained by removing a single edge. Therefore, G\mathcal{G} and G\mathcal{G}' differ by at most one edge.

Definition Node-level adjacent graphs
Two graphs G\mathcal{G} and G\mathcal{G}' are node-level adjacent if one can be obtained by removing a single node. Therefore, G\mathcal{G} and G\mathcal{G}' differ by at most one node.

From above definitions, the author define (ϵ,δ)(\epsilon, \delta)-edge-level and node-level differentially private algorithms. Intuitivly, it is obfuscating feature embeddings by adding controlled amount of noise to deter detecting edge-level or node-level adjacency. The edge-level dp protects an edge The node-level adjacency is harder to achieve, as removing a node also removes a set of edges connected to the node. Thus, the node-level DP is achieved by removing multiple edges.

2. Algorithm

The proposed GAP has three components: Encoder, Aggregation and Classification module. The encoder reduces dimension of node features. Aggregation is responsible for perturbing the multi-hop embeddings, and classification is a set of neural networks predicting the label of the node.

2.1. Encoder

The main purpose of the Encoder module is to reduce dimensions of node features. This allows lowering the privacy cost for perturbation.
Encoder is trained by attaching a linear softmax layer after it.

Y^=S(ENC(X,Θenc)W)\widehat{Y} = \mathscr{S}\left( ENC\left(X, \Theta_{enc}\right)\cdot W\right)

where the encoder is parametrized by Θenc\Theta_{enc}. XX is original node features, WW is a weight for linear layer, and S\mathscr{S} is a softmax activation function.

2.2. Aggregation Module

The authors proposed private multi-hop aggregation (PMA) algorithm for achieving edge-level and node-level differential privacy. The algorithm is as follows

Algorithm 1: Private Multi-hop Aggregation

ParametersUsagewhere
KKnumber of multi-hopsInput
Xˇ(0)\check{X}^{\left(0\right)}Normalized low-level node featuresInput
σ2\sigma^2Noise varianceInput
AAAdjacency matrixInput
Xˇ(1)Xˇ(K)\check{X}^{\left(1\right)} \cdots \check{X}^{\left(K\right)}Private aggregated node feature matricesOutput

for k{1,K} do\textbf{for } k \in \{1, \cdots K\} \textbf{ do}

X(k)ATXˇ(k1)X^{\left(k\right)} \leftarrow A^T \check{X}^{\left(k-1\right)}

X~(k)X(k)+N(σ2,I)\tilde{X}^{\left(k\right)} \leftarrow X^{\left(k\right)} + \mathcal{N}\left(\sigma^2, \mathbb{I}\right)

for vV do\textbf{for } v \in V \textbf{ do}

Xˇ(k)X~(k)/X~(k)\check{X}^{\left(k\right)} \leftarrow \tilde{X}^{\left(k\right)} / \lVert \tilde{X}^{\left(k\right)} \rVert

end\textbf{end}

end\textbf{end}

return Xˇ(1)Xˇ(K)\textbf{return } \check{X}^{\left(1\right)} \cdots \check{X}^{\left(K\right)}

The algorithm is very straightforward. First, it gets aggregated embedding by multiplying adjacency matrix and node features. Then it adds the noise to every row of X(k)X^{\left(k\right)} independently. Finally, normalize each row such that L2 norm of each row is 1.

Each aggregate from each hop is a kind of a node embedding, meaning that it includes connectivity information as well as features of the node. Thus, to effectively hide any presence of edge or node, the author perturbed the aggregate by each row separately.

2.3. Classification Module

Each private feature aggregates Xˇ(k)\check{X}^{\left(k\right)} are passed into a neural network. If there are 5 feature aggregates, then there are 5 individual multi-layer perceptrons handing the aggregates respectively. Finally, everything is combined into a single multi-layer perceptron called the head. The head provides a label prediction.

Hhead=Combine(H(0)H(K))H_{head} = \textrm{Combine}\left(H^{\left(0\right)} \cdots H^{\left(K\right)}\right)

Remarks

For node-level privacy, encoder and classification moduld has to be trained using DPSGD. Also, node degree has to be bounded to DD. That is, for each subgraph, the aggregation module need to pick at most DD edges for each node.

3. Experiments

I am omitting some figures and tables.
This section is more about any insight I had about the result.

The paper only simulates transductive setting. Yet the privacy guarantee holds for both transductive and inductive settings. Hence the implementation only deals with the transductive settings. The performance of GAP-NDP relys on privacy cost. It suffers utility loss than GAP-EDP mainly because
1. Encoder/Classifier has to be trained under DP-SGD
2. Degree has to be bound

3.1. Encoder

The encoder module of the model helps analyzing the node features, hence increasing accuracy. The Facebook data and reddit data has 500 and 600 features respectively. WIth bigger ε\varepsilon, accuracies from models with encoders are increased steeper than models with no encoders. However, when ε\varepsilon is small, performance of both encoder-aided model and its counterparts are very small, indicating the dimensionality reduction was seldom useful against severe utility loss from lower ε\varepsilon. Obviously, this approach is effective when the graph is simpler yet has various node features.

3.2 Number of Hops

The number of hops provided tradeoffs. With more hops, the model is able to analyze further neighbors. Intuitively this seems to help the accuracy. However, the accuracy was decreased after several hops. The authors explicated that the loss of utility with larget hops is due to the noise added to each aggregate. Personally I deem it is a bit questionable. Need to further check implementataions.

3.3. Maximum degree

The maximum degree on GAP-NDP will number of neighbors to consider as well as the amount of noise to be injected. For a dense dataset such as Reddit, the bigger maximum degree is helpful. However, on the other hand, Facebook and Amazon dataset has smaller degree, hence are not able to aid from settings for higher degree. Utility will only decrease for them as more noise will be infused.

Outro

For a fair comparison, urgency is on identifying ε\varepsilon. Need to inversely set the ε\varepsilon to match the privacy cost of DPSGD.

References

[1] Martin Abadi, Andy Chu, Ian Goodfellow, H BrendanMcMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.

[2] Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. ICLR, 2017.

profile
primitive attempt

0개의 댓글