[Stanford CS224W] Lecture2

Jaehee Kim·2024년 3월 24일

Graph

목록 보기

2/8

Lecture 2.1 - Traditional Feature-based Methods: Node

Traditional ML Pipeline

Desgin features for nodes/links/graphs
Obtain features for all training data
Structural feature에 집중해서 알아본다. Network의 주변에 무엇이 있는지, 전체적인 graph의 특징은 무엇인지..
Train an ML model
Random forest
SVM
Neural network, etc
Apply the model
Given a new node/link/graph, obtain its features and make a prediction

This Lecture: Feature Design

Using effective features over graphs is the key to achieving good test performance.
Traditional ML pipeline uses hand-designed features
In this lecture, we overview the traditional features for:

Node-level prediction
Link-level prediction
Graph-level prediction

For simplicity, we focus on undirected graphs.

Machine Leraning in Graphs

Goal : Make predictions for a set of objects
Design choices
Features : d-dimensional vectors
Objects : Nodes, edges, sets of nodes, entire graphs
Objective function : What task are we aiming to solve?
해당 함수를 어떻게 학습시킬것인가????

Node-level Tasks

Node classification → ML needs features
Goal : Characterize the structure and position of a node in the network: node degree, node centrality, clustering coefficient, graphlets

Node Degree

The degre k_v of node v is the number of edges (neighboring nodes) the node has.
Treats all neighboring nodes equally

Node Centrality

Node degree counts the neighboring nodes without capturing their importance.
Node centrality c_v takes the node importance in a graph into account
Engienvector centrality, Betweenness centrality, Closeness centrality, etc..

Eigenvector centrality

Betweenness centrality

Closeness centrality

Clustering Coefficient

어떻게 주변 노드들과 연결되어 있는가?

Graphlets

Observation : Clustering coefficient counts the #(triangles) in the ego-network
We can generalize the above by counting #(pre-specified subgraphs, i.e., graphlets)
Rooted connected non-isomorphic subgraphs
Graphlet Degree Vector (GDV) : Graphlet-base features for nodes, A count vector of graphlets rooted at a given node!
Degree count #(edges) that a node touches
Clustering coefficient counts #(triangles) that a node touches.
GDV counts #(graphlets) that a node touches
Considering graphlets on 2 to 5 nodes we get:
Vector of 73 coordinates is a signature of a node that describes the topology of node's neighborhood
Captures its interconnectivities out to a distance of 4 hops
Graphlet degree vector provides a measure of a node's local network topology
Comparing vectors of two nodes provides a more detailed measure of local topological similarity than node degrees or clustering coefficient.

Lecture 2.2 - Traditional Feature-based Methods: Link

Link Prediction as a Task

Two formulations of the link prediction task
1. Links missing at random:
Remove a random set of links and them aim to predict them
2. Links over time:

Link Prediction via Proximity

Link-Level Features

Distance-based feature

Shortest-path distance between two nodes

Local Neighborhood Overlap

Captures # neighboring nodes shared between two nodes v1 and v2

Global Neighborhood Overlap

Limitaion of Local neighborhood features: Metric is always zero if the two nodes do not have any neighbors in common
However, the two nodes may still potentially be connected in the future
Global neighborhood overlap metrics resolve the limitation by considering the entire graph.
Katz index: count the number of paths of all lengths between given pair of nodes
Powers of adjacency matrix
How to compute # paths between two nodes?

Lecture 2.3 - Traditional Feature-based Methods: Graph

Graph-Level Features

Goal : We want features that characterize the structure of an entire graph

Background: Kernel Methods

Idea: Design Kernels instead of feature vectors.
Graph Kernels: Measure similarity betwen two graphs:
Graphlet Kernel, Wisfeiler-Lehman Kernel

Graph Kernel: Key Idea

Both Graphlet Kernel and Weisfeiler-Lehman (WL) Kernel use Bag-of-representation of graph, where is more sophisticated than node degrees!

Graphlet Features

Key Idea: Count the number of different graphlets in a graph
✓ Definition of graphlets here is slightly different from node-level features.
✓ The two differences are:
- Nodes in graphlets here do not need to be connected
- The graphlets here are not rooted.
- Example:
Given two graphs, G and G', graphlet kernel is computed as
Problem: if G and G' have different sizes, that will greatly skew the value.
Solution: normalize each feature vector
Limitations: Counting graphlets is expensive!
❍ Counting size-k graphlets for a graph with size n by enumeration takes n^k
❍ This is unavoidable in the worst-case since subgraph isomorphism test is NP-hard
❍ If a graph's node degree is bounded by d, an O(nd^{k-1}) algorithm exists to count all the graphlets of size k

Weisfeiler-Lehman Kernel

Goal : design an efficient graph feature descriptor Φ(G)
Idea: use neighborhood structure to iteratively enrich node vocabulary
❍ Generalized version of Bag of node degrees since node degrees are one-hop neighborhood information

Example of color refinement given two graphs:

→

→

→
After color refinement, WL kernel counts number of nodes with a given color.
The WL kernel value is computed by the inner product of the color count vectors:
WL kernel is computationally efficient
✓ The time complexity for color refinement at each step is linear in #(edges), since it involves aggregating neighboring colors.
When computing a kernel value, only colors appeared in the two graphs need to be tracked.
✓ Thus, #(colors) is at most the total number of nodes.
Counting colors takes linear-time w.r.t. #(nodes).
In total, time complexity is linear in #(edges).