UGM (Undirected Graphical Model)

홍성우·2021년 1월 11일
0

Bayesian

목록 보기
1/6

Random Field

Defn
A random field is a random function over an arbitrary domain. i.e. it is a function f(x)f(x) that takes on a random value at each point xRnx\in R^n. By modern definitions, a random field is a gneralization of a stochastic process where the underlying parameter can take multidimensional vectors or points on some manifold.

Markov Random Field (Markov Network)

Defn
A set of random variables having a Markov property described by an undirected graph. i.e. a random field that satisfies Markov Properties.

  • Equivalent of Undirected Graphical Model.
  • Directed Graphical Models are called Bayesian Network.

Markov Properties

  • Global Property
    Consider disjoint subsets A,B,A,B, and CC of the Markov Network. AA and CC are conditionally independent given BB iff all paths connecting AA and CC pass through BB.
  • Local Property
    A node is conditionally independent from the rest of the network given its neighbors.
  • Pairwise Property (Also called Markov Blanket Property)
    Any two nodes in the Markov network are conditionally independent given the rest of the network if they are not neighbors.

Gibbs Random Field

Defn
When the joint probability density of the random variables in Markov Random Field is strictly positive, it is also referred to as a Gibbs random field.

Intuition of Why
Hammersley-Clifford theorem states that the strictly positive probability density can be represented by a Gibbs measure for an appropriate energy function.

Markov Network

For a set of variables X={x1,...,xn}X=\{x_1,...,x_n\} a Markov network is defined as a product of potentials on subsets of the variables XcXX_c \subset X

p(x1,...,xn)=1Zc=1Cϕc(Xc)p(x_1,...,x_n)=\frac 1 Z \prod_{c=1}^C \phi_c(X_c)

Where,

  • ϕc\phi_c is called potential 0\geq 0 (this does not have to be probability)
  • XcX_c is Maximal clique
  • ZZ is a normalizing factor to scale the product value to be within range [0,1][0,1]

Difference between Markov Network and Bayesian Network

  • Markov networks allow symmetrical dependencies between variables (undirected graphical model).
  • Markov networks allow cycles, and so cyclical dependencies can be represented.

Conditional Independence:

Global Markov Property: XAXBXCX_A\perp X_B | X_C iff CC separates AA from BB (there is no path connecting them). This is also referenced as Gibbs Distribution.

Markov Blanket(local property): The set of nodes that renders a node tt conditionally independent of all the other nodes in the graph.

tVmb(t){t}mb(t)t\perp V - mb(t) - \{t\} | mb(t)

where VV: All nodes in the graph.
mb(t)mb(t): Markov Blanket.

Clique

Defn
A clique is a subset of vertices of an undirected graph such that every two distinct vertices in the clque are adjacent. i.e, there exists edge for every two vertices in undirected graph.

Example 1)

p(x1,x2,x3,x4)=ϕ(x1,x2)ϕ(x2,x3)ϕ(x3,x4)ϕ(x4,x1)/Zap(x_1,x_2,x_3,x_4)=\phi(x_1,x_2)\phi(x_2,x_3)\phi(x_3,x_4)\phi(x_4,x_1)/Z_a

Example 2)

p(x1,x2,x3,x4)=ϕ(x1,x2,x3,x4)/Zbp(x_1,x_2,x_3,x_4)=\phi(x_1,x_2,x_3,x_4)/Z_b

Example 3)

p(x1,x2,x3,x4,x5)=ϕ(x1,x2,x4)ϕ(x2,x3,x4)ϕ(x3,x5)ϕ(x3,x6)/Zcp(x_1,x_2,x_3,x_4,x_5)=\phi(x_1,x_2,x_4)\phi(x_2,x_3,x_4)\phi(x_3,x_5)\phi(x_3,x_6)/Z_c

Clique Potential
Cannot interpret ϕ\phi as marginals or conditionals.
Can only be interpreted as general "compatibility" or "relationship" over their variables, but not as probability distributions.

Maximal Clique

Defn
A maximal clique is a clique that cannot be extended by including one more adjacent vertex, meaning it is not a subset of larger clique.

UGM representation

Example)

1. Maximal Clique Representation

P(x1,x2,x3,x4)=1Zϕc(x124)ϕc(x234)Z=x1,x2,x3,x4ϕc(x124)ϕc(x234)P'(x_1,x_2,x_3,x_4)=\frac 1 Z \phi_c(x_{124})\phi_c(x_{234})\\ Z=\sum_{x_1,x_2,x_3,x_4}\phi_c(x_{124})\phi_c(x_{234})

2. Sub-Clique Representation: Pairwise MRFs

P(x1,x2,x3,x4)=1Zijϕij(xij)=1Zϕ12(x12)ϕ14(x14)ϕ24(x24)ϕ23(x23)ϕ34(x34) Z=x1,x2,x3,x4ijϕij(xij)P''(x_1,x_2,x_3,x_4)=\frac 1 Z \prod_{ij}\phi_{ij}(x_{ij})\\ =\frac 1 Z \phi_{12}(x_{12})\phi_{14}(x_{14})\phi_{24}(x_{24})\phi_{23}(x_{23})\phi_{34}(x_{34})\\ \ \\Z=\sum_{x_1,x_2,x_3,x_4}\prod_{ij}\phi_{ij}(x_{ij})
  • Two graphs, PP' and PP'' have equivalent I-maps, i.e. I(P)==I(P)I(P') == I(P'') since both factorization is coming from the graph itself.

Hammersley-Clifford Theorem

  • If arbitrary potentials are utilized in the following product formula for probabilities,

    P(x1,...,xn)=1ZcCϕc(xc)Z=x1,...,xncCϕc(xc)P(x_1,...,x_n)=\frac 1 Z \prod_{c\in C}\phi_c(x_c)\\ Z=\sum_{x_1,...,x_n}\prod_{c\in C}\phi_c(x_c)

    then the family of probability distributions obtained is exactly that set which respects the qualitative specification (conditional independence relations) described earlier.

  • Thm: Let P be a positive distribution over V, and HH a Markov network graph over V. If HH is an I-map for P, then P is a Gibbs distribution over HH.

  • The theorem states that the process is a Markov random field iff the corresponding joint distribution of the graph is a Gibbs distribution.

Proof
Let {Xt:tT}\{X_t:t\in T\} be a finite collection of random variables with XtX_t taking values in a finite set StS_t. Let the index set TT be {1,...,n}\{1,...,n\} and St={0,...,mt}S_t=\{0,...,m_t\}. The joint distribution of the variable is

Q{x}=P{Xt=xt for tT}where x=(x1,...,xn)Q\{x\}=P\{X_t=x_t\ for\ t\in T\}\qquad where\ x=(x_1,...,x_n)

with 0xtmt0\leq x_t \leq m_t. The vector X=(X1,...,Xn)X=(X_1,...,X_n) takes values in X=tTStX=\prod_{t\in T} S_t, the set of all n_tuples x=(x1,...,xn)x=(x_1,...,x_n) with xtStx_t\in S_t for each tt.
Suppose TT is the set of nodes of a graph. Let NtN_t denote the set of nodes (the neighbors of tt) for which (t,s)(t,s) is an edge of the graph.

Definition
The process is said to be a Markov random field if
(i) Q{x}>0Q\{x\}\gt 0 for every xx in XX
(ii) for each tt and xx,

P{Xt=xtXs=xs for s=t}=P{Xt=xtXs=xs for sNs}.P\{X_t=x_t|X_s=x_s\ for\ s\cancel= t\} = P\{X_t=x_t|X_s=x_s\ for\ s\in N_s\}.

Which simply means that the conditional probability of a node is independent of all other nodes in the graph given its neighbors.

The conditional probability P{Xt=xtXs=xs for sNs} depends only on xs for s{t}NtThe\ conditional\ probability\ P\{X_t=x_t|X_s=x_s\ for\ s\in N_s\}\ depends\ only\ on\ x_s\ for\ s\in\{t\} \cup N_t

A subset AA of TT is said to be complete if each pair of vertices in AA defines an edge of the graph. Γ\Gamma doesnotes the collection of all complete subsets. Clique is another terminology for complete graph.

Definition
The probability distribution QQ is called a Gibbs distribution for the graph if it can be written in the form

Q{x}=AΓVA(x),Q\{x\}=\prod_{A\in \Gamma}V_A(x),

where each VAV_A is a positive function that depends on xx only through the coordinates {xt:tA}\{x_t:t\in A\}.

The Hammersley-Clifford Theorem asserts that the process {Xt:tT}\{X_t:t\in T\} is a Markov random field iff the corresponding QQ is a Gibbs distribution.

Factor Graphs

  • A factor graph is a graphical model representation that unifies directed and undirected models.
  • Bipartite Graph is composed of Circle Nodes and Square Nodes in the Graph.
  • Round Nodes represent Random Variables of the Graph
  • Square Nodes represent factors of the Graph. i.e. relationship between the nodes; edges in the graph.

Example)
Original UDG is Given by the following, where vertices 1~4 are Random Variables, edges are dependence relationship.

The graph above represents factorization by

P(x1,x2,x3,x4)=1Zϕ124(x1,x2,x4)ϕ234(x2,x3,x4)P(x_1,x_2,x_3,x_4)=\frac 1 Z \phi_{124}(x_1,x_2,x_4)\phi_{234}(x_2,x_3,x_4)

where ZZ is a normalization constant.

The graph above represents factorization of the original UDG by

p(x1,x2,x3,x4)=1Zϕ12(x1,x2)ϕ14(x1,x4)ϕ24(x2,x4)ϕ23(x2,x3)ϕ34(x3,x4)p(x_1,x_2,x_3,x_4)=\frac 1 Z \phi_{12}(x_1,x_2)\phi_{14}(x_1,x_4)\phi_{24}(x_2,x_4)\phi_{23}(x_2,x_3)\phi_{34}(x_3,x_4)

where ZZ is a normalization constant.

Example for DGM

The DGM above can be interpreted as follows.

p(x1,x2,x3,x4,x5)=p(x1)p(x2)p(x3x1,x2)p(x4x3)p(x5x3)p(x_1,x_2,x_3,x_4,x_5)=p(x_1)p(x_2)p(x_3|x_1,x_2)p(x_4|x_3)p(x_5|x_3)

Directed Graphical Model is a subset of UDG. And therefore, the factor graph we have established above can be applied to the directed graphical model. Furthermore, unlike UDG, DG's dependence relationship can be identified by probability, since there is a unique way how a DG is interpreted.
If the graph above were to be a UDG the joint probability would be given by the following.

p(x1,x2,x3,x4,x5)=1Zϕ1(x1)ϕ2(x2)ϕ3(x3)ϕ4(x4)ϕ5(x5)ϕ1,2,3(x1,x2,x3)ϕ34(x3,x4)ϕ35(x3,x5)p(x_1,x_2,x_3,x_4,x_5)=\frac 1 Z\phi_1(x_1)\phi_2(x_2)\phi_3(x_3)\phi_4(x_4)\phi_5(x_5)\phi_{1,2,3}(x_1,x_2,x_3)\phi_{34}(x_3,x_4)\phi_{35}(x_3,x_5)

Practical Examples

Exponential Form

  • Gibbs Distribution
    p(x1,...,xn)=1Zc=1Cϕc(Xc)p(x_1,...,x_n)=\frac 1 Z \prod_{c=1}^C \phi_c(X_c)
    And ϕc(Xc)\phi_c(X_c) is called Potentials.

We are free to define ϕ\phi function differently, as long as the function meets the requirement of non-negative function. Therefore, we can represent the distribution using exponential term.

p(x1,...,xn)=1Zc=1Cexp(ϕc(Xc))p(x_1,...,x_n)=\frac 1 Z \prod_{c=1}^C exp(-\phi_c(X_c))

Free energy of the system (log of prob):

H(x1,...,xn)=cϕc(Xc)H(x_1,...,x_n)=\sum_c \phi_c(X_c)

Powerful parametrization (log-linear model):

H(x1,...,xn;θ)=cfc(Xc)TθcH(x_1,...,x_n;\theta)=\sum_cf_c(X_c)^T\theta_c

where, fc(Xc)f_c(X_c) term is feature function, and θc\theta_c is probability of theta.

Boltzmann Machines
A fully connected graph with pairwise potentials on binary-valued nodes is called a Boltzmann machine.
We want to find what is the most stable system.

p(x1,x2,x3,x4;θ;α)=1Z(θ,α)exp[ijθijxixj+iαixi]p(x_1,x_2,x_3,x_4;\theta;\alpha)=\frac 1 {Z(\theta,\alpha)} exp\bigg[\sum_{ij}\theta_{ij}x_ix_j+\sum_i\alpha_ix_i\bigg]

Where one can think of the first term ijθijxixj\sum_{ij}\theta_{ij}x_ix_j as binary term, iαixi\sum_i\alpha_ix_i as unary term, and θij\theta_{ij} and αi\alpha_i are parameters associated.

Therefore, overall energy function has a quadratic form.

H(x;θ,μ)=(xμ)Tθ(xμ)H(x;\theta,\mu)=(x-\mu)^T\theta(x-\mu)

Since the objective is to find the most stable system, finding the minimum energy state is the objective of the Boltzmann machine. Since the energy function has. aquadratic form, we can simply find the minima for H(x;θ,μ)H(x;\theta,\mu), if θ,α\theta,\alpha are known.

maxx1,..,x4p(x1,...,x4;θ,α)=minx1,...,x4H(x;θ,μ)max_{x_1,..,x_4}p(x_1,...,x_4;\theta,\alpha)=min_{x_1,...,x_4}H(x;\theta,\mu)

Ising Models

  • Nodes are arranged in a regular topology and connected only to their geometric neighbors.

p(X)=1Zexp{i,jNiθijXiXj+iNiθi0Xi}p(X)=\frac 1 Z exp\bigg\{\sum_{i,j\in N_i}\theta_{ij}X_iX_j+\sum_{i\in N_i}\theta_{i0}X_i\bigg\}

Restricted Boltzmann Machines (RBM)

  • Observed are data

  • Unobserved has "a notion" of summary of data

    p(x,h;θ)=exp(iθiϕi(x)+jθjϕj(hj)+i,jθi,j(xi,hj)A(θ))p(x,h;\theta)=exp\bigg(\sum_i\theta_i \phi_i(x) + \sum_j \theta_j \phi_j(h_j) + \sum_{i,j} \theta_{i,j}(x_i,h_j) - A(\theta) \bigg)
  • Factors are marginally dependent.

  • Factors are conditionally independent given observations on the visible nodes.

    p(h1,...,hMx)=mp(hmx)p(h_1,...,h_M|x)=\prod_m p(h_m|x)
  • Iterative Gibbs sampling to generate pairs of (x,h).

Conditional Random Fields (CRF)

pθ(yx)=1Z(θ,x)exp{cθcfc(x,yc)}p_\theta(y|x)=\frac 1 {Z(\theta,x)}exp\bigg\{\sum_c \theta_c f_c(x, y_c)\bigg\}
  • An example of log linear model.

0개의 댓글

관련 채용 정보