Defn
A random field is a random function over an arbitrary domain. i.e. it is a function f(x) that takes on a random value at each point x∈Rn. By modern definitions, a random field is a gneralization of a stochastic process where the underlying parameter can take multidimensional vectors or points on some manifold.
Markov Random Field (Markov Network)
Defn
A set of random variables having a Markov property described by an undirected graph. i.e. a random field that satisfies Markov Properties.
Equivalent of Undirected Graphical Model.
Directed Graphical Models are called Bayesian Network.
Markov Properties
Global Property
Consider disjoint subsets A,B, and C of the Markov Network. A and C are conditionally independent given B iff all paths connecting A and C pass through B.
Local Property
A node is conditionally independent from the rest of the network given its neighbors.
Pairwise Property (Also called Markov Blanket Property)
Any two nodes in the Markov network are conditionally independent given the rest of the network if they are not neighbors.
Gibbs Random Field
Defn
When the joint probability density of the random variables in Markov Random Field is strictly positive, it is also referred to as a Gibbs random field.
Intuition of Why
Hammersley-Clifford theorem states that the strictly positive probability density can be represented by a Gibbs measure for an appropriate energy function.
Markov Network
For a set of variables X={x1,...,xn} a Markov network is defined as a product of potentials on subsets of the variables Xc⊂X
p(x1,...,xn)=Z1c=1∏Cϕc(Xc)
Where,
ϕc is called potential ≥0 (this does not have to be probability)
Xc is Maximal clique
Z is a normalizing factor to scale the product value to be within range [0,1]
Difference between Markov Network and Bayesian Network
Markov networks allow symmetrical dependencies between variables (undirected graphical model).
Markov networks allow cycles, and so cyclical dependencies can be represented.
Conditional Independence:
Global Markov Property: XA⊥XB∣XC iff C separates A from B (there is no path connecting them). This is also referenced as Gibbs Distribution.
Markov Blanket(local property): The set of nodes that renders a node t conditionally independent of all the other nodes in the graph.
t⊥V−mb(t)−{t}∣mb(t)
where V: All nodes in the graph. mb(t): Markov Blanket.
Clique
Defn
A clique is a subset of vertices of an undirected graph such that every two distinct vertices in the clque are adjacent. i.e, there exists edge for every two vertices in undirected graph.
Clique Potential
Cannot interpret ϕ as marginals or conditionals.
Can only be interpreted as general "compatibility" or "relationship" over their variables, but not as probability distributions.
Maximal Clique
Defn
A maximal clique is a clique that cannot be extended by including one more adjacent vertex, meaning it is not a subset of larger clique.
then the family of probability distributions obtained is exactly that set which respects the qualitative specification (conditional independence relations) described earlier.
Thm: Let P be a positive distribution over V, and H a Markov network graph over V. If H is an I-map for P, then P is a Gibbs distribution over H.
The theorem states that the process is a Markov random field iff the corresponding joint distribution of the graph is a Gibbs distribution.
Proof
Let {Xt:t∈T} be a finite collection of random variables with Xt taking values in a finite set St. Let the index set T be {1,...,n} and St={0,...,mt}. The joint distribution of the variable is
Q{x}=P{Xt=xtfort∈T}wherex=(x1,...,xn)
with 0≤xt≤mt. The vector X=(X1,...,Xn) takes values in X=∏t∈TSt, the set of all n_tuples x=(x1,...,xn) with xt∈St for each t.
Suppose T is the set of nodes of a graph. Let Nt denote the set of nodes (the neighbors of t) for which (t,s) is an edge of the graph.
Definition
The process is said to be a Markov random field if (i)Q{x}>0 for every x in X (ii) for each t and x,
A subset A of T is said to be complete if each pair of vertices in A defines an edge of the graph. Γ doesnotes the collection of all complete subsets. Clique is another terminology for complete graph.
Definition
The probability distribution Q is called a Gibbs distribution for the graph if it can be written in the form
Q{x}=A∈Γ∏VA(x),
where each VA is a positive function that depends on x only through the coordinates {xt:t∈A}.
The Hammersley-Clifford Theorem asserts that the process {Xt:t∈T} is a Markov random field iff the corresponding Q is a Gibbs distribution.
Factor Graphs
A factor graph is a graphical model representation that unifies directed and undirected models.
Bipartite Graph is composed of Circle Nodes and Square Nodes in the Graph.
Round Nodes represent Random Variables of the Graph
Square Nodes represent factors of the Graph. i.e. relationship between the nodes; edges in the graph.
Example)
Original UDG is Given by the following, where vertices 1~4 are Random Variables, edges are dependence relationship.
Directed Graphical Model is a subset of UDG. And therefore, the factor graph we have established above can be applied to the directed graphical model. Furthermore, unlike UDG, DG's dependence relationship can be identified by probability, since there is a unique way how a DG is interpreted.
If the graph above were to be a UDG the joint probability would be given by the following.
We are free to define ϕ function differently, as long as the function meets the requirement of non-negative function. Therefore, we can represent the distribution using exponential term.
p(x1,...,xn)=Z1c=1∏Cexp(−ϕc(Xc))
Free energy of the system (log of prob):
H(x1,...,xn)=c∑ϕc(Xc)
Powerful parametrization (log-linear model):
H(x1,...,xn;θ)=c∑fc(Xc)Tθc
where, fc(Xc) term is feature function, and θc is probability of theta.
Boltzmann Machines
A fully connected graph with pairwise potentials on binary-valued nodes is called a Boltzmann machine.
We want to find what is the most stable system.
Where one can think of the first term ∑ijθijxixj as binary term, ∑iαixi as unary term, and θij and αi are parameters associated.
Therefore, overall energy function has a quadratic form.
H(x;θ,μ)=(x−μ)Tθ(x−μ)
Since the objective is to find the most stable system, finding the minimum energy state is the objective of the Boltzmann machine. Since the energy function has. aquadratic form, we can simply find the minima for H(x;θ,μ), if θ,α are known.