UGM Inference

홍성우·2021년 1월 28일
0

Bayesian

목록 보기
6/6
  • Graphical model structure is given
  • Every variable appears in the training examples (no unobserved)

Questions

  1. What does the likelihood objective accomplish?
  2. Is likelihood the right objective function?
  3. How do we optimize the objective function?
  4. What guarantees does the optimizer provide?
  5. What is the mapping from data to model? In what ways can we incorporate our domain knowledge? How does this impact learning?

Settings

  1. ψC(xC)=θC,xC\psi_C(x_C)=\theta_{C,x_C}
  • Random Variables are discrete.
  • Parameters are the connection between the nodes.
  • MLE by inspection (Decomposable Models).
  • Iterative Proportional Fitting (IPF).

ex) If there is only 1 node in a clique, the parameters will be a vector. If there are 2 nodes, the parameters will be a matrix. If there are 3 nodes, the parameters will be a Tensor of 3 dimension, etc.

  1. ψC(xC)=exp(θf(xC))\psi_C(x_C)=exp(\theta*f(x_C))
  • Random Variables are continuous.
  • Generalized Iterative Scaling.
  • Gradient-based Methods.
  • θ\theta are the parameters, and f(xc)f(x_c) is some feature function.

Let D={x1,...,xn}D=\{\vec{x_1},..., \vec{x_n}\} the observed data, xi.i.d.p(x;θ)\vec{x}^{i.i.d.} \sim p(\vec{x};\theta)
Now, MLE is given by:

maxθlog p(D;θ)=maxθn=1Nclogϕc(xcn;θ)N log Z(θ)max_\theta log\ p(D;\theta)=max_\theta \sum_{n=1}^N \sum_c log \phi_c(x^n_c;\theta) - N\ log\ Z(\theta)

where each clique is denoted as cc, Z(θ)Z(\theta) is a normalizing constant.

Then,

L(θ)=n=1Nclogϕc(xcn;θ)Nlog(ycϕc(yc;θ))L(\theta)=\sum_{n=1}^N \sum_c log \phi_c(x^n_c;\theta) - N*log(\sum_y \prod_{c'} \phi_{c'}(y_{c'};\theta))

Taking Derivative with respect to θc\theta_c and setting the value to zero, we obtain

n=1Nlog ϕc(xcn;θ)NE[ddθclog θc(yc;θc)]=0\sum_{n=1}^N log\ \phi_c(x^n_c;\theta)-N*E\bigg[\frac{d}{d\theta_c}log\ \theta_c(y_c;\theta_c)\bigg]=0

Since the first term is empirical mean, and the second term is the true mean, we thus have that the true mean equals to the empirical mean.

Decomposable Graphs

By Junction Tree Algorithm,

p(a,b,c,d)=ϕ(a,b,c)ϕ(b,c,d)Z=p(a,b,c)p(b,c,d)p(c,b)p(a,b,c,d)=\frac{\phi(a,b,c)\phi(b,c,d)}{Z}=\frac{p(a,b,c)p(b,c,d)}{p(c,b)}

Intuition
Let's assume we have a graph as follows.

p(x1,...,x6)=1Zϕ(x1,x2)ϕ(x2,x3,x5)ϕ(x2,x4,x5)ϕ(x5,x6)p(x_1,...,x_6)=\frac 1 Z \phi(x_1,x_2)\phi(x_2,x_3,x_5)\phi(x_2,x_4,x_5)\phi(x_5,x_6)

Now if we apply the Junction Tree Algorithm, we get the joint probability expressed as the following graph.

p(x1,...,x6)=p(x1,x2)p(x2,x3,x5)p(x5,x6)p(x2,x4,x5)p(x2)p(x5)p(x2,x5)p(x_1,...,x_6)=\frac{p(x_1,x_2)p(x_2,x_3,x_5)p(x_5,x_6)p(x_2,x_4,x_5)}{p(x_2)p(x_5)p(x_2,x_5)}

Rewriting the RHS of the equation above, we obtain

p(x1,...,x6)=p(x1x2)p(x2,x3,x5)p(x4x2,x5)p(x6x5)p(x_1,...,x_6)=p(x_1|x_2)p(x_2,x_3,x_5)p(x_4|x_2,x_5)p(x_6|x_5)

Which is, expressed in an un-normalized form,

p(x1,...,x6)=ϕ(x1,x2)ϕ(x2,x3,x5)ϕ(x2,x4,x5)ϕ(x5,x6)Zp(x_1,...,x_6)=\frac{\phi(x_1,x_2)\phi(x_2,x_3,x_5)\phi(x_2,x_4,x_5)\phi(x_5,x_6)}{Z}

Now the Loss becomes

L=nlogp(x1nx2n)+logp(x2n,x3n,x5n)+logp(x4nx2n,x5n)+logp(x6nx5n)L=\sum_n log p(x_1^n|x_2^n)+logp(x_2^n,x_3^n,x_5^n)+logp(x_4^n|x_2^n,x_5^n)+logp(x_6^n|x_5^n)

IPF update

  • Iterates a set of fixed-point equations
  • Fixed-point update means setting the equation at the desired point, assign the updated values, and iterate over.
ϕc(t+1)(yc)ϕc(t)(yc)ϵ(yc)p(t)(yc)\phi_c^{(t+1)}(y_c)\leftarrow\phi_c^{(t)}(y_c)\frac{\epsilon(y_c)}{p^{(t)}(y_c)}

Where ϵ(yc)\epsilon(y_c) is empirical marginal distribution of ycy_c,
p(t)(yc)p^{(t)}(y_c) comes from inference of the graph with the currently updated parameters. i.e., p(t)(yc)=y:yc=ycp(y;θ(t))p^{(t)}(y_c)=\sum_{y':y'_c=y'_c} p(y';\theta^{(t)})

Feature-based Clique Potentials

  • For large cliques these general potentials are exponentially costly for inference and have exponential numbers of parameters that we must learn from limited data.
  • Could make cliques smaller, but this changes the dependencies.
  • Keep the same graphical model, but use a less general parameterization of the clique potentials.
  • Feature based models.

Features

  • Consider a clique xcx_c of random variables in a UGM; three consecutive characters c1c2c3c_1c_2c_3 in a string of English text.

  • How would we build a model of p(c_1,c_2,c_3)?
    If we use a single clique function over c1c2c3c_1c_2c_3, the full joint clique potential would be huge;(26^3-1) parameters. However, we often know that some particular joint settings of the variables in a clique are quite likely or quite unlikely; "ing", "ion",...

  • A "feature" is a function which is vacuous over all joint settings except a few particular ones on which it is high or low.
    For example, we might have fing(c1,c2,c3)f_{ing}(c_1,c_2,c_3) which is 1 if the string is 'ing' and 0 otherwise, and similar features for 'ion', etc..

  • Typically we can let features represent any of our prior knowledge over the cluster, and we can utilize the features to represent the potential function coming from each clusters.

Features as Micropotentials

  • By exponentiating them, each feature function can be made into a "micropotential". We can multiply these micropotentials together to get a clique potential.
  • Example: a clique potential ψ(c1,c2,c3)\psi(c_1,c_2,c_3) could be expressed as:
    ψc(c1,c2,c3)=eθingfing...\psi_c(c_1,c_2,c_3)=e^{\theta_{ing}f_{ing}}*...
=exp{k=1Kθkfk(c1,c2,c3)}=exp\bigg\{\sum_{k=1}^K \theta_k f_k(c_1,c_2,c_3)\bigg\}
  • This is still a potential over 26326^3 possible settings, but only uses K parameters if there are K features.
    Note that by having one indicator function per combination of xcx_c, we recover the standard tabular potential.

  • Each feature has a weight θk\theta_k which represents the numberical strength of the feature and wheter it increases or decreases the probability of the clique.

  • The marginal over the clique is a generalized exponential family distribution, actually, a GLMGLM

p(c1,c2,c3)exp{kKθkfk(xc)}.p(c_1,c_2,c_3)\sim exp\bigg\{\sum_k^K \theta_kf_k(x_c)\bigg\}.
  • In fact, by using such defined setup, we can see that

    p(c1,c2,c3)=1Z(θ)cexp(θcTf(xc))=exp(cθcTfc(Xc)logZ(θ))p(c_1,c_2,c_3)=\frac 1 {Z(\theta)} \prod_c exp(\theta_c^Tf(x_c))\\ = exp(\sum_c \theta_c^T f_c(X_c) - logZ(\theta))

    which is, as a result, an exponential family.

  • Therefore, the joint probability model now becomes an exponential family, if we define each Clique Potential to be defined that way.

  • In general, the features may be overlapping, unconstrained indicators or any function of any subset of the clique variables:

ψc(xc)=exp{iIcθkfk(xci)}\psi_c(x_c)=exp\bigg\{\sum_{i\in I_c}\theta_k f_k(x_{c_i})\bigg\}

Expressing the clique potentials with feature functions, we can write

p(x)=1Z(θ)cψc(xc)=1Z(θ)exp{ciIcθkfk(xci)}p(x)=\frac 1 {Z(\theta)}\prod_c \psi_c(x_c)=\frac 1 {Z(\theta)}exp\bigg\{\sum_c \sum_{i \in I_c} \theta_k f_k(x_{c_i})\bigg\}
  • Can I use IPF?
    Not exactly, but we can make modification to that.

Generalized Iterative Scaling (GIS)

Key Idea:
- Define a function which lower-bounds the log-likelihood
- Observe that the bound is tight at current parameters
- Increase lower-bound by fixed-point iteration in order to increase log-likelihood

** This Idea is akin to a standard derivation of the Expectation-Maximization Algorithm.

IPF vs GIS

  • IPF is a general algorithm for finding MLE of UGMs.
    ϕc(t+1)(yc)ϕc(t)(yc)ϵ(yc)p(t)(yc)\phi^{(t+1)}_c(y_c) \leftarrow \phi^{(t)}_c(y_c)\frac{\epsilon(y_c)}{p^{(t)}(y_c)}
  • GIS is iterative scaling on general UGM with feature-based potentials.
  • IPF is a special case of GIS which the clique potential is built. on featurers defined as an indicator function of clique configurations.

The two methods are essentially using the same logic;
Set next update to be current update + empiricalmarginal\frac {empirical} {marginal}

0개의 댓글

관련 채용 정보