Exact Inference

홍성우·2021년 1월 13일
0

Bayesian

목록 보기
2/6
  • Inference
    Used to answer queries about PM, e.g., PM(XY)P_M,\ e.g.,\ P_M(X|Y)

  • Learning
    Used to estimate a plausible model MM from data DD

Query

1. Likelihood

  • Evidence ee is an assignment of values to a set EE variables in the domain.
  • E={Xk+1,...,Xn}E=\{X_{k+1},...,X_n\}
  • Simplest query:
    P(e)=x1...x2P(x1,...,xk,e)P(e)=\sum_{x_1}...\sum_{x_2}P(x_1,...,x_k,e)
  • this is often referred to as computing the likelihood of ee
  • Notice that EE and X1,...,XkX_1,...,X_k are disjoint set.

2. Conditional Probability

P(Xe)=P(X,e)P(e)=P(X,e)xP(x=x,e)P(X|e)=\frac{P(X,e)}{P(e)}=\frac{P(X,e)}{\sum_x P(x=x,e)}

P(Xe)P(X|e) is the a posteriori belief in XX, given evidence ee.

  • Marginalization: the process of summing out the "unobserved" or "don't care" variables z is called.
    P(Ye)=zP(Y,Z=ze)P(Y|e)=\sum_z P(Y,Z=z|e)

3. Most Probable Assignment

  • most probable joint assignment (MPA) for some variables of interest.
  • What is the most probable assignment of
    maxx1,...,x4P(x1,...,x4e)?max_{x_1,...,x_4}P(x_1,...,x_4|e)?
    where ee is the evidence, and x1,...,x4x_1,...,x_4 are random variables of interest?
    MPA(Ye)=arg maxyY P(ye)=arg maxyYzP(y,ze)MPA(Y|e)=arg\ max_{y\in Y}\ P(y|e)=arg\ max_{y\in Y} \sum_z P(y,z|e)
  • maximum a posteriori configuration of YY.
  • Used in classification / explanation tasks.

example)

P(X1,X2,X3,X4,E)P(X_1,X_2,X_3,X_4,E)

Now we want,

maxx1,...,x4P(x1,...,x4e)max_{x_1,...,x_4}P(x_1,...,x_4|e)

How can we compute the Posterior Distribution?

Applications of A Posteriori Belief

  • Prediction: what is the probability of an outcome given the starting condition?
    A=>B=>CA => B => C
    P(CA)?P(C|A)?
  • Diagnosis: what is the probability of disease/fault given symptoms?
    A=>B=>CA => B => C
    P(AC)?P(A|C)?

Landscape of inference algorithms

Exact inference algorithms (Classical)

  • Elimination algorithms
  • Message-passing algorithms (sum-product, belief propagation)
  • Junction tree algorithms

Approxmiate inference techniques (Modern)

  • Stochastic simulation / sampling methods
  • Markov chain Monte Carlo methods
  • Variational algorithms

Variable Elimination

A=>B=>C=>D=>EA=>B=>C=>D=>E
  • Query: P(e)P(e)
    P(e)=dcbaP(a,b,c,d,e)=dcbaP(a)P(ba)P(cb)P(dc)P(ed)P(e)=\sum_d \sum_c \sum_b \sum_a P(a,b,c,d,e)\\ = \sum_d \sum_c \sum_b \sum_a P(a)P(b|a)P(c|b)P(d|c)P(e|d)
  • Eliminate nodes one by one all the way to the end,
    P(e)=dP(ed)p(d)P(e)=\sum_d P(e|d)p(d)
    The computational complexity of the chain elimination is O(nk2)O(n*k^2) where nn is number of elimination steps, and kk is the dimension of the Vector.

Example) Markov Chain
Let transition matrix M be a 3x3 matrix in a markov chain, where there are three states.

  • Let XX, nodes in a markov chain be the state that
  • What is the probability distribution of a random walk the the transition matrix, M?
    p(x5=ix1=1)=x4,x3,x2p(x5x4)p(x4x3)p(x3x2)p(x2x1=1)=[M4v]ip(x_5=i|x_1=1)=\sum_{x_4,x_3,x_2} p(x_5|x_4)p(x_4|x_3)p(x_3|x_2)p(x_2|x_1=1)\\ = [M^4v]_i
  • What is the probability distribution of a random walk when tt \rightarrow \infty?
    If we denote the stationary distribution as π\pi,
    πM=π\pi M = \pi
    The problem formation above looks similar to the eigenvalue decomposition form, Mv=λvMv=\lambda v, with λ=1\lambda=1.
    (πM)TMTπT=1πT(\pi M)^T \rightarrow M^T\pi^T=1*\pi^T
    In other words, the transposed transition matrix MTM^T has eigenvectors as πT\pi^T with eigenvalue of 11.

We can find the stationary distribution of a given transition matrix through eigenvalue decomposition of the transition matrix. Stationary distribution is the eigenvector with λ=1\lambda=1.

Message Passing

  • Doesn't work in a cyclical graph
    Message backpropagation can be stuck in a cycle.
  • Variable Elimination through Message Passing allows us to marginalize out unnecessary variables, with better Complexity.

Example: Variable Elimination

  • Given the graph above, what is the posterior probability, P(Ah)?P(A|h)?

  • Initial factors:

    P(a,b,c,d,e,f,g)=P(a)P(b)P(bc)P(da)P(ec,d)P(fa)P(ge)P(he,f)P(a,b,c,d,e,f,g)=P(a)P(b)P(b|c)P(d|a)P(e|c,d)P(f|a)P(g|e)P(h|e,f)
    Joint probability factorizes in such manner because the graph is a directed model.
  • Choose elimination order: H,G,F,E,D,C,BH,G,F,E,D,C,B

  • Step1:
    - Conditioning (fix the evidence node hh on its observed value h~\tilde h

    mh(e,f)=p(h=h~e,f)mh(e,f)=hp(he,f)σ(h=h~)m_h(e,f)=p(h=\tilde h|e,f)\\ m_h(e,f)=\sum_h p(h|e,f)\sigma(h=\tilde h)

    where σ(h=h~)\sigma(h=\tilde h) is a dirac-delta function.
    So far we have removed P(h=h~e,f)P(h=\tilde h|e,f), and introduced a new term mh(e,f)m_h(e,f). Now ee and ff become dependent.
    Through introducing a term for Message Passing, we are essentially converting a Directed Graphical Model to an Undirected Graphical Model.

    Delta function is a function with 1 on the value that the observation h holds and 0, otherwise.

Now,

P(a)P(b)P(bc)P(da)P(ec,d)P(fa)P(ge)P(he,f)P(a)P(b)P(bc)P(da)P(ec,d)P(fa)P(ge)mh(e,f)P(a)P(b)P(b|c)P(d|a)P(e|c,d)P(f|a)P(g|e)P(h|e,f)\\ \rightarrow P(a)P(b)P(b|c)P(d|a)P(e|c,d)P(f|a)P(g|e)m_h(e,f)

  • Step2:
    -Eliminate G
    Since there is no term that depends on GG, we can eliminate GG without having to introduce any new message factor.
    mg(e)=gP(ge)=1m_g(e)=\sum_g P(g|e) = 1

  • Step3:
    -Eliminate F
    If we eliminate FF, we need to connect (E,A)(E, A) together, since a new factor will be introduced to make them dependent.
P(a)P(b)P(bc)P(da)P(ec,d)P(fa)P(ge)P(he,f)P(a)P(b)P(bc)P(da)P(ec,d)P(fa)P(ge)mh(e,f)P(a)P(b)P(bc)P(da)P(ec,d)mf(a,e)P(a)P(b)P(b|c)P(d|a)P(e|c,d)P(f|a)P(g|e)P(h|e,f)\\ \rightarrow P(a)P(b)P(b|c)P(d|a)P(e|c,d)P(f|a)P(g|e)m_h(e,f)\\ \rightarrow P(a)P(b)P(b|c)P(d|a)P(e|c,d)m_f(a,e)

  • Step4:
    -Eliminate E
    me(a,c,d)=ep(ec,d)mf(a,e)m_e(a,c,d)=\sum_e p(e|c,d)m_f(a,e)
P(a)P(b)P(bc)P(da)P(ec,d)P(fa)P(ge)P(he,f)P(a)P(b)P(bc)P(da)P(ec,d)P(fa)P(ge)mh(e,f)P(a)P(b)P(bc)P(da)P(ec,d)mf(a,e)P(a)P(b)P(bc)P(da)me(a,c,d)P(a)P(b)P(b|c)P(d|a)P(e|c,d)P(f|a)P(g|e)P(h|e,f)\\ \rightarrow P(a)P(b)P(b|c)P(d|a)P(e|c,d)P(f|a)P(g|e)m_h(e,f)\\ \rightarrow P(a)P(b)P(b|c)P(d|a)P(e|c,d)m_f(a,e)\\ \rightarrow P(a)P(b)P(b|c)P(d|a)m_e(a,c,d)
  • Step5:
    -Eliminate D
md(a,c)=dP(da)me(a,c,d)m_d(a,c)=\sum_d P(d|a)m_e(a,c,d)
P(a)P(b)P(cb)P(da)P(ec,d)P(fa)P(ge)P(he,f)P(a)P(b)P(cb)P(da)P(ec,d)P(fa)P(ge)mh(e,f)P(a)P(b)P(cb)P(da)P(ec,d)mf(a,e)P(a)P(b)P(cb)P(da)me(a,c,d)P(a)P(b)P(cb)md(a,c)P(a)P(b)P(c|b)P(d|a)P(e|c,d)P(f|a)P(g|e)P(h|e,f)\\ \rightarrow P(a)P(b)P(c|b)P(d|a)P(e|c,d)P(f|a)P(g|e)m_h(e,f)\\ \rightarrow P(a)P(b)P(c|b)P(d|a)P(e|c,d)m_f(a,e)\\ \rightarrow P(a)P(b)P(c|b)P(d|a)m_e(a,c,d)\\ \rightarrow P(a)P(b)P(c|b)m_d(a,c)\\
  • Step5:
    -Eliminate C

  • Step6:
    -Eliminate B

    mb(a)=bp(b)mc(a,c)m_b(a)=\sum_b p(b)m_c(a,c)
P(a)P(b)P(cb)P(da)P(ec,d)P(fa)P(ge)P(he,f)P(a)P(b)P(cb)P(da)P(ec,d)P(fa)P(ge)mh(e,f)P(a)P(b)P(cb)P(da)P(ec,d)mf(a,e)P(a)P(b)P(cb)P(da)me(a,c,d)P(a)P(b)P(cb)md(a,c)P(a)P(b)mc(a,b)P(a)mb(a)P(a)P(b)P(c|b)P(d|a)P(e|c,d)P(f|a)P(g|e)P(h|e,f)\\ \rightarrow P(a)P(b)P(c|b)P(d|a)P(e|c,d)P(f|a)P(g|e)m_h(e,f)\\ \rightarrow P(a)P(b)P(c|b)P(d|a)P(e|c,d)m_f(a,e)\\ \rightarrow P(a)P(b)P(c|b)P(d|a)m_e(a,c,d)\\ \rightarrow P(a)P(b)P(c|b)m_d(a,c)\\ \rightarrow P(a)P(b)m_c(a,b)\\ \rightarrow P(a)m_b(a)
  • Final Step:
    -Wrap-up
    We have created message passing from AA to HH, using graph elimination. Since the initial state we were eliminating from was the joint distribution of the variables in the graph, we can finally calculate the conditional distribution, P(ah~)P(a|\tilde h).
p(a,h~)=p(a)mb(a) p(h~)=ap(a)mb(a) P(ah~)=p(a)mb(a)ap(a)mb(a)p(a,\tilde h)=p(a)m_b(a)\\ \ \\ p(\tilde h) = \sum_a p(a)m_b(a)\\ \ \\ \rightarrow P(a|\tilde h)= \frac{p(a)m_b(a)}{\sum_a p(a)m_b(a)}
  • Complexity Analysis
    If we were to integrate every single variables, the complexity would be O(K6)O(K^6), since we are marginalizing out 6 Variables. However, through Message Passing, at each step the complexity would be O(K2)O(K^2), resulting in total of 6O(K2)6*O(K^2)

Overview of Graph elimination

  • Begin with the undirected GM or moralized Bayesian Network (meaning changing the Directed Graph to an Undirected Graph.

  • Graph G(V,E)G(V, E) and elimination ordering II

  • Eliminate next node in the ordering II
    - Removing the node from the graph

    • Connecting the remaining neighbors of the nodes
  • The reconstituted graph G(V,E)G'(V,E')
    - Retain the edges that were created during the elimination procedure

    • The graph-theoretic property: the factors resulted during variable elimination are captured by recording the elimination clique. (Notice that Intermediate terms of form mx(Y,Z)m_x(Y, Z) correspond to the cliques resulted from elimination.

0개의 댓글

관련 채용 정보