Recap

홍성우·2021년 1월 26일
0

Bayesian

목록 보기
5/6

Exponential Family

p(x;θ)=exp(θTT(x)A(θ))h(x)p(x;\theta)=exp(\theta^T T(x) - A(\theta))h(x)
  • A(θ)A(\theta) is called log partition function.
  • T(x)T(x) is called sufficient statistics.
  • A(θ)A(\theta) is convex.
  • A(θ)A(\theta)'s first derivative with respect to θ\theta is Expectation of T(x)T(x), and second derivative with respect to θ\theta is the Variance of T(x)T(x).
  • Since A(θ)A(\theta) is (1) a function of the natural parameter, θ\theta, (2) convex function, we can solve the following equation, and obtain an expression for the natural parameter, θ\theta using the moment parameter, μ\mu.
    ddθA(θ)=μ\frac{d}{d\theta} A(\theta)=\mu\\
    Solving for θ\theta, we can obtain,
    θ=ψ(μ)\theta=\psi(\mu)

In short, for a random variable XX that is from natural exponential family distribution, the moment generating function is given by the following:

	-Note that natural exponential family distribution has sufficient statistics, T(x)=x

Recall that the definition of Moment Generating Function is given as follows

Mx(t)=E(exp(tx))=1+tm1+t2m22!+...M_{x}(t)=E(exp(tx))=1+tm_1+ \frac{t^2m_2}{2!} + ...

Applying the moment generating function to T(x),

MT(x)(t)=E(exp(T(x)t))=exp(A(t+θ)A(θ)).M_{T(x)}(t)=E(exp(T(x)t))=exp(A(t+\theta)-A(\theta)).

Furthermore,

E(T(x))=E(X)=A(θ)Var(T(x))=Var(X)=A(θ)E(T(x))=E(X)=A'(\theta)\\ Var(T(x))=Var(X)=A''(\theta)

  • MLE (Maximum Likelihood Estimation) for exponential family is the same as moment matching.
    (a) Log likelihood of a generic exponential family: const+θT(T(xi))nA(θ)const + \theta^T(\sum T(x_i)) - nA(\theta)
    (b) Taking the gradien w.r.t. θ\theta: T(xi)nΔθA(θ)\sum T(x_i) - n \Delta_\theta A(\theta)
    (c) Setting equal to zero and solving for ΔθA(θ)\Delta_\theta A(\theta): ΔθA(θ)=1nT(xi)μ=1nT(xi)\Delta_\theta A(\theta) = \frac 1 n \sum T(x_i) \rightarrow \mu = \frac 1 n \sum T(x_i)

Bayesian POV

After writing down the likelihood of the dta given the natural parameter, we want to pick a prior over the natural parameter, and work out the posterior over the natural parameter.

P(xθ)exp(θTT(x)A(θ))P(θ)exp(λTT(θ)A(λ))P(θx,λ)exp(θTT(x)+λTT(θ)+A(θ)+A(λ))P(x|\theta) \sim exp(\theta^T T(x) - A(\theta))\\ P(\theta) \sim exp(\lambda^T T(\theta) - A(\lambda))\\ P(\theta|x,\lambda) \sim exp(\theta^TT(x) + \lambda^T T(\theta) + A(\theta) + A(\lambda))

Generalized linear models (GLIM)

Components

  • Linear predictor: Linear function of regressors,
    λi=α+β1Xi1+...+βkXik\lambda_i=\alpha + \beta_1 X_{i1}+...+\beta_kX_{ik}
  • Link function: Transforms the expectation of the response variable, μi=E(Yi)\mu_i=E(Y_i) to the linear predictor. In other words, the link function linearizes the expectation of the response variable:
    g(μi)=λi=α+β1Xi1+...+βkXikg(\mu_i)=\lambda_i=\alpha + \beta_1 X_{i1}+...+\beta_kX_{ik}
  • Because the link function is invertible, we can write
    μi=g1(λi)=g1(α+β1Xi1+...+βkXik)\mu_i=g^{-1}(\lambda_i)=g^{-1}(\alpha + \beta_1 X_{i1}+...+\beta_kX_{ik})

Assumptions

YexponentialFamilyλ=ψ(μ=f(ϵ=θTx))Y \sim exponentialFamily\\ \lambda = \psi(\mu=f(\epsilon=\theta^Tx))

where YY are the responses, xx are the fixed inputs, θ\theta are parameters we need to learn, and ff and ψ\psi give us added flexibility if we want it.

Graphical Representation of GLM Models

We can describe the process as the following graph.

  • WW are the parameters.
  • λ=WTX.\lambda = W^TX.
  • μ\mu is called standard parameters, and signifies the mean of YX.Y|X.
  • θ\theta is called natural parameters, that governs the shape of the density YXY|X. It can be the case that μ=θ (e.g.,normal),\mu=\theta\ (e.g., normal), but this doesn't need to be true (Poisson distribution, for example).
  • We are aiming to model a transformation of the mean, μ\mu by finding g(μ)g(\mu) that satisfies θ=g(μ)=λ(X)=WTX.\theta=g(\mu)=\lambda(X)=W^TX.

MLE for Undirected Graphical Models###

1. Data

D={x(n)}n=1ND=\{x^{(n)}\}^N_{n=1}

2. Model
In a typical setup for Undirected Graphical Models, it is bett

p(xθ)=1Z(θ)cCψc(xc)p(x|\theta)=\frac 1 {Z(\theta)} \prod_{c\in C}\psi_c(x_c)

3. Objective

l(θ;D)=n=1Nlogp(x(n)θ)l(\theta;D)=\sum^N_{n=1}log p(x^{(n)}|\theta)

4. Learning

θ=argmaxθ l(θ;D)\theta^*=argmax_\theta\ l(\theta;D)

5. Inference

  • Marginal Inference; Inference over the variables of interest.
    p(xc)=x:xc=xcp(xθ)p(x_c)=\sum_{x':x'^c=x_c} p(x'|\theta)
  • Partition Function; Normalizing function for the Gibbs distribution obtained by factorization.
    Z(θ)=xCcψC(xC)Z(\theta)=\sum_x \prod_{C \in c}\psi_C(x_C)
  • MAP Inference; Compute variable assignment with highest probability.

0개의 댓글

관련 채용 정보