Probability

Rainy Night for Sapientia·2024년 1월 13일

Mathematics for AI

목록 보기
3/3

Basic properties

1. Sum rule (Marginal probability)

P(X=xi)=jmP(X=xi,Y=yj)P(X = x_i) = \sum_j^m P(X = x_i, Y = y_j)
p(X=xi)=jp(X=xi,Y=yj)dyp(X = x_i) = \int_j p(X = x_i, Y = y_j) dy

2. Product rule

P(X=xi,Y=yj)=P(X=xiY=yj)P(Y=yj)P(X = x_i, Y= y_j) = P(X=x_i | Y=y_j) P(Y=y_j)

3. Product rule (if they are independent)

P(X=xi,Y=yj)=P(X=xi)P(Y=yj)P(X = x_i, Y= y_j) = P(X=x_i) P(Y=y_j)

4. Bayes theorem

P(X=xY=y)=P(Y=yX=x)P(X=x)P(Y=y)P(X=x|Y=y) = \frac{P(Y=y|X=x)P(X=x)}{P(Y=y)}
P(X=xY=y)=P(Y=yX=x)P(X=x)imP(Y=yX=xi)(X=xi)P(X=x|Y=y) = \frac{P(Y=y|X=x)P(X=x)}{\sum_i^m P(Y=y|X=x_i)(X=x_i)}
Posterior=LikelihoodPriorNormalisationfactorPosterior = \frac{Likelihood * Prior}{Normalisation factor}

5. Expectation

E[f(x)]=xf(x)P(x)E[f(x)] = \sum_x f(x)P(x)
E[f(x)]=xf(x)p(x)dxE[f(x)] = \int_x f(x)p(x) dx
E[f(x)]1Nn=1Nf(xn)E[f(x)] \approx \frac{1}{N}\sum_{n=1}^N f(x_n)
  • This last approximation become exact as it goes to infinity.

6. Conditional Expectation

  • It's important to check expectation is calculated over which rv.
Ex[xy]=xxP(xy)E_x[x|y] = \sum_x x P(x|y)
Ex[f(x)y]=xf(x)P(xy)E_x[f(x)|y] = \sum_x f(x) P(x|y)

7. Variance

σ2[f(x)]=Var[f(x)]=E[{f(x)E(x)}2]\sigma^2[f(x)] = \text{Var}[f(x)] = E[\{f(x) - E(x)\}^2]
E[{f(x)E(x)}2]=E[(f(x)2]E[f(x)]2E[\{f(x) - E(x)\}^2] = E[(f(x)^2] - E[f(x)]^2
σ2[X]=E[X2]E[X]2\sigma^2[X] = E[X^2] - E[X]^2

8. Covariance

σ[X,Y]=Cov[X,Y]=EX,Y[(XE[X])(YE[Y])\sigma[X, Y] = \text{Cov}[X, Y] = E_{X,Y}[(X - E[X])(Y-E[Y])
=EX,Y[X,Y]E[X]E[Y]= E_{X,Y}[X, Y] - E[X]E[Y]
  • If X,YX, Y are independent, Cov[X,Y]=0\text{Cov}[X, Y] = 0

  • Cov[X,X]=Var[X]\text{Cov}[X, X] = \text{Var}[X]

9. Calculation of Expectation

  • when a,ba, b is constant
    E[aX+b]=aE[X]+bE[aX + b] = aE[X] + b
EX,Y[X+Y]=E[X]+E[YE_{X, Y}[X + Y] = E[X] + E[Y|
  • If X,YX, Y are independent.
EX,Y[XY]=E[X]E[Y]E_{X, Y}[XY] = E[X]E[Y]
  • Trivial thing
    E[X2]=σ2[X]+E[X]2E[X^2] = \sigma^2[X] + E[X]^2

10. Calculation of (Co)variances

Var[aX+b]=a2Var[X]\text{Var}[aX + b] = a^2\text{Var}[X]
=σ2[aX+b]=a2σ2[X]= \sigma^2[aX + b] = a^2\sigma^2[X]
Var[X+Y]=Var[X]+Var[Y]+2Cov[X,Y]\text{Var}[X + Y] = \text{Var}[X] + \text{Var}[Y] + 2 \cdot\text{Cov}[X, Y]
Var[XY]=E[X2]Var[Y]+E[Y2]Var[X]+2E[XY]Cov[X,Y]\text{Var}[XY] = E[X^2]\cdot \text{Var}[Y] + E[Y^2]\cdot \text{Var}[X] + 2\cdot E[XY] \cdot \text{Cov}[X, Y]
  • If X,YX, Y are independent
Var[X+Y]=Var[X]+Var[Y]\text{Var}[X + Y] = \text{Var}[X] + \text{Var}[Y]
Var[XY]=E[X2]Var[Y]+E[Y2]Var[X]\text{Var}[XY] = E[X^2]\cdot \text{Var}[Y] + E[Y^2]\cdot \text{Var}[X]

11. Correlation coefficient

ρX,Y=Cov[X,Y]Var[X]Var[Y]=σ[X,Y]σ[X]σ[Y]\rho_{X, Y} = \frac{ \text{Cov}[X, Y] } {\sqrt{\text{Var}[X]}\sqrt{\text{Var}[Y]}} = \frac{ \sigma[X, Y] } {\sigma[X]\sigma[Y]}

12. Law of iterated expectation

EY[EX[XY]]=E[X]E_Y[E_X[X|Y]] = E[X]

13. Independence

P(X,Y)=P(X)P(Y)P(X, Y) = P(X)P(Y)
  • Independece \rightarrow not correlated
  • Not correlated \neq not independent
    • Because they could be non-linearly correlated.

Distribution properties

1. Univariate Gaussian

  • xx is data point (scalar)
  • μ\mu is mean (scalar)
  • σ2\sigma^2 is variance (scalar)
N(xμ,σ2)=12πσ2exp{12σ2(xμ)2}\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\{-\frac{1}{2\sigma^2}(x-\mu)^2\}

2. Multivariate Gaussian

  • xd×1\mathbf{x}_{d \times 1} is random vector
  • μd×1\mathbf{\mu}_{d \times 1} is mean vector of all training data set Xd×N={x1,x2,...,xn}X_{d \times N} = \{\mathbf{x}_1, \mathbf{x}_2, ...,\mathbf{x}_n\}
  • Σd×d\Sigma_{d \times d} is covariance matrix
N(xμ,Σ)=1(2π)d/2Σ1/2exp{12(xμ)TΣ1(xμ)}\mathcal{N}(\mathbf{x}|\mathbf{\mu}, \mathbf{\Sigma}) = \frac{1}{(2\pi)^{d/2}\Sigma^{1/2}}\exp\{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\}

3. Bernoulli

  • RV x{0,1}x \in \{0,1 \}
  • μ\mu is probability for x=1x=1
Bern(xμ)=μx(1μ)1x\text{Bern}(x|\mu) = \mu^x(1-\mu)^{1-x}
  • E[x]=μE[x] = \mu
  • σ2[x]=μ(1μ)\sigma^2[x] = \mu(1-\mu)

4. Multinoulli

  • Bernoulli for Category >2
  • RV y{0,,c}y \in \{0, \dotsc, c\} for C classes
  • One hot vector y=[0,,yc=1,,0]\mathbf{y} = [0, \dotsc, y_c =1, \dotsc, 0] is used for convenience.
  • pcp_c is probability of p(y=cp)p(y=c|\mathbf{p})
  • p\mathbf{p} is pdf for class c, c=1Cpc=1\sum_{c=1}^C p_c = 1
Multinouli(yp)=c=1Cpcyc\text{Multinouli}(\mathbf{y}|\mathbf{p}) = \prod_{c=1}^C p_c^{y_c}

5. Binomial

  • RV x{0,1}x \in \{0,1 \}
  • NN: the number of observation
  • mm: RV, the number of getting x=1x=1 out of NN trial
  • pp: the probability for x=1x=1
Bin(mN,p)=(Nm)pm(1p)Nm\text{Bin}(m|N, p) = \binom{N}{m}p^m(1-p)^{N-m}
  • E[m]=NμE[m] = N\mu
  • σ2[m]=Nμ(1μ)\sigma^2[m] = N\mu(1-\mu)

6. Multinomial

  • Binomial for Category >2
  • RV y{0,,c}y \in \{0, \dotsc, c\} for C classes
  • One hot vector y=[0,,yc=1,,0]\mathbf{y} = [0, \dotsc, y_c =1, \dotsc, 0] is used for convenience.
  • pcp_c is probability of p(y=cp)p(y=c|\mathbf{p})
  • p\mathbf{p} is pdf for class c, c=1Cpc=1\sum_{c=1}^C p_c = 1
  • mcm_c: RV, the number of getting class cc out of NN trial
Multinomial(m1,,mCp,N)=(Nm1mC)c=1CpcMc\text{Multinomial}(m_1, \dotsc ,m_C |\mathbf{p}, N) = \binom{N}{m_1\dotsc m_C} \prod_{c=1}^C p_c^{M_c}
(Nm1mC)=N!m1!m2!mC!\binom{N}{m_1\dotsc m_C} =\frac{N!}{m_1!m_2! \dotsc m_C!}
profile
Artificial Intelligence study note

0개의 댓글