Probability

Mathematics for AI

목록 보기

3/3

1. Sum rule (Marginal probability)

P(X = x_i) = \sum_j^m P(X = x_i, Y = y_j)

p(X = x_i) = \int_j p(X = x_i, Y = y_j) dy

2. Product rule

P(X = x_i, Y= y_j) = P(X=x_i | Y=y_j) P(Y=y_j)

3. Product rule (if they are independent)

P(X = x_i, Y= y_j) = P(X=x_i) P(Y=y_j)

4. Bayes theorem

P(X=x|Y=y) = \frac{P(Y=y|X=x)P(X=x)}{P(Y=y)}

P(X=x|Y=y) = \frac{P(Y=y|X=x)P(X=x)}{\sum_i^m P(Y=y|X=x_i)(X=x_i)}

Posterior = \frac{Likelihood * Prior}{Normalisation factor}

5. Expectation

E[f(x)] = \sum_x f(x)P(x)

E[f(x)] = \int_x f(x)p(x) dx

E[f(x)] \approx \frac{1}{N}\sum_{n=1}^N f(x_n)

6. Conditional Expectation

E_x[x|y] = \sum_x x P(x|y)

E_x[f(x)|y] = \sum_x f(x) P(x|y)

7. Variance

\sigma^2[f(x)] = \text{Var}[f(x)] = E[\{f(x) - E(x)\}^2]

E[\{f(x) - E(x)\}^2] = E[(f(x)^2] - E[f(x)]^2

\sigma^2[X] = E[X^2] - E[X]^2

8. Covariance

\sigma[X, Y] = \text{Cov}[X, Y] = E_{X,Y}[(X - E[X])(Y-E[Y])

= E_{X,Y}[X, Y] - E[X]E[Y]

9. Calculation of Expectation

E_{X, Y}[X + Y] = E[X] + E[Y|

E_{X, Y}[XY] = E[X]E[Y]

10. Calculation of (Co)variances

\text{Var}[aX + b] = a^2\text{Var}[X]

= \sigma^2[aX + b] = a^2\sigma^2[X]

\text{Var}[X + Y] = \text{Var}[X] + \text{Var}[Y] + 2 \cdot\text{Cov}[X, Y]

\text{Var}[XY] = E[X^2]\cdot \text{Var}[Y] + E[Y^2]\cdot \text{Var}[X] + 2\cdot E[XY] \cdot \text{Cov}[X, Y]

\text{Var}[X + Y] = \text{Var}[X] + \text{Var}[Y]

\text{Var}[XY] = E[X^2]\cdot \text{Var}[Y] + E[Y^2]\cdot \text{Var}[X]

11. Correlation coefficient

\rho_{X, Y} = \frac{ \text{Cov}[X, Y] } {\sqrt{\text{Var}[X]}\sqrt{\text{Var}[Y]}} = \frac{ \sigma[X, Y] } {\sigma[X]\sigma[Y]}

12. Law of iterated expectation

E_Y[E_X[X|Y]] = E[X]

13. Independence

P(X, Y) = P(X)P(Y)

Independece $\rightarrow$ not correlated
Not correlated $\neq$ not independent
- Because they could be non-linearly correlated.

1. Univariate Gaussian

\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\{-\frac{1}{2\sigma^2}(x-\mu)^2\}

2. Multivariate Gaussian

$\mathbf{x}_{d \times 1}$ is random vector
$\mathbf{\mu}_{d \times 1}$ is mean vector of all training data set $X_{d \times N} = \{\mathbf{x}_1, \mathbf{x}_2, ...,\mathbf{x}_n\}$
$\Sigma_{d \times d}$ is covariance matrix

\mathcal{N}(\mathbf{x}|\mathbf{\mu}, \mathbf{\Sigma}) = \frac{1}{(2\pi)^{d/2}\Sigma^{1/2}}\exp\{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\}

3. Bernoulli

\text{Bern}(x|\mu) = \mu^x(1-\mu)^{1-x}

4. Multinoulli

Bernoulli for Category >2
RV $y \in \{0, \dotsc, c\}$ for C classes
One hot vector $\mathbf{y} = [0, \dotsc, y_c =1, \dotsc, 0]$ is used for convenience.
$p_c$ is probability of $p(y=c|\mathbf{p})$
$\mathbf{p}$ is pdf for class c, $\sum_{c=1}^C p_c = 1$

\text{Multinouli}(\mathbf{y}|\mathbf{p}) = \prod_{c=1}^C p_c^{y_c}

5. Binomial

\text{Bin}(m|N, p) = \binom{N}{m}p^m(1-p)^{N-m}

6. Multinomial

Binomial for Category >2
RV $y \in \{0, \dotsc, c\}$ for C classes
One hot vector $\mathbf{y} = [0, \dotsc, y_c =1, \dotsc, 0]$ is used for convenience.
$p_c$ is probability of $p(y=c|\mathbf{p})$
$\mathbf{p}$ is pdf for class c, $\sum_{c=1}^C p_c = 1$
$m_c$ : RV, the number of getting class $c$ out of $N$ trial

\text{Multinomial}(m_1, \dotsc ,m_C |\mathbf{p}, N) = \binom{N}{m_1\dotsc m_C} \prod_{c=1}^C p_c^{M_c}

\binom{N}{m_1\dotsc m_C} =\frac{N!}{m_1!m_2! \dotsc m_C!}

Artificial Intelligence study note