Lecture 16. Independent Component Analysis & RL

cryptnomy·2022년 11월 25일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

16/18

Lecture video link: https://youtu.be/YQA9lLdLig8

Outline

Independent Component Analysis
- CDFs (cumulative distribution functions)
- ICA model
Reinforcement Learning
- MDP

ICA (Independent Component Analysis)

(Source: https://youtu.be/YQA9lLdLig8?t=1m58s)

Sources: $s\in\mathbb R^n$ .

$s_j^{(i)}$ : speaker $j$ at time $i$ .

$x^{(i)}=As^{(i)},x\in\mathbb R^n$ . ( $x$ : what microphone records)

E.g., 5 speakers and 5 microphones, $A$ will be a 5 x 5 matrix.

(It will be covered that what happens if # of speakers ≠ # of microphones.)

Goal: Find $W=A^{-1}$ .

$s^{(i)}=Wx^{(i)}, W=\begin{bmatrix}w_1^T\\\vdots\\w_n^T\end{bmatrix},s_j=w_j^Tx$ .

Cocktail party problem

(Source: https://youtu.be/YQA9ILdLig8?t=3m4s)

Q. Why is ICA even possible? Given two overlapping voices, how is it even possible to separate them out?

A. 2nd image — initial state. 3rd image — state after 2nd image passes through $A$ .

What to do? To find an unmixing matrix $W$ that maps this data back to the square.

Q. Why is this example possible?

A. Date of the 2nd image are distributed uniformly between -1 and 1. However, human voices are not distributed uniformly.

If data Gaussian, not possible, due to rotational ambiguity.

Formally stated, ICA is possible only if your data is non-Gaussian. So long as your data is non-Gaussian, it is possible to recover the independent sources.

CDF (Cumulative distribution function)

F(s)=P(S\le s)

( $S$ : random variable, $s$ : some constant)

Relation: $p_s(s)=F'(s)$ .

$p_x(x)\stackrel{?}{=}p_s(s)=p_s(Wx)$ → Incorrect for continuous probability densities. (Actually this works with pmf (probability mass function) for discrete probability distributions.)

E.g.,

p_s=1\{0\le s\le1\}\;\;\;(s\sim\text{Uniform}(0,1))\\x=2s\;\;\;\left(A=2,W=\frac{1}{2}\right)\\p_x(x)=\frac{1}{2}\cdot1\{0\le s\le1\}\;\;\;(s\sim\text{Uniform}(0,2))

Note that $|W|=\frac{1}{2}$ .

$p_s(s)=?$

Our choice of cdf: $F(s)=P(S\le s)=\frac{1}{1+\exp(-s)}$ .

p_s(s)=\prod_{i=1}^np_s(s_i)

( $n$ speakers are independently speaking.)

\begin{aligned}p_x(x)&=p_s(Wx)|W|\\&=\left(\prod_{j=1}^np_s(w_j^Tx)\right)|W|.\end{aligned}

MLE:

l(w)=\sum_{i=1}^m\log\left[\left(\prod_{j=1}^np_s(w_j^Tx^{(i)})\right)|W|\right].

SGD:

\nabla_w l(w)=\begin{bmatrix}1-2g(w_1^Tx)\\\vdots\\1-2g(w_n^Tx)\end{bmatrix}x^{(i)^T}+(w^T)^{-1}.

where $g(z)=\frac{1}{1+\exp(-z)}$ .

Q. What is the closest non-linear extension of this (ICA)?

A. We don’t have a great answer to that right now frankly.

Nonlinear version of ICA?

Interesting research on hierarchical versions of sparse coding is a different algorithm that turns out to be very closely related to ICA and then you can show that they’re optimizing very similar things.

There has been less attention from research on this topic than it really deserves.

(Source: https://youtu.be/YQA9lLdLig8?t=33m15s)

ICA example

(ICAs are routinely used to clean up EEG data today.)

What’s an EEG (electroencephalogram)?

→ Plcae many electrode on your scalp to measure low electrical recordings on the surface of the scalp.

Your brain handles many tasks at the same time — blinking eyes, regulating heartbeat, breathing, etc.

(Source: https://youtu.be/YQA9lLdLig8?t=35m11s)

(Source: https://youtu.be/YQA9lLdLig8?t=36m44s)

To use an EEG to categorize very coarse level thoughts..

(Source: https://youtu.be/YQA9lLdLig8?t=37m52s)

Untitled

(Source: https://youtu.be/YQA9lLdLig8?t=38m05s)

(Source: https://youtu.be/YQA9lLdLig8?t=39m15s)

What ICA tells us is that the world is made up of edges or patches like the above and just by adding all of the voices you obtain a typical image fashion of the world. An interesting neuroscience theory has some parallels with sparse coding and ICA.

Q. Should the number of microphones and that of people be equal?

A. If microphones outnumber the people, it would be no problem — the rest sources would be silent. In the opposite case, it is a cutting edge research problem. E.g., two people consist of man and woman and you have only one microphone. The algorithm might separate out two voices since one has the higher pitch and the other the lower. However, separating out two male voices or two female ones is still very hard.

Q. Do you ever see a problem with $W$ ?

A. I’m sure you can. It’s not usually done in this version of the algorithm, but I would not be surprised if there are some other versions where you do.

Andrew Ng’s comment: K-means clustering, EM algorithm for the mixture of Gaussians, factor analysis model, PCA, and ICA. All of these are algorithms that could take as input an unlabeled training set, just $X_i$ ’s and no labels. We’ve covered three major topics — supervised learning, machine learning, and unsupervised learning.

RL (Reinforcement Learning)

Assume you need to train a model (computer) that runs helicopter.

It turns out that it’s very difficult to know what’s the one right answer for how to move the control sticks of a helicopter. → It’s hard to use supervised learning for that.

Your job: specify a reward function that just tells the helicopter when it’s flying well or not.

R(s)=\begin{cases}+1&\text{for win}\\-1&\text{for lose}\\0&\text{for tie}\end{cases}

Credit assignment problem — the problem of determining the actions that lead to a certain outcome.

(https://ai.stackexchange.com/questions/12908/what-is-the-credit-assignment-problem)

MDP (Markov Decision Process

RL algorithms will solve problems using this formalism.

MDP is a 5 tuple: $(S,A,P_\text{sa},\gamma, R)$ .

$S$ — set of states.

E.g., the set of all possible chess positions in chess; the set of all the possible positions, orientations, and velocities of helicopter

$A$ — set of actions.

E.g., all the moves you can make in chess; all of your conrol sticks for flying helicopter.

$P_{sa}$ — state transition probabilities $(\sum_{s'}P_{sa}(s')=1)$ .

If you take a certain action $A$ at a certain state $S$ , what is the chance of you ending up at a particular different state $S\;'$ ?

$\gamma$ — discount factor, $\gamma\in[0,1)$ .

$R$ — reward function.

E.g. Simpified MDP in which you have a robot navigating the simple maze below.

(Source: https://youtu.be/YQA9lLdLig8?t=1h3m12s)

→ 11 states, 4 actions $\{N, S, E, W\}$ .

E.g., $s=(3,1),a=\text{N(orth)}$

$P_{(3,1)\text N}((3,2))=0.8$

$P_{(3,1)\text N}((4,1))=0.1$

$P_{(3,1)\text N}((2,1))=0.1$

$P_{(3,1)\text N}((3,3))=0$

When the robot is commanded to head straight, its wheel may slips to veer off at a single angle.

It is actually important to model the noisy dynamics of a robot wheel slipping slightly or the orientation being slightly off.

To train the model fast?

→ Put a very small negative penalty, e.g., $R(s)=-0.02$ for all other states

(when $R((4,3))=+1,R((4,2))=-1$ ).

→ Charge it a little bit for using up electricity and wandering around.

Choose action $a_0$

Get to $s_1\sim P_{s_0a_0}$

Choose action $a_1$

Get to $s_2\sim P_{s_1a_1}$

$\vdots$

Total payoff (Sum of discounted rewards):

R(s_0)+\gamma R(s_1)+\gamma^2R(s_2)+\cdots

$\gamma$ is usually chosen to be just slightly less than $1$ , e.g., $0.99$ .

Discounting factor has the effect of giving a smaller weight to rewards in distant future, which encourages the robot to get deposited rewards faster.

2 pragmatic reasons for using $\gamma$ :

Time value of money.

All the RL algorithms converge much faster or they werem’t much better.

Goal of RL: Choose actions over time to maximize the expected total payoff