[ML&DL] 1. Statistical Learning

KBC·2024년 9월 6일

Machine Learning and Deep Learning

목록 보기

1/11

Predict target data using Observation or Given Data

\textcolor{blue}{Sales} = f(TV, Radio, Newspaper)

$\textcolor{blue}{Sales}$ : Target Variable, Response Variable
$TV, Radio, Newspaper$ : so called Predictor, Features, Inputs
Let just say $(TV, Radio, Newspaper) \; = \;(X_1, X_2, X_3)$
Then Input Vector could be like this $X = \left(\begin{matrix}X_1\\X_2\\X_3\end{matrix}\right)$
Now we could wrtie our statistical model as $Y = f(X) + \epsilon$

Where $\epsilon$ captures measurement errors and other discrepancies

With a goof $f$ : Function f, we can make predictions of $Y$ at new points $X=x$ .
We can understand which components of $X = (X_1, X_2, \cdots, X_p)$ are important in explaining $Y$
There could be extra features like Seniority, Years of education and Incom etc.
Depending on the complexity of $f$ , we may be able to understand how each component $X_j$ of $X$ affects $Y$
Is there an ideal $f(X)$ ?

What is goof value for $f(X)$ at any selected value of $X$ , say $\textcolor{red}{X =4}$ ??
There can be many Y values at $X = 4$ .
A good value is Expected value = Average
$f(4) = E(Y|X=4)$
This ideal $f(x) = E(Y|X = x)$ is called regression function

It also defined for vector $\underline X$
$f(x) = f(x_1,x_2,x_3)=E(Y|X_1=x_1,X_2=x_2,X_3=x_3)$
Is the ideal or optimal predictor of $Y$ with regard to mean-squared prediction error
- $f(x) = E(Y|X=x)$ is the function that minimize $E[(Y - g(X))^2|X=x]$ over all functions $g$ at all points $X=x$ .
$\epsilon = Y-f(x)$ is the irreducible error
- For any estimate $\hat f(x)$ of $f(x)$ , we have

Typically we have few if any data points with $X=4$ exactly.
So we cannot compute $E(Y|X =x)$ !
Realex the definition and let $\hat f(x) = Ave(Y|X \in \mathcal{N}(x))\\$

where $\mathcal{N}(x)$ is some neighborhood of $x$
Nearst neighrbor averageing can be pretty good for small p, p $\rightarrow$ parameter
Nearst neighbor methods can be lousy when p is large.
Reason : the curse of dimesionality Nearst neighbors tend to be far away in high dimensions.
- We neeed to get a reasonable fraction of the $N$ values of $y_i$ to average to bring the variance down. e.g.) 10%
- A 10% neighrborhood in high dimensions need no longer be local, so we lose the spirit of estimating
  $E(Y|X=x)$ by local averaging

Linear model is an important example of a parametric model:
$f_L(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p\\[0.5cm]$
We estimate the parameters by fitting the model to training data.
Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function $f(X)$ .
A simple Linear model $\hat f_L(X) = \hat \beta_0 + \hat \beta_1X$ gives a reasonable fit here
A quadratic model $\hat f_Q(X) = \hat \beta_0 + \hat \beta_1X + \hat \beta_2X^2$ fits slightly better.
- Simulated example. Red points are simulated values for income from the model
  $\textcolor{red}{income} = f(education, seniority ) + \epsilon$
  $f$ is the blue surface, estimator
- Linear Regression model fit to the simulated data.
  $\hat f_L(education, seniority) =\hat \beta_0 +\hat \beta_1\times education + \hat \beta_2 \times seniority$
- More flexible regression model $\hat f_S(education, seniority)$ fit to the simulated data.
- Here we use a technique called a thin-plate spline to fit a flexible surface.
- Just control the roughness of the fit.
  
  Even more flexible spline regression model $\hat f_s(education, seniority) \rightarrow$ Overfitted

Prediction accuracy versus interpretability
- Linear models are easy to interpret but... thin-plate splines are not easy to interpret
Good fit versus over-fit or under-fit
- How do we know when the fit is just right?
Parsimony(Simply interpretable model) versus black-box
- We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.

Suppose we fit a model $\hat f(x)$ to some training data
$Tr = \{x_i, y_i\}^N_1$ , and we wish to see how well it performs.
- Average squared prediction error over $Tr =$ trainning error $MSE_{Tr} = Ave_{i \in Tr}[y_i-\hat f(x_i)]^2$
This may be biased toward more overfit models. $Te =$ test error
- Instead we should, if possible, compute it using fresh test data $Te = \{x_i, y_i\}^M_1$ : $MSE_{Te} = Ave_{i \in Te}[y_i -\hat f(x_i)]^2$
$\textcolor{green}{Green \;Line}$ means overfitted regression model, $\textcolor{blue}{Blue \;Line}$ means Ground Truth, $\textcolor{orange}{Orange \;Line}$ means Simple Linear model

Suppose we have fit a model $\hat f(x)$ to some training data $Tr$ , and let $(x_0, y_0)$ be a test observation drawn from the population.
If the true model is $Y = f(X) + \epsilon$ (with $f(x) = E(Y|X = x)$ ), then
$E(y_0-\hat f(x_0))^2 = Var(\hat f(x_0)) + [Bias(\hat f(x_0))]^2 +Var(\epsilon) \\[0.2cm] Var(\hat f(x_0)) : Explanablity \;by \;Sample\;data - \textcolor{blue}{Reducible \;error}\\ [Bias(\hat f(x_0))]^2 : Explanablility \;of\; Model - \textcolor{blue}{Reducible\;error}\\[0.2cm] Var(\epsilon) : \textcolor{red}{Irreducible\;Error}$
The expectation averages over the variability of $y_0$ as well as the variability in $Tr$ .

Note that
$Bias(\hat f(x_0)) = E[\hat f(x_0)] -f(x_0)$
Typically as the flexibility of $\hat f$ increases, its variance increases, and its bias decreases
So choosing the flexibility based on average test error amounts to a bias-variance trade-off
Example

Is there an ideal $C(X)$ ? Let $p_k(x)$ $p_k(x)=Pr(Y=k|X=x),\;k=1,2,\cdots,K.$
These are the conditional class probabilities at $x$ ;

Then the bayes optimal classifier at $x$ is
$C(x) = j,\;if\;p_j(x)=max\{p_1(x),p_2(x),\cdots,p_K(x)\}\\K:number \;of\;labels$

Nearest-neighbor averaging can be used as before.
Also breaks down as dimension grows. However, the impact on $\hat C(X)$ is less than on $\hat p_k(x),\; k=1,\cdots,K$ .

Typically we measure the performance of $\hat C(x)$ using the misclassification error rate $Err_{Te}=Ave_{i\in{Te}}I[y_i\neq\hat C(x_i)]$
The Bayes classifier (using the true $p_k(x)$ ) has smallest error (in the population)
Support-vector machines build structured models for $C(x)$ . etc.

AI, Security