[ML&DL] 1. Statistical Learning

KBC·2024년 9월 6일
0

Statistical Learning?

Predict target data using Observation or Given Data

Sales=f(TV,Radio,Newspaper)\textcolor{blue}{Sales} = f(TV, Radio, Newspaper)
  • Sales\textcolor{blue}{Sales} : Target Variable, Response Variable
  • TV,Radio,NewspaperTV, Radio, Newspaper : so called Predictor, Features, Inputs
  • Let just say (TV,Radio,Newspaper)  =  (X1,X2,X3)(TV, Radio, Newspaper) \; = \;(X_1, X_2, X_3)
  • Then Input Vector could be like this
    X=(X1X2X3)X = \left(\begin{matrix}X_1\\X_2\\X_3\end{matrix}\right)
  • Now we could wrtie our statistical model as
    Y=f(X)+ϵY = f(X) + \epsilon

    Where ϵ\epsilon captures measurement errors and other discrepancies

What is f(X) good for?

  • With a goof ff : Function f, we can make predictions of YY at new points X=xX=x.

  • We can understand which components of X=(X1,X2,,Xp)X = (X_1, X_2, \cdots, X_p) are important in explaining YY

  • There could be extra features like Seniority, Years of education and Incom etc.

  • Depending on the complexity of ff, we may be able to understand how each component XjX_j of XX affects YY

  • Is there an ideal f(X)f(X) ?

    What is goof value for f(X)f(X) at any selected value of XX, say X=4\textcolor{red}{X =4} ??
    There can be many Y values at X=4X = 4.

  • A good value is Expected value = Average

    f(4)=E(YX=4)f(4) = E(Y|X=4)
  • This ideal f(x)=E(YX=x)f(x) = E(Y|X = x) is called regression function

The regression function f(x)

  • It also defined for vector X\underline X
    f(x)=f(x1,x2,x3)=E(YX1=x1,X2=x2,X3=x3)f(x) = f(x_1,x_2,x_3)=E(Y|X_1=x_1,X_2=x_2,X_3=x_3)
  • Is the ideal or optimal predictor of YY with regard to mean-squared prediction error
    • f(x)=E(YX=x)f(x) = E(Y|X=x) is the function that minimize E[(Yg(X))2X=x]E[(Y - g(X))^2|X=x] over all functions gg at all points X=xX=x.
  • ϵ=Yf(x)\epsilon = Y-f(x) is the irreducible error
    • For any estimate f^(x)\hat f(x) of f(x)f(x), we have

How to estimate f

  • Typically we have few if any data points with X=4X=4 exactly.
  • So we cannot compute E(YX=x)E(Y|X =x) !
  • Realex the definition and let
    f^(x)=Ave(YXN(x))\hat f(x) = Ave(Y|X \in \mathcal{N}(x))\\

    where N(x)\mathcal{N}(x) is some neighborhood of xx

  • Nearst neighrbor averageing can be pretty good for small p, p \rightarrow parameter

    Nearst neighbor methods can be lousy when p is large.

    Reason : the curse of dimesionality Nearst neighbors tend to be far away in high dimensions.

    • We neeed to get a reasonable fraction of the NN values of yiy_i to average to bring the variance down. e.g.) 10%
    • A 10% neighrborhood in high dimensions need no longer be local, so we lose the spirit of estimating
      E(YX=x)E(Y|X=x) by local averaging

Parametric and structured models

  • Linear model is an important example of a parametric model:
    fL(X)=β0+β1X1+β2X2++βpXpf_L(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p\\[0.5cm]
  • We estimate the parameters by fitting the model to training data.
  • Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function f(X)f(X).
  • A simple Linear model f^L(X)=β^0+β^1X\hat f_L(X) = \hat \beta_0 + \hat \beta_1X gives a reasonable fit here
  • A quadratic model f^Q(X)=β^0+β^1X+β^2X2\hat f_Q(X) = \hat \beta_0 + \hat \beta_1X + \hat \beta_2X^2 fits slightly better.

    • Simulated example. Red points are simulated values for income from the model
      income=f(education,seniority)+ϵ\textcolor{red}{income} = f(education, seniority ) + \epsilon
      ff is the blue surface, estimator
    • Linear Regression model fit to the simulated data.
      f^L(education,seniority)=β^0+β^1×education+β^2×seniority\hat f_L(education, seniority) =\hat \beta_0 +\hat \beta_1\times education + \hat \beta_2 \times seniority
    • More flexible regression model f^S(education,seniority)\hat f_S(education, seniority) fit to the simulated data.
    • Here we use a technique called a thin-plate spline to fit a flexible surface.
    • Just control the roughness of the fit.

      Even more flexible spline regression model f^s(education,seniority)\hat f_s(education, seniority) \rightarrow Overfitted

Some trade-offs(Interpretability vs Flexibility)

  • Prediction accuracy versus interpretability
    • Linear models are easy to interpret but... thin-plate splines are not easy to interpret
  • Good fit versus over-fit or under-fit
    • How do we know when the fit is just right?
  • Parsimony(Simply interpretable model) versus black-box
    • We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.

Assessing Model Accuracy

  • Suppose we fit a model f^(x)\hat f(x) to some training data
    Tr={xi,yi}1NTr = \{x_i, y_i\}^N_1, and we wish to see how well it performs.
    • Average squared prediction error over Tr=Tr = trainning error
      MSETr=AveiTr[yif^(xi)]2MSE_{Tr} = Ave_{i \in Tr}[y_i-\hat f(x_i)]^2
  • This may be biased toward more overfit models. Te=Te = test error
    • Instead we should, if possible, compute it using fresh test data Te={xi,yi}1MTe = \{x_i, y_i\}^M_1:
      MSETe=AveiTe[yif^(xi)]2MSE_{Te} = Ave_{i \in Te}[y_i -\hat f(x_i)]^2
  • Green  Line\textcolor{green}{Green \;Line} means overfitted regression model, Blue  Line\textcolor{blue}{Blue \;Line} means Ground Truth, Orange  Line\textcolor{orange}{Orange \;Line} means Simple Linear model

Bias-Variance Trade-off

  • Suppose we have fit a model f^(x)\hat f(x) to some training data TrTr, and let (x0,y0)(x_0, y_0) be a test observation drawn from the population.

  • If the true model is Y=f(X)+ϵY = f(X) + \epsilon (with f(x)=E(YX=x)f(x) = E(Y|X = x)), then

    E(y0f^(x0))2=Var(f^(x0))+[Bias(f^(x0))]2+Var(ϵ)Var(f^(x0)):Explanablity  by  Sample  dataReducible  error[Bias(f^(x0))]2:Explanablility  of  ModelReducible  errorVar(ϵ):Irreducible  ErrorE(y_0-\hat f(x_0))^2 = Var(\hat f(x_0)) + [Bias(\hat f(x_0))]^2 +Var(\epsilon) \\[0.2cm] Var(\hat f(x_0)) : Explanablity \;by \;Sample\;data - \textcolor{blue}{Reducible \;error}\\ [Bias(\hat f(x_0))]^2 : Explanablility \;of\; Model - \textcolor{blue}{Reducible\;error}\\[0.2cm] Var(\epsilon) : \textcolor{red}{Irreducible\;Error}
  • The expectation averages over the variability of y0y_0 as well as the variability in TrTr.

    Note that
    Bias(f^(x0))=E[f^(x0)]f(x0)Bias(\hat f(x_0)) = E[\hat f(x_0)] -f(x_0)

  • Typically as the flexibility of f^\hat f increases, its variance increases, and its bias decreases

  • So choosing the flexibility based on average test error amounts to a bias-variance trade-off

  • Example

Classification Problems

  • Here the response variable YY is qualitative - e.g. email is one of C=(spam,ham),  ham:good  emailC=(spam, ham),\; ham:good \;email

    Out goadls are to:

    • Build a classifier C(X)C(X) that assigns a class label from CC to a future unlabeled observation XX.
    • Assess the uncertainty in each classification
    • Understand the roles of the different predictors among
      $$X=(X_1,X_2,\cdots,X_p)

  • Is there an ideal C(X)C(X)? Let pk(x)p_k(x)
    pk(x)=Pr(Y=kX=x),  k=1,2,,K.p_k(x)=Pr(Y=k|X=x),\;k=1,2,\cdots,K.
  • These are the conditional class probabilities at xx;

    Then the bayes optimal classifier at xx is

    C(x)=j,  if  pj(x)=max{p1(x),p2(x),,pK(x)}K:number  of  labelsC(x) = j,\;if\;p_j(x)=max\{p_1(x),p_2(x),\cdots,p_K(x)\}\\K:number \;of\;labels

  • Nearest-neighbor averaging can be used as before.
  • Also breaks down as dimension grows. However, the impact on C^(X)\hat C(X) is less than on p^k(x),  k=1,,K\hat p_k(x),\; k=1,\cdots,K.

Classification : some details

  • Typically we measure the performance of C^(x)\hat C(x) using the misclassification error rate
    ErrTe=AveiTeI[yiC^(xi)]Err_{Te}=Ave_{i\in{Te}}I[y_i\neq\hat C(x_i)]
  • The Bayes classifier (using the true pk(x)p_k(x)) has smallest error (in the population)
  • Support-vector machines build structured models for C(x)C(x). etc.

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

profile
AI, Security

0개의 댓글