[ML&DL] 5. Moving Beyond Linearity

KBC·2024년 10월 22일
0

Moving Beyond Linearity

  • Often the linearity assumption is good enough
    • When its not...
      • Polynomials
      • Step functions
      • Splines
      • Local regression
      • Generalized additive models
  • Offer a lot of flexibility, without losing the ease and interpretability of linear models
yi=β0+β1xi+β2xi2++βdxid+ϵiy_i=\beta_0+\beta_1x_i+\beta_2x_i^2+\cdots+\beta_dx_i^d+\epsilon_i

Details

  • Create new variables X1=X,  X2=X2X_1=X, \;X_2=X^2, etc and then treat as multiple linear regression
  • Not really interested in the coefficients; more interested in the fitted function values at any value x0x_0
    f^(x0)=β^0+β^1x0+β^2x02+β^3x03+β^4x04\hat f(x_0)=\hat\beta_0+\hat\beta_1x_0+\hat\beta_2x^2_0+\hat\beta_3x_0^3+\hat\beta_4x_0^4
  • Since f^(x0)\hat f(x_0) is a linear function of the β^l\hat \beta_l, can get a simple expression for pointwise-variances Var[f^(x0)]\text{Var}[\hat f(x_0)] at any value x0x_0
  • In the figure we have computed the fit and pointwise standard errors on a grid of values for x0x_0
  • We show f^(x0)±2×se[f^(x0)]\hat f(x_0) \pm 2 \times \text{se}[\hat f(x_0)]
  • We either fix the degree dd at some reasonably low value, else use cross-validation to choose dd
  • Logistic Regression follows naturally
    Pr(yi>250xi)=exp(β0+β1xi++βdxid)1+exp(β0+β1xi++βdxid)\text{Pr}(y_i > 250|x_i) =\frac{\text{exp}(\beta_0+\beta_1x_i+\cdots+\beta_dx_i^d)}{1 + \text{exp}(\beta_0+\beta_1x_i+\cdots+\beta_dx_i^d)}
    • To get confidence intervals, compute upper and lower bounds on the logit scale, and then invert to get on probability scale
    • Polynomials have notorious tail behavior : very bad for extrapolation
  • `Step Functions
    C1(X)=I(X<35),  C2(X)=I(35X<50)  C3(X)=I(X65)C_1(X)=I(X<35),\;C_2(X)=I(35\leq X<50)\;C_3(X)=I(X\geq65)
    • Useful way of creating interactions that are easy to interpret
    • Interaction effect of YearYear and AgeAge:
      I(Year<2005)×Age,  I(Year2005)×AgeI(Year<2005)\times Age,\;I(Year\geq2005)\times Age
      would allow for different linear functions in each age category
    • Choice of cutpoints or knots can be problematic
    • For creating nonlinearities, smoother alternatives such as splines are available

Picewise Polynomials

  • Instead of a single polynomial in XX over its whole domain, we can rather use different polynomials in regions defined by knots
    yi={β01+β11xi+β21xi2+β31xi3+ϵiif xi<c,β02+β12xi+β22xi2+β32xi3+ϵiif xic.y_i = \begin{cases} \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \epsilon_i & \text{if } x_i < c, \\ \beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \epsilon_i & \text{if } x_i \geq c. \end{cases}
  • Better to add contraints to the polynomials e.g. continuity
  • Splines have the maximum amount of continuity

Linear Splines

  • A linear spline with knots at ξk\xi_k, k=1,,Kk=1,\dots,K is a piecewise linear polynomial continuous at each knot
  • We can represent this model as
    yi=β0+β1b1(xi)+β2b2(xi)++βK+1bK+1(xi)+ϵi,y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_{K+1} b_{K+1}(x_i) + \epsilon_i,
    where the bkb_k are basis functions
    b1(xi)=xibk+1(xi)=(xiξk)+,k=1,,Kb_1(x_i) = x_i \\ b_{k+1}(x_i) = (x_i - \xi_k)_+, \quad k = 1, \dots, K
    Here the ()+()_+ means positive part
    (xiξk)+={xiξkif xi>ξk,0otherwise.(x_i - \xi_k)_+ = \begin{cases} x_i - \xi_k & \text{if } x_i > \xi_k, \\ 0 & \text{otherwise}. \end{cases}

Cubic Splines

  • A cubic spline with knots at ξk\xi_k, k=1,,Kk=1, \dots, K is a piecewise cubic polynomial with continuous derivatives up to order 22 at each knot
  • Again we can represent this model with truncated power basis functions
    yi=β0+β1b1(xi)+β2b2(xi)++βK+3bK+3(xi)+ϵi,b1(xi)=xib2(xi)=xi2b3(xi)=xi3bk+3(xi)=(xiξk)+3,k=1,,Ky_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_{K+3} b_{K+3}(x_i) + \epsilon_i, \\ b_1(x_i) = x_i \\ b_2(x_i) = x_i^2 \\ b_3(x_i) = x_i^3 \\ b_{k+3}(x_i) = (x_i - \xi_k)^3_+, \quad k = 1, \dots, K
    where
    (xiξk)+3={(xiξk)3if xi>ξk,0otherwise.(x_i - \xi_k)^3_+ = \begin{cases} (x_i - \xi_k)^3 & \text{if } x_i > \xi_k, \\ 0 & \text{otherwise}. \end{cases}

Natural Cubic Splines

  • A natural cubic spline extrapolates linearly beyond the boundary knots
  • This adds 4=2×24=2\times 2 extra contraints, and allows us to put more internal knots for the same degrees of freedom as a regular cubic spline

Knot Placement

  • One strategy is to decide KK, the number of knots, and then place them at appropriate quantiles of the observed XX
  • A cubic spline with KK knots has K+4K+4 parameters or degrees of freedom
  • A natural spline with KK knots has KK degrees of freedom

Smoothing Splines

  • Consider this criterion for fitting a smooth function g(x)g(x) to some data:
    minimizegSi=1n(yig(xi))2+λ(g(t))2dt\text{minimize}_{g \in S} \sum_{i=1}^{n} \left( y_i - g(x_i) \right)^2 + \lambda \int \left( g''(t) \right)^2 dt
  • The first term is RSS, and tries to make g(x)g(x) match the data at each xix_i
  • The second term is a roughness penalty and controls how wiggly g(x)g(x) is
  • It is modulated by the tuning parameter λ0\lambda \geq 0
    • The smaller λ\lambda, the more wiggly the function
    • Eventually interpolating yiy_i when λ=0\lambda = 0
    • As λ\lambda \rightarrow \infty, the function g(x)g(x) becomes linear

Some issues

  • Smoothing splines avoid the knot-selection issue, leaving a single λ\lambda to be chosen
  • The vector of nn fitted values can be written as g^λ=Sλy\hat g_\lambda = S_\lambda y, where SλS_\lambda is a n×nn\times n matrix
  • The effective degrees of freedom are given by
    dfλ=i=1n{Sλ}ii.df_{\lambda} = \sum_{i=1}^{n} \{S_{\lambda}\}_{ii}.

Choosing lambda

  • The leave-one-out(LOO) cross-validation error is given by
    RSScv(λ)=i=1n(yig^λ(i)(xi))2=i=1n[yig^λ(xi)1{Sλ}ii]2.RSS_{cv}(\lambda) = \sum_{i=1}^{n} \left( y_i - \hat{g}_{\lambda}^{(-i)}(x_i) \right)^2 = \sum_{i=1}^{n} \left[ \frac{y_i - \hat{g}_{\lambda}(x_i)}{1 - \{S_{\lambda}\}_{ii}} \right]^2.

Local Regression

  • With a sliding weight function, we fit seperate linear fits over the range of XX by weighted least squares

Generalized Additive Models

  • Allows for flexible nonlinearities in several variables, but retains the additive structure of linear models

    yi=β0+f1(xi1)+f2(xi2)++fp(xip)+ϵi.y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \cdots + f_p(x_{ip}) + \epsilon_i.

  • Can fit a GAM simply using, e.g. natural splines

lm(wage~ns(year,df=5) +ns(age, df = 5) + education)
  • Coefficients not that interesting; fitted functions are
  • Can mix terms - some linear, some nonlinear - and use anove() to compare models

GAMs for Classification

log(p(X)1p(X))=β0+f1(X1)+f2(X2)++fp(Xp).\log \left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + f_1(X_1) + f_2(X_2) + \cdots + f_p(X_p).

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

profile
AI, Security

0개의 댓글