Lecture 08. Data Splits, Models & Cross-Validation

cryptnomy·2022년 11월 23일
0

CS229: Machine Learning

목록 보기
8/18
post-thumbnail

Lecture video link: https://youtu.be/rjbkWSTjHzM

Outline

  • Bias / Variance \sim easy to understand, but hard to master
  • Regularization \sim how to reduce variance in learning algorithms
  • Train/dev/test splits
  • Model selection vs. Cross validation

Bias & Variance

Bias

An error from erroeous assumptions in the learning algorithm.

Technically, bias=average model predictionground truth\text{bias}=|\text{average model prediction}-\text{ground truth}|.

Characterisitics

  • Failure to capture proper data trends
  • Potential towards underfitting
  • More generalized/overly simplified
  • High error rate

Variance

An error from sensitivity to small fluctuations in the tarining set.

The changes in the model when using different portions of the training dataset.

Characteristics

  • Noise in the dataset
  • Potentials towards overfitting
  • Complex models
  • Trying to put all data points as close as possible

(Ref links:

https://en.wikipedia.org/wiki/Bias–variance_tradeoff,

https://www.bmc.com/blogs/bias-variance-machine-learning/)

Price vs. size of house (ex.)

θ0+θ1x\theta_0+\theta_1x

Underfit

High bias

θ0+θ1x+θ2x2\theta_0+\theta_1x+\theta_2x^2

Just right

θ0+θ1x++θ5x5\theta_0+\theta_1x+\cdots+\theta_5x^5

Overfit

High variance

One of the ways to prevent overfitting is regularization.

Regularization

Comment: it’ll sound deceptively simple but is one of the techniques that I use most often in many models.

ex. For linear regression,

minθ  12i=1my(i)θTx(i)2+λ2θ2.\min_\theta\;\frac{1}{2}\sum_{i=1}^m\left\Vert y^{(i)}-\theta^Tx^{(i)}\right\Vert^2+\frac{\lambda}{2} \Vert\theta\Vert^2.

Large λ\lambda → underfitting.

λ=0\lambda=0 → relatively overfitting.

arg maxθ  i=1mlogp(y(i)x(i);θ)λθ2minw2.\argmax_\theta\;\sum_{i=1}^m \log p\left(y^{(i)}|x^{(i)};\theta\right)-\lambda\Vert\theta\Vert^2\\\Leftrightarrow\min\Vert w\Vert^2.

Q. Have you ever regularized multiple elements of parameters?

A. Not really. Choosing parameters λi\lambda_i’s corresponding to each θi\theta_i is as difficult as just choosing all the parameters λi\lambda_i’s in the first place. When we cover cross-validation and multiple regression, we’ll learn how to choose λ\lambda, but the technique won’t work for choosing from multiple number of λi\lambda_i’s.

Logistic regression with regularization will usually outperform naive Bayes on a classification accuracy standpoint. Without regularization, logistic regression will badly overfit the given data. (e.g. m=100m=100 e-mails to classify as spam or non-spam; n=10000n=10000 words.)

Q. Why don’t SVMs suffer too badly? Is it because there are a small number of vectors or is it because of minimizing the penalty ww?

A. I would say the formal argument relies more on the latter, but actually both. The class of all functions that separate the data of a large margin is a relatively simple class of functions, which I mean has low VC dimension, and thus any function within that class of functions is unlikely to overfit.

Q. Are the terms underfitting and high bias, or the terms overfitting and high variance interchangeable?

A. Not really. You can imagine a decision boundary with high bias and high variance: a very complicated function which still doesn’t fit your data well for some reason.

S={x(i),y(i)}i=1mS=\{x^{(i)},y^{(i)}\}_{i=1}^m: training set

Assume a prior distribution pp over θ\theta exists. This allows us to treat θ\theta as a random variable as in Bayesian statistics.

(ref: https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation)

Recall that

p(θS)=p(Sθ)p(S)p(θ)p(\theta|S)=\frac{p(S|\theta)}{p(S)}p(\theta)

by Bayes rule. And MLE would become

arg maxθp(θS)=arg maxθp(Sθ)p(θ)=arg maxθ(i=1mp(y(i)x(i);θ))p(θ)\begin{aligned}\argmax_\theta p(\theta|S) &= \argmax_\theta p(S|\theta)p(\theta) \\ &= \argmax_\theta \left(\prod_{i=1}^m p(y^{(i)}|x^{(i)};\theta)\right)p(\theta) \end{aligned}

where we assumed θN(0,τ2I)\theta\sim\mathcal{N}(0,\tau^2I), or

p(θ)=12π(τ2I)1/2exp(12θT(τ2I)1θ).p(\theta)=\frac{1}{\sqrt{2\pi}(\tau^2I)^{1/2}}\exp\left(-\frac{1}{2}\theta^T(\tau^2I)^{-1}\theta\right).

cf. Statistics: Frequentist vs. Bayesian?

For frequentist,

arg maxθp(S;θ)MLE\argmax_\theta p(S;\theta)-\text{MLE}

(cf. θ\theta is a paramter here.)

For Bayesian,

arg maxθp(θS)MAP\argmax_\theta p(\theta|S)-\text{MAP}

where MAP stands for maximum a posteriori. (cf. θ\theta is treated as a random variable here.)

Q. Can differences between these two (MLE and MAP) be seen as “regularized” vs. “non-regularized”?

A. Yes. MLE corresponds to the origin of regularization and the latter procedure (Bayesian) corresponds to adding regularization.

(Further description of differences between frequentist and Bayesian:

https://www.redjournal.org/article/S0360-3016(21)03256-9/fulltext#seccesectitle0004)

Error vs. Model complexity

Model complexity e.g. degree of polynomial

→ Higher the degree of your polynomial, the greater your training error.

e.g.

λ=10100\lambda=10^{100} → underfit

λ=0\lambda=0 → overfit

(Source: cs229-notes.pdf — Google it.)

Train/dev/test sets

SStrain,Sdev,StestS\rightarrow S_{\text{train}}, S_{\text{dev}}, S_{\text{test}}

  • Train each model ii (option for degree of polynomial) on StrainS_{\text{train}}. Get some hypothesis hh.
  • Measure error on SdevS_{\text{dev}}. Pick model with the lowest error on SdevS_{\text{dev}}. (Why not no StrainS_{\text{train}}? To avoid overfit.)
  • Optional: Evaluate algorithm on StestS_{\text{test}} and report the error.

Comment: Reporting on the dev error isn’t really a valid unbiased procedure.

Split ratio?

→ train : test = 7 : 3; train : dev : test = 6 : 2 : 2 (great)

(Simple) Hold-out cross-validation

“development set” = “cross-validation set”

For small datasets…

ex. m=100m=100

70  Strain,30  Sdev.70\;S_{\text{train}}, 30\;S_{\text{dev}}.

kk-fold CV(cross-validation)

k=10k=10 is typical.

For i=1,,k,i=1,\cdots,k,

Train (fit parameters) on k1k-1 pieces.

Test on reaming 11 piece.

Average.

Optional: refit the model on all 100% of data.

Leave-on-out CV

k=mk=m.

You need to change your algorithm mm times. → The huge downside… (very very expensive)

You never do this unless mm is too small. (e.g. m100m\le100 → you could consider this..)

Q. How do you sample the data in these sets?

A. In this lecture, we assume all your data comes from the same distribution, so we just randomly shuffle.

More to ref: mlyearning.org (machine learning yearning) or CS230. (if the train and test sets are from different distributions.)

Feature selection

Start with F=ϕ.\mathcal{F}=\phi.

Repeat {

  1. Try adding each feature ii to F\mathcal{F} and see which single-feature addition most improves the dev set performance.
  2. Add that feature to F\mathcal{F}.

}

0개의 댓글