지도학습 | Supervised learning 이란? (SVM, ANN, Ensemble)
- 본 포스팅은 인공지능의 지도학습 개념과 그 종류 (선형모델, SVM, ANN) 에 대한 내용을 포함하고 있습니다.
- Keyword : Supervised learning, overfitting, underfitting, linear model, SVM, ANN, ensemble
- 👉
Click
지도학습 Supervised learning
-
Given a set of labeled examples (x1,y1),...,(xN,yN), learn a mapping function g : X ➔ Y, s.t. given an unseen sample x', associated output y' is predicted.
-
Supervised learning relies on the sizes of dataset; what if we have no sufficient data?
- Data augmentation, learning from insufficient labels (weak supervision)
-
What if the data properties are different between datasets?
- Domain adaptation, transfer learning
- X=Rd is an input space
- X=Rd : a d-dimensional Euclidean space
- Input vector x∈X:x=(x1,...,xd)
- Y is an output space (binary decision)
- We want to approximate a target function f
- f : X ➔ Y (unknown ideal function)
- Data (x1,y1),...,(xN,yN); dataset where yN=f(XN)
- Correct label is ready for a training set
- Hypothesis g:X➔Y (ML model to approximate f) : g∈H
- Learning model : feature selection, model selection, optimization
Model generalization
- Learning is an ill-posed problem; data is limited to find a unique solution
- Generalization (goal) : a model needs to perform well on unseen data
- Generalization error Egen; the goal is to minimize this error, but it is impractical to compute in the real world
- Use training/validation/test set errors for the proxy
Errors
- Pointwise error is measured on an each input sample : e(h(x),y)
- From a pointwise error to overall errors: E[h(xi)−yi)2]
- If an input sample is chosen from training, validation, and testing datasets, the errors are called a training error (Etrain), a validation error (Eval), and a testing error (Etest).
- Training error Etrain measured on a training set, which may or may not represent Egen; used for fitting a model
- Testing error Etest (not used in training), which can be used for a proxy of Egen.
- Goal : Etest ≈ Egen ≈ 0
Overfitting and Underfitting
- Underfitting problem because of using too simpler model than actual data distribution (high bias)
- Overfitting problem because of more complex model than actual data distribution (high variance)
- Avoid overfitting
- Problem : In today's ML problems, a complex model tends to be used to handle high-dimensional data (and relatively insufficient number of data); prone to an overfitting problem
- Curse of dimension : Will you increase the dimension of the data to improve the performance as well as maintain the density of the examples per bin? If so, you need to increase the data exponentially.
- Remedy : Data augmentation, regularization, ensemble
Bias and Variance
- Bias : error because the model can not represent the concept
- Variance : error because a model overreacts to small changes (noise) in the training data
➔ Total loss = Bias + Variance (+ noise)
➔ Bias-variance trade-off : the two objective have trade-off between approximation and generalization w.r.t model complexity
Cross-validation (CV)
- CV allows a better model to avoid overfitting (but more complexity)
Linear Regression
- Hypothesis set H, model parameter θ
hθ(x)=θ0+θ1x1+...+θdxd=θTx
- Good for a first try : simplicity, generalization
- L2 cost function (Goal : minimizing MSE)
J(θ0,θ1)=2m1i=1∑m(hθ(x(i))−y(i))2
➔ minimizeθ0,θ1J(θ0,θ1)
Optimization
- Data matrix X ∈ RNx(d+1), target vector y ∈ RN, weight vector θ ∈ Rd+1
- In-sample error : ∣∣y−Xθ∣∣2
- Normal equation
- (Least square) E is continuous, differentiable, and convex
- Need to compute an inverse matrix and slow if the number of samples is very large
θ∗=argminθE(θ)
=argminθ[N1(θTXTXθ−2θTXTy+yTy)]
∴θ∗=(XTX)−1XTy=X+y
➔ Problem : huge computational complexity, non-invertible matrix ➔ needs iterative algorithm (gradient descent)
- To avoid overfitting
- If we have too many features, the hypothesis may fit the training set very well. However, it may fail to generalize to new samples.
- More features ➔ more parameters ➔ need more data; (in practive) less data ➔ overfitting
➔ Reduce the number of features, regularization
Gradient descent algorithm
θnew=θold−α∂θ∂J(θ)
- J is the objective function that we want to optimize. α : the step size to control the rate to move down the error surface (hyper parameter).
- If α is too small, gradient descent can be slow.
- If α is too large, gradient descent can overshoot the minimum.
- Gradient descent works well even when n large
- All examples (batch) are examined at each iteration : Use stochastic gradient descent or mini batch.
- Advances : AdaGrad, RMSProp, Adam
- Limitation : local optimum ➔ cannot guarantee global minimum but attempt to find a good local minimum
- To avoid local minimum
- Momentum : designed to speed up learning in high curvature and small/noise gradients ➔ exponentially weighted moving average of past gradients (low pass filtering)
- SGD + momentum : use a velocity as a weighted moving average of previous gradients
- Nesterov momentum : difference from standard momentum where gradient g is evaluated (lookahead gradient step)
- AdaGrad : adapts an indibidual learning rate of each direction
- RMSProp : attempts to fix the drawbacks of AdaGrad, in which the learning rate becomes infinitesimally small and the algorithm is no longer able learning when the accumulated gradient is large.
- Adam : RMSProp + momentum
- Learning rate scheduling
➔ Need to gradually decrease learning rate over time
Linear Classification
- Uses a hyperplane as a decision boundary to classify samples based on a linear combination of its explanatory variables
- The linear formula g ∈ H can be written as:
h(x)=sign((i=1∑dwixi)+w0)
➔ x0=1,w0 : a bias term, sign(x)=1ifx>0;0ifx<0;
- Sigmoid function
- Used to map a score value into a probability value.
- Squash the output of the linear function:
σ(−wTx)=1+e−wTx1
- Advantage : simplicity and interpretability
Loss
Loss0−1(x,y,w)=1[(w⋅ϕ(x))0≥y]
Losshinge(x,y,w)=max{1−(w⋅ϕ(x))y,0}
- Cross-entropy loss
- Considers two probability mass functions (pmf) {p, 1-p} and {q, 1-q} with a binary outcomes:
- Cross-entropy loss measures the performance of a classification model whose output is a probability value between 0 and 1.
CE(S,Y)=−∀i∑Yilog(Si))
Support vector machine (SVM)
- Choose the linear separator (hyperplane) with the largest margin on either side
- Maximum margin hyperplane with support vectors
- Robust to outliers
Support Vector
- an instance with the minimum margin, which will be the most sensible data points to affect the performance
Margin
Optimization
- Optimal weight w and bias b
- Classifiers points correctly as well as achieves the largest possible margin
- Hard margin SVM assumes linear separability
- Soft margin SVM extends to non-separable cases
➔ Nonlinear transformation and kernel trick
- Constraints : linearly separable, hard-margin linear SVM
h(x)=wTx+b≥1 for y=1
h(x)=wTx+b≦−1 for y=−1
y(wTx+b)≥1 for all samples
- Objective function : linearly separable, hard-margin linear SVM
- Distance from a support vector to the hyper plane:
∣∣w∣∣wTx+b=∣∣w∣∣±1⟶∣∣w∣∣2
Kernel trick (not linearly separable)
K(x,y)=(x⋅y+1)p
- Gaussian radial basis function (RBF):
K(x,y)=e−∣∣x−y∣∣2/2σ2
- Hyperbolic tangent (multilayer perceptron kernel):
K(x,y)=tanh(kx⋅y−δ)
Artificial neural network (ANN)
-
Needs elaborated training schemes to improve performance
-
Activation functions
- Sigmoid neurons give a real-valued output that is a smooth and bounded function of their total input
- Non-linearity due to the activation functions
-
Deep neural network can represent more complex (non-linear) boundaries with increasing neurons
-
Multilayer perceptron (MLP)
-
ANN for non-linear problem
- There exists cases when the accuracy is low even if the number of layers is high
➔ The result of one ANN is the result of sigmoid function
➔ The numerous multiplication of this result converges to near zero ➔ Gradient vanishing problem
Back propagation
- Back propagation barely changes lower-layer parameters (vanishing gradient)
- Breakthrough
- Pre-training + fine tuning
- CNN for reducing redundant parameters
- Rectified linear unit (constant gradient propagation)
- Dropout
- Accuracy = (TP+TN)/ALL
- Precision = TP/(TP+FP)
- Recall = TP/(TP+FN)
- F1 = PxR/(P+R)
- TPR = R = TP/(TP+FN)
- TNR = TN/(TN+FP)
- False positive error : predict = positive, actual = negative
- False negative error : predict = negative, actual = positive
- ROC Curve : performance comparisons between different classifiers in different true positive rates (TPR) and true negative rates (TNR).
Error measure
- The error measure should be specified by the user ➔ Not always given but needs to be carefully considered
Ensemble learning
- Predict class label for unseen data by aggregating a set of predictions : different classifiers (experts) learned from the training data
- Make a decision with a voting
- Bagging and boosting : improving decision tree
- By bagging : random forest (inherently boosting)
- By boosting : gradient boosting machine as generalized Adaboost
- Advantages
- Improve predictive performance, Other types of classifiers can be directly included
- Easy to implement, No too much parameter tuning
- Disadvantages
- Not a compact representation
Bagging
- Bootstrapping + aggregating (for more robust performance; lower variance)
- Train several models in parallel
- Bagging works because it reduces variance by voting/averaging (robust to overfitting)
- Learning algorithm is unstable; if small changes to the training set cause large changes in the learned classifier.
- Usually, the more classifiers the better
Boosting
- Cascading of week classifiers ➔ training multiple models in sequence, adaboost
- Adaboost : trained on weighted form of the training set, weight depends on the performance of the previous classifier, combined to give the final classifier
- Simple and easy, flexible, Versatile, non-parametric
- No prior knowledge needed about week learner
References
- 본 포스팅은
LG Aimers
프로그램에 참가하여 학습한 내용을 기반으로 작성되었습니다. (전체내용 X)
➔ LG Aimers
바로가기
[1] LG Aimers AI Essential Course Module 2.지도학습(분류/회귀), 이화여자대학교 강제원 교수