Hyperplane
: a flat subspace (N-1 dimension) that divides a higher dimensional space (N dimension)into two parts
create seperation between classes
when new point, use hyperplanes to assign a class
Example: a dataset with one feature, one binary target label
create seperating hyperplane
maximize the margin(여백) between the classes; maximal margin classifier
Cross validation to determine the optimal size of the margin
Theory and Intuition: Kernels
When hyperplane performs poorly, move from Support Vector Classifier to Support Vector Machines
Kernels are used to project the features to a higher dimension
Kernel Projection 1D
Mathematics
Linear classifier: finding optimal solution
• Maximizes the distance between the hyperplane and the “difficult points” close to decision boundary
• One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions
equation of n-dimensional hyperplane: w1x1+w2x2+w3x3+...+wnxn=a=>wTx
Support Vectors: Intuition
:traing data points closer to the border which are most crucial to design the classifier
To chose a good line: optimize some objective function
Primarily we want least number of misclassification of test points, points closer to boundary more likely to be misclassified
Support Vector Machine(SVM)
:maximum margin classifier L1,L2 are lines defined by the support vectors
margin: seperation between the lines
The decision boundary is the line that pass through the middle of L1 and L2.
another intuition: fat seperator between class, less choices, decreased capacity of the model
Linear Classifiers
f(x, w, b) = sign(wx - b)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
Maximum margin linear classifier(LSVM)
Large-margin Decision Boundary
Recall: the distance from the point (m, n) to the line Ax + By + C = 0: d=A2+B2∣Am+Bn+C∣
The perpendicular distance(수직거리) of the line (L) from any point u=[u1,u2,u3,...,un]T: d(u,L)=∥w∥∣wTu−a∣
The perpendicular distance of the line from origin: d(0,L)=∥w∥∣a∣
minimize ∣∣w∣∣2=wTw
yi class label: (C1 class1, C2 class2) yi={+1,ifxi∈C1−1,ifxi∈C2
constraint to our optimization problem: wTxi+b≤−1,∀xi∈C2wTxi+b≥1,∀xi∈C1
class 2는 L1 아래쪽, class 1은 L2 위쪽
simply say: yi(wTxi+b≥1∀i∈1,2,...,m)
(m: # of training samples)
C: Regularization parameter (user defined), Regularization inversely proportional to C, strictly positive, squared l2 penalty, default=1.0
Small C; allows large ϵi's, more xi's to slip through the margin (margin 안에 points 더 많아도 됨)
Large C; forces small ϵi's
c작으면 epsilon 작아짐, 더 다양한 경우- miss 많아짐
Kernel Trick
very large feature space
avoid computations in the enlarged feature space, only need to perform computations for each distinct pair of training points
The use of kernels can be a measure of similarity between the original feature space and enlarged feature space.
use inner(dot) product; similarty between the vectors
Linear Support Vector Classifier rewritten
(Linear) Kernel function
f(x)=β0+i∈S∑αiK(x,xi)
Polynomial Kernel
Radial Basis Kernel
As gamma γ increases, the influence range of a data point shortens, while a lower gamma extends it.