Lecture video link: https://youtu.be/lDwow4aOrtg
Outline
- Naive Bayes
- Laplace smoothing
- Event models
- Comments on applying ML
Recap
x=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡10⋮10⋮⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤aaardvark⋮buy⋮
xj=1{wordjappearsine-mail}
Generative model:
p(x∣y),p(y)p(x∣y)=i=1∏mp(xj∣y)
Parameters:
p(y=1)=ϕyp(xj=1∣y=0)=ϕj∣y=0p(xj=1∣y=1)=ϕj∣y=1
ϕyϕj∣y=1=m1i=1∑m1{y(i)=1}=i=1∑m1{y(i)=1}i=1∑m1{xj(i)=1,y(i)=1}.
At prediction time:
p(y=1∣x)=p(x∣y=1)p(y=1)+p(x∣y=0)p(y=0)p(x∣y=1)p(y=1).
Comment: statistically it’s a bad idea that estimating the probability of something as 0 just because you have not seen it once yet. → Laplace smoothing which helps address this problem.
ex.
Spam content: “Drugs! Buy drugs now!”
Multivariate Bernoulli event model
x=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡a (1)aardvark (2)⋮buy (800)⋮drugs (1600)⋮now (6200)⋮⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=[0,0,⋯,1,⋯,1,⋯,1,⋯]T∈Rm.
This representation loses some information since it does not contain the fact that drugs appear twice in the spam.
Multinomial event model
x=[1600,800,1600,6200]T∈Rnxj∈{1,⋯,10000}ni:length of e-maili
Parameters:
ϕyϕk∣y=0=p(y=1)=p(xj=k∣y=0)
where the RHS of the second equation signifies the chance of word j being k if y=0.
LHS of the second equation assumes the chance of any position j in the email is independent of j.
ϕk∣y=1=p(xj=k∣y=1).
MLE:
ϕk∣y=0=i=1∑m1{y(i)=0}⋅nii=1∑m(1{y(i)=0}j=1∑ni1{xj(i)=k})=p(xj=k∣y=0)
Comment: the English meaning of the above formula?
→ Look at all the words in all of your non-spam emails (y=0), and what fraction of those words is the word “drugs” (k)?
Your estimate of the chance of the word “drugs” appearing in the non-spam email in some position in that email.
To implement Laplace smoothing with the formula, you would add 1 to the numerator and 10000 to the denominator.
For most problems, you will find that logistic regression works better in terms of delivering a higher accuracy than Naive Bayes. But the advatages of Naive Bayes …
- Computationally very efficient
- Relatively quick to implement
- Does not require an iterative gradient descent thing
- Code lines is relatively short
Advice
When you get started on a machine learning project, start by implementing something quick and dirty, which has been implemented in most complicated possible learning algorithms.
→ You can then better understand how it’s performing.
→ Do error analysis.
→ Use that to drive your development.
Build a very complicated algorithm at the onset ∼ not recommended
Implement something quickly ∼ recommended
What if some spammers deliberately misspell words, e.g., …
mϕrtgag3 for “mortgage”?
For GDA and Naive Bayes …
despite their relatively lower accuracy, they are very quick to train and non-iterative.
Support Vector Machines
→ help us to find potentially very very “non-linear” decision boundaries.
One of the reasons SVMs are used today is that it is relatively a turn-key algorithm, which means it doesn’t have too many parameters to fiddle with.
SVMs are not as effective as neural networks for many problems but …
one great property of SVMs is that it is a turn-key.
Optimal margin classifier (separable case) (in this lecture)
linearly separable
Kernels (next lecture)
ex.
x⟼ϕ(x):R2⟶Rn
where n can be a sufficiently large number, even infinity.
(Q. How do we define infinite-dimensional image of x?)
Inseparable case (next lecture)
Functional margin
hθ(x)=g(θTx)
Predict “1” if θTx≥0, otherwise “0.”
If y(i)=1, hope that θTx≫0.
If y(i)=0, hope that θTx≪0.
Geometric margin
Ref: lecture-notes-cs229, Ch. 06 SVM figures…
Previously…
hθ(x)=g(θTx)
where x∈Rn+1,x0=1. Rewrite the above as
hw,b(x)=g(wTx+b)
where w∈Rn,b∈R.
Functional margin of hyperplane defined by (w,b) w.r.t. (x(i),y(i)):
γ^(i)=y(i)(wTx(i)+b).
If y(i)=1, want wTx(i)+b≫0.
If y(i)=−1, want wTx(i)+b≪0.
Want γ^(i)≫0.
If γ^(i)>0, h(x(i))=y(i).
Functional margin w.r.t. training set
γ^(i)=iminγ^(i),i=1,⋯,m.
(w,b)⟶(∣∣w∣∣w,∣∣w∣∣b)
Geometric margin
The Euclidean distance from point (x(i),y(i)) to the decision boundary wTx+b=0.
Geometric margin of hyperplane defined by (w,b) w.r.t. (x(i),y(i)):
γ(i)=∣∣w∣∣y(i)(wTx(i)+b).
Relation between functional margin and geometric margin:
geometricmargin=∣∣w∣∣functionalmargin.
Geometric margin with training set:
γ=iminγ(i).
cf. γ^: Functional margin, γ: Geometric margin.
Optimal margin classifier:
Choose w,b to maximize γ.
Mathematically,
γ,w,bmaxs.t.γ∣∣w∣∣y(i)(wTx(i)+b)≥γ,i=1,⋯,m.
Or you can reformulate this problem into the equivalent problem:
w,bmins.t.21∣∣w∣∣2y(i)(wTx(i)+b)≥1.