Lecture video link: https://youtu.be/iZTeva0WSTQ
Topics for today
- Perceptron
- Exponential Family
- Generalized Linear Models
- Softmax Regression (Multiclass Classification)
Perceptron learning algorithm
θj←θj+α(y(i)−hθ(x(i)))xj(i)
where hθ(x)=g(θTx) and g(z)={10z≥0z<0.
Exponential family
whose pdf is written as
p(y;η)=b(y)exp(ηTT(y)−a(η))
where
y: Data
η: Natural parameter
T(y): Sufficient statistic
b(y): Base measure
a(η): Log-partition
Bernoulli
ϕ= probability of event
p(y;ϕ)=ϕy(1−ϕ)1−y=exp(log(ϕy(1−ϕ)1−y))=1⋅exp(log(1−ϕϕ)y+log(1−ϕ))
where
b(y)T(y)ηa(η)=1=y=log(1−ϕϕ)⇒ϕ=1+e−η1=−log(1−ϕ)⇒−log(1−1+e−η1)=log(1+eη).
Gaussian (with fixed variance)
Assume σ2=1.
p(y;μ)=2π1exp(−2(y−μ)2)=2π1exp(−2y2)exp(μy−21μ2)
where
b(y)T(y)ηa(η)=2π1exp(−2y2)=y=μ=2μ2=2η2.
Properties
-
MLE w.r.t. η → concave
NLL is convex
-
E[y;η]=∂η∂a(η)
-
V[y;η]=∂η2∂2a(η)
cf. Probability distributions
Real - Gaussian
Binary - Bernoulli
Count - Poisson
R2 - Gamma, Exponential
Distribution - Beta, Dirichlet … Bayesian
GLM (Generalized linear models)
Assumptions / Design choices
-
y∣x;θ∼Exponential Family(η)
-
η=θTxθ∈Rn,x∈Rn
-
Test time: output E[y∣x;θ]
⇒hθ(x)=E[y∣x;θ]
Train time
Test time
θmax logp(y(i);θTx(i))
E[y;η]=E[y∣x;θ]=hθ(x)
GLM Training
Learning update rule
θj←θj+α(y(i)−hθ(x(i)))xj(i)
Terminology
η: natural parameter
μ=E[y;η]=g(η): canonical response function
η=g−1(μ): canonical link function
g(η)=∂η∂a(η)
3 parameterizations
Model param. | Natural param. | Canonical param. |
---|
θ | η | ϕ∼Bernoulli |
↑ Learn | −g→ | μ,σ2∼Gaussian |
Design choice −θTx→ | ←g−1− | λ∼Poisson |
Logistic regression
hθ(x)=E[y∣x;θ]=ϕ=1+e−η1=1+e−θTx1.
Softmax regression (cross entropy)
K - # classes
x(i)∈Rn
y - one-hot vector with length K
Learn → Predict → Compare
How to minimize the distance between two distributions: p^(y) and p(y)?
→ Minimize the cross entropy between two distributions
CrossEnt(p,p^)=−y∈classes∑p(y)logp^(y)=−logp^(ytarget)=−log∑c∈classeseθtargetTxeθcTx.
You treat the above as the loss and do gradient descent.