Softmax and Cross-entropy
- Softmax: p(y∣x)=∑c=1Cexp(Wc⋅x)exp(Wy⋅x)
- Wy⋅ is yth row of some weight matrix W
- Cross-entropy: H(p,q)=−c=1∑Cp(c)logq(c)
- True probability: p
- Modeled probability: q
- Cross-entropy loss function over dataset {xi,yi}i=1N:
J(θ)=−N1i=1∑Nlog(∑c=1Cexp(Wc⋅xi)exp(Wyi⋅xi)) ... where θ=[W⋅1…W⋅d]∈RC×d
- For classification, we want to minimize J(θ), which means we want to maximize probability of correct class y => update θ:=θ−α∇θJ(θ)
- cf) Binary cross-entropy loss: J(θ)=−N1i=1∑N(yilogyi^+(1−yi)log(1−yi^))
Neural Networks
- Softmax alone is not very powerful. It only gives linear decision boundaries.
- Neural networks can learn much more complex functions and nonlinear decision boundaries.
Artificial Neuron and Neural Network
Neuron | Neural Network |
---|
Image from: here | Image from: here |
- A neuron can be a binary logistic regression unit
- f= nonlinear activation function (e.g. sigmoid)
- A neural network = running several logistic regressions at the same time
And feed the outputs of a layer into next layer of neurons
- In matrix notation,
z=Wx+b ... (x is an input vector of a layer)
a=f(z) ... (a is output vector from the layer, and it will be fed into next layer)
Window Classification
- In general, classifying single words is rarely done because meaning of a word depends on context
- "To sanction" can mean "to permit" or "to punish"
- "Paris" can mean "Paris, France" or "Paris Hilton"
- Window Classification: classify a word in its context window of neighboring words
- To classify a center word, take concatenation of word vectors surrounding it in a window.
- Example: Classify "Paris" in the context with window length 2: Image from: here
- xwindow∈R5d is now an input vector of a neural net
- Neural Network Feed-forward Computation
- Let's assume a NER location classification task (classify whether the center word is a Location or not)
- s = score("museums in Paris are amazing")
- s=UTf(Wx+b)
- x∈R5d×1,W∈Rn×5d,U∈Rn×1 Image from: here
- The middle layer learns non-linear interactions between the input word vectors
- The max-margin loss
- s= True window's score
- sc= Corrupt window's score
- J=max(0,1−s+sc) ... minimizing J makes s larger and sc lower
- This is not differentiable but it is continuous -> we can use SGD by computing ∇θJ(θ)
Matrix Calculus
Gradients
-
Given a function f with 1 output and n inputs
- f(x)=f(x1,x2,...,xn)
- Gradient: ∇xf=∂x∂f=[∂x1∂f,∂x2∂f,...,∂xn∂f]
-
Given a function f with m outputs and n inputs
- f(x)=[f1(x1,...,xn),...,fm(x1,...,xn)]
- Jacobian: ∂x∂f=⎣⎢⎢⎡∂x1∂f1⋮∂x1∂fm…⋱…∂xn∂f1⋮∂xn∂fm⎦⎥⎥⎤, i.e. (∂x∂f)ij=∂xj∂fi
-
Example: Elementwise activation function
- h=f(z), where h,z∈Rn
- Jacobian: ∂z∂h=⎣⎢⎢⎢⎢⎡f′(z1)0⋱0f′(zn)⎦⎥⎥⎥⎥⎤=diag(f′(z))
Apply to NER location neural net
If there is something wrong in my writing or understanding, please comment and make corrections!
[references]
1. https://youtu.be/8CWyBNX6eDo
2. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture03-neuralnets.pdf
3. https://en.wikipedia.org/wiki/Partial_derivative