
Backpropagation => using chain rule





we can do it like this

simpler to little be more complex group if you can derive local gradient.
max gate

one will have full gradient of before while other will get gradient of 0.
because in forward path, also max value was used!


Jacobian Matrix will be just a diagonal matrix


calculating how much does the number affect the final output


non linearity is important!
multiple layers
h is value of scores of each of templates on W1
W2 weights all of them
h is right after the non linearity

