Cross-Entropy gradient : (Hard distillation)
with respect to each logit, of the distilled model.
If the cumbersome model has logits which produce soft target probabilities
and the transfer training is done at a temperature of ,
[REF]
paper : https://arxiv.org/pdf/1503.02531.pdf
blog : https://jmlb.github.io/ml/2017/12/26/Calculate_Gradient_Softmax/