The Illustrated Word2vec
https://jalammar.github.io/illustrated-word2vec/
Word2vec(Distributed Representations of Words)
https://arxiv.org/abs/1310.4546
Sequence to Sequence Learning with Neural Networks
https://arxiv.org/abs/1409.3215
Effective Approaches to Attention-based Neural Machine Translation
https://arxiv.org/abs/1508.04025
Sparse is Enough in Scaling Transformers
https://openreview.net/pdf?id=-b5OSCydOMe
The Illustrated Transformer
https://jalammar.github.io/illustrated-transformer/
Attention is all you need
https://arxiv.org/abs/1706.03762
The Annotated Transformer
https://nlp.seas.harvard.edu/2018/04/03/attention.html
Group Normalization
https://openaccess.thecvf.com/content_ECCV_2018/papers/Yuxin_Wu_Group_Normalization_ECCV_2018_paper.pdf
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/abs/1810.04805
XLNet: Generalized Autoregressive Pretraining for Language Understanding
https://arxiv.org/abs/1906.08237
GPT1 : www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
GPT2 : d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
GPT3 : arxiv.org/abs/2005.14165