This is reimplementation of Trajectory Transformer, introduced in Offline Reinforcement Learning as One Big Sequence Modeling Problem paper.
The original implementation has few problems with inference speed, namely quadratic attention during inference and sequential rollouts.
(참고 : https://ai.googleblog.com/2020/10/rethinking-attention-with-performers.html)
The former slows down planning a lot, while the latter does not allow to do rollouts in parallel and utilize GPU to the full.
Still, even after all changes, it is not that fast compared to traditional methods such as PPO or SAC/DDPG. However, the gains are huge, what used to take hours now takes a dozen minutes (25 rollouts, 1k steps each, for example). Training time remains the same, though.
1. Attention caching
During beam search we're only predicting one token at a time. So with the naive implementation model will make a lot of unnecessary computations to recompute attention maps for full past context. However it is not necessary, as it was already computed when the previous token was predicted. All we need is to cache it!
Actually, attention caching is a common thing in NLP field, but a lot of RL practitioners may not be familiar with NLP, so the code also can be educational.
2. Vectorized rollouts
Vectorized environments allow batching beam search planning and select actions in parallel, which is a lot faster if you need to evaluate agent on number of episodes (or seeds) during training.
https://ai.googleblog.com/2020/10/rethinking-attention-with-performers.html