Contrastive Decoding works because many failure modes of language models are more under smaller LMs than under larger LMs.
contrastive decoding generates outputs that emphasize the best of the expert LM and remove its amateur tendencies.
we find that better performance is achieved when the scale difference between expert and amateur is larger
Compared to four decoding baselines (nucleus sampling, top-k, typical decoding and SimCTG) our contrastive decoding method signif icantly improves the coherence of generated text, and improves or maintains the same fluency levels, according to both human evaluation and automatic metrics.






topic drift

