[Paper review] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

[Paper review] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

[Paper review] ZeroQ: A Novel Zero Shot Quantization Framework

[Paper review] GenQ: Quantization in Low Data Regimes with Generative Synthetic Data

[Paper review] Learned Token Pruning for Transformers

[Paper review] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

[Paper review] KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

[Paper review] Quantization in Layer’s Input is Matter

[Paper review] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

[Paper review] LoRA: Low-Rank Adaptation of Large Language Models

[Paper review] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

[Paper review] STAIR: Improving Safety Alignment with Introspective Reasoning

[Paper review] HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

[Paper review] Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory