시리즈

LLM-KV-Cache-Q

1.[24.arXiv]KVQuant: Towards 10M Context Length LLM Inference with KV Cache Quantization

parent: SqueezeLLM Settings LLaMA, Llama-2, Llama-3, Mistral Wikitext-2, C4 1M on a single A100-80GB GPU, 10M on 8-GPU Motivation small batc

2024년 9월 19일

2.No Token Left Behind: Reliable KV Cache Comopression via Importance-Aware Mixed Precision Quantization

한 줄 정리KV cache eviction method에서 기존에 evict되는 unimportant token들을 low-precision으로 저장해서 최소한의 정보를 유지하고,important token은 high precision으로 저장하는 mixed-preci

2025년 2월 1일

3.GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Observations 1) KIVI는 간단한 task에선 👍🏻, 복잡한 task에선 👎🏻 simple task 👍🏻 기존 방법(KIVI, KVQuant, FlexGen)들은 간단한 작업에서 low-precision에서 잘 동작함. (multiple-ch

2025년 2월 2일