ML System

1.[EuroSys'25] CacheBlend: Fast Large Language Model Serving for RAGwithCached Knowledge Fusion

post-thumbnail

2.[MLSys'25] FlexInfer: Flexible LLM Inference with CPU Computations

post-thumbnail

3.[SIGCOMM'25] MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

post-thumbnail

4.[SOSP'25] PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications

post-thumbnail

5.[SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention

post-thumbnail

6.[ASPLOS'25] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

post-thumbnail

7.[SOSP'25] JENGA: Effective Memory Management for Serving LLM with Heterogeneity

post-thumbnail

8.Utility-Driven Speculative Decoding for Mixture-of-Experts

post-thumbnail

9.[SOPS'25] IC-Cache: Efficient Large Language Model Serving via In-context Caching

post-thumbnail