arxiv

Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving

Prefill-Decode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the …

Zongze Li, Jingyu Liu, Zach Xu, Yineng Zhang, Tahseen Rabbani, Ce Zhang

Scaling Beyond Masked Diffusion Language Models

Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among …

Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, Ante Jukic

Scaling Beyond Masked Diffusion Language Models

HAMburger: Accelerating LLM Inference via Token Smashing

The growing demand for efficient Large Language Model (LLM) inference requires a holistic optimization on algorithms, systems, and …

Jingyu Liu, Ce Zhang

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. …

Jingyu Liu, Beidi Chen, Ce Zhang

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation