SGLang: Open-Source Model Performance Optimization

merefield · 10 November 2025 18:00

John, the Eston core maintainer, presents SGLang, an open-source framework that optimizes large language model performance through innovations like deepseek array optimizations, reinforcement learning integration, and high cache deterministic inference, achieving significant throughput improvements and cost reductions. The talk also covers advancements in scalable training frameworks, distributed inference on AMD GPUs, and enterprise-grade model deployment orchestration, with future plans for further optimization and unified model management.

merefield · 10 November 2025 18:23

In this talk, John, the Eston core maintainer, presents an overview of SGLang, an open-source framework designed to optimize the performance of large language models (LLMs) and future language models. He begins by outlining the agenda, which includes a review of the 2025 highlights, an outlook for the remainder of the year, and a Q&A session. John encourages the audience to connect via GitHub, X, and LinkedIn for further engagement with the open-source community.

The 2025 highlights focus on several key advancements in SGLang. These include deepseek array optimizations, large-scale deployment capabilities, reinforcement learning integration, speculative decoding training, high cache deterministic inference, and support for new models like D0. Notably, the framework now supports model deployment orchestration and distributed inference on AMD GPUs, broadening its applicability and efficiency across different hardware platforms.

John details the deepseek optimizations, which have significantly improved the performance of the D63 model through techniques such as flash integration, dynamic MLA to MHA switching, deep gym integration, kernel fusion, and DPR attention support. These enhancements have resulted in a twofold increase in throughput. Additionally, large-scale deployment improvements have enabled the system to handle up to 52,000 input transactions per second per node, at a cost five times cheaper than the official Deepseek API, with validation from multiple teams.

The talk also highlights the integration of reinforcement learning frameworks like SLAM, built on Maxron LM and Eston, which have been instrumental in training large models such as GLM 4.5 and 4.6. Another significant development is the Spec Forge training framework, supporting advanced architectures and scalable, memory-efficient distributed training. This framework has been widely adopted by the community for training models like Ego Search and GPTOSS Ego Draft.

Finally, John discusses advancements in high cache systems, deterministic inference, and model deployment orchestration. The high cache tree structure and GPU-assisted IO have improved throughput and reduced time-to-first-token significantly, with adoption by major companies like Ant Group and Alibaba Cloud. Deterministic inference now achieves up to 4.8 times speedup using CUDA graphs, ensuring reproducibility. The OM Ripple Kubernetes operator, developed by Oracle, simplifies enterprise-grade model management. Looking ahead, the Eston team plans further refactoring, performance optimizations, and the development of the Eston Model Gateway to unify model routing and deployment, with ongoing community collaboration encouraged.