Easy, Fast, and Cheap LLM Serving for Everyone

VLM is an open-source, cross-platform large language model serving engine designed for fast, accurate, and efficient inference across diverse hardware, supported by a large community and tight integration with PyTorch. It features advanced memory management, scalable distributed deployment with Kubernetes-native LMD, and ongoing optimizations in collaboration with AMD to deliver high performance and flexibility for LLM workloads.

The speaker introduces VLM, an open-source large language model (LLM) inference and serving engine designed to be fast, easy to use, and cross-platform. VLM supports a wide range of models from day zero, including LLaMA, Qwen, and DeepSeek, and runs efficiently on various hardware such as AMD GPUs, CPUs, and other accelerators. The project has a large and active community with over 60,000 GitHub stars, 1,700 contributors, and more than 50 full-time maintainers. VLM offers both a Python interface for offline batch inference and a drop-in server replacement compatible with open protocols, making it versatile for different use cases including reinforcement learning frameworks.

VLM prioritizes accuracy and performance in its model support. The team works closely with model providers to ensure that the models are reproduced accurately before release, using multiple evaluation methods to verify numeric correctness. Performance is continuously monitored through a benchmark dashboard developed with the PyTorch team, running tests on every commit to prevent regressions. VLM also maintains tight integration with PyTorch, ensuring compatibility with the latest features and hardware generations. This collaboration allows users to seamlessly move workloads across different GPUs and benefit from ongoing improvements.

Technically, VLM has innovated in memory management and KV cache handling to support emerging hybrid model architectures that combine sliding window attention, state space models, and other techniques to reduce computational complexity. The hybrid memory allocator optimizes memory usage for these complex models, while the KV connector API enables efficient transfer and storage of key-value caches across devices and storage systems. These advancements position VLM at the forefront of inference technology, enabling it to handle increasingly sophisticated models with better resource management.

Beyond single-instance serving, VLM is expanding into distributed and scalable deployment with LMD, a Kubernetes-native solution for managing VLM workloads at scale. LMD addresses the unique challenges of LLM inference, such as high computational cost, variable request lengths, and session affinity, which differ from typical web service workloads. It offers intelligent inference scheduling, prefix-aware routing, load balancing tailored for GPU utilization, and support for expert parallelism to run very large models across multiple nodes. This makes it easier for operators to deploy and scale LLM services efficiently in production environments.

Finally, the speaker highlights ongoing collaboration with AMD to optimize VLM for AMD hardware. This includes participation in the Inference Max benchmark to provide transparent performance comparisons and integration of AMD’s ROCm kernel library, which has been the default for VLM on AMD GPUs for over a year. Future work includes adding more optimized kernels, supporting new hardware features like 4-bit MX FP4 precision, and improving networking capabilities through the Mory project to enhance distributed inference performance. The overall goal remains to build an open-source ecosystem that delivers the best accuracy, performance, and hardware flexibility for LLM serving.