Filip Makraduli discusses the challenges of small model inference in production and presents Superlinked’s open-source inference engine, which optimizes GPU usage through hot-swapping and supports diverse model architectures for efficient AI workflows. He emphasizes the importance of combining robust model support with comprehensive infrastructure solutions to manage inference complexity, enabling seamless deployment, scaling, and integration in real-world applications.
Filip Makraduli’s talk addresses a significant gap in the AI market concerning small model inference, particularly for AI search and document processing. Initially confident in his understanding of model performance, Filip realized he had overlooked the practical challenges of inference in production environments. To bridge this gap, he joined Superlinked, collaborating with infrastructure engineers to build and open-source the Superlinked inference engine. This tool supports small models used in workflows involving agents, enabling efficient inference and has been tested with partners like Chroma and Weaviate.
Filip emphasizes the importance of small model inference in managing context rot—a phenomenon where the quality of AI outputs degrades as the context size increases. Small models can preprocess data to reduce context size, improving agent performance in workflows. This approach is supported by community efforts, including knowledge graph construction and context filtering models from organizations like Chroma. Filip also highlights practical applications, such as taxonomy classification in e-commerce, where small models serve as tools for effective data retrieval and processing.
He clarifies common misconceptions about inference, noting that simply adding more GPUs is not a viable solution for small models. Since these models occupy only a few gigabytes of memory, dedicating a GPU per model leads to underutilization and wasted resources. Instead, Superlinked’s engine supports hot-swapping of models on a single GPU with a least recently used eviction policy, optimizing GPU usage and reducing costs. Additionally, inference is not just about server deployment but also involves complex production challenges like routing, auto-scaling, and monitoring, areas where existing open-source solutions fall short.
Filip introduces the “yin and yang” of model inference, combining comprehensive model support with robust infrastructure. On the model side, supporting a wide variety of open-source models is crucial, as these models are rapidly improving in accuracy and efficiency. However, supporting diverse models is challenging due to differences in architectures, attention mechanisms, positional embeddings, and output formats. Superlinked addresses this by re-implementing the forward pass of models to handle these variations efficiently, including variable-length flash attention and support for complex models like ColBERT.
On the infrastructure side, Superlinked provides an end-to-end solution that integrates model inference with hardware provisioning, routing, queuing, and auto-scaling using tools like Prometheus and KEDA. This holistic approach allows users to deploy and switch between hundreds of models seamlessly, with infrastructure managed through simple configurations and Terraform. Filip concludes by inviting the audience to explore the open-source Superlinked inference engine (SIE) and reveals that the background image in his presentation visualizes sinusoidal positional encodings used in transformers, tying back to foundational concepts in model inference.