Agents to Generate Kernels for Faster AI

merefield · 10 November 2025 18:00

The talk presents an AI-driven framework called AGI Kit that autonomously generates and optimizes computational kernels to accelerate AI workloads by leveraging large language models for iterative performance improvements tailored to specific hardware. This approach addresses traditional manual optimization challenges, demonstrating significant performance gains and envisioning a future where AI-generated kernels continuously self-optimize in real-time for maximum efficiency.

merefield · 10 November 2025 18:24

The talk focuses on leveraging AI to generate optimized kernels that can accelerate AI workloads, addressing the common challenge of slow training and development processes. Kernels, small programs responsible for heavy computational lifting, often suffer from inefficiencies due to poor coordination and hardware underutilization. The speaker uses an analogy of chefs in a kitchen, where kernels are like chefs working with ingredients (numbers) and recipes (operations) to produce meals (results). The goal is to create a “magical kitchen” where all components work seamlessly together, maximizing hardware performance.

Traditionally, kernel engineers manually optimize kernels by tweaking assembly code, managing memory, and counting cache lines, but this process is time-consuming and demands quick, precise results. To overcome these limitations, the team proposes an AI-driven system that can self-reflect and optimize kernel performance based on the hardware in use. This system uses large language models (LLMs) to generate numerous kernel candidates, evaluate their performance, and iteratively refine them, akin to a “Tinder” for kernels where good candidates are “swiped right” and bad ones “swiped left.”

The team has developed an agentic framework called AGI Kit, which automates kernel generation and optimization. The framework analyzes code, researches optimization patterns, generates kernel candidates, and evaluates them through compilation, correctness checks, and performance measurements. Feedback from these evaluations is used to mutate and improve future kernel generations. This approach has been tested on datasets like Stanford’s Level 1 Kernel Bench and real-world libraries such as ROCm’s BLAS and solver libraries, showing promising results with significant compilation success rates and performance improvements ranging from 4% to over 20%.

One of the challenges highlighted is the scarcity of open-source kernel data, which necessitates generating large datasets internally to train and fine-tune models effectively. The team collaborates with other groups to build these datasets, enabling the development of language models capable of optimizing kernels autonomously. Early results with a 7-billion parameter model demonstrate up to 40% performance improvements in certain tasks, although some areas like hip-to-hip kernel optimization still require more data and refinement.

In conclusion, the vision is to create an evolving ecosystem where AI-generated kernels adapt dynamically to the hardware they run on, continuously optimizing themselves in real-time rather than being fixed at compile time. This would allow AI workloads to run faster and more efficiently, fully utilizing hardware capabilities. The speaker acknowledges the collaborative effort behind this work and expresses optimism about the future where AI not only accelerates AI but also self-optimizes, potentially even “joining us for coffee” as a humorous metaphor for seamless integration.