Fine-tuning & Reinforcement Learning for LLMs

In this presentation, Daniel highlights the use of reinforcement learning and efficient fine-tuning techniques like LoRA to improve large language models, enabling scalable and accessible AI training on AMD GPUs, including advanced quantization methods to optimize performance. He emphasizes collaboration with major AI platforms and invites the community to engage with their open-source tools and resources to advance AI research and applications.

In this presentation at AMD Dev Day, Daniel introduces the significance of reinforcement learning (RL) in improving large language models (LLMs) and various industries. He and his brother Michael have developed tools that allow users to train and run LLMs locally, including on AMD GPUs. Their work is widely adopted, with over 100 million monthly downloads on Hugging Face and numerous contributions to bug fixes and model improvements. Daniel emphasizes the collaborative efforts with major AI labs and platforms like Hugging Face and OpenAI to make AI more accessible and efficient.

Daniel explains reinforcement learning as a method where models learn by receiving rewards or penalties based on their actions, which can be applied across many fields such as trading, gaming, math, weather prediction, and AI code generation. He highlights how RL has led to significant performance improvements in models, citing the example of the 2048 game where RL helps the model learn to maximize rewards by achieving higher scores and avoiding errors. The ultimate goal of RL, he notes, is to automate AI research and potentially reach artificial general intelligence (AGI) by enabling models to learn and adapt across diverse tasks.

The talk also covers technical advancements in RL, including efficient fine-tuning methods like LoRA (Low-Rank Adaptation), which allow training only a small subset of model parameters rather than the entire model. Daniel discusses the importance of selecting the right parameters for LoRA to achieve high accuracy with fewer resources. He also highlights their work on speeding up inference during RL training, which is crucial since RL requires alternating between inference and training phases. Their framework supports 16-bit and 4-bit training and enables ultra-long context training by efficiently managing GPU and CPU memory.

Another key topic is model quantization, where Daniel presents their dynamic 1.5-bit quantization technique that selectively reduces the precision of certain model layers to save memory without sacrificing performance. This approach allows models to run efficiently on hardware with varying capabilities, from high-end GPUs like AMD’s MI300 to more modest setups. He showcases benchmarks demonstrating that quantized models maintain strong accuracy on tasks like coding performance, making RL and LLM fine-tuning more accessible and scalable.

Finally, Daniel announces their support for AMD GPUs, particularly the MI300X with 192GB VRAM, which enables training large models locally with reinforcement learning. He invites attendees to a workshop where they will demonstrate RL on AMD hardware using examples like automatic kernel creation and the 2048 game. He concludes by encouraging the community to explore their tools, contribute on GitHub, and take advantage of AMD’s free GPU credits, emphasizing the collaborative and open nature of their work to advance RL and LLM capabilities.