Run GPT-OSS-120B Locally with Lemonade on AMD ROCm™ Software

The video demonstrates running the GPT OSS120B model locally on a Strix Halo PC using AMD’s ROCm software and Lemonade Server, enabling efficient code generation and customization, exemplified by creating and enhancing a Python Asteroids game. It highlights the straightforward setup, fast local inference, and the flexibility developers gain by leveraging large language models on personal hardware.

In this video, the presenter demonstrates how to run the GPT OSS120B model locally on a Strix Halo PC using AMD’s ROCm software. The model fits well on the Strix Halo, allowing for efficient and enjoyable coding experiences. The process begins by launching a terminal and running the Lemonade Server, specifying ROCm as the backend to leverage the GPU capabilities. Once the server is up and running, the GPT model is loaded from Lemonade’s Continue Hub, which the presenter had previously set up.

The presenter then prompts the model to generate Python code for the classic game Asteroids. While the code is being generated, they explain that Lemonade Server was pre-installed using a Windows installer, which simplifies the setup process to just a few minutes. The GPT model, which is about 60 gigabytes in size, was downloaded in advance using the Lemonade Model Manager. The presenter also notes that Lemonade supports Linux installations, with command-line instructions available on their website.

Using the ROCm 7 beta as the backend for Llama CPP, the model is loaded into memory to ensure prompt response times during code generation. All of this is running locally on the Strix Halo PC, and the token generation speed is sufficient to handle such a large model efficiently. After the model completes the Python code for the Asteroids game, the presenter saves the generated code to a Python file and opens a new terminal to activate a conda environment where Pygame is pre-installed.

The presenter then runs the game, acknowledging that their skills at playing Asteroids need improvement but expressing excitement about the process. To enhance the game, they copy the class defining the asteroids and ask the model to add color to them, showcasing the flexibility and customization possible with large mixture-of-experts language models. This iterative process of tweaking code and running it locally highlights the power of having such models accessible on personal hardware.

In conclusion, the video emphasizes the ease of setting up and running large language models like GPT OSS120B locally using Lemonade and ROCm. The presenter encourages viewers to explore Lemonade for themselves by visiting their website. The demonstration illustrates how developers can create and customize applications tailored to their needs, all while benefiting from fast local inference and the ability to iterate quickly within their development environment.