1200 tps Just Using Ollama?

merefield · 6 August 2025 19:10

The video demonstrates Olama Turbo’s ability to run large open-source AI models at high speeds—achieving around 1,200 tokens per second on an M1 Max laptop by offloading computation to the cloud—while highlighting its new user-friendly GUI and strong privacy features. It also emphasizes the ease of integrating Turbo into applications via simple API configurations, showcasing its potential as a fast, private, and scalable AI service for developers and users alike.

merefield · 6 August 2025 19:31

The video showcases the impressive performance of Olama Turbo, a new cloud-based service that enables running open-source AI models at high speeds. The presenter demonstrates achieving around 1,200 tokens per second on an M1 Max laptop using the Olama client connected to the GPT OSS model from OpenAI. While the client runs locally, the heavy computation is offloaded to Olama’s cloud service, Turbo, which significantly boosts speed compared to running models solely on local hardware.

Olama recently introduced a new graphical user interface (GUI) that simplifies interaction with these models. The UI allows users to select between different models, toggle between local and Turbo modes, and even search the internet for enhanced responses. Although the UI is still in early stages and lacks some features like real-time metrics, it provides a user-friendly way to access powerful AI models. The presenter notes some minor usability issues, such as unclear button states, but overall praises the interface as a promising default client for private AI work.

Turbo offers several key benefits beyond speed. It supports running larger models, such as the 120 billion parameter GPT OSS, which are typically too large to run on most personal hardware. Privacy remains a priority, with no data retention policies ensuring user data stays private. Additionally, Turbo can help save battery life by offloading computation to the cloud, although this depends on having a reliable internet connection. The service is currently in preview, with some models previously available now replaced, but the core experience remains strong.

For developers, integrating Turbo into applications is straightforward. By obtaining an authorization key from the Olama subscription management page, users can configure API clients in JavaScript or Python to access the hosted models. The presenter shares their experience modifying their setup to use Turbo seamlessly, highlighting how minimal changes are needed to switch from local to cloud-based inference. This ease of integration makes Turbo an attractive option for those wanting fast, private AI capabilities without complex infrastructure.

Finally, the video touches on the challenges and future potential of Olama Turbo. While currently available to a limited number of users, the platform’s ability to scale efficiently will be crucial as more people gain access. The team has worked hard to optimize bandwidth and storage bottlenecks, ensuring smooth performance. The presenter expresses excitement about the evolving model lineup and looks forward to seeing how Turbo grows, inviting viewers to share their thoughts and anticipation for this innovative AI service.