Ollama now supports Thinking Natively

Ollama’s 0.9.0 update introduces Native Thinking Support, allowing the API to separately return internal reasoning and final content, with optional streaming for real-time insight into the model’s thought process. This feature enhances interpretability and development efficiency by clearly distinguishing between reasoning and output, while also offering potential for improved performance metrics and analysis.

In the recent 0.9.0 release, Ollama has introduced a groundbreaking feature called Native Thinking Support. While Ollama has supported reasoning models for months, this update allows the system to understand and separate the thinking process from the final output. This is achieved by providing a clear distinction between the internal reasoning (thinking) and the final response (content), making it easier for developers to interpret and utilize the model’s reasoning process.

The key innovation is that the API now returns separate fields for thinking and content, which are populated accordingly. This separation is available whether streaming is enabled or not. Streaming responses show real-time reasoning and final output as distinct tokens, allowing developers to observe the model’s thought process as it unfolds. Importantly, this feature is opt-in, requiring explicit activation via the API with a think: true parameter or through CLI commands, ensuring existing applications remain unaffected unless intentionally updated.

The presenter demonstrates how to enable and test this feature using API tools like curl, Postman, or Insomnia. When thinking is disabled, the response combines reasoning tokens with the final answer, which can be confusing. Once thinking is enabled, the response clearly separates reasoning from content, making it much easier to interpret. Streaming mode further enhances this by showing tokens in real-time, with distinct tags for thinking and content, allowing users to monitor the model’s internal process as it generates responses.

A notable aspect discussed is the potential for improved metrics and timing analysis. Currently, the system provides overall evaluation duration and token counts, but it lacks granular timing data for thinking versus content phases. The presenter suggests that developers can manually calculate these metrics by analyzing token responses and timing data, which would enable more precise performance insights and optimization, further enhancing the utility of this feature.

Overall, this update significantly simplifies working with reasoning models, eliminating the need for manual parsing of internal markers or symbols. It streamlines the development process, making it easier to build applications that leverage AI’s internal thought processes. The presenter expresses excitement about how this thoughtful API design will improve the developer experience and open new possibilities for AI-powered tools and applications.