The chapter highlights the importance of asynchronous programming and streaming in LangChain to improve responsiveness and scalability in conversational AI, demonstrating how to implement a custom asynchronous callback handler for real-time token streaming and intermediate tool call handling. It also shows how to integrate this handler with agent executors, enabling incremental token processing and enhanced user experience through transparent, step-by-step output delivery.
In this chapter, the focus is on the importance of asynchronous programming and streaming in LangChain, especially for conversational chat interfaces. Asynchronous code is crucial because many language model (LM) calls happen over APIs and can introduce latency; synchronous code would block the application during these waits, leading to poor scalability and user experience. Streaming, on the other hand, allows tokens generated by the LM to be sent and displayed incrementally, token by token, reflecting how LMs generate text sequentially. This streaming not only improves responsiveness but also enables displaying intermediate steps, such as tool usage in agentic interfaces, enhancing transparency and user understanding.
The chapter demonstrates how to implement streaming with LangChain’s language models, showing how tokens can be received asynchronously and printed or processed as they arrive. It explains the structure of token chunks and how they can be merged to reconstruct the full output. The use of flushing in printing is highlighted to ensure smooth real-time updates in the console. This streaming approach is extended to agents, which are more complex because they involve multiple tool calls and intermediate reasoning steps. The agent executor setup is reviewed, emphasizing the need to handle streaming tokens and tool call information properly.
A key part of the implementation is the creation of a custom asynchronous callback handler that manages tokens streamed from the LM or agent. This handler uses an asynchronous queue to collect tokens and supports distinguishing between intermediate tool calls and the final answer. The async iterator method in the handler waits for tokens to appear in the queue without blocking the event loop, allowing other tasks to run concurrently. The callback methods handle new tokens and the end of LM calls, ensuring that streaming stops appropriately and that intermediate steps can be signaled to the user interface.
The chapter then shows how to integrate this streaming callback handler with the agent executor, enabling real-time streaming of tokens and tool call data during agent execution. It demonstrates how tokens can be merged and processed outside the executor, which is essential for real-world applications such as APIs where tokens need to be sent to clients incrementally. By separating token processing from the executor’s internal logic, developers gain flexibility to handle streaming output in various ways, such as updating a frontend or sending data over a network.
Finally, the chapter illustrates how to consume the streamed tokens asynchronously, process them to identify tool calls and arguments, and display or forward them in a controlled manner. This approach allows for a responsive user experience where users see ongoing progress and intermediate reasoning steps rather than waiting for a complete response. The combination of async programming, streaming, and careful callback handling in LangChain provides a powerful framework for building scalable, interactive AI applications that can handle complex agent workflows efficiently.