GPT-5 VS Sonnet-4 round 2

The video compares GPT-5 and Sonnet-4 in implementing Google’s Gemini multispeaker speech generation API, highlighting GPT-5’s superior ability to use updated imports, apply voice styling, and adapt to new documentation, while Sonnet-4 struggles with outdated methods and lacks styling features. Although GPT-5’s script runs without errors but has minor audio output issues, it overall outperforms Sonnet-4, demonstrating better handling of complex, cutting-edge APIs.

The video presents a detailed comparison between two AI models, GPT-5 and Sonnet-4, focusing on their ability to implement Google’s new Gemini multispeaker speech generation API. The test requires the models to search for information and learn how to use the API to create a simple script that generates speech from a dialogue between two speakers with distinct voice styles—one upbeat and one mature. The models are evaluated on several criteria, including correct import statements, proper API usage, voice selection, styling of speech, and overall functionality of the generated script.

In terms of imports and API implementation, GPT-5 clearly outperforms Sonnet-4. GPT-5 correctly uses the updated Google import statements and calls the appropriate methods for multispeaker voice configuration, while Sonnet-4 relies on outdated imports and incorrect API calls that would prevent the script from running successfully. This demonstrates GPT-5’s superior ability to perform web searches and adapt to new documentation, which is crucial for working with recently released APIs.

When it comes to voice selection, GPT-5 partially succeeds by choosing one correct voice (an upbeat voice named Puck) and one that is close to the requested mature style, though not perfectly matching the specification. Sonnet-4, on the other hand, fails to select the correct voice names entirely, opting instead for generic labels that do not align with the API’s requirements. Both models correctly identify the speech generation model to use, which is a positive point for each.

Styling the speech within the dialogue is another area where GPT-5 excels. It incorporates style instructions directly into the prompt, aligning with the Gemini API’s documentation on how to influence speech style. Sonnet-4 neglects this aspect completely, producing a plain dialogue without any styling cues. This further highlights GPT-5’s better understanding and application of the API’s capabilities.

Finally, while GPT-5’s script runs without errors, it does not produce the expected audio output due to issues with audio settings, indicating room for improvement. Sonnet-4’s script, however, is unlikely to run at all due to fundamental errors in imports and API calls. The video concludes that GPT-5 is generally superior in handling new and complex API documentation, especially when web search and learning are required, making it the better choice for tasks involving cutting-edge technologies like Google’s Gemini multispeaker speech generation. The presenter also invites viewers to explore more AI-powered applications and consulting services available through their Patreon page.