Open-source Sora 2, realtime AI video, new DeepSeek, Nanobanana upgrades, Claude 4.5, realtime TTS

artesia · 5 October 2025 02:24

This week in AI saw major advancements including Alibaba’s transparent video generation model, real-time text-to-speech with Canyt, and powerful open-source tools like OVI for video and Hunyan Image 3.0 for image generation, alongside improvements in avatar creation and robot motion learning. Large language models also progressed with ZAI’s GLM 4.6 and Anthropic’s Claude 4.5, while Nvidia’s Dreamer 4 introduced innovative AI agent training through imagined gameplay, collectively pushing the boundaries of video, speech, image, and language technologies.

artesia · 5 October 2025 02:49

This week in AI has been packed with exciting developments across video generation, text-to-speech, image generation, and large language models. Alibaba released One Alpha, a video model capable of generating videos with transparency, allowing for easy layering over other backgrounds. It handles complex transparency effects like glowing flames, translucent objects, and even challenging hair segmentation with impressive accuracy. The model and instructions are available on HuggingFace and GitHub for local use. Additionally, a new real-time text-to-speech model called Canyt was introduced, capable of generating 15 seconds of audio in just one second on consumer-grade GPUs, supporting multiple languages and speakers, and available under an Apache 2 license.

In the realm of avatars and image generation, Cap 4D offers real-time 3D avatars that can be controlled and rotated, constructed from multiple photos of a person. Nano Banana received an update allowing users to control aspect ratios for image outputs, enhancing creative flexibility. Tencent released Hunyan Image 3.0, a powerful open-source image generator with strong world understanding and the ability to generate complex text and infographics. While it requires massive hardware resources to run locally, it is accessible via a free online interface. Nvidia’s Long Live model enables real-time interactive video generation with prompt-based editing, capable of producing videos up to four minutes long, though with some quality limitations.

Open-source video generation also saw a major release with OVI by Character AI, which can generate videos with audio natively built-in and supports both text-to-video and image-to-video generation. This model can handle multiple speakers, different languages, and even singing, with a minimum GPU requirement of 32 GB. Meanwhile, Omni Retarget demonstrated impressive robot motion learning by mapping human motion capture data onto robots, enabling complex acrobatic and task-oriented movements autonomously. The dataset for this project has been released, with code expected soon, opening doors for advanced humanoid robot training.

On the large language model front, ZAI released GLM 4.6, which boasts a massive 200K token context window, improved coding performance, advanced reasoning, and agentic capabilities. It supports real-time applications like color palette generation and CRM dashboards, with web search integration for up-to-date information. Deepseek introduced version 3.2 experimental, focusing on efficiency improvements with sparse attention mechanisms, making it cheaper to run while maintaining strong performance. Anthropic launched Claude Sonnet 4.5, claiming it as the best coding model, though tests showed mixed results compared to competitors like GPT-5 and Grock 4, especially in scientific and agentic benchmarks.

Finally, Nvidia’s Dreamer 4 showcased a groundbreaking approach to training AI agents by having them imagine playing Minecraft through video observations, enabling the agent to learn complex tasks like mining diamonds without direct interaction. This method holds promise for training real-world robots more efficiently by simulating tasks internally before deployment. While only a technical paper has been released so far, this represents a significant step toward advanced autonomous robot training. Overall, this week’s AI advancements highlight rapid progress in video, speech, image, and language technologies, with many open-source releases and tools becoming more accessible to developers and creators.