This week in AI saw major advancements including Vibe Voice, a real-time open-source TTS model with high speaker similarity, and cutting-edge video generators like Steady Dancer and Tencent’s Hunyan Video 1.5, alongside powerful image models from Alibaba, ByteDance, and Meta. Additionally, Google’s Gemini 3 Deep Think and DeepSeek 3.2 set new standards in complex reasoning tasks, while innovations like Engine AI’s agile humanoid robot and Alibaba’s Live Avatar highlight rapid progress in real-time multimodal AI applications.
This week in AI has been exceptionally busy with multiple groundbreaking releases across video, image, and speech generation. Among the highlights is Vibe Voice, an open-source real-time text-to-speech (TTS) model that can clone voices with just seconds of reference audio. It supports various accents and languages, runs efficiently on consumer-grade GPUs or even CPUs, and generates speech with very low latency and high speaker similarity. This makes it one of the best real-time TTS tools available right now, with the code already released on GitHub for public use.
In video generation, several state-of-the-art models have emerged. Steady Dancer stands out as the best tool for animating characters with reference videos, outperforming previous leaders like Juan Animate by producing smoother, more coherent, and consistent animations, even for irregularly proportioned or fictional characters. Additionally, Pixverse 5.5 and Runway Gen 4.5 offer video generation with sound and improved physics, though their dialogue quality and video coherence still lag behind the top models. Cling01 and Cling 2.6 from Clling provide flexible multi-modal video editing and generation, with Cling 2.6 introducing native sound that syncs well with visuals. Tencent’s Hunyan Video 1.5 now offers a distilled model that speeds up video generation by 75% without noticeable quality loss.
On the image front, competition is heating up. Alibaba’s Z Image remains a top open-source image generator, but Mateuan’s LongCat Image, a lightweight 6-billion-parameter model, offers promising poster and photo generation with an accompanying image editor, though initial tests show some quality issues. Alibaba also quietly released Ovis Image, a 7-billion-parameter text-to-image model excelling at rendering text within images. Meanwhile, ByteDance’s Cream 4.5 continues to impress with highly realistic and artistic image generation and editing capabilities, though it remains proprietary. Meta introduced Tuna, a unified model capable of generating and editing text, images, and videos, serving as a versatile multimodal AI, though its video quality is still basic.
In advanced AI model news, Google launched Gemini 3 Deep Think, a powerful version of Gemini 3 optimized for complex multi-step reasoning tasks in math, science, and coding. It achieves gold medal-level performance in international competitions but requires a high-tier subscription due to its heavy compute demands. French startup Mistl released Mistral 3, a family of open-source models ranging from 3 to 14 billion parameters suitable for consumer GPUs, though their largest model lags behind top competitors. DeepSeek returned with version 3.2, an open-source model rivaling closed-source giants like Gemini 3 Pro and GPT-5, achieving gold medal results in the toughest math and coding contests, and offering a cost-effective API for enterprises.
Finally, exciting developments include Engine AI’s T800 humanoid robot demonstrating rapid, agile combat moves far surpassing previous robots, and Alibaba’s Live Avatar, a real-time video generator capable of producing infinite-length videos with synchronized audio and nuanced emotions, though currently requiring multiple high-end GPUs. Other notable tools include Poster Copilot, an AI agent for professional poster design with multi-round refinement and aspect ratio conversion, and Lotus 2, a state-of-the-art depth and normal estimator for images that captures fine details better than competitors. Overall, this week showcases remarkable progress across AI modalities, pushing the boundaries of what’s possible in real-time generation, multimodal understanding, and practical applications.