This week in AI saw major advancements including ByteDance’s Bernini video editor, Google’s Magenta Realtime 2 music generator, Nvidia’s Cosmos 3 robotics model, and Google’s open-source Gemini 4 multimodal model, all pushing the boundaries of creative, real-time, and multimodal AI applications. Additionally, breakthroughs in hardware, humanoid robotics, quantum computing, and new text-to-speech and image generation models highlight the rapid and diverse progress across the AI landscape.
This week in AI has been packed with groundbreaking releases and advancements across various domains. ByteDance introduced Bernini, an open-source, flexible video editor akin to Gemini Omni, capable of editing videos using text, images, or video references. It allows users to add or remove objects, change perspectives, backgrounds, and even apply artistic styles, all while supporting multiple reference inputs for consistent and complex video generation. Nvidia unveiled Deja View, a compact yet efficient 3D reconstruction model that reconstructs scenes from multiple images with high performance and low computational cost, alongside Pager, a Google-Meta collaboration for 360° panoramic geometry reconstruction that excels in depth and surface normal estimation.
In the realm of creative AI tools, Google released Magenta Realtime 2, a low-latency, open-source music generator that responds live to MIDI, audio, and text inputs, making AI music generation interactive and playable without requiring a GPU. OpenAI enhanced ChatGPT with a “dreaming” memory system that synthesizes past conversation context to provide more personalized and temporally accurate responses. Meanwhile, BYU launched Nava, a video generation model with native audio synchronization, and Alibaba introduced Stream Career, a real-time video generator capable of streaming and editing avatars with precise motion control, pushing the boundaries of real-time AI-driven video content creation.
Several new image generation models also made waves. Reeve 2 and Audiogram 4 stand out for their advanced layout control, allowing users to define bounding boxes and layers for precise composition and editing. While Reeve 2 is a paid, closed-source model excelling in poster and infographic generation, Audiogram 4 is open-source but heavily censored, limiting its use for certain content. Google’s new open-source Gemini 4 12B model offers a unified, encoder-free multimodal architecture that processes text, images, and audio directly, enabling efficient offline use on consumer hardware with strong reasoning and agentic capabilities.
On the hardware and robotics front, Nvidia announced Cosmos 3, an open-source foundation model for physical AI applications like robotics and autonomous vehicles, capable of understanding and predicting real-world interactions from multimodal inputs. They also revealed RTX Spark, a powerful new chip designed for laptops and small desktops to run large AI models locally with up to 128 GB of unified memory. In humanoid robotics, Deep Robotics showcased the DR2 robot designed for harsh environments, while UBTech teased a highly realistic full-body humanoid companion robot. Microsoft made strides in quantum computing with Myerana 2, a quantum chip developed with AI assistance, promising a scalable quantum computer by 2029.
Finally, several other notable releases include Miniax’s upcoming open-source M3 model with a massive context window and multimodal capabilities, Nvidia’s Neotron 3 Ultra, a massive 550 billion parameter open-source model optimized for agentic workflows, and new text-to-speech models like ByteDance’s Waveet TTS and Higs Audio V3, which offer high-quality, controllable voice synthesis. Stability AI teased Stable Layers, a tool for converting images into editable transparent layers using reinforcement learning. Microsoft also introduced new AI models for thinking and image generation, further expanding the AI ecosystem. Overall, this week highlights the rapid pace of AI innovation across video, audio, robotics, and foundational models, with many tools already available for public use.