Realtime AI videos, Grok 4, open-source robots, add sound to video, 3D videos, 4K upscaler

This week in AI saw major advancements including Omni V for real-time customizable video generation, Think Sound for adding realistic audio to silent videos, and innovative robotics like the AGIBOT X2N and open-source Reichi Mini. Additionally, breakthroughs such as the Grok 4 multimodal AI model, Meta’s StreamDI real-time video synthesis, and 4K Agent’s image upscaling highlight rapid progress across AI video, audio, robotics, and image processing fields.

This week in AI has been packed with groundbreaking developments across various domains. One standout is Omni V, an AI video generator that can create customized videos in real time by altering subjects, scenes, or actions based on input images and prompts. It supports micro-edits, combining multiple images, and even controlling camera trajectories, although the video quality still shows some distortions. Meanwhile, Think Sound is an impressive AI that adds highly synchronized and realistic sound effects to silent videos, outperforming competitors like MM Audio in both quality and sync. It is available as a free Hugging Face space and GitHub repo for local use.

Another exciting innovation is 4D Slowmo, which transforms fast-moving scenes captured from multiple asynchronous angles into smooth, high-frame-rate 3D slow-motion videos. This technology allows viewers to explore scenes from different perspectives without needing specialized high-speed cameras. Although the code and data are yet to be released, the promise of open-sourcing this tool is highly anticipated. Additionally, Jarvis Art is a free AI agent that automates photo retouching by controlling Adobe Lightroom, enabling users to enhance images through natural language prompts and targeted edits, saving significant manual effort.

In robotics, the AGIBOT X2N humanoid robot showcases dual-mode locomotion, switching seamlessly between walking and rolling to navigate diverse terrains while carrying heavy loads. Hugging Face also introduced Reichi Mini, an affordable open-source desktop robot programmable via Python, capable of face recognition, movement, and interaction by leveraging Hugging Face’s vast AI model repository. This robot is ideal for learning AI-driven robotics, though it is limited to running smaller models due to hardware constraints.

On the AI model front, XAI released Grok 4, a state-of-the-art multimodal model excelling in advanced reasoning, math, and obscure knowledge benchmarks. Grok 4 supports a large 256K token context window and offers variants like Grok 4 Heavy, which uses multiple AI agents for complex tasks. It outperforms leading models such as Claude for Opus and Gemini 2.5 Pro across various tests, including the challenging ARC AGI benchmark. Access to Grok 4 is paid, with pricing tiers based on usage complexity.

Finally, Meta introduced StreamDI, a real-time video generation AI capable of producing minute-long videos at 16 frames per second on high-end GPUs, with the ability to modify scenes dynamically during generation. Although the quality is currently lower than some proprietary models and the code is unreleased, it represents a significant step toward real-time video synthesis. Complementing this, 4K Agent offers state-of-the-art deblurring and upscaling for images, enhancing details across various types of visuals, including satellite and microscopic images. While its code is pending release, these tools collectively highlight the rapid advancements and expanding capabilities in AI video, audio, robotics, and image processing technologies.