Self-evolving AI, robot fights, new GPT voice, new local image model, Gemma upgrade: AI NEWS

This week’s AI news highlights breakthroughs including RecGen’s 3D object reconstruction from limited images, Vivago AI’s high-resolution Hydream 01 Image model, Google’s faster Gemma 4 language model, and advanced robotics like Allen AI’s Momo Act 2 and Genesis AI’s dexterous Gene 26.5 robot hand. Additional innovations span real-time voice models from OpenAI, efficient video and image generation techniques, AI-driven scientific research tools, and cutting-edge humanoid robot demonstrations, collectively pushing the boundaries of AI capabilities across multiple fields.

This week in AI news has been packed with groundbreaking advancements across various domains. One highlight is RecGen, an AI model capable of reconstructing 3D objects from just a few RGBD images, even when objects are partially occluded. Trained on a massive synthetic dataset, RecGen excels in chaotic real-world scenarios, outperforming competitors in pose estimation and shape generation. Meanwhile, the new open-source image model Hydream 01 Image by Vivago AI impresses with its ability to generate high-resolution 2K images, especially excelling in text rendering and complex infographic creation without relying on traditional VAE encoding, making it a top contender in open-source image generation.

In video generation, Uni Vid X stands out by understanding and generating intrinsic video properties such as albedo, lighting, and surface normals, enabling advanced editing like relighting scenes and background replacement. Google also enhanced its Gemma 4 language model with multi-token prediction, significantly speeding up token generation by predicting multiple words at once without sacrificing output quality. This innovation addresses memory bottlenecks in AI processing, making models faster and more efficient on consumer hardware. Additionally, the Program Bench benchmark revealed that current AI models still struggle to fully reverse engineer entire software programs from executables, highlighting the complexity of real-world software development beyond simple coding tasks.

Robotics saw notable progress with Allen AI’s Momo Act 2, an open-source foundation model designed for real-world manipulation tasks involving two arms, showing significant speed and performance improvements over its predecessor. Genesis AI introduced Gene 26.5, a foundation model paired with a dexterous robotic hand capable of complex tasks like cooking, lab experiments, and even playing piano, pushing robots closer to human-level physical manipulation. Boston Dynamics showcased its Atlas robot performing extraordinary moves beyond human joint limitations, while Uni Tree G1 and Engine AI demonstrated humanoid robot fight sequences, hinting at future robot combat tournaments.

In scientific research and 3D asset creation, Lab OS emerged as an AI co-scientist that integrates with physical labs via XR smart glasses, enabling AI to observe and guide real-world experiments in real time. Fizz Forge introduced a system for generating 3D assets that are not only visually accurate but also physically grounded with realistic joints and interaction logic, enhancing applications in robotics and simulations. Microsoft unveiled Map to World, an AI that generates explorable 3D worlds from simple segment maps, allowing users to define diverse environments like seasonal villages or futuristic cities, with code expected to be released soon.

Other notable developments include OpenAI’s new real-time voice models offering conversational, translation, and transcription capabilities across multiple languages, and Zyra’s Zia 18B, a compact yet powerful reasoning model trained on AMD hardware that rivals much larger models in performance. Nvidia teased D-Rex, a system for creating photorealistic, relightable digital human avatars, while Japanese AI lab Sakana AI and Nvidia collaborated on a sparse computation method to speed up large language models by over 30% in inference and training. Lastly, Swift I2V demonstrated efficient high-resolution video generation from a single image, and Alibaba introduced CDM, a diffusion model acceleration technique that produces high-quality images in just four steps, significantly reducing generation time.