Insane 3D model generator, emotional TTS, AI eraser, 3D upscaler, Qwen3 beats all, 4D videos

This week in AI saw remarkable advancements including Microsoft’s “David” for detailed 3D human modeling, Alibaba’s open-source 3D world generator and top-performing Qwen 3 language models, and innovative tools like Object Clear for seamless object removal and Higs Audio V2 for emotional voice cloning. Breakthroughs also included the efficient hierarchical reasoning model (HRM), affordable humanoid robot Unirit R1, and cutting-edge 4D video generation and 3D model upscaling technologies, highlighting rapid progress across AI domains.

This week in AI has been packed with groundbreaking advancements across multiple domains. Microsoft introduced “David,” a powerful AI model that predicts detailed 3D information from images and videos of humans, including depth, surface normals, and precise segmentation—even capturing fine details like hair strands and wrinkles. This model excels in video segmentation and depth estimation, outperforming traditional tools like Photoshop, and is available with open-source code and datasets for local use. Complementing this, a new video object tracker called SEC demonstrates exceptional ability to track and segment specific objects in complex, high-action videos using simple text prompts, outperforming previous models like SAM 2 in accuracy and consistency.

Another impressive tool is Object Clear, an AI that can erase objects from images and videos along with their shadows and reflections, filling in backgrounds seamlessly. This tool simplifies what would traditionally be a time-consuming Photoshop task into seconds, with an easy-to-use interface that supports brush strokes or object clicks for selection. In robotics, Unree unveiled the affordable and highly agile humanoid robot Unirit R1, capable of acrobatics, running, and interaction through integrated multimodal AI, signaling a future where humanoid robots may become commonplace in everyday life.

In the realm of 3D and video generation, Alibaba’s open-source project “um” creates interactive 3D worlds from images, allowing users to navigate scenes dynamically with keyboard inputs. This builds on Alibaba’s leading video generation technology and is fully open-source, unlike similar projects that have yet to release code. Additionally, Alibaba’s Qwen 3 model has emerged as the top-performing non-thinking large language model, excelling in reasoning, science, math, and coding benchmarks, and is freely available with a highly cost-efficient API. Its coding-focused counterpart, Qwen 3 Coder, also outperforms many proprietary models, generating complex interactive web applications and simulations from simple prompts.

A major breakthrough in AI architecture was presented in the hierarchical reasoning model (HRM), which mimics human brain processing by handling reasoning at multiple layers and speeds. Despite being only 27 million parameters—much smaller than leading trillion-parameter models—HRM outperforms them on complex puzzles like Sudoku and maze navigation, requiring less compute and training data. This model could represent a significant step toward more efficient and biologically inspired AI reasoning, with open-source code available for experimentation.

Finally, advancements in text-to-speech and 4D video generation were highlighted. Higs Audio V2 offers open-source, multi-speaker, emotional voice cloning with impressive quality, rivaling commercial solutions. Meanwhile, Defu Man 4D generates high-quality 4D videos from just a few sparse-view videos, enabling detailed 3D video reconstructions that outperform competitors requiring much more data. Additionally, tools like Design Lab enhance slide presentations automatically, and Ultra 3D produces highly detailed 3D models from images, with Elevate 3D providing the first AI-powered 3D model upscaling to refine textures and geometry. These innovations collectively showcase the rapid and diverse progress in AI technologies this week.