This week in AI highlights include Vox Hammer’s precise 3D model editing, Compass and USO’s advancements in image generation and style transfer, and Microsoft’s Vibe Voice offering expressive, multilingual text-to-speech capabilities. Additionally, Bite Dance introduced Waiver 1.0 for cinematic video generation and Omnihuman 1.5 for natural lip-sync videos, while breakthroughs like GPT5’s gaming prowess, Mini CPM V4.5’s vision understanding, and OpenAI’s GPT Realtime voice model demonstrate rapid progress across AI domains.
This week in AI has been packed with groundbreaking developments across various domains. One standout is Vox Hammer, a 3D model editor that allows users to microedit specific parts of 3D objects using text prompts or reference images while preserving the rest of the model. This tool segments models into meaningful parts and applies changes only where specified, making 3D editing more intuitive and precise. The code and models are already available on GitHub, though it requires a powerful Nvidia GPU with substantial VRAM.
In image generation, Compass enhances spatial arrangement accuracy by fine-tuning existing models like Flux and Stable Diffusion, enabling them to better understand object placement in images. Bite Dance introduced USO, a free and open-source character and style transfer tool that outperforms competitors in generating diverse character poses and styles, including anime and hybrid styles. USO also excels in deepfake-style lip-sync animations and offers an online demo and local installation options, making it accessible for creative projects.
Microsoft unveiled Vibe Voice, a highly advanced text-to-speech generator capable of handling multiple speakers, long-form audio up to 90 minutes, and expressive emotions automatically inferred from transcripts. It supports spontaneous singing and multilingual switching with accents, making it ideal for podcasts, audiobooks, and educational content. The tool is available online with different model sizes optimized for various hardware capabilities, and it even allows background music integration for richer audio experiences.
In video generation, Bite Dance released Waiver 1.0, a text-to-video and image-to-video AI that produces coherent, cinematic clips with realistic physics and camera movements. It ranks highly on independent leaderboards and offers free access via Discord, though open-source availability is uncertain. Additionally, Omnihuman 1.5 by Bite Dance is a new competitor to V3, creating highly natural lip-sync videos from images and audio with optional text prompts for scene control, supporting multiple characters and expressive animations.
Other notable advancements include GPT5 setting a new record in playing Pokémon Crystal with exceptional planning and efficiency, Mini CPM V4.5 outperforming larger proprietary multimodal vision models in image and video understanding, and Hunyan Video Foley generating superior sound effects synchronized with video content. Alibaba’s one S2V offers a powerful video-from-image-and-audio generator with realistic lip-sync and expressive movements. OpenAI’s GPT Realtime voice model delivers low-latency, natural, and expressive speech for real-time applications like customer support, showcasing the rapid evolution and diversity of AI technologies this week.