Infinite 3D worlds, long AI videos, realtime images, game agents, character swap, RIP Udio - AI NEWS

This week in AI saw major advancements including Muan’s Long Cat Video for extended video generation, Emu 3.5’s multimodal image and language capabilities, and World Grow’s infinite 3D world creation, alongside efficient models like Kimmy Linear and innovative robotics and game-playing AI agents. Additionally, the AI music scene shifted with Udio’s restrictive licensing, Miniax’s Music 2.0, and Stability AI’s Foley Control, highlighting rapid progress across video, image, audio, and robotic AI technologies.

This week in AI has been packed with groundbreaking developments across various domains. Muan, a Chinese food delivery company, released Long Cat Video, an open-source video generation model capable of producing videos up to five minutes long with impressive physics and anatomical accuracy. It supports text-to-video, image-to-video, and video continuation, enabling seamless video extensions without noticeable transitions. The model is relatively small with 13.66 billion parameters and generates 720p videos at 30 frames per second, with the code and models available on HuggingFace for local use.

Another exciting release is Emu 3.5, an open-source multimodal AI model that combines language understanding with image generation and editing capabilities. It can provide step-by-step instructions with accompanying images, solve visual puzzles, and perform complex image edits like clothes swapping and perspective changes. Emu 3.5 outperforms other leading image models in benchmarks and is available for download with detailed setup instructions. Additionally, World Grow introduces a novel approach to generating infinite 3D worlds using Lego-like building blocks, ensuring structural and geometric coherence even as the world expands, with plans for public release of code and pre-trained models.

In the realm of AI efficiency, Moonshot AI unveiled Kimmy Linear, a hybrid linear attention transformer model designed to handle extremely long context sequences efficiently. It reduces memory usage by up to 75% and speeds up decoding by six times compared to traditional models, supporting context lengths up to one million tokens. This advancement is significant for processing large documents or codebases and is open-sourced for community use. Nvidia also released ChronoEdit, an image editor that uses video-based editing to apply changes gradually, and Google introduced Pomelli, a free AI tool that automates marketing creative design by extracting branding elements from websites and generating campaign materials effortlessly.

On the robotics front, 1X Technologies announced the Neo1X humanoid robot for home use, though current demonstrations rely heavily on human teleoperation rather than true autonomy. In contrast, Unitree’s G1 robot showcased impressive whole-body coordination by pulling a 1,400 kg car, highlighting advances in robotic physical capabilities. Leju Robotics introduced the modular Cuavo 5 robot, adaptable for industrial and home use, integrated with Huawei’s Pangu AI model for perception and control. Meanwhile, Bite Dance launched Game Tars, an AI agent capable of autonomously playing any video game in real time by processing visual inputs and controlling keyboard and mouse, with potential applications in real-world robotics.

Finally, the AI music generation landscape saw significant shifts. Udio, a leading AI music generator, entered a restrictive licensing partnership with Universal Music Group, halting downloads and causing user backlash. In contrast, Miniax released Music 2.0, a text-to-music generator with realistic vocals and instruments, allowing users to create songs with or without lyrics. Stability AI introduced Foley Control, a tool that generates synchronized audio for silent videos, enhancing AI-generated video content. These developments, alongside advances in image editing, video character swapping, and 3D scene reconstruction, underscore the rapid and diverse progress in AI technologies this week.