MikeBot3000: Can We Build an AI Mike from Open Source Tools? - Computerphile

artesia · 29 May 2025 16:43

The video demonstrates how open-source AI tools can be used to create a convincing AI-generated video of Mike, highlighting the technical process involving image generation, speech synthesis, and lip-syncing. It emphasizes the rapid advancement and accessibility of deepfake technology, raising ethical concerns about its potential misuse and the need for caution.

artesia · 29 May 2025 17:03

The video features Mike Pound and Lewis exploring the possibility of creating an AI-generated version of Mike using only open-source tools running on a desktop computer. Mike humorously expresses his desire to automate video production so he can take naps, and the duo sets out to see how convincingly they can generate a short AI video of Mike explaining a topic. They emphasize that this experiment is both a fun challenge and a commentary on how rapidly AI technology is advancing, raising concerns about the potential misuse of deepfake and synthetic media.

Lewis explains the technical pipeline behind creating the AI Mike, called Mikebot 3000. Starting with just 20 images of Mike and a small amount of audio, he retrains various open-source models to generate video content. The process involves using a large language model (LLM) to script scenes, which then prompts image generation with Flux, a diffusion-based open-source image generator. The images are converted into short video clips using diffusion video models, and lip-syncing is applied to match Mike’s lips to the generated speech, creating a convincing talking head.

The team discusses the tools used in detail, such as Flux for image generation, which employs diffusion to produce high-quality images from text prompts, and Aora for fine-tuning the models to recognize specific concepts like Mike. They also describe how video diffusion models generate short clips by extending the diffusion process into three dimensions (frames over time). For speech, they use a diffusion-based text-to-speech model trained on an hour of Mike’s voice, which, combined with lip-syncing, produces speech that sounds like Mike but with some limitations in capturing his unique vocal nuances.

Throughout the video, they demonstrate how adjusting the influence of the custom “Mike Laura” (a low-rank adaptation) affects the generated images, showing how small changes in parameters can produce more or less recognizable representations of Mike. They also compare open-source methods with proprietary tools like Hedra, which can produce more realistic videos from a single image and audio, highlighting the increasing accessibility of high-quality deepfake technology. The overall message emphasizes how these tools are becoming easier to use and more powerful, raising ethical concerns about their potential misuse.

In conclusion, the creators showcase several AI-generated videos of Mike explaining various topics, illustrating the impressive capabilities of current open-source AI tools. They warn that such technology, once limited to experts, is now accessible to anyone with minimal technical skill, making it possible to produce highly convincing fake videos of anyone saying anything. The discussion ends with a reflection on the societal implications, urging caution and consideration of how these tools could be used maliciously, as the line between real and synthetic media continues to blur.