The video demonstrates running the massive DeepSeek V4 Pro and Flash models at aggressive 2-bit quantization using distributed computing across two Macs to fit within memory constraints, successfully generating basic runnable code like Snake and Tetris despite some limitations in complex tasks. It also explores the potential of collaborative AI model hosting to overcome hardware limits, highlighting both the promise and challenges of extreme quantization and local large model deployment.
In this video, the presenter explores running DeepSeek V4 Pro and Flash models at aggressive 2-bit quantizations to fit within the memory constraints of a 128 GB system. DeepSeek V4 Pro, with its massive 1.6 trillion parameters, is too large for a single machine, so the presenter uses distributed computing across two Macs—a Mac Studio and a MacBook Pro—to share the workload. The goal is to test whether these heavily compressed models can still produce coherent outputs, particularly focusing on generating runnable code like simple games such as Snake and Tetris.
The Flash model at 2.8-bit quantization successfully generated a Snake game and a Tetris game that ran in the browser, using about 95 GB of memory and producing tokens at around 24-25 tokens per second. However, when attempting more complex tasks like a 3D version of Tetris, the output failed to run properly, resulting in a black screen. This highlighted the limitations of extreme quantization for advanced coding tasks, though basic coding and creative writing still showed promise.
For the DeepSeek V4 Pro model, running at 2.8-bit quantization on the distributed setup consumed around 456 GB of memory on the Mac Studio and 70 GB on the MacBook Pro, with token generation at about 12 tokens per second. While the code output looked syntactically correct, logical bugs prevented the Snake game from running properly. Despite this, the model demonstrated strong reasoning capabilities, correctly answering complex logical questions like the trolley problem, indicating that some core intelligence remains intact even under heavy compression.
The presenter also discussed the potential for expanding this distributed computing approach into a larger network or “lobby” where multiple users could pool their computing resources to run massive AI models collaboratively. This concept raises privacy concerns since prompts would be shared across multiple machines, but possible solutions like encryption or different operational modes could mitigate these risks. The presenter invited viewers to express interest in such a system, which could help overcome hardware limitations as AI models continue to grow in size.
Overall, the video showcased the challenges and possibilities of running state-of-the-art large language models locally with extreme quantization and distributed computing. While the 2.8-bit quantization approach is not yet ideal for complex coding tasks, it still produces coherent and creative outputs, making it a promising experimental step. The presenter encourages viewers to try the shared quantized models and looks forward to future improvements in hardware, quantization techniques, and collaborative computing to better handle these massive AI systems.