What can a blind LLM see?

artesia · 12 August 2025 05:21

The video demonstrates how large language models (LLMs) can be prompted to “draw” images by filling in grid coordinates through numerous API calls, revealing varying abilities among models to generate recognizable shapes and maps. It highlights that larger, more advanced models produce clearer visualizations, while also discussing the high computational cost of this method and providing resources for others to experiment with the approach.

artesia · 12 August 2025 05:43

The video explores how large language models (LLMs) “see” the world by asking them to draw various objects on a grid. The creator wrote a script that makes individual API calls for each square on the grid, requesting the model to fill or leave empty that particular coordinate to form shapes like smiley faces or circles. Different models were tested, including GPT5 Nano, Mini, and Medium, with GPT5 Mini surprisingly performing the best in generating recognizable shapes. The process involves making hundreds of simultaneous API calls, which can be costly, so a warning prompt is included before running the script.

The video demonstrates the drawing results from various models, showing that some, like GPT5 Mini, can produce fairly clear images, while others, such as Nano or certain open-source models, struggle to generate coherent shapes. The script supports switching between OpenAI and OpenRouter models, allowing users to experiment with different LLMs. The grid size used in the experiments is 20x20 to balance cost and resolution, but it can be increased for more detailed images at a higher expense.

Inspired by a blog post titled “How Does a Blind Model See the Earth,” the video also attempts to draw a world map using the same grid-based approach. By increasing the grid to 30 columns by 20 rows, the model is asked to depict the world map, and the results show recognizable landmasses like Spain, England, Australia, and parts of the Americas. This approach mimics the original blog post’s method, which used latitude and longitude coordinates to query models about land or water presence, revealing how different models interpret geographic data.

The video reviews the performance of various LLMs on this task, highlighting that larger and more advanced models tend to produce better and more accurate images. Models like GPT-4.1 and some versions of Quan with billions of parameters show impressive results, while others like LLaMA or smaller open-source models perform less effectively. The creator emphasizes the high computational and financial cost of running these experiments, noting that some runs can cost several dollars due to the large number of API calls and tokens processed.

In conclusion, the video provides an intriguing look into how LLMs can be probed to “visualize” concepts by breaking down images into grid coordinates and querying the model for each point. It showcases the varying capabilities of different models in this unconventional task and connects the experiment to existing research on how blind models perceive the world. The script and further details are made available on the creator’s Patreon for those interested in exploring this approach themselves.