The Biggest Prompt I’ve Ever Sent to an AI… 1,000,000 TOKENS

The creator attempts to scan a massive private codebase for secrets using a local AI model with a million-token context window, overcoming technical hurdles to prepare and submit an 831,000-token prompt. While initial attempts with Nvidia’s Neatron model fail to deliver useful results, switching to Llama 4 Scout on a powerful GPU cluster successfully identifies sensitive information, demonstrating the importance of both model selection and hardware for large-context AI tasks.

The video’s creator demonstrates an ambitious experiment: submitting an enormous, nearly one-million-token prompt to a local AI model. The context is a real, private codebase—an NX monorepo with over 300,000 lines of code and 1,600 source files, including three NativeScript applications. The goal is to use a large-context local language model to scan the codebase for sensitive information, such as API keys or secrets, without sending any private data to cloud-based services like ChatGPT, Gemini, or Claude.

Initially, the creator explores Nvidia’s Neatron 3 Nano 30B model, which claims to support a one-million-token context window. However, upon loading the model in LM Studio, the maximum input is limited to 262,144 tokens due to the model’s configuration. By editing the model’s config.json file and increasing the max position embeddings, the creator successfully unlocks the full million-token context window. This process is resource-intensive, requiring a high-end MacBook Pro with 128 GB of RAM and, for larger tests, a Mac Studio with 512 GB of RAM.

To prepare the massive prompt, the creator uses a tool called Reprompt, which helps select only relevant code files and exclude unnecessary assets, resulting in a prompt of about 831,000 tokens. Detailed instructions are added to the prompt, specifying how the AI should report any discovered secrets, redact sensitive values, and avoid guessing line numbers. The prompt is then submitted to the local model, but the process is extremely demanding: memory usage skyrockets, and prompt processing takes a long time, with the MacBook’s fans working overtime.

Despite the technical achievement of submitting such a large prompt, the results are disappointing. The Neatron model, even in its 8-bit quantized version on powerful hardware, fails to accurately identify secrets in the codebase. Instead, it produces hallucinated summaries or misses the task entirely. The creator references Nvidia’s “ruler” benchmark for long-context tasks and notes that other models, like Llama 4 Scout, claim even larger context windows (up to 10 million tokens) and may be better suited for this kind of “needle in a haystack” search.

Ultimately, the creator gains access to a high-end Nvidia B300 GPU cluster, running the full Llama 4 Scout model with BF16 precision. This setup, using over 2 TB of VRAM across eight GPUs, processes the million-token prompt at impressive speeds and successfully identifies 21 instances of sensitive information in the codebase. The experiment highlights the importance of both model choice and hardware capabilities for large-context AI tasks, and the creator concludes by encouraging viewers to subscribe for more experiments and to address their own code security issues.