This Model Claims A 12M Token Context, But I Am Skeptical

A new model called SubQ claims a groundbreaking 12 million token context window and significant efficiency improvements, but skepticism arises due to limited technical details, lack of independent verification, and inconsistent benchmark results. While the approach using sparse attention is promising, experts like Timothy Karen emphasize the need for rigorous testing and transparency before accepting such extraordinary claims.

About an hour ago, a company announced a new model called SubQ, claiming an unprecedented 12 million token context window and a 52 times efficiency improvement over existing models without any loss in quality. This bold claim, if true, could revolutionize both cloud and local AI models by enabling much longer and more complex reasoning tasks, such as processing entire codebases or months of text. However, the presenter, Timothy Karen, founder of Anything LLM, expresses strong skepticism due to the lack of detailed technical information and independent verification, emphasizing that such a breakthrough would be significant but requires careful scrutiny.

The core innovation behind SubQ is its use of a sparse attention architecture designed to be sub-quadratic in complexity, which theoretically allows it to handle extremely long contexts more efficiently than traditional dense attention or even flash attention mechanisms. Dense attention considers every token in the context, leading to quadratic compute costs, while flash attention reduces this by focusing on a limited recent context. Sparse attention goes further by selectively attending to semantically relevant tokens anywhere in the context, potentially enabling much longer effective context windows without prohibitive computational costs. Despite this promising approach, sparse attention is notoriously difficult to implement effectively, which adds to the skepticism.

Timothy reviews the benchmarks provided by the company, noting that the available results are primarily for a 1 million token preview model rather than the full 12 million token version. On benchmarks like SweetBench Verified and MRCRV2, the SubQ model performs reasonably well but does not clearly outperform leading models like Opus 4.6 or GPT 5.5. There are also inconsistencies between the benchmark scores shown on the company’s website and in their promotional video, further complicating the assessment. The lack of a technical report or detailed methodology makes it difficult to evaluate the validity of the claims or the true capabilities of the model.

The presenter also highlights the practical challenges of testing and using such a model, noting that filling a 12 million token context is itself a massive task and that no comparable models currently exist to benchmark against at that scale. He has applied for early access to the API to conduct independent tests but remains cautious about the hype until more concrete evidence is available. Timothy points out that while a million-token context window is becoming more feasible, especially with recent advances like DeepS v4’s hybrid attention, the jump to 12 million tokens is a much harder problem and requires more proof.

In conclusion, while the SubQ model’s claims are exciting and could represent a major step forward in long-context AI models, the current lack of transparency, technical details, and independent validation justifies skepticism. Timothy remains cautiously optimistic but stresses the importance of rigorous testing before accepting such extraordinary claims. He plans to follow up with further analysis if and when he gains access to the model, encouraging viewers to stay tuned for updates on this potentially groundbreaking development.