Deepseek just killed LLMs

The video highlights Deepseek’s breakthrough OCR technology that compresses visual text data up to 20 times with high accuracy, enabling more efficient and richer processing for large language models by using images instead of traditional text tokens. It also discusses broader AI advancements, challenges in AI safety, and expert insights suggesting that integrating vision and language inputs could revolutionize AI efficiency and capabilities in the future.

The video discusses the recent breakthrough by Deepseek with their new Deepseek OCR technology, which significantly improves the efficiency of processing text data by compressing visual context up to 20 times while maintaining 97% accuracy. This innovation is important because it addresses key bottlenecks in large language models (LLMs), such as limited memory capacity, slow training speeds, and the high computational cost of expanding context windows. By converting large amounts of text into images that can be efficiently processed by vision-language models, Deepseek OCR enables much shorter context windows and faster, cheaper training without sacrificing accuracy. This approach also allows for richer information representation, including bold or colored text and complex images, which traditional text tokenization struggles to capture.

The video also highlights the broader AI landscape, mentioning Google’s recent quantum computing breakthrough, which demonstrated a quantum computer running an algorithm 13,000 times faster than classical supercomputers. Additionally, Google’s 27-billion-parameter Gemma model has made strides in cancer immunotherapy research by predicting new drug candidates that could make tumors more visible to the immune system. This exemplifies the emergent capabilities of large-scale models, where certain complex reasoning abilities only appear once models reach a sufficient size, encouraging massive investments in AI data centers worldwide.

However, the video also touches on challenges in AI safety and reliability. A recent paper attempting to define Artificial General Intelligence (AGI) was criticized for containing numerous non-existent citations, likely due to unverified AI-generated content. Furthermore, research from Anthropic revealed that injecting just 250 poisoned documents into training data can backdoor models of various sizes, causing them to produce gibberish when triggered. This vulnerability persists regardless of model scale, highlighting ongoing security risks in AI development.

Returning to Deepseek OCR, the video features commentary from Andre Karpathy, a prominent AI researcher and former Tesla self-driving car lead, who praises the model and raises an intriguing question about whether pixels might be better inputs for language models than text tokens. Karpathy criticizes tokenizers for their inefficiency and limitations, such as inconsistent encoding of visually similar characters and emojis, which can hinder transfer learning and introduce security risks. He suggests that feeding models with images, even for pure text input, could be more effective and efficient, potentially eliminating the need for tokenizers altogether.

Finally, the video emphasizes the practical applications and future potential of Deepseek OCR, including its ability to parse complex documents, charts, chemical formulas, and scientific data, which could accelerate research in STEM fields. The technology’s capacity to generate massive amounts of training data daily and its synergy between vision and language modalities open new avenues for AI efficiency and capability. The video concludes with reflections on how this approach aligns with broader trends in AI, including Elon Musk’s view that photons (visual data) will dominate AI inputs and outputs in the long term, underscoring the transformative impact of integrating vision and language in AI systems.