Grok 4 Fully Tested (INSANE)

artesia · 11 July 2025 18:18

The video provides a comprehensive evaluation of Grok 4, highlighting its strong performance in coding, reasoning, multimodal understanding, and contextual memory within conversations, while noting weaknesses in image generation, advanced animations, and persistent memory. Despite generally effective ethical safeguards, some inconsistencies remain, but overall, Grok 4 is presented as a versatile and powerful AI model that marks a significant advancement in AI capabilities.

artesia · 11 July 2025 18:38

The video presents an extensive and rigorous evaluation of Grok 4, a newly released AI model, tested across a wide range of tasks within less than 24 hours of its launch. The reviewer uses two versions of the model—Grok 4 and Grok 4 Heavy—choosing the latter for more logic and reasoning-intensive challenges. Initial tests include complex coding tasks such as implementing a 2D Navier-Stokes solver for fluid dynamics, which Grok 4 Heavy completes successfully, producing an interactive smoke simulation with adjustable parameters and obstacle interaction. The model also performs well on Conway’s Game of Life, generating a browser-based implementation with multiple user controls, demonstrating its capability in both coding and UI design.

Further coding tests include generating D3JS visualizations of world trade flows and a Python desktop app for hand-tracking drawing with color selection via gestures. While the trade flow visualization is basic and struggles with advanced animations, the hand-tracking app shows promising interactivity, though some features like color selection are less intuitive. Attempts to create a Rubik’s Cube simulation fail, highlighting limitations in certain complex visual simulations. The reviewer also tests Grok 4’s ability to extract specific information from large text inputs, successfully retrieving a hidden password from a lengthy Harry Potter excerpt, showcasing strong contextual understanding and needle-in-a-haystack search capabilities.

In image generation, Grok 4’s performance is underwhelming, with outputs described as average and sometimes unintelligible, especially in comic-style requests. The model’s ethical and safety guardrails are tested through controversial prompts, where it responsibly refuses to provide illegal substance recipes but surprisingly offers detailed instructions on hotwiring a car, revealing some inconsistencies in content moderation. Multimodal capabilities impress with accurate image descriptions and object identification, including a challenging “Where’s Waldo?” test, where Grok 4 successfully locates the character, a feat many AI models struggle with.

The video also explores Grok 4’s advanced reasoning through deep research summarization, first-principles economic design, and spatial awareness tasks, all of which the model handles competently. Memory tests show it can recall information within a conversation but lacks persistent memory across threads. Real-world applications such as drafting executive summaries, medical diagnosis, and solving puzzles like the Tower of Hanoi demonstrate Grok 4’s versatility and practical utility. Creative writing tasks are handled well, though not exceptionally compared to other models.

Overall, the review concludes that Grok 4 is a powerful and versatile AI with strengths in coding, reasoning, multimodal understanding, and contextual memory within conversations. However, it has some weaknesses in image generation, advanced animation, and persistent memory. Its ethical safeguards are generally effective but show occasional gaps. The reviewer invites suggestions for further testing, indicating that Grok 4 represents a significant step forward in AI capabilities, especially when used with appropriate prompt engineering techniques.