So... My AI App Has Been Lying to Users (And How I Fixed It)

The creator of the AI-powered calorie tracking app Amy improved its accuracy by implementing a structured evaluation system called Brain Trust, which systematically tests and compares different AI configurations using realistic test cases and LLM-based judges. Through experiments separating search and reasoning tasks and switching search providers, they increased accuracy from 68% to 75%, highlighting the importance of rigorous evaluation, realistic testing, and continuous user feedback in developing reliable AI systems.

The video discusses the major challenge faced by the creator’s AI-powered calorie tracking app, Amy, which is accuracy. Since the app’s value depends entirely on the AI providing correct calorie and nutrition information, inaccuracies lead to user dissatisfaction and subscription cancellations. The AI model used, Perplexity Sonar, combines web search and reasoning to estimate calories, but users reported significant errors, especially with international and niche food items. The creator emphasizes the importance of structured testing and evaluation rather than ad-hoc fixes to improve AI accuracy.

To address this, the creator uses an evaluation system called Brain Trust, which organizes test cases, scoring functions, and experiments to systematically measure AI performance. The test cases include both synthetic data and real user-reported issues, with scorers that check for correct formatting and use large language models (LLMs) as judges to assess response quality. This setup allows side-by-side comparisons of different AI configurations to identify improvements or regressions, enabling more reliable and scalable AI development.

Several experiments were conducted to improve accuracy. First, the creator separated the search and reasoning tasks by using Perplexity Search for retrieval and Gemini 3 Flash for reasoning, but this performed worse than the original Sonar model. Next, a mini-agent approach allowed multiple searches per query, improving accuracy but increasing cost and latency. Finally, switching the search provider from Perplexity to Exa while keeping Gemini 3 Flash for reasoning yielded the best results, improving accuracy from 68% to 75%, reducing latency, and maintaining cost efficiency. Attempts to further optimize search queries or restrict search domains did not improve performance.

The creator highlights common pitfalls in setting up AI evaluation systems, such as poorly designed judges that fail to detect hallucinations and overly simplistic test cases that do not challenge the AI. They stress the importance of curating realistic, difficult test cases and continuously updating the evaluation dataset with real user corrections to maintain and improve AI quality over time. This feedback loop ensures that the AI system evolves based on actual user needs and errors.

In conclusion, the video underscores key lessons: separating search and reasoning provides better control, the choice of search provider significantly impacts accuracy, simpler approaches can outperform more complex ones, and rigorous evaluation is essential for trustworthy AI development. The creator encourages viewers building AI systems to adopt similar evaluation frameworks and offers Brain Trust as a practical tool to facilitate this process. The video ends with an invitation for feedback and further discussion on AI evaluation techniques.