OpenAI Tests if GPT-5 Can Automate Your Job - 4 Unexpected Findings

artesia · 26 September 2025 15:03

OpenAI’s recent research reveals that advanced language models like GPT-5 are nearing or surpassing human performance in many digital tasks, offering significant productivity boosts by speeding up expert workflows, though full job automation remains unlikely due to task complexity and error risks. Despite challenges such as AI adoption rates, fairness issues, and safety concerns, the study highlights AI’s potential as a powerful tool for augmenting human work and ongoing efforts to enhance AI reliability and capabilities.

artesia · 26 September 2025 15:25

In the last 24 hours, OpenAI released research investigating whether current language models can automate jobs across key sectors contributing significantly to GDP. The study used questions designed by industry professionals with an average of 14 years of experience, ensuring realistic and relevant tasks. One surprising finding was that Anthropic’s Claude Opus 4.1 model outperformed OpenAI’s own models and came close to matching industry experts in deliverable quality. OpenAI’s transparency in publishing these results was praised as honest scientific practice. Another interesting insight was that model performance varied significantly depending on the file type involved, with models excelling particularly in handling PDFs, PowerPoint presentations, and Excel spreadsheets, and even outperforming humans in government-related tasks.

A third unexpected discovery was that more advanced models like GPT-5 can actually speed up human experts by producing outputs that meet quality standards, reducing the time needed for review. However, this speed-up effect depends on the model’s strength; weaker models do not save time because reviewing their outputs is inefficient. There were two important caveats: the absence of data on Claude Opus 4.1’s speed impact and the possibility that human reviewers might miss subtle model errors, which could undermine the perceived efficiency gains. The biggest finding, supported by economists like Lawrence Summers, is that current frontier models are approaching or exceeding human performance on many task-specific tests, fueling claims that these systems might already qualify as Artificial General Intelligence (AGI).

Despite these promising results, the research also highlighted the robustness of human jobs against full automation by current generation language models. The study focused only on predominantly digital tasks within occupations that contribute at least 5% to the US GDP, excluding many non-digital or mixed tasks. Even within predominantly digital roles, not all tasks are automatable, meaning that job elimination is unlikely in the near term. The tasks tested were realistic and lengthy, averaging seven hours of expert work, but were one-shot and lacked the interactive, iterative nature of real-world jobs. Additionally, catastrophic errors—such as hallucinated data or harmful suggestions—occurred around 2.7% of the time, posing significant risks that could outweigh efficiency gains if not carefully managed.

The video also discussed broader limitations and real-world implications. For example, AI adoption rates remain low, with many companies dropping pilot projects, and lagging indicators like GDP growth will take time to reflect AI’s impact. The example of radiology was used to illustrate how even when AI outperforms humans in specific tasks, the profession’s overall demand and salaries have increased due to legal, social, and task complexity factors. Furthermore, AI tools often perform worse for minority groups or less common languages, and many tasks remain uncovered by automation. The speaker emphasized that while AI might not fully automate jobs soon, it can still act as a powerful multiplier, speeding up workflows and augmenting human productivity.

In conclusion, the video encourages viewers not to underestimate the potential of AI as a productivity enhancer, even if full automation is still some way off. It also highlighted ongoing efforts to improve AI safety through competitions like the Grey Swan Arena, where participants are rewarded for finding vulnerabilities in language models. The speaker noted the importance of understanding AI’s limitations and the nuanced challenges ahead, while remaining optimistic about the benefits of integrating AI tools into creative and professional workflows. Finally, the video touched on emerging features like ChatGPT Pulse for scheduled tasks, signaling continued innovation in AI capabilities.