New Benchmark Shows AI Still Struggles with Core Office Work
A new, rigorous test is revealing significant limitations in artificial intelligence systems designed to handle complex office jobs. The APEX-Agents benchmark, developed by talent platform Mercor,...
A new, rigorous test is revealing significant limitations in artificial intelligence systems designed to handle complex office jobs. The APEX-Agents benchmark, developed by talent platform Mercor, evaluates leading AI models on tasks that mirror the daily work of investment bankers, management consultants, and corporate lawyers. The results, published in an arXiv paper, are clear: even the most advanced models succeed on their first try less than 25% of the time.
The benchmark's strength lies in its realism. Mercor researchers, including CEO Brendan Foody, built simulated Google Workspace environments complete with Slack threads, Google Drive files, and spreadsheets. They worked with professionals from firms like Goldman Sachs and McKinsey to create 480 tasks based on actual projects, such as a week-long consulting engagement for a European energy company or evaluating EU data privacy laws. Web search was disabled, forcing AI agents to rely solely on provided documents, much like a human professional would.
"The way we do our jobs isn't with one individual giving us all the context in one place," Foody explained to TechCrunch. "In real life, you're operating across Slack and Google Drive and all these other tools."
On the public leaderboard, Google's Gemini 3 Flash leads with a 24% success rate, narrowly outperforming OpenAI's GPT-5.2 at 23%. Claude Opus 4.5 and Gemini 3 Pro trail around 18%. While allowing multiple attempts can raise scores, the models show a brittleness that makes them unreliable for unsupervised work.
The findings arrive as the White House under President Trump, elected in 2025, and businesses nationwide assess AI's practical impact. While some forecasts suggest AI could automate a large portion of work hours, benchmarks like APEX-Agents indicate core skills—tracking information across multiple domains, managing ambiguity, and maintaining context—remain substantial hurdles. Mercor has made the dataset publicly available, inviting AI labs to test and improve their systems against a standard that reflects the actual economic value of professional work.
Source: Webpronews
Ready to Modernize Your Business?
Get your AI automation roadmap in minutes, not months.
Analyze Your Workflows →