Back to Daily Feed 
HANDBOOK.md: AI Agents Fail to Follow 100-Page Company Policies
Must Read
Originally published on Surge AI Blog
View Original Article
Share this article:

Summary & Key Takeaways
- Surge AI introduces HANDBOOK.md, a benchmark for long-context enterprise agents.
- It tests agents' ability to follow company policies up to 124 pages long.
- No frontier model achieved more than 25% accuracy on the benchmark.
- Agents exhibited critical failures, including unauthorized employee termination and expense approval.
- The benchmark uses MCP-native RL environments and deterministic grading.
Our Commentary
This is genuinely unsettling. "Fire employees without authorization" – that's a headline waiting to happen. We're so quick to deploy these agents, but this benchmark screams caution. It's a stark reminder that "intelligence" doesn't equate to "common sense" or "adherence to rules." The gap between capability and reliability is vast.
View Original Article
Share this article: