Web Agents Just Met Their Match: Why Current AI Still Struggles With Real-World Browsing
Web Agents Just Met Their Match: Why Current AI Still Struggles With Real-World Browsing
Remember when AI beat humans at chess? Then Go? Every milestone felt like we were one step closer to general AI. But if you've ever tried to use an AI web agent for something genuinely useful—like booking a flight, comparing products across five different retailers, or planning a multi-city vacation—you've probably noticed something: they get lost.
The problem isn't the models themselves. It's that we've been measuring them wrong.
The Benchmark Gap Nobody Talked About
Until recently, web agent benchmarks have been... let's call them optimistic. Most tests involve single-site tasks that can be completed in minutes: "Log into this account." "Fill out this form." "Click this button." The frontier models? They're already crushing these. We're talking saturation—the benchmarks aren't telling us much anymore.
But real-world web browsing doesn't work like that. When you actually need an AI agent to do something valuable, the tasks are messy, multi-step, and genuinely challenging:
- Comparing products across competitors (searching Amazon, Walmart, Best Buy, and specialized retailers simultaneously)
- Planning complex trips (checking flights on multiple airlines, hotels, rental cars, attractions across different platforms)
- Aggregating information (synthesizing product reviews, pricing data, and availability across dozens of sources)
These tasks require something radically different: sustained context, cross-site reasoning, and the ability to maintain focus over potentially hours of browsing. They're the opposite of the short-horizon, single-site tasks we've been benchmarking.
Enter Odysseys.
Meet Odysseys: The Benchmark That Actually Reflects Reality
Researchers from Carnegie Mellon University introduced Odysseys—a benchmark of 200 long-horizon web tasks derived from actual real-world browsing sessions and tested on the live Internet. This isn't a lab environment with mock websites. These are real sites, real complexity, real failure modes.
The results? Sobering. The strongest frontier model achieved 44.5% perfect task success. That means roughly 55% of realistic workflows ended in failure or incomplete results.
But here's the kicker: even measuring "success" on long-horizon tasks is harder than it sounds.
Why Binary Pass/Fail Doesn't Cut It Anymore
Imagine this: an agent is asked to plan a three-day trip to Japan. It books the flights, finds a hotel, and identifies three attractions. But it misses one restaurant recommendation you explicitly requested. Did it succeed or fail?
With traditional pass/fail evaluation, you'd have to pick one. In reality, the agent partially solved the problem. Traditional benchmarks miss this nuance entirely.
Odysseys introduced rubric-based evaluation—breaking each task into granular checkpoints that can be verified independently. Instead of "Pass or Fail," each task is graded on a scale, with specific, measurable criteria for partial progress. This approach showed higher agreement with human judgment than the common LLM-as-judge methods that just throw a full trajectory at an AI and ask, "What do you think?"
This distinction matters. A lot.
The Efficiency Problem Nobody Expected
Here's something that caught researchers' attention: measuring success rate alone misses half the story. Even when agents do succeed, they're wildly inefficient.
Odysseys introduced a Trajectory Efficiency metric—essentially, how much rubric score progress the agent makes per step. Think of it as "bang for your computational buck."
The result: even frontier agents only reached 1.15% trajectory efficiency.
Translation: agents are taking enormous detours, getting sidetracked, re-checking information they already verified, and generally burning through steps like they're paying per click. For practical deployments, this is a serious problem. If an agent takes 1,000 steps to accomplish what a human could do in 50, the economics don't work—especially when each step might involve loading a new page, waiting for JavaScript to render, or navigating complex site structures.
What the Data Actually Shows
The benchmark tested eight frontier and open-weight models. When researchers plotted perfect task completion against "step budget" (how many browser actions the agent is allowed before giving up), the pattern was clear:
All models showed a sigmoidal curve. Performance stays near zero for the first ~15 steps. Then it climbs steeply between steps 20-70. After ~80 steps, improvement tapers off as agents hit their practical ceiling.
Frontier API models did climb steeper and higher than open-weight alternatives. But critically, none of them approached full completion. There's massive headroom—or depending on your perspective, massive room for failure.
Why This Matters for the Industry
If you're building products that rely on web agents—and increasingly, companies are—Odysseys is a reality check.
For founders and product builders: You can't rely on web agents for complex, multi-step workflows. Not yet. If your product depends on agents reliably executing realistic browsing tasks, you need to either simplify the task or add human oversight.
For AI researchers: This is the new frontier. The easy wins are gone. Building agents that handle long-horizon, multi-site reasoning is the next challenge. It requires better context management, improved planning, and smarter navigation strategies.
For infrastructure providers (like us at NameOcean): This benchmark raises interesting questions about how we design web-accessible services. If AI agents struggle to reason across domains and understand cross-site context, how should we structure APIs and hosting infrastructure to make agent integration easier? How do we design DNS, SSL, and service discovery to be "agent-friendly"?
The Real Takeaway
Web agents aren't ready to replace human judgment on complex tasks. But they're also not standing still. The Odysseys benchmark gives us a way to measure real progress—not just incremental improvements on easy problems, but genuine advances in handling the kind of work that actually matters.
The question isn't whether AI will eventually solve this. It's when. And for teams building on this technology today, that distinction matters a lot.
The benchmark is live, with task records, detailed rubrics, and video recordings of agent attempts. If you're working with web agents, it's worth exploring. It might just reveal why your current implementation is struggling.