Causal Evaluation of Agents

How Agent Arena ranks the models behind Agent Mode.

Arena doesn't grade models in a lab. It routes every real Agent Mode session to a randomly chosen orchestrator model and watches what actually happens. Then, with a methodology called causal tracing, it estimates how much each model improves (or hurts) the agent's behavior — across millions of in-the-wild interactions.

Live data · arena.ai/agent · Leaderboard at arena.ai/leaderboard/agent

An arena control room with leaderboards and competing AI agents working in parallel
01 · The basics

How Agent Arena works, in one paragraph

Every time someone opens a new chat in Agent Mode, Arena randomly assigns one of the available orchestrator models to drive the session. Everything else — the harness, tools, system prompt — is shared infrastructure. Because the model is the only thing that varies, Arena can measure how each model causally shifts user-visible behavior. That gives it a clean estimate of model quality without curating prompts or paying evaluators.

What's different from pairwise voting

Most leaderboards rank models by asking people to pick the better of two side-by-side answers. Arena's previous leaderboards work this way, using Bradley-Terry regression. Agent Mode can't use that — there are no two answers to compare. So instead, Agent Arena collects single-threaded traces: one user, one agent, possibly hundreds of turns, and a stream of feedback signals along the way.

The methodology is called causal tracing: treat the agent as a multi-component system, randomize which component (which model) gets used, and estimate the treatment effect of substituting that component.

What counts as "feedback"

Three kinds, all harvested from real sessions:

  • Explicit user feedback — approve / disapprove buttons, downloaded artifacts
  • Implicit user feedback — natural-language praise ("looks great") and complaint ("this is broken")
  • Environment feedback — shell exit codes, tool errors, tool-not-found responses

These get combined into five leaderboard signals, each one independently measurable, each one oriented so green always means good.

Design choice

The orchestrator model stays hidden after each session.

This isn't an oversight — it's deliberate. Arena doesn't reveal which model handled a session until evaluations are aggregated, so users focus on the task rather than the brand. It also keeps the leaderboard fair: nobody can selectively rate one model higher because they recognize it.

02 · By the numbers (7-day window)

What real Agent Mode usage actually looks like

These are the headline numbers from Arena's recent 7-day slice of live Agent Mode traffic. They're the raw material the leaderboard is built from.

0
Agent Mode tasks
160,480 primary intents
0
Structured tool calls
across the same window
0
Lines of code written
via successful write_file calls
0
High-tool-use sessions
after loop / runaway filtering
0
Sessions using ≥1 tool
out of 128,244 sessions
0
Sessions running bash
the most-used tool by far
0
Sessions using web_search
the third most-used tool
0
Mean tool calls / session
17% of sessions hit 26+
0
Sessions ending ≥128k tokens
8% exceed 1M input tokens
0
Workspace downloads
in the same 7-day window
0
Total sessions observed
model-agnostic count
0
LOC per coding session
roughly, on average
03 · What people actually build

The shape of real agent work

Beyond which model wins, the trace stream reveals what people delegate to agents, how complex those tasks get, and what tools dominate.

Task distribution

160,480 tasks · 7d
Code writing · 17.5% Research · 10.8% Planning · 10.6% Image/video · 10.2% Document · 9.1% Debug · 8.9% Chitchat · 6.8% Education · 5.7% Creative · 5.3% Other · 15.1%
Top 6 = 68% of all agent tasks. Coding & debugging combined ≈ 26%.

Tool calls by volume

calls (thousands) bash · 936k write_file · 550k web_search · 276k read_file · 118k fetch_page · 86k list_files · 46k ask_user 39k · gen_image 10k
Bash + write_file alone = 1.5M of 2.06M total tool calls.

Lines of code written

Top 9 languages · millions of non-blank lines .py 8.5M .md 7.8M .html 4.3M .js 3.0M .tsx 2.6M .ts 1.8M .php 1.5M .css 1.3M .dart 831k
Python and Markdown dominate. .md is 7.8M lines — agents write a lot of docs.
A person handing over a complex document to a friendly AI assistant at a desk with code and tools

How people delegate

Most prompts aren't questions — they're handovers. 45% of openings hand off an entire deliverable, 28% ask for advice, 14% let the agent run autonomously, and only 1% direct step-by-step. Yet after the first reply, users pull control back 2.3× more often than they hand over more — the "Steerability" signal is doing real work in real sessions.

Delegation posture

How much users hand over in the opening message.

  • 45% Handed off a deliverable
  • 28% Asked for advice
  • 14% Let it run autonomously
  • 11% Gave a scoped task
  • 1% Directed step-by-step
8,738 openings analyzed. Most prompts hand over a whole job.

Reining in

After the first reply, how control shifts.

50%
Took back control
22%
Handed over more
After seeing the first reply, users pull control back 2.3× as often as they hand over more.

Bluster & bluffing

Two ways a capable-sounding agent still underdelivers.

BLUSTER

Sounds firm but rarely holds ground.

Sounds assertive 26% Declines change 2.7% Argues wrong 1.4%
BLUFFING

Coverage on multi-part asks.

Every part · 58% One incomplete · 34% Silently dropped · 8%
Bluffing is the rarer but more consequential failure mode.
04 · The five signals

What the leaderboard actually measures

Each signal is independently measurable, built from real traces, and oriented so higher (or lower, for tool hallucination) always means better. Click a signal to see how it's mined and what it tells you about a model.

Confirmed Success
Praise vs Complaint
Steerability
Bash Recovery
Tool Hallucination
higher = better

Confirmed Success

Built from the final explicit task approval / disapproval within a trace. Arena gives users approve and disapprove buttons on every turn; the final button of a task's trajectory determines the outcome.

This is the cleanest signal because it's directly observable. A model that scores well here reliably produces artifacts the user clicks "done" on. It's the closest thing to a ground-truth "did the agent complete the task" signal that doesn't require an evaluator.

Practical prompt: give the agent a verifiable goal ("build a deployable web app with a download button") rather than a vague one ("help me think about this"). Confirmed Success goes up sharply when completion is observable.

05 · The leaderboard

Aggregate & per-signal rankings

The full table, with the actual numbers from Arena's first leaderboard release (June 2026, 18 models). Click any column header to sort. Green means "above average model"; red means "below"; ± is the 95% confidence interval.

Agent Arena · Net Improvement (τ̂)

18 models · 5 signals · 95% CI shown via bar magnitude
Model Aggregate Confirmed Success Praise vs Complaint Steerability Bash Recovery Tool Hallucination

Source: arena.ai/blog/agent-arena-methodology · The aggregate is the equal-weighted mean of all 5 signals. Tool Hallucination is oriented so a more negative number is better.

06 · The math, made visual

Causal tracing in three steps

This is the formal machinery behind every number on the leaderboard. You don't need to memorize it, but seeing it once helps the rest of the leaderboard make sense.

Step 1 · Randomize the component

Each session i independently samples an agent configuration T_i from distribution P. Today that means picking one orchestrator model uniformly at random; tomorrow it'll also include tool selection, system prompt, harness, and subagents. Components are sampled independently, so each can be studied in isolation.

session i = 1..n A B C session → model B

Step 2 · Estimate the treatment effect

The treatment effect of substituting model t for the average is the difference in expected outcomes:

τ̂k→t = E[Yi | Ti,k=t] − E[Yi]

The left term is what would happen if every session used model t; the right term is what actually happens with a random mix. The difference is the net improvement.

E[Y | model B] +9% E[Y | average] 0% → τ̂ = +9%

Step 3 · Reweight for time-decay

To handle distribution shift — new models enter, old ones get retired, user behavior evolves — Arena applies time-decaying weights so recent traces count more than older ones. The estimator is the self-normalized importance-weighted average:

τ̂k→t = Σi: Ti,k=t wi Yi  /  Σi: Ti,k=t wi
           − Σi wi Yi  /  Σi wi

The weights wi = q(Ti,k) / pi,k(Ti,k) come from the ratio of the baseline distribution (uniform) to the actual sampling distribution.

older ← traces → newer time-decaying weights wi

The intuition, end-to-end

In plain English: imagine 100,000 real Agent Mode users, each opening a new chat. Arena secretly flips a coin — heads, you get GPT 5.5 (High); tails, you get Gemini 3 Flash. The session runs. The user clicks approve, complains, corrects, downloads, or doesn't. Repeat 100,000 times, randomizing each time.

The math then asks: for sessions where the coin landed on GPT 5.5 (High), how much higher is the approve rate than average? That's the net improvement. It's causal because the randomization removes confounders; it's interpretable because the gap is in the same units as the signal (percentage points).

A scientific lab showing data flowing through transparent tubes into a glowing analysis chamber with statistics equations
Why this matters

Decoupling contributions.

Today the framework runs with K=1 (only the orchestrator model is randomized). But the same machinery will eventually isolate the causal contribution of every component — tools, system prompts, subagents — letting Arena build leaderboards for each. The hard part is already solved; the rest is just turning on more randomizers.

07 · Real usage examples

The shape of high-effort work

A sample of recent high-tool-use sessions from real Agent Mode users. These are the sessions that produced the workspace downloads in section 02. Click a card for the deliverable.

DOWNLOADED
Web app · data aggregation

Live sports-TV schedule site

Aggregates the day's sports broadcasts across several Italian TV and streaming guides, merging duplicate events, with a password-protected admin page to monitor and repair broken data feeds.

🔧 448 tool calls 💬 140 turns
Claude Opus 4.7 (Thinking)
DOWNLOADED
Full-stack · DevOps

Self-hosted movie watchlist

Took a written product spec + HTML mockup to a Dockerized, self-hosted web app that imports a year of films, filters by region/language, and exports curated watchlists.

🔧 522 tool calls 💬 60 turns
GPT 5.4 (High)
Robotics · control systems

Underwater-vehicle autopilot

Debugged and re-architected an autonomous underwater vehicle control system in ROS/Gazebo — fixing rudder and ballast physics, PID depth/pitch control, and selectable autopilot modes.

🔧 494 tool calls 💬 162 turns
Anonymous model
DOWNLOADED
CAD · creative tooling

Blender add-on for architects

A SketchUp-like architectural sketching workflow in Blender — predictive snapping, guide and tape-measure tools, premium UX. Worked from an existing project codex and schedule.

🔧 546 tool calls 💬 82 turns
GPT 5.4 (High)
AI infrastructure · RAG

Financial research RAG pipeline

A "financial brain" that ingests, cleans, chunks, and embeds finance articles and data feeds for downstream reasoning, with observability, evaluation, and a controlled pilot-execution kit.

🔧 676 tool calls 💬 84 turns
GPT 5.5
Edtech web platform

Live study-tracking platform

Researched leading study platforms, then extended an edtech web app with live study sessions — tracking, leaderboards, badges, and an admin dashboard for inactive students.

🔧 411 tool calls 💬 74 turns
GPT 5.5 (High)
DOWNLOADED
Media infrastructure

RTMP streaming server

A self-hostable RTMP server for streaming from OBS, with browser dashboard, HTTP-FLV playback, start/stop toggle, dark mode, and a settings panel — fixing LAN behavior and port conflicts.

🔧 417 tool calls 💬 130 turns
GPT 5.4 (High)
DOWNLOADED
Consumer web app

Kids' screen-time tracker

A React app for tracking a child's weekly behavior and screen time with admin-only approval workflows, dark mode, and emailed PDF reports with colorful per-week charts.

🔧 440 tool calls 💬 80 turns
Claude Opus 4.6
DOWNLOADED
Systems programming

Minecraft server in Go

A Go implementation of the Minecraft network protocol from a spec, fixing a long chain of compile errors and re-architected the networking engine from a worker pool to goroutine-per-connection.

🔧 438 tool calls 💬 59 turns
Claude Opus 4.7

Across these 9 sessions: 4,392 total tool calls, 861 total turns. Real users, real work, downloaded artifacts.

08 · Cost vs performance

The Pareto frontier

List price per session vs. net improvement. The dotted line is the Pareto frontier — the cheapest ways to get the most out of an agent. Some models are surprisingly expensive in practice despite cheaper on-paper pricing, because they take more steps per turn or trigger more user follow-ups.

List-price cost per session (USD, log) → Net improvement (τ̂) → 0% +10% -10% GPT 5.5 (High) +10.7% Opus 4.7 (Thinking) +9.5% GPT 5.4 (High) +8.9% Opus 4.6 +8.1% GPT 5.5 +7.5% Opus 4.7 +7.0% Sonnet 4.6 +4.6% GLM 5.1 +3.4% Gemini 3.1 Pro +1.4% Gemini 3.5 Flash +0.4% Kimi K2.6 -0.6% DeepSeek V4 Pro -1.9% Qwen 3.6 Plus -3.4% DeepSeek V4 Flash -5.1% Minimax M2.7 -8.5% Gemini 3 Flash -9.2% Gemma 4 31B -14.6% Grok 4.3 -25.1% Pareto frontier (dotted) $0.05 $0.20 $1.00 $5.00 $10+
Approximate positions; cost data per-session from a 7-day window. Models on the frontier are the cheapest at each quality level. Square markers in the source figure sit on the frontier.
09 · What this means for you

Prompting for the leaderboard signals

The five signals Arena scores models on are the same five behaviors that matter in real use. If your prompt maximizes them, you're going to get better outputs — regardless of which model handles the session.

↑ Confirmed Success

Make completion verifiable.

Build a deployable web app
with a download button.
Output: app.zip in workspace.

A user can click approve because the deliverable exists, not because it looks like a deliverable.

↑ Praise vs Complaint

Scope to a clear audience.

Explainer for general
audience, plain language,
strong analogies, visuals.

A model that knows the audience is more likely to produce something you'll praise rather than something you'll silently fix.

↑ Steerability

Pre-declare constraints.

Self-contained HTML.
No external CSS or JS.
Inline SVG for diagrams.

Constraints survive corrections. If you say "no external CSS" and the agent uses one, a one-line correction is enough to land.

↑ Bash Recovery

List a "stop when" condition.

Stop when:
  - file exists in workspace
  - basic check passes
  - all diagrams render

A clear "done" gives the model a recovery target after each bash call, instead of an open-ended retry loop.

↓ Tool Hallucination

Stay in scope of provided tools.

Use only the tools
listed in your toolkit.
If a tool is missing,
say so — don't invent one.

Asking for "a tool that doesn't exist" invites hallucination. Asking for things the harness can do keeps the agent honest.

↑ All of them

Pre-plan in the prompt.

1. Plan the sections first
2. Generate visuals
3. Build the HTML
4. Self-verify by listing
   requirements and marking
   which are satisfied

Telling the model to plan-then-act-then-verify hits Steerability (you can correct the plan), Bash Recovery (verify steps are clear), and Confirmed Success (the artifact exists at the end).

In summary

What Agent Arena is, in one breath

Randomized orchestrator swapreal user feedback (explicit, implicit, environmental) → causal tracing to estimate the treatment effect of each model → 5 leaderboard signals, each oriented so green means good.

  • It's live, not lab. Millions of real sessions across coding, research, planning, docs, and image work — not curated test prompts.
  • The math is causal. Randomization + treatment-effect estimation = interpretable "how much better" instead of "which is preferred."
  • It's componentized. Today K=1 (just orchestrators). Soon: tools, system prompts, subagents, harness — each with their own leaderboard.
  • The signals are user-visible. Approve, complain, correct, recover, hallucinate — all measured from what actually happens.
  • It explains your session too. The model behind any given chat is hidden by design, but the leaderboard tells you which models Arena has graded as best at the behaviors you care about.