Arena doesn't grade models in a lab. It routes every real Agent Mode session to a randomly chosen orchestrator model and watches what actually happens. Then, with a methodology called causal tracing, it estimates how much each model improves (or hurts) the agent's behavior — across millions of in-the-wild interactions.
Live data · arena.ai/agent · Leaderboard at arena.ai/leaderboard/agent
Every time someone opens a new chat in Agent Mode, Arena randomly assigns one of the available orchestrator models to drive the session. Everything else — the harness, tools, system prompt — is shared infrastructure. Because the model is the only thing that varies, Arena can measure how each model causally shifts user-visible behavior. That gives it a clean estimate of model quality without curating prompts or paying evaluators.
Most leaderboards rank models by asking people to pick the better of two side-by-side answers. Arena's previous leaderboards work this way, using Bradley-Terry regression. Agent Mode can't use that — there are no two answers to compare. So instead, Agent Arena collects single-threaded traces: one user, one agent, possibly hundreds of turns, and a stream of feedback signals along the way.
The methodology is called causal tracing: treat the agent as a multi-component system, randomize which component (which model) gets used, and estimate the treatment effect of substituting that component.
Three kinds, all harvested from real sessions:
These get combined into five leaderboard signals, each one independently measurable, each one oriented so green always means good.
This isn't an oversight — it's deliberate. Arena doesn't reveal which model handled a session until evaluations are aggregated, so users focus on the task rather than the brand. It also keeps the leaderboard fair: nobody can selectively rate one model higher because they recognize it.
These are the headline numbers from Arena's recent 7-day slice of live Agent Mode traffic. They're the raw material the leaderboard is built from.
Beyond which model wins, the trace stream reveals what people delegate to agents, how complex those tasks get, and what tools dominate.
Most prompts aren't questions — they're handovers. 45% of openings hand off an entire deliverable, 28% ask for advice, 14% let the agent run autonomously, and only 1% direct step-by-step. Yet after the first reply, users pull control back 2.3× more often than they hand over more — the "Steerability" signal is doing real work in real sessions.
How much users hand over in the opening message.
After the first reply, how control shifts.
Two ways a capable-sounding agent still underdelivers.
Sounds firm but rarely holds ground.
Coverage on multi-part asks.
Each signal is independently measurable, built from real traces, and oriented so higher (or lower, for tool hallucination) always means better. Click a signal to see how it's mined and what it tells you about a model.
This is the cleanest signal because it's directly observable. A model that scores well here reliably produces artifacts the user clicks "done" on. It's the closest thing to a ground-truth "did the agent complete the task" signal that doesn't require an evaluator.
Practical prompt: give the agent a verifiable goal ("build a deployable web app with a download button") rather than a vague one ("help me think about this"). Confirmed Success goes up sharply when completion is observable.
The full table, with the actual numbers from Arena's first leaderboard release (June 2026, 18 models). Click any column header to sort. Green means "above average model"; red means "below"; ± is the 95% confidence interval.
| Model ▼ | Aggregate ▼ | Confirmed Success ▼ | Praise vs Complaint ▼ | Steerability ▼ | Bash Recovery ▼ | Tool Hallucination ▼ |
|---|
Source: arena.ai/blog/agent-arena-methodology · The aggregate is the equal-weighted mean of all 5 signals. Tool Hallucination is oriented so a more negative number is better.
This is the formal machinery behind every number on the leaderboard. You don't need to memorize it, but seeing it once helps the rest of the leaderboard make sense.
Each session i independently samples an agent configuration
T_i from distribution P. Today that means picking
one orchestrator model uniformly at random; tomorrow it'll also include
tool selection, system prompt, harness, and subagents. Components are
sampled independently, so each can be studied in isolation.
The treatment effect of substituting model t for the
average is the difference in expected outcomes:
τ̂k→t = E[Yi | Ti,k=t] − E[Yi]
The left term is what would happen if every session used model
t; the right term is what actually happens with a
random mix. The difference is the net improvement.
To handle distribution shift — new models enter, old ones get retired, user behavior evolves — Arena applies time-decaying weights so recent traces count more than older ones. The estimator is the self-normalized importance-weighted average:
τ̂k→t = Σi: Ti,k=t wi Yi / Σi: Ti,k=t wi
− Σi wi Yi / Σi wi
The weights wi = q(Ti,k) / pi,k(Ti,k)
come from the ratio of the baseline distribution (uniform) to the
actual sampling distribution.
In plain English: imagine 100,000 real Agent Mode users, each opening a new chat. Arena secretly flips a coin — heads, you get GPT 5.5 (High); tails, you get Gemini 3 Flash. The session runs. The user clicks approve, complains, corrects, downloads, or doesn't. Repeat 100,000 times, randomizing each time.
The math then asks: for sessions where the coin landed on GPT 5.5 (High), how much higher is the approve rate than average? That's the net improvement. It's causal because the randomization removes confounders; it's interpretable because the gap is in the same units as the signal (percentage points).
Today the framework runs with K=1 (only the orchestrator model is randomized). But the same machinery will eventually isolate the causal contribution of every component — tools, system prompts, subagents — letting Arena build leaderboards for each. The hard part is already solved; the rest is just turning on more randomizers.
A sample of recent high-tool-use sessions from real Agent Mode users. These are the sessions that produced the workspace downloads in section 02. Click a card for the deliverable.
Aggregates the day's sports broadcasts across several Italian TV and streaming guides, merging duplicate events, with a password-protected admin page to monitor and repair broken data feeds.
Took a written product spec + HTML mockup to a Dockerized, self-hosted web app that imports a year of films, filters by region/language, and exports curated watchlists.
Debugged and re-architected an autonomous underwater vehicle control system in ROS/Gazebo — fixing rudder and ballast physics, PID depth/pitch control, and selectable autopilot modes.
A SketchUp-like architectural sketching workflow in Blender — predictive snapping, guide and tape-measure tools, premium UX. Worked from an existing project codex and schedule.
A "financial brain" that ingests, cleans, chunks, and embeds finance articles and data feeds for downstream reasoning, with observability, evaluation, and a controlled pilot-execution kit.
Researched leading study platforms, then extended an edtech web app with live study sessions — tracking, leaderboards, badges, and an admin dashboard for inactive students.
A self-hostable RTMP server for streaming from OBS, with browser dashboard, HTTP-FLV playback, start/stop toggle, dark mode, and a settings panel — fixing LAN behavior and port conflicts.
A React app for tracking a child's weekly behavior and screen time with admin-only approval workflows, dark mode, and emailed PDF reports with colorful per-week charts.
A Go implementation of the Minecraft network protocol from a spec, fixing a long chain of compile errors and re-architected the networking engine from a worker pool to goroutine-per-connection.
Across these 9 sessions: 4,392 total tool calls, 861 total turns. Real users, real work, downloaded artifacts.
List price per session vs. net improvement. The dotted line is the Pareto frontier — the cheapest ways to get the most out of an agent. Some models are surprisingly expensive in practice despite cheaper on-paper pricing, because they take more steps per turn or trigger more user follow-ups.
The five signals Arena scores models on are the same five behaviors that matter in real use. If your prompt maximizes them, you're going to get better outputs — regardless of which model handles the session.
Make completion verifiable.
Build a deployable web app with a download button. Output: app.zip in workspace.
A user can click approve because the deliverable exists, not because it looks like a deliverable.
Scope to a clear audience.
Explainer for general audience, plain language, strong analogies, visuals.
A model that knows the audience is more likely to produce something you'll praise rather than something you'll silently fix.
Pre-declare constraints.
Self-contained HTML. No external CSS or JS. Inline SVG for diagrams.
Constraints survive corrections. If you say "no external CSS" and the agent uses one, a one-line correction is enough to land.
List a "stop when" condition.
Stop when: - file exists in workspace - basic check passes - all diagrams render
A clear "done" gives the model a recovery target after each bash call, instead of an open-ended retry loop.
Stay in scope of provided tools.
Use only the tools listed in your toolkit. If a tool is missing, say so — don't invent one.
Asking for "a tool that doesn't exist" invites hallucination. Asking for things the harness can do keeps the agent honest.
Pre-plan in the prompt.
1. Plan the sections first 2. Generate visuals 3. Build the HTML 4. Self-verify by listing requirements and marking which are satisfied
Telling the model to plan-then-act-then-verify hits Steerability (you can correct the plan), Bash Recovery (verify steps are clear), and Confirmed Success (the artifact exists at the end).
Randomized orchestrator swap → real user feedback (explicit, implicit, environmental) → causal tracing to estimate the treatment effect of each model → 5 leaderboard signals, each oriented so green means good.