Agent Arena — A Visual Explainer

01 · The basics

How Agent Arena works, in one paragraph

Every time someone opens a new chat in Agent Mode, Arena randomly assigns one of the available orchestrator models to drive the session. Everything else — the harness, tools, system prompt — is shared infrastructure. Because the model is the only thing that varies, Arena can measure how each model causally shifts user-visible behavior. That gives it a clean estimate of model quality without curating prompts or paying evaluators.

What's different from pairwise voting

Most leaderboards rank models by asking people to pick the better of two side-by-side answers. Arena's previous leaderboards work this way, using Bradley-Terry regression. Agent Mode can't use that — there are no two answers to compare. So instead, Agent Arena collects single-threaded traces: one user, one agent, possibly hundreds of turns, and a stream of feedback signals along the way.

The methodology is called causal tracing: treat the agent as a multi-component system, randomize which component (which model) gets used, and estimate the treatment effect of substituting that component.

What counts as "feedback"

Three kinds, all harvested from real sessions:

Explicit user feedback — approve / disapprove buttons, downloaded artifacts
Implicit user feedback — natural-language praise ("looks great") and complaint ("this is broken")
Environment feedback — shell exit codes, tool errors, tool-not-found responses

These get combined into five leaderboard signals, each one independently measurable, each one oriented so green always means good.

Design choice

The orchestrator model stays hidden after each session.

This isn't an oversight — it's deliberate. Arena doesn't reveal which model handled a session until evaluations are aggregated, so users focus on the task rather than the brand. It also keeps the leaderboard fair: nobody can selectively rate one model higher because they recognize it.

03 · What people actually build

The shape of real agent work

Beyond which model wins, the trace stream reveals what people delegate to agents, how complex those tasks get, and what tools dominate.

Task distribution

Code writing · 17.5% Research · 10.8% Planning · 10.6% Image/video · 10.2% Document · 9.1% Debug · 8.9% Chitchat · 6.8% Education · 5.7% Creative · 5.3% Other · 15.1%

Top 6 = 68% of all agent tasks. Coding & debugging combined ≈ 26%.

Tool calls by volume

Bash + write_file alone = 1.5M of 2.06M total tool calls.

Lines of code written

Python and Markdown dominate. .md is 7.8M lines — agents write a lot of docs.

A person handing over a complex document to a friendly AI assistant at a desk with code and tools

How people delegate

Most prompts aren't questions — they're handovers. 45% of openings hand off an entire deliverable, 28% ask for advice, 14% let the agent run autonomously, and only 1% direct step-by-step. Yet after the first reply, users pull control back 2.3× more often than they hand over more — the "Steerability" signal is doing real work in real sessions.

Delegation posture

How much users hand over in the opening message.

45% Handed off a deliverable
28% Asked for advice
14% Let it run autonomously
11% Gave a scoped task
1% Directed step-by-step

8,738 openings analyzed. Most prompts hand over a whole job.

Reining in

After the first reply, how control shifts.

50%

Took back control

22%

Handed over more

After seeing the first reply, users pull control back 2.3× as often as they hand over more.

Bluster & bluffing

Two ways a capable-sounding agent still underdelivers.

BLUSTER

Sounds firm but rarely holds ground.

Sounds assertive 26% Declines change 2.7% Argues wrong 1.4%

BLUFFING

Coverage on multi-part asks.

Every part · 58% One incomplete · 34% Silently dropped · 8%

Bluffing is the rarer but more consequential failure mode.

04 · The five signals

What the leaderboard actually measures

Each signal is independently measurable, built from real traces, and oriented so higher (or lower, for tool hallucination) always means better. Click a signal to see how it's mined and what it tells you about a model.

Confirmed Success ↑

Praise vs Complaint ↑

Steerability ↑

Bash Recovery ↑

Tool Hallucination ↓

higher = better

Confirmed Success

Built from the final explicit task approval / disapproval within a trace. Arena gives users approve and disapprove buttons on every turn; the final button of a task's trajectory determines the outcome.

This is the cleanest signal because it's directly observable. A model that scores well here reliably produces artifacts the user clicks "done" on. It's the closest thing to a ground-truth "did the agent complete the task" signal that doesn't require an evaluator.

Practical prompt: give the agent a verifiable goal ("build a deployable web app with a download button") rather than a vague one ("help me think about this"). Confirmed Success goes up sharply when completion is observable.

higher = better

Praise vs Complaint

For each task, identify messages expressing explicit verbal praise ("looks great", "this is exactly what I needed") or explicit verbal complaint ("this is broken", "you misunderstood"). Task is marked a success if praise outnumbers complaints.

This isolates natural-language user satisfaction from button clicks. A model can lose on Confirmed Success but win here (user said "almost there, one more thing") or vice versa (user clicked approve but typed "this is rough"). The signal helps Arena see the gap between formal approval and felt satisfaction.

Practical prompt: be honest in feedback. Empty "looks great!" on a bad output teaches the leaderboard the wrong thing. Saying "this is broken because X" is more useful than silent disapproval.

higher = better

Steerability

When a user issues an in-line correction ("no, do X instead", "you misread the file"), does the very next response actually land — accepted, extended, or redirected — instead of being rejected or going nowhere?

Mistakes are inevitable in real work. What separates a useful agent from a frustrating one is how quickly it recovers when the user pushes back. Steerability captures the iterative quality of agent work — not whether the first attempt was right, but whether the agent can be pulled onto the right track without restarting.

Practical prompt: correct specifically. "Use indigo not amber for the accent color" beats "this looks off." Specific corrections give the agent something concrete to land on.

higher = better

Bash Recovery

When the agent issues a bash command that errors due to a model failure (not an environment issue), the recovery clock starts. We count follow-up bash calls until the next non-erroring command. If the agent gives up, we impose an additional penalty.

Bash is the most-used tool (936k calls in a week), so its error rate is the largest single source of agent friction. A model that recognizes its own typo and fixes it on the second try scores well; one that retries the exact same broken command three times scores poorly.

Practical prompt: if the agent is looping on a command, take back control ("stop — try X instead"). Letting it loop is a Steerability failure that also tanks Bash Recovery.

lower = better

Tool Hallucination

Penalizes invented tool names, malformed syntax that produces a junk name, and chain-of-thought tokens leaking into the tool field. The task is marked a failure if the agent calls a nonexistent tool.

The orientation flips here: lower is better. A model that hallucinates tools can't recover — once it sends a malformed tool call, the harness rejects it and the user sees a hard error. This is the one signal where a negative score (green) is genuinely good news.

Practical prompt: only ask for things the agent's tool list can do. Asking for "use my filesystem" when the agent has bash + read_file already is fine. Asking for a tool that doesn't exist invites hallucination.

06 · The math, made visual

Causal tracing in three steps

This is the formal machinery behind every number on the leaderboard. You don't need to memorize it, but seeing it once helps the rest of the leaderboard make sense.

Step 1 · Randomize the component

Each session i independently samples an agent configuration T_i from distribution P. Today that means picking one orchestrator model uniformly at random; tomorrow it'll also include tool selection, system prompt, harness, and subagents. Components are sampled independently, so each can be studied in isolation.

Step 2 · Estimate the treatment effect

The treatment effect of substituting model t for the average is the difference in expected outcomes:

τ̂_k→t = E[Y_i | T_i,k=t] − E[Y_i]

The left term is what would happen if every session used model t; the right term is what actually happens with a random mix. The difference is the net improvement.

Step 3 · Reweight for time-decay

To handle distribution shift — new models enter, old ones get retired, user behavior evolves — Arena applies time-decaying weights so recent traces count more than older ones. The estimator is the self-normalized importance-weighted average:

τ̂_k→t = Σ_{i: T_i,k=t} w_i Y_i  /  Σ_{i: T_i,k=t} w_i
           − Σ_i w_i Y_i  /  Σ_i w_i

The weights w_i = q(T_i,k) / p_i,k(T_i,k) come from the ratio of the baseline distribution (uniform) to the actual sampling distribution.

The intuition, end-to-end

In plain English: imagine 100,000 real Agent Mode users, each opening a new chat. Arena secretly flips a coin — heads, you get GPT 5.5 (High); tails, you get Gemini 3 Flash. The session runs. The user clicks approve, complains, corrects, downloads, or doesn't. Repeat 100,000 times, randomizing each time.

The math then asks: for sessions where the coin landed on GPT 5.5 (High), how much higher is the approve rate than average? That's the net improvement. It's causal because the randomization removes confounders; it's interpretable because the gap is in the same units as the signal (percentage points).

A scientific lab showing data flowing through transparent tubes into a glowing analysis chamber with statistics equations

Why this matters

Decoupling contributions.

Today the framework runs with K=1 (only the orchestrator model is randomized). But the same machinery will eventually isolate the causal contribution of every component — tools, system prompts, subagents — letting Arena build leaderboards for each. The hard part is already solved; the rest is just turning on more randomizers.

07 · Real usage examples

The shape of high-effort work

A sample of recent high-tool-use sessions from real Agent Mode users. These are the sessions that produced the workspace downloads in section 02. Click a card for the deliverable.

DOWNLOADED

Web app · data aggregation

Live sports-TV schedule site

Aggregates the day's sports broadcasts across several Italian TV and streaming guides, merging duplicate events, with a password-protected admin page to monitor and repair broken data feeds.

🔧 448 tool calls 💬 140 turns

Claude Opus 4.7 (Thinking)

DOWNLOADED

Full-stack · DevOps

Self-hosted movie watchlist

Took a written product spec + HTML mockup to a Dockerized, self-hosted web app that imports a year of films, filters by region/language, and exports curated watchlists.

🔧 522 tool calls 💬 60 turns

GPT 5.4 (High)

Robotics · control systems

Underwater-vehicle autopilot

Debugged and re-architected an autonomous underwater vehicle control system in ROS/Gazebo — fixing rudder and ballast physics, PID depth/pitch control, and selectable autopilot modes.

🔧 494 tool calls 💬 162 turns

Anonymous model

DOWNLOADED

CAD · creative tooling

Blender add-on for architects

A SketchUp-like architectural sketching workflow in Blender — predictive snapping, guide and tape-measure tools, premium UX. Worked from an existing project codex and schedule.

🔧 546 tool calls 💬 82 turns

GPT 5.4 (High)

AI infrastructure · RAG

Financial research RAG pipeline

A "financial brain" that ingests, cleans, chunks, and embeds finance articles and data feeds for downstream reasoning, with observability, evaluation, and a controlled pilot-execution kit.

🔧 676 tool calls 💬 84 turns

GPT 5.5

Edtech web platform

Live study-tracking platform

Researched leading study platforms, then extended an edtech web app with live study sessions — tracking, leaderboards, badges, and an admin dashboard for inactive students.

🔧 411 tool calls 💬 74 turns

GPT 5.5 (High)

DOWNLOADED

Media infrastructure

RTMP streaming server

A self-hostable RTMP server for streaming from OBS, with browser dashboard, HTTP-FLV playback, start/stop toggle, dark mode, and a settings panel — fixing LAN behavior and port conflicts.

🔧 417 tool calls 💬 130 turns

GPT 5.4 (High)

DOWNLOADED

Consumer web app

Kids' screen-time tracker

A React app for tracking a child's weekly behavior and screen time with admin-only approval workflows, dark mode, and emailed PDF reports with colorful per-week charts.

🔧 440 tool calls 💬 80 turns

Claude Opus 4.6

DOWNLOADED

Systems programming

Minecraft server in Go

A Go implementation of the Minecraft network protocol from a spec, fixing a long chain of compile errors and re-architected the networking engine from a worker pool to goroutine-per-connection.

🔧 438 tool calls 💬 59 turns

Claude Opus 4.7

Across these 9 sessions: 4,392 total tool calls, 861 total turns. Real users, real work, downloaded artifacts.

09 · What this means for you

Prompting for the leaderboard signals

The five signals Arena scores models on are the same five behaviors that matter in real use. If your prompt maximizes them, you're going to get better outputs — regardless of which model handles the session.

↑ Confirmed Success

Make completion verifiable.

Build a deployable web app
with a download button.
Output: app.zip in workspace.

A user can click approve because the deliverable exists, not because it looks like a deliverable.

↑ Praise vs Complaint

Scope to a clear audience.

Explainer for general
audience, plain language,
strong analogies, visuals.

A model that knows the audience is more likely to produce something you'll praise rather than something you'll silently fix.

↑ Steerability

Pre-declare constraints.

Self-contained HTML.
No external CSS or JS.
Inline SVG for diagrams.

Constraints survive corrections. If you say "no external CSS" and the agent uses one, a one-line correction is enough to land.

↑ Bash Recovery

List a "stop when" condition.

Stop when:
  - file exists in workspace
  - basic check passes
  - all diagrams render

A clear "done" gives the model a recovery target after each bash call, instead of an open-ended retry loop.

↓ Tool Hallucination

Stay in scope of provided tools.

Use only the tools
listed in your toolkit.
If a tool is missing,
say so — don't invent one.

Asking for "a tool that doesn't exist" invites hallucination. Asking for things the harness can do keeps the agent honest.

↑ All of them

Pre-plan in the prompt.

1. Plan the sections first
2. Generate visuals
3. Build the HTML
4. Self-verify by listing
   requirements and marking
   which are satisfied

Telling the model to plan-then-act-then-verify hits Steerability (you can correct the plan), Bash Recovery (verify steps are clear), and Confirmed Success (the artifact exists at the end).

In summary

What Agent Arena is, in one breath

Randomized orchestrator swap → real user feedback (explicit, implicit, environmental) → causal tracing to estimate the treatment effect of each model → 5 leaderboard signals, each oriented so green means good.

It's live, not lab. Millions of real sessions across coding, research, planning, docs, and image work — not curated test prompts.
The math is causal. Randomization + treatment-effect estimation = interpretable "how much better" instead of "which is preferred."
It's componentized. Today K=1 (just orchestrators). Soon: tools, system prompts, subagents, harness — each with their own leaderboard.
The signals are user-visible. Approve, complain, correct, recover, hallucinate — all measured from what actually happens.
It explains your session too. The model behind any given chat is hidden by design, but the leaderboard tells you which models Arena has graded as best at the behaviors you care about.

How Agent Arena works, in one paragraph

What's different from pairwise voting

What counts as "feedback"

The orchestrator model stays hidden after each session.

What real Agent Mode usage actually looks like

The shape of real agent work

Task distribution

Tool calls by volume

Lines of code written

How people delegate

Delegation posture

Reining in

Bluster & bluffing

What the leaderboard actually measures

Confirmed Success

Praise vs Complaint

Steerability

Bash Recovery

Tool Hallucination

Aggregate & per-signal rankings

Causal tracing in three steps

Step 1 · Randomize the component

Step 2 · Estimate the treatment effect

Step 3 · Reweight for time-decay

The intuition, end-to-end

Decoupling contributions.

The shape of high-effort work

Live sports-TV schedule site

Self-hosted movie watchlist

Underwater-vehicle autopilot

Blender add-on for architects

Financial research RAG pipeline

Live study-tracking platform

RTMP streaming server

Kids' screen-time tracker

Minecraft server in Go

The Pareto frontier

Prompting for the leaderboard signals

↑ Confirmed Success

↑ Praise vs Complaint

↑ Steerability

↑ Bash Recovery

↓ Tool Hallucination

↑ All of them

What Agent Arena is, in one breath