Event — Securing the Agent Skill Supply Chain | Virtual | June 17Register
Logo
Back to articlesYour benchmarks are lying to you, and your judge is to blame!

15 May 20269 minute read

Simon Maple

Simon Maple is Tessl’s Founding Developer Advocate, a Java Champion, and former DevRel leader at Snyk, ZeroTurnaround, and IBM.

Last week I published a benchmark comparing six models across eleven agent skills. The numbers in that post are averages, and we did not explain why.

When I shared the data internally, Maria from our AI Research team pointed out something that we should take very seriously: an LLM judge is likely to favour outputs from its own model family.

So I ran the full benchmark again with a second judge, then a third, to see if this hypothesis held any water. The scores shifted, the rankings moved, and one model swung 47 percentage points on a single skill depending on the judge who graded it. If you are publishing or trusting eval numbers from a single LLM judge, you are partly benchmarking judge preference rather than model capability.

TL;DR for this post

We’re all busy people, here’s the tl;dr.

  • The same benchmark, graded by three different LLM judges, produced different scores and different rankings. One model swung 47 points on a single skill.
  • Sonnet is the most generous judge, GPT-5.5 the strictest. The gap between them averages 6.9 points across all models and skills.
  • LLM judges favour their own model family. Opus gave itself a 4.6 point boost over what the other two judges awarded it.
  • Rankings only stay stable at the top of the agents we tested. opus-4-7 held first place under all three judges. Everything below it moved position.
  • Eval criteria built around binary, verifiable things (file deleted or not, flag enabled or not) produce stable scores regardless of the judge. Skills that require qualitative judgment can easily swing 25 percentage points.
  • Fix: run multiple judges and average. Favour the same judge as the model you’ll tend to use in development, if you know. Design rubrics with yes/no criteria wherever the task allows it.
Join us at AI Native DevCon
Join us at AI Native DevCon (use C0DE30 for 30% discount)

The Setup

Six models, eleven skills, five scenarios per agent skill, one rubric. The only variable was the scoring model, i.e. the judge: Sonnet, GPT-5.5, and Opus-4-7 each graded every run independently. The figures in our main benchmark are averaged across all three. This post is about what happens before the averaging.

The raw results

The Tessl UI shows eval results as an average of all the runs, but you can still see all the information about the scenarios/tasks that were set and how they fared with and without the skills, on the Tessl registry here.

Criteria are pretty easy to create with Tessl. Once you’ve installed and authenticated with a free account using Tessl command line tool:

$ curl -fsSL https://get.tessl.io | sh

Simply ask your agent to create them using Tessl, or just run the following at the command line:

$ tessl scenario generate <path/to/tile> --count=5

You can then download them to disk once to validate, and make any changes you want:

tessl scenario download --last

You can now run them, choosing the using the model you want to run the scenarios with the --agent flag, as well as the model you wish to judge the output using the --scorer-agent flag.

tessl eval run <path/to/tile> --agent=claude:claude-opus-4-6 --scorer-agent codex:gpt-5.5

Judge Strictness: Sonnet Grades Easiest, GPT-5.5 Grades Hardest

Averaged across all six models and all eleven skills, here is what each judge returned:

JudgeAvg without-skillAvg with-skillAvg lift
Sonnet76.190.314.2
Opus-4-772.688.315.7
GPT-5.570.783.412.7

Sonnet grades most generously. GPT-5.5 is 6.9 points stricter on average. If your pipeline scores agents with Sonnet as the default judge, your numbers are probably 5 to 7 points higher than a stricter grader would return. That gap is real and it is not uniformly distributed across models.

The Rankings Shift

Here are the per-judge leaderboards for with-skill performance, using the same models and rubrics with only the judge swapped:

RankSonnetGPT-5.5Opus-4-7
1opus-4-7 (94.5)opus-4-7 (89.2)opus-4-7 (96.5)
2gpt-5.4 (92.7)gpt-5.5 (88.4)gpt-5.5 (92.3)
3gpt-5.3 (91.9)composer (88.0)composer (90.3)
4composer (90.5)gpt-5.4 (86.5)gpt-5.4 (88.8)
5gpt-5.5 (87.4)gpt-5.3 (75.7)gpt-5.3 (84.0)
6gpt-5-codex (85.1)gpt-5-codex (72.9)gpt-5-codex (78.1)

The leaderboard format shows position but hides distances. Here are the raw scores:

ModelSonnetGPT-5.5Opus-4-7AvgSwing
opus-4-794.589.296.593.47.3
composer90.588.090.389.62.5
gpt-5.587.488.492.389.44.9
gpt-5.492.786.588.889.36.2
gpt-5.391.975.784.083.916.2
gpt-5-codex85.172.978.178.712.2

The swing column is the gap between the highest and lowest score any judge gave a model. composer swings 2.5 points, meaning all three judges broadly agree on it. gpt-5.3 swings 16.2 points, which is more than the gap between first and last place in the averaged rankings.

opus-4-7 holds first place under every judge, and that is the one stable finding in the table. Everything else shifts. gpt-5.3 sits third under Sonnet and falls to fifth under both GPT-5.5 and Opus. gpt-5.5 sits fifth under Sonnet and climbs to second under the other two. The Sonnet-only leaderboard, which is what most default Tessl runs would have produced, gives a flattering picture of gpt-5.3 and an unflattering one of gpt-5.5.

Judge choice also affects how much credit each model gets for using skill context. Lift scores, meaning the gap between baseline and with-skill performance, vary considerably by judge:

ModelSonnet liftGPT-5.5 liftOpus liftAvg lift
gpt-5.316.116.222.918.4
composer16.913.815.415.4
gpt-5.416.813.515.315.2
gpt-5.510.215.116.213.8
opus-4-714.09.714.112.6
gpt-5-codex11.38.310.410.0

The Opus judge gives gpt-5.3 a lift of 22.9 points. Sonnet and GPT-5.5 give it 16. The rubric was identical. The disagreement is purely about whether gpt-5.3's output counted as genuine compliance or a close approximation, and a single judge cannot tell you which reading is correct.

Self-Judge Bias Is Measurable

The results split along model family lines, though not symmetrically.

ModelOwn judge scoreOther judges avgBoost
opus-4-7 (Opus judge)96.591.9+4.6
gpt-5.5 (GPT-5.5 judge)88.489.9-1.5

The Opus case is unambiguous. Opus gives itself 96.5; Sonnet gives it 94.5; GPT-5.5 gives it 89.2. The 7.3 point gap between Opus-as-judge and GPT-5.5-as-judge for the same model on the same runs is entirely a grading artefact. The gpt-5.5 case does not follow the same pattern: GPT-5.5 actually scores its own model lower than the other two judges do, and Opus gives gpt-5.5 its highest score at 92.3. Self-favour exists but is not symmetric, and its size and direction vary by model and judge pairing.

The practical consequence for the Opus case specifically: if you are using Claude models to grade Claude outputs, expect a systematic upward bias of 4 to 5 points. It does not show up as a bias in your data; it just looks like good scores.

The Averaged Picture

Given all this variance, here is what you get when you average across all three judges:

ModelAvg baselineAvg with-skillAvg lift
opus-4-780.893.412.6
composer74.289.615.4
gpt-5.575.589.413.8
gpt-5.474.189.315.2
gpt-5.365.583.918.4
gpt-5-codex68.778.710.0

The lift column deserves a second look. gpt-5.3 shows the largest average lift at 18.4 points, which sounds like a strength, but its baseline is also the weakest of any non-codex model by nearly ten points. It benefits most from skill context and starts furthest behind without it.

What the gpt-5.3 Drop Actually Tells Us

gpt-5.3 scored 91.9 under Sonnet, 75.7 under GPT-5.5, and 84.0 under Opus. Two judges independently came back with substantially lower scores, which means the Sonnet-only number was inflated, and that inflation was specific to gpt-5.3 in a way it was not specific to gpt-5.4 or composer.

The pattern in the per-skill data points to one cause: gpt-5.3 produces outputs that are in the right direction but not precisely correct. Sonnet, the most generous judge, gives partial credit. GPT-5.5, the strictest, does not. If you care about whether your agent follows a spec exactly rather than approximately, GPT-5.5's score is the more informative one. If you care about general capability, the average of three judges is probably right.

What to Do About It

Single-judge evals are benchmarking judge preference as much as model capability. Running multiple judges and averaging the results fixes this: three independent judges will smooth out individual preferences and give you a number that is harder to game and more stable across reruns.

Beyond averaging, design your rubric for binary criteria wherever the task allows it. "Is the file deleted?" is a better eval item than "how well did the agent explain the migration?" The first gives every judge the same answer. The second gives every judge a different one.

The models that perform consistently across all three judges in this benchmark, gpt-5.4 and composer, share one characteristic: their outputs are correct rather than approximately correct. A strict grader and a generous one disagree less when the answer is unambiguously right.