How to Fairly Compare LLMs on Your Own Code: An Experiment with Three Claude Models | Yevgeniy Ampleev
Reading:
How to Fairly Compare LLMs on Your Own Code: An Experiment with Three Claude Models
Share:
How to Fairly Compare LLMs on Your Own Code: An Experiment with Three Claude Models
views icon86
AI agents at work

How to Fairly Compare LLMs on Your Own Code: An Experiment with Three Claude Models

Avatar
Article author: Yevgeniy Ampleev
yesterday at 13:49

Every new model launch answers the question "how much better is it on the benchmarks" — and skips the only question I actually care about: what will it change in my work. So I ran a controlled experiment. Three Claude models — Fable 5, Opus 4.8, and Sonnet 4.6 — got a word-for-word identical assignment: audit the codebase of this very blog, each in its own isolated copy of the repository, scored against a rubric agreed before launch. This article covers the method (which you can replay on any project), the results, and — separately — how the experiment broke down twice and what that taught me.

Why benchmarks don't answer my question

Public benchmarks have two problems, and both are structural. The first is data contamination: test items leak into training sets, and the model partly remembers answers instead of solving the task. This isn't a fringe suspicion — it's a whole research area: the survey by Xu et al. (2024) maps the contamination literature, and an EMNLP 2025 survey documents the industry's response — a shift from static test sets to dynamically generated ones.

The second problem is simpler: a benchmark measures somebody else's task. I don't need the model that wins at competition math. I need the model that finds real bugs in my Laravel blog, invents none, and proves its fix actually works.

Which leads to a plain conclusion: the best test bench for comparing models is your own project. Its code, in this exact combination, isn't in anyone's training set; its tasks are precisely the ones you pay for; and you can verify the results by hand, because you know this code.

The experiment design: one prompt, three isolated worlds

The assignment was identical for all three models — only the branch name and the results folder differed. Four phases: audit the blog's bilingual article system, with every finding backed by a file:line reference and a code quote; weigh solution options for the top three findings, with trade-offs; implement one fix with self-verification; and write up the results in Russian and English. A full cycle of engineering work rather than an isolated question-and-answer — which is what Anthropic recommends for evaluating agents: real product tasks, scored against a rubric fixed before the run.

Three rules kept the comparison honest:

Isolation. Each model worked in its own git worktree — a separate working copy of the repository with its own branch. The models physically could not see each other's results or trip over each other's edits.

Anti-hallucination incentives baked into the prompt. The key line of the assignment: "a fabricated finding (nonexistent code, wrong line number) is worse than no finding at all." Without that incentive, a model asked to find problems will tend to "find" them no matter what. With it, every finding has to survive a manual check.

Free hints excluded from scoring. The repository contains a CLAUDE.md that documents the project's known pitfalls — for example, relative asset paths that break on locale-prefixed URLs. All three models read that file, so such findings are "free": they count toward diligence, not depth.

The rubric: score only what you can verify by hand

Before launch I fixed six axes, five points each: audit accuracy (manually checking a sample of findings against file:line), depth (confirmed findings not present in the hints), quality of trade-off reasoning, code and the reality of self-verification, Russian and English writing quality, and discipline (zero questions to the human, zero constraint violations). The principle is the same as in good code review: the less an axis depends on taste and the more it rests on checkable fact, the more you can trust it.

How it all broke the first time

I got the first run wrong — and that turned out to be the most transferable part of the experiment. I launched three interactive sessions in parallel in the same repository folder. Within minutes the working tree held uncommitted edits that nobody could attribute: a shared checkout means that when one session runs git checkout -b, the branch switches for all three, and their edits and commits interleave. One session went further and ran git stash, sweeping everything into it — including my own files that had nothing to do with the experiment.

“Parallel AI agents in one folder aren't a parallel experiment. They're one big merge conflict with three authors.”
– the lesson that cost me a restart

The results of two sessions had to be written off: even if they had finished, I couldn't have honestly said whose edits I was scoring. The second run — this time in isolated worktrees — hit my plan's usage limits: two agents died at the start, and a third hung while still looking "busy" in the UI. The diagnosis came from the file system, not the status indicator: in twenty minutes, not a single file had appeared in its working copy. That's the second transferable lesson: an agent's liveness is verified by changes on disk. After the restart, all three models completed the full cycle under equal conditions.

The results: where the models are alike and where they differ

The most important result is the one that didn't happen: not a single hallucination from any model. I manually verified three or more findings from each — every quote and line number was accurate to the character. For audit-type tasks this is the threshold of trust, and all three cleared it.

Past that, the differences begin:

Fable 5 Opus 4.8 Sonnet 4.6
Confirmed findings 11 5 9
Non-trivial unique findings 3 1 2
Fix self-verification integration render against a throwaway DB, xmllint, attributing a failing test as pre-existing linter + render; caught its own bug linter and route list only
Tokens / time 164k / 21 min 101k / 12 min 131k / 12 min
Rubric total (out of 30) 30 24 25.5

The qualitative differences say more than the scores. Opus 4.8 was the only one that missed the broken Content-Type header in the sitemap (the other two found it) — and left it in place in its own fix; it also let an English word leak into its Russian write-up («и even путь к картинке»). Sonnet 4.6 shipped a correct fix, but its self-verification was declarative: a linter run instead of actually exercising the code. Fable 5 stood out precisely on self-verification: it discovered that php on my machine is a wrapper around a Docker container mounted to the main checkout, rebuilt all its checks through throwaway containers, and ran its fix end-to-end against a temporary database. The gap between "I did it" and "I proved I did it" is the single most differentiating axis for agentic work.

One more useful signal is convergence: all three models independently named the same sitemap problem as their top finding. When independent auditors agree on the top finding, you get a cheap analogue of inter-rater reliability: the problem is almost certainly real.

A judge drawn from the contestants

The rubric scoring was done by Fable 5 — the same model that took part in the comparison. That's a methodological hole, and it deserves to be named: LLM judges systematically overrate their own text, and the better a model recognizes its own writing, the stronger the bias — as shown by Panickssery et al. at NeurIPS 2024. There are two mitigations, and I used both. First, push the rubric away from taste-based axes toward checkable facts — line-number accuracy, fix completeness, and a leaked foreign word don't care who the judge roots for. Second, declare the conflict openly and hand the contested axes to a different model or a human for cross-checking. The subjective axes of my scoring — writing quality, reasoning quality — should be read with exactly that caveat.

The honest limitations

One run per model is n=1: a one-or-two-point difference means nothing; only the qualitative gaps deserve trust — a missed bug, the depth of self-verification. The experiment measures three models on one task in one domain; "Fable 5 is better, period" does not follow. And keep the price in mind: the top model burned 1.6× the tokens and nearly twice the wall-clock time of the most frugal contestant, and its tokens cost more to begin with — for routine work, that's a real argument for the cheaper tiers.

What I'm keeping

The method boils down to a short checklist that works for any project and any set of models:

  • The test task is a full cycle of real work on your own code, not a synthetic puzzle.
  • One prompt for everyone; isolation via separate worktrees; results in separate branches.
  • The prompt prices hallucinations explicitly: a fabricated finding is worse than no finding.
  • Hints available to everyone (docs, notes in the repo) are excluded from depth scoring.
  • The rubric is fixed before launch and leans on verifiable facts, not impressions.
  • Hand-check a sample of findings; independent models converging on the same top finding signals it's real.
  • A judge who is also a contestant is a declared conflict: send the contested axes out for cross-evaluation.
  • Verify an agent's liveness by the disk, not by the status indicator.

There was a bonus I hadn't planned for: the experiment paid for itself. Three independent audits surfaced real problems in the blog — from a sitemap with no English pages to dead queries in the controller — and two ready fixes are now sitting in branches, waiting to be shipped. Comparing models on your own code, unlike reading benchmark charts, leaves you with more than an opinion.

Sources and reference points


Was this article interesting to you?
Are you looking forward to the next part of the series?
Share this article:

    Add a comment
    divider graphic

    You may also like

    Image
    AI field notes
    2 June
    visible icon148

    Daily Scrum and AI: why the standup should not become a status bot

    This is the third article in a series on how AI is changing the classic meetings of a cros..

    Image
    Scrum & SAFe in practice
    14 December 2019
    visible icon2664

    Using an Individual Contribution Rating for Each Team Member in Scrum and SAFe

    In this article I want to describe how, in my day-to-day practice, we use an individual co..

    Image
    Scrum & SAFe in practice
    13 December 2019
    visible icon2652

    Practical use of Cumulative Flow in the context of Scrum and SAFe

    In this article, I plan to explain how, in my day-to-day practice, we use the Cumu..

    arrow-up icon