Back to Blog
AI & Technology

How Our Eval Harness Cut AI Costs by 282x While Improving Quality

Most AI products ship without evals and hope for the best. We built a 20-lane evaluation system with 38 fixture items, multi-judge scoring, and hard quality gates. Here's how it changed every model and prompt decision we make.

Tim CrookerยทFounder & Engineer
March 30, 2026
14 min read
How Our Eval Harness Cut AI Costs by 282x While Improving Quality

There's a dirty secret in AI product development. Most teams shipping AI features have no systematic way to measure whether their AI is actually good.

They eyeball a few outputs. They ask the team if it "looks right." They ship a prompt change and wait for user complaints to tell them if it broke something. When they switch models, they run a handful of manual tests and call it a day.

We did that too, in the beginning. Then we built an eval harness, and it changed how we make every single decision about our AI pipeline.

Why Evals Matter More Than You Think

ListForge uses AI agents to identify products from photos, research comparable sales, recommend pricing, and generate marketplace listings. That's a lot of AI decisions per item. And every one of those decisions has real consequences: wrong identification means the wrong listing goes live, wrong pricing means lost money, wrong comps mean bad pricing justification.

The question isn't whether your AI makes mistakes. It does. The question is whether you know where, how often, and how to fix them.

Without evals, you're flying blind. With evals, you have a dashboard.

What We Built

Our eval system has over 20 evaluation lanes, each testing a specific capability in the pipeline:

  • Identification: Can the AI correctly identify a product from photos and description? Scored by an LLM judge across 6 dimensions.
  • Vision Analysis: How accurately does the AI extract brand, model, category, condition, and text from images? Scored across 12+ dimensions.
  • Comp Validation: When the AI finds comparable sales, are its matching verdicts correct? Scored deterministically with F1 and accuracy.
  • Pricing: Given a set of comparables, does the AI recommend a reasonable price? Scored across 7 dimensions.
  • Listing Generation: Is the generated listing title and description accurate, complete, and marketplace-ready?
  • Field Research: Does the AI correctly fill in item specifics like size, color, material, brand?
  • Category, Condition, Content, Shipping Specialists: Isolated evaluation of each specialist agent.

Every lane has its own scoring methodology tuned to what actually matters for that capability. Some use deterministic metrics like F1 scores. Others use LLM-as-judge with multiple judge personas: an identifier judge, a skeptic judge, a reseller judge, and a safety auditor. When four different judges with different perspectives all agree the output is good, it's probably good.

Designing for Testability: The Architecture Decision That Made Everything Else Possible

Before I talk about fixtures, ground truth, and iteration speed, I need to talk about the design decision that made all of it work.

When we first built evals, we tested the entire research pipeline end-to-end. Photos go in, a fully researched item comes out, and the eval scores the final result. We still have a lane for that. It's useful for catching system-level regressions.

But it was a blunt instrument. When the end-to-end eval failed, we couldn't tell where. Was it the vision analysis that misread the brand? The identification agent that picked the wrong product? The comp validation that matched against irrelevant listings? The pricing model that ignored outliers? Everything was tangled together.

The ground truth couldn't be specific enough either. When you're scoring a full pipeline output, your ground truth has to account for every possible path the system could take to arrive at a correct answer. That's combinatorially complex. And you can't cache fixtures cleanly because each step depends on the live output of the previous step.

The real unlock was decomposing the pipeline into individually testable steps that build on each other.

Each step in the research pipeline became its own eval lane with its own fixtures, its own ground truth, and its own scoring methodology. Vision analysis gets tested in isolation with frozen images and specific expected extractions. Identification gets tested with frozen vision outputs so we know exactly what context it's working with. Comp validation gets tested with frozen candidate listings. Pricing gets tested with frozen comp sets.

This is critical for anyone building agentic workflows in production. If you're delivering real value and not just vibes, you need to be able to test with a scalpel, not a sledgehammer. Decompose your long-running agentic workflows into individually testable steps. Each step should have clearly defined inputs, outputs, and success criteria that can be evaluated independently.

The per-step approach gave us three things the end-to-end approach couldn't:

  1. Precise fault isolation. When a score drops, we know exactly which node caused it.
  2. Cached upstream state. Each lane runs against frozen inputs from the previous step, making results deterministic and reproducible. No more "the eval failed because eBay returned different search results this time."
  3. Specific ground truth. Instead of scoring a fuzzy end-to-end output, each lane has tight, domain-specific ground truth that catches subtle regressions.

We still run the full pipeline eval as a system-level smoke test. But the per-lane evals are where the real quality improvements happen. That's where you find the 2% regressions in vision analysis that would be invisible in an end-to-end score but compound into real user-facing problems.

The Fixture Dataset

Evals are only as good as the data you test against. We maintain 38 hand-curated fixture items spanning the full difficulty spectrum.

On the easy end: an Apple AirPods Pro case with clear branding, tons of comparable sales, unambiguous identification. On the hard end: a knockoff designer bag that looks authentic, a reproduction Griswold cast iron skillet that could fool most resellers, vintage electronics with minimal market data.

Each fixture includes product photos, human-authored ground truth with acceptable answer variants, frozen upstream outputs for isolated testing, and cached external API responses so results are deterministic and reproducible.

The ground truth is semantic, not string-matching. We don't require the AI to produce an exact string. We define what a correct identification must contain, what it must not claim, what name variants are acceptable, and what failure modes to watch for. A Canon AE-1 can be called "Canon AE-1 35mm SLR Film Camera" or "Canon AE-1 Program" and both are correct. But calling it a "Canon A-1" is a different camera entirely.

The Decision That Saved 282x on Cost

Here's where evals pay for themselves.

We needed to choose a model for comparable sales validation. The job is straightforward: look at a candidate listing from eBay, compare it to our identified product, and decide if it's a genuine match. It requires visual comparison and structured output, not deep reasoning.

We ran an A/B eval. Same fixtures. Same prompts. Two models.

Gemini 3.1 Flash Lite: 62.7% verdict accuracy. Total cost: $0.0065. Average latency: 8.9 seconds.

GPT-5.1: 0% verdict accuracy. Total cost: $1.84. Average latency: 26.3 seconds.

The expensive model was 282 times more costly, three times slower, and couldn't even parse the output correctly 7% of the time. The cheap model won on every metric.

Without evals, we would have defaulted to the "smarter" model and burned money on worse results. This one decision alone justified the entire eval system.

Model Selection by Capability

This pattern repeated across the pipeline. Different capabilities have different requirements, and the most expensive model is almost never the right choice for every job.

Our current production config:

  • Identification: GPT-5.4-mini. This is the hardest task. It requires genuine reasoning about what a product is from limited visual and textual evidence. Worth paying for a stronger model.
  • Vision Analysis: Gemini 3.1 Flash Lite. Extracting brand text, reading labels, categorizing condition. Structured output accuracy matters more than deep reasoning. The flash model handles it.
  • Comp Validation: Gemini 3.1 Flash Lite. Binary match/no-match decisions on structured data. Speed and cost matter.
  • Pricing: Gemini 3.1 Flash Lite. Math and structured output from comparable sales data. Doesn't need a reasoning powerhouse.
  • Eval Judges: GPT-5.4-nano. The cheapest model we use. Judges only need to score pre-formatted output against ground truth. They don't need to be creative.

Every one of these choices was made by running evals, not by guessing.

Hard Quality Gates

Evals don't just measure quality. They enforce it.

We have hard gates with specific thresholds that must pass before any prompt or model change can be promoted to production:

  • Identification accuracy: minimum 90%, max regression 1.5%
  • Identification gate fail rate: maximum 10%, max regression 3%
  • Pass-at-k (at least one correct answer in k attempts): minimum 80%
  • Comp validation F1: minimum 0.80
  • Pricing deviation score: minimum 0.60
  • Strategy ordering rate (aggressive < balanced < premium): minimum 95%

If a change causes identification accuracy to drop from 93% to 91%, that's still above the 90% floor, but the 2% regression exceeds the 1.5% max regression threshold. The gate fails. The change doesn't ship.

This protects against the most insidious problem in AI development: slow quality degradation that nobody notices until customers start complaining.

The Rapid Iteration Loop

The eval harness supports three modes that make iteration fast:

Fixture mode: Run a lane against a specific fixture with a specific model. See results in seconds. Use this when you're tweaking a prompt and want quick signal.

Capture mode: Snapshot the full state before a specific pipeline node. This freezes everything upstream so you can iterate on one node in isolation without re-running the whole pipeline every time.

Replay mode: Re-run a captured snapshot N times to get statistical confidence. AI is non-deterministic. Running once tells you almost nothing. Running 10 times tells you the distribution.

This loop means we can try a prompt change, run it against the full stable suite, see aggregate metrics and per-fixture breakdowns, compare against the baseline, and decide in minutes whether to keep it or revert.

Bootstrapping Ground Truth Without Losing Your Mind

Having an eval harness is great. But then you have to actually fill it with ground truth, and that's where most teams give up.

Manually authoring semantic ground truth for 38 fixtures across 20+ lanes is brutal. Each fixture needs acceptable name variants, failure modes to watch for, pricing ranges, category semantics. Doing it by hand for every lane would take weeks of tedious expert review.

We used a technique that cut that time dramatically: multi-model consensus.

We ran frontier models from all three major labs against every fixture lane and had each independently generate what they believed the ground truth should be. Anywhere all three agreed, we automatically saved it as ground truth. Anywhere they disagreed, we flagged it for manual review.

This is a consensus filter. The frontier models already agree on the easy stuff. There's no reason for a human expert to spend time writing ground truth that any frontier model would have gotten right. Automating that consensus let us skip the tedious work and focus expert time only on the cases where the models diverged, which are exactly the cases that matter most.

I then reviewed everything manually to sign off on it. This gave me confidence that the automated ground truth was actually correct, while the manual overrides caught the edge cases that separate good evals from great ones.

The Agentic Eval CLI

Even with good evals and solid ground truth, iterating is slow if you have to manually change prompts, swap models, adjust reasoning levels, and re-run each lane by hand.

The real unlock was exposing the eval harness as a CLI optimized for agentic use. I built skills and tooling around the CLI so an agent could actually use it. Via flags, the agent can inject mutations on prompting, model selection, reasoning level, and other parameters across any lane and fixture combination.

This meant the agent could automatically test dozens of permutations across all fixtures and present me with a ranked set of optimizations at the end. I select from the results. The agent does the exploration. I do the judgment.

The results: increases in performance, decreases in cost, and decreases in latency, all discovered through automated exploration rather than manual trial and error. The agent pulls on different strings while I review the outcomes.

This was the actual unlock for iteration speed. But it only worked because the harness already had high-quality fixtures with real-world accuracy underneath it. You can't automate iteration on top of bad evals. The foundation has to be right first, and then the agentic layer accelerates everything on top of it.

A Note on Eval Applicability

I want to be honest about something. Our use case is particularly well-suited for evals.

Product identification is either right or wrong. A Canon AE-1 is a Canon AE-1 or it isn't. Pricing has clear comparable data to score against. Comp validation is a binary match/no-match decision. Most of our outputs are objectively scoreable, which makes building ground truth and automated scoring relatively straightforward.

Not every domain has this luxury. If you're building a creative writing assistant, a therapy chatbot, or a legal summarization tool, the line between good and bad output is much blurrier. It may not be immediately obvious whether the output is good or bad, and it may not be obvious how to determine that algorithmically. The multi-model consensus approach to ground truth building may not be tenable for every use case.

But here's what I'd push back on: even in those domains, you have to find a way to evaluate. If you can't define what good looks like, you can't improve systematically. The eval methodology might be different. You might need human raters instead of deterministic metrics. You might need domain-specific rubrics instead of F1 scores. You might need a panel of expert judges instead of automated consensus.

The specific implementation varies by domain, but the principle doesn't change. You have to understand how to evaluate your own domain's data. That's the only way to objectively score outputs and make data-driven decisions about model selection, prompt engineering, and quality gates.

Even if it seems difficult to generate ground truth for your eval lanes, that difficulty is the actual work. Avoiding it doesn't make it go away. It just means you're shipping blind.

Conviction Calibration

One of the more interesting eval outputs is the conviction calibration chart. When the AI says it has "high confidence" in an identification, how often is it actually right?

We bucket eval results by conviction level and plot predicted confidence against actual accuracy. If the chart shows a diagonal line, the AI is well-calibrated: high confidence items are high accuracy, low confidence items are low accuracy. If high confidence items are frequently wrong, the AI is overconfident and the auto-approval thresholds need adjustment.

This directly feeds into the production review system. The auto-approval threshold is conviction-gated. If evals show that high-conviction items are 95% accurate, we can safely auto-approve them. If they're only 70% accurate, every item needs manual review regardless of what the AI claims.

What We Learned

After 179 eval runs across the pipeline, here's what surprised us:

Decompose before you evaluate. Testing an end-to-end pipeline tells you something is broken. Testing individual steps tells you what and why. The architecture decision to make each node independently testable was more important than any individual eval technique.

Expensive models aren't always better. For structured output tasks, smaller specialized models frequently outperform larger reasoning models. The larger model overthinks the problem, produces verbose output that fails to parse, or gets creative when the task requires precision.

The hard part isn't building evals. It's building good fixtures. Creating a fixture with semantic ground truth, frozen dependencies, and cached tool responses takes real work. But it's a one-time investment that pays dividends on every eval run for the life of the project.

Deterministic replay is essential. External API responses change constantly. eBay listings appear and disappear. Prices fluctuate. Without cached tool responses, you can't compare two eval runs because the inputs changed. Caching makes evals reproducible.

Multi-judge scoring catches things single scores miss. A reseller judge focuses on whether the listing would sell. A safety auditor focuses on whether claims are defensible. A skeptic actively tries to find problems. Different perspectives surface different failure modes.

Regression gates catch more issues than absolute thresholds. A model might pass the 90% accuracy floor but cause a 3% regression from the current baseline. That regression matters, especially when compounded across multiple small changes over time.

Consensus-based ground truth generation is a massive time saver. Let the frontier models agree on the easy cases. Spend your expert time on the disagreements. That's where the real quality lives.

Agentic eval iteration compounds fast. Once the foundation is solid, letting an agent explore permutations across prompts, models, and parameters finds optimizations a human would never try manually. But only if the foundation is solid first.

The Bigger Point

If you're building AI features and you don't have evals, you don't know how good your AI is. You think you know. But you don't.

Evals take the guesswork out. They turn "I think this prompt is better" into "this prompt scores 94.2% on identification accuracy, up from 91.7%, with no regression on comp validation F1." They turn "should we use GPT-5 or Gemini Flash?" into a table with accuracy, cost, and latency columns.

The investment pays for itself on the first model decision you make with data instead of intuition.


ListForge is building the most rigorous AI-powered listing platform for resellers. Every AI decision in our pipeline is measured, gated, and continuously improved through systematic evaluation. Sign up for early access to see the results for yourself.