Designing evaluation harnesses for LLM workflows.

How we structure regression suites for AI workflows so quality is measurable, not vibes-based.

Practice notesOctober 20259 min read

Most LLM workflows in production are evaluated by tribal memory: someone on the team uses them often enough to notice when they get worse. That works exactly as long as that person is paying attention, which is to say, not very long.

An evaluation harness is the substitute. It is a versioned set of inputs, expected behaviors, and automated graders that run on every change. With one, a model upgrade, a prompt edit, a retrieval-index rebuild are all decisions you make with data. Without one, they are guesses dressed up in a Slack message.

An AI workflow without an evaluation harness is a demo. With one, it becomes a system you can defend.

Evaluation harnesses for LLM workflows

The four things every harness needs

We have built these for inbox triage, document extraction, drafting assistants and agentic workflows. The surface area is different; the structure underneath is the same.

1. A frozen, representative input set.

Inputs come from real work, anonymized where required, sampled to cover the categories the workflow actually sees in production. A three-hundred-case set is usually enough to detect material regressions; we have shipped systems with sets as small as eighty, and as large as several thousand for safety-critical tasks. The size matters less than the coverage.

Freeze the set, version it, and treat additions as a release event. Inputs added quietly to chase a regression are how harnesses lose credibility.

2. A grader, or a small ensemble of them.

Graders are the hard part. For structured outputs (extraction, classification, routing) the grader is deterministic: exact-match, field-by-field, with categorical scoring. For generative outputs (drafting, summarization, response) the grader is usually a higher-quality LLM with a rubric. For high-stakes outputs we use both, and require agreement.

We avoid single-number quality scores. A harness that reports precision, recall, refusal rate, refusal-justification quality, mean response length, citation accuracy tells you what changed. One that reports a single score tells you something changed.

3. A regression policy.

What counts as a regression? Define it in advance. Common patterns: a hard floor on any metric below its previous run; a soft floor that requires sign-off below; a per-category threshold for high-risk slices. Without this, the conversation after every run is the same one: is this bad?

4. A continuous run cadence.

Daily, against the current production prompt and the current production model endpoint. Plus on every change. Plus a long-form weekly run that includes the more expensive graders. The point is that the harness is part of the system, not a project the team meant to do.

The shape of an evaluation

For a representative document-extraction workflow, our harness includes:

~400 frozen documents covering each document type.
Per-field precision and recall against a human-labeled gold set.
Confidence calibration plot: do high-confidence predictions actually correlate with correctness?
End-to-end accuracy at the auto-route threshold (the number that determines what skips human review).
Cost per document at current pricing.
Latency p50 / p95.

Every change to the prompt, the model, the index, the chunking or the retrieval algorithm runs the full set. The dashboard for it looks more like a unit-test report than a research notebook. That is the point.

What changes in regulated environments

For financial services, healthcare and public-sector deployments, the harness becomes evidence. We add:

A documented change log keyed to harness runs, retained for the model’s lifetime.
Fairness slices across cohorts the regulator will ask about.
Adversarial / red-team cases as their own grader, run on every release.
A challenger model run side by side with the production model on the same inputs.

None of this is overhead. It is the artifact that turns a model review from a quarter-long event into a routine sign-off.

The mistake we see most often

Teams build the harness after the first prompt has been in production for six months. By then, the implicit baseline is whatever the team is used to seeing, and the harness gets calibrated to it. Build the harness first, against the prototype, and let it be the forcing function that turns the prototype into a system.

An AI workflow without an evaluation harness is a demo. With one, it is a product.

Designing evaluation harnesses for LLM workflows.

The four things every harness needs

1. A frozen, representative input set.

2. A grader, or a small ensemble of them.

3. A regression policy.

4. A continuous run cadence.

The shape of an evaluation

What changes in regulated environments

The mistake we see most often

Related work.

Applied AI

Where AI actually fits

Shipping ML into regulated environments

Have a workload worth getting right?