Designing evaluation harnesses for LLM workflows.
How we structure regression suites for AI workflows so quality is measurable, not vibes-based.
Most LLM workflows in production are evaluated by tribal memory: someone on the team uses them often enough to notice when they get worse. That works exactly as long as that person is paying attention, which is to say, not very long.
An evaluation harness is the substitute. It is a versioned set of inputs, expected behaviors, and automated graders that run on every change. With one, a model upgrade, a prompt edit, a retrieval-index rebuild are all decisions you make with data. Without one, they are guesses dressed up in a Slack message.
The four things every harness needs
We have built these for inbox triage, document extraction, drafting assistants and agentic workflows. The surface area is different; the structure underneath is the same.
1. A frozen, representative input set.
Inputs come from real work, anonymized where required, sampled to cover the categories the workflow actually sees in production. A three-hundred-case set is usually enough to detect material regressions; we have shipped systems with sets as small as eighty, and as large as several thousand for safety-critical tasks. The size matters less than the coverage.
Freeze the set, version it, and treat additions as a release event. Inputs added quietly to chase a regression are how harnesses lose credibility.
2. A grader, or a small ensemble of them.
Graders are the hard part. For structured outputs (extraction, classification, routing) the grader is deterministic: exact-match, field-by-field, with categorical scoring. For generative outputs (drafting, summarization, response) the grader is usually a higher-quality LLM with a rubric. For high-stakes outputs we use both, and require agreement.
We avoid single-number quality scores. A harness that reports precision, recall, refusal rate, refusal-justification quality, mean response length, citation accuracy tells you what changed. One that reports a single score tells you something changed.
3. A regression policy.
What counts as a regression? Define it in advance. Common patterns: a hard floor on any metric below its previous run; a soft floor that requires sign-off below; a per-category threshold for high-risk slices. Without this, the conversation after every run is the same one: is this bad?
4. A continuous run cadence.
Daily, against the current production prompt and the current production model endpoint. Plus on every change. Plus a long-form weekly run that includes the more expensive graders. The point is that the harness is part of the system, not a project the team meant to do.
The shape of an evaluation
For a representative document-extraction workflow, our harness includes:
- ~400 frozen documents covering each document type.
- Per-field precision and recall against a human-labeled gold set.
- Confidence calibration plot: do high-confidence predictions actually correlate with correctness?
- End-to-end accuracy at the auto-route threshold (the number that determines what skips human review).
- Cost per document at current pricing.
- Latency p50 / p95.
Every change to the prompt, the model, the index, the chunking or the retrieval algorithm runs the full set. The dashboard for it looks more like a unit-test report than a research notebook. That is the point.
What changes in regulated environments
For financial services, healthcare and public-sector deployments, the harness becomes evidence. We add:
- A documented change log keyed to harness runs, retained for the model’s lifetime.
- Fairness slices across cohorts the regulator will ask about.
- Adversarial / red-team cases as their own grader, run on every release.
- A challenger model run side by side with the production model on the same inputs.
None of this is overhead. It is the artifact that turns a model review from a quarter-long event into a routine sign-off.
The mistake we see most often
Teams build the harness after the first prompt has been in production for six months. By then, the implicit baseline is whatever the team is used to seeing, and the harness gets calibrated to it. Build the harness first, against the prototype, and let it be the forcing function that turns the prototype into a system.
An AI workflow without an evaluation harness is a demo. With one, it is a product.