Evaluation Harnesses: How to Test AI Systems That Cannot Be Unit Tested

You cannot unit test an AI system. The output of an LLM call is not a deterministic function of the input. Running the same test twice does not guarantee the same result. Traditional software testing (assertion-based, exact-match, pass/fail) does not transfer.

This is not a reason to abandon testing. It is a reason to build the right testing infrastructure. That infrastructure is called an evaluation harness, and building one well is the most important engineering investment you can make in a production AI system.

What an Evaluation Harness Is

An evaluation harness is a system for measuring how well an AI system performs across a defined set of inputs. It differs from unit testing in three fundamental ways:

It uses rubrics, not exact matches. Instead of checking whether the output equals an expected value, it scores the output against criteria that define what good looks like.

It operates over distributions, not individual cases. A single test run tells you little. A harness gives you aggregate scores, variance measures, and trends over time.

It measures behaviour at multiple levels: final output quality, intermediate reasoning steps, tool call sequences, and latency. Watching only the final output misses most of what is useful.

Building the Test Dataset

The test dataset is the foundation of the harness. A weak dataset produces misleading evaluation signals. A strong dataset covers:

The core case distribution. The 80% of inputs that represent normal operation. These should reflect actual production input distribution, not idealised examples.

Edge cases. Inputs that are unusual but valid. The user who asks an unexpected question. The document with unusual formatting. The request that combines two tasks.

Known failure modes. Every time the system fails or behaves unexpectedly in production, add that input to the test dataset. Over time, this becomes a regression suite that prevents old failures from returning.

Adversarial inputs. Inputs designed to trigger specific failure patterns: empty inputs, very long inputs, inputs in unexpected languages, inputs that attempt prompt injection.

The minimum viable evaluation dataset for a production system is fifty cases. Below that, the variance in your scores will be too high to be meaningful.

Rubric Design

Rubric design is where evaluation harnesses succeed or fail. A rubric is a scoring function that takes an output and returns a quality score. The rubric must be:

Operationalised. "Good" and "bad" must be defined in terms that a scorer can apply consistently. "The response should be helpful" is not a rubric. "The response should directly address the user's question without adding unrequested information" is.

Decomposed. Complex tasks need multiple rubric dimensions. A customer support response might be scored on: accuracy of information, tone appropriateness, completeness, and escalation decision quality. Aggregate scores obscure which dimension is failing.

Calibrated. Before relying on a rubric, check that human scorers applying it independently produce similar scores on the same outputs. High inter-rater variability means the rubric is ambiguous.

Automated vs Human Evaluation

In production, evaluation needs to scale. Human evaluation does not scale. LLM-as-evaluator patterns work well for many rubric dimensions: coherence, instruction-following, relevance. They work less well for factual accuracy and domain-specific quality.

The practical approach: use LLM-as-evaluator for the rubric dimensions where it is reliable, use human evaluation for domain-specific quality and ground-truth accuracy checks. Build both into the harness and track where they agree and disagree.

When using an LLM as evaluator, use a different model than the one being evaluated. Same-model evaluation introduces optimism bias.

Regression Testing

Every production AI system degrades over time if not actively maintained. Model versions change. Prompts are edited. Retrieval indices go stale. Regression testing catches these degradations before they reach production.

The regression test suite is the subset of your evaluation dataset that covers your known failure modes. Run it on every prompt change, every model version change, and on a scheduled basis even when nothing has changed. The scheduled runs catch silent degradation from upstream changes.

Track scores over time, not just point-in-time values. A score of 85% on a rubric means nothing without knowing whether it was 90% last month.

Observability Integration

Evaluation and observability are two sides of the same coin. The evaluation harness tests offline. Observability monitors online. Both are necessary.

Instrument every production AI call to emit: the full prompt, the model response, the tool calls made in order, latency, and token usage. This gives you the ability to replay any production call in your evaluation harness when something goes wrong.

The best evaluation datasets are grown from production logs: real inputs, real failures, real edge cases. The harness and the monitoring system should be designed to feed each other.