Notes on Evaluating LLM Output Quality

The most common reason AI features regress silently is that nobody has a way to notice. Traditional software has tests; most AI features ship with none, because "testing" an LLM output feels fuzzier than asserting equals(2, 2). It doesn't have to be.

Start with a small, real evaluation set

A useful evaluation set doesn't need to be large. Ten to twenty real examples, pulled from actual usage or written to cover known edge cases, will catch more regressions than an elaborate framework with no data behind it. The set should include:

A handful of the most common, "easy" cases — these should almost never fail.
A few known-hard cases that have broken before.
One or two adversarial inputs that try to push the model off its intended behavior.

A minimal check, in code

You don't need a scoring model to get started. A simple structural check catches a surprising number of regressions:

function checkSummary(input: string, output: string) {
  const sentenceCount = output.split(/[.!?]+\s/).filter(Boolean).length;
  const withinLength = output.length <= 280;
  const mentionsSubject = output
    .toLowerCase()
    .includes(extractSubject(input).toLowerCase());
 
  return { sentenceCount, withinLength, mentionsSubject };
}

This won't tell you if the summary is good, but it will tell you if a prompt change silently broke the length constraint or dropped the subject entirely — which is often the failure that actually ships.

When you need a judge

For quality dimensions that resist structural checks — tone, helpfulness, faithfulness to a source document — a second model call acting as a judge is a reasonable next step, but it comes with its own failure modes: judge models have biases, and a judge prompt needs the same versioning discipline as any other prompt. Treat it as a stronger heuristic, not ground truth.

The point isn't perfect coverage

An evaluation set doesn't need to catch every possible regression to be worth building. It needs to catch the regressions that would otherwise reach a user first. Even a handful of checks, run automatically before a prompt or model change ships, turns "I think this is better" into something closer to a fact.