Testing AI-Enabled Systems When There Is No Single “Right” Answer

. Where AI Changes the Testing Game (and Where It Doesn’t)

Lee maps classic QA ideas onto AI lifecycle stages:

  • Problem definition

    • Poorly defined problems = fuzzy test oracles.

    • Same root issue as unclear requirements in normal software.

  • Data

    • Bad, biased, incomplete data → models that “work” technically but fail real users.

    • You need data validation tests just as much as feature tests.

  • Model behavior

    • Overfitting: great on training data, bad on real data.

    • Underfitting: misses patterns entirely.

    • Explainability: Can you justify why the model is doing what it’s doing?

  • Deployment & production

    • Integrated app testing: APIs, UI, load, security, privacy still matter.

    • Model drift: performance degrades as real-world data changes.

    • Need monitoring and retraining triggers.

👉 As you watch, ask: At my company, where do these AI risks actually show up — data, model, integration, or monitoring?

2. The Core Problem: Non-Deterministic Output

For GenAI features (chatbots, summarizers, assistants, etc.), same prompt ≠ same output:

  • Variations in structure, style, tone, wording are normal.

  • You wouldn’t expect a human to give the exact same answer word-for-word either.

  • But you would expect consistent quality and relevance.

We can’t:

  • Demand exact string matches.

  • Have humans review everything (too slow, too expensive).

So we need ways to:

  • Define what “good output” means per context (e.g., e-commerce chat vs medical advice).

  • Measure consistency, diversity, and appropriateness across many inputs and outputs.

  • Build automated checks that approximate human judgment.

3. Techniques for Testing Non-Deterministic GenAI Output

Lee walks through several concrete techniques.

3.1 Consistency Testing with Perturbed Inputs

Goal: Check that small variations in the input don’t cause wild swings in the output.

  • Vary prompts slightly:

    • Synonyms, phrasing changes, grammar variations, small data tweaks.

  • Compare outputs using:

    • Similarity metrics (semantic similarity, embeddings, etc.).

    • Specialized libraries/utilities for text similarity.

You’re not asking “are these answers identical?” but “are they consistently good and similar enough for our use case?”

3.2 Diversity & Edge Cases

You also want to know:

  • Does the app handle a wide range of prompts (happy paths + edge cases)?

  • Does it remain:

    • Relevant,

    • Non-toxic,

    • Unbiased,

    • On-topic?

Tools like spaCy and nltk can help measure linguistic diversity and patterns, but you still need to look at relevance and safety.

3.3 Gold Standard / Reference-Based Testing

Traditional idea adapted to GenAI:

  • Create “gold” artifacts (ideal answers, summaries, images, data).

  • Compare GenAI output against these using:

    • Similarity scores,

    • Heuristics,

    • Image/text comparison tools.

Key difference vs traditional testing:

  • You don’t expect an exact match.

  • You look for “close enough” quality measured via a score.

3.4 Fuzzing: Unusual & Adversarial Inputs

Use fuzz testing ideas:

  • Feed random, malformed, or adversarial prompts.

  • Check that the app:

    • Doesn’t produce nonsense,

    • Doesn’t return ugly stack traces or internal errors,

    • Fails gracefully and safely.

This is very similar to robustness testing for traditional apps, just adapted to GenAI behavior.

4. Practical Practices Around These Techniques

Lee highlights some important guardrails:

  1. Automate where possible

    • The space of inputs/outputs is huge. Manual-only isn’t viable.

    • Automated evaluation + logging is essential.

  2. Log everything

    • Inputs, outputs, scores, reasons.

    • You’ll need this to debug, tune thresholds, and improve prompts/evaluators.

  3. Be mindful of costs

    • If your tests call external LLMs or models, they incur token $ costs.

    • Separate:

      • “Free”/traditional tests (run often)

      • “Model-in-the-loop” tests (run strategically)

  4. Use sampling and impact analysis

    • Run a small, high-impact subset of model-involving tests first.

    • Based on those results, decide if you need to expand coverage.

  5. Humans stay in the loop

    • Always include a human-reviewed sample:

      • To validate the automated eval methods.

      • To catch semantic or contextual issues automated checks might miss.

5. Using LLMs as Judges for GenAI Output

This is the centerpiece of the talk.

5.1 Why LLM-as-Judge?

Research suggests:

  • A well-prompted LLM like GPT-4 can agree with human experts ~85% of the time.

  • Human experts in the same study only agreed with each other ~81% of the time.

So, if we use LLMs correctly, they can:

  • Approximate human evaluation quality,

  • Be embedded into automated pipelines,

  • Provide scores + explanations, not just a pass/fail.

5.2 The Testing Harness Pattern

Think of this as a test harness around your GenAI feature:

  1. Run the app under test

    • Execute your test scenario normally:

      • e.g., send a prompt, get a generated answer/summary.

  2. Build a judge prompt

    • Combine:

      • The app’s output,

      • (Optionally) a known-good reference output,

      • Context (original prompt, full article, etc.),

      • Clear evaluation instructions and criteria,

      • A request for a score + explanation.

  3. Send to an LLM Judge

    • Pass that meta-prompt to a separate LLM (ideally not the same model as your app).

  4. Receive score + reasoning

    • The judge returns:

      • Criteria-level scores,

      • A written justification.

  5. Apply deterministic thresholds

    • Your test harness:

      • Reads the scores from the judge.

      • Applies pass/fail rules (e.g., “all criteria ≥ 7”).

      • Logs everything.

Result: You convert fuzzy GenAI responses into a deterministic test result, backed by explainable scoring.

5.3 Important Considerations

  • Use a different model

    • To avoid bias, don’t let the same model grade its own work if you can avoid it.

  • Scoring scale

    • Don’t ask for absurdly granular scoring (0–100).

    • LLMs are better with coarser scales or probability-based normalization.

  • Chain-of-thought prompting

    • Give step-by-step evaluation criteria:

      • e.g., “First check relevance, then faithfulness, then tone…”

    • This improves consistency and transparency.

  • Include examples of ideal output

    • Few-shot examples help the judge align with your definition of “good”.

6. Concrete Example: Summarizing Academic Articles

Lee walks through a specific use case:

App: Summarizes academic articles.
Goal: Test quality of its summaries.

Manual approach:

  • Feed articles → generated summaries.

  • SMEs manually review each.

  • Accurate but not scalable.

LLM-as-judge approach:

  1. Calibrate the judge (“train” the prompt)

    • Use human-reviewed examples:

      • Articles + known good/bad summaries.

    • Send them through the LLM judge with your initial prompt.

    • Compare the judge’s evaluations to SME expectations.

    • Tweak the prompt until:

      • It distinguishes good vs bad reliably.

      • Scores and explanations make sense.

    This is “human in the loop” in action: humans set the standard; the judge learns how to mirror it.

  2. Integrate into automated tests

    For each test:

    • Provide:

      • An article to the GenAI summarizer → get generated summary.

    • Build the judge prompt with:

      • Full article,

      • Known good summary (for context),

      • Generated summary,

      • Clear scoring criteria and instructions,

      • Required JSON output format (so your test can parse it).

    • Call the LLM judge (e.g., OpenAI).

    • Judge returns:

      • Scores for each criterion (e.g., relevance, completeness, fidelity),

      • Justification text.

    • Your test applies pass/fail criteria, for example:

      • Known-good summary must score ≥ 8 in all categories.

      • Generated summary must score ≥ 6 in all categories.

    • Log scores, reasons, and which criteria failed if any.

Tech stack they used in the experiment:

  • GenAI app simulation:

    • Local model via LLM Studio using Gemma-2 9B.

  • Prompt generation for the app:

    • Initially created via Grok, then refined by humans.

  • LLM judge:

    • OpenAI (using a paid account).

  • Harness:

    • Simple Python script simulating the test framework:

      • Calls the GenAI summarizer,

      • Builds the judge prompt,

      • Calls the judge,

      • Applies pass/fail logic.

The key takeaway isn’t the specific tools, but the pattern:
GenAI app → Judge LLM → Score + Reason → Deterministic test result.

7. Skills You Need to Test AI Systems

Lee finishes by framing what testers actually need in this space.

a) Non-negotiable foundation

  • Solid software testing skills (nothing about AI replaces this).

  • Curiosity — he calls it out specifically:

    • Asking “why did it do that?”

    • Probing weird edge cases.

    • Challenging metrics and thresholds.

b) General technical skills

  • You don’t need to be a full-fledged data scientist, but you should be comfortable with:

    • Basic scripting,

    • APIs,

    • Data formats,

    • CI/CD & automation tooling.

c) Role-specific AI skills

Depending on where you sit:

  1. Model builders (data science / ML)

    • Need deep:

      • Data engineering,

      • Data science,

      • Knowledge of different model architectures & training methods.

  2. GenAI implementers (building features on top of models)

    • Above, plus:

      • Understanding of generative models,

      • Prompt engineering,

      • Safety & guardrails.

  3. Most testers (where we encounter AI first)

    • Skills around:

      • Testing non-deterministic behavior,

      • Understanding and explaining AI outputs,

      • Techniques like LLM-as-judge, similarity-based checks, fuzzing, and drift monitoring.

You don’t have to master everything in the stack — but you do need enough understanding to design meaningful tests and interpret what the AI is doing.

Comments are closed.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}