Testing AI-Enabled Systems When There Is No Single “Right” Answer

. Where AI Changes the Testing Game (and Where It Doesn’t)

Lee maps classic QA ideas onto AI lifecycle stages:

Problem definition
- Poorly defined problems = fuzzy test oracles.
- Same root issue as unclear requirements in normal software.
Data
- Bad, biased, incomplete data → models that “work” technically but fail real users.
- You need data validation tests just as much as feature tests.
Model behavior
- Overfitting: great on training data, bad on real data.
- Underfitting: misses patterns entirely.
- Explainability: Can you justify why the model is doing what it’s doing?
Deployment & production
- Integrated app testing: APIs, UI, load, security, privacy still matter.
- Model drift: performance degrades as real-world data changes.
- Need monitoring and retraining triggers.

👉 As you watch, ask: At my company, where do these AI risks actually show up — data, model, integration, or monitoring?

2. The Core Problem: Non-Deterministic Output

For GenAI features (chatbots, summarizers, assistants, etc.), same prompt ≠ same output:

Variations in structure, style, tone, wording are normal.
You wouldn’t expect a human to give the exact same answer word-for-word either.
But you would expect consistent quality and relevance.

We can’t:

Demand exact string matches.
Have humans review everything (too slow, too expensive).

So we need ways to:

Define what “good output” means per context (e.g., e-commerce chat vs medical advice).
Measure consistency, diversity, and appropriateness across many inputs and outputs.
Build automated checks that approximate human judgment.

3. Techniques for Testing Non-Deterministic GenAI Output

Lee walks through several concrete techniques.

3.1 Consistency Testing with Perturbed Inputs

Goal: Check that small variations in the input don’t cause wild swings in the output.

Vary prompts slightly:
- Synonyms, phrasing changes, grammar variations, small data tweaks.
Compare outputs using:
- Similarity metrics (semantic similarity, embeddings, etc.).
- Specialized libraries/utilities for text similarity.

You’re not asking “are these answers identical?” but “are they consistently good and similar enough for our use case?”

3.2 Diversity & Edge Cases

You also want to know:

Does the app handle a wide range of prompts (happy paths + edge cases)?
Does it remain:
- Relevant,
- Non-toxic,
- Unbiased,
- On-topic?

Tools like spaCy and nltk can help measure linguistic diversity and patterns, but you still need to look at relevance and safety.

3.3 Gold Standard / Reference-Based Testing

Traditional idea adapted to GenAI:

Create “gold” artifacts (ideal answers, summaries, images, data).
Compare GenAI output against these using:
- Similarity scores,
- Heuristics,
- Image/text comparison tools.

Key difference vs traditional testing:

You don’t expect an exact match.
You look for “close enough” quality measured via a score.

3.4 Fuzzing: Unusual & Adversarial Inputs

Use fuzz testing ideas:

Feed random, malformed, or adversarial prompts.
Check that the app:
- Doesn’t produce nonsense,
- Doesn’t return ugly stack traces or internal errors,
- Fails gracefully and safely.

This is very similar to robustness testing for traditional apps, just adapted to GenAI behavior.

4. Practical Practices Around These Techniques

Lee highlights some important guardrails:

Automate where possible
- The space of inputs/outputs is huge. Manual-only isn’t viable.
- Automated evaluation + logging is essential.
Log everything
- Inputs, outputs, scores, reasons.
- You’ll need this to debug, tune thresholds, and improve prompts/evaluators.
Be mindful of costs
- If your tests call external LLMs or models, they incur token $ costs.
- Separate:
  - “Free”/traditional tests (run often)
  - “Model-in-the-loop” tests (run strategically)
Use sampling and impact analysis
- Run a small, high-impact subset of model-involving tests first.
- Based on those results, decide if you need to expand coverage.
Humans stay in the loop
- Always include a human-reviewed sample:
  - To validate the automated eval methods.
  - To catch semantic or contextual issues automated checks might miss.

5. Using LLMs as Judges for GenAI Output

This is the centerpiece of the talk.

5.1 Why LLM-as-Judge?

Research suggests:

A well-prompted LLM like GPT-4 can agree with human experts ~85% of the time.
Human experts in the same study only agreed with each other ~81% of the time.

So, if we use LLMs correctly, they can:

Approximate human evaluation quality,
Be embedded into automated pipelines,
Provide scores + explanations, not just a pass/fail.

5.2 The Testing Harness Pattern

Think of this as a test harness around your GenAI feature:

Run the app under test
- Execute your test scenario normally:
  - e.g., send a prompt, get a generated answer/summary.
Build a judge prompt
- Combine:
  - The app’s output,
  - (Optionally) a known-good reference output,
  - Context (original prompt, full article, etc.),
  - Clear evaluation instructions and criteria,
  - A request for a score + explanation.
Send to an LLM Judge
- Pass that meta-prompt to a separate LLM (ideally not the same model as your app).
Receive score + reasoning
- The judge returns:
  - Criteria-level scores,
  - A written justification.
Apply deterministic thresholds
- Your test harness:
  - Reads the scores from the judge.
  - Applies pass/fail rules (e.g., “all criteria ≥ 7”).
  - Logs everything.

Result: You convert fuzzy GenAI responses into a deterministic test result, backed by explainable scoring.

5.3 Important Considerations

Use a different model
- To avoid bias, don’t let the same model grade its own work if you can avoid it.
Scoring scale
- Don’t ask for absurdly granular scoring (0–100).
- LLMs are better with coarser scales or probability-based normalization.
Chain-of-thought prompting
- Give step-by-step evaluation criteria:
  - e.g., “First check relevance, then faithfulness, then tone…”
- This improves consistency and transparency.
Include examples of ideal output
- Few-shot examples help the judge align with your definition of “good”.

6. Concrete Example: Summarizing Academic Articles

Lee walks through a specific use case:

App: Summarizes academic articles.
Goal: Test quality of its summaries.

Manual approach:

Feed articles → generated summaries.
SMEs manually review each.
Accurate but not scalable.

LLM-as-judge approach:

Calibrate the judge (“train” the prompt)
- Use human-reviewed examples:
  - Articles + known good/bad summaries.
- Send them through the LLM judge with your initial prompt.
- Compare the judge’s evaluations to SME expectations.
- Tweak the prompt until:
  - It distinguishes good vs bad reliably.
  - Scores and explanations make sense.
This is “human in the loop” in action: humans set the standard; the judge learns how to mirror it.
Integrate into automated tests
For each test:
- Provide:
  - An article to the GenAI summarizer → get generated summary.
- Build the judge prompt with:
  - Full article,
  - Known good summary (for context),
  - Generated summary,
  - Clear scoring criteria and instructions,
  - Required JSON output format (so your test can parse it).
- Call the LLM judge (e.g., OpenAI).
- Judge returns:
  - Scores for each criterion (e.g., relevance, completeness, fidelity),
  - Justification text.
- Your test applies pass/fail criteria, for example:
  - Known-good summary must score ≥ 8 in all categories.
  - Generated summary must score ≥ 6 in all categories.
- Log scores, reasons, and which criteria failed if any.

Tech stack they used in the experiment:

GenAI app simulation:
- Local model via LLM Studio using Gemma-2 9B.
Prompt generation for the app:
- Initially created via Grok, then refined by humans.
LLM judge:
- OpenAI (using a paid account).
Harness:
- Simple Python script simulating the test framework:
  - Calls the GenAI summarizer,
  - Builds the judge prompt,
  - Calls the judge,
  - Applies pass/fail logic.

The key takeaway isn’t the specific tools, but the pattern:
GenAI app → Judge LLM → Score + Reason → Deterministic test result.

7. Skills You Need to Test AI Systems

Lee finishes by framing what testers actually need in this space.

a) Non-negotiable foundation

Solid software testing skills (nothing about AI replaces this).
Curiosity — he calls it out specifically:
- Asking “why did it do that?”
- Probing weird edge cases.
- Challenging metrics and thresholds.

b) General technical skills

You don’t need to be a full-fledged data scientist, but you should be comfortable with:
- Basic scripting,
- APIs,
- Data formats,
- CI/CD & automation tooling.

c) Role-specific AI skills

Depending on where you sit:

Model builders (data science / ML)
- Need deep:
  - Data engineering,
  - Data science,
  - Knowledge of different model architectures & training methods.
GenAI implementers (building features on top of models)
- Above, plus:
  - Understanding of generative models,
  - Prompt engineering,
  - Safety & guardrails.
Most testers (where we encounter AI first)
- Skills around:
  - Testing non-deterministic behavior,
  - Understanding and explaining AI outputs,
  - Techniques like LLM-as-judge, similarity-based checks, fuzzing, and drift monitoring.

You don’t have to master everything in the stack — but you do need enough understanding to design meaningful tests and interpret what the AI is doing.

Done?

Mark lesson complete

Previous Lesson

To Course Page