. Where AI Changes the Testing Game (and Where It Doesn’t)
Lee maps classic QA ideas onto AI lifecycle stages:
Problem definition
Poorly defined problems = fuzzy test oracles.
Same root issue as unclear requirements in normal software.
Data
Bad, biased, incomplete data → models that “work” technically but fail real users.
You need data validation tests just as much as feature tests.
Model behavior
Overfitting: great on training data, bad on real data.
Underfitting: misses patterns entirely.
Explainability: Can you justify why the model is doing what it’s doing?
Deployment & production
Integrated app testing: APIs, UI, load, security, privacy still matter.
Model drift: performance degrades as real-world data changes.
Need monitoring and retraining triggers.
👉 As you watch, ask: At my company, where do these AI risks actually show up — data, model, integration, or monitoring?
2. The Core Problem: Non-Deterministic Output
For GenAI features (chatbots, summarizers, assistants, etc.), same prompt ≠ same output:
Variations in structure, style, tone, wording are normal.
You wouldn’t expect a human to give the exact same answer word-for-word either.
But you would expect consistent quality and relevance.
We can’t:
Demand exact string matches.
Have humans review everything (too slow, too expensive).
So we need ways to:
Define what “good output” means per context (e.g., e-commerce chat vs medical advice).
Measure consistency, diversity, and appropriateness across many inputs and outputs.
Build automated checks that approximate human judgment.
3. Techniques for Testing Non-Deterministic GenAI Output
Lee walks through several concrete techniques.
3.1 Consistency Testing with Perturbed Inputs
Goal: Check that small variations in the input don’t cause wild swings in the output.
Vary prompts slightly:
Synonyms, phrasing changes, grammar variations, small data tweaks.
Compare outputs using:
Similarity metrics (semantic similarity, embeddings, etc.).
Specialized libraries/utilities for text similarity.
You’re not asking “are these answers identical?” but “are they consistently good and similar enough for our use case?”
3.2 Diversity & Edge Cases
You also want to know:
Does the app handle a wide range of prompts (happy paths + edge cases)?
Does it remain:
Relevant,
Non-toxic,
Unbiased,
On-topic?
Tools like spaCy and nltk can help measure linguistic diversity and patterns, but you still need to look at relevance and safety.
3.3 Gold Standard / Reference-Based Testing
Traditional idea adapted to GenAI:
Create “gold” artifacts (ideal answers, summaries, images, data).
Compare GenAI output against these using:
Similarity scores,
Heuristics,
Image/text comparison tools.
Key difference vs traditional testing:
You don’t expect an exact match.
You look for “close enough” quality measured via a score.
3.4 Fuzzing: Unusual & Adversarial Inputs
Use fuzz testing ideas:
Feed random, malformed, or adversarial prompts.
Check that the app:
Doesn’t produce nonsense,
Doesn’t return ugly stack traces or internal errors,
Fails gracefully and safely.
This is very similar to robustness testing for traditional apps, just adapted to GenAI behavior.
4. Practical Practices Around These Techniques
Lee highlights some important guardrails:
Automate where possible
The space of inputs/outputs is huge. Manual-only isn’t viable.
Automated evaluation + logging is essential.
Log everything
Inputs, outputs, scores, reasons.
You’ll need this to debug, tune thresholds, and improve prompts/evaluators.
Be mindful of costs
If your tests call external LLMs or models, they incur token $ costs.
Separate:
“Free”/traditional tests (run often)
“Model-in-the-loop” tests (run strategically)
Use sampling and impact analysis
Run a small, high-impact subset of model-involving tests first.
Based on those results, decide if you need to expand coverage.
Humans stay in the loop
Always include a human-reviewed sample:
To validate the automated eval methods.
To catch semantic or contextual issues automated checks might miss.
5. Using LLMs as Judges for GenAI Output
This is the centerpiece of the talk.
5.1 Why LLM-as-Judge?
Research suggests:
A well-prompted LLM like GPT-4 can agree with human experts ~85% of the time.
Human experts in the same study only agreed with each other ~81% of the time.
So, if we use LLMs correctly, they can:
Approximate human evaluation quality,
Be embedded into automated pipelines,
Provide scores + explanations, not just a pass/fail.
5.2 The Testing Harness Pattern
Think of this as a test harness around your GenAI feature:
Run the app under test
Execute your test scenario normally:
e.g., send a prompt, get a generated answer/summary.
Build a judge prompt
Combine:
The app’s output,
(Optionally) a known-good reference output,
Context (original prompt, full article, etc.),
Clear evaluation instructions and criteria,
A request for a score + explanation.
Send to an LLM Judge
Pass that meta-prompt to a separate LLM (ideally not the same model as your app).
Receive score + reasoning
The judge returns:
Criteria-level scores,
A written justification.
Apply deterministic thresholds
Your test harness:
Reads the scores from the judge.
Applies pass/fail rules (e.g., “all criteria ≥ 7”).
Logs everything.
Result: You convert fuzzy GenAI responses into a deterministic test result, backed by explainable scoring.
5.3 Important Considerations
Use a different model
To avoid bias, don’t let the same model grade its own work if you can avoid it.
Scoring scale
Don’t ask for absurdly granular scoring (0–100).
LLMs are better with coarser scales or probability-based normalization.
Chain-of-thought prompting
Give step-by-step evaluation criteria:
e.g., “First check relevance, then faithfulness, then tone…”
This improves consistency and transparency.
Include examples of ideal output
Few-shot examples help the judge align with your definition of “good”.
6. Concrete Example: Summarizing Academic Articles
Lee walks through a specific use case:
App: Summarizes academic articles.
Goal: Test quality of its summaries.
Manual approach:
Feed articles → generated summaries.
SMEs manually review each.
Accurate but not scalable.
LLM-as-judge approach:
Calibrate the judge (“train” the prompt)
Use human-reviewed examples:
Articles + known good/bad summaries.
Send them through the LLM judge with your initial prompt.
Compare the judge’s evaluations to SME expectations.
Tweak the prompt until:
It distinguishes good vs bad reliably.
Scores and explanations make sense.
This is “human in the loop” in action: humans set the standard; the judge learns how to mirror it.
Integrate into automated tests
For each test:
Provide:
An article to the GenAI summarizer → get generated summary.
Build the judge prompt with:
Full article,
Known good summary (for context),
Generated summary,
Clear scoring criteria and instructions,
Required JSON output format (so your test can parse it).
Call the LLM judge (e.g., OpenAI).
Judge returns:
Scores for each criterion (e.g., relevance, completeness, fidelity),
Justification text.
Your test applies pass/fail criteria, for example:
Known-good summary must score ≥ 8 in all categories.
Generated summary must score ≥ 6 in all categories.
Log scores, reasons, and which criteria failed if any.
Tech stack they used in the experiment:
GenAI app simulation:
Local model via LLM Studio using Gemma-2 9B.
Prompt generation for the app:
Initially created via Grok, then refined by humans.
LLM judge:
OpenAI (using a paid account).
Harness:
Simple Python script simulating the test framework:
Calls the GenAI summarizer,
Builds the judge prompt,
Calls the judge,
Applies pass/fail logic.
The key takeaway isn’t the specific tools, but the pattern:
GenAI app → Judge LLM → Score + Reason → Deterministic test result.
7. Skills You Need to Test AI Systems
Lee finishes by framing what testers actually need in this space.
a) Non-negotiable foundation
Solid software testing skills (nothing about AI replaces this).
Curiosity — he calls it out specifically:
Asking “why did it do that?”
Probing weird edge cases.
Challenging metrics and thresholds.
b) General technical skills
You don’t need to be a full-fledged data scientist, but you should be comfortable with:
Basic scripting,
APIs,
Data formats,
CI/CD & automation tooling.
c) Role-specific AI skills
Depending on where you sit:
Model builders (data science / ML)
Need deep:
Data engineering,
Data science,
Knowledge of different model architectures & training methods.
GenAI implementers (building features on top of models)
Above, plus:
Understanding of generative models,
Prompt engineering,
Safety & guardrails.
Most testers (where we encounter AI first)
Skills around:
Testing non-deterministic behavior,
Understanding and explaining AI outputs,
Techniques like LLM-as-judge, similarity-based checks, fuzzing, and drift monitoring.
You don’t have to master everything in the stack — but you do need enough understanding to design meaningful tests and interpret what the AI is doing.

Comments are closed.