Stop Overtesting: How to Tame Combinatorial Explosion with Risk-Based, Model-Driven Testing

1. The Real Problem: Combinatorial Explosion in the Wild

Marcus opens with a concrete story:

  • Coupon / voucher site with:

    • Different coupon categories

    • Different positions (rails, middle, top, bottom, carousel, etc.)

    • Different link sources (text, image, code)

    • Different browsers and versions

All of this led to ~2,400 possible code paths.

The bug:

“If a coupon is in the fourth position in the carousel, on a Tuesday, in Safari two versions old…” — attribution breaks, revenue is lost, and nobody knows why.

What to notice:

  • This is exactly how real bugs sneak in: weird interactions, not just simple “field X breaks” issues.

  • Testing every path is unrealistic. You need smarter selection.

Ask yourself:

“Where in my own system do we have silent combinatorial explosions like this (pricing, promotions, feature flags, device + OS + locale, etc.)?”

2. Risk vs Coverage: Stop Chasing Magic Percentages

James and Marcus make a key point:

  • Coverage is not a useful single number in complex, distributed, API-heavy systems.

  • You’ll never know “true coverage” across:

    • System logic

    • Data combinations

    • Devices, OS versions, and configurations

    • Internal + external services

Instead, they suggest thinking in 3 dimensions:

  1. System logic & data combinations

    • What portions of logic are tested?

    • Are you hitting the real decision points and business rules?

  2. User journeys / critical paths

    • Which flows actually drive revenue or your key business metric?

    • Are those paths tested more thoroughly and more often than edge features 0.5% of users ever touch?

  3. Device & environment mix

    • What browsers / devices / OS versions matter most for your customer base today?

    • How does that change when you expand from one geography (e.g., US) to global markets?

Marcus’ example:

  • In US only: testing Apple + Samsung might get you ~70% coverage of users.

  • Expand into EU + global markets:

    • Apple + Samsung drop to ~40%.

    • You need to care about Xiaomi, Vivo, Oppo, Huawei, etc.

    • Android variants add another layer of combination risk.

Key mindset shift:

“Stop aiming for an abstract ‘coverage number’ and start aiming for ‘risk coverage of what keeps the business alive.’”

3. Why Traceability is the Missing Piece

James sets up a list of questions that are nearly impossible to answer without good traceability:

  • Do you know which tests cover a given requirement?

  • If data characteristics change, do you know which code and tests are impacted?

  • If a test fails, can you see:

    • The underlying code changes?

    • The environments involved?

    • The requirements at risk?

Reality in most orgs:

  • Requirements live in Jira.

  • Code in GitHub/GitLab.

  • Tests in separate tools.

  • Environments in spreadsheets or someone’s head.

All of that makes impact analysis and prioritized testing guesswork.

Their proposed solution: a Traceability Lab powered by a graph database (Neo4j):

  • Treat everything as nodes with relationships:

    • Requirements ↔ Models ↔ Tests ↔ Code files ↔ Commits ↔ Environments ↔ People.

  • Use graph queries to:

    • Find commits linked to requirements.

    • Find tests that cover those commits.

    • Spot “risky” commits where devs touch files they’ve never edited before.

    • Build targeted test suites based on actual impact, not intuition.

This is very similar to how Netflix recommends shows—only here, the “recommendations” are which tests to run.

4. Model-Based Testing: Generating the Right Combinations

James then moves into model-based testing with a demo of an e-commerce app (Shopizer):

  • The app supports:

    • Browsing products

    • Registering users

    • Signing in

    • Adding to cart

    • Checking out

They build models of:

  • Functional flow: e.g., user registration, login, checkout.

  • Supported platforms and environments:

    • Different browsers

    • Different mobile devices

    • Different OS / environment variants

From these models, they:

  1. Auto-generate tests with different coverage profiles:

    • All devices, medium coverage

      • Result: 286 combinations / tests.

    • Each device at least once, functional edges covered

      • Result: 20 tests (big reduction while still good coverage).

    • Android-focused profile

      • Test every Android permutation → 76 tests.

  2. Export test scenarios:

    • Include device mix & environment as part of the generated suite.

    • Run them in parallel on Sauce Labs (or similar cloud).

What to pay attention to:

  • The key isn’t “test everything”; it’s:

    • “Systematically cover what matters with the fewest necessary tests.”

  • Coverage profiles let you:

    • Run large suites nightly or pre-release.

    • Run minimal, high-value suites on every commit/PR.

5. Pinpoint Analysis: Turning Failures into Focused Exploration

When some generated tests fail, they don’t just shrug and file a ticket.

Instead, they:

  • Use “pinpoint analysis” to:

    • Identify which environments and which functional areas were involved.

    • Generate a focused mini-suite around that failure.

  • Example:

    • 286 tests → some failures → pinpoint → generate 17 highly targeted paths to explore that specific issue.

What this gives you:

  • Faster debugging: you’re not rerunning the entire suite to “get more data”.

  • Clearer understanding:

    • Was this a data issue?

    • Environment issue?

    • Real functional defect?

This is a very practical way to avoid “we saw a failure on some device somewhere last night” chaos.

6. Traceability Graph in Action: Concrete Examples

James then shows some specific Neo4j-driven queries:

  • Commits ↔ Requirements

    • “Show all commits that are linked to requirements.”

  • Commits ↔ Requirements ↔ Tests

    • “For new commits tied to a requirement, show the tests that cover them.”

  • Developer experience risk

    • “Find commits where the author hasn’t edited that file before.”

    • This surfaces higher-risk changes that may deserve extra scrutiny.

They also show dashboards over the graph:

  • Requirement view (e.g., withdraw cash from ATM):

    • Assignee

    • Linked models

    • Linked test cases

    • Sentiment & quality scores from ScopeMaster (ambiguity checks, language quality, etc.).

    • Commits associated to that requirement.

  • From a commit:

    • See which requirements it touches.

    • See which tests directly or indirectly cover that code.

    • Generate a targeted test suite to validate new changes.

Takeaway:

This isn’t just traceability for compliance—it’s actionable traceability that drives which tests you run after which changes.

7. Big Picture Takeaway (Marcus’ Closing Point)

Marcus distills the heart of the talk:

  • There are “primary workflows” in your system:

    • The flows that actually move revenue or your core metric.

  • You will miss bugs in your career—even in those flows.

  • Where you spend your time matters more than your raw test count.

His rule of thumb:

“I’d rather have 40 good solid tests that run on every commit, 10–20 times a day, than 400 unwieldy tests that we only run once a day or once a week.”

In other words:

  • Focus on:

    • Main customer journeys,

    • The dominant devices/OSs for your users,

    • The changes that actually touch critical code paths.

  • Use:

    • Modeling → to design smart coverage

    • Device analytics → to choose which platforms matter most

    • Graph-based traceability → to decide which tests to run when

That’s how you stop overtesting the wrong things and under-testing what matters.

Leave a Comment

Your email address will not be published. Required fields are marked

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}