Skip to main content

Test Doubles and Isolation

Pixel art movie production split-scene: busy green-screen rehearsals with stunt doubles (Stub, Mock, Fake) doing thousands of takes, alongside principal photography with the real stars (Real Database, Live API) for the final few shots. Film strip at bottom shows the ratio. Tagline: Rehearse with doubles. Ship with stars.

Poll: Did you get your GitHub Student Pack set up with CoPilot access?

A. Yes

B. Haven't tried yet

C. Tried, still waiting for it to go through

Poll Everywhere QR Code or Logo

Text espertus to 22333 if the
URL isn't working for you.

https://pollev.com/espertus

Poll: When did you request GitHub student access?

A. I already had it

B. Last week

C. Over the weekend

D. Monday

E. Tuesday

F. Today

G. I haven't requested it

Poll Everywhere QR Code or Logo

Text espertus to 22333 if the
URL isn't working for you.

https://pollev.com/espertus

CS 3100: Program Design and Implementation II

Lecture 15: Test Doubles and Isolation

©2026 Jonathan Bell & Ellen Spertus, CC-BY-SA

Learning Objectives

After this lecture, you will be able to:

  1. Distinguish between unit, integration, and end-to-end tests
  2. Explain the challenge of testing code with external dependencies
  3. Identify properties of high-quality individual tests (hermetic, clear, non-brittle)
  4. Differentiate between stubs, fakes, and spies as types of test doubles
  5. Apply mocking frameworks to generate test doubles
  6. Evaluate the tradeoffs of using test doubles
  7. Apply AI coding assistants to generate test plans and test doubles

What Makes a Good Test Suite?

Fast, reliable, and finds bugs—and individual tests must be hermetic (self-contained), clear, and non-brittle.

🎯

Finds Bugs

Actually detects defects in the code

🔒

Is Reliable

Passes when code is correct, fails when it's not

Is Fast

Runs quickly so you actually run it

Suite goals

  • Find bugs
  • Run automatically
  • Be cheap to run

Individual tests

  • Hermetic
  • Clear + debuggable
  • Not brittle (avoid unspecified behavior)
  • Use public APIs only

⚡ What Makes Tests Fast (or Slow)?

I/O dominates—memory is microseconds, network is milliseconds.

Fast (microseconds)

  • Creating objects in memory
  • Calling methods
  • Arithmetic, string operations

Slow (milliseconds to seconds)

  • File system I/O
  • Network calls (APIs, databases)
  • Starting processes/containers

🔒 What Makes Tests Reliable (or Flaky)?

Control all state → deterministic.
Depend on external world → flaky.

A "flaky" test sometimes passes, sometimes fails — same code!

Reliable (deterministic)

  • Same inputs → same outputs
  • No shared mutable state
  • No timing dependencies

Flaky (non-deterministic)

  • Network timeouts
  • Race conditions
  • External service state

🎯 Measuring Bug-Finding: Coverage

Coverage measures what code ran, not whether it's correct.

Code coverage measures what code your tests execute

  • Statement coverage: % of lines executed

  • Branch coverage: % of decision paths taken (if/else, loops)

Code Coverage with Factorial 1

VS Code's Test Explorer lets you run tests with coverage, showing coverage of:

  • lines (green for visited)
  • branches (green for both ways, hashed for one way)
VS Code screenshot showing partial coverage of factorial with a single base-case test

Code Coverage with Factorial 2

VS Code screenshot showing 100% coverage of factorial with base-case and recursive tests

We have 100% coverage!

Poll: What Coverage Guarantees Bug-Free Code?

A. 100% line coverage

B. 100% branch coverage

C. 100% line and branch coverage

D. none of the above

Poll Everywhere QR Code or Logo

Text espertus to 22333 if the
URL isn't working for you.

https://pollev.com/espertus

Coverage Can Still Miss Bugs

100% branch coverage ≠ 100% tested behaviors.

Even 100% branch coverage does not mean you tested every behavior.

public int magic(int x, int y) {
int z;
if (x != 0) { z = x + 10; }
else { z = 0; }

if (y > 0) { return y / z; } // BUG: can divide by zero
else { return x; }
}

@Test void t1() { assertEquals(2, magic(1, 22)); } // covers x!=0, y>0
@Test void t2() { assertEquals(0, magic(0, -10)); } // covers x==0, y<=0

// 100% branch coverage… but magic(0, 5) throws ArithmeticException

The Truth About Testing

Edsger W. Dijkstra
"Program testing can be used to show the presence of bugs, but never to show their absence!"
Edsger W. Dijkstra

Where Do Test Inputs Come From?

Spec-driven and code-driven strategies complement each other.

Two fundamental strategies for choosing test inputs:

Spec-Driven

(from requirements)

  • What should the code do?
  • Boundary values, equivalence classes
  • "2 cups → should convert to 32 tbsp"

Code-Driven

(from implementation)

  • Look at branches, paths
  • Choose inputs to exercise each path
  • Coverage guides what's missing

AI-assisted (new reality): ask for equivalence classes + boundaries… then you validate the oracle.

Example-Based Testing: What You Know

You choose specific inputs from requirements or by examining code.

You write specific test cases with specific inputs

@Test
void shouldConvert2CupsTo32Tablespoons() {
ExactQuantity twoCups = new ExactQuantity(2, Unit.CUP);

Quantity result = twoCups.convertTo(Unit.TABLESPOON, registry);

assertThat(result.getAmount()).isEqualTo(32); // 2 cups × 16 tbsp/cup
}

Parameterized Tests: Same Logic, Many Inputs

Write the test logic once; run it with many inputs.

From the HW3 handout — test the same behavior with different data

static Stream<Arguments> toStringTestCases() {
return Stream.of(
Arguments.of(1, "Preheat oven", "1. Preheat oven"),
Arguments.of(2, "Mix ingredients", "2. Mix ingredients"),
Arguments.of(10, "Serve warm", "10. Serve warm"),
Arguments.of(99, "Final step", "99. Final step"));
}

@ParameterizedTest(name = "step {0}: \"{1}\" -> \"{2}\"")
@MethodSource("toStringTestCases")
void shouldFormatCorrectly(int stepNumber, String text, String expected) {
Instruction instruction = new Instruction(stepNumber, text, List.of());
assertThat(instruction.toString()).isEqualTo(expected);
}

Fuzzing: Generate LOTS of Inputs

Generate millions of inputs to find crashes you'd never imagine.

Automatically generate inputs to find crashes and bugs

Graybox Fuzzing (used by security researchers):

  • Start with sample inputs, randomly mutate them
  • Monitor code coverage to guide mutation
  • If mutation covers new code → keep it, mutate further
  • Run millions of inputs per second

Google's OSS-Fuzz has found 10,000+ bugs in open-source projects

Unit Tests

Four-panel meme of a blue bird. Panel 1: Bird angrily yells 'Get that thing out of my face!' at a cracker labeled 'Unit Tests'. Panel 2: Bird takes a bite of the cracker with a 'chomp'. Panel 3: Empty. Panel 4: Bird has sparkly eyes and a blissful expression with bubbles around it. Caption reads 'When a unit test helps you find a bug you wouldn't have found otherwise'.

Unit Tests: ⚡ Fast + 🔒 Reliable

Test one unit in isolation: fast, focused, deterministic.

Test a single "unit" — typically one class — in isolation

  • Fast: Run in milliseconds (no I/O, no network)

  • 🔒 Reliable: No external state = deterministic

  • Focused: When it fails, you know exactly where to look

The challenge: achieving isolation when code has dependencies

Unit Test Example: From HW3 Handout

If it fails, you know exactly where the bug is.

Testing Instruction.toString() — one class, no dependencies

@ParameterizedTest(name = "step {0}: \"{1}\" -> \"{2}\"")
@DisplayName("should format as stepNumber. text")
void shouldFormatCorrectly(int stepNumber, String text, String expected) {
Instruction instruction = new Instruction(stepNumber, text, List.of());

assertThat(instruction.toString()).isEqualTo(expected);
}

// Test cases: (1, "Preheat oven", "1. Preheat oven")
// (2, "Mix ingredients", "2. Mix ingredients")
// (10, "Serve warm", "10. Serve warm")

Unit Tests Miss Some Problems

Integration Tests: The Middle Ground

Test components together to catch seam bugs.

Test multiple components interacting

  • ⚡≈ Moderate speed: Some I/O, but local

  • 🔒≈ Mostly reliable: Controlled environment

  • 🎯✓ Catches seam bugs: Serialization, protocols, formats

Do components communicate correctly? Agree on data formats?

The Mars Climate Orbiter Disaster

Comic-style illustration of the Mars Climate Orbiter disaster. Top panel shows the spacecraft burning up in Mars' atmosphere while a green alien thinks 'They should have done integration testing.' Bottom left panel labeled 'Jet Propulsion Lab' shows horrified engineers with a whiteboard displaying 'Force: 4.45 N, kg·m/s²'. Bottom right panel labeled 'Lockheed Martin' shows equally distressed engineers with their whiteboard showing 'Force: 1 lbf, pound-force-seconds'.

Integration Test Example: From HW3 Handout

Round-trip tests catch serialization and format issues.

Testing Recipe + JsonRecipeRepository + Jackson + file system together

@Test
@DisplayName("round-trip preserves recipe with all fields")
void roundTripPreservesRecipeWithAllFields() {
Recipe recipe = new Recipe("test-id", "Chocolate Cake",
new ExactQuantity(8, Unit.WHOLE),
List.of(new MeasuredIngredient("flour",
new ExactQuantity(2, Unit.CUP), null, null)),
List.of(new Instruction(1, "Mix ingredients", List.of())),
List.of());

repository.save(recipe); // Writes JSON to file system
Optional<Recipe> loaded = repository.findById("test-id");

assertTrue(loaded.isPresent());
assertEquals(recipe, loaded.get());
}

End-to-End Tests: 🎯 Finds Real Bugs

Test the whole system as users experience it—slow but realistic.

Test the entire system as a user would experience it

  • ⚡✗ Slow: Seconds or minutes per test

  • 🔒✗ Flaky: Network glitches, timing issues

  • 🎯✓ Realistic: Tests what users actually experience

The opposite tradeoff: sacrifice speed and reliability for realism

E2E Example: Cook Your Books Import-to-Export

One test, many systems—and many potential failure points.

Cook Your Books E2E test workflow

One test, five systems: Image → OCR → Parser → Repository → Exporter

The Practical Mix

E2E tests for critical journeys; unit tests for edge cases.

Why not just write E2E tests for everything?

Imagine testing 15 edge cases for ImportService:

Unit TestsE2E Tests
ExampleImportService logicFull OCR-to-export workflow
Time to run~150ms total~30 seconds
FlakinessNoneOCR API, file system
Debug timeSecondsMinutes to hours

The Test Pyramid

Test pyramid showing unit tests at base, integration in middle, E2E at top

The Dependency Problem

How do we test code that depends on databases, networks, hardware?

Consider SubmissionService from Pawtograder:

public class SubmissionService {
private final GitHubService github;
private final AutograderRunner autograder;
private final NotificationService notifier;
private final Database database;

public SubmissionService(GitHubService github,
AutograderRunner autograder,
NotificationService notifier,
Database database) {
this.github = github;
this.autograder = autograder;
this.notifier = notifier;
this.database = database;
}
// ...
}

The Method Under Test

Testing side effects requires controlling dependencies.

public GradeResult processSubmission(Submission submission) {
CodeSnapshot code = github.fetchCode(submission.repoUrl());
TestResult result = autograder.runTests(code);

if (result.allPassed()) {
database.saveGrade(submission.studentId(), result.score());
notifier.send(submission.studentEmail(),
"Your submission passed!", NotificationLevel.INFO);
} else {
notifier.send(submission.studentEmail(),
"Some tests failed", NotificationLevel.WARNING);
}
return new GradeResult(submission.studentId(), result);
}

To test this, we'd need real GitHub, real containers, real email!

Test Doubles: Stand-Ins for Real Dependencies

Stubs return canned answers; fakes work simply; spies record calls.

Stubs

Return canned answers

Fakes

Simplified implementations

Spies

Record what happened

From simplest to most sophisticated

Stubs: Return Canned Answers

Ignore details you don't care about; return what you need.

class StubGitHubService implements GitHubService {
private final CodeSnapshot fixedCode;

public StubGitHubService(CodeSnapshot code) {
this.fixedCode = code;
}

@Override
public CodeSnapshot fetchCode(String repoUrl) {
return fixedCode; // Always returns the same code
}
}

Ignores the repo URL, always returns sample code — that's fine!

Spies: Record What Happened (Decorator Pattern)

Wrap, record, delegate—verify interactions after the fact.

class SpyDatabase implements Database {
private final Database delegate; // Wraps a real implementation
private boolean saveGradeCalled = false;
private String savedStudentId = null;

public SpyDatabase(Database realDatabase) {
this.delegate = realDatabase; // Decorator pattern!
}

@Override
public void saveGrade(String studentId, int score) {
this.saveGradeCalled = true; // Record the call
this.savedStudentId = studentId;
delegate.saveGrade(studentId, score); // Delegate to real impl
}

// Query methods for tests
public boolean wasSaveGradeCalled() { return saveGradeCalled; }
public String getSavedStudentId() { return savedStudentId; }
}

A Complete Test with Hand-Rolled Doubles

Stubs + spies enable fast, focused tests.

@Test
public void savesGradeWhenAllTestsPass() {
StubGitHubService stubGithub = new StubGitHubService(sampleCode());
StubAutograderRunner stubAutograder = new StubAutograderRunner(
new TestResult(true, 100)); // All tests pass, score 100
SpyDatabase spyDatabase = new SpyDatabase();
StubNotificationService stubNotifier = new StubNotificationService();

SubmissionService service = new SubmissionService(
stubGithub, stubAutograder, stubNotifier, spyDatabase);

service.processSubmission(new Submission("student123", "repo-url", "email"));

assertTrue(spyDatabase.wasSaveGradeCalled());
assertEquals("student123", spyDatabase.getSavedStudentId());
assertEquals(100, spyDatabase.getSavedScore());
}

The Pain of Hand-Rolling Test Doubles

Four classes for one test? That doesn't scale.

We wrote four separate classes just to test one method!

  • StubGitHubService
  • StubAutograderRunner
  • SpyDatabase
  • StubNotificationService

And we haven't tested failures, timeouts, edge cases...

Do we need a new stub class for every test result?

Mockito is a Test Framework that Dynamically Creates Mocks

A mocking framework generates test doubles at runtime

@Test
public void savesGradeWhenAllTestsPass() {
// Create test doubles — Mockito generates these at runtime
GitHubService mockGithub = mock(GitHubService.class);
AutograderRunner mockAutograder = mock(AutograderRunner.class);
NotificationService mockNotifier = mock(NotificationService.class);
Database mockDatabase = mock(Database.class);

// Configure stub behavior
when(mockGithub.fetchCode(anyString())).thenReturn(sampleCode());
when(mockAutograder.runTests(any())).thenReturn(new TestResult(true, 100));

SubmissionService service = new SubmissionService(
mockGithub, mockAutograder, mockNotifier, mockDatabase);
service.processSubmission(new Submission("student123", "repo-url", "email"));

// Verify spy recordings
verify(mockDatabase).saveGrade("student123", 100);
verify(mockNotifier).send(eq("email"), contains("passed"), any());
}

Mockito: Create, Configure, Verify

mock(), when().thenReturn(), verify()—the three operations you'll use most.

mock(Class.class)Create a test double
when(...).thenReturn(...)Configure stub behavior
verify(mock).method(...)Check spy recordings
when(...).thenThrow(...)Simulate exceptions

Mockito uses reflection to generate implementations at runtime

AI for Test Doubles (Especially Mockito)

AI generates boilerplate; you evaluate whether mocks match reality.

AI assistants are great at generating boilerplate if you keep evaluation in the loop.

  1. Plan first: write the behaviors you need to simulate and verify

  2. Generate: ask AI to produce when(...)/verify(...) scaffolding

  3. Review: does the mock reflect real dependency behavior?

  4. Break it: introduce a bug — does the test fail?

Argument Matchers: Flexible Matching

Match any value, exact values, or custom predicates.

// Return sample code for ANY repo URL
when(mockGithub.fetchCode(anyString())).thenReturn(sampleCode());

// Verify saveGrade was called with any student ID
verify(mockDatabase).saveGrade(anyString(), anyInt());

// Verify with custom condition
verify(mockNotifier).send(
anyString(),
argThat(message -> message.contains("passed")),
eq(NotificationLevel.INFO)
);

anyString(), any(), eq(), argThat()

Fakes: When You Need Real Behavior

When you need save-then-retrieve, use a working in-memory implementation.

class FakeUserRepository implements UserRepository {
private final Map<String, User> users = new HashMap<>();

@Override
public void save(User user) {
users.put(user.getId(), user);
}

@Override
public User findById(String id) {
return users.get(id);
}

@Override
public List<User> findAll() {
return new ArrayList<>(users.values());
}
}

A working implementation — just simpler than the real database

The Good: Why Test Doubles Work

Speed, determinism, isolation, and easy error simulation.

  • Speed: Tests run in milliseconds

  • Determinism: No flaky tests from real APIs

  • Isolation: Failures point to the code under test

  • Edge cases: Easy to simulate errors

// Simulating a GitHub API failure — one line!
when(mockGithub.fetchCode(anyString()))
.thenThrow(new GitHubException("API rate limit exceeded"));

Dependency Injection Simplified

Tweet from patrick thomson @importantshock
I love to use [1] dependency [2] injection [3].

1 pass
2 values
3 functions

8:58 AM - 17 Jan 2019

The Dangerous: False Confidence

False confidence from testing with mocks

The Dangerous: Brittle Tests

Test behavior, not implementation details.

Brittle tests rely on unspecified behaviors:

// BRITTLE: Assumes Set iteration order (unspecified!)
assertThat(recipe.getTags().toString())
.isEqualTo("[vegetarian, quick, healthy]");

// BRITTLE: Exact string match on unspecified message format
verify(mockNotifier).send(eq("email@example.com"),
eq("Your submission passed!"), any());

// BETTER: Test the behavior, not the implementation
assertThat(recipe.getTags()).containsExactlyInAnyOrder(
"vegetarian", "quick", "healthy");

When to Use Test Doubles

If mock setup is more complex than the code, reconsider.

Use test doubles when:

  • Dependency is slow or unreliable
  • Need to simulate error conditions
  • Dependency has side effects (emails, charges)
  • Want to verify interactions

Be cautious when:

  • Mock setup is more complex than code
  • Verifying implementation, not behavior
  • Mocking types you don't own

If mock setup is getting elaborate, consider an integration test instead

Mutation Testing

Educational illustration of mutation testing. A valid snowman labeled 'Original Program Code' stands in the center with arrows pointing to four mutated snowmen around it: Mutant 1 has two heads (Code Duplication), Mutant 2 has an upside-down face (Logic Error), Mutant 3 has a scarf covering its face (Missing Functionality), and Mutant 4 has its nose on its belly (Incorrect Variable Usage). A scientist labeled 'Test Suite' holds a beeping detector against Mutant 3. Title reads 'Mutation Testing: Evaluating Test Quality by Introducing Defects'.

Connection to Your Assignments

Mutation testing grades whether your tests catch bugs, not just pass.

Your test suites are graded by mutation testing

  • We introduce bugs (mutations) into your code
  • Your tests should catch those bugs

  • A test that passes with the bug present = weak test

It's not enough for tests to pass — they must detect bugs!

AI for Test Planning (Plan Mode)

Plan first, generate drafts, then evaluate.

AI assistants are particularly good at generating drafts — especially when you start from a plan.

  • Turning “behaviors to test” into test skeletons
  • Enumerating edge cases (equivalence classes, boundaries)
  • Generating Mockito boilerplate (when/verify)

  • Creating test data and fixtures

Key rule (L13): only ask AI to produce what you can evaluate.

The Test-First Evaluation Trick

If you can't describe the oracle, you can't evaluate the AI's tests.

Before you ask AI for tests, answer this yourself:

“What would a good test check?”

  • What behavior should change for pass vs fail?
  • What’s the oracle (strong, not “doesn’t crash”)?

  • What inputs hit boundaries / equivalence classes?
  • What side effects must be verified?

If you can’t answer these, you’re in the “low familiarity” danger zone (L13): you can’t evaluate the output.

Summary

  • Test scope spectrum: Unit (fast/focused) → Integration → E2E (slow/complete)

  • Good tests: hermetic, clear, non-brittle, with strong oracles

  • Test doubles stand in for real dependencies

  • Stubs return canned answers; spies record interactions; fakes are simplified implementations

  • Mockito generates test doubles at runtime

  • Tradeoffs: Speed/isolation vs false confidence/brittleness

  • Mutation testing: Tests must detect bugs, not just pass (coverage helps find gaps)

  • AI for testing: plan first, generate drafts, then evaluate (don’t “vibe test”)

Next Steps

  • Reading: Lecture notes, Mockito documentation (linked on course site)

  • Next lecture: Designing for Testability

Bonus Slide

Reddit post from r/ProgrammerHumor titled 'Found this at work. Someone padded a repo with thousands of lines like this to pass a 75% code coverage check.' Shows Java code with a method called fakeCoverage() containing 'Integer i = 0;' followed by dozens of identical 'i++;' statements to artificially inflate line count and code coverage metrics.