Skip to main content
Pixel art: five Swiss cheese slices as defense layers. Left side has holes aligned with a red danger beam passing through. Right side has offset holes blocking all light. Labels on slices: Tests, Reviews, Rollout, Override, Monitoring. Tagline: Every Layer Has Holes.

CS 3100: Program Design and Implementation II

Lecture 35: Safety and Reliability

©2026 Jonathan Bell, CC-BY-SA

Learning Objectives

After this lecture, you will be able to:

  1. Distinguish safety from reliability
  2. Apply the Swiss cheese model to analyze layered defenses
  3. Analyze blast radius and fail-safe design
  4. Recognize prior course concepts as safety mechanisms
  5. Evaluate safety trade-offs against cost, complexity, and performance, and explain why professional judgment is currently the primary safety mechanism in most software

What Happens When Your Software Doesn't Work — and Who Gets Hurt?

You've built software that's correct, testable, maintainable, performant.

New question: what happens when it fails?

ConceptWhere you learned itIts safety role
Race condition preventionL31Prevents state corruption
Async error handlingL32Makes failures visible
Consistency modelsL33Prevents stale-data operations
Profiling + GCL34Safety-performance tradeoff

These aren't just performance or reliability tools. They are safety mechanisms.

Reliable, Available, and Safe Are Three Different Things

Reliable

Does what it's supposed to do, consistently.

Measure: error rates, MTBF

SceneItAll: activates scenes correctly 99.99% of the time

Available

Accessible when users need it.

Measure: "nines" — 99.9% = 8.7 hrs downtime/yr

GitHub: multiple major outages Feb–Mar 2026 (auth DB overload, Actions failover failures)

Safe

Avoids harm, even when it fails.

Measure: incident severity — did anyone get hurt?

Key: a property of the failure mode, not the happy path

  • Reliable and unsafe: A medical device delivers correct doses 99.99% of the time — but its failure mode is lethal. Reliable? Extremely. Safe? Not if one failure kills a patient.
  • Safe but unreliable: Hub crashes frequently but preserves device state and fails to safe defaults.

Safety Emerges from Design — You Can't Add It Later

StageWhat happensSceneItAll exampleCost of fixing
LaunchHappy paths work; few users50 beta homes, manual firmware pushesModerate
GrowthUntested interactions surface10,000 homes; scene activations during firmware updatesMigration
ScaleEdge cases hit productionFirmware bug bricks 200 devices in one pushMigration + legal + replacements
MaturityRegulatory requirements changeUL/CE requires hardware watchdog timerAll of the above + certification

The cost of addressing safety grows exponentially at each stage. An atomic firmware write on day one is moderate effort. After a bricking incident, it's a migration plus legal costs, customer replacements, and reputational damage.

Software Affects Safety in Ways You Don't Expect

Direct safety

Door lock bug lets an unauthorized person enter. Software controls a physical actuator; failure causes immediate harm.

Medical devices, autonomous vehicles, industrial control systems.

Indirect safety

Usage analytics reveal when a home is occupied. Not a safety concern at 50 homes — a burglary risk at 100,000.

Recommendation algorithms optimize for engagement, select for outrage. Missing-stakeholder problem (L9).

AI connection: You use Claude Code to generate a door lock controller. If you paste it without careful review, you've removed a Swiss cheese layer — human code review. Replacing human judgment with automation is cheaper and faster, but if the automation has bugs that the human would have caught, you've traded a defense for a vulnerability. You are responsible for the output.

Classify This System: Safe, Reliable, or Both?

SystemReliable?Safe?Why?
Recommendation algorithm delivers consistent suggestions 99.99% of the time — but optimizes for outrage, harming mental health at scale
Password manager crashes randomly, losing sessions — but on crash, locks all stored passwords until restart
CrowdStrike Falcon worked correctly 99.999% of the time — but its failure mode bricked 8.5M machines simultaneously
SceneItAll hub crashes daily — but always preserves device state and keeps doors locked

Discuss with a neighbor. Then we'll share answers.

The Swiss Cheese Model: Harm Requires Aligned Holes

Swiss cheese model: four defense layers as cheese slices. Top: holes align and a hazard arrow passes through to cause harm. Bottom: holes don't align and the hazard is blocked. Labels: Contracts, Testing, Architecture, Monitoring.

Recall: You've been building Swiss cheese layers all semester:

LayerWhere
Preconditions reject bad inputsL4 (Contracts)
Tests catch bugs before deploymentL15 (Testing)
Hex architecture isolates domain from infrastructureL16 (Testing II)
Resilience patterns handle network failuresL20 (Networks)
Idempotent consumers handle duplicatesL33 (Event Architecture)

A single layer with holes is not dangerous on its own. The problem is when someone removes a layer entirely, or when holes grow larger without anyone noticing.

Therac-25: A Race Condition Killed Six Patients

1985-1987. Earlier model (Therac-20) had hardware interlocks — physical mechanisms preventing lethal doses regardless of software. Therac-25 replaced them with software. The software had race conditions (L31).

LayerDefenseHole
Hardware interlocksPhysical mechanism prevents lethal doseRemoved entirely in Therac-25
Software safety checksSoftware validates beam energy before firingRace condition allowed high-energy beam in electron mode, masked in prior machines by hardware interlock
Operator trainingOperators trained to recognize error codesOperators learned to dismiss frequent, cryptic messages
Incident reportingOperators report anomalies to manufacturerManufacturer dismissed reports: "software is thoroughly tested"

All remaining holes aligned. Lethal radiation reached patients.

Boeing 737 MAX: A Single Sensor, No Pilot Training

2018-2019. 346 killed. New engines changed the plane's aerodynamics. Instead of redesigning the airframe, Boeing added MCAS — software to push the nose down. MCAS relied on a single angle-of-attack sensor.

LayerDefenseHole
Airframe designAerodynamic stability without softwareReplaced — MCAS compensates in software
Sensor redundancyDual sensors with disagree indicatorOptional upgrade — not on crashed aircraft
Pilot trainingPilots trained to recognize and override MCASMinimized — Boeing marketed "no retraining needed"
Pilot overridePilots can disable automation and fly manuallyPilots didn't know MCAS existed; couldn't diagnose failure

All holes aligned. MCAS pushed the nose down; pilots couldn't override; planes crashed.

The MCAS Feedback Loop: Where Are the Exits?

Two exits existed. Boeing made Exit 1 an optional upgrade and Exit 2 unnecessary by marketing "no retraining needed." Both crashed aircraft had neither exit.

CrowdStrike Falcon: 8.5 Million Machines, No Rollback

July 19, 2024. Kernel driver update with null pointer read caused a boot loop on 8.5 million Windows machines simultaneously. Airlines, hospitals, 911 systems went down. $5B+ in losses.

Key failure: "Content updates" bypassed the staged rollout required for "sensor updates." The update went to all 8.5M machines at once. Machines couldn't boot to receive a rollback.

LayerDefenseHole
Content validationAutomated testing before distributionDid not catch the null pointer read
Staged rolloutPush to 1% first, monitor, expandNot used for "content updates"
Automatic rollbackRevert if failures spikeMachines couldn't boot to receive rollback
Fail-safe bootIf driver crashes, boot without itDriver loads too early — crash prevents recovery

Three Disasters, One Pattern: Removing Layers Is Removing Safety

AspectTherac-25Boeing 737 MAXCrowdStrike Falcon
What was replaced?Hardware interlocksAirframe redesign + pilot trainingManual security review
Replaced with?Software safety checksMCAS software automationAutomated content update pipeline
Why?Cheaper, lighterCheaper, faster certificationSpeed — security threats need rapid response
Layer removed?Hardware interlock layerSensor redundancy + trainingStaged rollout for content updates
Critical flaw?Race conditionsSingle point of failureNo rollback path when kernel crashes
Could system recover?Yes — operators could restartNo — planes crashedNo — boot loop, manual access required

Three questions to ask when replacing hardware/human judgment with software:

  1. What failure modes does software introduce that the original didn't have?
  2. Is there redundancy? What happens when the single sensor/input fails?
  3. Can humans override the automation when it's wrong?

Blast Radius: How Much Breaks When Something Fails?

Blast radius = how much is affected when a component fails. L19: monolith = maximum blast radius. L7: low coupling limits it. Blast radius determines how many Swiss cheese layers you need.

The Citicorp Tower: When Every Layer Had a Hole

The Citicorp Center in Manhattan, showing its distinctive stilted base with columns at the center of each side rather than the corners.

Photo: Andrew Moore, CC BY 4.0

1978. A student's question about the unusual column placement prompts structural engineer LeMessurier to investigate quartering winds — winds hitting the corner at 45°, which the NYC building code didn't require analyzing. He discovers the building could collapse in a 16-year storm. Hurricane season is approaching.

LayerDefenseHole
Engineer's designOriginal spec: welded jointsContractor switched to bolted — cheaper, but weaker in tension
Code complianceNYC building code reviewCode only required perpendicular wind analysis, not quartering winds
Change reviewJoint substitution should trigger re-analysisNo one recalculated with bolted joints under quartering winds
Professional reviewPeers and inspectors review the designNobody questioned the substitution; the student's professor dismissed the column concern entirely

Blast radius: a skyscraper in midtown Manhattan. LeMessurier didn't fix it in secret — he brought in every stakeholder: the architect, Citicorp's CEO, an independent structural consultant, NYC's Building Commissioner, the Red Cross, police, the Mayor's Office of Emergency Management. He told the city the whole truth: "the failure of his own office to perceive and communicate the danger." City officials commended him. Nobody was hurt.

From Citicorp to Code: Who Regulates Software?

LeMessurier was a licensed professional engineer. Building codes. Inspections. Liability. Professional boards that can revoke your license.

Most software has none of that. Avionics (DO-178C) and medical devices (FDA) are regulated. Everything else — banking software, social media, autograders, smart home hubs — is governed by... your personal judgment.

"It's not the AI that's liable. It's either the people who made it or the people who use it or both."

— David Parnas, ICSE 2025 Keynote

Parnas argues we should regulate critical software the same way we regulate bridges — licensed engineers, accredited education, independent testing, required specifications. Not because it's "AI" or "not AI," but because the amount of regulation should depend on how important the answer is.

Until that happens, the last Swiss cheese layer is you. Your professional judgment. Your willingness to say "I got a problem, I made the problem, let's fix the problem."

Blast Radius Determines Your Defense Budget

SystemBlast radiusLayers needed
SceneItAll brightness controlOne room's lights are wrongError handling + UI feedback
SceneItAll door lockUnauthorized entryStrong consistency (L33) + redundant sensors + human override

Same IoT hub, same codebase, same protocol — but the blast radius demands different engineering.

Blast Radius Determines Your Defense Budget

SystemBlast radiusLayers needed
SceneItAll brightness controlOne room's lights are wrongError handling + UI feedback
SceneItAll door lockUnauthorized entryStrong consistency (L33) + redundant sensors + human override
SceneItAll firmware updateDevice bricked, needs replacementRollback + staged rollout + integrity verification
Pawtograder gradebookEvery student's GPA in the courseAudit trails + human-in-the-loop + fail-safe defaults

Firmware bricking is expensive. Gradebook corruption affects every student's GPA. More blast radius → more layers.

Blast Radius Determines Your Defense Budget

SystemBlast radiusLayers needed
SceneItAll brightness controlOne room's lights are wrongError handling + UI feedback
SceneItAll door lockUnauthorized entryStrong consistency + redundant sensors + human override
SceneItAll firmware updateDevice bricked, needs replacementRollback + staged rollout + integrity verification
Pawtograder gradebookEvery student's GPA in the courseAudit trails + human-in-the-loop + fail-safe defaults
Citicorp Tower10 blocks of ManhattanPhysical redundancy + independent verification + immediate remediation
Boeing 737 MAX MCASEveryone on the aircraftSensor redundancy + pilot training + override capability

From "one room" to "everyone on the aircraft" — the engineering investment scales with the blast radius.

Fail-Safe vs. Fail-Operational vs. Fail-Dangerous

FailureFail-safeFail-dangerous
Firmware mid-writeRoll backPartial write (bricked)
Autograder crash"Needs review"Assign zero
Door lock disconnectStay lockedReset unlocked
Scene: 1/15 fails"14/15 updated""Success!"

Boeing MCAS was fail-dangerous — it kept pushing the nose down.

Design the Recovery: What Should This System Do When It Fails?

For each scenario, decide: fail-safe or fail-operational? Then design the specific behavior.

ScenarioFail-safe would be...Fail-operational would be...Which is right?
Smart thermostat loses internet mid-winter
Autograder container runs out of memory
Door lock loses Zigbee connection
Scene activation: one shade doesn't respond

Think about the blast radius of each. That determines which mode you need.

Scenario: Firmware Update Bricks a Device

SceneItAll pushes a firmware update to a smart light. Halfway through, the Zigbee connection drops.

LayerDefenseHole?
Integrity checkVerify firmware checksum before applyingCatches corrupt downloads
Atomic writeWrite to staging partition, swap after verificationPrevents partial writes from bricking
RollbackIf new firmware fails to boot, revert to previousCatches bad firmware that passes checksum
Staged rolloutUpdate 10% first, monitor, then expandLimits blast radius to 10%
Dead letter queueFailed updates queue for human review (L33)Nothing silently lost

Remove atomic write AND rollback? Bricked. Skip staged rollout and push to all 1,000 devices? CrowdStrike at home scale.

Scenario: Race Condition on a Door Lock

Two users send conflicting commands to the same smart lock simultaneously — one locks, one unlocks. Same L31 race condition, but with safety-critical consequences.

LayerDefenseHole?
Sequential consistencyLock commands use strong consistency (L33)Prevents stale lock state
Atomic operationssynchronized on the lock device — no interleavingPrevents mixed state
Audit trailEvery lock/unlock logged with timestamp and userAccountability after the fact
Physical overridePhysical key always works regardless of softwareHuman can always recover

Eventual consistency is fine for brightness — a roommate seeing 100% for 5 seconds is harmless. For a door lock, it's not. Blast radius drives the consistency model choice.

Scenario: Pawtograder Autograder Crashes Mid-Run

The autograder crashes mid-run — out of memory, network timeout, or a bug in the grading script. What grade does the student see?

LayerDefenseHole?
Error classificationDistinguish "student tests failed" from "grader crashed"If both produce exit code 1, they're conflated
Fail-safe defaultInfrastructure failure = "internal error, needs review"Only works if failure is classified correctly
RetryAutomatically retry infrastructure failures onceHelps transient failures; not deterministic crashes
Audit trailLog every run with exit code, stderr, timingEnables after-the-fact investigation
Student notification"Your submission is being re-graded" vs "0/100""0/100" with no explanation is fail-dangerous

"Internal error, needs manual review" is fail-safe. Silently assigning zero is fail-dangerous.

Pawtograder Gradebook: Audit Trails Catch Human Errors Too

A staff member accidentally updates the wrong gradebook column — 35 students' participation bonus scores overwritten with zeros. Initial report: "looks like a software bug — grades changed without any submission."

LayerWhat it caught
Audit tableLogged the exact staff member, timestamp, old values, new values
Student visibilityStudent noticed grade drop within hours, not weeks
Flag mechanismStudent flagged concern → triggered investigation
Professor audit viewConfirmed: single bulk update by one staff member, not a software bug
ReversibilityOld values in audit table → all 200 grades restored in minutes

L24 callback: this was a slip — right intention, wrong column. The audit trail doesn't prevent the slip, but it makes it detectable, attributable, and reversible.

The Meta-Slice: Knowing the Defense System Exists

Every Swiss cheese layer assumes someone knows it's there. A defense you don't know about is a defense you can't use, maintain, or trigger.

CaseWho knew the layers?Outcome
Boeing 737 MAXPilots didn't know MCAS existedCould not diagnose or override — 346 killed
Citicorp TowerLeMessurier told every stakeholderEach person knew their role — building saved
Pawtograder (alone)Student sees 0/100, assumes "I failed"Never triggers flag mechanism
Pawtograder (compares notes)Student discovers classmates also got 0Reclassifies as systemic — flags it, triggers investigation
Pawtograder (grade finalization)Multiple instructors review all grades before submissionCatches systematic errors even if no student flags them

Practical rule: When you see a surprising result — a 0/100, a mysterious crash, a grade that doesn't match your work — compare notes before assuming you're the problem. You might be the person who activates the next defense layer.

Comprehension Check

Open Poll Everywhere and answer the three questions.

You Already Know How to Prevent These Failures

Safety patternWhere you learned itWhat it prevents
Contracts & validationL4: SpecificationsBad inputs propagating through the system
Information hidingL6: ChangeabilityUnintended dependencies that break silently
Low couplingL7: Coupling & CohesionFailures cascading across module boundaries
Testing at every scopeL15-L16: TestingBugs reaching production undetected
Timeouts & circuit breakersL20: NetworksOne slow service taking down the whole system
synchronized & atomicityL31: ConcurrencyRace conditions corrupting shared state
.exceptionally() & .orTimeout()L32: AsyncSilent failures hiding unsafe states
Idempotency & staged rolloutL33: EventsDuplicates causing harm; blast radius of bad deploys

These aren't exotic safety tools. They're the patterns you already use — applied where the blast radius includes human safety.

Safety Debt Compounds: Same Code, Growing Blast Radius

The code didn't change. The blast radius did. Safety debt is not the code getting worse — it's the consequences getting larger while the same holes remain open.

Who Profits, Who Bears the Risk?

Boeing sold sensor redundancy as an optional upgrade. Airlines serving price-sensitive passengers flew with less redundancy.

Cost savings accrued to Boeing and airlines. Risk fell on passengers who didn't know.

The same pattern appears in every safety-performance tradeoff:

Trade-offCost of safetyCost of not having it
Strong consistency (slower)Performance overheadLock shows "locked" when unlocked
Error handling (more complex)Code complexitySilent failures hide unsafe states
Staged rollouts (slower)Deployment speedCrowdStrike-scale blast radius
Redundant sensors (more expensive)Hardware costSingle point of failure

Not "can we afford safety?" but "can we afford the consequences of not having it?"

Looking Ahead

L36 (Thursday): Sustainability

We asked "who profits and who bears the risk?" for safety decisions. Thursday we generalize: sustainability — the meta-quality attribute that asks whether ALL your quality attributes hold up over time, and for whom.

GA1 due April 9 — think about error handling in your async chains. What happens when a network call fails? Does your app fail safely or fail dangerously?

Today: what happens when your software fails, and who gets hurt. Thursday: who benefits from your design decisions, who bears the cost, and over what time horizon.