
CS 3100: Program Design and Implementation II
Lecture 35: Safety and Reliability
©2026 Jonathan Bell, CC-BY-SA
Learning Objectives
After this lecture, you will be able to:
- Distinguish safety from reliability
- Apply the Swiss cheese model to analyze layered defenses
- Analyze blast radius and fail-safe design
- Recognize prior course concepts as safety mechanisms
- Evaluate safety trade-offs against cost, complexity, and performance, and explain why professional judgment is currently the primary safety mechanism in most software
What Happens When Your Software Doesn't Work — and Who Gets Hurt?
You've built software that's correct, testable, maintainable, performant.
New question: what happens when it fails?
| Concept | Where you learned it | Its safety role |
|---|---|---|
| Race condition prevention | L31 | Prevents state corruption |
| Async error handling | L32 | Makes failures visible |
| Consistency models | L33 | Prevents stale-data operations |
| Profiling + GC | L34 | Safety-performance tradeoff |
These aren't just performance or reliability tools. They are safety mechanisms.
Reliable, Available, and Safe Are Three Different Things
Reliable
Does what it's supposed to do, consistently.
Measure: error rates, MTBF
SceneItAll: activates scenes correctly 99.99% of the time
Available
Accessible when users need it.
Measure: "nines" — 99.9% = 8.7 hrs downtime/yr
GitHub: multiple major outages Feb–Mar 2026 (auth DB overload, Actions failover failures)
Safe
Avoids harm, even when it fails.
Measure: incident severity — did anyone get hurt?
Key: a property of the failure mode, not the happy path
- Reliable and unsafe: A medical device delivers correct doses 99.99% of the time — but its failure mode is lethal. Reliable? Extremely. Safe? Not if one failure kills a patient.
- Safe but unreliable: Hub crashes frequently but preserves device state and fails to safe defaults.
Safety Emerges from Design — You Can't Add It Later
| Stage | What happens | SceneItAll example | Cost of fixing |
|---|---|---|---|
| Launch | Happy paths work; few users | 50 beta homes, manual firmware pushes | Moderate |
| Growth | Untested interactions surface | 10,000 homes; scene activations during firmware updates | Migration |
| Scale | Edge cases hit production | Firmware bug bricks 200 devices in one push | Migration + legal + replacements |
| Maturity | Regulatory requirements change | UL/CE requires hardware watchdog timer | All of the above + certification |
The cost of addressing safety grows exponentially at each stage. An atomic firmware write on day one is moderate effort. After a bricking incident, it's a migration plus legal costs, customer replacements, and reputational damage.
Software Affects Safety in Ways You Don't Expect
Direct safety
Door lock bug lets an unauthorized person enter. Software controls a physical actuator; failure causes immediate harm.
Medical devices, autonomous vehicles, industrial control systems.
Indirect safety
Usage analytics reveal when a home is occupied. Not a safety concern at 50 homes — a burglary risk at 100,000.
Recommendation algorithms optimize for engagement, select for outrage. Missing-stakeholder problem (L9).
AI connection: You use Claude Code to generate a door lock controller. If you paste it without careful review, you've removed a Swiss cheese layer — human code review. Replacing human judgment with automation is cheaper and faster, but if the automation has bugs that the human would have caught, you've traded a defense for a vulnerability. You are responsible for the output.
Classify This System: Safe, Reliable, or Both?
| System | Reliable? | Safe? | Why? |
|---|---|---|---|
| Recommendation algorithm delivers consistent suggestions 99.99% of the time — but optimizes for outrage, harming mental health at scale | ✅ | ❓ | |
| Password manager crashes randomly, losing sessions — but on crash, locks all stored passwords until restart | ❓ | ❓ | |
| CrowdStrike Falcon worked correctly 99.999% of the time — but its failure mode bricked 8.5M machines simultaneously | ❓ | ❓ | |
| SceneItAll hub crashes daily — but always preserves device state and keeps doors locked | ❓ | ❓ |
Discuss with a neighbor. Then we'll share answers.
The Swiss Cheese Model: Harm Requires Aligned Holes

Recall: You've been building Swiss cheese layers all semester:
| Layer | Where |
|---|---|
| Preconditions reject bad inputs | L4 (Contracts) |
| Tests catch bugs before deployment | L15 (Testing) |
| Hex architecture isolates domain from infrastructure | L16 (Testing II) |
| Resilience patterns handle network failures | L20 (Networks) |
| Idempotent consumers handle duplicates | L33 (Event Architecture) |
A single layer with holes is not dangerous on its own. The problem is when someone removes a layer entirely, or when holes grow larger without anyone noticing.
Therac-25: A Race Condition Killed Six Patients
1985-1987. Earlier model (Therac-20) had hardware interlocks — physical mechanisms preventing lethal doses regardless of software. Therac-25 replaced them with software. The software had race conditions (L31).
| Layer | Defense | Hole |
|---|---|---|
| Hardware interlocks | Physical mechanism prevents lethal dose | Removed entirely in Therac-25 |
| Software safety checks | Software validates beam energy before firing | Race condition allowed high-energy beam in electron mode, masked in prior machines by hardware interlock |
| Operator training | Operators trained to recognize error codes | Operators learned to dismiss frequent, cryptic messages |
| Incident reporting | Operators report anomalies to manufacturer | Manufacturer dismissed reports: "software is thoroughly tested" |
All remaining holes aligned. Lethal radiation reached patients.
Boeing 737 MAX: A Single Sensor, No Pilot Training
2018-2019. 346 killed. New engines changed the plane's aerodynamics. Instead of redesigning the airframe, Boeing added MCAS — software to push the nose down. MCAS relied on a single angle-of-attack sensor.
| Layer | Defense | Hole |
|---|---|---|
| Airframe design | Aerodynamic stability without software | Replaced — MCAS compensates in software |
| Sensor redundancy | Dual sensors with disagree indicator | Optional upgrade — not on crashed aircraft |
| Pilot training | Pilots trained to recognize and override MCAS | Minimized — Boeing marketed "no retraining needed" |
| Pilot override | Pilots can disable automation and fly manually | Pilots didn't know MCAS existed; couldn't diagnose failure |
All holes aligned. MCAS pushed the nose down; pilots couldn't override; planes crashed.
The MCAS Feedback Loop: Where Are the Exits?
Two exits existed. Boeing made Exit 1 an optional upgrade and Exit 2 unnecessary by marketing "no retraining needed." Both crashed aircraft had neither exit.
CrowdStrike Falcon: 8.5 Million Machines, No Rollback
July 19, 2024. Kernel driver update with null pointer read caused a boot loop on 8.5 million Windows machines simultaneously. Airlines, hospitals, 911 systems went down. $5B+ in losses.
Key failure: "Content updates" bypassed the staged rollout required for "sensor updates." The update went to all 8.5M machines at once. Machines couldn't boot to receive a rollback.
| Layer | Defense | Hole |
|---|---|---|
| Content validation | Automated testing before distribution | Did not catch the null pointer read |
| Staged rollout | Push to 1% first, monitor, expand | Not used for "content updates" |
| Automatic rollback | Revert if failures spike | Machines couldn't boot to receive rollback |
| Fail-safe boot | If driver crashes, boot without it | Driver loads too early — crash prevents recovery |
Three Disasters, One Pattern: Removing Layers Is Removing Safety
| Aspect | Therac-25 | Boeing 737 MAX | CrowdStrike Falcon |
|---|---|---|---|
| What was replaced? | Hardware interlocks | Airframe redesign + pilot training | Manual security review |
| Replaced with? | Software safety checks | MCAS software automation | Automated content update pipeline |
| Why? | Cheaper, lighter | Cheaper, faster certification | Speed — security threats need rapid response |
| Layer removed? | Hardware interlock layer | Sensor redundancy + training | Staged rollout for content updates |
| Critical flaw? | Race conditions | Single point of failure | No rollback path when kernel crashes |
| Could system recover? | Yes — operators could restart | No — planes crashed | No — boot loop, manual access required |
Three questions to ask when replacing hardware/human judgment with software:
- What failure modes does software introduce that the original didn't have?
- Is there redundancy? What happens when the single sensor/input fails?
- Can humans override the automation when it's wrong?
Blast Radius: How Much Breaks When Something Fails?
Blast radius = how much is affected when a component fails. L19: monolith = maximum blast radius. L7: low coupling limits it. Blast radius determines how many Swiss cheese layers you need.
The Citicorp Tower: When Every Layer Had a Hole

Photo: Andrew Moore, CC BY 4.0
1978. A student's question about the unusual column placement prompts structural engineer LeMessurier to investigate quartering winds — winds hitting the corner at 45°, which the NYC building code didn't require analyzing. He discovers the building could collapse in a 16-year storm. Hurricane season is approaching.
| Layer | Defense | Hole |
|---|---|---|
| Engineer's design | Original spec: welded joints | Contractor switched to bolted — cheaper, but weaker in tension |
| Code compliance | NYC building code review | Code only required perpendicular wind analysis, not quartering winds |
| Change review | Joint substitution should trigger re-analysis | No one recalculated with bolted joints under quartering winds |
| Professional review | Peers and inspectors review the design | Nobody questioned the substitution; the student's professor dismissed the column concern entirely |
Blast radius: a skyscraper in midtown Manhattan. LeMessurier didn't fix it in secret — he brought in every stakeholder: the architect, Citicorp's CEO, an independent structural consultant, NYC's Building Commissioner, the Red Cross, police, the Mayor's Office of Emergency Management. He told the city the whole truth: "the failure of his own office to perceive and communicate the danger." City officials commended him. Nobody was hurt.
From Citicorp to Code: Who Regulates Software?
LeMessurier was a licensed professional engineer. Building codes. Inspections. Liability. Professional boards that can revoke your license.
Most software has none of that. Avionics (DO-178C) and medical devices (FDA) are regulated. Everything else — banking software, social media, autograders, smart home hubs — is governed by... your personal judgment.
"It's not the AI that's liable. It's either the people who made it or the people who use it or both."
— David Parnas, ICSE 2025 Keynote
Parnas argues we should regulate critical software the same way we regulate bridges — licensed engineers, accredited education, independent testing, required specifications. Not because it's "AI" or "not AI," but because the amount of regulation should depend on how important the answer is.
Until that happens, the last Swiss cheese layer is you. Your professional judgment. Your willingness to say "I got a problem, I made the problem, let's fix the problem."
Blast Radius Determines Your Defense Budget
| System | Blast radius | Layers needed |
|---|---|---|
| SceneItAll brightness control | One room's lights are wrong | Error handling + UI feedback |
| SceneItAll door lock | Unauthorized entry | Strong consistency (L33) + redundant sensors + human override |
Same IoT hub, same codebase, same protocol — but the blast radius demands different engineering.
Blast Radius Determines Your Defense Budget
| System | Blast radius | Layers needed |
|---|---|---|
| SceneItAll brightness control | One room's lights are wrong | Error handling + UI feedback |
| SceneItAll door lock | Unauthorized entry | Strong consistency (L33) + redundant sensors + human override |
| SceneItAll firmware update | Device bricked, needs replacement | Rollback + staged rollout + integrity verification |
| Pawtograder gradebook | Every student's GPA in the course | Audit trails + human-in-the-loop + fail-safe defaults |
Firmware bricking is expensive. Gradebook corruption affects every student's GPA. More blast radius → more layers.
Blast Radius Determines Your Defense Budget
| System | Blast radius | Layers needed |
|---|---|---|
| SceneItAll brightness control | One room's lights are wrong | Error handling + UI feedback |
| SceneItAll door lock | Unauthorized entry | Strong consistency + redundant sensors + human override |
| SceneItAll firmware update | Device bricked, needs replacement | Rollback + staged rollout + integrity verification |
| Pawtograder gradebook | Every student's GPA in the course | Audit trails + human-in-the-loop + fail-safe defaults |
| Citicorp Tower | 10 blocks of Manhattan | Physical redundancy + independent verification + immediate remediation |
| Boeing 737 MAX MCAS | Everyone on the aircraft | Sensor redundancy + pilot training + override capability |
From "one room" to "everyone on the aircraft" — the engineering investment scales with the blast radius.
Fail-Safe vs. Fail-Operational vs. Fail-Dangerous
| Failure | Fail-safe | Fail-dangerous |
|---|---|---|
| Firmware mid-write | Roll back | Partial write (bricked) |
| Autograder crash | "Needs review" | Assign zero |
| Door lock disconnect | Stay locked | Reset unlocked |
| Scene: 1/15 fails | "14/15 updated" | "Success!" |
Boeing MCAS was fail-dangerous — it kept pushing the nose down.
Design the Recovery: What Should This System Do When It Fails?
For each scenario, decide: fail-safe or fail-operational? Then design the specific behavior.
| Scenario | Fail-safe would be... | Fail-operational would be... | Which is right? |
|---|---|---|---|
| Smart thermostat loses internet mid-winter | |||
| Autograder container runs out of memory | |||
| Door lock loses Zigbee connection | |||
| Scene activation: one shade doesn't respond |
Think about the blast radius of each. That determines which mode you need.
Scenario: Firmware Update Bricks a Device
SceneItAll pushes a firmware update to a smart light. Halfway through, the Zigbee connection drops.
| Layer | Defense | Hole? |
|---|---|---|
| Integrity check | Verify firmware checksum before applying | Catches corrupt downloads |
| Atomic write | Write to staging partition, swap after verification | Prevents partial writes from bricking |
| Rollback | If new firmware fails to boot, revert to previous | Catches bad firmware that passes checksum |
| Staged rollout | Update 10% first, monitor, then expand | Limits blast radius to 10% |
| Dead letter queue | Failed updates queue for human review (L33) | Nothing silently lost |
Remove atomic write AND rollback? Bricked. Skip staged rollout and push to all 1,000 devices? CrowdStrike at home scale.
Scenario: Race Condition on a Door Lock
Two users send conflicting commands to the same smart lock simultaneously — one locks, one unlocks. Same L31 race condition, but with safety-critical consequences.
| Layer | Defense | Hole? |
|---|---|---|
| Sequential consistency | Lock commands use strong consistency (L33) | Prevents stale lock state |
| Atomic operations | synchronized on the lock device — no interleaving | Prevents mixed state |
| Audit trail | Every lock/unlock logged with timestamp and user | Accountability after the fact |
| Physical override | Physical key always works regardless of software | Human can always recover |
Eventual consistency is fine for brightness — a roommate seeing 100% for 5 seconds is harmless. For a door lock, it's not. Blast radius drives the consistency model choice.
Scenario: Pawtograder Autograder Crashes Mid-Run
The autograder crashes mid-run — out of memory, network timeout, or a bug in the grading script. What grade does the student see?
| Layer | Defense | Hole? |
|---|---|---|
| Error classification | Distinguish "student tests failed" from "grader crashed" | If both produce exit code 1, they're conflated |
| Fail-safe default | Infrastructure failure = "internal error, needs review" | Only works if failure is classified correctly |
| Retry | Automatically retry infrastructure failures once | Helps transient failures; not deterministic crashes |
| Audit trail | Log every run with exit code, stderr, timing | Enables after-the-fact investigation |
| Student notification | "Your submission is being re-graded" vs "0/100" | "0/100" with no explanation is fail-dangerous |
"Internal error, needs manual review" is fail-safe. Silently assigning zero is fail-dangerous.
Pawtograder Gradebook: Audit Trails Catch Human Errors Too
A staff member accidentally updates the wrong gradebook column — 35 students' participation bonus scores overwritten with zeros. Initial report: "looks like a software bug — grades changed without any submission."
| Layer | What it caught |
|---|---|
| Audit table | Logged the exact staff member, timestamp, old values, new values |
| Student visibility | Student noticed grade drop within hours, not weeks |
| Flag mechanism | Student flagged concern → triggered investigation |
| Professor audit view | Confirmed: single bulk update by one staff member, not a software bug |
| Reversibility | Old values in audit table → all 200 grades restored in minutes |
L24 callback: this was a slip — right intention, wrong column. The audit trail doesn't prevent the slip, but it makes it detectable, attributable, and reversible.
The Meta-Slice: Knowing the Defense System Exists
Every Swiss cheese layer assumes someone knows it's there. A defense you don't know about is a defense you can't use, maintain, or trigger.
| Case | Who knew the layers? | Outcome |
|---|---|---|
| Boeing 737 MAX | Pilots didn't know MCAS existed | Could not diagnose or override — 346 killed |
| Citicorp Tower | LeMessurier told every stakeholder | Each person knew their role — building saved |
| Pawtograder (alone) | Student sees 0/100, assumes "I failed" | Never triggers flag mechanism |
| Pawtograder (compares notes) | Student discovers classmates also got 0 | Reclassifies as systemic — flags it, triggers investigation |
| Pawtograder (grade finalization) | Multiple instructors review all grades before submission | Catches systematic errors even if no student flags them |
Practical rule: When you see a surprising result — a 0/100, a mysterious crash, a grade that doesn't match your work — compare notes before assuming you're the problem. You might be the person who activates the next defense layer.
Comprehension Check
Open Poll Everywhere and answer the three questions.
You Already Know How to Prevent These Failures
| Safety pattern | Where you learned it | What it prevents |
|---|---|---|
| Contracts & validation | L4: Specifications | Bad inputs propagating through the system |
| Information hiding | L6: Changeability | Unintended dependencies that break silently |
| Low coupling | L7: Coupling & Cohesion | Failures cascading across module boundaries |
| Testing at every scope | L15-L16: Testing | Bugs reaching production undetected |
| Timeouts & circuit breakers | L20: Networks | One slow service taking down the whole system |
synchronized & atomicity | L31: Concurrency | Race conditions corrupting shared state |
.exceptionally() & .orTimeout() | L32: Async | Silent failures hiding unsafe states |
| Idempotency & staged rollout | L33: Events | Duplicates causing harm; blast radius of bad deploys |
These aren't exotic safety tools. They're the patterns you already use — applied where the blast radius includes human safety.
Safety Debt Compounds: Same Code, Growing Blast Radius
The code didn't change. The blast radius did. Safety debt is not the code getting worse — it's the consequences getting larger while the same holes remain open.
Who Profits, Who Bears the Risk?
Boeing sold sensor redundancy as an optional upgrade. Airlines serving price-sensitive passengers flew with less redundancy.
Cost savings accrued to Boeing and airlines. Risk fell on passengers who didn't know.
The same pattern appears in every safety-performance tradeoff:
| Trade-off | Cost of safety | Cost of not having it |
|---|---|---|
| Strong consistency (slower) | Performance overhead | Lock shows "locked" when unlocked |
| Error handling (more complex) | Code complexity | Silent failures hide unsafe states |
| Staged rollouts (slower) | Deployment speed | CrowdStrike-scale blast radius |
| Redundant sensors (more expensive) | Hardware cost | Single point of failure |
Not "can we afford safety?" but "can we afford the consequences of not having it?"
Looking Ahead
L36 (Thursday): Sustainability
We asked "who profits and who bears the risk?" for safety decisions. Thursday we generalize: sustainability — the meta-quality attribute that asks whether ALL your quality attributes hold up over time, and for whom.
GA1 due April 9 — think about error handling in your async chains. What happens when a network call fails? Does your app fail safely or fail dangerously?
Today: what happens when your software fails, and who gets hurt. Thursday: who benefits from your design decisions, who bears the cost, and over what time horizon.