Skip to main content

Safety and Reliability

You've spent this semester building software that works — software that is correct, testable, maintainable, and performant. Today we ask a different question: what happens when it doesn't work? And more specifically: who gets hurt?

The tools you've learned this semester — race condition prevention (L31), error handling (L32), consistency models (L33), profiling (L34) — are not just performance or reliability tools. They are safety mechanisms. The difference is not the mechanism; it is the consequence of getting it wrong.

Distinguish safety from reliability and explain why both are architectural drivers (12 minutes)

These terms are often used interchangeably, but they mean different things — and the differences matter:

Reliability is whether the system does what it is supposed to do, consistently. A reliable system works. You measure it in error rates, mean time between failures (MTBF), and successful operation counts. A SceneItAll hub that correctly activates scenes 99.99% of the time is highly reliable.

Availability is whether the system is accessible when users need it. A highly available system is there when you call. You measure it in "nines" — 99.9% (8.7 hours of downtime per year), 99.99% (52 minutes), 99.999% (5 minutes). GitHub experienced multiple major outages in February–March 2026 — a core authentication database overloaded when client apps drove a tenfold increase in read traffic, and GitHub Actions failed due to insufficient failover. That's an availability failure — significant downtime — but not a safety failure (nobody was physically harmed).

Safety is whether the system avoids causing unacceptable harm, even when it fails. A safe system fails without hurting people. You measure it not in uptime but in incident severity — did anyone get hurt? Did anyone lose data? Did anyone lose money?

The critical insight: a system can be reliable and unsafe. The Therac-25 radiation therapy machine delivered correct doses the vast majority of the time — it was reliable. But its failure mode was lethal. Conversely, a system can be safe but unreliable. If SceneItAll's hub crashes frequently but always preserves device state and fails to a safe default (lights on, doors locked), it is much less likely that someone will be harmed, but it is still possible (locked out of the house?) — even though the system is unreliable.

Recall

In L18 (Thinking Architecturally), we introduced quality attributes as the drivers that shape architecture. Safety and reliability are quality attributes, just like performance, scalability, and changeability. They don't get added as features — they emerge from (or fail to emerge from) architectural decisions made early and maintained throughout.

Safety Isn't a Feature — It's a Property That Emerges (or Doesn't)

You can't add safety the way you add a search bar. Safety is not a feature you build in Sprint 4 — it is a property that emerges from how every other feature is built. This should sound familiar, because you've heard this exact pattern all semester, applied to different quality attributes:

  • In L7 (Design for Change), we argued that changeability comes from coupling and cohesion decisions made during design, not from a refactoring sprint later. Low coupling doesn't happen by accident — it happens because someone chose the right module boundaries before the code was tangled.
  • In L16 (Designing for Testability), we showed that testability requires hexagonal architecture — separating domain logic from infrastructure — and that bolting tests onto code that wasn't designed for testability is painful and incomplete.
  • In L18 (Thinking Architecturally), we introduced "just enough architecture": decide the hard-to-reverse things up front, design the system so deferred decisions stay cheap. The cost of getting those early decisions wrong grows exponentially over time.
  • In L20 (Networks and Security), we stated it directly: "Security isn't a feature you bolt on at the end — it's an architectural concern that shapes design decisions throughout."
  • In L28 (Accessibility), we showed that accessibility designed in from the start is straightforward; accessibility retrofitted onto an inaccessible interface is expensive, incomplete, and often patronizing.

Safety follows the same rule — and the stakes are higher. When SceneItAll's firmware update uses an atomic write with rollback, that's not a "safety feature" — it's a firmware update that was designed safely. When it doesn't use an atomic write, the firmware update still works in the happy path. The safety gap only becomes visible when the Zigbee connection drops mid-write and the device is bricked.

This is why safety concerns change as a system grows:

StageWhat happensSceneItAll example
LaunchHappy paths work; few users, limited blast radius50 beta homes, firmware updates pushed manually
GrowthUsers interact in ways you didn't design for10,000 homes; users trigger scene activations during firmware updates — a race condition you never tested
ScaleEdge cases surface in production; safety debt compoundsA firmware bug bricks 200 devices in one push; staged rollout would have caught it at 10
MaturityRegulatory requirements change; what was "safe enough" no longer compliesUL/CE certification requires hardware watchdog timer; your software-only safety was grandfathered in

The cost of addressing safety at each stage grows exponentially. Designing in an atomic firmware write on day one is moderate engineering effort. Adding it after 10,000 devices are deployed requires a migration. Adding it after a bricking incident requires that migration plus legal costs, customer replacements, and reputational damage.

Software Affects Human Safety in Ways You Don't Expect

Direct safety is straightforward: SceneItAll controls a smart door lock. A bug in the lock firmware lets an unauthorized person enter. Software controls a physical actuator, and the failure causes immediate physical harm. Medical devices, autonomous vehicles, and industrial control systems fall in this category.

Indirect safety is harder to see. A recommendation algorithm affects mental health — not because it was designed to, but because optimizing for engagement selects for outrage. An automated hiring tool screens out qualified candidates — not because it's biased by design, but because its training data reflects historical bias. SceneItAll's usage analytics reveal when a home is occupied and when it isn't — not a safety concern at launch, but a burglary risk at scale. Moreover, at scale, SceneItAll might have potential users with real safety concerns about this data.

These are fundamentally missing-stakeholder problems. In L9 (Requirements), we discussed three dimensions of requirements risk. Indirect safety hazards are what happens when the understanding dimension fails — when we don't understand who our stakeholders are or how our system affects them.

The same patterns appear in AI systems. In L13, we discussed AI coding agents generating code with security vulnerabilities. An AI system that makes safety-relevant decisions (medical diagnosis, content moderation, autonomous driving) replaces human judgment with software — like Therac-25. It may lack redundancy — like Boeing's single sensor. And its blast radius scales with deployment — like CrowdStrike. If a developer uses AI to generate safety-critical code and cannot evaluate the output, they have removed a Swiss cheese layer (human code review) without adding a replacement. Don't just take our word for it, see David Parnas' ICSE 2025 Keynote.

Recall

In L24 (Usability), we introduced three types of human error — slips, lapses, and mistakes — and how poor usability increases their likelihood. The same error patterns apply at a larger scale: the slip that makes a user click the wrong button in a recipe app can make an operator misconfigure a safety-critical system, or a developer deploy untested code to production.

Apply the Swiss cheese model to analyze layered defenses (13 minutes)

The Swiss Cheese Model of Failure

Recall

You've been building Swiss cheese layers all semester without naming them. Preconditions (L4) reject bad inputs. Tests (L15) catch bugs before deployment. Hexagonal architecture (L16) isolates domain logic from infrastructure failures. Resilience patterns (L20) handle network failures. Idempotent consumers (L33) handle duplicate messages. Today we name this pattern and analyze what happens when layers are removed.

The Swiss cheese model (James Reason) is the most useful framework for thinking about safety in systems. The idea: every safety mechanism is a layer of defense — a slice of Swiss cheese. Each layer has holes (failure modes). Harm occurs only when holes in multiple layers align — when every defense fails simultaneously.

A single layer with holes is not dangerous on its own. Multiple layers with non-aligned holes provide robust protection. The problem is when someone removes a layer entirely, or when holes grow larger over time without anyone noticing.

Case Study: Therac-25

The Therac-25 was a radiation therapy machine that killed at least six patients between 1985 and 1987. The bug was a race condition — the same kind you studied in L31.

Earlier models (the Therac-20) had hardware interlocks — physical mechanisms that prevented the machine from delivering lethal radiation doses regardless of what the software did. The hardware was a Swiss cheese layer with very small holes. The Therac-25 replaced those interlocks with software safety checks. Cheaper, lighter, more flexible — but it removed an entire layer of Swiss cheese. The software had race conditions that the hardware interlocks would have caught. When operators reported errors, the manufacturer dismissed them: the software was "thoroughly tested."

Swiss cheese analysis:

LayerDefenseHole?
Hardware interlocksPhysical mechanism prevents lethal doseRemoved entirely in Therac-25
Software safety checksSoftware validates beam energy before firingRace condition allowed high-energy beam in electron mode
Operator trainingOperators trained to recognize error codesOperators learned to dismiss frequent, cryptic error messages
Incident reportingOperators report anomalies to manufacturerManufacturer dismissed reports — "software is thoroughly tested"

All remaining holes aligned. Lethal radiation reached patients.

Case Study: Boeing 737 MAX

Two crashes in 2018–2019 killed 346 people. The cause was a software system called MCAS (Maneuvering Characteristics Augmentation System) that overrode pilot control based on a single sensor input.

The 737 MAX's larger engines changed the plane's aerodynamics, creating a tendency to pitch up. Rather than redesign the airframe (expensive, would require recertification), Boeing added MCAS to automatically push the nose down. MCAS relied on a single angle-of-attack sensor — no redundancy. When that sensor failed, MCAS repeatedly forced the nose down. Pilots fought the automation until the planes crashed.

The crucial detail: Boeing offered a dual-sensor configuration with an "Angle of Attack Disagree" indicator as an optional upgrade. Airlines that paid extra got redundancy; airlines that didn't got a single point of failure. Both crashed aircraft (Lion Air and Ethiopian Airlines) had the basic single-sensor configuration.

Swiss cheese analysis:

LayerDefenseHole?
Airframe designAerodynamic stability without software interventionReplaced — new engines changed aerodynamics; MCAS compensates in software
Sensor redundancyDual angle-of-attack sensors with disagree indicatorOptional upgrade — not present on crashed aircraft
Pilot trainingPilots trained to recognize and override MCASMinimized — Boeing marketed "no retraining needed" as a selling point
Pilot overridePilots can disable automation and fly manuallyInadequate — pilots didn't know MCAS existed, couldn't diagnose the failure

All holes aligned. Software pushed the nose down, pilots couldn't override, planes crashed.

Case Study: CrowdStrike Falcon Update (July 2024)

On July 19, 2024, CrowdStrike pushed a content update to its Falcon security agent — a kernel-level driver running on approximately 8.5 million Windows machines worldwide (Microsoft, Helping our customers through the CrowdStrike outage, corporate blog, as of 2024-07-20). Technical root cause and mitigations are documented in CrowdStrike’s Channel File 291 root cause analysis (dated 2024-08-06). The update contained a bug that caused an out-of-bounds read in the kernel, triggering a Blue Screen of Death on boot. Because the driver loads early in the boot process, affected machines entered an unrecoverable boot loop. Manual intervention — physically accessing each machine, booting into Safe Mode, and deleting the offending file — was required for every single one.

Airlines grounded flights. Hospitals delayed surgeries. 911 dispatch systems went offline. Banks could not process transactions. Economic-loss estimates were in the billions of dollars: insurer Parametrix estimated 5.4billioninfinanciallossesforUSFortune500companies([Reuters,Fortune500firmstosee5.4 billion in financial losses for US Fortune 500 companies ([Reuters, *Fortune 500 firms to see 5.4 bln in CrowdStrike losses, says insurer Parametrix*](https://www.reuters.com/technology/fortune-500-firms-see-54-bln-crowdstrike-losses-says-insurer-parametrix-2024-07-24/), 2024-07-24); the same Reuters report quoted Parametrix’s CEO estimating ~$15 billion globally (as of 2024-07-24).

Swiss cheese analysis:

LayerDefenseHole?
Content validationAutomated testing before distributionTest did not catch the null pointer read
Staged rolloutPush to 1% first, monitor, then expandNot used for "content updates" — only for "sensor updates." The update went to all 8.5M machines simultaneously
Automatic rollbackRevert if failures spikeMachines could not boot to receive the rollback
Fail-safe bootIf driver crashes, boot without itCrowdStrike loads too early in boot — crash prevents any recovery
MonitoringDetect crash spike and halt rolloutWith simultaneous push, crashes happened faster than monitoring could react

This is the SceneItAll firmware update scenario from earlier in this lecture — but at planetary scale. The blast radius of skipping staged rollout was not "1000 devices" but "every Windows machine in every hospital, airline, and emergency dispatch center running CrowdStrike Falcon."

The Comparison

AspectTherac-25Boeing 737 MAXCrowdStrike Falcon
What was replaced?Hardware interlocksAirframe redesign + pilot trainingManual security review
Replaced with?Software safety checksMCAS software automationAutomated content update pipeline
Why?Cheaper, lighterCheaper, faster certificationSpeed — security threats require rapid response
Swiss cheese layer removed?Hardware interlock layerSensor redundancy + trainingStaged rollout for content updates
Critical flaw?Race conditionsSingle point of failureNo rollback path when kernel driver crashes
Could system recover?Yes — operators could restartNo — planes crashedNo — boot loop required manual access to each machine

The recurring lesson: When we replace hardware safety or human judgment with software (because it's cheaper), we must ask:

  1. What failure modes does software introduce that the original system didn't have?
  2. Is there redundancy? What happens when the single sensor/input/assumption fails?
  3. Can humans override the automation when it's wrong? Do they know how?

Safety as a Premium Feature

Boeing sold sensor redundancy as an optional upgrade. Airlines serving price-sensitive passengers in developing countries flew with less redundancy. The cost savings accrued to airlines and Boeing; the risk was borne disproportionately by passengers who had no idea their ticket price reflected a safety tradeoff.

This reveals a pattern we'll explore further in L36 (Sustainability): who profits from removing a safety mechanism, and who bears the risk? If the answer is "different people," you have an ethical problem — those making the cost/benefit calculation aren't the ones who suffer the consequences.

Analyze blast radius and fail-safe design in your own systems (15 minutes)

Blast Radius: How Much Breaks When Something Fails?

Blast radius is how much of the system — and the world — is affected when a component fails. It is the single most important factor in determining how many Swiss cheese layers you need.

Recall

In L19 (Architectural Qualities), we noted that a monolith's deployment risk is "all-or-nothing: a bug in one feature can take down everything." That's blast radius language — L19 implicitly introduced the concept. A monolith has the maximum blast radius: every feature shares a single deployment unit.

Low coupling (L7) limits blast radius: when modules are loosely coupled, a failure in one does not propagate to others. High coupling means a bug in the authentication module can crash the gradebook.

The Citicorp Tower in Manhattan (1978) illustrates this. A student's question about the building's unusual column placement prompted structural engineer William LeMessurier to investigate quartering winds — winds hitting the corner at 45°, which the NYC building code didn't require analyzing. He discovered that a contractor had substituted bolted joints for the specified welded joints (cheaper, but weaker in tension), and that nobody had recalculated for quartering winds after the substitution. The combination meant the building could collapse in a storm that occurs roughly every 16 years. Hurricane season was approaching.

LeMessurier didn't fix it quietly. He brought in every stakeholder: the architect (Hugh Stubbins), Citicorp's CEO (Walter Wriston), an independent structural consultant (Leslie Robertson, who had consulted on the World Trade Center), the contractor, NYC's Acting Building Commissioner and nine senior city officials, the American Red Cross, police, and the Mayor's Office of Emergency Management. He told the city the whole truth — "the failure of his own office to perceive and communicate the danger" — for over an hour. City officials commended him. Emergency welding was completed during a newspaper strike, so there was no public panic. The building was rebuilt to withstand a 700-year storm. Nobody was hurt.

Why did he act? Not because he was uniquely virtuous — but because the blast radius left no other responsible option. A collapsed skyscraper in midtown Manhattan would endanger thousands of people. When your system's blast radius is that large, you have a professional obligation to understand and mitigate every failure mode you discover — even when disclosure is personally costly. As project manager Arthur Nusbaum put it: "It started with a guy who stood up and said, 'I got a problem, I made the problem, let's fix the problem.' If you're gonna kill a guy like LeMessurier, why should anybody ever talk?"

The ACM Code of Ethics formalizes this: Principle 1.2 states 'Avoid harm,' and Principle 2.5 requires 'comprehensive and thorough evaluations of computer systems and their impacts, including analysis of possible risks.' Blast radius analysis IS that evaluation.

In his ICSE 2025 keynote, David Parnas — who invented the information hiding concepts you learned in L6 — put it simply: "It's not the AI that's liable. It's either the people who made it or the people who use it or both. The people who made it have a duty to have a specification for it and to make sure it meets that specification. People who use it have to make sure that they have read the specification and they're not using it outside of the specification." This is the same contracts framework from L4 — preconditions and postconditions — applied to professional accountability. It doesn't matter whether your system uses AI, event-driven architecture, or a single for loop. If its blast radius includes human safety, you are responsible for specifying what it does and verifying that it does it.

Blast radius determines how many Swiss cheese layers you need:

SystemBlast radius of failureLayers needed
SceneItAll brightness controlOne room's lights are wrongError handling + UI feedback
SceneItAll door lockUnauthorized person entersStrong consistency (L33) + redundant sensors + human override
SceneItAll firmware updateDevice bricked, needs replacementRollback mechanism + staged rollout + integrity verification
Pawtograder gradebookEvery student's GPA in the courseAudit trails + human-in-the-loop + fail-safe defaults
Citicorp TowerMidtown Manhattan skyscraperPhysical redundancy + independent verification + immediate remediation
Boeing 737 MAX MCASEveryone on the aircraftSensor redundancy + pilot training + override capability

Fail-Safe vs. Fail-Operational

SceneItAll's hub loses its connection to a smart light mid-scene-activation. What should the light do?

Fail-safe: The light stays at its current brightness. Nothing changes. The user notices the scene didn't fully apply, but no harm is done. When in doubt, do nothing harmful.

Fail-operational: The hub falls back to local control via Zigbee, bypassing the cloud. The system continues in a degraded mode — no remote access, no analytics — but the user can still control their lights.

Failure modeFail-safe behaviorFail-dangerous behavior
Firmware update fails mid-writeRoll back to previous firmwareContinue with partially written firmware (bricked device)
Autograder crashes mid-runReport "internal error, needs manual review"Silently assign zero
Door lock loses connectionLock stays in current state (locked or unlocked)Lock resets to unlocked default
Scene activation: 1 of 15 devices failsReport "14/15 updated — shade didn't respond"Report "Scene activated!" (silent failure)

Most software should be fail-safe. Fail-operational is harder to get right and is reserved for systems that cannot afford to stop — airplanes, pacemakers, nuclear reactor cooling.

Boeing's MCAS was neither. It didn't stop when the sensor failed (not fail-safe). It didn't degrade gracefully by alerting pilots and giving them manual control (not fail-operational). It kept pushing the nose down — fail-dangerous.

SceneItAll Safety Scenarios

Let's apply the Swiss cheese model and blast radius to SceneItAll — the system you've been building all semester:

Scenario 1: Firmware update bricks a device

A SceneItAll hub pushes a firmware update to a smart light. The update involves writing firmware in chunks (L31, cooperative interrupts). Halfway through, the Zigbee connection drops.

LayerDefenseHole?
Integrity checkVerify firmware checksum before applyingCatches corrupt downloads
Atomic writeWrite new firmware to a staging partition, swap only after verificationPrevents partial writes from bricking
RollbackIf new firmware fails to boot, revert to previous versionCatches bad firmware that passes checksum
Staged rolloutUpdate 10% of devices first, monitor for failures, then roll out to restLimits blast radius to 10%
Dead letter queueFailed updates queue for human review (L33)Nothing is silently lost

Remove any one layer and the failure mode gets worse. Remove the atomic write AND the rollback? The device is bricked. Blast radius: one device (manageable). But if you skip the staged rollout and push to all 1000 devices at once? Blast radius: every device in the deployment.

Scenario 2: Race condition on a door lock

Two users send conflicting commands to the same smart lock simultaneously — one locks, one unlocks. This is the same race condition from L31 (Alice and Bob activating different scenes), but with safety-critical consequences.

LayerDefenseHole?
Sequential consistencyLock commands use strong consistency (L33) — no eventualPrevents stale lock state
Atomic operationssynchronized on the lock device — no interleavingPrevents mixed state
Audit trailEvery lock/unlock is logged with timestamp and userAccountability after the fact
Physical overridePhysical key always works regardless of software stateHuman can always recover

For brightness controls, eventual consistency is fine — a roommate seeing 100% for 5 seconds is harmless. For a door lock, it's not. The blast radius difference (annoyance vs unauthorized entry) drives the consistency model choice.

Scenario 3: Silent failure in scene activation

A user activates "Evening" scene. The hub sends 15 async device commands (L32). One command fails silently — no .exceptionally() handler. The user sees "Scene activated!" but one shade is still open.

LayerDefenseHole?
Error handling.exceptionally() on every async chain (L32)Catches device failures
Timeout.orTimeout(5, SECONDS) — don't wait foreverCatches hung devices
Status verificationAfter activation, read back device states and compareCatches silent failures
User notificationShow "14/15 devices updated — shade in bedroom did not respond"User can investigate

Without error handling, the hole in Layer 1 means the failure is invisible. The user thinks the shade is closed. If the shade is a window on the ground floor, this is a security issue. Blast radius: one room's security.

Pawtograder: When the Autograder Crashes

Pawtograder's autograder runs student code in a containerized environment. If the autograder process crashes mid-run — out of memory, network timeout to the GitHub API, or a bug in the grading script itself — what grade does the student see?

LayerDefenseHole?
Error classificationDistinguish "student tests failed" from "grader infrastructure failed"If both produce exit code 1, they are conflated
Fail-safe defaultInfrastructure failure → "internal error, needs manual review" (not zero)Only works if the system classifies the failure correctly
RetryAutomatically retry infrastructure failures onceHelps with transient failures; doesn't help with deterministic crashes
Audit trailLog every grading run with exit code, stderr, timingEnables after-the-fact investigation
Student notificationTell the student what happened — "your submission is being re-graded" vs "0/100""0/100" with no explanation is fail-dangerous

The fail-safe default matters: "internal error, needs manual review" is fail-safe. "0" is fail-dangerous. The blast radius of getting this wrong: one student's grade in the best case, every student's grade if the bug is systematic.

Pawtograder: When a Human Accidentally Overwrites the Gradebook

Not every safety incident is a software bug. Pawtograder's gradebook stores every student's grades for the course. A staff member accidentally updates the wrong column — overwriting 200 students' homework scores with zeros. The staff member doesn't realize the mistake. A student notices their grade dropped and flags a concern.

This actually happened. The initial report looked like a software bug — "grades changed without any submission." But the audit trail told a different story.

LayerDefenseWhat it caught
Audit tableEvery grade update is logged with timestamp, user, old value, new valueShowed exactly which staff member made the change, when, and what the previous values were
Student visibilityStudents can see their own grades in real timeStudent noticed the discrepancy within hours, not weeks
Flag/concern mechanismStudents can flag grade concerns to instructorsThe student's flag triggered the investigation
Professor audit viewProfessors can view full audit history for any student or assignmentConfirmed the change was a single bulk update by one staff member, not a software bug
ReversibilityOld values stored in audit table enable rollbackAll 200 grades were restored from audit history
Recall

In L24 (Usability), we introduced three types of human error: slips (intended the right action, did the wrong one), lapses (forgot a step), and mistakes (wrong mental model). This was a slip — the staff member intended to update one column but selected the wrong one. The audit trail doesn't prevent the slip, but it makes the slip detectable, attributable, and reversible.

Without the audit trail, this incident would have been invisible until final grades were submitted — and then it would have looked like a software bug, triggering a costly investigation into code that was working correctly. The audit trail is a Swiss cheese layer that catches both machine errors (autograder crash) and human errors (accidental overwrite). Blast radius without audit trail: 200 students' grades, potentially discovered only at end of semester. Blast radius with audit trail: 200 students' grades, detected in hours, reversed in minutes.

Recognize prior course concepts as safety mechanisms (12 minutes)

You've learned these tools as performance, reliability, and concurrency mechanisms. Every one of them is also a safety mechanism. The difference is the consequence of getting it wrong:

What you learnedWhereIts reliability functionIts safety function
Preconditions/contractsL4Rejects invalid inputs at boundariesPrevents unsafe states from being reachable — a restrictive precondition is a Swiss cheese layer
Testing (unit, integration, E2E)L15Catches bugs before deploymentPrevents safety-critical defects from reaching production — tests are a Swiss cheese layer with their own holes (incomplete coverage, flaky tests)
synchronizedL31Prevents race conditionsPrevents safety-critical state corruption (door lock mixed state)
.exceptionally() / .orTimeout()L32Handles async errorsEnsures failures are visible, not silent (shade left open)
Sequential consistencyL33All nodes agreePrevents operations on stale data (lock shows "locked" when unlocked)
Circuit breakerL20Prevents cascade failuresStops a failing component from triggering harm in others
Idempotent operationsL33Makes retries safePrevents duplicate actions from causing harm (door locks twice = fine; alarm disarms twice = fine)
Audit trailsL12Enables debuggingEnables accountability, reversibility, incident investigation
Fail-safe defaultsTodayGraceful degradationSystem fails toward safety (lights on, doors locked) not toward harm
RedundancyL33, TodaySurvives component failureEliminates single points of failure (Boeing's single sensor)
Blast radius analysisTodayScopes failure impactDetermines how many defense layers you need

The key insight: You didn't learn "safety tools" and "reliability tools" separately. They are the same tools. The engineering is the same. The difference is the stakes — and the stakes are determined by the blast radius.

Connect safety to sustainability (8 minutes)

Safety is one dimension of the sustainability framework we'll explore in L36. The connections:

Safety debt compounds like technical debt

Every Swiss cheese hole you leave unfixed is a bet that no other holes will align with it. That bet gets worse over time.

SceneItAll ships without staged rollout for firmware updates. At 50 homes, this is fine — if an update bricks a device, support replaces it. At 10,000 homes, a bad update bricks 200 devices in one push. At 100,000 homes, it's a CrowdStrike-scale event. The code didn't change — the blast radius did. Safety debt is not the code getting worse; it is the consequences getting larger while the same holes remain open.

Who profits, and who bears the risk?

Boeing sold sensor redundancy as an optional upgrade. Budget airlines — often serving price-sensitive passengers in developing countries — flew with less redundancy. The cost savings accrued to Boeing and airlines; the risk fell on passengers who didn't know their ticket price reflected a safety tradeoff.

This pattern recurs. L36 will formalize it: who profits from a design decision, and who bears the risk? If the answer is "different people," the decision deserves extra scrutiny. The same logic applies to L28 (Accessibility): the people who decide not to invest in accessibility are rarely the people excluded by that decision.

Evaluate safety trade-offs and explain why professional judgment is currently the primary safety mechanism in most software (8 minutes)

Safety Is Always in Tension

Safety is never free. Every safety mechanism costs something, and ignoring those costs leads to safety mechanisms that get stripped out under pressure:

Safety mechanismWhat it costsExample
Strong consistencyPerformance — sequential consistency is slower than eventualSceneItAll door lock: sequential consistency adds latency to every lock/unlock command
Error handlingComplexity — .exceptionally() on every async chain, timeout logic, retry policiesL32's scene activation: 15 device commands each need error handling, timeouts, and status verification
Staged rolloutDeployment speed — rolling out to 1% first and monitoring adds hours or daysCrowdStrike skipped staged rollout for "content updates" to push security patches faster
Redundant sensorsMoney — dual sensors, disagree indicators, additional wiringBoeing sold sensor redundancy as an optional upgrade to save airlines money
Automatic memory managementPerformance — GC pauses, memory overheadL34: Java's GC trades performance for safety (no use-after-free bugs)
Human-in-the-loopThroughput — humans are slowPawtograder: "internal error, needs manual review" is safer than auto-assigning zero, but requires a TA to act

In every case, safety costs something: performance, complexity, money, or time. The question is not "can we afford safety?" but "can we afford the consequences of not having it?" The answer depends on the blast radius.

From Citicorp to Code: Who Regulates Software?

LeMessurier was a licensed professional engineer. Building codes required specific structural analyses. Inspectors reviewed the work. Professional boards could revoke his license. When he disclosed the Citicorp flaw, city officials had the authority and expertise to evaluate his analysis and coordinate the response.

Most software has none of that. Avionics software must comply with DO-178C. Medical device software goes through FDA review. These are the exceptions. Banking software, social media algorithms, smart home firmware, autograders, hiring tools — all unregulated. There is no professional licensing requirement for writing software that controls door locks, manages student grades, or recommends content to billions of users.

In his ICSE 2025 keynote, David Parnas — who invented the information hiding concepts you learned in L6 — argued that we should regulate critical software the same way we regulate bridges: licensed engineers, accredited education, required specifications, independent testing. Not because it's "AI" or "not AI," but because "the amount of regulation doesn't depend on what we call it or how it's built — it depends on how important the answer is."

Until that happens, the last Swiss cheese layer is you. Your professional judgment. Your willingness to ask "what's the blast radius?" before you ship, and to disclose when you find a hole. That's not a great safety mechanism — it depends on individual conscience and employer culture rather than institutional enforcement. It's why Parnas argues for regulation. But right now, it's what we have.

Want to go deeper?