Pixel art museum hallway with framed paintings of course concepts on both walls — Information Hiding, Coupling and Cohesion, Requirements, Architecture, Testability, Networks on the left; Concurrency, Events, Performance, Safety on the right — converging at the far end into one large glowing MapReduce pipeline exhibit. A student walks toward it. Tagline: Every Lecture Led Here.

Lecture overview:

Total time: ~50 minutes
Prerequisites: L5 (Pure functions), L6 (Information hiding), L7 (Coupling/cohesion), L9 (Requirements), L16 (Hex arch/testability), L20 (Networks), L31 (Concurrency), L33 (Event architecture), L34 (Performance), L35 (Safety/reliability), L36 (Sustainability)
Connects to: L38 (Future of Programming), GA2 (due Thu Apr 16)

Structure (~25 slides):

Arc 1: The Problem + Programming Model (~12 min) — Google's requirements, data/computation scale, file system workload, batching + locality, GFS, MapReduce model (pivot to SceneItAll) + execution
Arc 2: Design Decisions (~12 min) — single NameNode trade-off, fault tolerance via pure functions (merged), Swiss cheese
Arc 3: Blast Radius (~8 min) — blast radius analysis by failure type
Arc 4: Sustainability + What Came After (~12 min) — four dimensions, Jevons, who benefits, successors, open source, constraints → architecture synthesis, course arc table
Comprehension check + Looking Ahead (2 slides)

Running example: Google's actual use cases — web indexing, PageRank, log processing — as described in the original papers.

Transition: Let's start with the learning objectives...

CS 3100: Program Design and Implementation II

Lecture 37: Design Case Study — MapReduce

Learning Objectives

After this lecture, you will be able to:

Describe the MapReduce programming model and GFS at the level needed to analyze their design decisions
Analyze MapReduce and GFS architectural decisions using quality attributes from the course
Identify how MapReduce and GFS apply patterns learned throughout the semester
Evaluate sustainability and trade-off dimensions of MapReduce and GFS

Two Systems, One Case Study

In 2003-2004, Google published two papers that changed how the industry processes data:

Google File System (GFS, 2003)

Stores data across thousands of machines. A NameNode (GFS master) manages metadata (which machines hold which chunks). DataNodes (GFS chunkservers) store the actual data.

Solved the storage problem.

MapReduce (2004)

Distributes computation across thousands of machines. A coordinator assigns tasks and handles failures. Workers execute user-defined functions.

Solved the computation problem.

Today we use these systems as a lens to see every concept from the semester in one place — coupling, cohesion, information hiding, requirements, architecture, concurrency, consistency, performance, safety, and sustainability.

Set the stage before diving into requirements. Students need to know what the two systems are and what role each plays before they can appreciate the requirements that shaped them.

Terminology note: The GFS paper names the metadata service the master and the chunk-storing workers chunkservers—matching lecture-notes/l37-map-reduce. NameNode and DataNode are the usual HDFS names for those same roles: NameNode is the analog of the master, and each DataNode is a chunkserver. These slides use the HDFS labels for familiarity; map them to master/chunkserver when you read the GFS paper or the notes. For MapReduce we use coordinator/worker throughout this lecture.

The "lens" framing is the key pedagogical move: this is NOT a distributed systems lecture. Students will not implement MapReduce. The goal is to see how design concepts they already know appear at scale. The museum hallway from the cover image is the metaphor — every painting (lecture) converges into this one system.

Transition: Before we look at the systems, let's state what they need to do...

Start with Requirements: What Does This System Need to Do?

From L9 and L18: requirements drive architecture. Before we look at MapReduce and GFS, let's state what the system needs — using the quality scenario format.

	Source	Stimulus	Environment	Response	Measure
Throughput	Search infrastructure	Rebuild the web index from billions of crawled pages	Hundreds of TB of raw data, thousands of commodity machines	Produce inverted index, PageRank scores	Complete in hours, not weeks
Fault tolerance	Hardware	Machine crashes mid-computation	1000+ machines; failures are routine, not exceptional	Recover and continue without restarting the job	No data loss; no human intervention
Scalability	Web growth	The web is growing exponentially	Same codebase, same programming model	Add machines, not rewrite software	Linear throughput scaling with machines added

These three requirements — throughput, fault tolerance, scalability — shaped every architectural decision in MapReduce and GFS. As we walk through the system, ask: which requirement drove this decision?

This is the L9 → L18 payoff. Students learned quality scenarios in L18 and requirements analysis in L9. Now they see the same framework applied to a real system at Google scale.

Walk through each row briefly:

Throughput: Hundreds of TB is the scale we'll establish on the next slide. A single machine reading at ~500 MB/s would take days — Google needed results in hours.
Fault tolerance: With 1000+ machines, the probability of at least one failing during a multi-hour job is nearly 100%. Failure is not an exception — it's the normal operating environment. This is the constraint that forces pure functions and re-execution.
Scalability: The web keeps growing. If doubling the crawl size requires rewriting the system, that's unsustainable (L36). The architecture must scale by adding machines, not changing code.

The question to plant: "As we go through each design decision, I'll ask: which of these three requirements drove it?" This gives students an analytical lens for the rest of the lecture.

L19 connection: "There's no universally best architecture — only architectures that are better or worse fits for particular constraints and priorities." These three requirements ARE the constraints that shaped MapReduce.

Transition: Let's see where these requirements come from — starting with the data...

Google's Data Problem Started with Storage

By 2003, Google had crawled billions of web pages. Each page: ~100KB of HTML, metadata, and outbound links.

	Per web page	Total crawl (billions of pages)
Raw data	~100 KB	Hundreds of TB
Words to index	~500 unique	Trillions of index entries
Links to follow	~50 outbound	Hundreds of billions of edges

Hundreds of TB don't fit on one machine. A large server in 2003 had ~2 TB of disk. You need hundreds of machines just to store the raw crawl — and the web keeps growing.

And Google doesn't just store it — they need to process all of it to build the products people use: web search requires an inverted index, relevant ranking requires PageRank, and the crawl itself must be kept current as the web changes.

Bridge from L36: Last lecture we closed with sustainability — "the integral of programming over time." Today we take every concept from the semester and apply it to one real system: Google's MapReduce and the Google File System. This is not a distributed systems lecture. Students will not implement MapReduce. The goal is to use it as a lens for the entire course.

Start with storage, not computation. Students often think "big data" means "slow computation." But the first problem is simpler: where do you put it? A single machine can't even hold the data, let alone process it.

Walk through the math. Billions of pages × ~100KB each = hundreds of TB. A 2003 server had maybe 2 TB of disk. Let students feel the scale building from per-page (familiar) to total crawl (staggering). The trillions-of-index-entries number should land hard.

The numbers are approximate but realistic. The exact size of Google's 2003 crawl is not public. The point is the order of magnitude — the conclusion (doesn't fit on one machine) holds even if numbers are 2-3x off. The GFS paper describes clusters with "hundreds of terabytes across hundreds of machines."

If a student asks "why not just keep the latest crawl?": You can't — you need the full crawl to build the index, and the web keeps changing, so you're continuously re-crawling and re-processing. The steady-state data volume only grows as the web grows.

Transition: So storage is the first problem. But what do you actually want to DO with all this data?

Then It's a Computation Problem

Google needs to build an inverted index (which pages contain each word — this IS Google Search), compute PageRank (how pages link to each other — this is what makes results relevant), and process web server logs (billions of requests per day). That means reading and processing hundreds of TB regularly.

The math: One machine reads from disk at ~500 MB/s. Reading hundreds of TB takes days — for an index that needs to stay current as the web changes.

Hundreds of machines reading in parallel: hours. That's feasible — but now you have two new problems:

Coordination: How do you split the work, route results, and combine them?
Failure: If one machine crashes mid-computation (and at 100+ machines, something will crash), what happens to its work?

Google's answer: build a file system that stores data across thousands of machines (GFS), and a programming model that distributes computation automatically (MapReduce).

The two-slide setup: Slide 1 established that storage is the first problem — the data doesn't fit on one machine. This slide establishes that computation is the second problem — you can't process it fast enough on one machine either.

The "days" number is the hook. Hundreds of TB ÷ 500 MB/s = hundreds of thousands of seconds = days. Students can estimate this in their heads. The gap between "days" and "need a current index" makes the problem visceral.

Quality scenario (L18 format): If you want to formalize this on the board:

Source: Search infrastructure team
Stimulus: Rebuild inverted index from latest crawl
Environment: Hundreds of TB of crawled pages, commodity hardware (machines fail regularly)
Artifact: Data processing pipeline
Response: Produce inverted index mapping every word to every page containing it
Measure: Completes in hours, not days; tolerates machine failures without restarting

This is exactly the quality scenario template from L18 — and these constraints shape the architecture.

Name the three products: Inverted index = Google Search (look up which pages contain a word). PageRank = what makes results relevant (analyze the link graph). Web server logs = understand traffic patterns. All three require reading the full dataset.

Name the architectural style: This is batch processing — all input is processed in one large batch on a fixed schedule, rather than stream processing (events processed as they arrive). Batch processing trades latency for throughput and simplicity.

Transition: Before we look at the solutions, let's understand what kind of data access this workload actually needs...

MapReduce Hides Distributed Complexity Behind Two Functions

Google built MapReduce for web indexing — but the programming model is general. Any data processing task with the same requirements (massive data, parallel processing, fault tolerance) can use the same two functions. SceneItAll's analytics pipeline has exactly those requirements:

Map: Takes a key-value pair, emits zero or more intermediate pairs.

map(homeId, telemetryData) →
  emit("high-energy", {homeId, 45kWh})
  emit("normal", {homeId, 12kWh})

One home's telemetry in, classification out.

Reduce: Takes a key + all its values, combines into final result.

reduce("high-energy", [
  {home1, 45kWh}, {home2, 52kWh}, ...
]) →
  emit("high-energy",
    {count: 12400, avgKWh: 48.3})

All high-energy homes in, aggregate out.

Between map and reduce, the framework performs a shuffle — groups intermediate values by key and routes them to the appropriate reducer. The programmer writes only map and reduce; the framework handles everything else.

The pivot to SceneItAll is deliberate. We told Google's story to motivate WHY MapReduce exists. Now we use SceneItAll to show the model is GENERAL — the same two-function abstraction works for any data processing task with the same requirements (large data, parallel processing, fault tolerance). The map function processes one home; the reduce function aggregates across homes. The shuffle is the framework's job — the programmer never sees it.

Name the architectural style: This is a pipelined architecture (L19) — data flows through sequential stages: input → map → shuffle → reduce → output. Same pattern as Pawtograder's grading pipeline (submission → build → test → parse → grade → feedback), but at data-center scale.

Name the information hiding: The "everything else" that the framework handles — distribution, parallelism, fault tolerance, data routing — is information hiding (L6) at system scale. The programmer depends on the interface (write two functions), not the implementation (thousands of machines, network failures, replication). Thousands of Google engineers wrote MapReduce jobs without understanding any of the distributed infrastructure.

Transition: What does the execution look like under the hood?

Step 1: Split the Input, Run Map in Parallel

Input is split into chunks. Each chunk is assigned to a map worker running your map() function. Workers run in parallel on different machines — they don't communicate with each other.

Step 2: Shuffle Groups by Key, Reduce Aggregates

The shuffle collects all intermediate pairs with the same key and routes them to the same reduce worker. Then each reduce worker runs your reduce() function on all values for its assigned keys.

Step 3: A Coordinator Orchestrates Everything

The coordinator does NOT process data. It assigns tasks, monitors workers, and restarts failed ones. Workers that crash mid-task have their chunk reassigned to another worker — transparent to the programmer.

Now the full picture. Students have seen the data flow (map → shuffle → reduce) across the previous two slides. This slide adds the master — the orchestrator.

The master is the single point of coordination, not computation. It keeps track of which chunks have been processed, which workers are alive, and which tasks need retrying. If a map worker crashes, the master reassigns its chunks to another worker. The programmer's code doesn't change.

Hexagonal architecture callback (L16): The programmer writes two pure functions (map and reduce). The framework handles all infrastructure: splitting, distribution, shuffling, fault tolerance, storage. This is the same separation of domain logic from infrastructure that we've been teaching all semester — just at Google scale.

Scale: Google ran this on thousands of machines. The same architecture handles 10 workers or 10,000 workers.

Transition: Where does the data live?

Your Laptop's File System Was Designed for a Different Workload

Every file system you've used assumes: small files, random reads and writes, open-edit-save workflows, 4KB blocks. These are great for editing documents and compiling code. Google's workload is the opposite:

Your laptop's file system	Data processing workload
Small files (KB-MB)	Huge files (GB-TB) — a single crawl output can be terabytes
Random reads and writes	Append-only writes, sequential reads — scan start to finish
Open, edit, save, close	Write once, read many times, never modify
Permissions, directories, timestamps	Irrelevant at batch scale
4KB blocks	Wasteful — 50GB file = 13 million blocks of bookkeeping

L9 → L6: Different requirements demand different abstractions. A file system designed for editing 50KB documents is the wrong tool for scanning 50GB logs. Google built GFS with 64MB chunks instead of 4KB blocks — 16,000x fewer allocations. The simplification IS the performance gain.

Students have never thought about file systems as a design choice. They just save files and they work. The point is to make the invisible visible: the file system on your laptop embodies design decisions optimized for a specific workload. Those decisions are brilliant for that workload — and terrible for data processing.

The append-then-scan pattern shows up everywhere:

Apache Kafka: append-only message log, consumers read sequentially
Database write-ahead logs (WAL): append-only for crash recovery
Git object store: content-addressed blobs, append-only, garbage collected
Cloud storage (S3): write-once objects, read sequentially — S3 literally does not support editing a byte in the middle of a file

If a student asks "what's it called?": POSIX (Portable Operating System Interface) — a standard from the 1980s. macOS, Linux, and (mostly) Windows all follow it. You'll learn about it in CS 3650 (Systems).

The L6 connection: Your laptop's file system and GFS are both interfaces that hide how data is stored. Same information hiding principle — but the abstraction boundary is in a different place because the requirements are different.

Transition: So what does a file system designed for this workload look like? Let's start with why the blocks need to be bigger...

Performance at Scale: Bigger Chunks and Data Locality

GFS uses 64MB chunks instead of your laptop's 4KB blocks — that's 16,000x bigger. Why? Every chunk has fixed costs that don't scale with chunk size:

Fixed cost per chunk	With 4KB blocks (100 TB)	With 64MB chunks (100 TB)	Reduction
Metadata entry in NameNode RAM	26 billion entries	1.6 million entries	16,000x
Replication coordination (3 copies each)	78 billion replica records	4.8 million replica records	16,000x
I/O round trips per read (NameNode lookup + DataNode setup)	26 billion round trips	1.6 million round trips	16,000x

L34 (Batching): Same principle — amortize fixed costs across more work. Batching database queries avoids per-query overhead. Batching I/O into 64MB chunks avoids per-block metadata, replication, and round-trip overhead. When fixed costs dominate, make the units bigger.

L34 (Locality): MapReduce also moves computation to data — the coordinator assigns map tasks to workers on the same machine as the data chunk. Local disk reads (microseconds) vs. network reads (milliseconds). Batching and locality together minimize network transfer.

This is L34's batching pattern at infrastructure scale. In L34 we showed that batching database queries (N+1 → 1 query with JOIN) and batching network calls (N API calls → 1 batch endpoint) amortize fixed per-invocation costs. GFS applies the same principle to storage.

Walk through each row:

Metadata (scalability): The NameNode keeps every chunk's location in RAM. 26 billion entries would require enormous memory; 1.6 million is manageable. Bigger chunks mean the NameNode can track a bigger file system without running out of RAM. You need a directory of pages either way — make each entry cover more data.
Replication (redundancy): Each chunk is replicated to 3 DataNodes. Replication has a fixed coordination cost per chunk — the NameNode must track locations, detect failures, schedule re-replication. Fewer chunks = less coordination overhead. The redundancy itself (3 copies) costs the same per byte, but the coordination cost per byte drops dramatically.
I/O (throughput): Each read requires a round trip to the NameNode ("where is this chunk?") plus a TCP handshake with the DataNode. With 4KB blocks, these fixed costs dominate — most time is spent on overhead, not reading data. With 64MB chunks, one round trip yields a sustained sequential read where overhead is negligible. Bigger pages means fewer calls to the underlying layers per byte of data.

The insight is general: When fixed costs dominate variable costs, make the batch bigger. This applies to database queries, network calls, and file system blocks. The difference at GFS scale is that the fixed costs are high and the data is massive — so the optimal batch size is enormous.

Locality (also L34): MapReduce exploits the data-center memory hierarchy. Local disk: microseconds. Same-rack network: ~1ms. Cross-datacenter: ~100ms. The coordinator knows (from the NameNode) which DataNodes hold each chunk and assigns map tasks to workers on the same machine — the data-center equivalent of cache-friendly access. Combined with 64MB chunks, map workers mostly read from local disk, keeping network bandwidth free for the shuffle phase.

Pooling (also L34): Worker thread pools are reused across tasks, avoiding per-task startup costs. Three L34 patterns — batching, locality, pooling — all visible in one system.

If a student asks "why not even bigger?": Good question. Larger chunks mean more wasted space for small files (internal fragmentation) and longer recovery times when a chunk is lost (must re-replicate more data). 64MB was Google's empirical sweet spot for their workload.

Transition: Now let's see the full GFS architecture...

GFS Stores Data Across Thousands of Machines with Built-In Redundancy

Single NameNode: Maintains metadata — which chunks belong to which file, where each chunk is stored. Does not store file data.
Many DataNodes: Store the actual data. Files divided into 64MB chunks, each replicated across 3 DataNodes.
Clients: Contact the NameNode for metadata, then communicate directly with DataNodes for data.

Together, MapReduce + GFS = programmers write simple sequential functions, and the framework distributes execution across thousands of machines, handles failures, and manages storage transparently.

A Single NameNode Trades Simplicity for a Dangerous Single Point of Failure

GFS uses a single NameNode for all metadata operations. Every client contacts the NameNode to learn which DataNodes hold the data it needs.

Benefits:

Simplicity: No distributed consensus, no conflicting views
Functional cohesion (L7): All metadata logic in one place — one responsibility
Global optimization: NameNode makes optimal placement decisions

Costs:

High coupling (L7): Every client depends on the NameNode — stamp coupling via metadata types
Single point of failure (L35): NameNode down = entire file system unavailable
Blast radius (L35): NameNode failure affects every client and every DataNode

Mitigation — separation of concerns (L6, L19): The NameNode handles one concern (metadata: "where is the data?"), DataNodes handle another (I/O: "give me the data"). Clients ask the NameNode for metadata, then read directly from DataNodes. The NameNode stays out of the data path.

L7 callback: In L7 we defined coupling as how much one module is affected by changes in another. The GFS master has high coupling to every client — if the master is slow, everything is slow. But it also has high cohesion — all metadata logic in one place, which is functionally cohesive.

The mitigation is elegant: By keeping the master out of the data path, GFS limits the master's load to metadata lookups (small, fast) while chunkservers handle the heavy lifting (large data transfers). This is information hiding at the system level — the master hides the mapping from files to chunks, and clients never need to know the physical layout of data.

Foreshadowing: This single-master decision eventually drove Google to replace GFS with Colossus, which distributes metadata. We'll cover that in Arc 4.

Transition: With thousands of machines, failures are guaranteed. How does MapReduce handle them?

Pure Functions Make Re-Execution Safe — and Fault Tolerance Trivial

The design decision: Map functions are pure (L5) — same input → same output, no side effects. Workers have zero coupling (L7) — no shared mutable state (L31). Each map task has functional cohesion — one responsibility: process this chunk.

The consequence: With thousands of workers, failures are nearly certain. But purity makes recovery trivial — re-execute the failed task on another machine. Same input, same output, guaranteed.

Failure	Detection	Recovery	Why it works
Map worker dies	Coordinator pings periodically	Reassign task to another worker	Pure function (L5) → re-execution produces same output
Reduce worker dies	Coordinator pings periodically	Reassign reduce task	Idempotent (L33) → re-execution is safe
DataNode dies	NameNode detects missing heartbeat	Read from surviving replica (2 of 3 remain)	Redundancy (L35) → no single point of failure for data

L20 (Networks) + L33 (Event Architecture): Retry with exponential backoff is a resilience pattern. MapReduce applies it at the task level — if a task fails, retry on a different machine. This is safe because map functions are idempotent (L33): executing the same function on the same input produces the same output, whether executed once or a hundred times.

Two concepts, one argument. The pure-function constraint (L5, L7, L31) is what enables fault tolerance. Students should see that the design decision that eliminates race conditions is the SAME decision that makes crash recovery safe. This is the overlapping-lenses theme of the lecture.

L7 vocabulary: Map workers have the best possible coupling: NONE. Zero shared state. The only coupling is through the shuffle — data coupling (primitive key-value pairs). The trade-off is real: you cannot compare two homes in one map function; you must emit both under the same key and compare in reduce. This deliberate constraint buys safe parallelism AND safe recovery.

GFS replication pays off too: When a chunkserver dies, the input data is still available on 2 other chunkservers. The new worker reads from a surviving replica. Redundancy from L35.

The key insight: Fault tolerance is not an add-on. It emerges from the pure function constraint (L5) and the replication strategy (L35). Every prior design decision contributes.

Transition: Let's apply the Swiss cheese model to a specific failure scenario...

Swiss Cheese Analysis: A Chunkserver Fails Mid-Write

A MapReduce worker is writing results to a GFS file. One of three DataNodes crashes during the write.

Layer	Defense	Hole?
Chunk replication	Data written to 3 DataNodes; 2 survive	Catches single-server failure
Write protocol	Primary replica forwards to secondaries; ack only when all confirm	Primary detects secondary failure
NameNode re-replication	NameNode detects under-replicated chunk, schedules new copy	Restores redundancy, but not instant
Client retry	If write fails, client retries with new replicas	Catches transient network failures
Append semantics	GFS guarantees at-least-once append	Retried append may produce a duplicate record

The bottom layer reveals the trade-off: at-least-once delivery means duplicates. For the search index, a duplicate page entry is harmless — the reducer deduplicates. For ad click billing, a duplicate charge means overcharging an advertiser. Blast radius of inconsistency determines the consistency model (L33).

L35 callback: This is exactly the Swiss cheese model from L35. Each layer has a defense and a hole. The system is safe as long as the holes don't align — and here they mostly don't.

Walk through progressively. Each row appears on click. Build the tension — the first four layers seem solid. Then the fifth layer reveals the trade-off: at-least-once semantics means consumers must handle duplicates. This connects directly to L33's idempotency discussion.

The ad billing contrast: The same system that tolerates duplicate search index entries would be dangerous for ad click billing. Students studied this trade-off in L33 — this is the concrete application at GFS scale. Don't linger here; one sentence is enough because they already know the concept.

Which requirement does this serve? Fault tolerance. The system sacrifices strong consistency to gain the ability to retry without coordination overhead.

Transition: Let's step back and see blast radius across different failure types...

Blast Radius Varies by Orders of Magnitude Depending on What Fails

Failure	Blast Radius	Mitigation	Course Concept
One map worker dies	One task delayed	Coordinator reassigns to another worker	L35: redundancy limits blast radius
One DataNode dies	Data temporarily under-replicated	3x replication; NameNode schedules re-replication	L35: Swiss cheese model
Network partition isolates a rack	Workers unreachable; tasks reassigned	Stale outputs discarded; tasks re-executed	L20: network is not reliable
MapReduce coordinator dies	Entire job fails	Job restarted from scratch (original); later versions: checkpointing	L35: single point of failure
NameNode dies	Entire file system unavailable	Shadow NameNodes for read-only; state checkpointed	L35: blast radius determines layers needed

Worker failure costs minutes. Coordinator/NameNode failure costs the entire job or entire file system. This blast radius difference drove Google to eventually distribute the metadata role.

L35 callback: "Blast radius determines how many Swiss cheese layers you need." Workers have small blast radius (one task), so one layer of defense (re-execution) suffices. The master has enormous blast radius (entire job or file system), so it needs more layers: checkpointing, shadow copies, and eventually, architectural redesign.

The red rows are the teaching moment. The centralization that makes the system simple (single master = no consensus, no conflicting views) also creates the largest blast radius. This is the fundamental trade-off of centralization — and it's the same coupling analysis from L7.

Foreshadowing: This blast radius problem is why Google built Colossus (distributed metadata) to replace GFS. We'll cover that in Arc 4.

Transition: Let's look at the requirements that drove these decisions...

MapReduce Scores Well on Technical and Economic Sustainability — But the Full Picture Is Mixed

Dimension	Assessment	Key Tension
Technical	High — simple model, pure functions, small user code, framework handles complexity	Batch-only model could not evolve to support interactive queries
Economic	High — commodity hardware instead of expensive fault-tolerant servers; framework absorbs failures so you don't pay for reliable machines	Internally: Google lock-in. Open source (Hadoop) mitigated this for everyone else
Environmental	Mixed — locality reduces network energy, but enabled processing at unprecedented scale	Total resource consumption exploded
Social	Mixed — democratized data processing within Google, but enabled mass-scale data collection	Web index contains data about everyone

Same four-dimensional analysis from L36. Same pattern: no decision optimizes all four.

MapReduce Made Data Processing Cheap — So Total Consumption Exploded

Before MapReduce (early 2000s): processing the web index was a bespoke engineering effort — custom code, manual failure handling, weeks per pipeline.

MapReduce made it cheap: write two functions, submit a job, get results.

What happened: Per-job engineering cost dropped dramatically. Google ran thousands of MapReduce jobs daily by 2004. The efficiency gain per job was overwhelmed by the increase in total jobs. MapReduce consumed a significant fraction of Google's total compute.

L36 (Jevons' Paradox): Same pattern as Pawtograder — per-submission grading cost dropped, but unlimited submissions dramatically increased total compute. Efficiency is not sustainability.

The People Making Design Decisions Are Not Always the Ones Bearing the Consequences

Decision	Who benefits	Who bears the cost
Simple programming model	Thousands of Google engineers	Infrastructure team maintaining massive clusters
Commodity hardware (cheap, unreliable)	Google's budget (lower capital cost)	Environment (more machines, more energy, more e-waste)
Process the entire web index	Google's search quality, ad targeting	Everyone whose data is in the index (often without consent)
Batch-only model (high throughput, high latency)	Large-scale analytics workloads	Teams needing real-time queries (forced to build separate systems)

Same distributional analysis from L35 and L36: the people making the design decision are not always the people bearing the consequences.

MapReduce's Strengths Became Its Limitations as Context Changed

MapReduce's limitations drove Google to build successors:

Limitation	Successor	What changed
Batch-only — cannot serve interactive queries	Dremel (interactive SQL) → BigQuery	Users needed answers in seconds, not hours
Two-function model — too restrictive for iterative algorithms	Systems supporting iterative computation (ML requires multiple passes over data)	Machine learning became a dominant workload
GFS single NameNode — metadata bottleneck at scale	Colossus — distributes metadata across multiple servers	File system grew beyond one NameNode's capacity

L36 (Sustainability): The same design decisions that enable initial success can become obstacles as context changes. MapReduce's simplicity was its greatest strength (massive adoption) and its greatest weakness (could not evolve for new workloads). Low coupling and information hiding (L6, L7) delayed this reckoning — many jobs migrated to successors without rewriting their map and reduce functions.

Open Source Made MapReduce Sustainable Beyond Google

Google published the MapReduce (2004) and GFS (2003) papers — but kept the code proprietary. Doug Cutting and Mike Cafarella read the papers and built Apache Hadoop: an open source implementation.

What happened	Sustainability dimension
Yahoo, Facebook, and hundreds of companies adopted Hadoop	Economic — shared infrastructure cost, no single-vendor lock-in
Entire ecosystem grew: Spark, Hive, HBase, Kafka	Technical — clean programming model enabled open source reimplementation
Google's proprietary system was replaced internally — but the ideas live on in open source	Social — any organization can process data at scale, not just Google
Same pattern: Google's Borg → open source Kubernetes (container orchestration for everyone)	Technical — clean abstractions enable reimplementation; Borg's ideas power every cloud provider
Hadoop and K8s clusters consume enormous energy worldwide	Environmental — Jevons' paradox again, now at industry scale

The design outlived the implementation. That's sustainability.

L23 callback: In L23 (Open Source) we discussed how "infrastructure code gravitates toward open source." MapReduce is the textbook case. Google kept the implementation but published the ideas. The open source community built Hadoop, and an entire ecosystem grew around it.

Why this matters: Google has since replaced MapReduce internally with systems like Flume and Cloud Dataflow. But the programming model — map, shuffle, reduce — lives on in Spark, Flink, and every cloud provider's data processing service. The information hiding and low coupling that made MapReduce's API simple (L6, L7) is exactly what made it reimplementable. A tightly-coupled, implementation-specific API could not have been cloned from a paper.

Borg → Kubernetes: The same pattern repeated with Google's cluster management system Borg. Google published the Borg paper (2015), and Kubernetes was built as an open source reimplementation. Today K8s powers container orchestration at every major cloud provider. Again: proprietary ideas → open source implementation → industry standard. Clean abstractions (pods, services, deployments) enabled reimplementation, just as map/reduce did.

Pawtograder parallel: Pawtograder follows the same model — GPL, open source, any institution can adopt and extend it. The design (hexagonal architecture, GitHub Actions integration) is documented; the implementation is open. If we stop maintaining it, someone else can fork it.

Jevons' note: Hadoop and K8s democratized infrastructure. Total global compute has grown enormously as a result. Every cluster is a Jevons' paradox instance — per-job cost dropped, total jobs exploded.

Transition: We've seen the evolution and impact. Let's zoom out — why did every decision take this shape?

Architecture Emerges from Constraints

We've analyzed MapReduce's design decisions one by one — pure functions, single NameNode, replication, re-execution, locality, sustainability. From L18: "Architecture is the shape that emerges when you apply your constraints." Every decision traces back to four constraints:

Constraint	Implication	Decisions it shaped
Commodity hardware fails constantly	Re-execution must be safe	Pure functions, idempotent retry, 3x replication
Data is too big to move across the network	Move computation to data	Locality optimization, 64MB chunks, NameNode metadata
Network bandwidth is scarce and shared	Minimize data transfer	Batching (64MB), shuffle as the only cross-network phase
Jobs run for hours on thousands of machines	Detect and recover from failures automatically	Coordinator heartbeats, task reassignment, checkpointing

These constraints shaped two architectural styles: a pipelined architecture (L19) for the data flow (input → map → shuffle → reduce → output) and a coordinator/worker style for orchestration. Stateless workers + pipelined stages = linear scalability — add more machines, get proportionally more throughput, no code changes.

This slide is the synthesis moment. Students have now seen every individual design decision, the blast radius analysis, the sustainability trade-offs, and the evolution of the system. Now they see that every decision traces back to the same four constraints. The third column ("Decisions it shaped") ties the abstract constraints to the specific decisions students analyzed over the past 20+ minutes.

This is the L18 payoff. In L18 we taught "just enough architecture" — decide what's hard to change, defer what's cheap to change. MapReduce's hard-to-change decisions are the programming model (two pure functions), the master/worker coordination, and the pipelined data flow. Everything else — number of workers, chunk size, replication factor — is configurable.

Pipelined architecture (L19): Students saw this with Pawtograder's grading pipeline (submission → build → test → parse → grade → feedback). MapReduce is the same pattern: input → map → shuffle → reduce → output. Data flows through sequential stages; each stage transforms input to output.

Testability consequence: Because the pipeline's domain logic (map/reduce functions) is separated from infrastructure (distribution, fault tolerance), you can test map and reduce on your laptop with a small file. This is L16's testability principle — separate what you want to test from the infrastructure around it.

Transition: Now let's see every semester concept in this one system...

Every Semester Concept Is Visible in One System

Semester Concept	Where It Appears in MapReduce/GFS
Information hiding (L6)	Framework hides distributed complexity behind two functions
Coupling and cohesion (L7)	Single NameNode: functional cohesion, high coupling — explicit trade-off
Requirements (L9)	Designed for sequential reads, commodity hardware, non-expert programmers
Testability (L16)	Test map/reduce locally with small files — domain logic separated from infrastructure
Architecture (L19, L20)	Pipelined architecture; distributed because no single machine could hold the data
Concurrency (L31)	Pure map functions eliminate shared mutable state — no locks needed
Event-driven patterns (L33)	Idempotent re-execution, eventual consistency, at-least-once delivery
Performance (L34)	Locality optimization, batching, pooling workers
Safety and reliability (L35)	Swiss cheese layers: replication, re-execution, checkpointing; blast radius analysis
Sustainability (L36)	Jevons' paradox; who benefits vs. who bears cost; technical sustainability through simple APIs

The concepts are not separate tools — they are overlapping lenses that illuminate different aspects of the same design.

Comprehension Check

Open Poll Everywhere and answer the three questions.

Poll Q1: A programmer writes a MapReduce job, and it runs on 1,000 machines. The programmer's code never mentions machines, network calls, or failure handling. Which course concept BEST explains why?

A. Low coupling (L7) — the map and reduce functions are loosely coupled
B. Information hiding (L6) — the framework hides all distributed infrastructure behind a two-function interface [CORRECT]
C. Pure functions (L5) — side-effect-free code doesn't need to know about machines
D. Pipelined architecture (L19) — the stages are independent

Teaching point: All four answers describe real properties of MapReduce, but the BEST explanation is information hiding. The programmer depends on the interface (write map and reduce), not the implementation (thousands of machines, network shuffles, failure recovery). Thousands of Google engineers used MapReduce without understanding any of the distributed infrastructure. This is L6 at system scale.

Poll Q2: GFS uses a single NameNode to manage all file metadata. If the NameNode crashes, the entire file system is unavailable. Why did Google accept this single point of failure?

A. They didn't realize it was a risk — it was a design oversight
B. Replicating the NameNode would have violated information hiding
C. A single NameNode is simpler to build correctly, and the blast radius was an acceptable trade-off given how rarely it fails [CORRECT]
D. The NameNode can't crash because it doesn't store actual file data

Teaching point: This is a deliberate architectural trade-off, not an oversight. A single NameNode is much simpler to keep consistent — and simplicity itself is a safety mechanism (fewer things to go wrong). Google decided the blast radius (temporary metadata unavailability, no data loss) was acceptable compared to the complexity of replication. Every architecture involves trade-offs; the goal is to make them conscious.

Poll Q3: A MapReduce map worker crashes halfway through processing its chunk. The coordinator reassigns the chunk to a new worker. Why does this recovery work correctly?

A. The coordinator saved a checkpoint of the worker's progress
B. GFS replicated the worker's partial output to other machines
C. The map function is pure — rerunning it on the same input produces the same output, so no state was lost [CORRECT]
D. The shuffle stage buffers intermediate results and replays them

Teaching point: Pure functions make fault tolerance almost trivial. Because map has no side effects and no shared state, you can re-execute any failed task by giving the same input to a different worker. Nothing to checkpoint, no state to recover, no partial results to reconcile. L5's pure functions provide L35's fault tolerance — the design decision that eliminates race conditions also makes crash recovery a one-line operation.

Transition: Let's look ahead...

Looking Ahead

L38 (Wednesday): The Future of Programming — where does software engineering go from here? How do the tools and principles from this semester apply to what comes next?

L39 (Thursday): Review — exam preparation.

GA2: Feature Buffet due Thursday April 16. Process over product — a well-documented partial feature scores higher than a complete feature with no documentation.

Want to go deeper? The original papers are readable:

Dean and Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (2004)
Ghemawat, Gobioff, and Leung, "The Google File System" (2003)

Follow-up courses: CS 4730 (Distributed Systems), CS 6620 (Cloud Computing)

Lecture 37: Design Case Study — MapReduce​

Learning Objectives​

Two Systems, One Case Study​

Start with Requirements: What Does This System Need to Do?​

Google's Data Problem Started with Storage​

Then It's a Computation Problem​

MapReduce Hides Distributed Complexity Behind Two Functions​

Step 1: Split the Input, Run Map in Parallel​

Step 2: Shuffle Groups by Key, Reduce Aggregates​

Step 3: A Coordinator Orchestrates Everything​

Your Laptop's File System Was Designed for a Different Workload​

Performance at Scale: Bigger Chunks and Data Locality​

GFS Stores Data Across Thousands of Machines with Built-In Redundancy​

A Single NameNode Trades Simplicity for a Dangerous Single Point of Failure​

Pure Functions Make Re-Execution Safe — and Fault Tolerance Trivial​

Swiss Cheese Analysis: A Chunkserver Fails Mid-Write​

Blast Radius Varies by Orders of Magnitude Depending on What Fails​

MapReduce Scores Well on Technical and Economic Sustainability — But the Full Picture Is Mixed​

MapReduce Made Data Processing Cheap — So Total Consumption Exploded​

The People Making Design Decisions Are Not Always the Ones Bearing the Consequences​

MapReduce's Strengths Became Its Limitations as Context Changed​

Open Source Made MapReduce Sustainable Beyond Google​

Architecture Emerges from Constraints​

Every Semester Concept Is Visible in One System​

Comprehension Check​

Looking Ahead​