
CS 3100: Program Design and Implementation II
Lecture 37: Design Case Study — MapReduce
©2026 Jonathan Bell, CC-BY-SA
Learning Objectives
After this lecture, you will be able to:
- Describe the MapReduce programming model and GFS at the level needed to analyze their design decisions
- Analyze MapReduce and GFS architectural decisions using quality attributes from the course
- Identify how MapReduce and GFS apply patterns learned throughout the semester
- Evaluate sustainability and trade-off dimensions of MapReduce and GFS
Two Systems, One Case Study
In 2003-2004, Google published two papers that changed how the industry processes data:
Google File System (GFS, 2003)
Stores data across thousands of machines. A NameNode (GFS master) manages metadata (which machines hold which chunks). DataNodes (GFS chunkservers) store the actual data.
Solved the storage problem.
MapReduce (2004)
Distributes computation across thousands of machines. A coordinator assigns tasks and handles failures. Workers execute user-defined functions.
Solved the computation problem.
Today we use these systems as a lens to see every concept from the semester in one place — coupling, cohesion, information hiding, requirements, architecture, concurrency, consistency, performance, safety, and sustainability.
Start with Requirements: What Does This System Need to Do?
From L9 and L18: requirements drive architecture. Before we look at MapReduce and GFS, let's state what the system needs — using the quality scenario format.
| Source | Stimulus | Environment | Response | Measure | |
|---|---|---|---|---|---|
| Throughput | Search infrastructure | Rebuild the web index from billions of crawled pages | Hundreds of TB of raw data, thousands of commodity machines | Produce inverted index, PageRank scores | Complete in hours, not weeks |
| Fault tolerance | Hardware | Machine crashes mid-computation | 1000+ machines; failures are routine, not exceptional | Recover and continue without restarting the job | No data loss; no human intervention |
| Scalability | Web growth | The web is growing exponentially | Same codebase, same programming model | Add machines, not rewrite software | Linear throughput scaling with machines added |
These three requirements — throughput, fault tolerance, scalability — shaped every architectural decision in MapReduce and GFS. As we walk through the system, ask: which requirement drove this decision?
Google's Data Problem Started with Storage
By 2003, Google had crawled billions of web pages. Each page: ~100KB of HTML, metadata, and outbound links.
| Per web page | Total crawl (billions of pages) | |
|---|---|---|
| Raw data | ~100 KB | Hundreds of TB |
| Words to index | ~500 unique | Trillions of index entries |
| Links to follow | ~50 outbound | Hundreds of billions of edges |
Hundreds of TB don't fit on one machine. A large server in 2003 had ~2 TB of disk. You need hundreds of machines just to store the raw crawl — and the web keeps growing.
And Google doesn't just store it — they need to process all of it to build the products people use: web search requires an inverted index, relevant ranking requires PageRank, and the crawl itself must be kept current as the web changes.
Then It's a Computation Problem
Google needs to build an inverted index (which pages contain each word — this IS Google Search), compute PageRank (how pages link to each other — this is what makes results relevant), and process web server logs (billions of requests per day). That means reading and processing hundreds of TB regularly.
The math: One machine reads from disk at ~500 MB/s. Reading hundreds of TB takes days — for an index that needs to stay current as the web changes.
Hundreds of machines reading in parallel: hours. That's feasible — but now you have two new problems:
- Coordination: How do you split the work, route results, and combine them?
- Failure: If one machine crashes mid-computation (and at 100+ machines, something will crash), what happens to its work?
Google's answer: build a file system that stores data across thousands of machines (GFS), and a programming model that distributes computation automatically (MapReduce).
MapReduce Hides Distributed Complexity Behind Two Functions
Google built MapReduce for web indexing — but the programming model is general. Any data processing task with the same requirements (massive data, parallel processing, fault tolerance) can use the same two functions. SceneItAll's analytics pipeline has exactly those requirements:
Map: Takes a key-value pair, emits zero or more intermediate pairs.
map(homeId, telemetryData) →
emit("high-energy", {homeId, 45kWh})
emit("normal", {homeId, 12kWh})
One home's telemetry in, classification out.
Reduce: Takes a key + all its values, combines into final result.
reduce("high-energy", [
{home1, 45kWh}, {home2, 52kWh}, ...
]) →
emit("high-energy",
{count: 12400, avgKWh: 48.3})
All high-energy homes in, aggregate out.
Between map and reduce, the framework performs a shuffle — groups intermediate values by key and routes them to the appropriate reducer. The programmer writes only map and reduce; the framework handles everything else.
Step 1: Split the Input, Run Map in Parallel
Input is split into chunks. Each chunk is assigned to a map worker running your map() function. Workers run in parallel on different machines — they don't communicate with each other.
Step 2: Shuffle Groups by Key, Reduce Aggregates
The shuffle collects all intermediate pairs with the same key and routes them to the same reduce worker. Then each reduce worker runs your reduce() function on all values for its assigned keys.
Step 3: A Coordinator Orchestrates Everything
The coordinator does NOT process data. It assigns tasks, monitors workers, and restarts failed ones. Workers that crash mid-task have their chunk reassigned to another worker — transparent to the programmer.
Your Laptop's File System Was Designed for a Different Workload
Every file system you've used assumes: small files, random reads and writes, open-edit-save workflows, 4KB blocks. These are great for editing documents and compiling code. Google's workload is the opposite:
| Your laptop's file system | Data processing workload |
|---|---|
| Small files (KB-MB) | Huge files (GB-TB) — a single crawl output can be terabytes |
| Random reads and writes | Append-only writes, sequential reads — scan start to finish |
| Open, edit, save, close | Write once, read many times, never modify |
| Permissions, directories, timestamps | Irrelevant at batch scale |
| 4KB blocks | Wasteful — 50GB file = 13 million blocks of bookkeeping |
L9 → L6: Different requirements demand different abstractions. A file system designed for editing 50KB documents is the wrong tool for scanning 50GB logs. Google built GFS with 64MB chunks instead of 4KB blocks — 16,000x fewer allocations. The simplification IS the performance gain.
Performance at Scale: Bigger Chunks and Data Locality
GFS uses 64MB chunks instead of your laptop's 4KB blocks — that's 16,000x bigger. Why? Every chunk has fixed costs that don't scale with chunk size:
| Fixed cost per chunk | With 4KB blocks (100 TB) | With 64MB chunks (100 TB) | Reduction |
|---|---|---|---|
| Metadata entry in NameNode RAM | 26 billion entries | 1.6 million entries | 16,000x |
| Replication coordination (3 copies each) | 78 billion replica records | 4.8 million replica records | 16,000x |
| I/O round trips per read (NameNode lookup + DataNode setup) | 26 billion round trips | 1.6 million round trips | 16,000x |
L34 (Batching): Same principle — amortize fixed costs across more work. Batching database queries avoids per-query overhead. Batching I/O into 64MB chunks avoids per-block metadata, replication, and round-trip overhead. When fixed costs dominate, make the units bigger.
L34 (Locality): MapReduce also moves computation to data — the coordinator assigns map tasks to workers on the same machine as the data chunk. Local disk reads (microseconds) vs. network reads (milliseconds). Batching and locality together minimize network transfer.
GFS Stores Data Across Thousands of Machines with Built-In Redundancy
- Single NameNode: Maintains metadata — which chunks belong to which file, where each chunk is stored. Does not store file data.
- Many DataNodes: Store the actual data. Files divided into 64MB chunks, each replicated across 3 DataNodes.
- Clients: Contact the NameNode for metadata, then communicate directly with DataNodes for data.
Together, MapReduce + GFS = programmers write simple sequential functions, and the framework distributes execution across thousands of machines, handles failures, and manages storage transparently.
A Single NameNode Trades Simplicity for a Dangerous Single Point of Failure
GFS uses a single NameNode for all metadata operations. Every client contacts the NameNode to learn which DataNodes hold the data it needs.
Benefits:
- Simplicity: No distributed consensus, no conflicting views
- Functional cohesion (L7): All metadata logic in one place — one responsibility
- Global optimization: NameNode makes optimal placement decisions
Costs:
- High coupling (L7): Every client depends on the NameNode — stamp coupling via metadata types
- Single point of failure (L35): NameNode down = entire file system unavailable
- Blast radius (L35): NameNode failure affects every client and every DataNode
Mitigation — separation of concerns (L6, L19): The NameNode handles one concern (metadata: "where is the data?"), DataNodes handle another (I/O: "give me the data"). Clients ask the NameNode for metadata, then read directly from DataNodes. The NameNode stays out of the data path.
Pure Functions Make Re-Execution Safe — and Fault Tolerance Trivial
The design decision: Map functions are pure (L5) — same input → same output, no side effects. Workers have zero coupling (L7) — no shared mutable state (L31). Each map task has functional cohesion — one responsibility: process this chunk.
The consequence: With thousands of workers, failures are nearly certain. But purity makes recovery trivial — re-execute the failed task on another machine. Same input, same output, guaranteed.
| Failure | Detection | Recovery | Why it works |
|---|---|---|---|
| Map worker dies | Coordinator pings periodically | Reassign task to another worker | Pure function (L5) → re-execution produces same output |
| Reduce worker dies | Coordinator pings periodically | Reassign reduce task | Idempotent (L33) → re-execution is safe |
| DataNode dies | NameNode detects missing heartbeat | Read from surviving replica (2 of 3 remain) | Redundancy (L35) → no single point of failure for data |
L20 (Networks) + L33 (Event Architecture): Retry with exponential backoff is a resilience pattern. MapReduce applies it at the task level — if a task fails, retry on a different machine. This is safe because map functions are idempotent (L33): executing the same function on the same input produces the same output, whether executed once or a hundred times.
Swiss Cheese Analysis: A Chunkserver Fails Mid-Write
A MapReduce worker is writing results to a GFS file. One of three DataNodes crashes during the write.
| Layer | Defense | Hole? |
|---|---|---|
| Chunk replication | Data written to 3 DataNodes; 2 survive | Catches single-server failure |
| Write protocol | Primary replica forwards to secondaries; ack only when all confirm | Primary detects secondary failure |
| NameNode re-replication | NameNode detects under-replicated chunk, schedules new copy | Restores redundancy, but not instant |
| Client retry | If write fails, client retries with new replicas | Catches transient network failures |
| Append semantics | GFS guarantees at-least-once append | Retried append may produce a duplicate record |
The bottom layer reveals the trade-off: at-least-once delivery means duplicates. For the search index, a duplicate page entry is harmless — the reducer deduplicates. For ad click billing, a duplicate charge means overcharging an advertiser. Blast radius of inconsistency determines the consistency model (L33).
Blast Radius Varies by Orders of Magnitude Depending on What Fails
| Failure | Blast Radius | Mitigation | Course Concept |
|---|---|---|---|
| One map worker dies | One task delayed | Coordinator reassigns to another worker | L35: redundancy limits blast radius |
| One DataNode dies | Data temporarily under-replicated | 3x replication; NameNode schedules re-replication | L35: Swiss cheese model |
| Network partition isolates a rack | Workers unreachable; tasks reassigned | Stale outputs discarded; tasks re-executed | L20: network is not reliable |
| MapReduce coordinator dies | Entire job fails | Job restarted from scratch (original); later versions: checkpointing | L35: single point of failure |
| NameNode dies | Entire file system unavailable | Shadow NameNodes for read-only; state checkpointed | L35: blast radius determines layers needed |
Worker failure costs minutes. Coordinator/NameNode failure costs the entire job or entire file system. This blast radius difference drove Google to eventually distribute the metadata role.
MapReduce Scores Well on Technical and Economic Sustainability — But the Full Picture Is Mixed
| Dimension | Assessment | Key Tension |
|---|---|---|
| Technical | High — simple model, pure functions, small user code, framework handles complexity | Batch-only model could not evolve to support interactive queries |
| Economic | High — commodity hardware instead of expensive fault-tolerant servers; framework absorbs failures so you don't pay for reliable machines | Internally: Google lock-in. Open source (Hadoop) mitigated this for everyone else |
| Environmental | Mixed — locality reduces network energy, but enabled processing at unprecedented scale | Total resource consumption exploded |
| Social | Mixed — democratized data processing within Google, but enabled mass-scale data collection | Web index contains data about everyone |
Same four-dimensional analysis from L36. Same pattern: no decision optimizes all four.
MapReduce Made Data Processing Cheap — So Total Consumption Exploded
Before MapReduce (early 2000s): processing the web index was a bespoke engineering effort — custom code, manual failure handling, weeks per pipeline.
MapReduce made it cheap: write two functions, submit a job, get results.
What happened: Per-job engineering cost dropped dramatically. Google ran thousands of MapReduce jobs daily by 2004. The efficiency gain per job was overwhelmed by the increase in total jobs. MapReduce consumed a significant fraction of Google's total compute.
L36 (Jevons' Paradox): Same pattern as Pawtograder — per-submission grading cost dropped, but unlimited submissions dramatically increased total compute. Efficiency is not sustainability.
The People Making Design Decisions Are Not Always the Ones Bearing the Consequences
| Decision | Who benefits | Who bears the cost |
|---|---|---|
| Simple programming model | Thousands of Google engineers | Infrastructure team maintaining massive clusters |
| Commodity hardware (cheap, unreliable) | Google's budget (lower capital cost) | Environment (more machines, more energy, more e-waste) |
| Process the entire web index | Google's search quality, ad targeting | Everyone whose data is in the index (often without consent) |
| Batch-only model (high throughput, high latency) | Large-scale analytics workloads | Teams needing real-time queries (forced to build separate systems) |
Same distributional analysis from L35 and L36: the people making the design decision are not always the people bearing the consequences.
MapReduce's Strengths Became Its Limitations as Context Changed
MapReduce's limitations drove Google to build successors:
| Limitation | Successor | What changed |
|---|---|---|
| Batch-only — cannot serve interactive queries | Dremel (interactive SQL) → BigQuery | Users needed answers in seconds, not hours |
| Two-function model — too restrictive for iterative algorithms | Systems supporting iterative computation (ML requires multiple passes over data) | Machine learning became a dominant workload |
| GFS single NameNode — metadata bottleneck at scale | Colossus — distributes metadata across multiple servers | File system grew beyond one NameNode's capacity |
L36 (Sustainability): The same design decisions that enable initial success can become obstacles as context changes. MapReduce's simplicity was its greatest strength (massive adoption) and its greatest weakness (could not evolve for new workloads). Low coupling and information hiding (L6, L7) delayed this reckoning — many jobs migrated to successors without rewriting their map and reduce functions.
Open Source Made MapReduce Sustainable Beyond Google
Google published the MapReduce (2004) and GFS (2003) papers — but kept the code proprietary. Doug Cutting and Mike Cafarella read the papers and built Apache Hadoop: an open source implementation.
| What happened | Sustainability dimension |
|---|---|
| Yahoo, Facebook, and hundreds of companies adopted Hadoop | Economic — shared infrastructure cost, no single-vendor lock-in |
| Entire ecosystem grew: Spark, Hive, HBase, Kafka | Technical — clean programming model enabled open source reimplementation |
| Google's proprietary system was replaced internally — but the ideas live on in open source | Social — any organization can process data at scale, not just Google |
| Same pattern: Google's Borg → open source Kubernetes (container orchestration for everyone) | Technical — clean abstractions enable reimplementation; Borg's ideas power every cloud provider |
| Hadoop and K8s clusters consume enormous energy worldwide | Environmental — Jevons' paradox again, now at industry scale |
The design outlived the implementation. That's sustainability.
Architecture Emerges from Constraints
We've analyzed MapReduce's design decisions one by one — pure functions, single NameNode, replication, re-execution, locality, sustainability. From L18: "Architecture is the shape that emerges when you apply your constraints." Every decision traces back to four constraints:
| Constraint | Implication | Decisions it shaped |
|---|---|---|
| Commodity hardware fails constantly | Re-execution must be safe | Pure functions, idempotent retry, 3x replication |
| Data is too big to move across the network | Move computation to data | Locality optimization, 64MB chunks, NameNode metadata |
| Network bandwidth is scarce and shared | Minimize data transfer | Batching (64MB), shuffle as the only cross-network phase |
| Jobs run for hours on thousands of machines | Detect and recover from failures automatically | Coordinator heartbeats, task reassignment, checkpointing |
These constraints shaped two architectural styles: a pipelined architecture (L19) for the data flow (input → map → shuffle → reduce → output) and a coordinator/worker style for orchestration. Stateless workers + pipelined stages = linear scalability — add more machines, get proportionally more throughput, no code changes.
Every Semester Concept Is Visible in One System
| Semester Concept | Where It Appears in MapReduce/GFS |
|---|---|
| Information hiding (L6) | Framework hides distributed complexity behind two functions |
| Coupling and cohesion (L7) | Single NameNode: functional cohesion, high coupling — explicit trade-off |
| Requirements (L9) | Designed for sequential reads, commodity hardware, non-expert programmers |
| Testability (L16) | Test map/reduce locally with small files — domain logic separated from infrastructure |
| Architecture (L19, L20) | Pipelined architecture; distributed because no single machine could hold the data |
| Concurrency (L31) | Pure map functions eliminate shared mutable state — no locks needed |
| Event-driven patterns (L33) | Idempotent re-execution, eventual consistency, at-least-once delivery |
| Performance (L34) | Locality optimization, batching, pooling workers |
| Safety and reliability (L35) | Swiss cheese layers: replication, re-execution, checkpointing; blast radius analysis |
| Sustainability (L36) | Jevons' paradox; who benefits vs. who bears cost; technical sustainability through simple APIs |
The concepts are not separate tools — they are overlapping lenses that illuminate different aspects of the same design.
Comprehension Check
Open Poll Everywhere and answer the three questions.
Looking Ahead
L38 (Wednesday): The Future of Programming — where does software engineering go from here? How do the tools and principles from this semester apply to what comes next?
L39 (Thursday): Review — exam preparation.
GA2: Feature Buffet due Thursday April 16. Process over product — a well-documented partial feature scores higher than a complete feature with no documentation.
Want to go deeper? The original papers are readable:
- Dean and Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (2004)
- Ghemawat, Gobioff, and Leung, "The Google File System" (2003)
Follow-up courses: CS 4730 (Distributed Systems), CS 6620 (Cloud Computing)