Skip to main content
Pixel art museum hallway with framed paintings of course concepts on both walls — Information Hiding, Coupling and Cohesion, Requirements, Architecture, Testability, Networks on the left; Concurrency, Events, Performance, Safety on the right — converging at the far end into one large glowing MapReduce pipeline exhibit. A student walks toward it. Tagline: Every Lecture Led Here.

CS 3100: Program Design and Implementation II

Lecture 37: Design Case Study — MapReduce

©2026 Jonathan Bell, CC-BY-SA

Learning Objectives

After this lecture, you will be able to:

  1. Describe the MapReduce programming model and GFS at the level needed to analyze their design decisions
  2. Analyze MapReduce and GFS architectural decisions using quality attributes from the course
  3. Identify how MapReduce and GFS apply patterns learned throughout the semester
  4. Evaluate sustainability and trade-off dimensions of MapReduce and GFS

Two Systems, One Case Study

In 2003-2004, Google published two papers that changed how the industry processes data:

Google File System (GFS, 2003)

Stores data across thousands of machines. A NameNode (GFS master) manages metadata (which machines hold which chunks). DataNodes (GFS chunkservers) store the actual data.

Solved the storage problem.

MapReduce (2004)

Distributes computation across thousands of machines. A coordinator assigns tasks and handles failures. Workers execute user-defined functions.

Solved the computation problem.

Today we use these systems as a lens to see every concept from the semester in one place — coupling, cohesion, information hiding, requirements, architecture, concurrency, consistency, performance, safety, and sustainability.

Start with Requirements: What Does This System Need to Do?

From L9 and L18: requirements drive architecture. Before we look at MapReduce and GFS, let's state what the system needs — using the quality scenario format.

SourceStimulusEnvironmentResponseMeasure
ThroughputSearch infrastructureRebuild the web index from billions of crawled pagesHundreds of TB of raw data, thousands of commodity machinesProduce inverted index, PageRank scoresComplete in hours, not weeks
Fault toleranceHardwareMachine crashes mid-computation1000+ machines; failures are routine, not exceptionalRecover and continue without restarting the jobNo data loss; no human intervention
ScalabilityWeb growthThe web is growing exponentiallySame codebase, same programming modelAdd machines, not rewrite softwareLinear throughput scaling with machines added

These three requirements — throughput, fault tolerance, scalability — shaped every architectural decision in MapReduce and GFS. As we walk through the system, ask: which requirement drove this decision?

Google's Data Problem Started with Storage

By 2003, Google had crawled billions of web pages. Each page: ~100KB of HTML, metadata, and outbound links.

Per web pageTotal crawl (billions of pages)
Raw data~100 KBHundreds of TB
Words to index~500 uniqueTrillions of index entries
Links to follow~50 outboundHundreds of billions of edges

Hundreds of TB don't fit on one machine. A large server in 2003 had ~2 TB of disk. You need hundreds of machines just to store the raw crawl — and the web keeps growing.

And Google doesn't just store it — they need to process all of it to build the products people use: web search requires an inverted index, relevant ranking requires PageRank, and the crawl itself must be kept current as the web changes.

Then It's a Computation Problem

Google needs to build an inverted index (which pages contain each word — this IS Google Search), compute PageRank (how pages link to each other — this is what makes results relevant), and process web server logs (billions of requests per day). That means reading and processing hundreds of TB regularly.

The math: One machine reads from disk at ~500 MB/s. Reading hundreds of TB takes days — for an index that needs to stay current as the web changes.

Hundreds of machines reading in parallel: hours. That's feasible — but now you have two new problems:

  1. Coordination: How do you split the work, route results, and combine them?
  2. Failure: If one machine crashes mid-computation (and at 100+ machines, something will crash), what happens to its work?

Google's answer: build a file system that stores data across thousands of machines (GFS), and a programming model that distributes computation automatically (MapReduce).

MapReduce Hides Distributed Complexity Behind Two Functions

Google built MapReduce for web indexing — but the programming model is general. Any data processing task with the same requirements (massive data, parallel processing, fault tolerance) can use the same two functions. SceneItAll's analytics pipeline has exactly those requirements:

Map: Takes a key-value pair, emits zero or more intermediate pairs.

map(homeId, telemetryData) →
emit("high-energy", {homeId, 45kWh})
emit("normal", {homeId, 12kWh})

One home's telemetry in, classification out.

Reduce: Takes a key + all its values, combines into final result.

reduce("high-energy", [
{home1, 45kWh}, {home2, 52kWh}, ...
]) →
emit("high-energy",
{count: 12400, avgKWh: 48.3})

All high-energy homes in, aggregate out.

Between map and reduce, the framework performs a shuffle — groups intermediate values by key and routes them to the appropriate reducer. The programmer writes only map and reduce; the framework handles everything else.

Step 1: Split the Input, Run Map in Parallel

Input is split into chunks. Each chunk is assigned to a map worker running your map() function. Workers run in parallel on different machines — they don't communicate with each other.

Step 2: Shuffle Groups by Key, Reduce Aggregates

The shuffle collects all intermediate pairs with the same key and routes them to the same reduce worker. Then each reduce worker runs your reduce() function on all values for its assigned keys.

Step 3: A Coordinator Orchestrates Everything

The coordinator does NOT process data. It assigns tasks, monitors workers, and restarts failed ones. Workers that crash mid-task have their chunk reassigned to another worker — transparent to the programmer.

Your Laptop's File System Was Designed for a Different Workload

Every file system you've used assumes: small files, random reads and writes, open-edit-save workflows, 4KB blocks. These are great for editing documents and compiling code. Google's workload is the opposite:

Your laptop's file systemData processing workload
Small files (KB-MB)Huge files (GB-TB) — a single crawl output can be terabytes
Random reads and writesAppend-only writes, sequential reads — scan start to finish
Open, edit, save, closeWrite once, read many times, never modify
Permissions, directories, timestampsIrrelevant at batch scale
4KB blocksWasteful — 50GB file = 13 million blocks of bookkeeping

L9 → L6: Different requirements demand different abstractions. A file system designed for editing 50KB documents is the wrong tool for scanning 50GB logs. Google built GFS with 64MB chunks instead of 4KB blocks — 16,000x fewer allocations. The simplification IS the performance gain.

Performance at Scale: Bigger Chunks and Data Locality

GFS uses 64MB chunks instead of your laptop's 4KB blocks — that's 16,000x bigger. Why? Every chunk has fixed costs that don't scale with chunk size:

Fixed cost per chunkWith 4KB blocks (100 TB)With 64MB chunks (100 TB)Reduction
Metadata entry in NameNode RAM26 billion entries1.6 million entries16,000x
Replication coordination (3 copies each)78 billion replica records4.8 million replica records16,000x
I/O round trips per read (NameNode lookup + DataNode setup)26 billion round trips1.6 million round trips16,000x

L34 (Batching): Same principle — amortize fixed costs across more work. Batching database queries avoids per-query overhead. Batching I/O into 64MB chunks avoids per-block metadata, replication, and round-trip overhead. When fixed costs dominate, make the units bigger.

L34 (Locality): MapReduce also moves computation to data — the coordinator assigns map tasks to workers on the same machine as the data chunk. Local disk reads (microseconds) vs. network reads (milliseconds). Batching and locality together minimize network transfer.

GFS Stores Data Across Thousands of Machines with Built-In Redundancy

  • Single NameNode: Maintains metadata — which chunks belong to which file, where each chunk is stored. Does not store file data.
  • Many DataNodes: Store the actual data. Files divided into 64MB chunks, each replicated across 3 DataNodes.
  • Clients: Contact the NameNode for metadata, then communicate directly with DataNodes for data.

Together, MapReduce + GFS = programmers write simple sequential functions, and the framework distributes execution across thousands of machines, handles failures, and manages storage transparently.

A Single NameNode Trades Simplicity for a Dangerous Single Point of Failure

GFS uses a single NameNode for all metadata operations. Every client contacts the NameNode to learn which DataNodes hold the data it needs.

Benefits:

  • Simplicity: No distributed consensus, no conflicting views
  • Functional cohesion (L7): All metadata logic in one place — one responsibility
  • Global optimization: NameNode makes optimal placement decisions

Costs:

  • High coupling (L7): Every client depends on the NameNode — stamp coupling via metadata types
  • Single point of failure (L35): NameNode down = entire file system unavailable
  • Blast radius (L35): NameNode failure affects every client and every DataNode

Mitigation — separation of concerns (L6, L19): The NameNode handles one concern (metadata: "where is the data?"), DataNodes handle another (I/O: "give me the data"). Clients ask the NameNode for metadata, then read directly from DataNodes. The NameNode stays out of the data path.

Pure Functions Make Re-Execution Safe — and Fault Tolerance Trivial

The design decision: Map functions are pure (L5) — same input → same output, no side effects. Workers have zero coupling (L7) — no shared mutable state (L31). Each map task has functional cohesion — one responsibility: process this chunk.

The consequence: With thousands of workers, failures are nearly certain. But purity makes recovery trivial — re-execute the failed task on another machine. Same input, same output, guaranteed.

FailureDetectionRecoveryWhy it works
Map worker diesCoordinator pings periodicallyReassign task to another workerPure function (L5) → re-execution produces same output
Reduce worker diesCoordinator pings periodicallyReassign reduce taskIdempotent (L33) → re-execution is safe
DataNode diesNameNode detects missing heartbeatRead from surviving replica (2 of 3 remain)Redundancy (L35) → no single point of failure for data

L20 (Networks) + L33 (Event Architecture): Retry with exponential backoff is a resilience pattern. MapReduce applies it at the task level — if a task fails, retry on a different machine. This is safe because map functions are idempotent (L33): executing the same function on the same input produces the same output, whether executed once or a hundred times.

Swiss Cheese Analysis: A Chunkserver Fails Mid-Write

A MapReduce worker is writing results to a GFS file. One of three DataNodes crashes during the write.

LayerDefenseHole?
Chunk replicationData written to 3 DataNodes; 2 surviveCatches single-server failure
Write protocolPrimary replica forwards to secondaries; ack only when all confirmPrimary detects secondary failure
NameNode re-replicationNameNode detects under-replicated chunk, schedules new copyRestores redundancy, but not instant
Client retryIf write fails, client retries with new replicasCatches transient network failures
Append semanticsGFS guarantees at-least-once appendRetried append may produce a duplicate record

The bottom layer reveals the trade-off: at-least-once delivery means duplicates. For the search index, a duplicate page entry is harmless — the reducer deduplicates. For ad click billing, a duplicate charge means overcharging an advertiser. Blast radius of inconsistency determines the consistency model (L33).

Blast Radius Varies by Orders of Magnitude Depending on What Fails

FailureBlast RadiusMitigationCourse Concept
One map worker diesOne task delayedCoordinator reassigns to another workerL35: redundancy limits blast radius
One DataNode diesData temporarily under-replicated3x replication; NameNode schedules re-replicationL35: Swiss cheese model
Network partition isolates a rackWorkers unreachable; tasks reassignedStale outputs discarded; tasks re-executedL20: network is not reliable
MapReduce coordinator diesEntire job failsJob restarted from scratch (original); later versions: checkpointingL35: single point of failure
NameNode diesEntire file system unavailableShadow NameNodes for read-only; state checkpointedL35: blast radius determines layers needed

Worker failure costs minutes. Coordinator/NameNode failure costs the entire job or entire file system. This blast radius difference drove Google to eventually distribute the metadata role.

MapReduce Scores Well on Technical and Economic Sustainability — But the Full Picture Is Mixed

DimensionAssessmentKey Tension
TechnicalHigh — simple model, pure functions, small user code, framework handles complexityBatch-only model could not evolve to support interactive queries
EconomicHigh — commodity hardware instead of expensive fault-tolerant servers; framework absorbs failures so you don't pay for reliable machinesInternally: Google lock-in. Open source (Hadoop) mitigated this for everyone else
EnvironmentalMixed — locality reduces network energy, but enabled processing at unprecedented scaleTotal resource consumption exploded
SocialMixed — democratized data processing within Google, but enabled mass-scale data collectionWeb index contains data about everyone

Same four-dimensional analysis from L36. Same pattern: no decision optimizes all four.

MapReduce Made Data Processing Cheap — So Total Consumption Exploded

Before MapReduce (early 2000s): processing the web index was a bespoke engineering effort — custom code, manual failure handling, weeks per pipeline.

MapReduce made it cheap: write two functions, submit a job, get results.

What happened: Per-job engineering cost dropped dramatically. Google ran thousands of MapReduce jobs daily by 2004. The efficiency gain per job was overwhelmed by the increase in total jobs. MapReduce consumed a significant fraction of Google's total compute.

L36 (Jevons' Paradox): Same pattern as Pawtograder — per-submission grading cost dropped, but unlimited submissions dramatically increased total compute. Efficiency is not sustainability.

The People Making Design Decisions Are Not Always the Ones Bearing the Consequences

DecisionWho benefitsWho bears the cost
Simple programming modelThousands of Google engineersInfrastructure team maintaining massive clusters
Commodity hardware (cheap, unreliable)Google's budget (lower capital cost)Environment (more machines, more energy, more e-waste)
Process the entire web indexGoogle's search quality, ad targetingEveryone whose data is in the index (often without consent)
Batch-only model (high throughput, high latency)Large-scale analytics workloadsTeams needing real-time queries (forced to build separate systems)

Same distributional analysis from L35 and L36: the people making the design decision are not always the people bearing the consequences.

MapReduce's Strengths Became Its Limitations as Context Changed

MapReduce's limitations drove Google to build successors:

LimitationSuccessorWhat changed
Batch-only — cannot serve interactive queriesDremel (interactive SQL) → BigQueryUsers needed answers in seconds, not hours
Two-function model — too restrictive for iterative algorithmsSystems supporting iterative computation (ML requires multiple passes over data)Machine learning became a dominant workload
GFS single NameNode — metadata bottleneck at scaleColossus — distributes metadata across multiple serversFile system grew beyond one NameNode's capacity

L36 (Sustainability): The same design decisions that enable initial success can become obstacles as context changes. MapReduce's simplicity was its greatest strength (massive adoption) and its greatest weakness (could not evolve for new workloads). Low coupling and information hiding (L6, L7) delayed this reckoning — many jobs migrated to successors without rewriting their map and reduce functions.

Open Source Made MapReduce Sustainable Beyond Google

Google published the MapReduce (2004) and GFS (2003) papers — but kept the code proprietary. Doug Cutting and Mike Cafarella read the papers and built Apache Hadoop: an open source implementation.

What happenedSustainability dimension
Yahoo, Facebook, and hundreds of companies adopted HadoopEconomic — shared infrastructure cost, no single-vendor lock-in
Entire ecosystem grew: Spark, Hive, HBase, KafkaTechnical — clean programming model enabled open source reimplementation
Google's proprietary system was replaced internally — but the ideas live on in open sourceSocial — any organization can process data at scale, not just Google
Same pattern: Google's Borg → open source Kubernetes (container orchestration for everyone)Technical — clean abstractions enable reimplementation; Borg's ideas power every cloud provider
Hadoop and K8s clusters consume enormous energy worldwideEnvironmental — Jevons' paradox again, now at industry scale

The design outlived the implementation. That's sustainability.

Architecture Emerges from Constraints

We've analyzed MapReduce's design decisions one by one — pure functions, single NameNode, replication, re-execution, locality, sustainability. From L18: "Architecture is the shape that emerges when you apply your constraints." Every decision traces back to four constraints:

ConstraintImplicationDecisions it shaped
Commodity hardware fails constantlyRe-execution must be safePure functions, idempotent retry, 3x replication
Data is too big to move across the networkMove computation to dataLocality optimization, 64MB chunks, NameNode metadata
Network bandwidth is scarce and sharedMinimize data transferBatching (64MB), shuffle as the only cross-network phase
Jobs run for hours on thousands of machinesDetect and recover from failures automaticallyCoordinator heartbeats, task reassignment, checkpointing

These constraints shaped two architectural styles: a pipelined architecture (L19) for the data flow (input → map → shuffle → reduce → output) and a coordinator/worker style for orchestration. Stateless workers + pipelined stages = linear scalability — add more machines, get proportionally more throughput, no code changes.

Every Semester Concept Is Visible in One System

Semester ConceptWhere It Appears in MapReduce/GFS
Information hiding (L6)Framework hides distributed complexity behind two functions
Coupling and cohesion (L7)Single NameNode: functional cohesion, high coupling — explicit trade-off
Requirements (L9)Designed for sequential reads, commodity hardware, non-expert programmers
Testability (L16)Test map/reduce locally with small files — domain logic separated from infrastructure
Architecture (L19, L20)Pipelined architecture; distributed because no single machine could hold the data
Concurrency (L31)Pure map functions eliminate shared mutable state — no locks needed
Event-driven patterns (L33)Idempotent re-execution, eventual consistency, at-least-once delivery
Performance (L34)Locality optimization, batching, pooling workers
Safety and reliability (L35)Swiss cheese layers: replication, re-execution, checkpointing; blast radius analysis
Sustainability (L36)Jevons' paradox; who benefits vs. who bears cost; technical sustainability through simple APIs

The concepts are not separate tools — they are overlapping lenses that illuminate different aspects of the same design.

Comprehension Check

Open Poll Everywhere and answer the three questions.

Looking Ahead

L38 (Wednesday): The Future of Programming — where does software engineering go from here? How do the tools and principles from this semester apply to what comes next?

L39 (Thursday): Review — exam preparation.

GA2: Feature Buffet due Thursday April 16. Process over product — a well-documented partial feature scores higher than a complete feature with no documentation.

Want to go deeper? The original papers are readable:

  • Dean and Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (2004)
  • Ghemawat, Gobioff, and Leung, "The Google File System" (2003)

Follow-up courses: CS 4730 (Distributed Systems), CS 6620 (Cloud Computing)