Skip to main content
Pixel art museum hallway with framed paintings of course concepts on both walls — Information Hiding, Coupling and Cohesion, Requirements, Architecture, Testability, Networks on the left; Concurrency, Events, Performance, Safety on the right — converging at the far end into one large glowing MapReduce pipeline exhibit. A student walks toward it. Tagline: Every Lecture Led Here.

CS 3100: Program Design and Implementation II

Lecture 38: MapReduce

©2026 Jonathan Bell & Ellen Spertus, CC-BY-SA

Learning Objectives

After this lecture, you will be able to:

  1. Give examples of paradigm shifts
  2. Give examples of programming paradigms
  3. Describe the MapReduce programming model and GFS at the level needed to analyze their design decisions
  4. Analyze MapReduce and GFS architectural decisions using quality attributes from the course
  5. Identify how MapReduce and GFS apply patterns learned throughout the semester
  6. Evaluate sustainability and trade-off dimensions of MapReduce and GFS
  7. Describe how hasty actions can cause lasting damage

Reminder: We Want Your Feedback

  • Complete TRACE if you haven't
    • AM: 13/36 (36%)
    • PM: 22/38 (57%)
  • Qualtrics survey will be part of assignment
  • Make yourself eligible for recommendations

TRACE is anonymous, and I won't see results until after grades are in.

You can make changes through April 26.

Comments from this Morning

I went through the practice test, and I do not understand why we have short answer questions, as opposed to the other campuses, which do not.

  • Most other sections do have short answer questions.
  • We believe that is the best way for students to demonstrate their knowledge.

We haven't had short answer response questions on a test for the whole year, and I feel unprepared with this sudden change.

  • That's why we created the practice finals.
  • Finals are often different for midterms.
  • Your assignments and labs should have prepared you.

I feel challenged by having to write code down physically.

  • I agree it's not the best way to write code, but needs must.
  • We won't grade on syntax.
  • This lets us see if students have learned from assignments.

I feel we should have consistency across sections.

  • It's fine that you feel that way, but it's not standard in college.
  • I believe my questions are easier than Prof. Bell's.

Please don't ask for exceptions from course policies.

Paradigm Shifts

A paradigm is a way of viewing the world.

A paradigm shift is a change of worldview.

Animation showing planets rotating around sun (left) and planets rotating around Earth (right)

"Apparent retrograde motion" by Cleonis, Wikimedia Commons, CC BY-SA 2.5

A Paradigm Shift in Information Retrieval (1990s)

Pre-Web Era (Cathedral)

  • Curated, centrally-controlled collections
  • Thousands of documents with controlled growth
  • Content signals trustworthy — no incentive to game them

Web Era (Bazaar)

  • Decentralized, chaotic — anyone can publish
  • Billions of documents with uncontrolled growth
  • Content signals easily gamed (keyword stuffing)

Traditional information retrieval algorithms relied on keywords and metadata to find the best matches in flat text collections.

That didn't work for web search. What did?

PageRank Algorithm

Colored graph of interconnected nodes with different numeric values

en:User:345Kai, User:Stannered, Public domain, via Wikimedia Commons

Moore's Law

Log-linear graph showing the number of transistors on microchips doubling every two years from 1971 to 2020

Max Roser, Hannah Ritchie, CC BY 4.0, via Wikimedia Commons

Wirth's Law

Software gets slower more rapidly than hardware gets faster.
Niklaus Wirth

Niklaus Wirth

TODO

Setting the Scene (Early 2000s)

In 2003, Google was crawling billions of web pages.

Per web pageTotal crawl (billions of pages)
Raw data~100 KBHundreds of TB
Words to index~500 uniqueTrillions of index entries
Links to follow~50 outboundHundreds of billions of edges

Servers had ~2 TB of storage and had ~1/1000th the power of today's computers.

How could they store and process all of that information?

Two Systems, One Case Study

In 2003-2004, Google published two papers that changed how the industry processes data:

Google File System (GFS, 2003)

Stores data across thousands of machines. A NameNode (GFS master) manages metadata (which machines hold which chunks). DataNodes (GFS chunkservers) store the actual data.

Solved the storage problem.

MapReduce (2004)

Distributes computation across thousands of machines. A coordinator assigns tasks and handles failures. Workers execute user-defined functions.

Solved the computation problem.

Today we use these systems as a lens to see every concept from the semester in one place — coupling, cohesion, information hiding, requirements, architecture, concurrency, consistency, performance, safety, and sustainability.

Start with Requirements: What Does This System Need to Do?

From L9 and L18: requirements drive architecture. Before we look at MapReduce and GFS, let's state what the system needs — using the quality scenario format.

SourceStimulusEnvironmentResponseMeasure
ThroughputSearch infrastructureRebuild the web index from billions of crawled pagesHundreds of TB of raw data, thousands of commodity machinesProduce inverted index, PageRank scoresComplete in hours, not weeks
Fault toleranceHardwareMachine crashes mid-computation1000+ machines; failures are routine, not exceptionalRecover and continue without restarting the jobNo data loss; no human intervention
ScalabilityWeb growthThe web is growing exponentiallySame codebase, same programming modelAdd machines, not rewrite softwareLinear throughput scaling with machines added

These three requirements — throughput, fault tolerance, scalability — shaped every architectural decision in MapReduce and GFS. As we walk through the system, ask: which requirement drove this decision?

Programming Paradigms

Object-Oriented Programming

Organize code around objects that bundle state and behavior
State changes over time via method calls
Model the world as interacting entities

Functional Programming

Organize code around functions that transform data
Avoid mutable state — same input always gives same output
Model the world as data flowing through pipelines

Which is best? It depends.

Have both in your toolkit.

The Map and Reduce Higher-Order Functions

Diagram showing bread and vegetables on the left, going through Map to become chopped, then reduced to become a sandwich

map applies a function to every element in a list

reduce combines a list into a single item

Counting Words on the Web with MapReduce

Image showing use of MapReduce to generate a count of words in documents
map(key:String, document:String):Void ->
for each w:word in document:
emit(w, 1)
reduce(word:String, counts:List[Int]):Int ->
return sum(counts)

Sandwich MapReduce with Shuffle

Diagram with bread and vegetables at left going through map (chop),
shuffle (group), and reduce (assemble into sandwich) states

Building an Index with MapReduce

Image showing use of MapReduce to build an index
map(url:String, document:String):Void ->
for each w:word in document:
emit(w, url)
reduce(word:String, urls:List[String]):String ->
return urls.join(prefix=word, separator=' ')

Poll: What does this calculate?

Graph showing 8 names and faces connected by edges

QR code linking to the PollEV survey for espertus

Text espertus to 22333 if the URL isn't working for you. https://pollev.com/espertus

// map over edges
map(p1:Person, p2:Person):Void ->
emit(p1, p2)
reduce(p: Person, persons:List[Person]):Int ->
return persons.length

A. How many connections each person has
B. How many nodes are in the graph
C. How many edges are in the graph
D. None of the above

MapReduce: How It Works

  1. Input is split into chunks, sent to different machines.
  2. Map workers process data, producing (key, value) pairs.
  3. Shufflers send data with the same key to reduce workers.
  4. Reduce workers write the output to GFS.

What Happens Behind the Scenes

The coordinator assigns tasks, monitors workers, and restarts failed ones — transparent to the programmer.

Your Laptop's File System Was Designed for a Different Workload

Most file systems are designed for open-edit-save workflows, of mostly small files.

Your laptop's file systemData processing workload
Small files (KB-MB)Huge files (GB-TB) — a single crawl output can be terabytes
Random reads and writesAppend-only writes, sequential reads — scan start to finish
Open, edit, save, closeWrite once, read many times, never modify
Permissions, directories, timestampsIrrelevant at batch scale
4KB blocksWasteful — 50GB file = 13 million blocks of bookkeeping

Performance at Scale: Bigger Chunks and Data Locality

Why does GFS use larger chunks?

Fixed cost per chunkWith 4KB blocks (100 TB)With 64MB chunks (100 TB)Reduction
Metadata entry in NameNode RAM26 billion entries1.6 million entries16,000x
Replication coordination (3 copies each)78 billion replica records4.8 million replica records16,000x
I/O round trips per read (NameNode lookup + DataNode setup)26 billion round trips1.6 million round trips16,000x

We've seen this before: Batching and Locality (L34)

GFS Stores Data Across Thousands of Machines with Built-In Redundancy

Together, MapReduce + GFS = programmers write simple sequential functions, and the framework distributes execution across thousands of machines, handles failures, and manages storage transparently.

Pure Functions Make Re-Execution Safe — and Fault Tolerance Trivial

FailureDetectionRecoveryWhy it works
Map worker diesCoordinator pings periodicallyReassign task to another workerPure function (L5) → re-execution produces same output
Reduce worker diesCoordinator pings periodicallyReassign reduce taskIdempotent (L33) → re-execution is safe
DataNode diesNameNode detects missing heartbeatRead from surviving replica (2 of 3 remain)Redundancy (L35) → no single point of failure for data

MapReduce Made Data Processing Cheap — So Total Consumption Exploded

Before MapReduce (early 2000s): processing the web index was a bespoke engineering effort — custom code, manual failure handling, weeks per pipeline.

MapReduce made it cheap: write two functions, submit a job, get results.

What happened: Per-job engineering cost dropped dramatically. Google ran thousands of MapReduce jobs daily by 2004. The efficiency gain per job was overwhelmed by the increase in total jobs. MapReduce consumed a significant fraction of Google's total compute.

L36 (Jevons' Paradox): Same pattern as Pawtograder — per-submission grading cost dropped, but unlimited submissions dramatically increased total compute. Efficiency is not sustainability.

MapReduce's Strengths Became Its Limitations as Context Changed

MapReduce's limitations drove Google to build successors:

LimitationSuccessorWhat changed
Batch-only — cannot serve interactive queriesDremel (interactive SQL) → BigQueryUsers needed answers in seconds, not hours
Two-function model — too restrictive for iterative algorithmsSystems supporting iterative computation (ML requires multiple passes over data)Machine learning became a dominant workload
GFS single NameNode — metadata bottleneck at scaleColossus — distributes metadata across multiple serversFile system grew beyond one NameNode's capacity

Open Source Made MapReduce Sustainable Beyond Google

Google published the MapReduce (2004) and GFS (2003) papers — but kept the code proprietary. Doug Cutting and Mike Cafarella read the papers and built Apache Hadoop: an open source implementation.

What happenedSustainability dimension
Yahoo, Facebook, and hundreds of companies adopted HadoopEconomic — shared infrastructure cost, no single-vendor lock-in
Entire ecosystem grew: Spark, Hive, HBase, KafkaTechnical — clean programming model enabled open source reimplementation
Google's proprietary system was replaced internally — but the ideas live on in open sourceSocial — any organization can process data at scale, not just Google
Same pattern: Google's Borg → open source Kubernetes (container orchestration for everyone)Technical — clean abstractions enable reimplementation; Borg's ideas power every cloud provider
Hadoop and K8s clusters consume enormous energy worldwideEnvironmental — Jevons' paradox again, now at industry scale

The Inventors of MapReduce

Wired Magazine stotry 'If Xerox PARC Invented the PC, Google Invited the Internet` by Cade Metz, 08.08.12.
Photo shows two smiling men with caption 'Jeff Dean and Sanjay Ghemawat, two of the most important software engineers
of the Internet age -- and two of the most underappreciated.'

"DEC was one of the first companies to build a successful web search engine — AltaVista, which came out of the Western Research Lab — and at least in the beginning, the entire thing ran on a single DEC machine. But Google eclipsed AltaVista in large part because it turned this model on its head. Rather than using big, beefy machines to run its search engine, it broke its software into pieces and spread them across an army of small, cheap machines. This is the fundamental idea behind GFS, MapReduce, and BigTable — and so many other Google technologies that would overturn the status quo."

Jeff Dean Facts

  • The rate at which Jeff Dean produces code jumped by a factor of 40 in late 2000 when he upgraded his keyboard to USB 2.0.
  • Jeff Dean once failed a Turing test when he correctly identified the 203rd Fibonacci number in less than a second.
  • Google search went down for a few hours in 2002, and Jeff Dean started handling queries by hand. Search Quality doubled.
  • Jeff Dean's infinite loops run in 5 seconds.


Jeff Dean (2020)

Screenshot of online article by Mark Bergen, Dina Bass, and Shelly Banjo dated December 3, 2020.
Google Scientist's Abrupt Exit Exposes Rift in Prominent AI Unit
* Timnit Gebru departed after management criticism of an email
* AI unit leader Dean accused of retaliation, chided by staffers
Photograph of Jeff Dean in front of Google logo

Reputations

Photo of old man in suit, Warren Buffett
It takes 20 years to build a reputation and five minutes to ruin it. If you think about that, you'll do things differently.
Warren Buffett

CC-BY 2.0 Marco Verch

Washington Post headline: 'Google hired Timnit Gebru to be an outspoken critic of unethical AI. Then she was fired for it.' Photo of Timnit Gebru speaking at TechCrunch Disrupt SF 2018.
First page of the 2021 FAccT paper 'On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?' by Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell.

Lessons

  • Keep an eye out for paradigm shifts. Don't solve today's problems with yesterday's tools.
  • Software architecture comes out of requirements.
  • "Luck is what happens when preparation meets opportunity." — Seneca
  • Act in accordance with your values.

Bonus Slide

Graphic with header: 'map, filter, and reduce explained with emoji'.
map([cow, potato, chicken, corn], cook) => [hamburger, fries, drumstick, popcorn]
filter([hamburger, fries, drumstick, popcorn], isVegetarian) => [fries, popcorn]
reduce([hamburger, fries, drumstick, popcorn], eat) => poop