Pixel art museum hallway with framed paintings of course concepts on both walls — Information Hiding, Coupling and Cohesion, Requirements, Architecture, Testability, Networks on the left; Concurrency, Events, Performance, Safety on the right — converging at the far end into one large glowing MapReduce pipeline exhibit. A student walks toward it. Tagline: Every Lecture Led Here.

CS 3100: Program Design and Implementation II

Lecture 38: MapReduce

Learning Objectives

After this lecture, you will be able to:

Give examples of paradigm shifts
Give examples of programming paradigms
Describe the MapReduce programming model and GFS at the level needed to analyze their design decisions
~~Analyze MapReduce and GFS architectural decisions using quality attributes from the course~~
Identify how MapReduce and GFS apply patterns learned throughout the semester
~~Evaluate sustainability and trade-off dimensions of MapReduce and GFS~~
Describe how hasty actions can cause lasting damage

Reminder: We Want Your Feedback

Complete TRACE if you haven't
- AM: 13/36 (36%)
- PM: 22/38 (57%)
Qualtrics survey will be part of assignment
Make yourself eligible for recommendations

TRACE is anonymous, and I won't see results until after grades are in.

You can make changes through April 26.

Comments from this Morning

I went through the practice test, and I do not understand why we have short answer questions, as opposed to the other campuses, which do not.

Most other sections do have short answer questions.
We believe that is the best way for students to demonstrate their knowledge.

We haven't had short answer response questions on a test for the whole year, and I feel unprepared with this sudden change.

That's why we created the practice finals.
Finals are often different for midterms.
Your assignments and labs should have prepared you.

I feel challenged by having to write code down physically.

I agree it's not the best way to write code, but needs must.
We won't grade on syntax.
This lets us see if students have learned from assignments.

I feel we should have consistency across sections.

It's fine that you feel that way, but it's not standard in college.
I believe my questions are easier than Prof. Bell's.

Please don't ask for exceptions from course policies.

Paradigm Shifts

A paradigm is a way of viewing the world.

A paradigm shift is a change of worldview.

Animation showing planets rotating around sun (left) and planets rotating around Earth (right)

"Apparent retrograde motion" by Cleonis, Wikimedia Commons, CC BY-SA 2.5

A Paradigm Shift in Information Retrieval (1990s)

Pre-Web Era (Cathedral)

Curated, centrally-controlled collections
Thousands of documents with controlled growth
Content signals trustworthy — no incentive to game them

Web Era (Bazaar)

Decentralized, chaotic — anyone can publish
Billions of documents with uncontrolled growth
Content signals easily gamed (keyword stuffing)

Traditional information retrieval algorithms relied on keywords and metadata to find the best matches in flat text collections.

That didn't work for web search. What did?

PageRank Algorithm

Colored graph of interconnected nodes with different numeric values

en:User:345Kai, User:Stannered, Public domain, via Wikimedia Commons

Moore's Law

Log-linear graph showing the number of transistors on microchips doubling every two years from 1971 to 2020

Max Roser, Hannah Ritchie, CC BY 4.0, via Wikimedia Commons

Wirth's Law

“Software gets slower more rapidly than hardware gets faster.”
— Niklaus Wirth

Niklaus Wirth

Setting the Scene (Early 2000s)

In 2003, Google was crawling billions of web pages.

	Per web page	Total crawl (billions of pages)
Raw data	~100 KB	Hundreds of TB
Words to index	~500 unique	Trillions of index entries
Links to follow	~50 outbound	Hundreds of billions of edges

Servers had ~2 TB of storage and had ~1/1000th the power of today's computers.

How could they store and process all of that information?

Two Systems, One Case Study

In 2003-2004, Google published two papers that changed how the industry processes data:

Google File System (GFS, 2003)

Stores data across thousands of machines. A NameNode (GFS master) manages metadata (which machines hold which chunks). DataNodes (GFS chunkservers) store the actual data.

Solved the storage problem.

MapReduce (2004)

Distributes computation across thousands of machines. A coordinator assigns tasks and handles failures. Workers execute user-defined functions.

Solved the computation problem.

Today we use these systems as a lens to see every concept from the semester in one place — coupling, cohesion, information hiding, requirements, architecture, concurrency, consistency, performance, safety, and sustainability.

Set the stage before diving into requirements. Students need to know what the two systems are and what role each plays before they can appreciate the requirements that shaped them.

Terminology note: The GFS paper names the metadata service the master and the chunk-storing workers chunkservers—matching lecture-notes/l37-map-reduce. NameNode and DataNode are the usual HDFS names for those same roles: NameNode is the analog of the master, and each DataNode is a chunkserver. These slides use the HDFS labels for familiarity; map them to master/chunkserver when you read the GFS paper or the notes. For MapReduce we use coordinator/worker throughout this lecture.

The "lens" framing is the key pedagogical move: this is NOT a distributed systems lecture. Students will not implement MapReduce. The goal is to see how design concepts they already know appear at scale. The museum hallway from the cover image is the metaphor — every painting (lecture) converges into this one system.

Transition: Before we look at the systems, let's state what they need to do...

Start with Requirements: What Does This System Need to Do?

From L9 and L18: requirements drive architecture. Before we look at MapReduce and GFS, let's state what the system needs — using the quality scenario format.

	Source	Stimulus	Environment	Response	Measure
Throughput	Search infrastructure	Rebuild the web index from billions of crawled pages	Hundreds of TB of raw data, thousands of commodity machines	Produce inverted index, PageRank scores	Complete in hours, not weeks
Fault tolerance	Hardware	Machine crashes mid-computation	1000+ machines; failures are routine, not exceptional	Recover and continue without restarting the job	No data loss; no human intervention
Scalability	Web growth	The web is growing exponentially	Same codebase, same programming model	Add machines, not rewrite software	Linear throughput scaling with machines added

These three requirements — throughput, fault tolerance, scalability — shaped every architectural decision in MapReduce and GFS. As we walk through the system, ask: which requirement drove this decision?

This is the L9 → L18 payoff. Students learned quality scenarios in L18 and requirements analysis in L9. Now they see the same framework applied to a real system at Google scale.

Walk through each row briefly:

Throughput: Hundreds of TB is the scale we'll establish on the next slide. A single machine reading at ~500 MB/s would take days — Google needed results in hours.
Fault tolerance: With 1000+ machines, the probability of at least one failing during a multi-hour job is nearly 100%. Failure is not an exception — it's the normal operating environment. This is the constraint that forces pure functions and re-execution.
Scalability: The web keeps growing. If doubling the crawl size requires rewriting the system, that's unsustainable (L36). The architecture must scale by adding machines, not changing code.

Programming Paradigms

Object-Oriented Programming

•Organize code around objects that bundle state and behavior

•State changes over time via method calls

•Model the world as interacting entities

Functional Programming

•Organize code around functions that transform data

•Avoid mutable state — same input always gives same output

•Model the world as data flowing through pipelines

Which is best? It depends.

Have both in your toolkit.

The Map and Reduce Higher-Order Functions

Diagram showing bread and vegetables on the left, going through Map to become chopped, then reduced to become a sandwich

map applies a function to every element in a list

reduce combines a list into a single item

Counting Words on the Web with MapReduce

Image showing use of MapReduce to generate a count of words in documents

map(key:String, document:String):Void ->
    for each w:word in document:
        emit(w, 1)

reduce(word:String, counts:List[Int]):Int ->
    return sum(counts)

Source: https://docs.hazelcast.org/docs/3.2/manual/html/mapreduce-essentials.html

Sandwich MapReduce with Shuffle

Diagram with bread and vegetables at left going through map (chop),
shuffle (group), and reduce (assemble into sandwich) states

https://web.stanford.edu/class/cs110/summer-2021/lecture-notes/lecture-17/

Building an Index with MapReduce

Image showing use of MapReduce to build an index

map(url:String, document:String):Void ->
    for each w:word in document:
        emit(w, url)

reduce(word:String, urls:List[String]):String ->
    return urls.join(prefix=word, separator=' ')

https://web.stanford.edu/class/cs110/summer-2021/lecture-notes/lecture-17/

Poll: What does this calculate?

Graph showing 8 names and faces connected by edges

QR code linking to the PollEV survey for espertus

Text espertus to 22333 if the URL isn't working for you. https://pollev.com/espertus

// map over edges
map(p1:Person, p2:Person):Void ->
    emit(p1, p2)

reduce(p: Person, persons:List[Person]):Int ->
    return persons.length

A. How many connections each person has
B. How many nodes are in the graph
C. How many edges are in the graph
D. None of the above

MapReduce: How It Works

Input is split into chunks, sent to different machines.
Map workers process data, producing (key, value) pairs.
Shufflers send data with the same key to reduce workers.
Reduce workers write the output to GFS.

Start simple. The input is one big dataset — all of SceneItAll's device logs. The framework splits it into chunks (16-64MB each) and assigns each chunk to a map worker. Each worker runs the same map function on its chunk and emits intermediate key-value pairs.

Key point: The map workers are completely independent. They don't talk to each other. This is why it scales — add more workers, process more chunks in parallel.

The shuffle is the framework's job, not yours. This is the expensive part — moving data across the network so all values for a given key end up on the same machine.

Example: If the map phase emitted ("living-room", 1) from Worker 1 and ("living-room", 1) from Worker 3, the shuffle ensures both arrive at the same reduce worker. That reduce worker then sums them: ("living-room", 2).

Transition: But who orchestrates all of this? Who assigns tasks, tracks failures, restarts crashed workers?

What Happens Behind the Scenes

The coordinator assigns tasks, monitors workers, and restarts failed ones — transparent to the programmer.

Now the full picture. Students have seen the data flow (map → shuffle → reduce) across the previous two slides. This slide adds the master — the orchestrator.

The master is the single point of coordination, not computation. It keeps track of which chunks have been processed, which workers are alive, and which tasks need retrying. If a map worker crashes, the master reassigns its chunks to another worker. The programmer's code doesn't change.

Hexagonal architecture callback (L16): The programmer writes two pure functions (map and reduce). The framework handles all infrastructure: splitting, distribution, shuffling, fault tolerance, storage. This is the same separation of domain logic from infrastructure that we've been teaching all semester — just at Google scale.

Scale: Google ran this on thousands of machines. The same architecture handles 10 workers or 10,000 workers.

Transition: Where does the data live?

Your Laptop's File System Was Designed for a Different Workload

Most file systems are designed for open-edit-save workflows, of mostly small files.

Your laptop's file system	Data processing workload
Small files (KB-MB)	Huge files (GB-TB) — a single crawl output can be terabytes
Random reads and writes	Append-only writes, sequential reads — scan start to finish
Open, edit, save, close	Write once, read many times, never modify
Permissions, directories, timestamps	Irrelevant at batch scale
4KB blocks	Wasteful — 50GB file = 13 million blocks of bookkeeping

Students have never thought about file systems as a design choice. They just save files and they work. The point is to make the invisible visible: the file system on your laptop embodies design decisions optimized for a specific workload. Those decisions are brilliant for that workload — and terrible for data processing.

The append-then-scan pattern shows up everywhere:

Apache Kafka: append-only message log, consumers read sequentially
Database write-ahead logs (WAL): append-only for crash recovery
Git object store: content-addressed blobs, append-only, garbage collected
Cloud storage (S3): write-once objects, read sequentially — S3 literally does not support editing a byte in the middle of a file

If a student asks "what's it called?": POSIX (Portable Operating System Interface) — a standard from the 1980s. macOS, Linux, and (mostly) Windows all follow it. You'll learn about it in CS 3650 (Systems).

The L6 connection: Your laptop's file system and GFS are both interfaces that hide how data is stored. Same information hiding principle — but the abstraction boundary is in a different place because the requirements are different.

Transition: So what does a file system designed for this workload look like? Let's start with why the blocks need to be bigger...

Performance at Scale: Bigger Chunks and Data Locality

Why does GFS use larger chunks?

Fixed cost per chunk	With 4KB blocks (100 TB)	With 64MB chunks (100 TB)	Reduction
Metadata entry in NameNode RAM	26 billion entries	1.6 million entries	16,000x
Replication coordination (3 copies each)	78 billion replica records	4.8 million replica records	16,000x
I/O round trips per read (NameNode lookup + DataNode setup)	26 billion round trips	1.6 million round trips	16,000x

We've seen this before: Batching and Locality (L34)

This is L34's batching pattern at infrastructure scale. In L34 we showed that batching database queries (N+1 → 1 query with JOIN) and batching network calls (N API calls → 1 batch endpoint) amortize fixed per-invocation costs. GFS applies the same principle to storage.

Walk through each row:

Metadata (scalability): The NameNode keeps every chunk's location in RAM. 26 billion entries would require enormous memory; 1.6 million is manageable. Bigger chunks mean the NameNode can track a bigger file system without running out of RAM. You need a directory of pages either way — make each entry cover more data.
Replication (redundancy): Each chunk is replicated to 3 DataNodes. Replication has a fixed coordination cost per chunk — the NameNode must track locations, detect failures, schedule re-replication. Fewer chunks = less coordination overhead. The redundancy itself (3 copies) costs the same per byte, but the coordination cost per byte drops dramatically.
I/O (throughput): Each read requires a round trip to the NameNode ("where is this chunk?") plus a TCP handshake with the DataNode. With 4KB blocks, these fixed costs dominate — most time is spent on overhead, not reading data. With 64MB chunks, one round trip yields a sustained sequential read where overhead is negligible. Bigger pages means fewer calls to the underlying layers per byte of data.

The insight is general: When fixed costs dominate variable costs, make the batch bigger. This applies to database queries, network calls, and file system blocks. The difference at GFS scale is that the fixed costs are high and the data is massive — so the optimal batch size is enormous.

Locality (also L34): MapReduce exploits the data-center memory hierarchy. Local disk: microseconds. Same-rack network: ~1ms. Cross-datacenter: ~100ms. The coordinator knows (from the NameNode) which DataNodes hold each chunk and assigns map tasks to workers on the same machine — the data-center equivalent of cache-friendly access. Combined with 64MB chunks, map workers mostly read from local disk, keeping network bandwidth free for the shuffle phase.

Pooling (also L34): Worker thread pools are reused across tasks, avoiding per-task startup costs. Three L34 patterns — batching, locality, pooling — all visible in one system.

If a student asks "why not even bigger?": Good question. Larger chunks mean more wasted space for small files (internal fragmentation) and longer recovery times when a chunk is lost (must re-replicate more data). 64MB was Google's empirical sweet spot for their workload.

Transition: Now let's see the full GFS architecture...

GFS Stores Data Across Thousands of Machines with Built-In Redundancy

Together, MapReduce + GFS = programmers write simple sequential functions, and the framework distributes execution across thousands of machines, handles failures, and manages storage transparently.

Pure Functions Make Re-Execution Safe — and Fault Tolerance Trivial

Failure	Detection	Recovery	Why it works
Map worker dies	Coordinator pings periodically	Reassign task to another worker	Pure function (L5) → re-execution produces same output
Reduce worker dies	Coordinator pings periodically	Reassign reduce task	Idempotent (L33) → re-execution is safe
DataNode dies	NameNode detects missing heartbeat	Read from surviving replica (2 of 3 remain)	Redundancy (L35) → no single point of failure for data

Two concepts, one argument. The pure-function constraint (L5, L7, L31) is what enables fault tolerance. Students should see that the design decision that eliminates race conditions is the SAME decision that makes crash recovery safe. This is the overlapping-lenses theme of the lecture.

L7 vocabulary: Map workers have the best possible coupling: NONE. Zero shared state. The only coupling is through the shuffle — data coupling (primitive key-value pairs). The trade-off is real: you cannot compare two homes in one map function; you must emit both under the same key and compare in reduce. This deliberate constraint buys safe parallelism AND safe recovery.

GFS replication pays off too: When a chunkserver dies, the input data is still available on 2 other chunkservers. The new worker reads from a surviving replica. Redundancy from L35.

The key insight: Fault tolerance is not an add-on. It emerges from the pure function constraint (L5) and the replication strategy (L35). Every prior design decision contributes.

Transition: Let's apply the Swiss cheese model to a specific failure scenario...

MapReduce Made Data Processing Cheap — So Total Consumption Exploded

Before MapReduce (early 2000s): processing the web index was a bespoke engineering effort — custom code, manual failure handling, weeks per pipeline.

MapReduce made it cheap: write two functions, submit a job, get results.

What happened: Per-job engineering cost dropped dramatically. Google ran thousands of MapReduce jobs daily by 2004. The efficiency gain per job was overwhelmed by the increase in total jobs. MapReduce consumed a significant fraction of Google's total compute.

L36 (Jevons' Paradox): Same pattern as Pawtograder — per-submission grading cost dropped, but unlimited submissions dramatically increased total compute. Efficiency is not sustainability.

MapReduce's Strengths Became Its Limitations as Context Changed

MapReduce's limitations drove Google to build successors:

Limitation	Successor	What changed
Batch-only — cannot serve interactive queries	Dremel (interactive SQL) → BigQuery	Users needed answers in seconds, not hours
Two-function model — too restrictive for iterative algorithms	Systems supporting iterative computation (ML requires multiple passes over data)	Machine learning became a dominant workload
GFS single NameNode — metadata bottleneck at scale	Colossus — distributes metadata across multiple servers	File system grew beyond one NameNode's capacity

Open Source Made MapReduce Sustainable Beyond Google

Google published the MapReduce (2004) and GFS (2003) papers — but kept the code proprietary. Doug Cutting and Mike Cafarella read the papers and built Apache Hadoop: an open source implementation.

What happened	Sustainability dimension
Yahoo, Facebook, and hundreds of companies adopted Hadoop	Economic — shared infrastructure cost, no single-vendor lock-in
Entire ecosystem grew: Spark, Hive, HBase, Kafka	Technical — clean programming model enabled open source reimplementation
Google's proprietary system was replaced internally — but the ideas live on in open source	Social — any organization can process data at scale, not just Google
Same pattern: Google's Borg → open source Kubernetes (container orchestration for everyone)	Technical — clean abstractions enable reimplementation; Borg's ideas power every cloud provider
Hadoop and K8s clusters consume enormous energy worldwide	Environmental — Jevons' paradox again, now at industry scale

The Inventors of MapReduce

Wired Magazine stotry 'If Xerox PARC Invented the PC, Google Invited the Internet` by Cade Metz, 08.08.12.
Photo shows two smiling men with caption 'Jeff Dean and Sanjay Ghemawat, two of the most important software engineers
of the Internet age -- and two of the most underappreciated.'

"DEC was one of the first companies to build a successful web search engine — AltaVista, which came out of the Western Research Lab — and at least in the beginning, the entire thing ran on a single DEC machine. But Google eclipsed AltaVista in large part because it turned this model on its head. Rather than using big, beefy machines to run its search engine, it broke its software into pieces and spread them across an army of small, cheap machines. This is the fundamental idea behind GFS, MapReduce, and BigTable — and so many other Google technologies that would overturn the status quo."

Jeff Dean Facts

The rate at which Jeff Dean produces code jumped by a factor of 40 in late 2000 when he upgraded his keyboard to USB 2.0.

Jeff Dean once failed a Turing test when he correctly identified the 203rd Fibonacci number in less than a second.

Google search went down for a few hours in 2002, and Jeff Dean started handling queries by hand. Search Quality doubled.

Jeff Dean's infinite loops run in 5 seconds.

http://www.quora.com/Jeff-Dean/What-are-all-the-Jeff-Dean-facts

Jeff Dean (2020)

https://bloomberg.com/news/articles/2020-12-04/google-scientist-s-abrupt-exit-exposes-rift-in-prominent-ai-unit

Reputations

Photo of old man in suit, Warren Buffett

“It takes 20 years to build a reputation and five minutes to ruin it. If you think about that, you'll do things differently.”
— Warren Buffett

CC-BY 2.0 Marco Verch

Washington Post headline: 'Google hired Timnit Gebru to be an outspoken critic of unethical AI. Then she was fired for it.' Photo of Timnit Gebru speaking at TechCrunch Disrupt SF 2018.

First page of the 2021 FAccT paper 'On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?' by Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell.

Lessons

Keep an eye out for paradigm shifts. Don't solve today's problems with yesterday's tools.

Software architecture comes out of requirements.

"Luck is what happens when preparation meets opportunity." — Seneca

Act in accordance with your values.

Lecture 38: MapReduce​

Learning Objectives​

Reminder: We Want Your Feedback​

Comments from this Morning​

Paradigm Shifts​

A Paradigm Shift in Information Retrieval (1990s)​

PageRank Algorithm​

Moore's Law​

Wirth's Law​

Niklaus Wirth

Setting the Scene (Early 2000s)​

Two Systems, One Case Study​

Start with Requirements: What Does This System Need to Do?​

Programming Paradigms​

The Map and Reduce Higher-Order Functions​

Counting Words on the Web with MapReduce​

Sandwich MapReduce with Shuffle​

Building an Index with MapReduce​

Poll: What does this calculate?​

MapReduce: How It Works​

What Happens Behind the Scenes​

Your Laptop's File System Was Designed for a Different Workload​

Performance at Scale: Bigger Chunks and Data Locality​

GFS Stores Data Across Thousands of Machines with Built-In Redundancy​

Pure Functions Make Re-Execution Safe — and Fault Tolerance Trivial​

MapReduce Made Data Processing Cheap — So Total Consumption Exploded​

MapReduce's Strengths Became Its Limitations as Context Changed​

Open Source Made MapReduce Sustainable Beyond Google​

The Inventors of MapReduce​

Jeff Dean Facts​

Jeff Dean (2020)​

Reputations​

Lessons​

Bonus Slide