A pixel art split-scene: LEFT shows a developer handing a note directly to a colleague (instant, 0.0001 sec). RIGHT shows an epic chaotic postcard journey through submarine cable cuts, latency, packet loss, eavesdroppers, a Palo Alto Networks firewall arbitrarily blocking traffic (labeled 'NEU contracted us'), BGP roulette wheel, and toll plazas — with multiple retry postcards. Tagline: When Code Leaves the Building, Everything Changes.

Lecture overview:

Total time: ~65 MINUTES
Prerequisites: L19 (monoliths, modular monoliths, microservices intro, Conway's Law preview)
Connects to: L21 (serverless — pushing these concerns to the platform), L22 (teams and Conway's Law)

Structure:

Recap: Why Leave the Monolith? (~3 min)
Client-Server Architecture (~7 min)
REST APIs — How Services Communicate (~8 min)
★ The Eight Fallacies of Distributed Computing ★ (~15 min) — the heart of the lecture
Microservices: Benefits and Costs (~10 min)
Network-Related Quality Requirements — perf, reliability, scalability (~8 min)
Security as an Architectural Concern (~12 min)
Bringing It Together (~5 min)

Key theme: Network communication doesn't just change HOW components talk — it invalidates assumptions programmers make constantly. Students who understand the eight fallacies will make better decisions when they inevitably work on distributed systems. Security isn't a feature you add — it shapes architecture from the start.

→ Transition: Let's start with the title...

CS 3100: Program Design and Implementation II

Lecture 20: Distributed Architecture — Networks, Microservices, and Security

Announcements

Midterm Survey via Qualtrics, +1 participation credit if you complete it

Due Friday 2/27 @ 11:59 PM
Anonymous link in your email
150 students have completed it so far

Team Formation Survey Released!

Starting Week 10: teams of 4 for CookYourBooks GUI project
Tell us your preferences + availability
Due Friday 2/27 @ 11:59 PM
Complete the Survey →

HW4 Due Thursday Night

Learning Objectives

After this lecture, you will be able to:

Explain why network communication fundamentally changes architectural tradeoffs compared to in-process method calls
Identify and explain the Fallacies of Distributed Computing and how they affect system design
Describe the client-server architecture and REST API conventions used for service communication
Analyze the benefits and costs of microservices compared to monolithic architectures
Apply security principles (authentication, authorization, trust boundaries, CIA triad) to distributed system analysis

Important framing: Junior engineers read API documentation and debug network issues far more often than they design new distributed architectures. Comprehension comes first — you'll understand distributed systems well enough to work within them confidently.

Recap: The Network Changes Everything

In L19, we ended with a teaser: method calls in a monolith are instant, reliable, and traceable. Over a network? Everything changes.

Monolith (Bottlenose)

submission.computeGrade();
// ✅ Executes in nanoseconds
// ✅ Always succeeds or throws
// ✅ Full stack trace on error
// ✅ Wrapped in a DB transaction

Distributed (Pawtograder)

feedbackApi.submit(submissionId, feedback);
// ⚠️ Might take ms... or seconds... or ∞
// ⚠️ Server might be down or overloaded
// ⚠️ Request succeeds, response lost
// ⚠️ Retry = accidentally grade twice?
// ⚠️ No cross-system transactions

Today we'll explore why networks are hard, what patterns handle these challenges, and how security changes when code leaves the building.

Make this visceral:

"When Bottlenose calls computeGrade(), it KNOWS it will execute. If it fails, you get an exception with a full stack trace."
"When Pawtograder calls feedbackApi.submit(): What if the API times out? What if the request succeeds but the RESPONSE never arrives? Did the grade save or not?"

Today's lecture will address each ⚠️:

"Might take forever" → Timeouts and the Fallacies of Distributed Computing
"Server might be down" → Retry with exponential backoff, circuit breakers
"Response lost" → Idempotency patterns
"Retry = grade twice" → Idempotency keys
"No transactions" → Eventual consistency (preview for L33)

The key insight:

These aren't edge cases — they're the NORMAL operating conditions of distributed systems
Every distributed system must handle ALL of these
The fallacies will explain WHY these happen

→ Transition: Let's start with the hard physical limits that make networks fundamentally different...

The Price of Distribution: Hard Physical Limits

Some network constraints aren't design choices or wrong assumptions — they're physics. No software engineering can fix them. This is what you're signing up for when you leave the monolith.

The Speed of Light: ~1 foot per nanosecond

Light (and electrical signals in fiber) travels roughly 1 ns/ft — slower in copper wire.

Distance	Min. round-trip
Across a server rack (~10 ft)	~20 ns
Across a data center (~1,000 ft)	~2,000 ns
Boston → New York (~200 mi)	~2,100,000 ns (~2.1 ms)
Boston → London (~3,300 mi)	~35,000,000 ns (~35 ms)
Boston → Tokyo (~6,700 mi)	~71,000,000 ns (~71 ms)

(Actual latency is always higher — routing, switching, queuing add more.)

Compare: Memory access in a monolith

Operation	Latency
L1 CPU cache hit	~0.5 ns
L2 CPU cache hit	~3 ns
L3 CPU cache hit	~10 ns
RAM access	~100 ns

A Boston-London round-trip takes ~70,000,000× longer than an L1 cache hit.

A method call in a monolith? Essentially free by comparison.

The "1 foot per nanosecond" rule:

Grace Hopper famously handed out 11.8-inch pieces of wire at talks to illustrate this: "That's a nanosecond"
In fiber optic cable it's closer to 2 ns/ft due to the refractive index of glass
The point: the speed of light is a hard ceiling no engineering can break through

Working through the table:

Boston → New York: ~330 km, speed of light gives ~2.1 ms round-trip. Real latency: ~7-10 ms
Boston → London: ~5,300 km, ~35 ms minimum round-trip. Real: ~70-80 ms
Boston → Tokyo: ~10,800 km, ~71 ms minimum round-trip. Real: ~140-160 ms
The minimums are never achieved — routing takes non-straight paths, every switch adds queuing delay

The memory comparison is the key insight:

Students who've only worked in monoliths have never had to think about this
A method call touches L1/L2 cache: nanoseconds
Even a localhost network call: microseconds (1000× slower)
Cross-datacenter: milliseconds (1,000,000× slower)
This is pure physics — no clever code fixes it

Connect to Pawtograder:

Grading Action (GitHub servers) → Pawtograder API (Supabase servers): likely in same US region, but still 1-5ms round trips
100 individual API calls × 5ms = 500ms overhead from physics alone, before any application work
This is why batching all results into one submitFeedback() call isn't just nice-to-have — it's necessary

→ Transition: Physics is one constraint. The other is even more surprising: components can be disconnected by things you never anticipated...

Shared Fate vs. Independent Failure

In a monolith, all components share the same process — they cannot be partially disconnected by accident. Distributed systems introduce a new failure mode that simply doesn't exist in a monolith.

Monolith: Shared fate

If the process is running, all components can talk to each other. Always.

No one can trip over a cable and disconnect the grading logic from the database — they're in the same memory space.

A server crash takes everything down together, but there's no such thing as a partial disconnection between components.

Distributed system: Independent failure

Things that can disconnect two services that would never affect a monolith:

A cloud provider's switch fails in one availability zone
A university contracts with a network filter that decides your service is malware
A student's ISP throttles GitHub traffic during grading
A ship anchor cuts a submarine fiber cable in the Pacific

A "network partition" — two parts of a system that can't reach each other — is impossible in a monolith and inevitable in a distributed system. You must design for it.

The "ship anchor" example is real:

Submarine fiber cables get cut by ship anchors, fishing trawlers, and earthquakes regularly
The SEA-ME-WE 4 cable cut in 2008 severely disrupted internet traffic across South Asia
No retry logic helps when the cable is on the ocean floor

BGP incidents:

BGP (Border Gateway Protocol) is how internet routers tell each other how to reach destinations
2010: China Telecom accidentally advertised routes for ~15% of the internet, redirecting traffic through China for ~18 minutes
2021: A Facebook BGP misconfiguration took down Facebook, Instagram, and WhatsApp globally for ~6 hours — Facebook employees couldn't even badge into their buildings because the badge readers needed the network
These aren't bugs in your code. They're failures in infrastructure you don't control.

The Palo Alto / NEU example:

This will come up again in Fallacy 6
It perfectly illustrates the "someone else makes decisions that affect your network path" problem

The design implication:

Every distributed system must answer: "What do we do when we can't reach the other service?"
This isn't an edge case to handle later — it's the central design question
Pawtograder: retry with backoff → show "grading in progress" → don't crash
The answer shapes your entire error-handling architecture

CAP theorem (optional mention if time):

This physical reality is exactly what the CAP theorem addresses
When a partition happens, you must choose: stay Consistent (refuse requests) or stay Available (accept requests that might diverge)
We won't go deep on CAP, but it flows directly from the fact that partitions are inevitable

→ Transition: So: physics limits how fast data can move, and the network can be severed by things outside your control. Given all that — how do distributed services actually talk to each other?

Client-Server Architecture

Clients make requests, servers respond. The most ubiquitous pattern — every web app, mobile app, and service-to-service call.

// Client code (Grading Action) — what actually runs
HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create(API_URL + "/submitFeedback"))
    .header("Authorization", "Bearer " + authToken)
    .header("Content-Type", "application/json")
    .POST(BodyPublishers.ofString(feedbackJson))
    .build();

HttpResponse<String> response = client.send(
    request, BodyHandlers.ofString());

Benefits — Centralized control/state, update server → all clients benefit, enforce security policies, multiple clients connect simultaneously

Constraints — Server = single point of failure, network latency on every op, must handle errors/timeouts/retries, client-initiated only

Make this concrete with the code:

"You've used client-server every time you've used a browser. The browser is the client, the web server is the server."
Walk through the Java code: HttpClient is Java's built-in HTTP client (since Java 11)
The Grading Action calls client.send() — that's a BLOCKING network call. The thread waits until the server responds.
"Communication is always client-initiated" — the server can't push to the client without polling or websockets.

The single point of failure:

The API goes down? All grading actions fail — even if they're running fine on GitHub's infrastructure.
This is one reason Pawtograder implements retry logic.

Why does this matter?

Every service-to-service call in a microservices architecture is client-server
Understanding this is the foundation for understanding REST

Note: Message-passing and event-driven systems (L33) enable more bi-directional communication.

→ Transition: We've been showing HTTP code — but that's just ONE way to implement client-server. Where does HTTP fit in the bigger picture?

The Network Itself Is a Layered Architecture

Client-server doesn't require HTTP — it's just one option. The OSI model shows how network communication is organized into layers, each solving one problem. Sound familiar from L19?

The OSI 7-layer model with HTTP highlighted at Layer 7 (Application). Shows how a network request travels down through all layers on the client, across the internet, and back up on the server. Callout notes this is for context only, not on exam. Footer connects to Layered Architecture from L19.

SET EXPECTATIONS: This is context, not testable content!

"You won't be asked to name all 7 OSI layers on the exam"
"The point is: when you call client.send(), data passes through MANY layers"
"Each layer can fail, add latency, or have issues — that's why network calls aren't like method calls"

Connection to L19 — This IS Layered Architecture!

The OSI model is the canonical example of layered architecture
Each layer solves ONE problem, depends only on layers below
You can swap implementations: TCP ↔ UDP at Layer 4, HTTP ↔ gRPC at Layer 7
Same heuristics from L19: separation of concerns, downward dependencies only
"The internet itself is a layered architecture" — drive this home!

HTTP is one choice at Layer 7:

REST/HTTP: What we'll focus on (most common for APIs)
gRPC: Google's binary protocol (faster, stricter contracts)
WebSockets: Persistent connections, server can push
Raw TCP (Layer 4): Maximum control, you handle everything

Why this explains fallibility:

Your code hands data to the OS, which passes it down through ALL these layers
Each layer adds headers, does processing, can introduce failures
Network call can fail at ANY layer — that's why the Fallacies matter

The CS 4700 plug:

We're skipping congestion control, routing algorithms, DNS, BGP
For SE, understanding "many layers, each can fail" is enough

→ Transition: We'll focus on HTTP at Layer 7 — the most common choice for APIs. Let's see how it works...

How Services Communicate: HTTP and REST

HTTP is the foundation. REST (Representational State Transfer) is a set of conventions built on HTTP for structuring APIs. An HTTP request has: a method (verb), a URL (resource), and optionally a body (data).

// GET — retrieve a resource (read-only, no body)
HttpRequest get = HttpRequest.newBuilder()
    .uri(URI.create(BASE_URL + "/submissions?student_id=" + studentId))
    .GET().build();

// POST — create a resource or trigger an action
HttpRequest post = HttpRequest.newBuilder()
    .uri(URI.create(BASE_URL + "/functions/v1/createSubmission"))
    .header("Content-Type", "application/json")
    .POST(BodyPublishers.ofString("{\"repo\": \"cs3100/hw1-alice\"}")).build();

// PATCH — partial update (only fields you're changing)
HttpRequest patch = HttpRequest.newBuilder()
    .uri(URI.create(BASE_URL + "/submissions/123"))
    .method("PATCH", BodyPublishers.ofString("{\"score\": 87}")).build();

// DELETE — remove a resource
HttpRequest delete = HttpRequest.newBuilder()
    .uri(URI.create(BASE_URL + "/submissions/123"))
    .DELETE().build();

REST organizing principle: Organize around nouns (submissions, assignments, students) and manipulate them with standard verbs. Once you know the pattern, every RESTful API works the same way.

Walk through the code:

GET: no request body — query params in URL. Used for reading.
POST: body contains data to create. Used for creation and actions.
PATCH: body contains ONLY the fields to update (vs PUT which replaces entire resource)
DELETE: usually no body — the URL identifies what to delete

The response half:

Server responds with a STATUS CODE: 200 = success, 201 = created, 404 = not found, 401 = unauthorized, 500 = server error
And optionally a response BODY (usually JSON)

Statelessness:

Key REST constraint: each request contains ALL information needed to process it
The server doesn't remember previous requests
This makes horizontal scaling easy: any server instance can handle any request

Mention GraphQL briefly:

GraphQL lets clients specify exactly which fields they need — avoids over-fetching
Many APIs offer GraphQL alongside REST — you should know it exists

→ Transition: REST also standardizes how servers communicate success and failure...

REST didn't come from a committee. Roy Fielding co-authored the HTTP/1.0 spec, then asked: why does this work so well? His 2000 PhD thesis reverse-engineered HTTP into a set of architectural constraints. REST is the name he gave that style.

REST's Six Architectural Constraints (Fielding, 2000)

Client-Server — separate UI concerns from data storage
Stateless — each request contains all info needed; no session state on the server
Cacheable — responses must declare whether they can be cached
Uniform Interface — standard verbs + resource URLs + self-describing messages
Layered System — clients don't know if they're talking to the real server or a proxy

Connection to L18: Drivers → Style

Each constraint maps to a quality attribute driver:

Constraint	Quality Attribute
Stateless	Scalability — any server handles any request
Cacheable	Performance — skip redundant fetches
Uniform Interface	Changeability — swap server implementations
Layered System	Security + Deployability — add proxies/CDNs transparently

The same process we used for Pawtograder — identify quality attributes, apply heuristics, let the style emerge — is how REST was discovered.

REST is not a standard, a spec, or a protocol. It's an architectural style — a set of constraints that, when applied, produce a system with desirable properties. Fielding studied what made HTTP work, then named the pattern.

The origin story:

Fielding was one of the principal authors of HTTP/1.0 and HTTP/1.1
After building it, he wanted to formally capture WHY the web's architecture was so successful
His 2000 PhD dissertation at UC Irvine is "Architectural Styles and the Design of Network-based Software Architectures"
Chapter 5 is REST — one of the most influential chapters in CS history
The dissertation is free online; it's surprisingly readable

The stateless constraint — most important for this lecture:

"No session state on server" → any server instance can handle any request
This is exactly why GitHub Actions can spin up thousands of parallel grading runners — no runner needs to remember any state from previous runs
Contrast: PHP session files or sticky sessions on a load balancer — now every request must go to the SAME server → you lose horizontal scalability

Uniform Interface — the key insight:

Once you know GET/POST/PUT/DELETE and URL conventions, you can interact with ANY REST API
The Grading Action's developer could read the Pawtograder API docs and know exactly how to call it without special libraries or training
Contrast with SOAP: you need a WSDL file, a code generator, and a specific library for each API

The L18 connection — drive this home:

L18 showed: quality attribute (testability) → heuristic (separate what from how) → hexagonal architecture emerges
Fielding: quality attribute (scalability) → constraint (stateless) → REST emerges
"Architecture is discovered, not invented" applies to REST too — Fielding studied the web and discovered the pattern already present in HTTP

Interesting footnote:

Most "REST" APIs today violate some of Fielding's constraints (especially HATEOAS/hypermedia)
Fielding himself has complained about this on his blog
But the core stateless + uniform interface constraints are what matter in practice

→ Transition: One artifact of that uniform interface: standardized status codes...

REST Status Codes: The Language of Failure

One of REST's great gifts: standardized error codes. The status code tells you where to look when debugging.

Code	Meaning	Action
2xx	Success	Process response
`200 OK`	Request succeeded
`201 Created`	Resource created
4xx	Client Error	Fix your code
`401 Unauthorized`	Not authenticated	Refresh token
`403 Forbidden`	Not allowed	Check permissions
`429 Too Many`	Rate limited	Back off
5xx	Server Error	Retry with backoff
`503 Unavailable`	Server overloaded	Wait and retry

HttpResponse<String> response = client.send(
    request, BodyHandlers.ofString());

switch (response.statusCode() / 100) {
    case 2 -> processSuccess(response.body());
    case 4 -> {
        if (response.statusCode() == 401) {
            refreshToken();  // Auth expired
        } else if (response.statusCode() == 429) {
            sleepUntil(response.headers()
                .firstValue("Retry-After"));
        } else {
            throw new ClientError(response);
        }
    }
    case 5 -> throw new RetryableException(response);
}

Key insight: 4xx = your code is wrong (don't retry). 5xx = server is struggling (retry with backoff). 401 = who are you? 403 = I know you, but no.

The Fallacies of Distributed Computing

Eight warning signs arranged in a grid, each crossing out a wrong assumption about distributed systems: The network is reliable, Latency is zero, Bandwidth is infinite, The network is secure, Topology doesn't change, There is one administrator, Transport cost is zero, The network is homogeneous. Clean flat illustration style with warning colors.

Peter Deutsch and colleagues at Sun Microsystems identified eight assumptions developers make about networks — all of which are false. These are the Fallacies of Distributed Computing.

Fallacy 1: "The Network Is Reliable"

Networks fail. Cables get unplugged, routers crash, cloud providers have outages. Code that assumes a network call will always succeed is fragile code.

The fragile version

// Assumes the network always works
Response response = client.send(request);
processResponse(response);
// If request fails → student never sees grade

Better: timeout and retry

Response response = null;
int attempts = 0;
while (response == null && attempts < 3) {
  try {
    response = client.send(request,
        Duration.ofSeconds(10));
  } catch (TimeoutException e) {
    // Exponential backoff: 1s, 2s, 4s
    Thread.sleep((long) Math.pow(2, attempts) * 1000);
    attempts++;
  }
}
if (response == null) {
  logError("Failed after 3 attempts");
  // Show "grading in progress" not a crash
}

Pawtograder: The Grading Action tries to submit feedback. The request times out. Retry logic with exponential backoff — wait 1s, then 2s, then 4s. Either succeeds, or the student sees "grading in progress."

Pattern: Timeout + Retry with Exponential Backoff

Addresses Fallacy 1 (unreliable). Never wait forever — set a deadline, then try again, backing off between attempts.

private static final Random random = new Random();

public Response sendWithRetry(HttpRequest request) throws Exception {
    int maxAttempts = 3;
    for (int attempt = 1; attempt <= maxAttempts; attempt++) {
        try {
            // ALWAYS set a timeout — without one, a hung server blocks forever
            return client.send(request, BodyHandlers.ofString(), Duration.ofSeconds(10));
        } catch (HttpTimeoutException | IOException e) {
            if (attempt == maxAttempts) throw e;
            // Exponential backoff: 2s, 4s, 8s + jitter to prevent thundering herd
            long backoffMs = (long) Math.pow(2, attempt) * 1000;
            long jitter = random.nextLong(500);  // Random 0-500ms
            Thread.sleep(backoffMs + jitter);
        }
    }
    throw new RuntimeException("unreachable");
}

Fixed 1s retry — the thundering herd

100 clients all fail at t=0 (API restart). All retry at t=1s. Server gets slammed again. All fail. All retry at t=2s… The server never gets a chance to recover.

Exponential backoff + jitter — spreading the load Load arrives in waves the server can absorb.

Only retry 5xx and timeouts. A 400 Bad Request won't fix itself. A 401 Unauthorized won't either. Retrying a 409 Conflict might make things worse.

The timeout is the most important part:

Without a timeout, client.send() can block indefinitely
One blocked thread is fine; a service that accumulates 100 blocked threads per second will exhaust its thread pool and die
Always set a timeout — even if generous (30s)

Exponential backoff rationale:

Imagine 100 grading actions all fail at the same moment (API restart)
Fixed 1s retry: all 100 retry at t+1s — waves of 100 requests on top of normal load
Exponential backoff: naturally spread out — 2s, 4s, 8s
Add jitter: backoffMs += random.nextLong(0, 500) — "thundering herd" prevention

The 4xx vs 5xx distinction:

429 Too Many Requests: IS worth retrying — check the Retry-After header
401 Unauthorized: don't retry, your token is wrong
503 Service Unavailable: retry with backoff
Key question: "Is the problem on my side (4xx) or the server's side (5xx)?"

→ Transition: But now we have a problem: if we retry, could we run the operation twice?

Pattern: Idempotency — Making Retries Safe

Over a network, "did it run?" is ambiguous — the request may have arrived but the response was lost. Design operations so retrying is safe.

// CLIENT: attach a stable unique key — same key = same operation, don't repeat it
public void submitFeedback(String submissionId, Feedback feedback) {
    HttpRequest request = HttpRequest.newBuilder()
        .uri(URI.create(API_URL + "/functions/v1/submitFeedback"))
        .header("Idempotency-Key", submissionId)   // Stable, unique per grading run
        .POST(HttpRequest.BodyPublishers.ofString(gson.toJson(feedback)))
        .build();
    sendWithRetry(request);   // Now safe to call multiple times!
}

// SERVER: check the key before doing any work
public Response submitFeedback(Request req) {
    String key = req.header("Idempotency-Key");

    Optional<Response> cached = db.findByIdempotencyKey(key);
    if (cached.isPresent()) {
        return cached.get();   // Already ran — return same result, don't re-grade
    }

    Feedback feedback = gson.fromJson(req.body(), Feedback.class);
    Response result = gradingService.store(feedback);
    db.storeIdempotencyResult(key, result);   // Cache so future retries are safe
    return result;
}

HTTP verb idempotency: GET always idempotent (read-only). DELETE idempotent (deleting twice = 404, no harm). PUT idempotent (replace whole resource). POST NOT idempotent by default — that's why you need the Idempotency-Key header.

Fallacy 2: "Latency Is Zero" — Chatty vs Chunky APIs

Every network call takes time. Local method calls: nanoseconds. Network calls: milliseconds to seconds. Minimize round-trips.

Chatty API — 100 network round-trips

// BAD: One API call per test result
for (TestResult result : testResults) {
    api.submitSingleResult(submissionId, result);
    // 100ms latency × 100 tests = 10 SECONDS
}

Chunky API — 1 network round-trip

// GOOD: All results in one payload
FeedbackBatch batch = FeedbackBatch.builder()
    .submissionId(submissionId)
    .results(testResults)  // All 100 results
    .build();
api.submitFeedback(batch);
// 100ms latency × 1 call = 100ms

Fallacy 3: Bandwidth is infinite — SHA hash to skip unchanged downloads (grader tarball caching)

Fallacy 5: Topology changes — Never hardcode URLs; use config. Servers move, DNS updates.

Fallacy 8: Network is homogeneous — Same code, different behavior on different networks.

Walk through the code:

The chatty version makes 100 API calls — each with its own TCP handshake, TLS negotiation, HTTP headers
The chunky version makes 1 API call — same total data, 1/100th the overhead
This is why Pawtograder batches all test results into ONE submitFeedback() call

The 100x difference is real:

100 tests × 100ms round-trip = 10 seconds of pure network overhead
Batched: 100ms round-trip = 100ms overhead
The test data itself takes negligible time to serialize

On bandwidth (Fallacy 3):

SHA-based caching: if the hash matches, skip the download entirely
Docker image layers use this, git uses this — content-addressable caching

On topology (Fallacy 5):

Config over code: API_URL is a per-assignment config, not hardcoded
Production: Kubernetes DNS, AWS service mesh handle this automatically

On homogeneity (Fallacy 8):

Same code, different network = different timing = potential race conditions
Your local test might pass, production might fail

→ Transition: The remaining fallacies touch on security and cost...

Fallacy 4: "The Network Is Secure" (and Fallacies 6–7)

Fallacy 4: The network is secure

Data crossing networks can be intercepted, modified, or spoofed. Every network boundary is a potential attack surface.

Pawtograder: Without an authentication token, anyone could POST fake grades. Without HTTPS, a network observer could read or modify grades in transit.

(We'll dive deep on security later in this lecture.)

Fallacy 6: There is one administrator

Different parts of distributed systems are controlled by different organizations. You can't control what they do.

Real NEU example: Northeastern contracts with Palo Alto Networks to filter all campus traffic. When Palo Alto arbitrarily decides Pawtograder's dev environment is malware — learning is disrupted. NEU claimed no responsibility. This happens all the time.

Fallacy 7: Transport cost is zero

Network calls have real costs: computational (serialization, encryption), monetary (API pricing, bandwidth fees), and energy (radio transmission, data center processing).

Pawtograder: Batching 100 test results into one submitFeedback() call instead of 100 calls doesn't just save latency — it saves energy. 6,000 grading runs/semester × 100 extra API calls = measurable environmental impact.

Fallacy 6 is worth a moment:

This is a real thing that happened at NEU
The lesson: in distributed systems, you don't control the whole network path
Corporate firewalls, university proxies, ISP throttling, regional outages — all outside your control
Design for this: health checks, fallback modes, clear error messages to users

Fallacy 7 — the sustainability angle:

Every API call: serialize to JSON → transmit → deserialize → process → serialize response → transmit → deserialize
In a monolith: that's a method call (nanoseconds)
Multiply by thousands of calls/minute, thousands of services — real energy cost
Sustainability is becoming a quality attribute: the "Green Software" movement

Energy calculation for Pawtograder:

6,000 grading runs/semester
100 tests per assignment
Batching saves ~99 API calls per run
594,000 fewer API calls per semester
That's not nothing

The architectural takeaway:

"Monolith-first" isn't just about simplicity — it's about not paying distributed-system costs until you have distributed-system benefits

→ Transition: The fallacies also drive system-level resilience patterns — let's see two more...

Pattern: Circuit Breaker — Stop Hammering a Struggling Service

If a service is struggling, hammering it with retries makes it worse. The circuit breaker detects sustained failure and stops trying — giving the service time to recover.

// Three states: CLOSED (normal) → OPEN (failing fast) → HALF_OPEN (testing recovery)
public class CircuitBreaker {
    enum State { CLOSED, OPEN, HALF_OPEN }

    private State state = State.CLOSED;
    private int failureCount = 0;
    private Instant openedAt;

    public Response call(Supplier<Response> request) {
        if (state == State.OPEN) {
            if (Duration.between(openedAt, Instant.now()).toSeconds() < 30) {
                // Fail immediately — fast failure > slow failure
                throw new CircuitOpenException("Service unavailable, try later");
            }
            state = State.HALF_OPEN;   // After 30s, allow one probe request through
        }

        try {
            Response response = request.get();
            reset();           // Success: back to CLOSED
            return response;
        } catch (Exception e) {
            failureCount++;
            if (failureCount >= 5) {
                state = State.OPEN;    // 5 consecutive failures → trip the circuit
                openedAt = Instant.now();
            }
            throw e;
        }
    }

    private void reset() { state = State.CLOSED; failureCount = 0; }
}

The electrical analogy:

A circuit breaker in your house trips when current is too high — protecting the wiring
This pattern trips when failure rate is too high — protecting the downstream service

Three states:

CLOSED: all requests pass through (normal)
OPEN: all requests fail immediately — don't even attempt
HALF_OPEN: one probe through to test recovery; success → CLOSED, failure → OPEN

Why fail fast beats fail slow:

Slow failures: threads pile up → thread pool exhaustion → YOUR service also goes down → cascading failure
Fast failures: immediately tell callers "not available" → callers degrade gracefully → no cascade

Production note:

Use a library: Resilience4j (Java), Polly (.NET)
Don't roll your own — thread safety and half-open probe races are subtle

→ Transition: The circuit breaker fails fast. But what do we show the user when it's open?

Pattern: Graceful Degradation — Reduced Functionality Beats Crashing

When a service is unavailable, offer reduced functionality rather than crashing. Stale data beats an error screen.

// When grading service is unavailable, give the student useful information
public SubmissionResponse handleSubmission(String submissionId) {
    // Step 1: Record that we received the code (this succeeds even if grading is down)
    db.markSubmissionReceived(submissionId);
    
    try {
        return gradingApi.triggerGrading(submissionId);
    } catch (CircuitOpenException | ServiceUnavailableException e) {
        // Grading is down — but the student's code IS safe
        return SubmissionResponse.builder()
            .submissionId(submissionId)
            .status("RECEIVED_PENDING_GRADING")
            .message("Your code has been received. Grading is temporarily unavailable, " +
                     "but will run automatically once the system recovers. " +
                     "You'll receive an email when your results are ready.")
            .build();
    }
}

Design your degraded state intentionally. Tell users what succeeded, what's delayed, and what to expect. A helpful message beats a stack trace. For every service call: if this fails, what should the user experience?

The key insight — tell users what they need to know:

What succeeded: "Your code has been received"
What's delayed: "Grading is temporarily unavailable"
What will happen: "Will run automatically once system recovers"
What to do: "You'll receive an email"

Stale cache is another common pattern:

CDNs: serve cached content when origin is down
Browsers: show cached pages when offline
"Stale data > no data" for most reads — users tolerate minutes-old data; they hate error screens

Degrade intentionally:

Bad: catch exception, re-throw as 500
Good: catch exception, return designed fallback with human-readable message

All four patterns together — Pawtograder:

submitFeedback() has a 10s timeout, retries up to 3 times (Timeout + Retry)
Each retry sends the same submissionId as the idempotency key (Idempotency)
After 5 consecutive API failures, circuit opens (Circuit Breaker)
Student sees "Code received, grading pending" instead of a crash (Graceful Degradation)

→ Transition: Now that we understand the costs AND the patterns for handling them, let's revisit microservices with fresh eyes...

Microservices Architecture: Now With Context

We introduced microservices in L19. Now we understand the cost. Let's look at why teams still pay it.

A microservices architecture decomposes a system into small, independently deployable services, each owning a specific business capability and its own data.

Benefits you pay for

Independent scaling: Scale the grading service without scaling the API
Elastic scaling: Spin up 500 grading runners at deadline time, scale to zero at 3 AM
Isolated failures: Discord bot bug can't crash grading
Team autonomy: Grading Action team and API team evolve independently
Technology flexibility: Different runtimes for different constraints (GitHub Actions vs Deno vs PostgreSQL)

Costs you definitely pay

All eight fallacies apply — every call is a network call
Operational overhead: Many builds, many deploys, many log streams
Data consistency: No transactions across services — eventual consistency only (more on consistency models in L33)
Testing complexity: Integration tests must spin up multiple services
Energy overhead: Every inter-service call costs orders of magnitude more than a method call

Connect to the fallacies:

"We just spent 15 minutes on the fallacies. In a microservices architecture, every service-to-service call is subject to all eight."
This is the TAX of microservices — you pay it on every interaction

Independent scaling:

In a monolith: grading is slow? You must scale the entire app.
In microservices: scale only the grading service — and only during deadline rushes.

Elastic scaling:

Pawtograder example: 500 students submit in the last hour before deadline
GitHub Actions spins up 500 parallel runners — each grading job is independent
At 3 AM: zero runners active, zero cost
A monolith can't do this — you'd need to keep enough servers running to handle peak load 24/7

Team autonomy is real:

Pawtograder: GitHub Actions maintainers evolve the Grading Action independently of API maintainers
They just need to agree on the interface (API contract)
This maps to Conway's Law (L22): team structure → system structure

Energy overhead:

I want to flag this again: the "chatty" microservices architecture wastes energy
This is one reason why the design pattern of "batch operations" matters
And why "monolith-first" is the sustainable default

→ Transition: There's a particularly bad outcome to watch out for...

The Distributed Monolith: All Costs, No Benefits

Left side shows three independent microservices, each with their own database, connected by clean HTTP arrows. Right side shows the Distributed Monolith anti-pattern: three services tangled together with dozens of crossing dependencies and a shared database, labeled 'We can't deploy A without also deploying B and C.' A balance scale at bottom shows all microservices costs but none of the benefits.

The distributed monolith anti-pattern: services that are deployed separately but so tightly coupled they must be changed and deployed together. You pay all eight fallacies' costs — but get none of the benefits (independent scaling, isolated failures, team autonomy).

Signs you have a distributed monolith:

Changing one service requires changing multiple others
Services share a database schema (the classic red flag)
You can't deploy services independently
Teams must coordinate every change

How it happens:

Teams extract services without establishing clean boundaries
The old shared database stays shared
"We'll add an API in front of it later" → never happens
Services call each other in complex chains — tight coupling over the network

If you find yourself here:

Option 1: Properly decouple — establish true service contracts, separate data ownership
Option 2: Collapse back into a monolith — seriously! A good monolith beats a bad distributed monolith.
This is usually a symptom of premature decomposition (splitting before understanding domain boundaries)

The message:

"If you're going to pay the distributed systems tax, make sure you're getting the benefits"
"A well-designed monolith beats a poorly-designed microservices architecture every time"

→ Transition: Let's look at the quality attributes that become critical in distributed systems...

Quality Attributes: Distribution Creates Challenges AND Opportunities

Several quality attributes become critical when components communicate over networks. Distribution makes these harder to achieve — but also enables solutions impossible in a monolith.

Performance

How fast is the system?

Challenge: Network adds latency to every call

Opportunity: Parallelize work across machines; cache at edge locations near users

Reliability

Does it stay up when things fail?

Challenge: More components = more failure points

Opportunity: Redundancy across machines/regions; no single point of failure

Scalability

Can it handle more load?

Challenge: Coordination overhead; distributed state complexity

Opportunity: Add machines on demand; scale individual bottlenecks independently

The goal isn't to avoid distribution — it's to distribute strategically where the benefits outweigh the costs. Let's make sure you know the vocabulary.

Frame the dual nature:

"Distribution isn't just a tax you pay — it's also the key to solving problems monoliths CAN'T solve"
"A monolith can't survive a data center fire. A distributed system across regions can."
"A monolith can't scale one component independently. Microservices can."

Frame expectations:

"This is a survey, not a deep dive"
"CS4530 (SE), CS4700 (Networks), CS4730 (Distributed Systems) go much deeper"
"The goal: when you see a system design doc or interview question, you know what these terms mean"

Why these three?

Performance: users experience slowness directly
Reliability: users experience downtime directly
Scalability: users experience system collapse under load directly

Note: We covered the strategies in the patterns section

Caching, batching → Fallacies 2-3
Retry, circuit breaker → Fallacy 1
These slides just give students the vocabulary to recognize these concepts

→ Transition: Here's the vocabulary you need...

Scaling and Reliability: The Vocabulary You Need

Distribution isn't just a tax — it's also the key to solving problems monoliths can't. Here's the vocabulary you'll encounter in system design docs and interviews.

Availability — The "Nines" Game

Percentage of time the system is operational

Availability	Downtime/year
99% ("two nines")	3.65 days
99.9% ("three nines")	8.76 hours
99.99% ("four nines")	52.6 minutes
99.999% ("five nines")	5.26 minutes

Each additional nine is exponentially harder and more expensive. High availability is only achievable through redundancy — impossible in a monolith.

Vertical vs. Horizontal Scaling

Vertical ("Scale Up"): Bigger machine — more CPU, RAM, disk.

✓ Simple, no code changes
✗ Hardware limits, expensive, single point of failure

Horizontal ("Scale Out"): More machines + load balancer.

✓ Near-infinite scaling, redundancy built-in, elastic
✗ Requires stateless design, all 8 fallacies apply

Pawtograder: GitHub Actions scales horizontally — 500 grading jobs run in parallel on 500 separate runners.

Strategies we've already covered: Caching and batching (Fallacies 2-3), retry + circuit breaker (Fallacy 1), redundancy + failover (reliability). These aren't separate topics — they're all responses to distribution's challenges.

The Key to Horizontal Scaling: Statelessness

Horizontal scaling only works if any server can handle any request. This requires stateless services — each request contains all information needed to process it.

Stateless: Any server works

POST /submitFeedback
{
  "submission_id": "abc123",
  "results": [...],
  "auth_token": "..."  // Identity in request
}
// Load balancer routes to ANY available server

The Grading Action is completely stateless — each run is independent. GitHub can route jobs to any available runner.

Stateful: Locked to one server

POST /submitFeedback
{
  "results": [...]
}
// Server remembers session from earlier request
// Only THIS server can handle this request
// "Sticky sessions" → can't scale freely

If the server dies, the session state is lost. If load spikes, you can't add servers without breaking sessions.

Externalize state: Shared state lives in the database, not in individual servers.

This is the key architectural insight:

REST's "stateless" constraint exists specifically for scalability
Fielding knew this in 2000 — statelessness enables horizontal scaling
"Sticky sessions" (always route user X to server Y) break horizontal scaling

The pattern:

Keep services stateless
Put all shared state in a dedicated service (database, cache)
Now any server can handle any request → you can add servers freely

Pawtograder example:

The Grading Action is the purest example of stateless
Each run: fresh VM, no memory of previous runs
GitHub can spin up 500 VMs in parallel without coordination

Database becomes the bottleneck:

This is why database scaling (replication, sharding) is its own specialty
More on this in L21

→ Transition: Now let's connect to security — because security shapes all of these decisions...

Network Traffic Is Readable and Forgeable by Default

Without security, distributed systems are trivially exploitable. Here's what an attacker on the same network can do:

Eavesdrop

Read every HTTP request. See passwords, tokens, grades, personal data. A student on coffee shop WiFi submits homework — attacker reads their GitHub token.

Modify

Change requests in flight. Alter grades, redirect payments, inject code. Grading Action reports 85% → attacker changes to 100% before API receives it.

Impersonate

Send requests pretending to be someone else. Attacker sends: "I'm student123's grading action, here are perfect scores." How does the API know it's lying?

The network is a hostile environment. Security isn't paranoia — it's engineering for reality.

Make this visceral:

These aren't theoretical attacks — they're trivial to execute
Wireshark + coffee shop WiFi = see everyone's unencrypted traffic
"Man in the middle" attacks are well-automated tools
The Palo Alto firewall we complained about? It's DOING exactly this — reading and modifying your traffic. When it's NEU doing it, they call it "security." When it's an attacker, it's a breach.

The three attacks map to CIA:

Eavesdrop → violates Confidentiality
Modify → violates Integrity
Impersonate → violates Integrity AND potentially all three

Why this matters for Pawtograder:

Students submit from everywhere — dorms, cafés, airports
Grades are sensitive data (FERPA protected!)
Academic integrity depends on submissions being authentic
The Grading Action runs on GitHub's infrastructure — how do we trust it?

This lecture's security goal:

Build the mental model for HOW we defend against these attacks
Not exhaustive security training — that's a whole course
But enough to understand architectural security decisions

→ Transition: Let's start with a framework for thinking about what we're protecting...

Every Security Decision Trades Off Confidentiality, Integrity, and Availability

The CIA Security Triad: an equilateral triangle with Confidentiality (padlock), Integrity (shield checkmark), and Availability (clock) at each vertex. Threat labels point inward: Data Breach threatens Confidentiality, Tampering threatens Integrity, Denial of Service threatens Availability. Center: CIA TRIAD.

This is our framework for the security discussion.

Walk through each with Pawtograder examples:

Confidentiality:

Student A can't see Student B's grades
Students can't see the instructor's solution before the deadline
OIDC tokens must not be logged (they're short-lived but still sensitive)

Integrity:

Grades should actually reflect test results — not be tampered with in transit
The grader tarball SHA hash ensures the tarball hasn't been modified in transit
Pawtograder's use of HTTPS ensures grades can't be modified mid-flight (MITM attack)

Availability:

If students can't submit before the deadline because the API is down, that's an Availability violation
A DDoS attack that takes down the API is an availability attack
Note that "availability" is also one of the eight fallacies (Fallacy 1 — the network is reliable) — they connect!

The tradeoffs:

Confidentiality vs Availability: more access controls = harder for authorized users to access
Sometimes you have to prioritize: for Pawtograder, Integrity (accurate grades) might trump Availability (slightly delayed grading)

→ Transition: The most actionable concept: trust boundaries...

Draw Lines Where Trust Ends — Validate Everything That Crosses

Trust boundary illustration: Left side shows chaotic 'untrusted zone' like an airport terminal with suspicious figures (fork attacks, Claude leaking scripts). Center shows security checkpoint verifying OIDC signatures. Right side shows calm 'trusted zone' with API processing verified data. Bottom right shows audit/accountability: even if a Trojan horse gets through, the student's verified identity means they can be held accountable.

The API must not trust the action to report: its own repository name, accurate test results, or the submission time. All derived from cryptographically verified sources or computed server-side.

The "fork attack" scenario:

A student forks the grading action repository
Modifies it to report perfect scores regardless of test results
Submits their homework — the modified action runs on their fork
BUT: the OIDC token says which REPO the action is running from
If the API checks "is this an authorized version of the action?" → attack fails

Trust boundary principle:

NEVER trust data that came from across a trust boundary without validation
The trust boundary is the API endpoint: everything from GitHub Actions is "untrusted input"
The API re-derives all sensitive info: who submitted, when, from which repo

Apply this pattern broadly:

Client-side apps (web apps or desktop apps): any input from the app = untrusted (user could modify code)
APIs: any data in the request body = untrusted (caller could lie)
Databases: even your own database could be compromised — defense in depth

→ Transition: But what does it mean to "verify" something across a network?

Knowing WHO You Are Is Different from What You're ALLOWED to Do

Authentication proves identity. Authorization checks permissions. Both required — in that order.

public Response submitFeedback(Request req) {
    // STEP 1: AUTHENTICATION — "Who are you?"
    // Verify the token cryptographically (we'll see HOW shortly)
    Identity caller = authService.verifyToken(req.header("Authorization"));
    // Now we KNOW who this is — not just who they CLAIM to be
    
    // STEP 2: AUTHORIZATION — "What are you allowed to do?"
    if (!enrollmentService.isRepoEnrolled(caller.repository())) {
        throw new ForbiddenException("Repository not enrolled");  // 403
    }
    if (assignment.getDeadline().isBefore(Instant.now())) {
        throw new ForbiddenException("Deadline has passed");      // 403
    }
    
    // Both checks passed — process the request
    return gradingService.storeFeedback(req.body());
}

HTTP 401 "Unauthorized" = authentication failure (poorly named — should be "Unauthenticated"). HTTP 403 "Forbidden" = authorization failure (I know you, but you can't do this).

Walk through the code:

Line 4: Authentication — verify the token. HOW this works is coming up next!
Line 5: Key insight — we now KNOW who this is, not just who they claim to be
Line 8-12: Authorization — check if this identity is ALLOWED to do this action
The order matters: you must authenticate BEFORE you can authorize

The 401 vs 403 distinction:

401: "I don't know who you are" → token missing, expired, or invalid
403: "I know who you are, but you can't do this" → token valid, but repo not enrolled or deadline passed

Key insight:

Authentication: cryptographic verification (can this token be trusted?)
Authorization: business logic (does this identity have permission?)
Both are necessary — neither is sufficient alone

→ Transition: But how does that cryptographic verification actually work? How can we verify identity over an untrusted network?

Proving Identity Over a Network Is Harder Than It Sounds

In a monolith, method calls within the same process have implicit trust. Over a network? Anyone can send bytes claiming to be anyone.

Monolith: Implicit Trust

// Within the same process
grader.submitFeedback(studentId, score);
// We KNOW this is our grader — it's our code

The call happens in memory. No one can intercept or forge it.

Distributed: Anyone Can Lie

POST /api/feedback HTTP/1.1
Authorization: "trust me, I'm the grader"
{"studentId": "alice", "score": 100}

How do you know this request is really from the Grading Action — and not a student faking it from their laptop?

We need a way to prove identity that: (1) Can't be forged by attackers, (2) Doesn't require sharing secrets over the network, (3) Can be verified without calling back to the identity provider.

Asymmetric cryptography flow: GitHub has private key (secret) and public key (published). GitHub signs claims with private key, creating a token. Pawtograder API fetches public key and verifies signature locally. Three benefits: can't be forged, no shared secrets, self-verifying.

The key insight:

Private key = can CREATE valid signatures (only GitHub has this)
Public key = can VERIFY signatures (anyone can have this)
The claims (payload) are NOT encrypted — anyone can read them
The SIGNATURE proves only GitHub could have created the token

Why OIDC uses signing, not encrypting:

The claims aren't secret: "This is repo cs3100-sp26/hw1-student123"
Anyone can read the token — that's fine!
What matters is AUTHENTICITY: only GitHub could have made this assertion

Trade-off:

Asymmetric crypto is slower than symmetric (10-100x)
That's why HTTPS uses asymmetric to exchange a symmetric key, then symmetric for bulk data
OIDC tokens are small enough that the overhead is fine

The remaining question:

How does the API know the public key at github.com is really GitHub's?
Answer: HTTPS and Certificate Authorities — trust flows from CAs

→ Transition: Let's trace that chain of trust...

HTTPS Solves Eavesdropping but Creates a New Trust Question

HTTPS certificate chain: Certificate Authority signs github.com's certificate. Your browser verifies using CA's public key from its trusted list. Trust flows from OS → CAs → domains → OIDC keys.

⚠️ The catch: Employers can add their own CA to devices they control. NEU-owned laptops have Northeastern's root CA installed (cannot remove) — they can intercept ALL your HTTPS traffic. If that CA is compromised, attackers can forge certificates for any domain. Use personal devices for anything sensitive.

How the chain of trust works:

Your OS/browser ships with ~100 trusted CA root certificates
CAs vouch for domain ownership before issuing certificates
When you connect to github.com, browser verifies certificate chain
If valid → encrypted connection, and you trust github.com's public keys

Risk 1 — Employer interception:

NEU-owned laptops have Northeastern's root CA installed — you cannot remove it
iOS devices with MDM profiles add employer CAs
With their CA trusted, they can proxy and decrypt all HTTPS traffic
The lock icon still appears green — but your employer sees everything
This is legal: "It's our device, it's in the acceptable use policy"

Risk 2 — CA compromise:

If ANY CA in your trust store is compromised, attackers can issue fake certificates
DigiCert has dedicated security teams, HSMs, audits
A university IT department running a CA? Probably less secure
If NEU's CA private key is stolen, attacker could issue "github.com" certificate
This has happened: DigiNotar (2011), Comodo (2011), WoSign (2016)

Practical advice:

Personal banking, medical info, private communications → PERSONAL device
Assume anything on a work/university device can be monitored

→ Transition: Now let's see this all come together in OIDC...

Self-Verifying Tokens: The Full Verification Flow

The API verifies the Grading Action's identity without contacting GitHub. The signature is self-verifying — GitHub's public key is all that's needed.

// This class caches GitHub's public keys — fetched once, reused for all requests
public class OidcVerifier {
    private final JwkProvider jwkProvider = new UrlJwkProvider(
        "https://token.actions.githubusercontent.com/.well-known/jwks");  // Cached
    
    public VerifiedIdentity verify(String oidcToken) throws JWTVerificationException {
        // Step 1: Decode WITHOUT verifying (to get the key ID)
        DecodedJWT unverified = JWT.decode(oidcToken);
        
        // Step 2: Fetch the public key (from cache after first request)
        Jwk jwk = jwkProvider.get(unverified.getKeyId());
        RSAPublicKey publicKey = (RSAPublicKey) jwk.getPublicKey();
        
        // Step 3: Verify signature — LOCAL MATH, no network call
        DecodedJWT verified = JWT.require(Algorithm.RSA256(publicKey, null))
            .withIssuer("https://token.actions.githubusercontent.com")
            .build()
            .verify(oidcToken);  // Throws if signature invalid
        
        // Step 4: Extract claims — guaranteed by GitHub's signature
        return new VerifiedIdentity(
            verified.getClaim("repository").asString(),      // "cs3100-sp26/hw1-alice"
            verified.getClaim("workflow_ref").asString(),    // Detects modified workflows
            verified.getClaim("run_id").asString());
    }
}

Verification is local math (microseconds), not a network call (milliseconds). This is why OIDC scales — and why the API can handle thousands of grading requests without bottlenecking on GitHub.

Walk through the code:

Line 3-4: JwkProvider fetches GitHub's public keys ONCE, caches them
Line 8: Decode the token to get the key ID (tokens say WHICH key signed them)
Line 11-12: Fetch the public key — hits cache after first request
Line 15-18: The actual verification — pure math, no network
Line 21-24: Extract claims from the VERIFIED token — these are trustworthy

Why this design is brilliant:

If every verification required a network call to GitHub: latency disaster
If GitHub goes down, all grading would fail
Self-verifying tokens eliminate both problems

What the API learns from the token:

repository: which repo triggered the workflow (can't be spoofed)
workflow_ref: which workflow file ran (detects if student modified it)
run_id: unique ID for this workflow run (for audit logs)

What the API does NOT trust:

Anything in the request BODY — the action could lie about test results
Pawtograder mitigates: workflow validation, log preservation, instructor review

→ Transition: But what if we DIDN'T have self-verifying tokens? What would the alternative look like?

The Alternative: Network-Verified Tokens (Why Not?)

What if tokens weren't self-verifying? Every verification would require a network call back to the identity provider.

Network-verified (the alternative)

public Identity verify(String token) {
    // EVERY verification = network call to GitHub
    HttpResponse<String> resp = client.send(
        HttpRequest.newBuilder()
            .uri(URI.create("https://api.github.com/verify"))
            .header("Authorization", "Bearer " + token)
            .build(),
        BodyHandlers.ofString());
    
    // GitHub checks its database: "Is this token valid?"
    // Returns the identity if valid, 401 if not
    return parseIdentity(resp.body());
}

⚠️ Every grading request → network call to GitHub ⚠️ GitHub down = all grading stops ⚠️ GitHub rate limits = grading throttled ⚠️ Latency: +50-200ms per verification

Self-verifying (OIDC/JWT)

public Identity verify(String token) {
    // Public key fetched ONCE, cached forever
    // Verification is LOCAL MATH (can cache issuer keys)
    DecodedJWT verified = JWT.require(
            Algorithm.RSA256(cachedPublicKey, null))
        .withIssuer("https://token.actions.githubusercontent.com")
        .build()
        .verify(token);
    
    return new Identity(
        verified.getClaim("repository").asString());
}

✅ No network call per request ✅ GitHub down? Verification still works ✅ No rate limit concerns ✅ Latency: ~0.1ms (pure CPU)

The scale problem:

Pawtograder: ~6,000 grading runs per semester, each with multiple API calls
Network-verified: 6,000+ calls to GitHub's verification endpoint
At deadline time: 500 students submit in 1 hour = 500 verification calls
GitHub might rate-limit you, or just be slow under load

The availability problem:

Network-verified: GitHub goes down → all grading stops
Self-verifying: GitHub goes down → still works! (public key is cached)
The only dependency is HTTPS to fetch the public key once

The latency problem:

Network-verified: 50-200ms per verification (network round-trip to GitHub)
Self-verifying: ~0.1ms (RSA signature verification is fast)
2000x faster, and it adds up

This is exactly Fallacy 2 (Latency is zero) in action:

The network-verified approach treats verification like a method call
It ignores that every network call has real latency cost
Self-verifying tokens are a DESIGN response to the fallacies

Real-world parallel:

OAuth 2.0 "introspection" endpoint: network-verified (for short-lived access tokens)
OIDC/JWT: self-verifying (for identity assertions)
The industry moved toward self-verifying because scale matters

→ Transition: Let's step back and see the full chain of trust...

Security Is a Chain: Every Link Is Something You Trust

Pawtograder's security depends on a chain of trust. If any link breaks, the whole system is compromised.

What We Trust	What Could Go Wrong	Mitigation
Certificate Authorities (DigiCert, Let's Encrypt)	CA compromise → fake certificates for any domain	Certificate Transparency logs, browser vendor audits
GitHub's Infrastructure	GitHub's private key stolen → forged OIDC tokens	GitHub's security team, HSMs, incident response
GitHub's OIDC Claims	GitHub lies about which repo is running	Trust GitHub's incentives (reputation, contracts)
HTTPS on github.com/.well-known	DNS hijack + rogue certificate → fake public keys	DNSSEC, certificate pinning (advanced)
The Grading Action Code	Student forks action, modifies to report fake grades	Workflow validation, tarball only to authorized workflows
The Student's Repository	Student manipulates code to pass tests dishonestly	Code review, plagiarism detection, test design
NEU's Network (on NEU devices)	Palo Alto firewall blocks legitimate traffic arbitrarily	Use personal devices; complain loudly
NEU's CA (on NEU devices)	NEU's CA compromised → attacker forges any certificate	Use personal devices for sensitive data

Security isn't about eliminating trust — it's about understanding what you trust and why. Every trust decision is an attack surface.

Walk through the chain:

Start at the bottom: CAs are the root of web trust
GitHub's security is the root of OIDC trust
Each layer depends on the layer below

The uncomfortable truth about NEU:

On NEU devices, you're trusting NEU's IT department not to be compromised
The Palo Alto firewall that blocked our API? It's reading ALL your HTTPS traffic
NEU's root CA means they COULD issue fake certificates for any domain
This isn't hypothetical — it's how enterprise "security" works

Practical advice:

Don't do personal banking on university computers
Don't log into personal accounts on university networks if you can avoid it
If you must, use a VPN (which NEU's firewall might also block...)
This isn't paranoia — it's understanding the trust model you're operating in

The deeper point:

Security is about RISK MANAGEMENT, not perfection
You can't eliminate trust — you can only choose WHAT to trust
Distributed systems make trust decisions EXPLICIT (which is actually good)
The alternative (implicit trust) is worse — you just don't see the risks

→ Transition: Let's bring it all together — what's changed from monolith to distributed?

Bringing It Together: Monolith → Distributed

Concept	Monolith (L19)	Distributed (L20)
Communication	Method calls (nanoseconds)	Network calls (milliseconds+)
Failure modes	Process crash → everything fails	Partial failures, network partitions
Consistency	DB transactions span all operations	Eventual consistency, no cross-service transactions (L33)
Debugging	Single stack trace, single log file	Distributed tracing, multiple log streams
Deployment	All-or-nothing	Independent service deploys
Scaling	Scale everything together	Scale services independently
Security	Internal method calls (implicit trust)	Every call across network (explicit auth)

The monolith-first principle still holds. Don't distribute until you need to. When you do need to — for isolation, scaling, team autonomy, or platform leverage — you now understand what you're signing up for.

Conway's Law: Even More True for Distributed Systems

In L19, we introduced Conway's Law:

"Organizations which design systems are constrained to produce designs which are copies of the communication structures of those organizations."

— Melvin Conway, 1967

For distributed systems, this cuts both ways:

If your teams are siloed → your services will be siloed

Pawtograder: separate teams for the Grading Action, API, and Discord bot → separate services with clean interfaces.

If teams must coordinate constantly → services will be tightly coupled

Distributed monolith! Teams that can't deploy independently → services that can't deploy independently.

We'll explore this further in L22 (Teams and Collaboration).

What's Next: Serverless

In L21, we'll explore Serverless Architecture — an architectural style that pushes many of today's concerns to the platform level.

What serverless does:

Infrastructure management → platform's job
Scaling, availability → platform's job
Some security concerns → platform's job
You focus on business logic

The tradeoffs:

You STILL deal with all eight fallacies
You gain elasticity and scale-to-zero
You lose control over runtime environment
New constraints: cold starts, execution time limits

Pawtograder uses serverless extensively — Supabase Edge Functions, GitHub Actions.

Serverless is the natural continuation of today's lesson: it's distributed computing with operational complexity offloaded to the cloud provider.

Lecture 20: Distributed Architecture — Networks, Microservices, and Security​

Announcements​

Learning Objectives​

Recap: The Network Changes Everything​

The Price of Distribution: Hard Physical Limits​

Shared Fate vs. Independent Failure​

Client-Server Architecture​

The Network Itself Is a Layered Architecture​

How Services Communicate: HTTP and REST​

Sidebar: REST Was Itself an Architectural Discovery​

REST Status Codes: The Language of Failure​

The Fallacies of Distributed Computing​

Fallacy 1: "The Network Is Reliable"​

Pattern: Timeout + Retry with Exponential Backoff​

Pattern: Idempotency — Making Retries Safe​

Fallacy 2: "Latency Is Zero" — Chatty vs Chunky APIs​

Fallacy 4: "The Network Is Secure" (and Fallacies 6–7)​

Pattern: Circuit Breaker — Stop Hammering a Struggling Service​

Pattern: Graceful Degradation — Reduced Functionality Beats Crashing​

Microservices Architecture: Now With Context​

The Distributed Monolith: All Costs, No Benefits​

Quality Attributes: Distribution Creates Challenges AND Opportunities​

Scaling and Reliability: The Vocabulary You Need​

The Key to Horizontal Scaling: Statelessness​

Network Traffic Is Readable and Forgeable by Default​

Every Security Decision Trades Off Confidentiality, Integrity, and Availability​

Draw Lines Where Trust Ends — Validate Everything That Crosses​

Knowing WHO You Are Is Different from What You're ALLOWED to Do​

Proving Identity Over a Network Is Harder Than It Sounds​

Asymmetric Crypto: Prove Identity Without Sharing Secrets​

HTTPS Solves Eavesdropping but Creates a New Trust Question​

Self-Verifying Tokens: The Full Verification Flow​

The Alternative: Network-Verified Tokens (Why Not?)​

Security Is a Chain: Every Link Is Something You Trust​

Bringing It Together: Monolith → Distributed​

Conway's Law: Even More True for Distributed Systems​

What's Next: Serverless​

Lecture 20: Distributed Architecture — Networks, Microservices, and Security

Announcements

Learning Objectives

Recap: The Network Changes Everything

The Price of Distribution: Hard Physical Limits

Shared Fate vs. Independent Failure

Client-Server Architecture

The Network Itself Is a Layered Architecture

How Services Communicate: HTTP and REST

Sidebar: REST Was Itself an Architectural Discovery

REST Status Codes: The Language of Failure

The Fallacies of Distributed Computing

Fallacy 1: "The Network Is Reliable"

Pattern: Timeout + Retry with Exponential Backoff

Pattern: Idempotency — Making Retries Safe

Fallacy 2: "Latency Is Zero" — Chatty vs Chunky APIs

Fallacy 4: "The Network Is Secure" (and Fallacies 6–7)

Pattern: Circuit Breaker — Stop Hammering a Struggling Service

Pattern: Graceful Degradation — Reduced Functionality Beats Crashing

Microservices Architecture: Now With Context

The Distributed Monolith: All Costs, No Benefits

Quality Attributes: Distribution Creates Challenges AND Opportunities

Scaling and Reliability: The Vocabulary You Need

The Key to Horizontal Scaling: Statelessness

Network Traffic Is Readable and Forgeable by Default

Every Security Decision Trades Off Confidentiality, Integrity, and Availability

Draw Lines Where Trust Ends — Validate Everything That Crosses

Knowing WHO You Are Is Different from What You're ALLOWED to Do

Proving Identity Over a Network Is Harder Than It Sounds

Asymmetric Crypto: Prove Identity Without Sharing Secrets

HTTPS Solves Eavesdropping but Creates a New Trust Question

Self-Verifying Tokens: The Full Verification Flow

The Alternative: Network-Verified Tokens (Why Not?)

Security Is a Chain: Every Link Is Something You Trust

Bringing It Together: Monolith → Distributed

Conway's Law: Even More True for Distributed Systems

What's Next: Serverless