Skip to main content
A pixel art split-scene: LEFT shows a developer handing a note directly to a colleague (instant, 0.0001 sec). RIGHT shows an epic chaotic postcard journey through submarine cable cuts, latency, packet loss, eavesdroppers, a Palo Alto Networks firewall arbitrarily blocking traffic (labeled 'NEU contracted us'), BGP roulette wheel, and toll plazas — with multiple retry postcards. Tagline: When Code Leaves the Building, Everything Changes.

CS 3100: Program Design and Implementation II

Lecture 20: Distributed Architecture — Networks, Microservices, and Security

©2026 Jonathan Bell, CC-BY-SA

Announcements

Midterm Survey via Qualtrics, +1 participation credit if you complete it

  • Due Friday 2/27 @ 11:59 PM
  • Anonymous link in your email
  • 150 students have completed it so far

Team Formation Survey Released!

  • Starting Week 10: teams of 4 for CookYourBooks GUI project
  • Tell us your preferences + availability
  • Due Friday 2/27 @ 11:59 PM
  • Complete the Survey →

HW4 Due Thursday Night

Learning Objectives

After this lecture, you will be able to:

  1. Explain why network communication fundamentally changes architectural tradeoffs compared to in-process method calls
  2. Identify and explain the Fallacies of Distributed Computing and how they affect system design
  3. Describe the client-server architecture and REST API conventions used for service communication
  4. Analyze the benefits and costs of microservices compared to monolithic architectures
  5. Apply security principles (authentication, authorization, trust boundaries, CIA triad) to distributed system analysis

Important framing: Junior engineers read API documentation and debug network issues far more often than they design new distributed architectures. Comprehension comes first — you'll understand distributed systems well enough to work within them confidently.

Recap: The Network Changes Everything

In L19, we ended with a teaser: method calls in a monolith are instant, reliable, and traceable. Over a network? Everything changes.

Monolith (Bottlenose)

submission.computeGrade();
// ✅ Executes in nanoseconds
// ✅ Always succeeds or throws
// ✅ Full stack trace on error
// ✅ Wrapped in a DB transaction

Distributed (Pawtograder)

feedbackApi.submit(submissionId, feedback);
// ⚠️ Might take ms... or seconds... or ∞
// ⚠️ Server might be down or overloaded
// ⚠️ Request succeeds, response lost
// ⚠️ Retry = accidentally grade twice?
// ⚠️ No cross-system transactions

Today we'll explore why networks are hard, what patterns handle these challenges, and how security changes when code leaves the building.

The Price of Distribution: Hard Physical Limits

Some network constraints aren't design choices or wrong assumptions — they're physics. No software engineering can fix them. This is what you're signing up for when you leave the monolith.

The Speed of Light: ~1 foot per nanosecond

Light (and electrical signals in fiber) travels roughly 1 ns/ft — slower in copper wire.

DistanceMin. round-trip
Across a server rack (~10 ft)~20 ns
Across a data center (~1,000 ft)~2,000 ns
Boston → New York (~200 mi)~2,100,000 ns (~2.1 ms)
Boston → London (~3,300 mi)~35,000,000 ns (~35 ms)
Boston → Tokyo (~6,700 mi)~71,000,000 ns (~71 ms)

(Actual latency is always higher — routing, switching, queuing add more.)

Compare: Memory access in a monolith

OperationLatency
L1 CPU cache hit~0.5 ns
L2 CPU cache hit~3 ns
L3 CPU cache hit~10 ns
RAM access~100 ns

A Boston-London round-trip takes ~70,000,000× longer than an L1 cache hit.

A method call in a monolith? Essentially free by comparison.

Shared Fate vs. Independent Failure

In a monolith, all components share the same process — they cannot be partially disconnected by accident. Distributed systems introduce a new failure mode that simply doesn't exist in a monolith.

Monolith: Shared fate

If the process is running, all components can talk to each other. Always.

No one can trip over a cable and disconnect the grading logic from the database — they're in the same memory space.

A server crash takes everything down together, but there's no such thing as a partial disconnection between components.

Distributed system: Independent failure

Things that can disconnect two services that would never affect a monolith:

  • A cloud provider's switch fails in one availability zone
  • A university contracts with a network filter that decides your service is malware
  • A student's ISP throttles GitHub traffic during grading
  • A ship anchor cuts a submarine fiber cable in the Pacific

A "network partition" — two parts of a system that can't reach each other — is impossible in a monolith and inevitable in a distributed system. You must design for it.

Client-Server Architecture

Clients make requests, servers respond. The most ubiquitous pattern — every web app, mobile app, and service-to-service call.

// Client code (Grading Action) — what actually runs
HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(API_URL + "/submitFeedback"))
.header("Authorization", "Bearer " + authToken)
.header("Content-Type", "application/json")
.POST(BodyPublishers.ofString(feedbackJson))
.build();

HttpResponse<String> response = client.send(
request, BodyHandlers.ofString());

Benefits — Centralized control/state, update server → all clients benefit, enforce security policies, multiple clients connect simultaneously

Constraints — Server = single point of failure, network latency on every op, must handle errors/timeouts/retries, client-initiated only

The Network Itself Is a Layered Architecture

Client-server doesn't require HTTP — it's just one option. The OSI model shows how network communication is organized into layers, each solving one problem. Sound familiar from L19?

The OSI 7-layer model with HTTP highlighted at Layer 7 (Application). Shows how a network request travels down through all layers on the client, across the internet, and back up on the server. Callout notes this is for context only, not on exam. Footer connects to Layered Architecture from L19.

How Services Communicate: HTTP and REST

HTTP is the foundation. REST (Representational State Transfer) is a set of conventions built on HTTP for structuring APIs. An HTTP request has: a method (verb), a URL (resource), and optionally a body (data).

// GET — retrieve a resource (read-only, no body)
HttpRequest get = HttpRequest.newBuilder()
.uri(URI.create(BASE_URL + "/submissions?student_id=" + studentId))
.GET().build();

// POST — create a resource or trigger an action
HttpRequest post = HttpRequest.newBuilder()
.uri(URI.create(BASE_URL + "/functions/v1/createSubmission"))
.header("Content-Type", "application/json")
.POST(BodyPublishers.ofString("{\"repo\": \"cs3100/hw1-alice\"}")).build();

// PATCH — partial update (only fields you're changing)
HttpRequest patch = HttpRequest.newBuilder()
.uri(URI.create(BASE_URL + "/submissions/123"))
.method("PATCH", BodyPublishers.ofString("{\"score\": 87}")).build();

// DELETE — remove a resource
HttpRequest delete = HttpRequest.newBuilder()
.uri(URI.create(BASE_URL + "/submissions/123"))
.DELETE().build();

REST organizing principle: Organize around nouns (submissions, assignments, students) and manipulate them with standard verbs. Once you know the pattern, every RESTful API works the same way.

REST didn't come from a committee. Roy Fielding co-authored the HTTP/1.0 spec, then asked: why does this work so well? His 2000 PhD thesis reverse-engineered HTTP into a set of architectural constraints. REST is the name he gave that style.

REST's Six Architectural Constraints (Fielding, 2000)

  • Client-Server — separate UI concerns from data storage
  • Stateless — each request contains all info needed; no session state on the server
  • Cacheable — responses must declare whether they can be cached
  • Uniform Interface — standard verbs + resource URLs + self-describing messages
  • Layered System — clients don't know if they're talking to the real server or a proxy

Connection to L18: Drivers → Style

Each constraint maps to a quality attribute driver:

ConstraintQuality Attribute
StatelessScalability — any server handles any request
CacheablePerformance — skip redundant fetches
Uniform InterfaceChangeability — swap server implementations
Layered SystemSecurity + Deployability — add proxies/CDNs transparently

The same process we used for Pawtograder — identify quality attributes, apply heuristics, let the style emerge — is how REST was discovered.

REST is not a standard, a spec, or a protocol. It's an architectural style — a set of constraints that, when applied, produce a system with desirable properties. Fielding studied what made HTTP work, then named the pattern.

REST Status Codes: The Language of Failure

One of REST's great gifts: standardized error codes. The status code tells you where to look when debugging.

CodeMeaningAction
2xxSuccessProcess response
200 OKRequest succeeded
201 CreatedResource created
4xxClient ErrorFix your code
401 UnauthorizedNot authenticatedRefresh token
403 ForbiddenNot allowedCheck permissions
429 Too ManyRate limitedBack off
5xxServer ErrorRetry with backoff
503 UnavailableServer overloadedWait and retry
HttpResponse<String> response = client.send(
request, BodyHandlers.ofString());

switch (response.statusCode() / 100) {
case 2 -> processSuccess(response.body());
case 4 -> {
if (response.statusCode() == 401) {
refreshToken(); // Auth expired
} else if (response.statusCode() == 429) {
sleepUntil(response.headers()
.firstValue("Retry-After"));
} else {
throw new ClientError(response);
}
}
case 5 -> throw new RetryableException(response);
}

Key insight: 4xx = your code is wrong (don't retry). 5xx = server is struggling (retry with backoff). 401 = who are you? 403 = I know you, but no.

The Fallacies of Distributed Computing

Eight warning signs arranged in a grid, each crossing out a wrong assumption about distributed systems: The network is reliable, Latency is zero, Bandwidth is infinite, The network is secure, Topology doesn't change, There is one administrator, Transport cost is zero, The network is homogeneous. Clean flat illustration style with warning colors.

Peter Deutsch and colleagues at Sun Microsystems identified eight assumptions developers make about networks — all of which are false. These are the Fallacies of Distributed Computing.

Fallacy 1: "The Network Is Reliable"

Networks fail. Cables get unplugged, routers crash, cloud providers have outages. Code that assumes a network call will always succeed is fragile code.

The fragile version

// Assumes the network always works
Response response = client.send(request);
processResponse(response);
// If request fails → student never sees grade

Better: timeout and retry

Response response = null;
int attempts = 0;
while (response == null && attempts < 3) {
try {
response = client.send(request,
Duration.ofSeconds(10));
} catch (TimeoutException e) {
// Exponential backoff: 1s, 2s, 4s
Thread.sleep((long) Math.pow(2, attempts) * 1000);
attempts++;
}
}
if (response == null) {
logError("Failed after 3 attempts");
// Show "grading in progress" not a crash
}

Pawtograder: The Grading Action tries to submit feedback. The request times out. Retry logic with exponential backoff — wait 1s, then 2s, then 4s. Either succeeds, or the student sees "grading in progress."

Pattern: Timeout + Retry with Exponential Backoff

Addresses Fallacy 1 (unreliable). Never wait forever — set a deadline, then try again, backing off between attempts.

private static final Random random = new Random();

public Response sendWithRetry(HttpRequest request) throws Exception {
int maxAttempts = 3;
for (int attempt = 1; attempt <= maxAttempts; attempt++) {
try {
// ALWAYS set a timeout — without one, a hung server blocks forever
return client.send(request, BodyHandlers.ofString(), Duration.ofSeconds(10));
} catch (HttpTimeoutException | IOException e) {
if (attempt == maxAttempts) throw e;
// Exponential backoff: 2s, 4s, 8s + jitter to prevent thundering herd
long backoffMs = (long) Math.pow(2, attempt) * 1000;
long jitter = random.nextLong(500); // Random 0-500ms
Thread.sleep(backoffMs + jitter);
}
}
throw new RuntimeException("unreachable");
}

Fixed 1s retry — the thundering herd

100 clients all fail at t=0 (API restart). All retry at t=1s. Server gets slammed again. All fail. All retry at t=2s… The server never gets a chance to recover.

Exponential backoff + jitter — spreading the load Load arrives in waves the server can absorb.

Only retry 5xx and timeouts. A 400 Bad Request won't fix itself. A 401 Unauthorized won't either. Retrying a 409 Conflict might make things worse.

Pattern: Idempotency — Making Retries Safe

Over a network, "did it run?" is ambiguous — the request may have arrived but the response was lost. Design operations so retrying is safe.

// CLIENT: attach a stable unique key — same key = same operation, don't repeat it
public void submitFeedback(String submissionId, Feedback feedback) {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(API_URL + "/functions/v1/submitFeedback"))
.header("Idempotency-Key", submissionId) // Stable, unique per grading run
.POST(HttpRequest.BodyPublishers.ofString(gson.toJson(feedback)))
.build();
sendWithRetry(request); // Now safe to call multiple times!
}

// SERVER: check the key before doing any work
public Response submitFeedback(Request req) {
String key = req.header("Idempotency-Key");

Optional<Response> cached = db.findByIdempotencyKey(key);
if (cached.isPresent()) {
return cached.get(); // Already ran — return same result, don't re-grade
}

Feedback feedback = gson.fromJson(req.body(), Feedback.class);
Response result = gradingService.store(feedback);
db.storeIdempotencyResult(key, result); // Cache so future retries are safe
return result;
}

HTTP verb idempotency: GET always idempotent (read-only). DELETE idempotent (deleting twice = 404, no harm). PUT idempotent (replace whole resource). POST NOT idempotent by default — that's why you need the Idempotency-Key header.

Fallacy 2: "Latency Is Zero" — Chatty vs Chunky APIs

Every network call takes time. Local method calls: nanoseconds. Network calls: milliseconds to seconds. Minimize round-trips.

Chatty API — 100 network round-trips

// BAD: One API call per test result
for (TestResult result : testResults) {
api.submitSingleResult(submissionId, result);
// 100ms latency × 100 tests = 10 SECONDS
}

Chunky API — 1 network round-trip

// GOOD: All results in one payload
FeedbackBatch batch = FeedbackBatch.builder()
.submissionId(submissionId)
.results(testResults) // All 100 results
.build();
api.submitFeedback(batch);
// 100ms latency × 1 call = 100ms

Fallacy 3: Bandwidth is infinite — SHA hash to skip unchanged downloads (grader tarball caching)

Fallacy 5: Topology changes — Never hardcode URLs; use config. Servers move, DNS updates.

Fallacy 8: Network is homogeneous — Same code, different behavior on different networks.

Fallacy 4: "The Network Is Secure" (and Fallacies 6–7)

Fallacy 4: The network is secure

Data crossing networks can be intercepted, modified, or spoofed. Every network boundary is a potential attack surface.

Pawtograder: Without an authentication token, anyone could POST fake grades. Without HTTPS, a network observer could read or modify grades in transit.

(We'll dive deep on security later in this lecture.)

Fallacy 6: There is one administrator

Different parts of distributed systems are controlled by different organizations. You can't control what they do.

Real NEU example: Northeastern contracts with Palo Alto Networks to filter all campus traffic. When Palo Alto arbitrarily decides Pawtograder's dev environment is malware — learning is disrupted. NEU claimed no responsibility. This happens all the time.

Fallacy 7: Transport cost is zero

Network calls have real costs: computational (serialization, encryption), monetary (API pricing, bandwidth fees), and energy (radio transmission, data center processing).

Pawtograder: Batching 100 test results into one submitFeedback() call instead of 100 calls doesn't just save latency — it saves energy. 6,000 grading runs/semester × 100 extra API calls = measurable environmental impact.

Pattern: Circuit Breaker — Stop Hammering a Struggling Service

If a service is struggling, hammering it with retries makes it worse. The circuit breaker detects sustained failure and stops trying — giving the service time to recover.

// Three states: CLOSED (normal) → OPEN (failing fast) → HALF_OPEN (testing recovery)
public class CircuitBreaker {
enum State { CLOSED, OPEN, HALF_OPEN }

private State state = State.CLOSED;
private int failureCount = 0;
private Instant openedAt;

public Response call(Supplier<Response> request) {
if (state == State.OPEN) {
if (Duration.between(openedAt, Instant.now()).toSeconds() < 30) {
// Fail immediately — fast failure > slow failure
throw new CircuitOpenException("Service unavailable, try later");
}
state = State.HALF_OPEN; // After 30s, allow one probe request through
}

try {
Response response = request.get();
reset(); // Success: back to CLOSED
return response;
} catch (Exception e) {
failureCount++;
if (failureCount >= 5) {
state = State.OPEN; // 5 consecutive failures → trip the circuit
openedAt = Instant.now();
}
throw e;
}
}

private void reset() { state = State.CLOSED; failureCount = 0; }
}

Pattern: Graceful Degradation — Reduced Functionality Beats Crashing

When a service is unavailable, offer reduced functionality rather than crashing. Stale data beats an error screen.

// When grading service is unavailable, give the student useful information
public SubmissionResponse handleSubmission(String submissionId) {
// Step 1: Record that we received the code (this succeeds even if grading is down)
db.markSubmissionReceived(submissionId);

try {
return gradingApi.triggerGrading(submissionId);
} catch (CircuitOpenException | ServiceUnavailableException e) {
// Grading is down — but the student's code IS safe
return SubmissionResponse.builder()
.submissionId(submissionId)
.status("RECEIVED_PENDING_GRADING")
.message("Your code has been received. Grading is temporarily unavailable, " +
"but will run automatically once the system recovers. " +
"You'll receive an email when your results are ready.")
.build();
}
}

Design your degraded state intentionally. Tell users what succeeded, what's delayed, and what to expect. A helpful message beats a stack trace. For every service call: if this fails, what should the user experience?

Microservices Architecture: Now With Context

We introduced microservices in L19. Now we understand the cost. Let's look at why teams still pay it.

A microservices architecture decomposes a system into small, independently deployable services, each owning a specific business capability and its own data.

Benefits you pay for

  • Independent scaling: Scale the grading service without scaling the API
  • Elastic scaling: Spin up 500 grading runners at deadline time, scale to zero at 3 AM
  • Isolated failures: Discord bot bug can't crash grading
  • Team autonomy: Grading Action team and API team evolve independently
  • Technology flexibility: Different runtimes for different constraints (GitHub Actions vs Deno vs PostgreSQL)

Costs you definitely pay

  • All eight fallacies apply — every call is a network call
  • Operational overhead: Many builds, many deploys, many log streams
  • Data consistency: No transactions across services — eventual consistency only (more on consistency models in L33)
  • Testing complexity: Integration tests must spin up multiple services
  • Energy overhead: Every inter-service call costs orders of magnitude more than a method call

The Distributed Monolith: All Costs, No Benefits

Left side shows three independent microservices, each with their own database, connected by clean HTTP arrows. Right side shows the Distributed Monolith anti-pattern: three services tangled together with dozens of crossing dependencies and a shared database, labeled 'We can't deploy A without also deploying B and C.' A balance scale at bottom shows all microservices costs but none of the benefits.

The distributed monolith anti-pattern: services that are deployed separately but so tightly coupled they must be changed and deployed together. You pay all eight fallacies' costs — but get none of the benefits (independent scaling, isolated failures, team autonomy).

Quality Attributes: Distribution Creates Challenges AND Opportunities

Several quality attributes become critical when components communicate over networks. Distribution makes these harder to achieve — but also enables solutions impossible in a monolith.

Performance

How fast is the system?

Challenge: Network adds latency to every call

Opportunity: Parallelize work across machines; cache at edge locations near users

Reliability

Does it stay up when things fail?

Challenge: More components = more failure points

Opportunity: Redundancy across machines/regions; no single point of failure

Scalability

Can it handle more load?

Challenge: Coordination overhead; distributed state complexity

Opportunity: Add machines on demand; scale individual bottlenecks independently

The goal isn't to avoid distribution — it's to distribute strategically where the benefits outweigh the costs. Let's make sure you know the vocabulary.

Scaling and Reliability: The Vocabulary You Need

Distribution isn't just a tax — it's also the key to solving problems monoliths can't. Here's the vocabulary you'll encounter in system design docs and interviews.

Availability — The "Nines" Game

Percentage of time the system is operational

AvailabilityDowntime/year
99% ("two nines")3.65 days
99.9% ("three nines")8.76 hours
99.99% ("four nines")52.6 minutes
99.999% ("five nines")5.26 minutes

Each additional nine is exponentially harder and more expensive. High availability is only achievable through redundancy — impossible in a monolith.

Vertical vs. Horizontal Scaling

Vertical ("Scale Up"): Bigger machine — more CPU, RAM, disk.

  • ✓ Simple, no code changes
  • ✗ Hardware limits, expensive, single point of failure

Horizontal ("Scale Out"): More machines + load balancer.

  • ✓ Near-infinite scaling, redundancy built-in, elastic
  • ✗ Requires stateless design, all 8 fallacies apply

Pawtograder: GitHub Actions scales horizontally — 500 grading jobs run in parallel on 500 separate runners.

Strategies we've already covered: Caching and batching (Fallacies 2-3), retry + circuit breaker (Fallacy 1), redundancy + failover (reliability). These aren't separate topics — they're all responses to distribution's challenges.

The Key to Horizontal Scaling: Statelessness

Horizontal scaling only works if any server can handle any request. This requires stateless services — each request contains all information needed to process it.

Stateless: Any server works

POST /submitFeedback
{
"submission_id": "abc123",
"results": [...],
"auth_token": "..." // Identity in request
}
// Load balancer routes to ANY available server

The Grading Action is completely stateless — each run is independent. GitHub can route jobs to any available runner.

Stateful: Locked to one server

POST /submitFeedback
{
"results": [...]
}
// Server remembers session from earlier request
// Only THIS server can handle this request
// "Sticky sessions" → can't scale freely

If the server dies, the session state is lost. If load spikes, you can't add servers without breaking sessions.

Externalize state: Shared state lives in the database, not in individual servers.

Network Traffic Is Readable and Forgeable by Default

Without security, distributed systems are trivially exploitable. Here's what an attacker on the same network can do:

Eavesdrop

Read every HTTP request. See passwords, tokens, grades, personal data. A student on coffee shop WiFi submits homework — attacker reads their GitHub token.

Modify

Change requests in flight. Alter grades, redirect payments, inject code. Grading Action reports 85% → attacker changes to 100% before API receives it.

Impersonate

Send requests pretending to be someone else. Attacker sends: "I'm student123's grading action, here are perfect scores." How does the API know it's lying?

The network is a hostile environment. Security isn't paranoia — it's engineering for reality.

Every Security Decision Trades Off Confidentiality, Integrity, and Availability

The CIA Security Triad: an equilateral triangle with Confidentiality (padlock), Integrity (shield checkmark), and Availability (clock) at each vertex. Threat labels point inward: Data Breach threatens Confidentiality, Tampering threatens Integrity, Denial of Service threatens Availability. Center: CIA TRIAD.

Draw Lines Where Trust Ends — Validate Everything That Crosses

Trust boundary illustration: Left side shows chaotic 'untrusted zone' like an airport terminal with suspicious figures (fork attacks, Claude leaking scripts). Center shows security checkpoint verifying OIDC signatures. Right side shows calm 'trusted zone' with API processing verified data. Bottom right shows audit/accountability: even if a Trojan horse gets through, the student's verified identity means they can be held accountable.

The API must not trust the action to report: its own repository name, accurate test results, or the submission time. All derived from cryptographically verified sources or computed server-side.

Knowing WHO You Are Is Different from What You're ALLOWED to Do

Authentication proves identity. Authorization checks permissions. Both required — in that order.

public Response submitFeedback(Request req) {
// STEP 1: AUTHENTICATION — "Who are you?"
// Verify the token cryptographically (we'll see HOW shortly)
Identity caller = authService.verifyToken(req.header("Authorization"));
// Now we KNOW who this is — not just who they CLAIM to be

// STEP 2: AUTHORIZATION — "What are you allowed to do?"
if (!enrollmentService.isRepoEnrolled(caller.repository())) {
throw new ForbiddenException("Repository not enrolled"); // 403
}
if (assignment.getDeadline().isBefore(Instant.now())) {
throw new ForbiddenException("Deadline has passed"); // 403
}

// Both checks passed — process the request
return gradingService.storeFeedback(req.body());
}

HTTP 401 "Unauthorized" = authentication failure (poorly named — should be "Unauthenticated"). HTTP 403 "Forbidden" = authorization failure (I know you, but you can't do this).

Proving Identity Over a Network Is Harder Than It Sounds

In a monolith, method calls within the same process have implicit trust. Over a network? Anyone can send bytes claiming to be anyone.

Monolith: Implicit Trust

// Within the same process
grader.submitFeedback(studentId, score);
// We KNOW this is our grader — it's our code

The call happens in memory. No one can intercept or forge it.

Distributed: Anyone Can Lie

POST /api/feedback HTTP/1.1
Authorization: "trust me, I'm the grader"
{"studentId": "alice", "score": 100}

How do you know this request is really from the Grading Action — and not a student faking it from their laptop?

We need a way to prove identity that: (1) Can't be forged by attackers, (2) Doesn't require sharing secrets over the network, (3) Can be verified without calling back to the identity provider.

Asymmetric Crypto: Prove Identity Without Sharing Secrets

Asymmetric cryptography flow: GitHub has private key (secret) and public key (published). GitHub signs claims with private key, creating a token. Pawtograder API fetches public key and verifies signature locally. Three benefits: can't be forged, no shared secrets, self-verifying.

HTTPS Solves Eavesdropping but Creates a New Trust Question

HTTPS certificate chain: Certificate Authority signs github.com's certificate. Your browser verifies using CA's public key from its trusted list. Trust flows from OS → CAs → domains → OIDC keys.

⚠️ The catch: Employers can add their own CA to devices they control. NEU-owned laptops have Northeastern's root CA installed (cannot remove) — they can intercept ALL your HTTPS traffic. If that CA is compromised, attackers can forge certificates for any domain. Use personal devices for anything sensitive.

Self-Verifying Tokens: The Full Verification Flow

The API verifies the Grading Action's identity without contacting GitHub. The signature is self-verifying — GitHub's public key is all that's needed.

// This class caches GitHub's public keys — fetched once, reused for all requests
public class OidcVerifier {
private final JwkProvider jwkProvider = new UrlJwkProvider(
"https://token.actions.githubusercontent.com/.well-known/jwks"); // Cached

public VerifiedIdentity verify(String oidcToken) throws JWTVerificationException {
// Step 1: Decode WITHOUT verifying (to get the key ID)
DecodedJWT unverified = JWT.decode(oidcToken);

// Step 2: Fetch the public key (from cache after first request)
Jwk jwk = jwkProvider.get(unverified.getKeyId());
RSAPublicKey publicKey = (RSAPublicKey) jwk.getPublicKey();

// Step 3: Verify signature — LOCAL MATH, no network call
DecodedJWT verified = JWT.require(Algorithm.RSA256(publicKey, null))
.withIssuer("https://token.actions.githubusercontent.com")
.build()
.verify(oidcToken); // Throws if signature invalid

// Step 4: Extract claims — guaranteed by GitHub's signature
return new VerifiedIdentity(
verified.getClaim("repository").asString(), // "cs3100-sp26/hw1-alice"
verified.getClaim("workflow_ref").asString(), // Detects modified workflows
verified.getClaim("run_id").asString());
}
}

Verification is local math (microseconds), not a network call (milliseconds). This is why OIDC scales — and why the API can handle thousands of grading requests without bottlenecking on GitHub.

The Alternative: Network-Verified Tokens (Why Not?)

What if tokens weren't self-verifying? Every verification would require a network call back to the identity provider.

Network-verified (the alternative)

public Identity verify(String token) {
// EVERY verification = network call to GitHub
HttpResponse<String> resp = client.send(
HttpRequest.newBuilder()
.uri(URI.create("https://api.github.com/verify"))
.header("Authorization", "Bearer " + token)
.build(),
BodyHandlers.ofString());

// GitHub checks its database: "Is this token valid?"
// Returns the identity if valid, 401 if not
return parseIdentity(resp.body());
}

⚠️ Every grading request → network call to GitHub ⚠️ GitHub down = all grading stops ⚠️ GitHub rate limits = grading throttled ⚠️ Latency: +50-200ms per verification

Self-verifying (OIDC/JWT)

public Identity verify(String token) {
// Public key fetched ONCE, cached forever
// Verification is LOCAL MATH (can cache issuer keys)
DecodedJWT verified = JWT.require(
Algorithm.RSA256(cachedPublicKey, null))
.withIssuer("https://token.actions.githubusercontent.com")
.build()
.verify(token);

return new Identity(
verified.getClaim("repository").asString());
}

✅ No network call per request ✅ GitHub down? Verification still works ✅ No rate limit concerns ✅ Latency: ~0.1ms (pure CPU)

Pawtograder's security depends on a chain of trust. If any link breaks, the whole system is compromised.

What We TrustWhat Could Go WrongMitigation
Certificate Authorities (DigiCert, Let's Encrypt)CA compromise → fake certificates for any domainCertificate Transparency logs, browser vendor audits
GitHub's InfrastructureGitHub's private key stolen → forged OIDC tokensGitHub's security team, HSMs, incident response
GitHub's OIDC ClaimsGitHub lies about which repo is runningTrust GitHub's incentives (reputation, contracts)
HTTPS on github.com/.well-knownDNS hijack + rogue certificate → fake public keysDNSSEC, certificate pinning (advanced)
The Grading Action CodeStudent forks action, modifies to report fake gradesWorkflow validation, tarball only to authorized workflows
The Student's RepositoryStudent manipulates code to pass tests dishonestlyCode review, plagiarism detection, test design
NEU's Network (on NEU devices)Palo Alto firewall blocks legitimate traffic arbitrarilyUse personal devices; complain loudly
NEU's CA (on NEU devices)NEU's CA compromised → attacker forges any certificateUse personal devices for sensitive data

Security isn't about eliminating trust — it's about understanding what you trust and why. Every trust decision is an attack surface.

Bringing It Together: Monolith → Distributed

ConceptMonolith (L19)Distributed (L20)
CommunicationMethod calls (nanoseconds)Network calls (milliseconds+)
Failure modesProcess crash → everything failsPartial failures, network partitions
ConsistencyDB transactions span all operationsEventual consistency, no cross-service transactions (L33)
DebuggingSingle stack trace, single log fileDistributed tracing, multiple log streams
DeploymentAll-or-nothingIndependent service deploys
ScalingScale everything togetherScale services independently
SecurityInternal method calls (implicit trust)Every call across network (explicit auth)

The monolith-first principle still holds. Don't distribute until you need to. When you do need to — for isolation, scaling, team autonomy, or platform leverage — you now understand what you're signing up for.

Conway's Law: Even More True for Distributed Systems

In L19, we introduced Conway's Law:

"Organizations which design systems are constrained to produce designs which are copies of the communication structures of those organizations."


— Melvin Conway, 1967

For distributed systems, this cuts both ways:

If your teams are siloed → your services will be siloed

Pawtograder: separate teams for the Grading Action, API, and Discord bot → separate services with clean interfaces.

If teams must coordinate constantly → services will be tightly coupled

Distributed monolith! Teams that can't deploy independently → services that can't deploy independently.

We'll explore this further in L22 (Teams and Collaboration).

What's Next: Serverless

In L21, we'll explore Serverless Architecture — an architectural style that pushes many of today's concerns to the platform level.

What serverless does:

  • Infrastructure management → platform's job
  • Scaling, availability → platform's job
  • Some security concerns → platform's job
  • You focus on business logic

The tradeoffs:

  • You STILL deal with all eight fallacies
  • You gain elasticity and scale-to-zero
  • You lose control over runtime environment
  • New constraints: cold starts, execution time limits

Pawtograder uses serverless extensively — Supabase Edge Functions, GitHub Actions.

Serverless is the natural continuation of today's lesson: it's distributed computing with operational complexity offloaded to the cloud provider.