A pixel art illustration showing a developer transitioning from arranging small code blocks to surveying a system of interconnected components: Grading Action, Solution Repo, Pawtograder API. Tagline: Where Do the Boundaries Go?

CS 3100: Program Design and Implementation II

Lecture 18: Thinking Architecturally

Learning Objectives

After this lecture, you will be able to:

Define software architecture and distinguish it from design
Identify architectural drivers (functional requirements, quality attributes, constraints) that shape decisions
Apply heuristics to determine service/module boundaries and design good interfaces
Use the C4 model to communicate architecture at different levels of detail
Write Architecture Decision Records (ADRs) to capture the why behind decisions

Architecture vs. Design: A Continuum

Architecture and design exist on a continuum. They ask different questions at different scales.

Architecture — The Big Picture

What are the major components?
How do they communicate?
What are the quality requirements?
Which decisions are hard to change later?

Design — The Details

How is this class organized?
What data structures should we use?
How do these methods collaborate?
Which pattern fits this problem?

A useful heuristic: architectural decisions are the ones that are expensive to change.

Case Study: Pawtograder Autograder Requirements

Before we see how Pawtograder is architected, let's think about what it needs to do:

Discussion: What decisions would be "expensive to change"?

Where should tests run? Who can see the test code? How do instructors configure grading? What happens when 200 students submit at once?

PAUSE FOR DISCUSSION (2-3 min):

Prompt students to think before revealing the answers:

"If we wanted to switch from GitHub Actions to our own servers later, how hard would that be?"
"What if we want to add support for a new language like Rust?"
"What if we want instructors to configure grading differently?"

Common student responses to watch for:

Where tests run: Some may say "doesn't matter" — push back: switching from GitHub Actions to dedicated servers would require rewriting the entire deployment model
Test visibility: Security concern — students can't see instructor tests, but how do we enforce that?
Configuration: Will instructors use a config file? A web UI? This affects who can change what and how quickly

Key insight to plant: "Your instincts about what's 'easy' or 'hard' to change — those instincts ARE architectural thinking. We're going to see how Pawtograder actually made these decisions, and compare it to Bottlenose which made different choices."

→ Transition: Let's see what decisions Pawtograder actually made, and compare them to Bottlenose...

Architecture: Pawtograder vs. Bottlenose

Two systems, same problem — different architectural choices:

Decision	Pawtograder	Bottlenose
Where does grading run?	GitHub Actions (rented)	Custom job queue + Orca (owned)
Where does business logic live?	"Grading action" component — knows Java, Gradle, checkstyle etc. Parses, scores, normalizes	Platform — `Grader` subclasses per language
How do instructors configure?	`pawtograder.yml` configuration in each assignment repo	Web UI forms

Why these are architectural:

Execution: Own vs. rent infrastructure — months to reverse
Business logic: Determines where complexity lives - crucial for changeability
Config: Shapes who can change behavior and how fast

Make this concrete — you use this system!

"When you push code and see the autograder run, GitHub Actions is executing the grading action"
"The reason you can't see our test cases? That's an architectural decision — the private grader is downloaded at runtime"

Caveat about speculation:

We're comparing these systems to illustrate architectural thinking
We didn't interview the Bottlenose creators — we're inferring from the code and documentation
The point is to see types of architectural differences, not to claim we know exactly why they chose what they chose

Expand on each row:

Execution: GitHub Actions = zero ops burden but less control; custom scheduler = full control but you build and maintain it
Business logic: Thick grader enables richer feedback; centralizing in platform is simpler per-grader but platform grows complex
Config: Config file in repo = instructors iterate independently; web UI = platform must support every pattern

→ Transition: So what drives these decisions?

The Key Insight

Neither architecture is "wrong" — they likely reflect different requirements in different contexts.

	Bottlenose (then)	Pawtograder (now)
Grading needs	Simpler, per-language	Structured feedback, mutation testing
Instructor flexibility	Platform defines patterns	Instructors iterate independently
Infrastructure	Owned servers, full control	Rented (GitHub Actions), zero operational responsibility

This raises a question: What forces might push a system toward a particular architecture?

(We're speculating here — we didn't interview the original creators!)

Important caveat:

We're using these systems as examples to illustrate architectural thinking
We didn't interview the original Bottlenose creators — we're speculating based on the artifacts
The goal is to learn the types of forces that shape architecture, not to definitively explain these specific systems

The point to land:

Neither system is "wrong" — different contexts lead to different choices
The computing landscape changed: GitHub Actions didn't exist when Bottlenose was built
Requirements likely differed: richer feedback, more instructor control, multi-language support

Bridge to next section:

"So what ARE the kinds of forces that shape architecture?"
"Let's look at architectural drivers — the things that push us toward particular solutions"

→ Transition: Let's examine what drives these decisions...

What Drives Architectural Decisions?

Architecture doesn't happen in a vacuum. Decisions are shaped by architectural drivers:

Driver 1: Functional Requirements

What must the Pawtograder autograder do?

Accept student code submissions from GitHub repositories
Copy student files into the instructor's solution repo and build the project
Run instructor test suites and report results per graded unit
Run mutation analysis on student tests
Report grading results back to Pawtograder for students to see

A single monolithic script COULD do all of this. But should it? The functional requirements alone don't tell us how to structure it.

Driver 2: Quality Attributes (the "-ilities")

How well must the system perform? Quality attributes shape structure more than features do.

You already know these:

Changeability (L6-L7): New assignment = swap assignment config file. New language = add one Builder class in grader component. No need to change API or affect other courses.
Testability (L16): Run grading locally without GitHub, API, or database.

Same principles from class design — now at service scale!

Coming up:

Security (L20): Students must NEVER see test cases
Scalability (L21): 200 students submit at once
Deployability (L36): Ship changes to one course with zero risk of affecting others
Maintainability (L36): Small team, many courses

Quality attributes often conflict. Security (download grader at runtime) creates a network dependency that hurts reliability. Architecture = making these tradeoffs consciously.

Connect to prior knowledge:

"You already know how to design for change at the class level — information hiding, low coupling, high cohesion"
"You already know testability — observability, controllability, separating 'what to do' from 'how to connect'"
"Today we're applying those SAME principles at a bigger scale"

Pawtograder examples:

Changeability → pawtograder.yml is the ONLY thing instructors touch; adding Rust = one new Builder class
Testability → npx tsimp src/grading/main.ts -s /solution -u /submission runs locally, no network needed

Why course isolation matters — emphasize this!

Imagine 10 different courses all using the same grading platform
If an instructor tweaks their grading config and it breaks ANOTHER course's grading, that's a nightmare
Worse: if an instructor is mid-semester and can't update their test cases because it requires a platform-wide deployment
The architecture MUST let each course evolve independently — new assignments, new tests, new grading criteria — without coordinating with other instructors or risking their stability
This is changeability at organizational scale: "I can ship changes to MY course without a meeting"

Preview future lectures:

L20: How do we authenticate? How do we hide test cases? (Security)
L21: How does GitHub Actions handle 200 concurrent submissions? (Scalability)
L36: How do we deploy changes safely? How do we keep this maintainable over years? (Deployability, Sustainability)

→ Transition: Constraints also shape what's possible...

Driver 3: Constraints

Constraints are non-negotiable boundaries. They limit our design space:

Source	Pawtograder Constraint
Platform	"Must run inside GitHub Actions runners"
Security	"Student code is untrusted — might try to steal the secret grader code or overwrite grading results"
Authentication	"Student code is submitted via GitHub, must use GitHub's identity framework"
Compatibility	"Must support Pyret, Java and Python assignments"

Constraints aren't negotiable the way quality attributes are. They're the fixed boundaries within which we architect. Sometimes constraints ARE the architecture — the GitHub Actions sandbox essentially dictates the deployment model.

Drivers Tell Us What; Heuristics Help Us Find Where

Four boundary-finding heuristics shown as colored lenses: Rate of Change, Actor, Interface Segregation, Testability — used to examine system boundaries.

Bridge from drivers to heuristics:

We just identified WHAT we care about: functional requirements, quality attributes, constraints
But knowing we need "changeability" or "testability" doesn't tell us WHERE to draw the lines
These four heuristics are lenses for examining a system to find natural boundaries

The four lenses (preview):

Rate of Change: Things that change at different speeds should be separate
Actor: Things owned by different people should be separate
Interface Segregation: Clients that need different things should get different interfaces
Testability: Things that need independent testing should be separable

How they connect to drivers:

Changeability driver → Rate of Change heuristic helps us isolate fast-changing parts
Maintainability driver → Actor heuristic helps us give each team a clear slice
Testability driver → Testability heuristic helps us design components that can be tested alone

Coming up: We'll apply each heuristic to Pawtograder and see how the same boundaries emerge from multiple angles — that's a sign we've found a natural seam.

→ Transition: Let's start with rate of change...

Heuristic 1: Group by Rate of Change

Things that change at different rates should live in different components:

Component	How Often It Changes	Who Changes It
Config file: `pawtograder.yml`	Every assignment (weekly)	Instructor
Grading Action code	Every few months	Action maintainers
Pawtograder API	Rarely — endpoints are stable	Sysadmin team
`PawtograderConfig` interface	Very rarely — breaking change	Action maintainers (carefully!)

✓ The config file (pawtograder.yml) changes weekly → it SHOULD be separate from the action code that changes monthly. And both should be separate from the API that changes rarely.

Four things, four rates of change:

pawtograder.yml (the config) — changes per assignment (weekly)
Grading Action (the engine) — changes per semester
Pawtograder API (the platform) — changes very rarely
PawtograderConfig interface — the contract between action and API, changes almost NEVER

The interface is the most stable:

The PawtograderConfig interface defines what the API expects from the action
If we change this interface, BOTH the action AND the API must change together
That's why it changes "very rarely — breaking change" — it's the stable contract that lets the other pieces evolve independently

If they were all one thing:

Changing a test case configuration would mean re-deploying the entire platform
Updating the action to support Python would risk breaking the API
Rate of change tells us: keep them apart!

→ Transition: What does this config file actually look like?

What's in a `pawtograder.yml`?

YAML (Yet Another Markup Language) is a human-readable data format — like JSON but with less punctuation. Here's a simplified example:

grader: 'overlay'
build:
  preset: 'java-gradle'           # How to build: Java with Gradle
  student_tests:
    run_tests: true               # Run student's tests against instructor impl
    run_mutation: true            # Check if student tests catch buggy versions

gradedParts:
  - name: "Domain Model"
    gradedUnits:
      - name: "Cookbook"
        tests: ["Cookbook addRecipe", "Cookbook removeRecipe"]
        points: 4
      - name: "UserLibrary"
        tests: ["UserLibrary findRecipesByTitle"]
        points: 3

submissionFiles:
  files: ['src/main/**/*.java']   # Which student files to grade
  testFiles: ['src/test/**/*.java']

This file IS the interface between instructor and grading action — it changes weekly, the action code doesn't.

Heuristic 2: Each Actor Gets Their Own Slice

Four actors: Student (pushes code, sees grades), Instructor (configures pawtograder.yml), Sysadmin (maintains API), Intrepid Instructor (changes grading action without touching API). Each connects to their part of the system.

Walk through the graphic — four actors, four slices:

Student: Pushes code, sees grades — never touches config or action
Instructor: Writes pawtograder.yml — never modifies TypeScript
Sysadmin: Maintains API infrastructure — doesn't write grading configs
Intrepid Instructor: Can change how grading works by modifying the action — without touching the API!

Key insight: changes from one actor shouldn't ripple to another's code.

When instructor changes pawtograder.yml → action and API unchanged
When sysadmin updates API → instructor config unchanged
When intrepid instructor adds LLM feedback → API unchanged

Connection to L8 (SOLID): This is Single Responsibility Principle at architectural scale — each component has one reason to change, corresponding to one actor.

→ Transition: What if we forced all clients to use one fat interface?

Heuristic 3: Apply Interface Segregation

Don't force clients to depend on interfaces they don't use. What if we had one fat interface?

// BAD: One monolithic interface for everything
public interface AutograderSystem {
    // Instructor concerns
    PawtograderConfig parseConfig(String yaml);
    void validateGradedParts(PawtograderConfig config);

    // Action engine concerns
    void copySubmissionFiles(Path src, Path dest);
    BuildResult buildProject(BuildPreset preset);
    List<TestResult> runTests(PawtograderConfig config);
    List<MutantResult> runMutationAnalysis(List<String> locations);

    // API concerns
    SubmissionRegistration registerSubmission(String oidcToken);
    void submitFeedback(String submissionId, AutograderFeedback results);
}

An instructor configuring a YAML file shouldn't need to know about OIDC tokens. The API shouldn't need to know how tests are run. We'll design the better version after finishing the heuristics.

Heuristic 4: Optimize for Testability

Can you test a component without deploying the whole system?

Component	Testable in Isolation?	How?
Grading logic	✅ Yes	`npx tsimp src/grading/main.ts -s /path/to/solution -u /path/to/submission` — runs locally, no API needed
Config parsing	✅ Yes	Parse a `pawtograder.yml` file and validate — no network, no GitHub
API endpoints	✅ Yes	Edge functions tested independently with mock OIDC tokens
Full pipeline	⚠️ Integration	Requires GitHub Actions runner + API — regression test suite

The grading action can run locally without Pawtograder. That's not an accident — it's an architectural choice that enables fast development iteration.

Connection to L16 (Testability):

Same principle: separate "what to do" from "how to connect"
The action doesn't need a real API to test grading logic
The API doesn't need a real GitHub runner to test submission handling

The local testing command is key:

npx tsimp src/grading/main.ts -s /path/to/solution -u /path/to/submission

No GitHub Actions, no Pawtograder API, no network
Just: "here's a solution repo and a student repo — grade it"

→ Transition: Before we see how these come together, let's pause for a quick poll...

Quick Poll: Applying the Heuristics

📊 pollev.com/jbell

Take a moment to reflect on what you've learned and predict what comes next.

Open Poll Everywhere and run the survey (~3-4 min)

Questions:

Word Cloud: "In one word: What makes code HARD to change?"
Reflection: "Think about your most frustrating debugging experience. Which boundary problem was likely involved?"
- Couldn't test the broken part in isolation
- Couldn't tell which component was responsible
- A change in one place broke something far away
- Had to understand the whole system to fix one thing
Opinion: "What's your instinct when requirements are unclear: build flexibility or keep it simple?"
Prediction: "Based on the four heuristics, how many distinct components should Pawtograder have?"
- 2 / 3 / 4 / 5+

After the prediction question: "Let's see if your intuition matches what the heuristics reveal..."

→ Transition: Now let's see how these heuristics converge on a structure...

Emerging Architecture: Three Components

Applying our four heuristics, a natural structure emerges:

Comparing Boundaries: Two approaches for grading student code

Pawtograder: "Smart Action"

Action parses, scores, normalizes
Sends structured AutograderFeedback to API
API just stores it — language-agnostic

Bottlenose: "Platform-Driven"

Grader subclasses per language
Platform interprets raw output
Orca is a thin execution layer

	Pawtograder	Bottlenose
Data coupling	Narrow API (2 endpoints)	Shared Database
Adding a language	One new `Builder` class	New `Grader` + views + Docker image

Data Coupling Affects Testability

Pawtograder (narrow API)

Test the entire grading pipeline on your laptop:

npx tsimp src/grading/main.ts \
  -s /path/to/solution \
  -u /path/to/submission

No network, no database, no GitHub runner needed.

Bottlenose (shared database)

You could test a Grader subclass by feeding it canned TAP output, but you'd need to:

Stub ActiveRecord (Grader inherits from it)
Mock Backburner job dispatch
Simulate Orca's Docker execution responses

Possible, but the architecture works against you.

Coupling makes itself felt at test time: isolated testing isn't impossible in Bottlenose, but the architecture makes it unnatural—you're working against the grain instead of with it.

Connection to L16 (Testability):

Same principle: the more dependencies a component has, the harder it is to test
Pawtograder's narrow API boundary means: no DB dependency → easy local testing
Bottlenose's shared database means: you can test a Grader with canned output, but you're fighting ActiveRecord inheritance, job dispatch coupling, etc.

Connection to L17 (Service Locator vs DI):

Bottlenose's Service Locator approach means: grader code implicitly depends on the platform's registry
Testing a JavaGrader in isolation requires mocking the lookup mechanism, the ActiveRecord base class, etc.
Pawtograder's DI approach means: GradleBuilder depends only on what's in its constructor—easy to test by passing mocks
This is exactly L17's point: "DI makes dependencies explicit... Service Locator hides them"

Neither is wrong — different tradeoffs:

Bottlenose's shared DB simplified the architecture (components just read/write the same tables)
But it made testing harder and changes more expensive — a schema change ripples everywhere
Pawtograder's narrow API adds normalization complexity but gains testability and independent deployment

→ Transition: There's still a big decision inside the action...

A Design Decision: Where Does Language Logic Live?

We need to support multiple languages (Java, Python). Where should the language-specific build/test logic live?

Option A: Logic in the Action

The action knows how to build Java (Gradle) and Python. It normalizes results before sending to API.

Pros: API stays language-agnostic (simple!)
Cons: Action grows with every new language support
Example: preset: 'java-gradle' | 'python-script' strategy pattern

Option B: Logic in the API

Action sends raw logs/XML to API. API parses and normalizes.

Pros: Action is very thin ("dumb pipe")
Cons: API becomes coupled to every language toolchain; hard to test locally
Risk: Scaling issues if parsing is heavy

Pawtograder chooses Option A. The action normalizes everything into a standard AutograderFeedback format. The API remains simple and scalable.

Bottlenose's architecture is closer to Option B:

Orca (the Docker execution service) is a thin pipe that returns raw output
Bottlenose's Grader subclasses interpret everything — each language has its own parsing logic
This worked because Bottlenose already had deep knowledge of assignment structure for each language

Pawtograder chose Option A because:

The action runs in the build environment where parsing tools are available
Keeping the API simple enables it to serve many courses with minimal changes
Adding Rust support = adding a new Builder implementation, not changing the API

Reflection questions:

If we added Rust support, which components would need to change?
Can you test language-specific parsing without a running API?
What if we wanted to support a course platform other than Pawtograder (e.g., Canvas)?

→ Transition: How do we wire the internal steps?

This decision is expensive to change

Fork in the road: Logic in Action (normalize early, simple API) vs Logic in API (thin action, complex/coupled API). Architect says: Normalize early.

Why is this "expensive to change"?

Once you take the orange path (logic in API), coupling spreads like a web:

API now knows about JUnit XML format → tied to Java
Add Python? API learns pytest format → more parsing code
Add Rust? More formats, more coupling
Database schema stores language-specific fields
UI displays language-specific results
Every new language touches EVERY layer

The extraction question:

Six months later, you realize the API is bloated and hard to test
You think: "Let's extract the parsing into a separate service"
But now parsing code is intertwined with storage, with the UI, with the job queue...
You ask yourself: "Can I even extract this? Or should I just start from scratch?"

THAT question — "should I throw it away and start over?" — is the definition of an expensive decision.

The green path (normalize early) avoids this:

Action handles all language complexity
API just stores AutograderFeedback — same shape for every language
Adding Rust? One new Builder class. API unchanged. Database unchanged.

This is why we said earlier: architectural decisions are the ones that are expensive to change. This fork is architectural.

→ Transition: So we've found our boundaries and made the big structural decisions. But how do we design the interfaces BETWEEN these components? Turns out, the same principles from class design apply at this scale too...

Interface Design at the Service Level

Once we've identified components, we need to design their interfaces. The same principles from class design apply—and so do the same patterns from L17.

Dependency Injection at service scale:

// Good: GradingAction depends on abstractions
public class GradingAction {
    private final SubmissionAPI submissionApi;
    private final FeedbackAPI feedbackApi;

    public GradingAction(SubmissionAPI submissionApi, FeedbackAPI feedbackApi) {
        this.submissionApi = submissionApi;
        this.feedbackApi = feedbackApi;
    }
}

Use the Strategy pattern for extensibility:

// Builder is an interface—we can add new language support
public interface Builder {
    BuildResult build(Path projectDir, BuildConfig config);
    List<TestResult> parseTestResults(Path reportDir);
    Optional<LintResult> lint(Path projectDir, LinterConfig config);
    Optional<List<MutantResult>> mutationTest(Path projectDir, MutationConfig config);
}

// GradleBuilder for Java, PythonScriptBuilder for Python
// Adding Rust support = adding a new Builder implementation

Make contracts clear: What does register promise? What happens if the OIDC token is invalid? What format does AutograderFeedback use? These details belong in the interface's documentation.

The Contracts: Data Types That Cross Boundaries

The interface isn't just method signatures — it's the data types that flow across. These are the contracts that must remain stable:

// What the API returns when the action registers a submission
public record SubmissionRegistration(
    String submissionId,     // Unique ID for this grading run
    String graderUrl,        // One-time URL to download instructor's test code
    String graderSha         // SHA to verify the download wasn't tampered with
) {}

// What the action sends back after grading — THE KEY CONTRACT
public record AutograderFeedback(
    List<TestFeedback> tests,          // Results per graded part
    LintResult lint,                    // Style check results
    Optional<Double> score,             // Overall score
    List<FeedbackComment> annotations   // Line-level comments on student code
) {}

// A comment attached to a specific line of student code
public record FeedbackComment(
    String file, int line, String message, String severity
) {}

If AutograderFeedback were poorly designed — no visibility levels, no line-level comments — every future grading implementation would have to work around those limitations. This contract deserves careful design.

Why these records matter:

SubmissionRegistration: What the action receives — a one-time URL to download the private grader
AutograderFeedback: What the action sends back — ALL the grading results in one structured format
FeedbackComment: Enables line-level feedback on student code

This is the "thick action" paying off:

The action has already parsed JUnit XML, Pitest reports, linter output
Normalized everything into this clean, structured format
The API just stores it — doesn't need to know about JUnit or Pitest

The stability requirement:

If we change AutograderFeedback, EVERY grading action implementation must change
That's why this interface changes "very rarely — breaking change"
Same principle as L7's stable interfaces — but now at the API level

→ Transition: How do we communicate all of this architecture to others?

Architecture in Your Head Is Folklore

Left: One developer with architecture in their head. Right: Three confused team members each imagining different architectures. Center: broken telephone effect. Bottom: C4 Diagrams and ADRs as solutions. Callout: Architecture in one head is folklore.

The communication challenge with Pawtograder:

We just covered: three components, four heuristics, data coupling decisions, interface contracts, record types...
How would you explain all of that to a new team member?
How would you explain it to an instructor who just wants to configure grading?
How would you explain it to a sysadmin who needs to debug the API?

The problem:

If this architecture only exists in one person's head, it degrades like a game of telephone
Six months later, everyone has a DIFFERENT mental model
New contributors make changes that violate boundaries they didn't know existed
"Why doesn't the action just write directly to the database?" — because they never saw the ADR

Structured approaches help:

C4 Diagrams — Show the system at different zoom levels (different audiences need different detail)
ADRs — Capture WHY we made decisions, not just WHAT we built

→ Transition: Let's look at C4 first — four levels of zoom...

The C4 Model: Four Levels of Zoom

Level	Shows	Pawtograder Example
1. System Context	System + Users + External	Student ↔ GitHub ↔ Autograder ↔ Instructor
2. Container	Deployable Units	Grading Action (TypeScript), API (Edge Functions), Database (Supabase)
3. Component	Internals of one Container	Inside Action: `ConfigParser`, `Builder`, `TestRunner`, `FeedbackClient`
4. Code	Classes / Interfaces	`PawtograderConfig` interface, `Builder` class

Let's walk through all four levels for Pawtograder's autograder, so you can see how each level zooms in on a different scale of the same system.

C4 Level 1: System Context

Who uses the system, and what external systems does it talk to? This is the "napkin sketch" level.

At this level, the autograder is a single box. We don't care what's inside—only what it interacts with.

C4 Level 2: Container

Zoom into the "Autograder System" box. What are the major deployable units?

The narrow API boundary between the Grading Action and Pawtograder API is a deliberate choice—the action never touches the database directly.

C4 Level 3: Component (Grading Action)

This view shows how main.ts orchestrates the internal components. Mutation Analysis depends on Tests being run first, and multiple report parsers normalize tool-specific output into AutograderFeedback.

C4 Level 4: Code (Build Runner)

Zoom into the "Build Runner" component. What are the actual classes and interfaces?

Level 4 shows actual code structure. You'd rarely draw this for the whole system—it's useful for specific extension points or critical interfaces.

Choosing the Right Level

Use the right level of detail for your audience:

Audience	Useful Levels	Why
Students and TAs	Level 1	Need to understand what the system does, not how
Instructors configuring grading	Levels 1–2	Need to see where `pawtograder.yml` fits in the pipeline
Action maintainers	Levels 2–3	Need to understand component responsibilities and dependencies
Contributors adding a new Builder	Levels 3–4	Need to see the extension point and its interface

Architecture Decision Records (ADRs)

Diagrams show what. ADRs capture why. An ADR documents: Context, Decision, and Consequences.

ADR-001: Thick Grading Action with Narrow API vs. Thin Action with Shared Database

Context: The grading action needs to communicate results to the Pawtograder platform. We could have the action access the database directly (as Bottlenose's components share PostgreSQL), or we could put a narrow API in between.

Decision: The grading action will communicate exclusively through two API endpoints (createSubmission and submitFeedback). It will never access the database directly.

Consequences:

✅ Testability: The action can run locally without a database, network, or GitHub runner
✅ Changeability: Database schema can evolve freely without touching the action
✅ Decoupling: Action and API can be developed and deployed independently
⚠️ Complexity: The action must normalize all results into a standard AutograderFeedback format
⚠️ Duplication: Some validation happens in both the action and the API

ADR Example: Security Decision

ADR-002: Dynamic Grader Download vs. Embedded Tests

Context: We need to run instructor tests against student code. We could embed tests in the student repo (hidden folder?) or download them at runtime.

Decision: The action will download the private grader at runtime using an authenticated one-time URL.

Consequences:

✅ Security: Students strictly cannot see test code (it's never in their repo). Student code does run alongside instructor code, but the design ensures that a student who submits a trojan horse to exfiltrate the secret grader is trivially detected.
✅ Changeability: Instructors can update tests after assignment release without student pull.
⚠️ Reliability: Grading requires active network connection to Pawtograder API, adding more points of failure.
⚠️ Complexity: Needs a complex auth flow to get the download URL and ensure that student code is immutably recorded before receiving access to the grader code.

ADRs create institutional memory. Without this record, someone might ask "Just put the tests in the repo!" — the ADR explains why we don't.

Plan the Infrastructure, Let the System Emerge

Three-panel city planning metaphor: Left shows planner with elaborate blueprints but empty lot (BDUF). Right shows chaotic sprawl with no infrastructure (No Design). Center shows vibrant living city with planned infrastructure but organic growth (Just Enough Architecture). A café named 'QWAN Coffee' is visible in the center.

The city planning metaphor:

Left (Master Plan): Beautiful blueprints, but nothing built. Years of planning, no city.
Right (No Plan): Buildings thrown up wherever, roads to nowhere, infrastructure chaos.
Center (Living City): Key infrastructure was planned (roads, utilities, zoning), but neighborhoods grew organically within those constraints.

The key insight: "Plan the infrastructure. Let the neighborhoods emerge."

In Pawtograder: we planned the component boundaries (action, API, config) and the key interfaces (AutograderFeedback)
But we didn't plan every feature — line-level comments, mutation hints, dependency scoring all emerged from use

Christopher Alexander connection:

Alexander was a real architect who influenced software patterns deeply
His book The Timeless Way of Building (1979) argues: the most livable spaces emerge through gradual, adaptive growth, not master plans
The same is true for software — we discover what the architecture should be by building and using the system

Easter egg for the curious:

"QWAN Coffee" = "Quality Without A Name" — Alexander's term for the ineffable quality that makes spaces feel alive
He couldn't quite define it, but you know it when you see it
If students ask: "It's a reference to Christopher Alexander — look him up!"

→ Transition: Let's see how Pawtograder grew this way...

Piecemeal Growth: Features That Emerged

Pawtograder's first version could build Java, run JUnit tests, and report a score. Features were added as real needs emerged:

Feature	When it was added	What prompted it
Line-level feedback comments	After instructors asked for richer feedback	A `comment_script` hook lets any external script attach comments to specific lines of student code
Detailed hints for mutants not detected	After mutation testing revealed that students struggled to understand why certain mutants weren't caught	The `mutationTest()` method returns detailed feedback on each mutant (configured via `pawtograder.yml`), not just a score
Dependency-based scoring	After multi-part assignments revealed the need	If Part 2 depends on Part 1 and Part 1 fails, Part 2 shouldn't run—avoids cascading confusing failures

None of these were in the original design. But the architecture made each addition cheap because the right boundaries were in place from the start.

Just Enough Architecture

Decide what's hard to reverse; defer what's easy to change; design the system so deferred decisions stay cheap.

Decision	Why it was decided early	Cost of getting it wrong
Narrow API boundary (`createSubmission`, `submitFeedback`)	Decouples action from platform—each evolves independently	Every grading action and the entire API would need rewriting
Thick action / thin API	Action normalizes results; API just stores them	Changing who interprets results means rearchitecting both sides
Config-driven grading (`pawtograder.yml`)	Instructors iterate independently without touching action code	Every instructor's workflow would break; action would need a UI
Strategy pattern for builders (`Builder` interface)	New languages = new class, not new architecture	Adding a language would require forking the entire grading pipeline

Decision	Why it was safe to defer	Where it lives
Which linter rules to enforce	Contained in a config file within the solution repo	`checkstyle.xml` in grader tarball
How artifacts are stored and uploaded	Behind the `SupabaseAPI` client—action just calls `uploadArtifact()`	Can switch from Supabase Storage to S3 without touching grading logic
Specific report parsing formats	Each parser is a self-contained module	`surefire.ts`, `pitest.ts`, `jacoco.ts`—add or replace without rippling

The "decide now" items are all about separation:

How components talk to each other (narrow API vs. shared database)
Where intelligence lives (thick action vs. thin pipe)
Getting these wrong means rewriting the action, the API, and every instructor's config

The "defer" items are contained within a single component:

They can change without rippling across boundaries
Data coupling and thick/thin grader decisions are exactly the kind of choices that are expensive to reverse once courses depend on the system

Bottlenose context:

Bottlenose's architecture reflects a context where simpler grading was sufficient — centralizing grading logic in the platform with UI-driven config was a natural fit
Pawtograder was designed when instructors needed richer capabilities
Those richer requirements drove the decision to put a narrow API between components instead of a shared database

→ Transition: Summary...

Looking Forward: Where These Ideas Go Next

Concept from Today	Where It Goes
Quality Attributes (testability, changeability...)	L19: Deep dive into architectural qualities — hexagonal architecture applied to CookYourBooks, tradeoffs between -ilities
Component Boundaries & APIs	L20-21: What happens when boundaries cross networks? Distributed architecture, fallacies of distributed computing, serverless
Data Coupling Decisions	L20-21: Distributed data is even harder — consistency, latency, the CAP theorem
Architecture Communication (C4, ADRs)	L22: Conway's Law — how team structure affects (and is affected by) architecture

The vocabulary you learned today — drivers, boundaries, coupling, expensive decisions — will be your lens for the rest of the course.

The arc ahead:

L19 (Architectural Qualities): We'll apply hexagonal architecture to CookYourBooks, seeing how ports and adapters create testable, changeable systems. Deep dive into how architectural choices affect the -ilities.
L20 (Networks): When your components communicate over a network, everything gets harder. The Fallacies of Distributed Computing. Client-server architecture.
L21 (Serverless): Infrastructure building blocks — databases, blob storage, queues. Serverless as an architectural style.
L22 (Teams): Conway's Law — "Organizations design systems that mirror their communication structure." Architecture isn't just technical; it's social.

The throughline: Today's question was "what boundaries should we draw?" The next lectures ask "what happens at those boundaries?" — first within a process, then across networks, then across teams.

Final thought: Next time Pawtograder grades your submission, think about the architecture making it happen. The narrow API boundary means the grading action can change independently. The data contracts (AutograderFeedback, FeedbackComment) cross that boundary cleanly. Every choice was made for a reason — and now you have vocabulary to discuss why.

Lecture 18: Thinking Architecturally​

Learning Objectives​

Architecture vs. Design: A Continuum​

Case Study: Pawtograder Autograder Requirements​

Architecture: Pawtograder vs. Bottlenose​

The Key Insight​

What Drives Architectural Decisions?​

Driver 1: Functional Requirements​

Driver 2: Quality Attributes (the "-ilities")​

Driver 3: Constraints​

Drivers Tell Us What; Heuristics Help Us Find Where​

Heuristic 1: Group by Rate of Change​

What's in a pawtograder.yml?​

Heuristic 2: Each Actor Gets Their Own Slice​

Heuristic 3: Apply Interface Segregation​

Heuristic 4: Optimize for Testability​

Quick Poll: Applying the Heuristics​

Emerging Architecture: Three Components​

Comparing Boundaries: Two approaches for grading student code​

Data Coupling Affects Testability​

A Design Decision: Where Does Language Logic Live?​