Skip to main content
A pixel art illustration showing a developer transitioning from arranging small code blocks to surveying a system of interconnected components: Grading Action, Solution Repo, Pawtograder API. Tagline: Where Do the Boundaries Go?

CS 3100: Program Design and Implementation II

Lecture 18: Thinking Architecturally

©2026 Jonathan Bell, CC-BY-SA

Learning Objectives

After this lecture, you will be able to:

  1. Define software architecture and distinguish it from design
  2. Identify architectural drivers (functional requirements, quality attributes, constraints) that shape decisions
  3. Apply heuristics to determine service/module boundaries and design good interfaces
  4. Use the C4 model to communicate architecture at different levels of detail
  5. Write Architecture Decision Records (ADRs) to capture the why behind decisions

Architecture vs. Design: A Continuum

Architecture and design exist on a continuum. They ask different questions at different scales.

Architecture — The Big Picture

  • What are the major components?
  • How do they communicate?
  • What are the quality requirements?
  • Which decisions are hard to change later?

Design — The Details

  • How is this class organized?
  • What data structures should we use?
  • How do these methods collaborate?
  • Which pattern fits this problem?

A useful heuristic: architectural decisions are the ones that are expensive to change.

Case Study: Pawtograder Autograder Requirements

Before we see how Pawtograder is architected, let's think about what it needs to do:

Discussion: What decisions would be "expensive to change"?

Where should tests run? Who can see the test code? How do instructors configure grading? What happens when 200 students submit at once?

Architecture: Pawtograder vs. Bottlenose

Two systems, same problem — different architectural choices:

DecisionPawtograderBottlenose
Where does grading run?GitHub Actions (rented)Custom job queue + Orca (owned)
Where does business logic live?"Grading action" component — knows Java, Gradle, checkstyle etc. Parses, scores, normalizesPlatform — Grader subclasses per language
How do instructors configure?pawtograder.yml configuration in each assignment repoWeb UI forms

Why these are architectural:

  • Execution: Own vs. rent infrastructure — months to reverse
  • Business logic: Determines where complexity lives - crucial for changeability
  • Config: Shapes who can change behavior and how fast

The Key Insight

Neither architecture is "wrong" — they likely reflect different requirements in different contexts.

Bottlenose (then)Pawtograder (now)
Grading needsSimpler, per-languageStructured feedback, mutation testing
Instructor flexibilityPlatform defines patternsInstructors iterate independently
InfrastructureOwned servers, full controlRented (GitHub Actions), zero operational responsibility

This raises a question: What forces might push a system toward a particular architecture?

(We're speculating here — we didn't interview the original creators!)

What Drives Architectural Decisions?

Architecture doesn't happen in a vacuum. Decisions are shaped by architectural drivers:

Driver 1: Functional Requirements

What must the Pawtograder autograder do?

  • Accept student code submissions from GitHub repositories
  • Copy student files into the instructor's solution repo and build the project
  • Run instructor test suites and report results per graded unit
  • Run mutation analysis on student tests
  • Report grading results back to Pawtograder for students to see

A single monolithic script COULD do all of this. But should it? The functional requirements alone don't tell us how to structure it.

Driver 2: Quality Attributes (the "-ilities")

How well must the system perform? Quality attributes shape structure more than features do.

You already know these:

  • Changeability (L6-L7): New assignment = swap assignment config file. New language = add one Builder class in grader component. No need to change API or affect other courses.
  • Testability (L16): Run grading locally without GitHub, API, or database.

Same principles from class design — now at service scale!

Coming up:

  • Security (L20): Students must NEVER see test cases
  • Scalability (L21): 200 students submit at once
  • Deployability (L36): Ship changes to one course with zero risk of affecting others
  • Maintainability (L36): Small team, many courses

Quality attributes often conflict. Security (download grader at runtime) creates a network dependency that hurts reliability. Architecture = making these tradeoffs consciously.

Driver 3: Constraints

Constraints are non-negotiable boundaries. They limit our design space:

SourcePawtograder Constraint
Platform"Must run inside GitHub Actions runners"
Security"Student code is untrusted — might try to steal the secret grader code or overwrite grading results"
Authentication"Student code is submitted via GitHub, must use GitHub's identity framework"
Compatibility"Must support Pyret, Java and Python assignments"

Constraints aren't negotiable the way quality attributes are. They're the fixed boundaries within which we architect. Sometimes constraints ARE the architecture — the GitHub Actions sandbox essentially dictates the deployment model.

Drivers Tell Us What; Heuristics Help Us Find Where

Four boundary-finding heuristics shown as colored lenses: Rate of Change, Actor, Interface Segregation, Testability — used to examine system boundaries.

Heuristic 1: Group by Rate of Change

Things that change at different rates should live in different components:

ComponentHow Often It ChangesWho Changes It
Config file: pawtograder.ymlEvery assignment (weekly)Instructor
Grading Action codeEvery few monthsAction maintainers
Pawtograder APIRarely — endpoints are stableSysadmin team
PawtograderConfig interfaceVery rarely — breaking changeAction maintainers (carefully!)

✓ The config file (pawtograder.yml) changes weekly → it SHOULD be separate from the action code that changes monthly. And both should be separate from the API that changes rarely.

What's in a pawtograder.yml?

YAML (Yet Another Markup Language) is a human-readable data format — like JSON but with less punctuation. Here's a simplified example:

grader: 'overlay'
build:
preset: 'java-gradle' # How to build: Java with Gradle
student_tests:
run_tests: true # Run student's tests against instructor impl
run_mutation: true # Check if student tests catch buggy versions

gradedParts:
- name: "Domain Model"
gradedUnits:
- name: "Cookbook"
tests: ["Cookbook addRecipe", "Cookbook removeRecipe"]
points: 4
- name: "UserLibrary"
tests: ["UserLibrary findRecipesByTitle"]
points: 3

submissionFiles:
files: ['src/main/**/*.java'] # Which student files to grade
testFiles: ['src/test/**/*.java']

This file IS the interface between instructor and grading action — it changes weekly, the action code doesn't.

Heuristic 2: Each Actor Gets Their Own Slice

Four actors: Student (pushes code, sees grades), Instructor (configures pawtograder.yml), Sysadmin (maintains API), Intrepid Instructor (changes grading action without touching API). Each connects to their part of the system.

Heuristic 3: Apply Interface Segregation

Don't force clients to depend on interfaces they don't use. What if we had one fat interface?

// BAD: One monolithic interface for everything
public interface AutograderSystem {
// Instructor concerns
PawtograderConfig parseConfig(String yaml);
void validateGradedParts(PawtograderConfig config);

// Action engine concerns
void copySubmissionFiles(Path src, Path dest);
BuildResult buildProject(BuildPreset preset);
List<TestResult> runTests(PawtograderConfig config);
List<MutantResult> runMutationAnalysis(List<String> locations);

// API concerns
SubmissionRegistration registerSubmission(String oidcToken);
void submitFeedback(String submissionId, AutograderFeedback results);
}

An instructor configuring a YAML file shouldn't need to know about OIDC tokens. The API shouldn't need to know how tests are run. We'll design the better version after finishing the heuristics.

Heuristic 4: Optimize for Testability

Can you test a component without deploying the whole system?

ComponentTestable in Isolation?How?
Grading logic✅ Yesnpx tsimp src/grading/main.ts -s /path/to/solution -u /path/to/submission — runs locally, no API needed
Config parsing✅ YesParse a pawtograder.yml file and validate — no network, no GitHub
API endpoints✅ YesEdge functions tested independently with mock OIDC tokens
Full pipeline⚠️ IntegrationRequires GitHub Actions runner + API — regression test suite

The grading action can run locally without Pawtograder. That's not an accident — it's an architectural choice that enables fast development iteration.

Quick Poll: Applying the Heuristics

📊 pollev.com/jbell

Take a moment to reflect on what you've learned and predict what comes next.

Emerging Architecture: Three Components

Applying our four heuristics, a natural structure emerges:

Comparing Boundaries: Two approaches for grading student code

Pawtograder: "Smart Action"

  • Action parses, scores, normalizes
  • Sends structured AutograderFeedback to API
  • API just stores it — language-agnostic

Bottlenose: "Platform-Driven"

  • Grader subclasses per language
  • Platform interprets raw output
  • Orca is a thin execution layer
PawtograderBottlenose
Data couplingNarrow API (2 endpoints)Shared Database
Adding a languageOne new Builder classNew Grader + views + Docker image

Data Coupling Affects Testability

Pawtograder (narrow API)

Test the entire grading pipeline on your laptop:

npx tsimp src/grading/main.ts \
-s /path/to/solution \
-u /path/to/submission

No network, no database, no GitHub runner needed.

Bottlenose (shared database)

You could test a Grader subclass by feeding it canned TAP output, but you'd need to:

  • Stub ActiveRecord (Grader inherits from it)
  • Mock Backburner job dispatch
  • Simulate Orca's Docker execution responses

Possible, but the architecture works against you.

Coupling makes itself felt at test time: isolated testing isn't impossible in Bottlenose, but the architecture makes it unnatural—you're working against the grain instead of with it.

A Design Decision: Where Does Language Logic Live?

We need to support multiple languages (Java, Python). Where should the language-specific build/test logic live?

Option A: Logic in the Action

The action knows how to build Java (Gradle) and Python. It normalizes results before sending to API.

  • Pros: API stays language-agnostic (simple!)
  • Cons: Action grows with every new language support
  • Example: preset: 'java-gradle' | 'python-script' strategy pattern

Option B: Logic in the API

Action sends raw logs/XML to API. API parses and normalizes.

  • Pros: Action is very thin ("dumb pipe")
  • Cons: API becomes coupled to every language toolchain; hard to test locally
  • Risk: Scaling issues if parsing is heavy

Pawtograder chooses Option A. The action normalizes everything into a standard AutograderFeedback format. The API remains simple and scalable.

This decision is expensive to change

Fork in the road: Logic in Action (normalize early, simple API) vs Logic in API (thin action, complex/coupled API). Architect says: Normalize early.

Interface Design at the Service Level

Once we've identified components, we need to design their interfaces. The same principles from class design apply—and so do the same patterns from L17.

Dependency Injection at service scale:

// Good: GradingAction depends on abstractions
public class GradingAction {
private final SubmissionAPI submissionApi;
private final FeedbackAPI feedbackApi;

public GradingAction(SubmissionAPI submissionApi, FeedbackAPI feedbackApi) {
this.submissionApi = submissionApi;
this.feedbackApi = feedbackApi;
}
}

Use the Strategy pattern for extensibility:

// Builder is an interface—we can add new language support
public interface Builder {
BuildResult build(Path projectDir, BuildConfig config);
List<TestResult> parseTestResults(Path reportDir);
Optional<LintResult> lint(Path projectDir, LinterConfig config);
Optional<List<MutantResult>> mutationTest(Path projectDir, MutationConfig config);
}

// GradleBuilder for Java, PythonScriptBuilder for Python
// Adding Rust support = adding a new Builder implementation

Make contracts clear: What does register promise? What happens if the OIDC token is invalid? What format does AutograderFeedback use? These details belong in the interface's documentation.

The Contracts: Data Types That Cross Boundaries

The interface isn't just method signatures — it's the data types that flow across. These are the contracts that must remain stable:

// What the API returns when the action registers a submission
public record SubmissionRegistration(
String submissionId, // Unique ID for this grading run
String graderUrl, // One-time URL to download instructor's test code
String graderSha // SHA to verify the download wasn't tampered with
) {}

// What the action sends back after grading — THE KEY CONTRACT
public record AutograderFeedback(
List<TestFeedback> tests, // Results per graded part
LintResult lint, // Style check results
Optional<Double> score, // Overall score
List<FeedbackComment> annotations // Line-level comments on student code
) {}

// A comment attached to a specific line of student code
public record FeedbackComment(
String file, int line, String message, String severity
) {}

If AutograderFeedback were poorly designed — no visibility levels, no line-level comments — every future grading implementation would have to work around those limitations. This contract deserves careful design.

Architecture in Your Head Is Folklore

Left: One developer with architecture in their head. Right: Three confused team members each imagining different architectures. Center: broken telephone effect. Bottom: C4 Diagrams and ADRs as solutions. Callout: Architecture in one head is folklore.

The C4 Model: Four Levels of Zoom

LevelShowsPawtograder Example
1. System ContextSystem + Users + ExternalStudent ↔ GitHub ↔ Autograder ↔ Instructor
2. ContainerDeployable UnitsGrading Action (TypeScript), API (Edge Functions), Database (Supabase)
3. ComponentInternals of one ContainerInside Action: ConfigParser, Builder, TestRunner, FeedbackClient
4. CodeClasses / InterfacesPawtograderConfig interface, Builder class

Let's walk through all four levels for Pawtograder's autograder, so you can see how each level zooms in on a different scale of the same system.

C4 Level 1: System Context

Who uses the system, and what external systems does it talk to? This is the "napkin sketch" level.

At this level, the autograder is a single box. We don't care what's inside—only what it interacts with.

C4 Level 2: Container

Zoom into the "Autograder System" box. What are the major deployable units?

The narrow API boundary between the Grading Action and Pawtograder API is a deliberate choice—the action never touches the database directly.

C4 Level 3: Component (Grading Action)

This view shows how main.ts orchestrates the internal components. Mutation Analysis depends on Tests being run first, and multiple report parsers normalize tool-specific output into AutograderFeedback.

C4 Level 4: Code (Build Runner)

Zoom into the "Build Runner" component. What are the actual classes and interfaces?

Level 4 shows actual code structure. You'd rarely draw this for the whole system—it's useful for specific extension points or critical interfaces.

Choosing the Right Level

Use the right level of detail for your audience:

AudienceUseful LevelsWhy
Students and TAsLevel 1Need to understand what the system does, not how
Instructors configuring gradingLevels 1–2Need to see where pawtograder.yml fits in the pipeline
Action maintainersLevels 2–3Need to understand component responsibilities and dependencies
Contributors adding a new BuilderLevels 3–4Need to see the extension point and its interface

Architecture Decision Records (ADRs)

Diagrams show what. ADRs capture why. An ADR documents: Context, Decision, and Consequences.

ADR-001: Thick Grading Action with Narrow API vs. Thin Action with Shared Database

Context: The grading action needs to communicate results to the Pawtograder platform. We could have the action access the database directly (as Bottlenose's components share PostgreSQL), or we could put a narrow API in between.

Decision: The grading action will communicate exclusively through two API endpoints (createSubmission and submitFeedback). It will never access the database directly.

Consequences:

  • Testability: The action can run locally without a database, network, or GitHub runner
  • Changeability: Database schema can evolve freely without touching the action
  • Decoupling: Action and API can be developed and deployed independently
  • ⚠️ Complexity: The action must normalize all results into a standard AutograderFeedback format
  • ⚠️ Duplication: Some validation happens in both the action and the API

ADR Example: Security Decision

ADR-002: Dynamic Grader Download vs. Embedded Tests

Context: We need to run instructor tests against student code. We could embed tests in the student repo (hidden folder?) or download them at runtime.

Decision: The action will download the private grader at runtime using an authenticated one-time URL.

Consequences:

  • Security: Students strictly cannot see test code (it's never in their repo). Student code does run alongside instructor code, but the design ensures that a student who submits a trojan horse to exfiltrate the secret grader is trivially detected.
  • Changeability: Instructors can update tests after assignment release without student pull.
  • ⚠️ Reliability: Grading requires active network connection to Pawtograder API, adding more points of failure.
  • ⚠️ Complexity: Needs a complex auth flow to get the download URL and ensure that student code is immutably recorded before receiving access to the grader code.

ADRs create institutional memory. Without this record, someone might ask "Just put the tests in the repo!" — the ADR explains why we don't.

Plan the Infrastructure, Let the System Emerge

Three-panel city planning metaphor: Left shows planner with elaborate blueprints but empty lot (BDUF). Right shows chaotic sprawl with no infrastructure (No Design). Center shows vibrant living city with planned infrastructure but organic growth (Just Enough Architecture). A café named 'QWAN Coffee' is visible in the center.

Piecemeal Growth: Features That Emerged

Pawtograder's first version could build Java, run JUnit tests, and report a score. Features were added as real needs emerged:

FeatureWhen it was addedWhat prompted it
Line-level feedback commentsAfter instructors asked for richer feedbackA comment_script hook lets any external script attach comments to specific lines of student code
Detailed hints for mutants not detectedAfter mutation testing revealed that students struggled to understand why certain mutants weren't caughtThe mutationTest() method returns detailed feedback on each mutant (configured via pawtograder.yml), not just a score
Dependency-based scoringAfter multi-part assignments revealed the needIf Part 2 depends on Part 1 and Part 1 fails, Part 2 shouldn't run—avoids cascading confusing failures

None of these were in the original design. But the architecture made each addition cheap because the right boundaries were in place from the start.

Just Enough Architecture

Decide what's hard to reverse; defer what's easy to change; design the system so deferred decisions stay cheap.

DecisionWhy it was decided earlyCost of getting it wrong
Narrow API boundary (createSubmission, submitFeedback)Decouples action from platform—each evolves independentlyEvery grading action and the entire API would need rewriting
Thick action / thin APIAction normalizes results; API just stores themChanging who interprets results means rearchitecting both sides
Config-driven grading (pawtograder.yml)Instructors iterate independently without touching action codeEvery instructor's workflow would break; action would need a UI
Strategy pattern for builders (Builder interface)New languages = new class, not new architectureAdding a language would require forking the entire grading pipeline
DecisionWhy it was safe to deferWhere it lives
Which linter rules to enforceContained in a config file within the solution repocheckstyle.xml in grader tarball
How artifacts are stored and uploadedBehind the SupabaseAPI client—action just calls uploadArtifact()Can switch from Supabase Storage to S3 without touching grading logic
Specific report parsing formatsEach parser is a self-contained modulesurefire.ts, pitest.ts, jacoco.ts—add or replace without rippling

Looking Forward: Where These Ideas Go Next

Concept from TodayWhere It Goes
Quality Attributes (testability, changeability...)L19: Deep dive into architectural qualities — hexagonal architecture applied to CookYourBooks, tradeoffs between -ilities
Component Boundaries & APIsL20-21: What happens when boundaries cross networks? Distributed architecture, fallacies of distributed computing, serverless
Data Coupling DecisionsL20-21: Distributed data is even harder — consistency, latency, the CAP theorem
Architecture Communication (C4, ADRs)L22: Conway's Law — how team structure affects (and is affected by) architecture

The vocabulary you learned today — drivers, boundaries, coupling, expensive decisions — will be your lens for the rest of the course.