Skip to main content

Recitation 4 -- Data Cleaning & Transformation

Skills: 2

Reading: 4.2

Objectives

  • Data cleaning
  • Adding columns by computing from existing columns

I. Introduction (5 minutes)

Overview: Explain that real-world data is rarely clean and ready to use. Students will work with a "messy" dataset of student survey responses about favorite foods, learning to clean and transform it into a usable format for analysis.

II. Examining Messy Data (10 minutes)

A. The Raw Dataset

Present students with this intentionally messy table:

SURVEY-RAW = table: name, age, favorite-food, rating, state
row: "Alice", "19", "pizza", "8", "MA"
row: "Bob", "20", "PIZZA", "9", "ma"
row: "Carol", "", "tacos", "7", "California"
row: "David", "18", "Pizza", "ten", "Mass"
row: "Eve", "21", "sushi", "", "MA"
row: "Frank", "19", "ice cream", "6", "massachusetts"
row: "Grace", "twenty", "Pizza", "8", "CA"
row: "Henry", "19", "ICE CREAM", "5", "calif"
end

B. Identifying Data Quality Issues

Have students examine the data and identify problems:

  • Inconsistent capitalization: "pizza" vs "PIZZA" vs "Pizza"
  • Missing values: Empty strings in age, rating columns
  • Inconsistent formats: "twenty" vs "20", "ten" vs "10"
  • Inconsistent state abbreviations: "MA" vs "ma" vs "Mass" vs "massachusetts"

Discussion: Ask, "What makes this data difficult to work with? What questions would be hard to answer?"

III. Cleaning Text Data (15 minutes)

A. Standardizing Case

Show students how to normalize text using string functions:

food-normalized = build-column(SURVEY-RAW, "food-clean", lam(r): string-to-lower(r["favorite-food"]) end)

Check Understanding: Ask, "Why might we want all food names in lowercase?"

B. Standardizing State Names

Create a more complex transformation for states:

fun clean-state(state-input):
doc: "convert various state formats to standard abbreviations"
cleaned = string-to-lower(string-trim(state-input))
if (cleaned == "ma") or (cleaned == "mass") or (cleaned == "massachusetts"):
"MA"
else if (cleaned == "ca") or (cleaned == "california") or (cleaned == "calif"):
"CA"
end
end

state-normalized = build-column(food-normalized, "state-clean", lam(r): clean-state(r["state"]) end)

Discussion: Ask, "Why do we use string-trim and string-to-lower first?"

IV. Handling Missing and Invalid Data (15 minutes)

A. Dealing with Missing Ages

Show different strategies for missing numeric data:

# Strategy 1: Filter out missing ages
complete-ages = filter-with(state-normalized, lam(r): not(r["age"] == "") end)

# Strategy 2: Replace missing ages with a default value
DEFAULT-AGE = 20

age-filled = build-column(state-normalized, "age-clean", lam(r):
if r["age"] == "":
DEFAULT-AGE
else:
string-to-number(r["age"])
end
end)

Discussion: Ask, "When might you choose between filtering vs. filling missing data?"

B. Converting Text Numbers to Actual Numbers

Handle the mixed numeric/text rating problem:

fun clean-rating(rating-input):
doc: "converts rating to number, handling text entries"
if rating-input == "":
-1
else if rating-input == "ten":
10
else if rating-input == "nine":
9
else if rating-input == "eight":
8
else:
string-to-number(rating-input)
end
end

ratings-cleaned = build-column(age-filled, "rating-clean", lam(r): clean-rating(r["rating"]) end)

Check Understanding: Ask, "Why should we use -1 for missing ratings instead of 0?"

V. Computing New Columns (10 minutes)

Creating Derived Data

Add computed columns based on existing data:

# Add ratings category column
final-table = build-column(ratings-cleaned, "rating-category", lam(r):
rating = r["rating-clean"]
if rating == -1:
"no-rating"
else if rating >= 8:
"high"
else if rating >= 6:
"medium"
else:
"low"
end
end)

VI. Recap and Wrap-Up (5 minutes)

Key Points Discussion:

  • Incremental Cleaning: Building columns step by step rather than trying to fix everything at once
  • Trade-offs: Different strategies for missing data have different implications

Reflection Questions:

  • "How did adding computed columns make the data more useful?"
  • "What would happen if we tried to analyze the original messy data?"