Recitation 4 — Data Cleaning & Transformation
Skills: 2
Reading: 4.2
Examining Messy Data
Here is an intentionally messy table:
SURVEY-RAW = table: name, age, favorite-food, rating, state
row: "Alice", "19", "pizza", "8", "MA"
row: "Bob", "20", "PIZZA", "9", "ma"
row: "Carol", "", "tacos", "7", "California"
row: "David", "18", "Pizza", "ten", "Mass"
row: "Eve", "21", "sushi", "", "MA"
row: "Frank", "19", "ice cream", "6", "massachusetts"
row: "Grace", "twenty", "Pizza", "8", "CA"
row: "Henry", "19", "ICE CREAM", "5", "calif"
end
Identifying Data Quality Issues
First, let's look at the data and identify problems (Inconsistent capitalization, Missing values, Inconsistent formats, Inconsistent state abbreviations)
Discussion: "What questions would be hard to answer?"
Cleaning Text Data
Standardizing Case
Can normalize text using string functions -- can build-column
a new column using string-to-lower
of existing favorite-food
column.
Check Understanding: "Why might we want all food names in lowercase?"
Standardizing State Names
Create a more complex transformation for states -- design a function that first uses string-to-lower
but then checks using if
if the lowercased version is one of the ma
variants or one of the ca
variants and replaces each with a standard version.
Handling Missing and Invalid Data
Dealing with Missing Ages
What different strategies can we use for missing ages?
- Can remove the data. In this case, create a new table that only includes data where the age is present.
- Can put a default age. In this case, add a new column with a cleaned age column, where you fill in a default age (say, 20) in case it was missing.
Discussion: Ask, "When might you choose between filtering vs. filling missing data?"
Converting Text Numbers to Actual Numbers
If people wrote words instead of numbers, we can convert them. If they were
arbitrary numbers, this could be complex, but if we expect numbers to only be
1-10 (e.g., in the rating
column), then can write a large if ... else if... else...
with cases for "", "ten", "nine", ..., and then the last case we can
just convert from a number like "8"
to a number 8 with string-to-number
.
Design such a helper function, and use it to add a rating-clean
column.
Check Understanding: Ask, "Why should we use -1 for missing ratings instead of 0?"
Computing New Columns
Sometimes we might also want to create computed columns based on existing data.
For example, maybe we want a rating-category
column that distinguishes between
"no rating" (-1, based on previous cleaning), "high" (above 7), medium (above
5), and low.
Add such a column.
Wrap-Up
- Incremental Cleaning: Building columns step by step rather than trying to fix everything at once
- Trade-offs: Different strategies for missing data have different implications
Reflection Questions:
- "What would happen if we tried to analyze the original messy data?"