Skip to main content

Recitation 4 — Data Cleaning & Transformation

Skills: 2

Reading: 4.2

Examining Messy Data

Here is an intentionally messy table:

SURVEY-RAW = table: name, age, favorite-food, rating, state
row: "Alice", "19", "pizza", "8", "MA"
row: "Bob", "20", "PIZZA", "9", "ma"
row: "Carol", "", "tacos", "7", "California"
row: "David", "18", "Pizza", "ten", "Mass"
row: "Eve", "21", "sushi", "", "MA"
row: "Frank", "19", "ice cream", "6", "massachusetts"
row: "Grace", "twenty", "Pizza", "8", "CA"
row: "Henry", "19", "ICE CREAM", "5", "calif"
end

Identifying Data Quality Issues

First, let's look at the data and identify problems (Inconsistent capitalization, Missing values, Inconsistent formats, Inconsistent state abbreviations)

Discussion: "What questions would be hard to answer?"

Cleaning Text Data

Standardizing Case

Can normalize text using string functions -- can build-column a new column using string-to-lower of existing favorite-food column.

Check Understanding: "Why might we want all food names in lowercase?"

Standardizing State Names

Create a more complex transformation for states -- design a function that first uses string-to-lower but then checks using if if the lowercased version is one of the ma variants or one of the ca variants and replaces each with a standard version.

Handling Missing and Invalid Data

Dealing with Missing Ages

What different strategies can we use for missing ages?

  • Can remove the data. In this case, create a new table that only includes data where the age is present.
  • Can put a default age. In this case, add a new column with a cleaned age column, where you fill in a default age (say, 20) in case it was missing.

Discussion: Ask, "When might you choose between filtering vs. filling missing data?"

Converting Text Numbers to Actual Numbers

If people wrote words instead of numbers, we can convert them. If they were arbitrary numbers, this could be complex, but if we expect numbers to only be 1-10 (e.g., in the rating column), then can write a large if ... else if... else... with cases for "", "ten", "nine", ..., and then the last case we can just convert from a number like "8" to a number 8 with string-to-number.

Design such a helper function, and use it to add a rating-clean column.

Check Understanding: Ask, "Why should we use -1 for missing ratings instead of 0?"

Computing New Columns

Sometimes we might also want to create computed columns based on existing data. For example, maybe we want a rating-category column that distinguishes between "no rating" (-1, based on previous cleaning), "high" (above 7), medium (above 5), and low.

Add such a column.

Wrap-Up

  • Incremental Cleaning: Building columns step by step rather than trying to fix everything at once
  • Trade-offs: Different strategies for missing data have different implications

Reflection Questions:

  • "What would happen if we tried to analyze the original messy data?"