Skip to main content

Day 11 - Cleaning data

Skills: 2

Pre-reading: 4.2.1

Supplementary Videos

Cleaning Data

Reference: For all work with tables, refer to the Tables page in the menu at the top of the page!

Intro (20 mins)

  • Data sources, especially if they are the result of manual entry by people, often have minor mistakes in them.
  • Sometimes data is missing, othertimes it shows up in different ways.
  • Today we'll work with this synthetic example data -- which models a subset of the type of data that is available to political campaigns (at least, in the United States) about voters.
  • Let's start by loading it in as a table:
    include csv

    voter-data =
    load-table: VoterID,FirstName,LastName,DOB,Party,Address,City,State,Zip,Phone,Email,LastVoted
    source: csv-table-url("https://raw.githubusercontent.com/neu-pdi/cs2000-public-resources/refs/heads/main/static/support/10-voters.csv", default-options)
    end
  • If we look at the data, we should notice a few things are off about the data.
  • For example, the party affiliation, which is often used to target outreach, has one column empty (which is perhaps Independent?), but also, a couple rows have typos!
  • So if we run filter-with(voter-data, lam(r :: Row) -> Boolean: r["Party"] == "Republican" end) we see only two rows, despite the original data seeming to have 3. If we look really closely we can see that one of the rows is "Repbulican" (note the typo).
  • Let's create a new file in our cs2000-scratch repository, voters.csv, copy the contents of https://raw.githubusercontent.com/neu-pdi/cs2000-public-resources/refs/heads/main/static/support/10-voters.csv (open it in a new tab to copy) into it, and then fix this typo.
  • Now we can change the above to csv-table-file("voters.csv", default-options):
    include csv

    voter-data =
    load-table: VoterID,FirstName,LastName,DOB,Party,Address,City,State,Zip,Phone,Email,LastVoted
    source: csv-table-file("voters.csv", default-options)
    end
  • When we run the same filter-with expression now, we see the three rows we expect.
  • Why did we change the source data, rather than trying to fix this in code? There are always two possible approaches -- we can normalize (remove distinctions between things that are "the same") in the original source, or we can do it in code. If there are mistakes in the original data, like that typo, it may make sense to fix the data, rather than making our analysis more complicated.
  • However, in other cases, like the empty Party being treated as Independent, we may want to do this in our analysis, but it would likely be a mistake to change the data, because it is possible that no party and Independent as a party are two different things, and later analysis would want to treat them separately. Let's do that.
    fun blank-to-indep(s :: String) -> String:
    doc: "replaces an empty string with Independent"
    if s == "":
    "Independent"
    else:
    s
    end
    where:
    blank-to-indep("") is "Independent"
    blank-to-indep("blah") is "blah"
    end
    voters-with-indep = transform-column(voter-data, "Party", blank-to-indep)
  • Now we can see our new table, where there are no blank party affiliations, but we have not removed the original blank from voters.csv.
  • Even in cases of typos, it may make sense to automate corrections -- e.g., we can see several different forms of dates in our table. Manually correcting all of them is not wrong, but doing things manually always raises the possibly of introducing errors -- e.g., accidentally deleting a character that changes a date, and not realize it was done (and even without mistakes, it could be time consuming), whereas creating a normalize-date function would be more reliable.

Class Exercises (35 mins)

  1. There is another issue with Party -- identify & fix it!
  2. Both State and Zip have a single row that has bad data. Identify what it is. Can this be corrected through an automated process? Or can you correct it manually?
  3. Design a function normalize-phone that transforms all the phone numbers to the form NNNNNNNNNN (i.e., the same form as VoterId 4), where N stands for a digit (e.g., 5551234567). Use transform-column to create a table with these normalized phone numbers.
  4. Look at the DOB column -- can you normalize these (to, say, YYYY-MM-DD) in the same way as you did phone numbers? Why or why not? Through a combination of manually editing the source and transforming the column with a function, normalize the DOB column.
  5. Now look at the LastVoted column. Unlike with DOB, there are many rows that do not conform to YYYY-MM-DD. While manually editing the one that is of form "Oct 10 2022" may make sense, manually editing many rows is likely prone to error. Think about a strategy to transform those in other forms, likely using string-substring (multiple times, with comparisons). If you are able to carry out your plan, that's great, but start by making a plan (in comments).

Wrap-up (5 mins)

  • Most data needs to be filtered or cleaned in some way.
  • Sometimes that is done manually, but often it is done via code.