Skip to main content

Day 10 - Cleaning data

Skills: 2

Pre-reading: DRAFT:4.2.1

Intro (20 mins)

  • Data sources, especially if they are the result of manual entry by people, often have minor mistake in them.
  • Sometimes data is missing, othertimes it shows up in different ways.
  • Today we'll work with this synthetic example data -- which models a subset of the type of data that is available to political campaigns about voters.
  • Let's start by loading it in as a table:
    include csv

    voter-data =
    load-table: VoterID,FirstName,LastName,DOB,Party,Address,City,State,Zip,Phone,Email,LastVoted
    source: csv-table-url("https://pdi.run/f25-10-voters.csv")
    end
  • If we look at the data, we should notice a few things are off about the data.
  • For example, the party affiliation, which is often used to target outreach, has one column empty (which is perhaps Independent?), but also, a couple rows have typos!
  • So if we run filter-with(voter-data, lam(r): r["Party"] == "Republican" end) we see only two rows, despite the original data seeming to have 3. If we look really closely we can see that one of the rows is "Repbulican" (note the typo).
  • Let's create a new local file, voters.csv, copy the contents of https://pdi.run/f25-10-voters.csv into it, and fix this typo.
  • Now we can change the above to csv-table-file("voters.csv"):
    include csv

    voter-data =
    load-table: VoterID,FirstName,LastName,DOB,Party,Address,City,State,Zip,Phone,Email,LastVoted
    source: csv-table-file("voters.csv")
    end
  • When we run the same filter-with expression now, we see the three rows we expect.
  • Why did we change the source data, rather than trying to fix this in code? There are always two possible approaches -- we can normalize (remove distinctions between things that are "the same") in the original source, or we can do it in code. If there are mistakes in the original data, like that typo, it may make sense to fix the data, rather than making our analysis more complicated.
  • However, in other cases, like the empty Party being treated as Independent, we may want to do this in our analysis, but it would likely be a mistake to change the data, because it is possible that no party and Independent as a party are two different things, and later analysis would want to treat them separately. Let's do that.
    fun blank-to-indep(s :: String) -> String:
    doc: "replaces an empty string with Independent"
    if s == "":
    "Independent"
    else:
    s
    end
    where:
    blank-to-indep("") is "Independent"
    blank-to-indep("blah") is "blah"
    end
    voters-with-indep = transform-column(voter-data, "Party", blank-to-indep)
  • Now we can see our new table, where there are no blank party affiliations, but we have not removed the original blank from voters.csv.
  • Even in cases of typos, it may make sense to automate corrections -- e.g., we can see several different forms of dates in our table. Manually correcting all of them is not wrong, but could possibly introduce errors (and would be time consuming), whereas creating a normalize-date function would be more reliable.

Class Exercises (30 mins)

  • There is another issue with Party -- fix it!
  • Both State and Zip have a single row that has bad data. Identify what it is. Can this be corrected through an automated process? Or can you correct in manually?
  • Design a function normalize-phone that transforms all the phone numbers to the form NNNNNNNNNN (i.e., the same form as VoterId 4). Use transform-column to create a table with these normalized phone numbers.
  • Look at the DOB column -- can you normalize these (to, say, YYYY-MM-DD) in the same way as you did phone numbers? Why or why not? Through a combination of manual editing the source and transforming the column with a function, normalize the DOB column.
  • Now look at the LastVoted column. Unlike with DOB, there are many rows that do not conform to YYYY-MM-DD. While manually editing the one that is of form "Oct 10 2022" may may sense, manually editing many rows is likely prone to error. Think about a strategy to transform those in other forms, likely using string-substring (multiple times, with comparisons). If you are able to carry out your plan, that's great, but start by making a plan (in comments).

Wrap-up (5 mins)

  • Most data needs to be filtered or cleaned in some way.
  • Sometimes that is done manually, but often it is done via code.