Day 10 - Cleaning data
Skills: 2
Pre-reading: DRAFT:4.2.1
Intro (20 mins)
- Data sources, especially if they are the result of manual entry by people, often have minor mistake in them.
- Sometimes data is missing, othertimes it shows up in different ways.
- Today we'll work with this synthetic example data -- which models a subset of the type of data that is available to political campaigns about voters.
- Let's start by loading it in as a table:
include csv
voter-data =
load-table: VoterID,FirstName,LastName,DOB,Party,Address,City,State,Zip,Phone,Email,LastVoted
source: csv-table-url("https://pdi.run/f25-10-voters.csv")
end - If we look at the data, we should notice a few things are off about the data.
- For example, the party affiliation, which is often used to target outreach, has one column empty (which is perhaps Independent?), but also, a couple rows have typos!
- So if we run
filter-with(voter-data, lam(r): r["Party"] == "Republican" end)
we see only two rows, despite the original data seeming to have 3. If we look really closely we can see that one of the rows is"Repbulican"
(note the typo). - Let's create a new local file,
voters.csv
, copy the contents ofhttps://pdi.run/f25-10-voters.csv
into it, and fix this typo. - Now we can change the above to
csv-table-file("voters.csv")
:include csv
voter-data =
load-table: VoterID,FirstName,LastName,DOB,Party,Address,City,State,Zip,Phone,Email,LastVoted
source: csv-table-file("voters.csv")
end - When we run the same
filter-with
expression now, we see the three rows we expect. - Why did we change the source data, rather than trying to fix this in code? There are always two possible approaches -- we can normalize (remove distinctions between things that are "the same") in the original source, or we can do it in code. If there are mistakes in the original data, like that typo, it may make sense to fix the data, rather than making our analysis more complicated.
- However, in other cases, like the empty Party being treated as Independent, we
may want to do this in our analysis, but it would likely be a mistake to
change the data, because it is possible that no party and Independent as a
party are two different things, and later analysis would want to treat them
separately. Let's do that.
fun blank-to-indep(s :: String) -> String:
doc: "replaces an empty string with Independent"
if s == "":
"Independent"
else:
s
end
where:
blank-to-indep("") is "Independent"
blank-to-indep("blah") is "blah"
end
voters-with-indep = transform-column(voter-data, "Party", blank-to-indep) - Now we can see our new table, where there are no blank party affiliations, but
we have not removed the original blank from
voters.csv
. - Even in cases of typos, it may make sense to automate corrections -- e.g., we
can see several different forms of dates in our table. Manually correcting all
of them is not wrong, but could possibly introduce errors (and would be time
consuming), whereas creating a
normalize-date
function would be more reliable.
Class Exercises (30 mins)
- There is another issue with Party -- fix it!
- Both State and Zip have a single row that has bad data. Identify what it is. Can this be corrected through an automated process? Or can you correct in manually?
- Design a function
normalize-phone
that transforms all the phone numbers to the form NNNNNNNNNN (i.e., the same form as VoterId 4). Usetransform-column
to create a table with these normalized phone numbers. - Look at the DOB column -- can you normalize these (to, say, YYYY-MM-DD) in the same way as you did phone numbers? Why or why not? Through a combination of manual editing the source and transforming the column with a function, normalize the DOB column.
- Now look at the LastVoted column. Unlike with DOB, there are many rows that do
not conform to YYYY-MM-DD. While manually editing the one that is of form "Oct
10 2022" may may sense, manually editing many rows is likely prone to error.
Think about a strategy to transform those in other forms, likely using
string-substring
(multiple times, with comparisons). If you are able to carry out your plan, that's great, but start by making a plan (in comments).
Wrap-up (5 mins)
- Most data needs to be filtered or cleaned in some way.
- Sometimes that is done manually, but often it is done via code.