Design Patterns for Handling Data 1
Functions are objects
As we saw in Lecture 10, we can pass functions as arguments into other functions. This means that a variable can represent a function.
We also saw that the variable type for a function that takes an argument of type T
and returns type R
is Callable[[T], R]
.
Functions have attributes, just like other objects that can be represented using variables. Here we access and print its attribute __name__
:
from typing import Callable
def my_function() -> None:
print('um hi')
print(my_function.__name__) # my_function
If we have a variable that holds a function, we can call (execute) that function using parentheses like normal.
def my_function() -> None:
print('um hi')
# print(my_function.__name__)
def execute_multiple(func: Callable[[], None], times: int) -> None:
"""Executes the function func times number of times"""
for _ in range(times):
func()
execute_multiple(my_function, 5)
We can also declare a function variable in one line:
add: Callable[[int, int], int] = lambda x, y: x + y
print(add(4, 5)) # 9
Poll: Which of these variables is a function that safely divides int
s or returns None
if that's impossible?
divide: Callable[[int, int], Optional[float]] = lambda num, den: num / den
divide: Callable[[int, int], float] = lambda num, den: num / den if den != 0 else None
divide: Callable[[int, int], Optional[float]] = lambda num, den: num / den if den != 0 else None
divide: Callable[[int, int], Optional[float]] = lambda num, den: try: num / den except ZeroDivisionError: None if den != 0 else None
Sorting with Pandas
Storing functions as variables is useful when sorting and filtering data.
Here is an example using Python's built-in sorted()
function:
harry_styles_words: List[str] = 'tastes like strawberries on a summer evening'.split()
sorted_alphabetically: List[str] = sorted(harry_styles_words)
print(' '.join(sorted_alphabetically)) # a evening like on strawberries summer tastes
sorted_by_length: List[str] = sorted(harry_styles_words, key = lambda word: len(word))
print(' '.join(sorted_by_length)) # a on like tastes summer evening strawberries
The sorted()
function has an optional argument key
, which is a function that is applied to each element when determining the sorted order.
Poll: Which of these lists is sorted by the number of vowels?
sorted(harry_styles_words, key=lambda word: 'aeiou' in word))
sorted(harry_styles_words, key=lambda word: len([i for i in word if i in 'aeiou']))
sorted(harry_styles_words, key=lambda word: len([i for i in word if word in 'aeiou']))
sorted(harry_styles_words, key=lambda word, i: len([i for i in word if word in 'aeiou']))
The function for sorting elements of a Pandas dataframe also has a key
argument, but the function passed to that argument should accept the entire column as an argument, and return the entire transformed column.
df = pd.DataFrame(
{'Person': ['Elephant', 'Cat', 'Frog'],
'Age': [13, 10, 3]})
print(df.sort_values(by='Person'))
Person Age
1 Cat 10
0 Elephant 13
2 Frog 3
print(df.sort_values(by='Person', key=lambda col: [len(word) for word in col]))
Person Age
1 Cat 10
2 Frog 3
0 Elephant 13
Filtering with Pandas
Quick note: Recall that we can iterate through a dataframe like this:
for index, row in df.iterrows():
# use row as a dict where the column names are its keys
We can filter a Pandas dataframe to rows that match our criteria. Here, we select the rows where the age is more than 5:
df = pd.DataFrame(
{'Person': ['Elephant', 'Cat', 'Frog'],
'Age': [13, 10, 3]})
print(df[df['Age'] > 5])
Person Age
0 Elephant 13
1 Cat 10
For criteria that are more complicated than a simple expression like df['Age'] > 5
, we can instead use a function.
This function should take the entire dataset as the argument. It should return a list of booleans that is the same length as the dataset; indexes that are marked True
will be included in the filtered dataset.
df = pd.DataFrame(
{'Person': ['Elephant', 'Cat', 'Frog'],
'Age': [13, 10, 3]})
print(df[[True, False, True]])
Person Age
0 Elephant 13
2 Frog 3
print(df[lambda dataframe: ['a' in row['Person'] for _, row in dataframe.iterrows()]])
Person Age
0 Elephant 13
1 Cat 10
Poll: Let's say we have a dataset of books, and the columns are Title
(str
), Author
(str
), and Year
(int
). How can we get the authors of the books whose titles contain the letter 'e', sorted by year?
df[lambda dataframe: ['e' in row['Title'] for _, row in dataframe.iterrows()]].sort_values(by='Year')['Author']
df[['e' in row['Title'] for _, row in df.iterrows()]].sort_values(by='Year')['Author']
df[['e' in row['Title']]].sort_values(by='Year')['Author']
df[['e' in 'Title']].sort_values(by='Year')['Author']
Scaling and normalizing data
Data will look vastly different just based on the units used. For example, length measured in inches will yield different numbers than length measured in centimeters -- even if they are both describing the same things.
It turns out that this becomes a problem for machine learning models (and also for our own visualizations). Machine learning models can give more "importance" to bigger numbers, just as visualizations can trick people into thinking that some things are larger than others.
Source: https://www.heap.io/blog/how-to-lie-with-data-visualization
To address this issue, data scientists often adjust the scale of a variable so that it is in the same range as another variable.
The process of scaling variables so that all of them are in the range between 0 and 1 is called normalizing the data. Luckily, this doesn't need a fancy Python package (though there is one). It's a simple formula:
x_normalized = (x - x_min) / (x_max - x_min)
where x_min
is the minimum and x_max
is the maximum data point for the variable x
. We transform each value of x
into x_normalized
this way, and the set of x_normalized
values will be our new dataset comprised of numbers between 0 and 1.
Poll: Let's say that this is our dataset of values for the variable x
: [6, 8, 9, 8, 7]. What is the dataset after normalizing it?
- [0.25, 0.75, 1, 0.75, 0.5]
- [0, 0.67, 1, 0.67, 0.33]
- [0, 0.5, 0.75, 0.5, 0.25]
- [0, 4, 1, 3, 2]
Argmax
It is useful to know that Numpy has a function called argsort()
which takes a list of elements and returns a list of its indices, moved to the locations they would be in if the list was sorted.
harry_styles_words: List[str] = 'tastes like strawberries on a summer evening'.split()
sorted_alphabetically: List[str] = sorted(harry_styles_words)
print(sorted_alphabetically) # ['a', 'evening', 'like', 'on', 'strawberries', 'summer', 'tastes']
indices_of_sorted_words = np.argsort(harry_styles_words)
print(indices_of_sorted_words) # [4 6 1 3 2 5 0]
This is saying that, if the list of words was sorted alphabetically, the word at index 4 would be first, then the word at index 6, then the word at index 1, etc. until the last word, which would ironically be the word at index 0.