Day 30 - Filtering, sorting & transforming with Pandas
Skills: None
Pre-reading: 10.1.4, 10.1.5
Reference: For all work with tables, refer to the Tables page in the menu at the top of the page! It covers Pyret tables, but is good reference regardless.
Intro (15 mins)
Today we learn Pandas operations that correspond to what you learned in Pyret: filtering tables, sorting, and adding/transforming columns.
Filtering rows (like Pyret's filter-with
):
# Pyret way:
# fun is-high-value(r :: Row) -> Boolean:
# r["amount"] >= 5.0
# end
# high-orders = filter-with(orders, is-high-value)
# Pandas way -- we construct "masks" that look like boolean expressions, but
# refers to an entire column (i.e., Series).
high_orders = orders[orders['quantity'] >= 2]
takeout_orders = orders[orders['order_type'] == 'takeout']
Sorting tables (like Pyret's order-by
):
# Pyret way:
# order-by(orders, "quantity", true) # ascending
# order-by(orders, "quantity", false) # descending
# Pandas way:
orders.sort_values('quantity') # ascending
orders.sort_values('quantity', ascending=False) # descending
Adding new columns (like Pyret's build-column
):
# Pyret way:
# fun calc-total(r :: Row) -> Number:
# r["quantity"] * prices[r["dish"]]
# end
# orders-with-total = build-column(orders, "total", calc-total)
# Pandas way -- we _assign_ to a new column (here 'total') and construct it
# by applying a function (here a `lambda`) that is passed a row.
prices = {'Pasta': 12.0, 'Salad': 8.0, 'Burger': 10.0}
orders['total'] = orders.apply(lambda row: row['quantity'] * prices.get(row['dish'], 0), axis=1)
Transforming existing columns (like Pyret's transform-column
):
# Pyret way:
# fun discount-price(price :: Number) -> Number:
# price * 0.9
# end
# discounted = transform-column(orders, "total", discount-price)
# Pandas way -- here, we similarly assign to a column, but this time
# an existing one. But to create it, we again use `.apply` with a function,
# but here on a single column (Series), so the function is over the single
# values, rather than the entire row. We could transform an existing column using the
# entire row (something not possible with Pyret) by using
# `orders.apply(lambda row: ...)` as above.
orders['total'] = orders['total'].apply(lambda price: price * 0.9)
Class Exercises (40 mins)
Filtering:
- Filter your workout DataFrame to show only workouts longer than 45 minutes.
- From the photos dataset, filter to show only photos with subject "Forest".
- What happens if you try to filter by a column that doesn't exist? Try it and note the error.
Sorting:
- Sort your workout DataFrame by duration from longest to shortest.
- Sort the photos dataset by date (ascending), then print the first 5 rows.
Adding new columns (like build-column
):
- Add a column called
workout_category
that is "long" if duration ≥ 60, "medium" if 30-59, "short" otherwise. - For your workouts, add a column called
calories_burned
that estimates calories as duration * 8. - In the photos dataset, add a column called
is_landscape
that checks if the location contains "Park" or "Garden".
Transforming columns (like transform-column
):
- Convert all workout durations from minutes to hours (divide by 60).
- In the photos dataset, transform the subject column to be all uppercase.
Combining operations:
- Filter workouts to only "long" category, then sort by calories burned.
- What is the average duration of workouts in each category?
Wrap-up (5 mins)
- Pandas provides the same table operations you learned in Pyret: filtering with boolean conditions, sorting by columns, adding new columns with computations, and transforming existing columns.
- The syntax is different (boolean masks vs functions,
.sort_values()
vsorder-by
), but the concepts and capabilities are the same.