Homework 12 — Housing Data Analysis (Python)
Skills: None
Due
Thursday, December 4, 2025 at 6PM (Oakland) or 9PM (Boston)
Submission
This HW is done via VSCode on your computer, and submitted to Github via the Source Control tab. Go to Pawtograder to find the repository, which you can clone locally to do the assignment. Commits automatically make submissions, and feedback can be viewed on Pawtograder.
Introduction
In this assignment, you'll work with the same housing data from New York City as you did in Homework 4 and Homework 5. You'll practice common data analysis tasks using Pandas.
Problem 1
You have been provided with the same housing dataset as before, in the file project-housing.csv. It contains some missing values (which are represented by Pandas as NaN, which stands for Not a Number). Your first task is to clean the data by removing rows that contain any missing values.
- Load the CSV file
project-housing.csvinto a DataFrame namedhousing_data - Remove all rows that contain any missing values (NaN). You can do this with the
dropna()function on DataFrames.
Important: Make sure your code passes the provided test in test_assignment.py before moving on. This test verifies that your dataframe does not contain any NaN values. This checkpoint ensures you don't proceed to later problems with incomplete data.
Problem 2
Filter the data to contain only properties within a bounding box that roughly contains Manhattan (it is rectangular, so includes parts of the Bronx, Queens, and Brooklyn as well). The bounding box coordinates are:
- Southwest corner: Latitude 40.701735, Longitude -74.019756
- Northeast corner: Latitude 40.873392, Longitude -73.908128
Create a new DataFrame called bounded_housing_data that contains only the rows where:
- The
latitudecolumn is between 40.4774 and 40.9176 (inclusive) - The
longitudecolumn is between -74.2591 and -73.7004 (inclusive)
Hint: Look at the between function offered by Pandas for Series objects.
Use bounded_housing_data for all remaining problems.
Problem 3
In this problem, you'll write functions to compute basic statistics and apply them to the housing data.
Part A
Write a function called my_mean that takes a list of numbers and returns the mean (average). The mean is calculated as the sum of all values divided by the number of values.
Part B
Write a function called my_std that takes a list of numbers and returns the standard deviation. The standard deviation is calculated as follows:
- Compute the mean of the values (you can use your
my_meanfunction) - For each value, compute the difference between that value and the mean, then square that difference
- Sum all of these squared differences
- Divide the sum by the number of values
- Take the square root of the result
Part C
Extract the lotarea and bldgarea columns from bounded_housing_data and compute:
mean_lotarea: the mean of thelotareacolumnstd_lotarea: the standard deviation of thelotareacolumnmean_bldgarea: the mean of thebldgareacolumnstd_bldgarea: the standard deviation of thebldgareacolumn
Use your my_mean and my_std functions to compute these values.
Problem 4
Use the apply function in Pandas to create a new column called bldgarea_percentage in bounded_housing_data. This column should represent the percentage of the lot area occupied by the building, calculated as:
bldgarea_percentage = (bldgarea / lotarea) * 100
Important: Some properties have a lotarea of 0, which would cause a division by zero error. Your function should handle this case by returning 0 when lotarea is 0.
Write a function called compute_building_percentage that takes a row (as a Series) and returns the building percentage. Then use apply with axis=1 to create the new column.
Problem 5
Filter bounded_housing_data to create two separate DataFrames:
city_owned: properties whereownertypeequals "C"tax_exempt: properties whereownertypeequals "X"
A helper function plot_latlon_columns has been provided in the starter code. This function takes two DataFrames (that should both have latitude and longitude columns) and two labels, then creates a scatter plot showing the locations of properties from both DataFrames in different colors.
Call plot_latlon_columns with city_owned (labeled "City-Owned") and tax_exempt (labeled "Tax-Exempt") to visualize where these properties are located.