Skip to main content

Recitation 12 -- Data Analysis with Pandas

Objectives

  • Create and manipulate DataFrames using Pandas
  • Filter and clean data using boolean masks
  • Compute new columns and perform aggregations
  • Visualize data patterns using Matplotlib

I. Introduction (5 minutes)

Introduce Pandas as a package for table-based data analysis in Python. We'll work with a movie streaming dataset where each record has a title, genre, runtime, rating, and release year.

Key Concepts:

  • DataFrame: A table with rows and columns
  • Series: A single column of data

II. Creating and Loading DataFrames (15 minutes)

A. Manually Creating a DataFrame

Example: Movie Streaming Data

import pandas as pd

data = {
'title': ['The Matrix', 'Inception', 'Zodiac', 'The Avengers', 'Parasite'],
'genre': ['Sci-Fi', 'Sci-Fi', 'Crime', 'Action', 'Thriller'],
'runtime': [136, 148, 154, 143, 132],
'rating': [8.7, 8.8, 8.9, 8.0, 8.6],
'year': [1999, 2010, 1994, 2012, 2019]
}
movies = pd.DataFrame(data)
print(movies)

B. Accessing Data

# access a column (returns a Series)
genres = movies['genre']
print(genres)

# access multiple columns
subset = movies[['title', 'rating']]
print(subset)

# access a specific row by index
first_movie = movies.iloc[0]
print(first_movie)

# access a specific cell
matrix_rating = movies['rating'][0]
print(f"The Matrix rating: {matrix_rating}")

C. Loading from CSV

movies = pd.read_csv('movies.csv')

III. Data Cleaning (15 minutes)

A. Handling Missing Data

Example with Missing Values:

movies_with_missing = movies.copy()
movies_with_missing.loc[2, 'rating'] = None # Zodiac missing rating
movies_with_missing.loc[4, 'genre'] = None # Parasite missing genre

print("Data with missing values:")
print(movies_with_missing)

# check for missing values
print("Missing values per column:")
print(movies_with_missing.isnull().sum())

Cleaning Strategies:

# Strategy 1: Fill missing values
movies_filled = movies_with_missing.copy()
movies_filled['rating'] = movies_filled['rating'].fillna(movies_filled['rating'].mean())
movies_filled['genre'] = movies_filled['genre'].fillna('Unknown')

print("After filling missing values:")
print(movies_filled)

# Strategy 2: Drop rows with missing values
movies_dropped = movies_with_missing.dropna()
print(f"Original rows: {len(movies_with_missing)}, After dropping: {len(movies_dropped)}")

Discussion: Ask, "When might you choose to fill missing values versus drop them?"

IV. Filtering with Boolean Masks (15 minutes)

A. Basic Filtering

Example: High-Rated Movies

# create a boolean mask for high-rated movies (>= 8.5)
high_rated_mask = movies['rating'] >= 8.5
print("High-rated mask:")
print(high_rated_mask)

# apply the mask to filter data
high_rated_movies = movies[high_rated_mask]
print("High-rated movies:")
print(high_rated_movies)

B. Multiple Conditions

Example: Recent Sci-Fi Movies

# combine conditions: sci-fi movies from 2000 or later
recent_scifi = movies[(movies['genre'] == 'Sci-Fi') & (movies['year'] >= 2000)]
print("Recent Sci-Fi movies:")
print(recent_scifi)

# movies that are either highly rated OR very long
popular_or_long = movies[(movies['rating'] >= 8.5) | (movies['runtime'] >= 150)]
print("Popular or long movies:")
print(popular_or_long)

V. Creating New Columns (15 minutes)

A. Simple Column Operations

Example: Adding Derived Data

# create a decade column
movies['decade'] = (movies['year'] // 10) * 10
print("Movies with decade:")
print(movies[['title', 'year', 'decade']])

# create a length category column
def categorize_length(runtime):
if runtime < 120:
return 'Short'
elif runtime < 150:
return 'Medium'
else:
return 'Long'

movies['length_category'] = movies['runtime'].apply(categorize_length)

B. Conditional Updates

Example: Rating Categories

# add rating tier based on score
movies['tier'] = 'Good' # default value
movies.loc[movies['rating'] >= 8.5, 'tier'] = 'Excellent'
movies.loc[movies['rating'] < 8.0, 'tier'] = 'Average'

VI. Aggregation and Grouping (10 minutes)

A. Basic Statistics

# statistics
print(f"Average rating: {movies['rating'].mean():.2f}")
print(f"Total runtime of all movies: {movies['runtime'].sum()} minutes")
print(f"Highest rated movie: {movies.loc[movies['rating'].idxmax(), 'title']}")

B. GroupBy Operations

avg_rating_by_genre = movies.groupby('genre')['rating'].mean()
print(avg_rating_by_genre)

movies_per_decade = movies.groupby('decade')['title'].count()
print(movies_per_decade)

# multiple aggregations
genre_stats = movies.groupby('genre').agg({
'rating': ['mean', 'count'],
'runtime': 'mean'
})
print(genre_stats)

VII. Basic Plotting (10 minutes)

A. Setting Up Matplotlib

import matplotlib.pyplot as plt

B. Bar Charts

Example: Movies by Genre

genre_counts = movies['genre'].value_counts()

plt.figure()
plt.bar(genre_counts.index, genre_counts.values, color='lightblue')
plt.xlabel('Genre')
plt.ylabel('Number of Movies')
plt.title('Movie Count by Genre')
plt.xticks(rotation=45) # this is just for better display
plt.tight_layout() # this is also for better display
plt.show()

C. Scatter Plots

Example: Rating vs Runtime

plt.figure()
plt.scatter(movies['runtime'], movies['rating'], c='red', alpha=0.7)
plt.xlabel('Runtime (minutes)')
plt.ylabel('Rating')
plt.title('Movie Rating vs Runtime')

# add movie titles as labels
for i, txt in enumerate(movies['title']):
plt.annotate(txt, (movies['runtime'].iloc[i], movies['rating'].iloc[i]),
xytext=(5, 5), textcoords='offset points', fontsize=8)

plt.tight_layout()
plt.show()

VIII. Data Reshaping (5 minutes)

A. Wide vs Tall Format Concept

Example: Theater vs Streaming Performance

# create movies across different platforms
wide_data = pd.DataFrame({
'title': ['The Matrix', 'Inception', 'Zodiac'],
'netflix_views': [1000000, 800000, 1200000],
'hulu_views': [500000, 600000, 400000],
'amazon_views': [300000, 400000, 350000]
})

# convert to tall format
tall_data = wide_data.melt(id_vars=['title'],
var_name='platform',
value_name='views')

IX. Recap and Wrap-Up (5 minutes)

Key Points Discussion:

  • DataFrames: Flexible table structure for data analysis
  • Filtering: Boolean masks enable data selection
  • Aggregation: GroupBy operations reveal patterns in data
  • Visualization: Charts help communicate insights
  • Data Reshaping: Wide vs tall formats serve different analysis needs

Reflection Questions:

  • idk lol i'll write these later