Recitation 12 — Data Analysis with Pandas

Data Analysis with Pandas

Pandas is a package for table-based data analysis in Python. We'll work with a movie streaming dataset where each record has a title, genre, runtime, rating, and release year.

Key Concepts:

DataFrame: A table with rows and columns
Series: A single column of data

Creating and Loading DataFrames

Manually Creating a DataFrame

import pandas as pd

data = {
    'title': ['The Matrix', 'Inception', 'Zodiac', 'The Avengers', 'Parasite'],
    'genre': ['Sci-Fi', 'Sci-Fi', 'Crime', 'Action', 'Thriller'],
    'runtime': [136, 148, 154, 143, 132],
    'rating': [8.7, 8.8, 8.9, 8.0, 8.6],
    'year': [1999, 2010, 1994, 2012, 2019]
}
movies = pd.DataFrame(data)
print(movies)

Exercise 1

Add a new movie to the DataFrame with the title "Arrival", genre "Sci-Fi", runtime 116, rating 7.9, and year 2016. Print the updated DataFrame.

Accessing Data

# access a column (returns a Series)
genres = movies['genre']
print(genres)

# access multiple columns
subset = movies[['title', 'rating']]
print(subset)

# access a specific row by index
first_movie = movies.iloc[0]
print(first_movie)

# access a specific cell
matrix_rating = movies['rating'][0]
print(f"The Matrix rating: {matrix_rating}")

Exercise 2

What is the genre of the third movie in the DataFrame?
What is the rating of "Parasite"?

Loading from CSV

movies = pd.read_csv('movies.csv')

Data Cleaning

Handling Missing Data

movies_with_missing = movies.copy()
movies_with_missing.loc[2, 'rating'] = None # Zodiac missing rating
movies_with_missing.loc[4, 'genre'] = None  # Parasite missing genre

print("Data with missing values:")
print(movies_with_missing)

# check for missing values
print("Missing values per column:")
print(movies_with_missing.isnull().sum())

Cleaning Strategies:

# Strategy 1: Fill missing values
movies_filled = movies_with_missing.copy()
movies_filled['rating'] = movies_filled['rating'].fillna(movies_filled['rating'].mean())
movies_filled['genre'] = movies_filled['genre'].fillna('Unknown')

print("After filling missing values:")
print(movies_filled)

# Strategy 2: Drop rows with missing values
movies_dropped = movies_with_missing.dropna()
print(f"Original rows: {len(movies_with_missing)}, After dropping: {len(movies_dropped)}")

Exercise 3

After filling missing values, what is the new rating for "Zodiac"?
After dropping missing values, how many movies remain?

When might you choose to fill missing values versus drop them?

Filtering with Boolean Masks

Basic Filtering

# create a boolean mask for high-rated movies (>= 8.5)
high_rated_mask = movies['rating'] >= 8.5
print("High-rated mask:")
print(high_rated_mask)

# apply the mask to filter data
high_rated_movies = movies[high_rated_mask]
print("High-rated movies:")
print(high_rated_movies)

Exercise 4

List the titles of all movies with a rating of at least 8.5.

Multiple Conditions

# combine conditions: sci-fi movies from 2000 or later
recent_scifi = movies[(movies['genre'] == 'Sci-Fi') & (movies['year'] >= 2000)]
print("Recent Sci-Fi movies:")
print(recent_scifi)

# movies that are either highly rated OR very long
popular_or_long = movies[(movies['rating'] >= 8.5) | (movies['runtime'] >= 150)]
print("Popular or long movies:")
print(popular_or_long)

Exercise 5

How many Sci-Fi movies from 2000 or later are in the DataFrame?
Which movies are either highly rated (>=8.5) or have a runtime of at least 150 minutes?

Creating New Columns

Simple Column Operations

# create a decade column
movies['decade'] = (movies['year'] // 10) * 10
print("Movies with decade:")
print(movies[['title', 'year', 'decade']])

# create a length category column
def categorize_length(runtime):
    if runtime < 120:
        return 'Short'
    elif runtime < 150:
        return 'Medium'
    else:
        return 'Long'

movies['length_category'] = movies['runtime'].apply(categorize_length)

Exercise 6

What is the length category of "The Avengers"?
How many movies are classified as "Long"?

Conditional Updates

# add rating tier based on score
movies['tier'] = 'Good' # default value
movies.loc[movies['rating'] >= 8.5, 'tier'] = 'Excellent'
movies.loc[movies['rating'] < 8.0, 'tier'] = 'Average'

Exercise 7

Which movies are classified as "Excellent"?
Which are "Average"?

Aggregation and Grouping

Basic Statistics

# statistics
print(f"Average rating: {movies['rating'].mean():.2f}")
print(f"Total runtime of all movies: {movies['runtime'].sum()} minutes")
print(f"Highest rated movie: {movies.loc[movies['rating'].idxmax(), 'title']}")

Exercise 8

What is the average rating of all movies?
What is the total runtime of all movies?
Which movie has the highest rating?

GroupBy Operations

avg_rating_by_genre = movies.groupby('genre')['rating'].mean()
print(avg_rating_by_genre)

movies_per_decade = movies.groupby('decade')['title'].count()
print(movies_per_decade)

# multiple aggregations
genre_stats = movies.groupby('genre').agg({
    'rating': ['mean', 'count'],
    'runtime': 'mean'
})
print(genre_stats)

Exercise 9

What is the average rating for each genre?
How many movies are there per decade?

Basic Plotting

Setting Up Matplotlib

import matplotlib.pyplot as plt

Bar Charts

genre_counts = movies['genre'].value_counts()

plt.figure()
plt.bar(genre_counts.index, genre_counts.values, color='lightblue')
plt.xlabel('Genre')
plt.ylabel('Number of Movies')
plt.title('Movie Count by Genre')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Exercise 10

Draw a bar chart showing the number of movies in each genre. What genre has the most movies?

Scatter Plots

plt.figure()
plt.scatter(movies['runtime'], movies['rating'], c='red', alpha=0.7)
plt.xlabel('Runtime (minutes)')
plt.ylabel('Rating')
plt.title('Movie Rating vs Runtime')

# add movie titles as labels
for i, txt in enumerate(movies['title']):
    plt.annotate(txt, (movies['runtime'].iloc[i], movies['rating'].iloc[i]), 
                xytext=(5, 5), textcoords='offset points', fontsize=8)

plt.tight_layout()
plt.show()

Exercise 11

What trend, if any, do you see between runtime and rating in the scatter plot?

Data Reshaping

Wide vs Tall Format Concept

# create movies across different platforms
wide_data = pd.DataFrame({
    'title': ['The Matrix', 'Inception', 'Zodiac'],
    'netflix_views': [1000000, 800000, 1200000],
    'hulu_views': [500000, 600000, 400000],
    'amazon_views': [300000, 400000, 350000]
})

# convert to tall format
tall_data = wide_data.melt(id_vars=['title'], 
                          var_name='platform', 
                          value_name='views')

Exercise 12

After reshaping to tall format, how many rows will the new DataFrame have?
What are the column names in the tall DataFrame?

Wrap-Up

What are the main advantages of using Pandas over regular Python lists?
When would you choose to reshape data from wide to tall format?

Data Analysis with Pandas​

Creating and Loading DataFrames​

Manually Creating a DataFrame​

Exercise 1​

Accessing Data​

Exercise 2​

Loading from CSV​

Data Cleaning​

Handling Missing Data​

Exercise 3​

Filtering with Boolean Masks​

Basic Filtering​

Exercise 4​

Multiple Conditions​

Exercise 5​

Creating New Columns​

Simple Column Operations​

Exercise 6​

Conditional Updates​

Exercise 7​

Aggregation and Grouping​

Basic Statistics​

Exercise 8​

GroupBy Operations​

Exercise 9​

Basic Plotting​

Setting Up Matplotlib​

Bar Charts​

Exercise 10​

Scatter Plots​

Exercise 11​

Data Reshaping​

Wide vs Tall Format Concept​

Exercise 12​

Wrap-Up​