Skip to main content

Recitation 12 — Data Analysis with Pandas

Data Analysis with Pandas

Pandas is a package for table-based data analysis in Python. We'll work with a movie streaming dataset where each record has a title, genre, runtime, rating, and release year.

Key Concepts:

  • DataFrame: A table with rows and columns
  • Series: A single column of data

Creating and Loading DataFrames

Manually Creating a DataFrame

import pandas as pd

data = {
'title': ['The Matrix', 'Inception', 'Zodiac', 'The Avengers', 'Parasite'],
'genre': ['Sci-Fi', 'Sci-Fi', 'Crime', 'Action', 'Thriller'],
'runtime': [136, 148, 154, 143, 132],
'rating': [8.7, 8.8, 8.9, 8.0, 8.6],
'year': [1999, 2010, 1994, 2012, 2019]
}
movies = pd.DataFrame(data)
print(movies)

Exercise 1

  • Add a new movie to the DataFrame with the title "Arrival", genre "Sci-Fi", runtime 116, rating 7.9, and year 2016. Print the updated DataFrame.

Accessing Data

# access a column (returns a Series)
genres = movies['genre']
print(genres)

# access multiple columns
subset = movies[['title', 'rating']]
print(subset)

# access a specific row by index
first_movie = movies.iloc[0]
print(first_movie)

# access a specific cell
matrix_rating = movies['rating'][0]
print(f"The Matrix rating: {matrix_rating}")

Exercise 2

  • What is the genre of the third movie in the DataFrame?
  • What is the rating of "Parasite"?

Loading from CSV

movies = pd.read_csv('movies.csv')

Data Cleaning

Handling Missing Data

movies_with_missing = movies.copy()
movies_with_missing.loc[2, 'rating'] = None # Zodiac missing rating
movies_with_missing.loc[4, 'genre'] = None # Parasite missing genre

print("Data with missing values:")
print(movies_with_missing)

# check for missing values
print("Missing values per column:")
print(movies_with_missing.isnull().sum())

Cleaning Strategies:

# Strategy 1: Fill missing values
movies_filled = movies_with_missing.copy()
movies_filled['rating'] = movies_filled['rating'].fillna(movies_filled['rating'].mean())
movies_filled['genre'] = movies_filled['genre'].fillna('Unknown')

print("After filling missing values:")
print(movies_filled)

# Strategy 2: Drop rows with missing values
movies_dropped = movies_with_missing.dropna()
print(f"Original rows: {len(movies_with_missing)}, After dropping: {len(movies_dropped)}")

Exercise 3

  • After filling missing values, what is the new rating for "Zodiac"?
  • After dropping missing values, how many movies remain?

When might you choose to fill missing values versus drop them?

Filtering with Boolean Masks

Basic Filtering

# create a boolean mask for high-rated movies (>= 8.5)
high_rated_mask = movies['rating'] >= 8.5
print("High-rated mask:")
print(high_rated_mask)

# apply the mask to filter data
high_rated_movies = movies[high_rated_mask]
print("High-rated movies:")
print(high_rated_movies)

Exercise 4

  • List the titles of all movies with a rating of at least 8.5.

Multiple Conditions

# combine conditions: sci-fi movies from 2000 or later
recent_scifi = movies[(movies['genre'] == 'Sci-Fi') & (movies['year'] >= 2000)]
print("Recent Sci-Fi movies:")
print(recent_scifi)

# movies that are either highly rated OR very long
popular_or_long = movies[(movies['rating'] >= 8.5) | (movies['runtime'] >= 150)]
print("Popular or long movies:")
print(popular_or_long)

Exercise 5

  • How many Sci-Fi movies from 2000 or later are in the DataFrame?
  • Which movies are either highly rated (>=8.5) or have a runtime of at least 150 minutes?

Creating New Columns

Simple Column Operations

# create a decade column
movies['decade'] = (movies['year'] // 10) * 10
print("Movies with decade:")
print(movies[['title', 'year', 'decade']])

# create a length category column
def categorize_length(runtime):
if runtime < 120:
return 'Short'
elif runtime < 150:
return 'Medium'
else:
return 'Long'

movies['length_category'] = movies['runtime'].apply(categorize_length)

Exercise 6

  • What is the length category of "The Avengers"?
  • How many movies are classified as "Long"?

Conditional Updates

# add rating tier based on score
movies['tier'] = 'Good' # default value
movies.loc[movies['rating'] >= 8.5, 'tier'] = 'Excellent'
movies.loc[movies['rating'] < 8.0, 'tier'] = 'Average'

Exercise 7

  • Which movies are classified as "Excellent"?
  • Which are "Average"?

Aggregation and Grouping

Basic Statistics

# statistics
print(f"Average rating: {movies['rating'].mean():.2f}")
print(f"Total runtime of all movies: {movies['runtime'].sum()} minutes")
print(f"Highest rated movie: {movies.loc[movies['rating'].idxmax(), 'title']}")

Exercise 8

  • What is the average rating of all movies?
  • What is the total runtime of all movies?
  • Which movie has the highest rating?

GroupBy Operations

avg_rating_by_genre = movies.groupby('genre')['rating'].mean()
print(avg_rating_by_genre)

movies_per_decade = movies.groupby('decade')['title'].count()
print(movies_per_decade)

# multiple aggregations
genre_stats = movies.groupby('genre').agg({
'rating': ['mean', 'count'],
'runtime': 'mean'
})
print(genre_stats)

Exercise 9

  • What is the average rating for each genre?
  • How many movies are there per decade?

Basic Plotting

Setting Up Matplotlib

import matplotlib.pyplot as plt

Bar Charts

genre_counts = movies['genre'].value_counts()

plt.figure()
plt.bar(genre_counts.index, genre_counts.values, color='lightblue')
plt.xlabel('Genre')
plt.ylabel('Number of Movies')
plt.title('Movie Count by Genre')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Exercise 10

  • Draw a bar chart showing the number of movies in each genre. What genre has the most movies?

Scatter Plots

plt.figure()
plt.scatter(movies['runtime'], movies['rating'], c='red', alpha=0.7)
plt.xlabel('Runtime (minutes)')
plt.ylabel('Rating')
plt.title('Movie Rating vs Runtime')

# add movie titles as labels
for i, txt in enumerate(movies['title']):
plt.annotate(txt, (movies['runtime'].iloc[i], movies['rating'].iloc[i]),
xytext=(5, 5), textcoords='offset points', fontsize=8)

plt.tight_layout()
plt.show()

Exercise 11

  • What trend, if any, do you see between runtime and rating in the scatter plot?

Data Reshaping

Wide vs Tall Format Concept

# create movies across different platforms
wide_data = pd.DataFrame({
'title': ['The Matrix', 'Inception', 'Zodiac'],
'netflix_views': [1000000, 800000, 1200000],
'hulu_views': [500000, 600000, 400000],
'amazon_views': [300000, 400000, 350000]
})

# convert to tall format
tall_data = wide_data.melt(id_vars=['title'],
var_name='platform',
value_name='views')

Exercise 12

  • After reshaping to tall format, how many rows will the new DataFrame have?
  • What are the column names in the tall DataFrame?

Wrap-Up

  • What are the main advantages of using Pandas over regular Python lists?
  • When would you choose to reshape data from wide to tall format?