Homework 12 -- Python (Tables, Visualization)
Skills Practiced:
Introduction
You're working as a data analyst for "StreamingStats", a movie and TV streaming platform. The company has provided you with datasets containing user viewing habits, content ratings, and subscription information. Your job is to clean, analyze, and visualize this data to help the business understand viewer preferences and platform performance.
Setup
Make sure to include these imports at the top of your file:
import pandas as pd
import matplotlib.pyplot as plt
You have been provided with a CSV file containing streaming platform viewing data with the following columns: user_id, content_title, genre, watch_time_minutes, rating, subscription_type. Load the CSV file into a DataFrame and complete the following problems.
Problem 1
Part A
- Use
.loc
to get the watch time for the row with index 2 - Use
.iloc
to get the same watch time value using position-based indexing Verify that both methods return the same value
Part B
Filter and access data based on these conditions:
- Find all data for user_id 105 using boolean indexing with
.loc
- Extract the genre column as a Series
Problem 2
Part A
Create boolean masks to filter the data to:
- Find all Sci-Fi content
- Find all viewing sessions longer than 40 minutes
- Find all content rated above 4.0
Part B
Design a function filter_by_rating
that takes a DataFrame and a minimum rating, and returns only the rows with ratings at or above that threshold.
Problem 3
You've received a messy dataset with the following issues:
messy_data = {
'show_name': ['breaking bad', 'FRIENDS', 'The Office', 'stranger things', 'Breaking Bad'],
'subscription_type': ['Premium', 'basic', 'PREMIUM', 'Basic', 'premium'],
'monthly_cost': ['12.99', '8.99', 'twelve', '8.99', '12.99'],
'user_age': [25, 'thirty-five', 28, 32, 29]
}
Clean the data by:
- Converting all show names to title case
- Standardizing subscription types to lowercase
- Fixing the "twelve" entry in monthly_cost and converting the column to numeric
- Replacing "thirty-five" with 35 and converting user_age to numeric
Problem 4
Part A
Using your cleaned viewing data, create new columns:
watch_hours
: Convert watch_time_minutes to hoursbinge_session
: Boolean indicating if watch_time > 120 minutes
Part B
Then, create conditional columns:
price_tier
: "Budget" for monthly_cost < 10, "Standard" for 10-15, "Premium" for >15age_group
: "Young" (18-30), "Middle" (31-50), "Mature" (51+)
Problem 5
The following data is given to you:
extended_data = pd.DataFrame({
'user_id': [101, 102, 103, 104, 105, 101, 102, 103, 104, 105],
'genre': ['Sci-Fi', 'Comedy', 'Drama', 'Sci-Fi', 'Comedy', 'Drama', 'Sci-Fi', 'Comedy', 'Drama', 'Sci-Fi'],
'watch_time': [45, 22, 50, 60, 30, 40, 35, 25, 55, 42],
'subscription': ['premium', 'basic', 'premium', 'basic', 'premium', 'premium', 'basic', 'basic', 'premium', 'basic']
})
From the given data, create these plots:
- A scatterplot showing the relationship between user_id (x-axis) and watch_time (y-axis). Color the points by subscription type using different colors for "basic" and "premium" subscribers.
- A frequency bar chart showing the frequency (count) of viewing sessions by genre. Use different colors for each genre bar for better readability.
- A histogram showing the distribution of watch_time values. Use 6 bins and add edge colors to make the bars more distinct.