Skip to main content

Homework 12 -- Python (Tables, Visualization)

Skills Practiced:


Introduction

You're working as a data analyst for "StreamingStats", a movie and TV streaming platform. The company has provided you with datasets containing user viewing habits, content ratings, and subscription information. Your job is to clean, analyze, and visualize this data to help the business understand viewer preferences and platform performance.

Setup

Make sure to include these imports at the top of your file:

import pandas as pd
import matplotlib.pyplot as plt

You have been provided with a CSV file containing streaming platform viewing data with the following columns: user_id, content_title, genre, watch_time_minutes, rating, subscription_type. Load the CSV file into a DataFrame and complete the following problems.

Problem 1

Part A

  • Use .loc to get the watch time for the row with index 2
  • Use .iloc to get the same watch time value using position-based indexing Verify that both methods return the same value

Part B

Filter and access data based on these conditions:

  • Find all data for user_id 105 using boolean indexing with .loc
  • Extract the genre column as a Series

Problem 2

Part A

Create boolean masks to filter the data to:

  • Find all Sci-Fi content
  • Find all viewing sessions longer than 40 minutes
  • Find all content rated above 4.0

Part B

Design a function filter_by_rating that takes a DataFrame and a minimum rating, and returns only the rows with ratings at or above that threshold.

Problem 3

You've received a messy dataset with the following issues:

messy_data = {
'show_name': ['breaking bad', 'FRIENDS', 'The Office', 'stranger things', 'Breaking Bad'],
'subscription_type': ['Premium', 'basic', 'PREMIUM', 'Basic', 'premium'],
'monthly_cost': ['12.99', '8.99', 'twelve', '8.99', '12.99'],
'user_age': [25, 'thirty-five', 28, 32, 29]
}

Clean the data by:

  1. Converting all show names to title case
  2. Standardizing subscription types to lowercase
  3. Fixing the "twelve" entry in monthly_cost and converting the column to numeric
  4. Replacing "thirty-five" with 35 and converting user_age to numeric

Problem 4

Part A

Using your cleaned viewing data, create new columns:

  • watch_hours: Convert watch_time_minutes to hours
  • binge_session: Boolean indicating if watch_time > 120 minutes

Part B

Then, create conditional columns:

  • price_tier: "Budget" for monthly_cost < 10, "Standard" for 10-15, "Premium" for >15
  • age_group: "Young" (18-30), "Middle" (31-50), "Mature" (51+)

Problem 5

The following data is given to you:

extended_data = pd.DataFrame({
'user_id': [101, 102, 103, 104, 105, 101, 102, 103, 104, 105],
'genre': ['Sci-Fi', 'Comedy', 'Drama', 'Sci-Fi', 'Comedy', 'Drama', 'Sci-Fi', 'Comedy', 'Drama', 'Sci-Fi'],
'watch_time': [45, 22, 50, 60, 30, 40, 35, 25, 55, 42],
'subscription': ['premium', 'basic', 'premium', 'basic', 'premium', 'premium', 'basic', 'basic', 'premium', 'basic']
})

From the given data, create these plots:

  1. A scatterplot showing the relationship between user_id (x-axis) and watch_time (y-axis). Color the points by subscription type using different colors for "basic" and "premium" subscribers.
  2. A frequency bar chart showing the frequency (count) of viewing sessions by genre. Use different colors for each genre bar for better readability.
  3. A histogram showing the distribution of watch_time values. Use 6 bins and add edge colors to make the bars more distinct.