🎯 Midterm Project: Netflix Movies and Shows
As your midterm project in this introductory data science course, you will be working with a dataset containing listings of movies and TV shows available on Netflix up until 2021. The goal of this project is to perform data cleaning
tasks and create visualizations
to gain insights into the Netflix content library.
Due Date: Tuesday, 14th of November, 2023

🎯 Netflix Movies and Shows
The Netflix dataset contains comprehensive information about movies and TV shows available on the Netflix streaming platform up until 2021. It provides a listing of the vast collection of content available for viewers worldwide. The dataset includes details such as the title, genre, release year, duration, country of origin, and cast/crew information for each movie or TV show.
🎯 TODOs...
1. Data Cleaning:
- Create a notebook with the name
netflix-midterm-project.ipynb
where you'll do all your work - Load and explore the data to have an understanding of what it represent.
- Remove any duplicate entries in the dataset.
- Handle missing values by either imputing or removing them.
- Standardize and clean up the text data, such as titles or genres, to ensure consistency.
2. Exploratory Data Analysis (EDA):
- Perform descriptive statistics to understand the distribution of release years, genres, and durations.
- Explore the relationship between release years and the number of movies/TV shows available.
3. Data Visualization:
- Visualize the TOP 10 countries contributing to the Netflix content library using a bar plot or a world map.
- Create a
word cloud
of the most common words in movie titles or genres to identify popular themes or trends. - Create visualizations to analyze the distribution of content across different genres.
- Design an interactive dashboard to explore the dataset, allowing users to filter by genre, release year, or country.
🎯 HINTs...
- Before starting, make sure to make a
copy
of the original dataset to preserve the integrity of the data. - Utilize pandas functions and methods, such as
drop_duplicates()
,fillna()
, andstr.replace()
as discussed in the lessons, to handle cleaning tasks. - Use
Seaborn
and/ormatplotlib
libraries for visualizations. Experiment with different types of plots and charts, such as bar plots, pie charts, and word clouds. - Focus on visualizing aspects such as the distribution of genres, the trend in release years, or the duration of movies and TV shows.
- Consider interactive visualizations such as dashboards, to enable users to explore the dataset and interact with the data.
- Document your data cleaning process and provide clear explanations and interpretations for each visualization.
🎯 Collaboration & Teamwork
- This is a Team Project where you'll work in groups of 2-3 students.
- Form your groups and communicate with your team before you accept the assignment in Github Classroom.
- Join the same team in Github Classroom. Work on your project together.
- Ideally, find a time when you can all join a video call and work together on the project.
- Everyone in the group should have a roughly equal contribution to the project.
- You'll need some extra bit of googling to complete this task.
🎯 Submission
- Commit and push your project to Github.
- Submit your project in Gradescope as a team.
- Upload your work to Woolf (each team member should upload the files to Woolf).