♻️ Data Cleaning
In data science, unclean
data refers to a dataset that contains errors, inconsistencies, or inaccuracies, making it unsuitable for analysis without preprocessing. Such data may have missing values, duplicate entries, incorrect formatting, inconsistent naming conventions, outliers, or other issues that can impact the quality and reliability of the data. These problems can arise from various sources, such as...
- Data entry errors
- Measurement inaccuracies
- Technical issues during data collection or storage.
Cleaning the data involves identifying and addressing these issues to ensure that the dataset is accurate, complete, and reliable before further analysis or modeling takes place.
Data cleaning aims to improve the integrity, completeness, and consistency of the data. When cleaning a data, our goal will be to produce a clean and reliable dataset that is ready for further analysis. By investing time and effort into data cleaning, we can improve the accuracy and credibility of our analysis results, leading to more robust and reliable insights. To understand this more, we'll be looking at the following.
Data cleaning with Google Sheet ♻️
In GS, data cleaning can involve tasks such as removing duplicate values, correcting misspellings, handling missing data by filling in or deleting the values, and formatting data appropriately. GS provides us with various built-in functions and tools, such as filters, conditional formatting, and formulas, that can help with data cleaning tasks.
When we carry out data cleaning, we can improve the quality of our datasets and ensure that the data is ready for further analysis or visualization. To have an understanding of how to clean a dataset, we'll be looking at 3 things for this lesson.
1. Handling missing data
Missing data is one of the most frequently occuring problems you can face as a data scientist. Watch the next video to have an idea of how important it is to understand this problem, and possible causes.
As a data scientist, there are many ways of dealing with missing values in a data. For this lesson, we'll be looking at 4 different techniques of handling missing data - _dropping, filling with constant, filling with statistics, and interpolation.
2. Removing Duplicates
Duplicate data are rows or records within a dataset with similar or nearly identical values across all or most of their attributes. This can occur due to various reasons, such as data entry errors, system glitches, or merging data from different sources. As a data scientist, there are number of ways to handle duplicate data in a small or large dataset.
3. #VALUE! Error
The #VALUE!
error in Google Sheets is caused by attempting to perform calculations or operations involving cells containing incompatible data types, such as trying to add text to a numerical calculation or using dates in an inappropriate context.
4. Conditional Formatting
Conditional formatting is used in Google Sheets, as well as other spreadsheet software, to visually highlight and emphasize data based on specific conditions or criteria.
Smart CleanUp in GS
With GS Smart Cleanup feature, we can do many things in an easier manner. Let's look at two important functionalities of this feature...
- Finding Problems: It takes a look at your data set, and tries to find out if there could be any problems in that dataset. For example, are there any duplicates in that data set? Is there anything that might be spelled incorrectly? So it gives you a chance to fix your dataset before you analyze it.
- Statistics: Smart Cleanup can help take a look at a column, and gives some statistics based on that column.
📺 Watch the video below on using Smart Cleanup and practice along.
👩🏾🎨 Practice: Clean the smell... 🎯
A smaller sample of the global COVID-19 dataset is provided here for this exrcise.
- Create a copy of the dataset for your own use.
- Explore the dataset to have a sense of what the it represent.
- By leveraging your data cleaning skills, attempt the following...
- Remove duplicate data if exist.
- Handle blank space.
- Convert the column from text to number.
- Implement other cleaning techniques of your choice.