Feature Encoding
In the last lesson, we have seen how to transform or scale numerical features. In this lesson, we'll be focusing on techniques we need to know in order to perform feature engineering on categorical features. This is required because machine learning models can only understand numerical data. Hence, feature encoding is like translating different languages into a common language that a computer can understand.
What is feature encoding?
Since most machine learning algorithms work with numerical data, feature encoding is necessary to represent categorical information in a way that can be used for analysis or modeling.
For example, let's consider the Gender
column with values Male
and Female
in a dataset. We can encode this column into numerical values, like 0
for Male and 1
for Female. This way, the computer can understand and work with the data, allowing us to use it for various tasks, such as making predictions or finding patterns.
Encoding techniques
There are a number of different techniques for encoding categorical features, but some of the most common include...
- Label encoding
- One-hot encoding
1. Label encoding
Label encoding is a technique in data science that converts categorical or non-numeric features into numbers. This is done by assigning each category a unique integer value. This is like giving numbers to different things so that we can easily refer to them using numbers instead of long names.
For example, using a sample dataFrame, suppose we have a dataset of fruits, and the categorical features include Fruit_Type
, Color
, and Taste
. We can use label encoding to convert these features into numbers. Here's a sample code snippet:
In the code snippet, Apple is represented as 0
, Banana as 1
, and Orange as 2
. Now, we can use these encoded numbers for further analysis or modeling tasks. Here's the output of the code snippet.
Check your understanding: Label encoding
You are working with a dataset that contains information about different houses for sale. Here's a simplified version of the dataset:
House ID: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Square Feet: [1200, 1500, 1800, 900, 2200, 1000, 1600, 1300, 1100, 1700]
Bedrooms: [2, 3, 4, 2, 4, 2, 3, 2, 2, 3]
Bathrooms: [1, 2, 2, 1, 3, 1, 2, 1, 1, 2]
Year Built: [1995, 2000, 1985, 2005, 2010, 1998, 2002, 1990, 2008, 2015]
Price ($): [150000, 200000, 230000, 120000, 280000, 140000, 210000, 180000, 160000, 220000]
Task:
Use label encoding to encode the Bedrooms
feature into numerical values (e.g., 2 bedrooms as 0, 3 bedrooms as 1, etc.)
2. One-hot encoding
One-hot encoding is another technique used to handle categorical or non-numeric features. This is done by creating a new binary feature for each category. Each column represents a specific category, and it contains a value of 1
if the data point belongs to that category, and 0
if it does not.
Now, let's use the same example of the fruit dataset and perform one-hot encoding on the Fruit_Type
column.
The output of the sample code is a DataFrame with three additional columns: Fruit_Apple
, Fruit_Banana
, and Fruit_Orange
. Each column represents a fruit type, and a value of 1
indicates that the row corresponds to that particular fruit, while a value of 0
indicates that it doesn't.
With one-hot encoding, we have converted the Fruit_Type
categorical feature into binary columns, making it easier for machine learning algorithms to process and analyze the data.
Check your understanding: One-Hot encoding
You are working with a dataset that contains information about different houses for sale. Here's a simplified version of the dataset:
House ID: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Square Feet: [1200, 1500, 1800, 900, 2200, 1000, 1600, 1300, 1100, 1700]
Bedrooms: [2, 3, 4, 2, 4, 2, 3, 2, 2, 3]
Bathrooms: [1, 2, 2, 1, 3, 1, 2, 1, 1, 2]
Year Built: [1995, 2000, 1985, 2005, 2010, 1998, 2002, 1990, 2008, 2015]
Price ($): [150000, 200000, 230000, 120000, 280000, 140000, 210000, 180000, 160000, 220000]
Task:
Apply One-Hot encoding to the Bathrooms
feature. Create new binary columns for each unique value in the Bathrooms
feature.
Encoding techniques selection
Each encoding technique has its strengths and is suitable for different scenarios. Label Encoding is useful when there is an inherent order among the categories, while One-Hot Encoding is effective for scenarios where categories are not ordered and have no numerical relationship. The choice of encoding technique depends on the nature of the data and the requirements of the analysis or modeling task.
➡️ In the next section, we'll be looking at
Feature selection methods
🎯