Feature Encoding

In the last lesson, we have seen how to transform or scale numerical features. In this lesson, we'll be focusing on techniques we need to know in order to perform feature engineering on categorical features. This is required because machine learning models can only understand numerical data. Hence, feature encoding is like translating different languages into a common language that a computer can understand.

What is feature encoding?

Since most machine learning algorithms work with numerical data, feature encoding is necessary to represent categorical information in a way that can be used for analysis or modeling.

For example, let's consider the Gender column with values Male and Female in a dataset. We can encode this column into numerical values, like 0 for Male and 1 for Female. This way, the computer can understand and work with the data, allowing us to use it for various tasks, such as making predictions or finding patterns.

Encoding techniques

There are a number of different techniques for encoding categorical features, but some of the most common include...

Label encoding
One-hot encoding

1. Label encoding

Label encoding is a technique in data science that converts categorical or non-numeric features into numbers. This is done by assigning each category a unique integer value. This is like giving numbers to different things so that we can easily refer to them using numbers instead of long names.

For example, using a sample dataFrame, suppose we have a dataset of fruits, and the categorical features include Fruit_Type, Color, and Taste. We can use label encoding to convert these features into numbers. Here's a sample code snippet:

In the code snippet, Apple is represented as 0, Banana as 1, and Orange as 2. Now, we can use these encoded numbers for further analysis or modeling tasks. Here's the output of the code snippet.

Check your understanding: Label encoding

You are working with a dataset that contains information about different houses for sale. Here's a simplified version of the dataset:

House ID:       [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Square Feet:    [1200, 1500, 1800, 900, 2200, 1000, 1600, 1300, 1100, 1700]
Bedrooms:       [2, 3, 4, 2, 4, 2, 3, 2, 2, 3]
Bathrooms:      [1, 2, 2, 1, 3, 1, 2, 1, 1, 2]
Year Built:     [1995, 2000, 1985, 2005, 2010, 1998, 2002, 1990, 2008, 2015]
Price ($):      [150000, 200000, 230000, 120000, 280000, 140000, 210000, 180000, 160000, 220000]

Task:

Use label encoding to encode the Bedrooms feature into numerical values (e.g., 2 bedrooms as 0, 3 bedrooms as 1, etc.)

2. One-hot encoding

One-hot encoding is another technique used to handle categorical or non-numeric features. This is done by creating a new binary feature for each category. Each column represents a specific category, and it contains a value of 1 if the data point belongs to that category, and 0 if it does not.

Now, let's use the same example of the fruit dataset and perform one-hot encoding on the Fruit_Type column.

The output of the sample code is a DataFrame with three additional columns: Fruit_Apple, Fruit_Banana, and Fruit_Orange. Each column represents a fruit type, and a value of 1 indicates that the row corresponds to that particular fruit, while a value of 0 indicates that it doesn't.

With one-hot encoding, we have converted the Fruit_Type categorical feature into binary columns, making it easier for machine learning algorithms to process and analyze the data.

Check your understanding: One-Hot encoding

You are working with a dataset that contains information about different houses for sale. Here's a simplified version of the dataset:

House ID:       [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Square Feet:    [1200, 1500, 1800, 900, 2200, 1000, 1600, 1300, 1100, 1700]
Bedrooms:       [2, 3, 4, 2, 4, 2, 3, 2, 2, 3]
Bathrooms:      [1, 2, 2, 1, 3, 1, 2, 1, 1, 2]
Year Built:     [1995, 2000, 1985, 2005, 2010, 1998, 2002, 1990, 2008, 2015]
Price ($):      [150000, 200000, 230000, 120000, 280000, 140000, 210000, 180000, 160000, 220000]

Task:

Apply One-Hot encoding to the Bathrooms feature. Create new binary columns for each unique value in the Bathrooms feature.

Encoding techniques selection

Each encoding technique has its strengths and is suitable for different scenarios. Label Encoding is useful when there is an inherent order among the categories, while One-Hot Encoding is effective for scenarios where categories are not ordered and have no numerical relationship. The choice of encoding technique depends on the nature of the data and the requirements of the analysis or modeling task.

➡️ In the next section, we'll be looking at Feature selection methods🎯