Feature transformation

Think of it as translating a story from one language to another depending on the audience. It deals with reshaping or molding the data to better highlight patterns and relationships, by applying mathematical or statistical operations to the data to create new features or modify existing ones.

For example, imagine we have a dataset of temperatures in Celsius (°C), and we want to understand it in Fahrenheit (°F). Feature transformation can help us convert all the temperatures to Fahrenheit, making it easier for us to relate to and compare.

Feature transformation techniques

There are many different techniques for feature transformation, and the each technique has its purpose and can be applied depending on the characteristics of the data and the specific analysis goals. In this lesson, we'll look at 2 techniques used for transformation - scaling and binning.

1. Scaling

Feature scaling involves making sure that all our numeric features in the data have a similar range of values. This includes rescaling numerical features to a common range, such as 0 to 1 or -1 to 1. It ensures that all features have a similar influence on the analysis and prevents any one feature from dominating (or bullying 😁) the others.

Let's look at an example of scaling features in a dataset. Imagine we have a dataset containing both numeric and non-numeric features, we can scale the numeric features using the code snippet below.

We used Min-Max (or Normalization) scaling technique to scale the numeric features (Age, Height, Weight, Income, and Experience), but then we modified the scaled values to ensure they are positive, by adding the absolute minimum value of the scaled data. Other scaling techniques includes

Standardization (or Zero Score)
Max scaling
Robust Scaling
Power
Box-Cox
Quantile
Rank
Unit Vector Scaling

Check your understanding: Scaling

You are working with a dataset that contains information about different houses for sale. Here's a simplified version of the dataset:

House ID:       [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Square Feet:    [1200, 1500, 1800, 900, 2200, 1000, 1600, 1300, 1100, 1700]
Bedrooms:       [2, 3, 4, 2, 4, 2, 3, 2, 2, 3]
Bathrooms:      [1, 2, 2, 1, 3, 1, 2, 1, 1, 2]
Year Built:     [1995, 2000, 1985, 2005, 2010, 1998, 2002, 1990, 2008, 2015]
Price ($):      [150000, 200000, 230000, 120000, 280000, 140000, 210000, 180000, 160000, 220000]

Task: Perform min-max scaling on the Square Feet feature to bring it into the range [0, 1].

2. Binning

Binning is the process of dividing a continuous feature (e.g., age) into discrete intervals or bins. It's like creating age groups to make it easier to understand and analyze the data. For example, let's say we have a dataset of customer ages, and we want to bin the ages into 5 bins, the we have

0-18 years old
19-30 years old
31-45 years old
46-60 years old
61+ years old

We can then assign each customer to a bin based on their age. This would allow us to use the binned age as a categorical feature, rather than wide range of continuos values. Binning helps simplify the data and make it easier to interpret and analyze, thereby allowing us to summarize the data in a meaningful way and uncover insights that may not be apparent when looking at individual values.

Let's use another example to demonstrate binning. Suppose we have a dataset containing information about students' test scores, and we want to bin their scores into different performance categories, such as Low, Medium, and High. Here's a sample DataFrame with six features: Student_ID, Math_Score, Science_Score, English_Score, History_Score, and Total_Score:

In this example, we used the pandas cut() function to bin the Math_Score column into three categories: "Low" (scores below 70), "Medium" (scores between 70 and 80), and "High" (scores above 80). The new column "Math_Category" has been added to the DataFrame to store the respective category for each student's math score.

Different binning techniques can be used depending on the data and analysis goals. Here are some common binning techniques:

Equal Width: Dividing the data into bins of equal width. For example, dividing age into bins like 0-10 years, 11-20 years, etc.
Equal Frequency: Dividing the data into bins with an equal number of data points. This can help handle skewed data.
Custom: Defining specific bins based on domain knowledge or specific requirements.

Check your understanding: Binning

You are working with a dataset that contains information about different houses for sale. Here's a simplified version of the dataset:

House ID:       [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Square Feet:    [1200, 1500, 1800, 900, 2200, 1000, 1600, 1300, 1100, 1700]
Bedrooms:       [2, 3, 4, 2, 4, 2, 3, 2, 2, 3]
Bathrooms:      [1, 2, 2, 1, 3, 1, 2, 1, 1, 2]
Year Built:     [1995, 2000, 1985, 2005, 2010, 1998, 2002, 1990, 2008, 2015]
Price ($):      [150000, 200000, 230000, 120000, 280000, 140000, 210000, 180000, 160000, 220000]

Task:

Apply binning on the Year Built feature to categorize houses into decades (e.g., 1980s, 1990s, etc.)

Lesson Summary...

Feature transformation is the process of changing the way data is represented or organized to make it more suitable for analysis and modeling.

There are several popular techniques used in feature transformation and each technique has its purpose and can be applied depending on the characteristics of the data and the specific analysis goals.

A summary of the techniques covered and more is given in the table below.

| Technique                  | What it does                                                                            |
| -------------------------- | --------------------------------------------------------------------------------------- |
| Scaling                    | Makes the data have a similar range of values.                                          |
| Normalization              | Centers the data around a mean of 0 and a standard deviation of 1.                      |
| Transformation             | Applies a mathematical transformation to the data to make it more linearly correlated.  |
| Log transformation         | Takes the logarithm of the data, which can help to make the distribution more normal.   |
| Reciprocal transformation  | Takes the reciprocal of the data, which can help to make the distribution more uniform. |
| Square root transformation | Takes the square root of the data, which can help to make the distribution more normal. |

➡️ In the next section, we'll be looking at Feature encoding 🎯.

Introduction to Data Science

Feature transformation

Feature transformation techniques

1. Scaling

2. Binning