Blog Data and Analytics

Popular tricks and tips for Feature Engineering – Part IV

Jacob Joseph Jacob Joseph, a 40 Under 40 award-winning Data Scientist, leads the Data Science team at CleverTap. With over 20 years in analytics and consulting, he excels in solving complex marketing challenges.
Popular tricks and tips for Feature Engineering – Part IV

 

In the earlier part, we discussed tricks (iv) to (vii) for feature engineering. In this part, we will dive deep into tricks (viii) and (ix). The examples discussed in this article can be reproduced with the source code and datasets available here. Refer to Part 1 for an introduction to the tricks dealt with in detail here.

viii) Reducing Dimensionality

As an analyst, you savor the scenario where you have a lot of data. But, with a lot of data comes the added complexity to analyze and make better sense of such data. Often, the variables within the data are correlated, and analysis or models built on such untreated data may lead to poor analysis or lead to a model that overfits. 

It is a common practice among analysts to employ dimension reduction techniques to create new variables that number less than the original variables to explain the dataset. For example, Assume a dataset has 1000 variables, but with the dimension reduction technique, only 50 newly created variables were able to explain the original data quite well.

Dimension Reduction techniques are heavily used in image processing, video processing, and generally, where you deal with very high number of variables. You may use hand-engineered feature extraction methods like SIFT, VLAD, HOG, GIST, and LBP or learn features that are discriminative in the given context like PCA, ICA, Sparse Coding, Auto Encoders, Restricted Boltzmann Machines, etc.

In this article, we look at a popular technique called Principal Component Analysis (PCA). PCA is a technique used to emphasize variation and bring out strong patterns in a dataset.

Let’s explore PCA visually in 2 dimensions before proceeding toward a multidimensional dataset.

2D Example

Consider we have the following dataset:

Dataset

PhysicsMaths
7069
4055.8
80.574.2
72.170
55.163
6059.5
85.575
5662

The dataset contains the average marks in Physics and Maths of 8 students

From the above scatter plot, it seems that there is a positive relationship between the marks in Physics and Maths.

But what if we want to summarize the above data on just one coordinate instead of two? We have 2 options:

  1. Take all the values of Physics and plot them on a line
  2. Take all the values of Maths and plot it on a line

Popular tricks and tips for Feature Engineering – Part IV

It seems the variation for marks in Physics is more than in Maths. What if we choose the Physics marks to be representative of the dataset since it varies more than Maths marks and anyway, the marks in Physics and Maths both move together? But intuitively, that doesn’t seem right. Though we chose the variable that had the maximum variation in the dataset, i.e., Physics marks, we are sacrificing the information about Maths marks totally. 

Why can’t we create new variables, which take a linear combination of the existing variables and then look at such newly created variables that maximize such variation? 

Transformed

PC1PC2
0.5743780.058681
-2.3647420.153077
1.6655680.088140
0.7883060.060262
-0.8254730.165476
-0.954877-0.459403
2.004565-0.078450
-0.8877250.012218

The dataset is a transformed version of the dataset containing the marks. It is calculated by taking the principal components of the data.

Principal components are essentially linear combinations of the variables, i.e., both PC1 and PC2 are linear combinations of variables in Physics and Maths. For the 2 variables, we got 2 principal components. 

Popular tricks and tips for Feature Engineering – Part IV

From the above line graphs, we see that PC1 shows the maximum variation, and since it is a representation of both the original variables, it is a better candidate to represent the dataset than just Physics.

Let’s observe the variance explained by both the components

Importance of Components

PC1PC2
Standard Deviation1.4020.188
Proportion of Variance0.9820.018
Cumulative Proportion0.9821.000

It is further clear from the above table that PC1 accounts for 98% of the variance in the dataset, and it could be used to represent the dataset. 

Multidimensional Example

Let’s extend this same idea to a multi-dimensional scenario. We will try to understand with the help of the wine dataset discussed in part 2 under trick (i). To recollect, the wine dataset contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. The sample data set is below:

First 10 Rows

TypeAlcoholMalicAshAlcalinityMagnesiumPhenolsFlavanoidsNonflavanoidsProanthocyaninsColorHueDilutionProline
114.231.712.4315.61272.83.060.282.295.641.043.921065
113.21.782.1411.21002.652.760.261.284.381.053.41050
113.162.362.6718.61012.83.240.32.815.681.033.171185
114.371.952.516.81133.853.490.242.187.80.863.451480
113.242.592.87211182.82.690.391.824.321.042.93735
114.21.762.4515.21123.273.390.341.976.751.052.851450
114.391.872.4514.6962.52.520.31.985.251.023.581290
114.062.152.6117.61212.62.510.311.255.051.063.581295
114.831.642.1714972.82.980.291.985.21.082.851045
113.861.352.2716982.983.150.221.857.221.013.551045

Popular tricks and tips for Feature Engineering – Part IV

We have a total of 13 numerical variables on which we can conduct PCA analysis (PCA can run on numerical variables only). Let’s check the variable importance after running the PCA algorithm and select the principal components based on the variation in the dataset it explains.

Importance of Components

PC1PC2PC3PC4PC5PC6PC7
Standard Deviation2.1691.5801.2030.9590.9240.8010.742
Proportion of Variance0.3620.1920.1110.0710.0660.0490.042
Cumulative Proportion0.3620.5540.6650.7360.8020.8510.893

PC8PC9PC10PC11PC12PC13
Standard Deviation0.5900.5370.5010.4750.4110.322
Proportion of Variance0.0270.0220.0190.0170.0130.008
Cumulative Proportion0.9200.9420.9620.9790.9921.000

Based on the above table, it seems that more than 50% of the variance in the data is explained by the top 2 principal components, 80% by the top 5 principal components, and over 90% by the top 8.

Let’s further understand the relationship between the variables and confirm if the PCA has been able to capture the pattern in the data among the variables with the help of a biplot. 

Popular tricks and tips for Feature Engineering – Part IV

Biplots are the primary visualization tool for PCA. The biplots plot the transformed data as points shown in the form of the row index of the dataset and the original variables as vectors (arrows) on the same graph. It also helps us visualize the relationship between the variables themselves. 

The direction of the vectors, length, and the angle between the vectors all have a meaning. Let’s look at the angle between the vectors. The smaller the angles between the vectors, the more the variables are positively correlated. In the above plot, if we pick a few variables, Alcalinity and Nonflavanoids have a high positive correlation due to the small angle between the vectors. The same can be said for Proanthocyanins and Phenois. Malic and Hue or Alcalinity and Phenois are negatively correlated as the vectors go in the opposite direction. 

We can verify the claims by choosing the row indices from the plot.

Let’s choose row index 131 and 140

AlcalinityNonflavanoids
131180.21
140240.53

Both Alcalinity and Nonflavanoids moved together 

Let’s choose row index 23 and 33

ProanthocyaninsPhenois
512.912.72
331.972.42

Both Proanthocyanins and Phenois moved together

Interestingly, the data points seem to form clusters indicated by the different colors corresponding to the type of wine. PCA can be useful not only for reducing the dimensionality of the dataset but also for clustering. Since we would be using only a subset of the principal components that explain the majority of the variation in the dataset while building the predictive models, it could be very useful in reducing the menace of overfitting.

ix) Intuitive & Additional Features

Sometimes, you may create additional features, which could be the result of domain knowledge and common sense either manually or programmatically. 

Examples:

  1. How many times have you come across a dataset that contains the birth date of a user? Are you using that information in the given form, or are you transforming and creating a new variable like the Age of the user? 
  2. You would also have come across date and time stamps that contain the information on date, hour, minutes up to seconds, if not more. Would you be taking this information as it is? Wouldn’t it be useful if new variables like the month of the year, day of the week, and hour of the day were created?
  3. Many businesses are seasonal in nature, while some are not. Based on the nature of the industry, a new variable recognizing the seasonality could be created. 

Conclusion

The steps preceding the predictive modeling stage take up as much as 80% of an analyst’s time of which Data Preparation takes up a lion’s share. The importance of Feature Engineering in Data Preparation can’t be underestimated. If done the right way, it could lead to better insights and more efficient and robust models. The tricks discussed in the article series attempt to arm the analyst with enough ammunition in the form of tricks/tips. 

Posted on January 30, 2025