Blog Data and Analytics

Popular tricks and tips for Feature Engineering – Part II

Jacob Joseph Jacob Joseph, a 40 Under 40 award-winning Data Scientist, leads the Data Science team at CleverTap. With over 20 years in analytics and consulting, he excels in solving complex marketing challenges.
Popular tricks and tips for Feature Engineering – Part II

 

In the previous part, we looked at some of the popular tricks for Feature Engineering and got a broad overview of each trick. In this part, we will look at the first three tricks in detail. The examples discussed in this article can be reproduced with the source code and datasets available here.

i) Bringing numerical variables on the same scale

Standardization is a popular pre-processing step in Data Preparation. It is done to bring all the variables on the same scale so that your machine learning algorithms give equal importance to all the variables and do not distinguish based on scale.

Let’s take an example with K-means Clustering, a popular data mining and unsupervised learning technique. We will work with publicly available wine data from the UCI Machine Learning Repository. The dataset contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. The sample data set is below:

First 10 Rows

TypeAlcoholMalicAshAlcalinityMagnesiumPhenolsFlavanoidsNonflavanoidsProanthocyaninsColorHueDilutionProline
114.231.712.4315.61272.803.060.282.295.641.043.921065
113.201.782.1411.21002.652.760.261.284.381.053.401050
113.162.362.6718.61012.803.240.302.815.681.033.171185
114.371.952.5016.81133.853.490.242.187.800.863.451480
113.242.592.8721.01182.802.690.391.824.321.042.93735
114.201.762.4515.21123.273.390.341.976.751.052.851450
114.391.872.4514.6962.502.520.301.985.251.023.581290
114.062.152.6117.61212.602.510.311.255.051.063.581295
114.831.642.1714.0972.802.980.291.985.201.082.851045
113.861.352.2716.0982.983.150.221.857.221.013.551045

 

Popular tricks and tips for Feature Engineering – Part II

As can be observed from the summary, the variables aren’t on the same scale. In order to identify the 3 types of wine (see ‘Type’), we will cluster the data using K-means clustering, with and without bringing the variables on the same scale.

A wide variety of indices have been proposed to find the optimal number of clusters in partitioning the data during the clustering process. We will use NbClust, a popular package in R that provides up to 30 indices for determining the ideal number of clusters, given a range of clusters. The cluster that is chosen by the maximum number of indices will be the ideal cluster size. We shall iterate between 2 and 15 and select the ideal cluster number with the help of NbClust.

Without Standardization

Popular tricks and tips for Feature Engineering – Part II

Based on the above graph, 2 clusters is the ideal cluster number among all the clusters provided since the majority of the indices have proposed ‘2’. But, from the data, we know that there are 3 types of wine. So we can easily reject this.

With Standardization

We will standardize the wine data score provided using z-score and ignore the ‘Type’ column. Below is the data summary post-standardization:

Popular tricks and tips for Feature Engineering – Part II

Let’s now run the NbClust algorithm to estimate the ideal number of clusters.

Popular tricks and tips for Feature Engineering – Part II

Based on the above graph, 3 clusters is the ideal cluster number among all the clusters provided since the majority of the indices have proposed ‘3’. This clustering looks promising considering the fact that there are 3 types of wine.

Evaluating Clusters formed

We will run the K-means algorithm based on the cluster number provided by NbClust to cluster the data. We will use the Confusion Matrix to evaluate the performance of the classification arrived after clustering. Confusion Matrix is a tabular representation of Actual (wine type from data) vs. Predicted values (clusters). The off-diagonals (in orange) represent the observations that are misclassified, and the diagonals (in blue) represent the observations that are correctly classified.

Popular tricks and tips for Feature Engineering – Part II

The ideal scenario would be where all the observations corresponding to a wine type belong to only one of the 3 clusters. In the above matrix, there are 6 (3 + 3) misclassified observations that belong to Type 2 wine, but 3 of those observations are present in Cluster 1, and the rest in Cluster 3 instead of Cluster 2.

With standardized data, the best cluster size predicted was more accurate than non-standardized data. Additionally, the classification performance based on the clustering with the standardized data was extremely encouraging.

ii) Binning/Converting Numerical to Categorical Variable

Converting numerical to categorical variables is another useful feature engineering technique. It could not only help you to interpret and visualize the numerical variable but also add an additional feature, which could eventually increase the performance of the predictive model by reducing noise or non-linearity. 

Let’s look at an example where we have data on the age of the users and whether they have interacted with an app during a particular time period. Below are the first 10 rows and the summary of the data:

Age / OS / Interact

AgeOSInteract
18iOS1
23iOS1
20Android0
22Android0
21Android0
16iOS1
21Android1
79iOS0
16iOS0
22Android1
24iOS1

 

Data Summary

AgeOSInteract
Min.:16.00Android:980:65
1st Qu.:21.00iOS:671:100
Median:29.00
Mean:33.63
3rd Qu.:42.00
Max.:79.00

We have 165 users aged between 16 and 79 years, of which 98 are on Android and 67 are on iOS. The 1 and 0 for the ‘Interact’ variable refer to users who have interacted with the app frequently and occasionally, respectively.

We need to build a model to predict whether a user interacts with an app based on the above information. We will use 2 approaches, one where we take Age as it is and the other where we create an additional variable by grouping or binning the age in buckets. Though we can use several methods like domain expertise, visualization, and predictive models to bin ‘Age,’ we will bin ‘Age’ based on a percentile approach.

Age Summary

StatisticValue
1st Quartile21
2nd Quartile29
3rd Quartile42
Mean33.63
Min.16
Max.79

Based on the above table, 25% of the users’ age are below 21, 50% are between 21 and 42, and the rest 25% greater than 42. We will use these breakpoints to bin the users into different age group buckets and create a new variable ‘Age Group’.

Age Group Summary

Age GroupCount
< 2141
≥ 21 & < 4282
≥ 4242

Since we have binned the users’ ages, let’s build a model to predict whether the users will interact with the app with the help of Logistic Regression since the dependent variable is a binary variable.

Model Summary

Logistic Regression Model Summary

Independent VariableDependent variable:
Interact
Model AModel B
Age-0.049***-0.062*
AgeGroup: ≥21 & < 421.596*
AgeGroup: > 421.236
OS: iOS2.150***2.534***
Constant1.483**0.696
Observations165165
Residual Deviance157.2149.93
AIC163.2159.93

* p < 0.05; ** p < 0.01; *** p < 0.001

Model A has taken only ‘Age’ and ‘OS’ as the independent variables, whereas Model B has taken ‘Age’, ‘Age Group’ and ‘OS’ as the independent variables.

Model Discrimination

There are various metrics to discriminate between various Logistic Regression Models like Residual Deviance, Log Likelihood, AIC, SC, AUC, etc. For the sake of simplicity, we will only look at Residual Deviance and AIC (Akaike Information Criterion). The lower the number for the metrics mentioned, the better the model. Based on the results for AIC and Residual Deviance obtained, Model B appears to be the best among the two. Binning has proved to be useful in the above case since the variable formed due to binning has turned out to be statistically significant for the model indicated by * if we assume a 5% cut-off for the p-value. The binned variable is able to capture some part of the non-linearity of the relationship between Age and Interact. Model B can be improved further with the help of dummy variables, which will be discussed in the next part.

iii) Reducing Levels in Categorical Variables

Rationalizing the levels or attributes in Categorical variables could lead to better models and computational efficiency. Consider the example given below detailing the number of App launches by City

CityApp LaunchedPercentage (%)Cumulative (%)
1NewYork_City238781452525
2LosAngeles200576422146
3SanDeigo171922641864
4SanFrancisco143268871579
5Arlington95512581089
6Houston62083186.595.5
7Philadelphia15282011.697.1
8Phoenix11461511.298.3
9Chandler7641010.899.1
10Dallas3342940.3599.45
11Austin1719230.1899.63
12Jacksonville1050640.1199.74
13Riverside668590.0799.81
14Pittsburgh573080.0699.87
15Mesa382050.0499.91
16Miami286540.0399.94
17FortWorth238780.0299.96
18Irvine191030.0299.98
19Tampa95510.0199.99
20Fresno47760.01100

It can be observed from the above table that 95% of the App Launches are accounted for by 6 cities out of the 20. In fact, we can combine the remaining 14 cities into one level and name it as ‘Others’ with the 5% share.

Let’s look at the states to which these cities belong to and find out if States could group all the cities.

CityStateApp Launched
1NewYork_CityNewYork23878145
2LosAngelesCalifornia20057642
3SanDeigoCalifornia17192264
4SanFranciscoCalifornia14326887
5ArlingtonTexas9551258
6HoustonTexas6208318
7PhiladelphiaPennsylvania1528201
8PhoenixArizona1146151
9ChandlerArizona764101
10DallasTexas334294
11AustinTexas171923
12JacksonvilleFlorida105064
13RiversideCalifornia66859
14PittsburghPennsylvania57308
15MesaArizona38205
16MiamiFlorida28654
17FortWorthTexas23878
18IrvineCalifornia19103
19TampaFlorida9551
20FresnoCalifornia4776

 

StateApp Launched
1Arizona1948457
2California51667531
3Florida143269
4NewYork23878145
5Pennsylvania1585509
6Texas16289671

As per the table, the information related to the cities could be summarized by the corresponding 6 States.

In the above example, we have tried to reduce the categorical levels by 2 approaches, the first one by combining the levels which have a low number of App Launches using frequency distribution, and the other by creating a new variable, ‘State’, using logic. Reducing the levels helps in making data computationally less expensive and easier for visualization, which in turn helps in making better sense of data and could also reduce the danger of overfitting (overfitting leads to a good fit on the data used to build the model or in-sample data but may poorly fit out-of-sample or new data).

In this part, we have looked at the first three tricks in detail. The ensuing part will discuss tricks four to six in detail.

Posted on January 30, 2025