Unlike evaluating the accuracy of models that predict a continuous or discrete dependent variable like Linear Regression models, evaluating the accuracy of a classification model could be more complex and time-consuming. Before measuring the accuracy of classification models, an analyst would first measure its robustness with the help of metrics such as AIC-BIC, AUC-ROC, AUC- PR, Kolmogorov-Smirnov chart, etc. The

# Author: Jacob Joseph

## A Neat Trick to Increase Robustness of Regression Models

The first predictive model that an analyst encounters is Linear Regression. A linear regression line has an equation of the form, where X = explanatory variable, Y = dependent variable, a = intercept and b = coefficient. In order to find the intercept and coefficients of a linear regression line, the above equation is generally solved by

## How Do We Perceive Analytics or Data Science?

Is the purpose of analytics or data science to draw some insights from data or some cool visualization or is it just a recommendation based on some metric we deem important? The list could be endless. But, what is true analytics or data science? Let’s begin with the definition of Analytics and Data Science given by Wikipedia. Analytics: Analytics

## The Fallacy of Seeing Patterns

Human beings try to find patterns to explain the reason behind almost every phenomenon, but that doesn’t mean that there is a pattern to rely on. Superstitions are a classic example where spurious patterns were generalized to explain many a phenomena. As Analysts, we are on the lookout for patterns and quite often, either knowingly

## I Wish I Had Autobots for Data Transformation

Being a sci-fi movie buff, I would always wonder if my variables could turn into Autobots just like the movie ‘Transformers’ and make my life building statistical models that much easier. Until that day, I will have to use the available tools to transform my variables. Data Analysis Before drawing valuable insights or building predictive

## How to Compare Apples and Oranges ? : Part III

In the part 1 and part 2 of the series, we looked at ways to compare numerical variables and categorical variables. Let’s now look at techniques to compare mixed type of variables i.e. numerical and categorical variables together. Please read this article to visually analyze the relationship between mixed type of variables. We will work with

## How to Compare Apples and Oranges ? : Part II

In the previous article, we looked at some of the ways to compare different numerical variables. In this article, we shall look at techniques to compare categorical variables with the help of an example. Assume you have been given a dataset totaling 10,000 rows containing user information on Operating System, Gender and whether the user

## How to Compare Apples and Oranges? : Part I

How often have you come across the idiom “Comparing apples and oranges”. It is a great analogy to articulate that two things can’t be compared due to the fundamental difference between them. As an analyst, you deal with such difference and make sense of it on a daily basis. Let’s take an example and understand some ways to

## Do You Need Big Data or Smart Data ? : Part II

In the previous article, we discussed how sampling could turn your Big Data to Smart Data and briefly laid out few sampling techniques. Let’s now discuss the techniques in detail. Probability Sampling Probability Sampling is one in which every element of the population has a chance or a probability (greater than zero) of selection, and this

## Do you need Big Data or Smart Data ? : Part I

Big Data is the buzzword of our current times. A majority of the firms either use or wish to use Big Data on their analytics platform and discover actionable insights from their data. The two key requirements to deliver such insights is 1) the presence of intelligent infrastructure to process the data and 2) the data

## Deriving Better Insights from Time Series Data with Cycle Plots

Visualizing time series data for the analysis of numerical information like revenue, app launches, uninstalls, etc. can help analysts quickly reveal an underlying trend. The graph below displays the visualization of time series data: The above graph captures the essence of a slight uptrend over the course of 12 weeks but leaves out further details

## How to Treat Missing Values in Your Data : Part II

In the previous article, we discussed some techniques to deal with missing data. We will now look at an example where we shall test all the techniques discussed earlier to infer or deal with such missing observations. With the information on Visits,Transactions, Operating System, and Gender, we need to build a model to predict Revenue.

## How to Detect Outliers Using Parametric and Non- Parametric Methods : Part II

In the previous article, we discussed what an outlier is and ways to detect such outliers with parametric and non-parametric methods by conducting a univariate and bivariate analysis. Let’s now look at Clustering, a non-parametric method and a popular data mining technique to detect such outliers when we are dealing with many variables or in

## How to Treat Missing Values in Your Data : Part I

One of most excruciating pain points during Data Exploration and Preparation stage of an Analytics project are missing values. How do you deal with missing values – ignore or treat them? The answer would depend on the percentage of those missing values in the dataset, the variables affected by missing values, whether those missing values are a

## How to Detect Outliers Using Parametric and Non- Parametric Methods : Part I

An Outlier is an observation or point that is distant from other observations/points. But, how would you quantify the distance of an observation from other observations to qualify it as an outlier. Outliers are also referred to as observations whose probability to occur is low. But, again, what constitutes low?? There are parametric methods and

## How to Represent Data with Intelligent Use of the Coordinate System

The most widely used coordinate system to represent data is the Cartesian coordinates followed by Polar coordinates. Source: Wikipedia Basically, Cartesian coordinate system uses a grid of straight lines while Polar coordinate system uses a grid of circles to represent data. Let’s now look at a few examples where with the appropriate use of the

## How to Bin or Convert Numerical Variables to Categorical Variables with Decision Trees

Why would you want to convert a numerical variable into categorical one? Depending on the situation, it can lead to a better interpretation of the numerical variable, quick segmentation or just an additional feature for building your predictive model by creating bins for the numerical variable. Binning is a popular feature engineering technique. Suppose your hypothesis

## Exploring the Relationship between Variables Visually

As an analyst, you can explore the relationship between variables both quantitatively and visually. However, only looking at the quantitative indicators like correlation could be leaving out much of the bigger picture. Numerical v/s Numerical The Anscombe quartet as shown below is a classic example: Source: Wikipedia The points in the datasets are such that the

## How to Use Cohort Data to Analyze User Behavior

In the world of data analysis, one tool is often left unused. While being a very powerful analytics tool, cohorts are often pushed aside due to their seemingly complex nature. With a lot to offer in the way of data analysis, let’s take a deeper (yet simplified) look into cohorts. Let’s start by explaining what