EDA - Data Imputation

Shravan C
2 min readMar 21, 2020

Hello, getting straight into the business. Data Imputation is about filling the missing data of your dataset. Of course, pandas are used and no particular dataset is used rather data frame is created for the understanding purpose.

This is done in three steps

  • Data visualization with proper data
  • What is an outlier and dealing with it
  • Data imputation for missing data
  • Data imputation using Kmean clustering

In the first part, of the Colab example, it is about creating a DataFrame and then visualizing it with pandas describe and the boxplot. Here we see that there are no outliers as such because data has none.

In the second part, an outlier is introduced by adding a very large number compared to others in the data. From the boxplot, it is evident that data contains outliers. Usually, all the outlier features are segregated and treated differently(in terms of scaling, where logarithmic scaling and robust scaling are applied) and regular scaling for non-outlier features. Scaling is important because this will allow a machine learning model to give equal importance to all features.

In the third and fourth parts, missing data is introduced. It is then checked how they are there. And again describe is used to analyze the distribution. And two methods are discussed to fill in the missing data. 1. is a regular method using your data analysis 2. using cluster analysis to impute the data which will make sure there is no bias introduced in the imputation process.

If the dataset is very large it does take a long time for cluster analysis, but it is what it is. Please do go along the Colab to understand better. I have been experimenting with all scaler methods available in sklearn and with all possible regression models shortly will publish an article on the same. Thanks for reading!!! Enjoy Coding!!!

--

--

Shravan C

Software Engineer | Machine Learning Enthusiast | Super interested in Deep Learning with Tensorflow | GCP