EDA - Data Imputation

Shravan C
2 min readMar 21, 2020

Hello, getting straight into the business. Data Imputation is about filling the missing data of your dataset. Of course, pandas are used and no particular dataset is used rather data frame is created for the understanding purpose.

This is done in three steps

  • Data visualization with proper data
  • What is an outlier and dealing with it
  • Data imputation for missing data
  • Data imputation using Kmean clustering

In the first part, of the Colab example, it is about creating a DataFrame and then visualizing it with pandas describe and the boxplot. Here we see that there are no outliers as such because data has none.

In the second part, an outlier is introduced by adding a very large number compared to others in the data. From the boxplot, it is evident that data contains outliers. Usually, all the outlier features are segregated and treated differently(in terms of scaling, where logarithmic scaling and robust scaling are applied) and regular scaling for non-outlier features. Scaling is important because this will allow a machine learning model to give equal importance to all features.

In the third and fourth parts, missing data is introduced. It is then checked how they are there. And again describe is used to analyze the distribution. And two methods are discussed to fill in the missing data. 1. is a regular method using your data analysis 2. using cluster analysis to impute the data which will make sure there is no bias introduced in the imputation process.

If the dataset is very large it does take a long time for cluster analysis, but it is what it is. Please do go along the Colab to understand better. I have been experimenting with all scaler methods available in sklearn and with all possible regression models shortly will publish an article on the same. Thanks for reading!!! Enjoy Coding!!!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Shravan C
Shravan C

Written by Shravan C

Software Engineer | Machine Learning Enthusiast | Super interested in Deep Learning with Tensorflow | GCP

No responses yet

Recommended from Medium

Lists

See more recommendations