Home / Blog / Data Science / Data Imputation Methods

Data Imputation Methods

July 03, 2024
47

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Introduction to Imputation

What Imputation Is It? How can we avoid the situation if we run across it in our dataset? Let's read this blog to find out more.

When we get data from consumers or gather it from other sources, some of the data may be missing from the dataset for a variety of reasons. These missing values may be something like Na, blank, or other values (sometimes unusual characters), but they won't be the real numbers that need to be there. When we use these datasets to run our algorithms, they might not perform as expected or anticipate the output in the way we had hoped. As a result, the models we run using these datasets may provide unexpected results.

We may also omit those rows if the data is absent in order to remove this missing data problem from our dataset. However, if we skip or remove the whole row or observation that includes a blank cell, we risk losing out on some crucial data inputs. Since applying the model to cells with missing data does not provide the intended results, we must substitute some meaningful values for them. Imputation describes the process of substituting or filling in the values that are lacking.

In the real-world data, we have three types of missing data:

Missing at random (MAR)
Missing completely at random (MCAR)
Not missing at random (NMAR)

We will discuss the above three missing variants with an example of a student who has to attend the exam but was not able to attendfor some reason.

Missing at Random (MAR)

The youngster who will be returning to the schools to complete an examination is good and was unable to take the test because a member of his family passed away and was unable to take it.

Simple imputation methods, such as replacing with Mean, can be used to fill in this type of missing information. Replace the missing value with the average of the existing values in the column. If any outliers are present, we may also replace it with the median values.
Missing Completely at Random (MCAR)

The student going to school met with an accident. This situation was not at all expected and it was completely random.
Missing Not at Random (MNAR):

The pupil who was due to take the exam had consciously chosen not to do so. In this instance, the student's absence from the exam was deliberate and not accidental.

Missing data leads in an imbalance in the data, symmetry problems, knowledge loss, and frequently, inaccurate findings.

We occasionally have to exclude certain missing data, such as survey results about staff pay. People with high wages could purposely withhold the facts or provide inaccurate information. In this case, we cannot assign any value to the empty cells since doing so will provide incorrect results.

Below are a few imputation methods that are majorly used:
- Deletion Methods
  
  This is the simplest strategy for imputation.
  - Case wise deletion/List wise deletion/Complete case deletion
  - Pairwise deletion or Available Case Analysis
- Simple Imputation Methods:
  - Mean Imputation
  - Median Imputation
  - Mode Imputation
  - Random Imputation
  - Hot Deck Imputation
  - Regression Imputation
  - KNN Imputation
- Model-Based Methods:
  - Maximum Likelihood (EM Algorithm)
  - Multiple Imputation
- Most Frequent values
  
  We will work with a dataset with missing fields to see how imputation helps in filling up a logical value for the missing values.
Mean and Median Imputation:

In this approach, we compute the mean and median for the dataset's non-missing values and impute these values using different applications to each column's missing cells. This only works with numerical data.
- Advantages:
  - It can be calculated and applied easily
  - It can be applied very well on small data sets
- Disadvantages:
  - It cannot get the correlations between the columns
  - This can be applied only to the columns
  - This does not give good accuracy
  - It cannot be applied to Categorical data

Data Imputation Methods

The output of running the above code:

Data Imputation Methods

Missing values from the dataset in Salaries Column

Data Imputation Methods

After Mean imputation:

Data Imputation Methods

Click here to explore 360DigiTMG.

Median Imputation coding:

Data Imputation Methods

After imputation:

Data Imputation Methods

Imputed with Median values

Most Frequent values (Mode)

This imputation process considers the most prevalent values inside a column. To fill in the gaps in that particular column, we use the values that were thusly identified. We repeat this process for all the other columns. This is an additional statistical imputation method that makes use of categorical traits.
- Advantages:
  - It works on categorical data and one of the easy methods of imputation on categorical data
- Disadvantages:
  - It cannot get the correlation between the columns
  - Biasness can be introduced by using this model
Dataset after Mode Imputation:

Missing values are imputed by Mode values which got repeated the most in the column.
Random Imputation

In this imputation, a number from the set of numbers in the data set is selected at random and used to fill in the missing values. Occasionally, we end up imputing the entire dataset with the same value. We must attempt to impute the missing numbers using other values in order to avoid this.
Hot Deck Imputation

Hot Deck is another method of imputation. With this approach, we choose a different row with the same values as the missing-values row and swap out its number for the missing one. We can disregard a whole column and retry with the remaining data if we are not receiving any row values that match the row with the missing value. Alternately, list all the numbers that may be used to make up the missing number, then average them all out to make up the difference.
- Advantages:
  - One advantage in this method is that, if we are having missing values for the age column and if the age should be between 30 to 40, the imputation value can be any number within 30-40 and cannot be anything else.
- Disadvantage:
  - As we are choosing a random value as the imputation number, at times it might not fit correctly.
KNN Imputation:

KNN, a straightforward classification algorithm, provides an additional technique for imputation. 'Feature similarity' is a method used by the KNN algorithm to predict any new values in the dataset. A similar prediction can be used to impute the missing data. For the non-missing value, we can discover K's Nearest Neighbour in observation, and we can impute the same value for the missing data in the dataset.

Watch Free Videos on Youtube
- Advantages:
  - This imputation can prove to be more efficient than the mean, median, mode, and other imputation methods.
- Disadvantage:
  - K-NN behaves when there are outliers in the dataset and it also occupies more memory while computing the K-NN values.
  - K-NN is a lazy learner.
Imputation done by KNN Imputation.
Click here to learn Data Science Training in Hyderabad.
Regression Imputation:

When using imputation techniques like mean median or mode approaches, we look at the mean value or the value that has been seen the most frequently and do not take into account any association between the variables. If the two variables are correlated, we may apply a straightforward linear regression model to impute the missing values for that variable. The term for this is regression imputation.
- Advantages:
  - We are considering the correlation between the variables which helps to find the missing values with some relation.
- Disadvantage:
  - As it finds the correlation between all the variables and then imputes the values, for datasets with more variables, it is a time-consuming task.
Note

From various online forums and also to extent of my personal experience, I have learned that the regression imputation method will give values with noise or bias.
Conclusion

I would like to sum up by stating that there is no ideal technique or procedure for imputation. On various datasets, the aforementioned approaches can carry out imputation in various ways. The best models that suit your imputation requirements must be found through experimentation on datasets with missing values in order to acquire the best results from imputation approaches.