Login
Congrats in choosing to up-skill for your bright career! Please share correct details.
Home / Blog / Data Science / Data Imputation Methods
Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.
Table of Cont
What Imputation Is It? How can we avoid the situation if we run across it in our dataset? Let's read this blog to find out more.
When we get data from consumers or gather it from other sources, some of the data may be missing from the dataset for a variety of reasons. These missing values may be something like Na, blank, or other values (sometimes unusual characters), but they won't be the real numbers that need to be there. When we use these datasets to run our algorithms, they might not perform as expected or anticipate the output in the way we had hoped. As a result, the models we run using these datasets may provide unexpected results.
We may also omit those rows if the data is absent in order to remove this missing data problem from our dataset. However, if we skip or remove the whole row or observation that includes a blank cell, we risk losing out on some crucial data inputs. Since applying the model to cells with missing data does not provide the intended results, we must substitute some meaningful values for them. Imputation describes the process of substituting or filling in the values that are lacking.
In the real-world data, we have three types of missing data:
We will discuss the above three missing variants with an example of a student who has to attend the exam but was not able to attendfor some reason.
The youngster who will be returning to the schools to complete an examination is good and was unable to take the test because a member of his family passed away and was unable to take it.
Simple imputation methods, such as replacing with Mean, can be used to fill in this type of missing information. Replace the missing value with the average of the existing values in the column. If any outliers are present, we may also replace it with the median values.
The student going to school met with an accident. This situation was not at all expected and it was completely random.
The pupil who was due to take the exam had consciously chosen not to do so. In this instance, the student's absence from the exam was deliberate and not accidental.
Missing data leads in an imbalance in the data, symmetry problems, knowledge loss, and frequently, inaccurate findings.
We occasionally have to exclude certain missing data, such as survey results about staff pay. People with high wages could purposely withhold the facts or provide inaccurate information. In this case, we cannot assign any value to the empty cells since doing so will provide incorrect results.
Below are a few imputation methods that are majorly used:
This is the simplest strategy for imputation.
We will work with a dataset with missing fields to see how imputation helps in filling up a logical value for the missing values.
In this approach, we compute the mean and median for the dataset's non-missing values and impute these values using different applications to each column's missing cells. This only works with numerical data.
The output of running the above code:
Missing values from the dataset in Salaries Column
After Mean imputation:
Click here to explore 360DigiTMG.
After imputation:
Imputed with Median values
This imputation process considers the most prevalent values inside a column. To fill in the gaps in that particular column, we use the values that were thusly identified. We repeat this process for all the other columns. This is an additional statistical imputation method that makes use of categorical traits.
Dataset after Mode Imputation:
Missing values are imputed by Mode values which got repeated the most in the column.
In this imputation, a number from the set of numbers in the data set is selected at random and used to fill in the missing values. Occasionally, we end up imputing the entire dataset with the same value. We must attempt to impute the missing numbers using other values in order to avoid this.
Hot Deck is another method of imputation. With this approach, we choose a different row with the same values as the missing-values row and swap out its number for the missing one. We can disregard a whole column and retry with the remaining data if we are not receiving any row values that match the row with the missing value. Alternately, list all the numbers that may be used to make up the missing number, then average them all out to make up the difference.
KNN, a straightforward classification algorithm, provides an additional technique for imputation. 'Feature similarity' is a method used by the KNN algorithm to predict any new values in the dataset. A similar prediction can be used to impute the missing data. For the non-missing value, we can discover K's Nearest Neighbour in observation, and we can impute the same value for the missing data in the dataset.
Watch Free Videos on Youtube
Imputation done by KNN Imputation.
Click here to learn Data Science Training in Hyderabad.
When using imputation techniques like mean median or mode approaches, we look at the mean value or the value that has been seen the most frequently and do not take into account any association between the variables. If the two variables are correlated, we may apply a straightforward linear regression model to impute the missing values for that variable. The term for this is regression imputation.
From various online forums and also to extent of my personal experience, I have learned that the regression imputation method will give values with noise or bias.
I would like to sum up by stating that there is no ideal technique or procedure for imputation. On various datasets, the aforementioned approaches can carry out imputation in various ways. The best models that suit your imputation requirements must be found through experimentation on datasets with missing values in order to acquire the best results from imputation approaches.
Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore
360DigiTMG - Data Science Course, Data Scientist Course Training in Chennai
D.No: C1, No.3, 3rd Floor, State Highway 49A, 330, Rajiv Gandhi Salai, NJK Avenue, Thoraipakkam, Tamil Nadu 600097
1800-212-654-321
Didn’t receive OTP? Resend
Let's Connect! Please share your details here