Login
Congrats in choosing to up-skill for your bright career! Please share correct details.
Home / Blog / Data Science / Data Cleansing
Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.
Table of Content
The "new oil" is data. Even if you might think this is a cliché, it is nonetheless an obvious fact. Due to data, the globe has seen a quick speed of expansion, progress, and transformation. Elections have been won by governments. We have entered the era of artificial intelligence thanks to the disruption of the corporate ecosystem by organisations, academic institutions, and startups. This constant harping on the value of data. Everything appears to be incredibly simple and entertaining. Data does, however, provide a unique set of difficulties.
In the past, there hasn't been a purposeful effort to gather, organise, and retain data in an organised way. The "new oil" is data. Even if you might think this is a cliché, it is nonetheless an obvious fact. Due to data, the globe has seen a quick speed of expansion, progress, and transformation. Elections have been won by governments. We have entered the era of artificial intelligence thanks to the disruption of the corporate ecosystem by organisations, academic institutions, and startups. This constant harping on the value of data. Everything appears to be incredibly simple and entertaining. Data does, however, provide a unique set of difficulties. Before we can produce any real output or get any meaningful insights, the data must first be cleaned and prepared. False conclusions are reached when data is inaccurate or inconsistent.
According to Maletic & Marcus (2000), "data cleansing is the process of removing errors and inconsistencies in data and resolving the object identity problem." The raw data that originates from any source is typically soiled. Therefore, before mining it for any purpose, it must be cleaned. It takes a lot of time and effort to clean up data. Any open-source software, including R and Python, can be used to complete the task. Numerous factors might cause the data to be unclean and untidy.
Click here to explore 360DigiTMG.
For our purpose of understanding we will understand these challenges from the perspective of nature of the data:
Let us start with structured data cleansing methods.
For our understanding purpose, structured data is a dataset neatly orchestrated in a two-dimensional table such as below:
Such data sets are made available by conducting surveys, design of experiments, maintaining a record of business transactions, etc. However, such records cannot be directly used to drive insights or build prediction models as they may have many intrinsic challenges that need to be addressed first.
Outliers in a dataset are, to put it simply, very high or low values. Outliers alter the data's normal distribution and have the ability to artificially inflate or deflate the mean. They may occur for a variety of real causes. One of the most important steps in the data cleansing process is to locate the outliers and deal with them. The most effective visualisation approaches are those that let you spot outliers. Boxplots are useful for finding outliers in data. If you want to learn more about boxplots, you may look here. A boxplot is seen in the visualisation below. Here, we can observe that the boxplot's extreme ends contain outliers. These values are excessive. Different packages and modules that Python and R, two open-source programming languages, provide can be used to analyse outliers. Here, cite a few packages.
We must comprehend the causes of outliers' existence before we begin the outlier therapy. Outliers can be handled in a variety of ways.
If we have outliers in the data as per (Figure 2), then we can apply the alpha trim method to treat the outlier effect. Depending on the context you may remove 5% or 10% of data from
the upper and lower extreme of the dataset. In (Figure 2) we can observe that we have taken 10% of data from the extreme ends of the data set and are left with 80% of the data. Now we have three datasets with 10% of data from upper extreme, 10% of data from lower and the remaining 80% of the original dataset, respectively. Hence, we use these three datasets to build separate Machine learning models and treat the effect of outliers and achieve our objective.
There can be several outliers in the dataset. The quantity of outliers in the dataset may not be readily apparent from a boxplot. It could also occur because the second outlier prevents the outlier test from detecting the first one. It's known as masking.
There are nearly two comparable datasets with outlier values in the graphic above. In the visualisation to the right, we can see that while the outlier in the first dataset has been identified and is shown in red, there are two outliers in the second dataset who, due to masking, have not been identified as outliers and have instead been treated like regular data points. We may use Cook's distance approach to try to solve this problem. We might still need to explore for alternative approaches, though.
When non-outliers are classified as outliers, this phenomenon is termed as swamping. In the process of identifying multiple outliers, there is a high chance that even if there is just one outlier in the data, other data points could also get classified as extreme values.
Data missingness is a pervasive issue. It basically indicates that, for a variety of reasons, certain values in the dataset may be missing, rendering the dataset incomplete. Missing values are bad for machine learning models because, in most cases, algorithms exclude these observations from the dataset while creating the prediction model or classification model. It is plainly evident that the ability to build machine learning models is improved by the amount of data we have. Removing or dropping observations is not the answer. Any competent data scientist will not favour data loss. As a general rule, observations with missing values are eliminated if the missingness in the data is less than 5%. We do, however, highly caution against using such a tactic.
Missing data treatment is crucial and may be accomplished in a variety of ways. The first way to handle missing values is imputation. Imputation is a technique for substituting a logical value for a missing one. Data scientists have reported positive and advantageous results if imputation is carried out appropriately, despite the fact that it is an artificial procedure with a negative side since it messes with how data are distributed naturally. To impute data, there are several approaches.
If we have a continuous variable, then we can take the mean of the variable and impute the mean in place of missing values
If there are outliers in the dataset, then we cannot perform mean imputation to use as a logical value in place of missing values. In such a scenario we use median value to impute as shown in the below figure.
Both continuous and categorical data types can be used in conjunction with this technique. It is beneficial in the case of categorical variables, though. As seen here, it requires copying values from other records that are comparable. However, sufficient data must be available in order to apply this approach. Please refer to this page to learn how to use R's hot deck approach. The claimant's data set (Table 1) contains 1390 observations as well as 4 characteristics of relevance (CLMSEX, CLMINSUR, SEATBELT, and CLMAGE) with missing data. Using the R and Python versions of the hot deck approach, we may infer these four attributes. The codes for Python and R are listed below, respectively:
We may impute in Python using either the forward fill "fill()" or the backward fill "fill()" methods. Data is copied from the previous comparable row in a forward fill and the subsequent similar row in a backward fill. Any of the 4 characteristics may be imputed using the 'variable_name' in the code below.
In R, we perform the imputation using a package named "hot.deck". The "hot.deck" function applies different imputations on the missing data in the dataset. It uses either the "best cell" (the default) or the "probabilistic draw" approach.
Multivariate Imputation using Chained Equations is a very powerful method to impute when we have multiple missing values in multiple variables in a dataset. The MICE library for R has been developed by Stef van Buuren. You may refer to the library documentation here. It can impute categorical and continuous missing values in a dataset. It can automatically identify the numeric, binary, ordered, and unordered variables. It has flexibility for multiple imputations techniques as shown in (Table 2).
Table 2: MICE imputation techniques (Source: Journal of Statistical Software)
Below is a simple implementation of MICE Imputation in R. We have used claimants (refer Table 1) dataset to impute. It is a multivariate data set with missing values for multiple variables. There are 6 features of interest except for the first column.
According to the aforementioned table, 1096 out of 1340 observations had no missing data. 244 observations have revealed the missing values. The variables "Attorney" and "Loss" have all valid values. 12 missing data exist for "CLMSEX", 41 for "CLMINSUR", 48 for "SEATBELT", and 189 for "IMAGE". There are 290 missing values in total.
The missingness distribution and proportion for each variable are displayed in the graphic above. The missing data are represented by the colour red. To fill in the missing data, we will now use MICE imputation. In order to select the imputed dataset that most closely resembles the data's natural distribution, we should attempt to produce many versions of it. As a result, we will produce 5 imputed datasets with a maximum of 40 iterations in this case.
Here, we can see that the variable "IMAGE" has undergone a total of 5 separate sets of imputations. The index value of the observations where the age variable's data was missing is shown in the first column above. The results of the imputation for the variable are summarised below.
The fifth set of imputed values, which has age as its minimum value, makes more sense to employ, as can be seen from the summary above. Senses well. Age cannot start at zero. Therefore, we may utilise the dataset's fifth batch of imputed values. The strip plot shown below demonstrates how imputed values for "CLMAGE" are distributed naturally.
In order to evaluate the quality of fit for imputed data, density charts can also be seen.
Here, we can see all five imputed sets of values for the variable "CLMAGE" visualised, but only one curve appears to nearly match the distribution of data naturally. As a result, MICE imputation is a very effective approach. On the other hand, due to the nature of chained equation calculations, its performance degrades when used to big datasets.
3.Standardization/Normalization
The problem of unit and scale is one of the most common issues that occur in any multivariate dataset. In building prediction models, we must ensure that each variable influences the outcome equally. It should not be the case that one independent variable has a larger influence over the dependent variable as compared to the rest of the independent variables. This generally happens due to the influence of the unit and scale of the independent variable. It can be observed for the below summary (Table 3) of a multivariate dataset that all the independent variables “HP”, “VOL”, “SP”, and “WT” are of different unit and scale and can influence the dependent variable “MPG”.
Table 3: Summary of a multivariate dataset
3.1. Standardization
By mathematically forcing the mean to zero and the standard deviation to one for all the variables, we may apply the "Z" transformation to the data to eliminate the effects of scale and unit. The formula is shown below.
By standardising the data, we compel all of the variables in the dataset to, ideally, fall within the range of -3.9 to +3.9. The effect of scaling is shown in the summary of the standardised data below.
The details of the arguments of the function can be found here.
In Python, we use the sklearn package to standardize the dataset. Below is the code snippet:
This method also resolves the issue of unit and scale in the data by applying the below formula:
Once we apply this formula, for all variables the minimum value becomes zero and the maximum value becomes one. Although we can use the “normalize” function in R directly over the data set to make the data range between zero to one, we can design our user-defined function and apply that on the data set to achieve min-max scaling. Below is the R code snippet:
The below summary reflects the effect of this method:
In Python, the min-max scaler can be implemented using the sklearn package. Below is the python code snippet:
Data must be linearized in order to create strong, highly accurate prediction models. We may need to apply some mathematical change to naturally existing data in order to make it linear or roughly match a normal distribution. The "Normal Q-Q Plot" for the variable "work experience" for 773 people can be seen below. The data does not follow a linear pattern and exhibits some non-linear behaviour, most likely an exponential one.
This can be corrected by applying a logarithmic modification to the data to approach log-normal distribution and linearity. R makes it simple to do this. This is the piece of code.
We can see that the data is now regularly distributed from the visualisation below.
Not all of the transformations we can do are log transformations. There are several others, including power, inverse, square root, and many others. When it comes to which transformation to use, there is no general guideline. It's just for testing purposes. Although it would still not be considered a thumb rule, someone with a lot of expertise might be able to glean some insights about which transformations to use based on context.
5. Discretization
There are instances when we need to need to convert the continuous data into discrete (factor) form. It is also called binning or grouping. Essentially, we build equal size intervals. We may need to do so. In a basic decision tree algorithm, to calculate entropy or information gain, we need to have the data in discrete form. Continuous data can very easily be discretized using R or Python. In R we use the “discretize” function from a rules package to convert numeric features to factors. We will use an education dataset for this purpose. Summary of the education dataset is as below:
Below is the R code snippet for discretizing both the features along with summary:
Here we can observe that both the features have been converted to factors with 3 levels. This is done by default. To further understand how to manipulate the classes and methods please refer here.
We can use the cut function from the panda’s package to bin the data. Here we will show how “GMAT” scores can be binned by defining the bins. The Python code snippet is below for discretization.
The output of binning is as below:
Watch Free Videos on Youtube
No direct statistical analysis or model-building can be done on categorical data that has two or more classifications. To achieve this, we must change them into a dummy variable form. Let's say we have a variable called "type of car" that has four levels: "Petrol," "Diesel," "CNG," and "LPG." We cannot use the textual data if this characteristic has to be used as an input variable in the prediction model. In order to use it in the construction of models, we must transform it into a dummy variable form. "One-hot encoding" refers to the process of transforming category data into dummy variables. A one-hot encoded data format is shown below.
It's important to keep in mind that there are only n-1 encoded features available when trying to encode a category variable. As a result, if there are four levels, we will only receive the first three levels since the fourth level is automatically decoded. The caret package in R may be used to construct fake variables. The dummy Vars method is a useful one for creating dummy variables.
Dummy variable creation can be done in Python using the get_dummies function from the panda’s package. Below is the Python code snippet and output:
Duplication in the dataset can happen for various reasons. Handling duplication in the data set is the easiest job to perform. Why? Because we simply remove that observation or variable. We can use the “type of car” dataset to drop the duplicate observations. In R we can use the “distinct()” function from the “dplyr” package to remove duplicates. In the “type of car” dataset, there are 4 levels to the variable but each of the levels is duplicated 5 times. We use the distinct function over the dataset to get the unique values. Below is the R code snippet and output:
In Python, we can use the “drop_duplicates()” function to handle duplication. Below is the Python code snippet and output:
If any variable has got constant variance, such as the age of all the people in a business unit is 28 or all of these people are males, then no further information or insights can be derived from such a variable. It will add no value in extracting insights or developing a prediction model. Hence, we conveniently ignore it from the data set.
Unstructured data in the form of text is widely accessible in print media, social media, academia, etc. Scraping the web allows us to obtain text-based resources as well. The challenge of data cleaning is made more difficult and complex by the fact that the topology of storing textual data varies and is dependent on the organization's data management strategy. There are some extremely helpful methods for cleaning up textual data that may be carried out in both R and Python. We are all familiar with text mining methodology. Data cleaning may now be carried out automatically thanks to some extremely advanced natural language processing tools. Python's Spacy package performs a fantastic job. But there are several circumstances in which we cannot rely only on these libraries to clean the data.
There are situations, though, where we cannot rely only on these libraries to clean and prepare the data. Therefore, in order to overcome such difficulties without utilising any particular libraries, we must turn to standard data cleaning techniques.
For the sake of this blog, we shall make reference to the Belgian university of Gent's "Nobel Prizes" dilemma. You may retrieve the "prizes.csv" dataset and the issue statement from this page. The unstructured data that has to be cleaned and given a correct structure is shown in the image below.
This dataset includes every recipient of the prestigious Nobel prize for various scientific and cultural achievements over the years. The data is in an unorganised manner, therefore we need to clean it up and put it in the following beautifully structured format.
The Python code snippet and implementation to clean the data and provide the required output are provided below. Here, we'll create a function to clean up the data and deliver it in an organised style.
After putting the code into practise in the Python IDE, we may obtain the data in a neat and organised way. The data cleaning technique is validated by the results listed below.
These are a few data cleansing techniques that we may do before mining the data to organise, standardise, and clean it in order to get insightful conclusions and create strong machine learning models that are more accurate.
These are not restricted to the situations we covered in this blog post for the data preparation step, though. Numerous other factors will need to be addressed as they arise, case by case.
Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore
360DigiTMG - Data Analytics, Data Science Course Training Hyderabad
2-56/2/19, 3rd floor, Vijaya Towers, near Meridian School, Ayyappa Society Rd, Madhapur, Hyderabad, Telangana 500081
099899 94319
Didn’t receive OTP? Resend
Let's Connect! Please share your details here