Home / Blog / Data Science / Data Cleansing

Data Cleansing

July 07, 2024
45

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Structured Data: Cleansing Methods

For our understanding purpose, structured data is a dataset neatly orchestrated in a two-dimensional table such as below:

Such data sets are made available by conducting surveys, design of experiments, maintaining a record of business transactions, etc. However, such records cannot be directly used to drive insights or build prediction models as they may have many intrinsic challenges that need to be addressed first.

1.Outliers or Extreme Values

Outliers in a dataset are, to put it simply, very high or low values. Outliers alter the data's normal distribution and have the ability to artificially inflate or deflate the mean. They may occur for a variety of real causes. One of the most important steps in the data cleansing process is to locate the outliers and deal with them. The most effective visualisation approaches are those that let you spot outliers. Boxplots are useful for finding outliers in data. If you want to learn more about boxplots, you may look here. A boxplot is seen in the visualisation below. Here, we can observe that the boxplot's extreme ends contain outliers. These values are excessive. Different packages and modules that Python and R, two open-source programming languages, provide can be used to analyse outliers. Here, cite a few packages.

Boxplot with outliers at extreme ends

We must comprehend the causes of outliers' existence before we begin the outlier therapy. Outliers can be handled in a variety of ways.

1.1. 3R Technique:

Rectify – If the outliers exist due to a typo or data entry errors then you may identify the individual observations and correct it by manually setting it right. However, is a tedious process.
retaining the outlier values exist due to some intrinsic change in work strategy or due to some natural phenomena such as COVID 2019, which is irreversibly changed, then we can expect such values to be repeated in the future as well and can be utilized for future purposes in our dataset. In such scenarios, we retain the outliers and use them separately.
remove the outlier exists due to some tactical activity or non-repeating random phenomena, then such values should be removed from the dataset.

1.2. Alpha Trim or Winsorization:

If we have outliers in the data as per (Figure 2), then we can apply the alpha trim method to treat the outlier effect. Depending on the context you may remove 5% or 10% of data from

Alpha Trim Technique to treat outliers

the upper and lower extreme of the dataset. In (Figure 2) we can observe that we have taken 10% of data from the extreme ends of the data set and are left with 80% of the data. Now we have three datasets with 10% of data from upper extreme, 10% of data from lower and the remaining 80% of the original dataset, respectively. Hence, we use these three datasets to build separate Machine learning models and treat the effect of outliers and achieve our objective.

1.3. Masking

There can be several outliers in the dataset. The quantity of outliers in the dataset may not be readily apparent from a boxplot. It could also occur because the second outlier prevents the outlier test from detecting the first one. It's known as masking.

Masking example

There are nearly two comparable datasets with outlier values in the graphic above. In the visualisation to the right, we can see that while the outlier in the first dataset has been identified and is shown in red, there are two outliers in the second dataset who, due to masking, have not been identified as outliers and have instead been treated like regular data points. We may use Cook's distance approach to try to solve this problem. We might still need to explore for alternative approaches, though.

1.4. Swamping

When non-outliers are classified as outliers, this phenomenon is termed as swamping. In the process of identifying multiple outliers, there is a high chance that even if there is just one outlier in the data, other data points could also get classified as extreme values.

2. Missing Values

Data missingness is a pervasive issue. It basically indicates that, for a variety of reasons, certain values in the dataset may be missing, rendering the dataset incomplete. Missing values are bad for machine learning models because, in most cases, algorithms exclude these observations from the dataset while creating the prediction model or classification model. It is plainly evident that the ability to build machine learning models is improved by the amount of data we have. Removing or dropping observations is not the answer. Any competent data scientist will not favour data loss. As a general rule, observations with missing values are eliminated if the missingness in the data is less than 5%. We do, however, highly caution against using such a tactic.

Missing data treatment is crucial and may be accomplished in a variety of ways. The first way to handle missing values is imputation. Imputation is a technique for substituting a logical value for a missing one. Data scientists have reported positive and advantageous results if imputation is carried out appropriately, despite the fact that it is an artificial procedure with a negative side since it messes with how data are distributed naturally. To impute data, there are several approaches.

2.1. Mean Imputation

If we have a continuous variable, then we can take the mean of the variable and impute the mean in place of missing values

Mean imputation

Median Imputation

If there are outliers in the dataset, then we cannot perform mean imputation to use as a logical value in place of missing values. In such a scenario we use median value to impute as shown in the below figure.

2.3. Hot Deck imputation

Both continuous and categorical data types can be used in conjunction with this technique. It is beneficial in the case of categorical variables, though. As seen here, it requires copying values from other records that are comparable. However, sufficient data must be available in order to apply this approach. Please refer to this page to learn how to use R's hot deck approach. The claimant's data set (Table 1) contains 1390 observations as well as 4 characteristics of relevance (CLMSEX, CLMINSUR, SEATBELT, and CLMAGE) with missing data. Using the R and Python versions of the hot deck approach, we may infer these four attributes. The codes for Python and R are listed below, respectively:

We may impute in Python using either the forward fill "fill()" or the backward fill "fill()" methods. Data is copied from the previous comparable row in a forward fill and the subsequent similar row in a backward fill. Any of the 4 characteristics may be imputed using the 'variable_name' in the code below. Median Imputation

In R, we perform the imputation using a package named "hot.deck". The "hot.deck" function applies different imputations on the missing data in the dataset. It uses either the "best cell" (the default) or the "probabilistic draw" approach.

Hot.deck

2.4. MICE Imputation

Multivariate Imputation using Chained Equations is a very powerful method to impute when we have multiple missing values in multiple variables in a dataset. The MICE library for R has been developed by Stef van Buuren. You may refer to the library documentation here. It can impute categorical and continuous missing values in a dataset. It can automatically identify the numeric, binary, ordered, and unordered variables. It has flexibility for multiple imputations techniques as shown in (Table 2).

Table 2: MICE imputation techniques (Source: Journal of Statistical Software)

MICE imputation techniques

Below is a simple implementation of MICE Imputation in R. We have used claimants (refer Table 1) dataset to impute. It is a multivariate data set with missing values for multiple variables. There are 6 features of interest except for the first column.

R Code implementation:

R Code implementation

According to the aforementioned table, 1096 out of 1340 observations had no missing data. 244 observations have revealed the missing values. The variables "Attorney" and "Loss" have all valid values. 12 missing data exist for "CLMSEX", 41 for "CLMINSUR", 48 for "SEATBELT", and 189 for "IMAGE". There are 290 missing values in total.

The missingness distribution and proportion for each variable are displayed in the graphic above. The missing data are represented by the colour red. To fill in the missing data, we will now use MICE imputation. In order to select the imputed dataset that most closely resembles the data's natural distribution, we should attempt to produce many versions of it. As a result, we will produce 5 imputed datasets with a maximum of 40 iterations in this case.

290 missing values imputed dataset

Here, we can see that the variable "IMAGE" has undergone a total of 5 separate sets of imputations. The index value of the observations where the age variable's data was missing is shown in the first column above. The results of the imputation for the variable are summarised below.

propagation of missingness

The fifth set of imputed values, which has age as its minimum value, makes more sense to employ, as can be seen from the summary above. Senses well. Age cannot start at zero. Therefore, we may utilise the dataset's fifth batch of imputed values. The strip plot shown below demonstrates how imputed values for "CLMAGE" are distributed naturally. imputed data set

In order to evaluate the quality of fit for imputed data, density charts can also be seen.

imputed values for CLMAGE

Climage plotting

Here, we can see all five imputed sets of values for the variable "CLMAGE" visualised, but only one curve appears to nearly match the distribution of data naturally. As a result, MICE imputation is a very effective approach. On the other hand, due to the nature of chained equation calculations, its performance degrades when used to big datasets.

3.Standardization/Normalization

The problem of unit and scale is one of the most common issues that occur in any multivariate dataset. In building prediction models, we must ensure that each variable influences the outcome equally. It should not be the case that one independent variable has a larger influence over the dependent variable as compared to the rest of the independent variables. This generally happens due to the influence of the unit and scale of the independent variable. It can be observed for the below summary (Table 3) of a multivariate dataset that all the independent variables “HP”, “VOL”, “SP”, and “WT” are of different unit and scale and can influence the dependent variable “MPG”.

Table 3: Summary of a multivariate dataset

To fix this problem, two methods can be utilized:

3.1. Standardization

By mathematically forcing the mean to zero and the standard deviation to one for all the variables, we may apply the "Z" transformation to the data to eliminate the effects of scale and unit. The formula is shown below. standariation

By standardising the data, we compel all of the variables in the dataset to, ideally, fall within the range of -3.9 to +3.9. The effect of scaling is shown in the summary of the standardised data below.

arguments of the function

The details of the arguments of the function can be found here.

arguments of the function

In Python, we use the sklearn package to standardize the dataset. Below is the code snippet:

standarized data set

3.2. Min – Max Scaler or Range Method

This method also resolves the issue of unit and scale in the data by applying the below formula:

min-max scaler formula

Once we apply this formula, for all variables the minimum value becomes zero and the maximum value becomes one. Although we can use the “normalize” function in R directly over the data set to make the data range between zero to one, we can design our user-defined function and apply that on the data set to achieve min-max scaling. Below is the R code snippet:

reflects the effect

The below summary reflects the effect of this method:

reflects the effect

In Python, the min-max scaler can be implemented using the sklearn package. Below is the python code snippet:

min-max scaler can be implemented using the sklearn package

4. Transformation

Data must be linearized in order to create strong, highly accurate prediction models. We may need to apply some mathematical change to naturally existing data in order to make it linear or roughly match a normal distribution. The "Normal Q-Q Plot" for the variable "work experience" for 773 people can be seen below. The data does not follow a linear pattern and exhibits some non-linear behaviour, most likely an exponential one.

normal Q-Q plot

This can be corrected by applying a logarithmic modification to the data to approach log-normal distribution and linearity. R makes it simple to do this. This is the piece of code. formula of Q-Q plot

We can see that the data is now regularly distributed from the visualisation below.

visualization of q-q plot

Not all of the transformations we can do are log transformations. There are several others, including power, inverse, square root, and many others. When it comes to which transformation to use, there is no general guideline. It's just for testing purposes. Although it would still not be considered a thumb rule, someone with a lot of expertise might be able to glean some insights about which transformations to use based on context.

imlementation of Q-Q plot

5. Discretization

There are instances when we need to need to convert the continuous data into discrete (factor) form. It is also called binning or grouping. Essentially, we build equal size intervals. We may need to do so. In a basic decision tree algorithm, to calculate entropy or information gain, we need to have the data in discrete form. Continuous data can very easily be discretized using R or Python. In R we use the “discretize” function from a rules package to convert numeric features to factors. We will use an education dataset for this purpose. Summary of the education dataset is as below:

Discremation workex of discrimenation

Below is the R code snippet for discretizing both the features along with summary:

R code snippet

Here we can observe that both the features have been converted to factors with 3 levels. This is done by default. To further understand how to manipulate the classes and methods please refer here.

We can use the cut function from the panda’s package to bin the data. Here we will show how “GMAT” scores can be binned by defining the bins. The Python code snippet is below for discretization.

Python code snippet

The output of binning is as below:

output of binning

Watch Free Videos on Youtube

6. Dummy Variable Creation

No direct statistical analysis or model-building can be done on categorical data that has two or more classifications. To achieve this, we must change them into a dummy variable form. Let's say we have a variable called "type of car" that has four levels: "Petrol," "Diesel," "CNG," and "LPG." We cannot use the textual data if this characteristic has to be used as an input variable in the prediction model. In order to use it in the construction of models, we must transform it into a dummy variable form. "One-hot encoding" refers to the process of transforming category data into dummy variables. A one-hot encoded data format is shown below.

It's important to keep in mind that there are only n-1 encoded features available when trying to encode a category variable. As a result, if there are four levels, we will only receive the first three levels since the fourth level is automatically decoded. The caret package in R may be used to construct fake variables. The dummy Vars method is a useful one for creating dummy variables.

Dummy Variable Creation

Dummy variable creation can be done in Python using the get_dummies function from the panda’s package. Below is the Python code snippet and output:

import pandas pandas working

8. Duplication

Duplication in the dataset can happen for various reasons. Handling duplication in the data set is the easiest job to perform. Why? Because we simply remove that observation or variable. We can use the “type of car” dataset to drop the duplicate observations. In R we can use the “distinct()” function from the “dplyr” package to remove duplicates. In the “type of car” dataset, there are 4 levels to the variable but each of the levels is duplicated 5 times. We use the distinct function over the dataset to get the unique values. Below is the R code snippet and output:

duploication formula

Output:

output of duplication

In Python, we can use the “drop_duplicates()” function to handle duplication. Below is the Python code snippet and output:

handle duplication

Output:

outpu to handle duplication

9. Zero and Near-Zero Variance Features

If any variable has got constant variance, such as the age of all the people in a business unit is 28 or all of these people are males, then no further information or insights can be derived from such a variable. It will add no value in extracting insights or developing a prediction model. Hence, we conveniently ignore it from the data set.

10. Text/String Issues

Unstructured data in the form of text is widely accessible in print media, social media, academia, etc. Scraping the web allows us to obtain text-based resources as well. The challenge of data cleaning is made more difficult and complex by the fact that the topology of storing textual data varies and is dependent on the organization's data management strategy. There are some extremely helpful methods for cleaning up textual data that may be carried out in both R and Python. We are all familiar with text mining methodology. Data cleaning may now be carried out automatically thanks to some extremely advanced natural language processing tools. Python's Spacy package performs a fantastic job. But there are several circumstances in which we cannot rely only on these libraries to clean the data.

There are situations, though, where we cannot rely only on these libraries to clean and prepare the data. Therefore, in order to overcome such difficulties without utilising any particular libraries, we must turn to standard data cleaning techniques.

Text/String Issues

For the sake of this blog, we shall make reference to the Belgian university of Gent's "Nobel Prizes" dilemma. You may retrieve the "prizes.csv" dataset and the issue statement from this page. The unstructured data that has to be cleaned and given a correct structure is shown in the image below.

dataset for Text/String Issues

This dataset includes every recipient of the prestigious Nobel prize for various scientific and cultural achievements over the years. The data is in an unorganised manner, therefore we need to clean it up and put it in the following beautifully structured format.

Python IDE

The Python code snippet and implementation to clean the data and provide the required output are provided below. Here, we'll create a function to clean up the data and deliver it in an organised style.

Python IDE

After putting the code into practise in the Python IDE, we may obtain the data in a neat and organised way. The data cleaning technique is validated by the results listed below.

These are a few data cleansing techniques that we may do before mining the data to organise, standardise, and clean it in order to get insightful conclusions and create strong machine learning models that are more accurate.

These are not restricted to the situations we covered in this blog post for the data preparation step, though. Numerous other factors will need to be addressed as they arise, case by case.

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore