Home / Blog / Data Science Digital Book / CRISP - DM Data Cleansing / Data Preparation

CRISP - DM Data Cleansing / Data Preparation

July 15, 2024
26

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Data Cleansing / Data Preparation

Other names for data cleaning include data preparation, data organisation, munging, and data wrangling.

Outlier or Extreme Values: Any value that deviates significantly from the rest of the data in terms of size or range.

Outliers are treated using 3 R technique:

data cleansing/data analysing

Winsorization Technique

By reducing outliers, the Winsorization process alters the sample distribution of random variables. In the case of 90% winsorization, all data below the 5th percentile would be placed at that level, and all data above the 95th percentile would be set at that level.

winsorization technique

Alpha Trimmed Technique

You may establish an alpha value using the Alpha Trimmed Technique; for instance, if alpha = 5%, all values in the lower and higher 5% range are trimmed or eliminated.

alpha trimmed technique

Missing Values

Missing values refer to data fields that may be empty or include NA, NaN, or Null.

3 Variants of Missing Values

Missingness At Random (MAR)
Missingness Not At Random (MNAR)
Missingness Completely At Random (MCAR)

missing values

Imputation

Imputation is a technique used to replace missing values with logical values. Wide variety of Techniques are available, choosing the one which fits the data is an art:

imputation

Transformation

Changing the underlying nature of the data for better analysis.

Types of transformation

Logarithmic
Exponential
Square Root
Reciprocal
Box-Cox
Johnson

Discretization / Binning / Grouping - Converting continuous data to discrete
Binarization - Converting continuous data into two categories (binary)
Rounding - Rounding off the decimals to the nearest integer e.g. 5.6 = 6

Binning - Two types of Binning

Fixed Width Binning
Adaptive Binning

binning - two types of binning

Normalization

Normalization / Standardization - Making the data scale-free and unitless.

normalization

Methods of Normalization / Standardization includes

Standardized Scaling also called as Standardization
Min-Max Scaler also called as Normalization or Range Method, Robust Scaling

Standardization has two parts:

Mean Normalization or Mean Subtraction - Mean Normalization will make the mean of the data ‘Zero’
Variance Normalization - Variance Normalization will make the variance of the data ‘One’

normalisation

The Min-Max Scaler or Range technique is another name for normalisation. When dealing with negative numbers, the range of normalised data can occasionally be between -1 and +1 with a minimum value of 0 and a maximum value of 1.

The drawback of Mix-Max Scaler is that outliers might affect scaled numbers.

Because it takes into account the "Median" and "IQR," robust scaling is not impacted by outliers.

normalization

Dummy Variable

Create a dummy variable by representing or converting numerical data from categorical data.

Techniques for Dummy Variable creation are:

dummy variable

Type Casting

transforming one type to another, such is changing a character type to a factor type or an integer type to a floating-point type.

type casting

Handling Duplicates

enables us to gather the truth from all the many sources into a single source.

For instance, a person may open a bank account, but his transactions might be shown as John Travolta in some, John in some, and Travolta in some—despite the fact that all three names belong to the same individual. We thus combine all of these names into one.

handlung duplicates

String Manipulation

Working with textual data. Various ways of converting unstructured textual data into structured data are:

string manipulation

Zero or Near Zero Variance

variables that are important on a single level or on the same levels for the majority of them. For instance, all of the zip code numbers are the same or all entries in the gender column are classified as female.

We exclude variables from our analysis that have zero or almost zero feature variance.