Call Us

Home / Blog / Data Science Digital Book / CRISP - DM Data Cleansing / Data Preparation

CRISP - DM Data Cleansing / Data Preparation

  • July 15, 2023
  • 5353
  • 26
Author Images

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Read More >

Data Cleansing / Data Preparation

Other names for data cleaning include data preparation, data organisation, munging, and data wrangling.

Outlier or Extreme Values: Any value that deviates significantly from the rest of the data in terms of size or range.

.

Outliers are treated using 3 R technique:

data cleansing/data analysing


Learn the core concepts of Data Science Course video on YouTube:

Winsorization Technique

By reducing outliers, the Winsorization process alters the sample distribution of random variables. In the case of 90% winsorization, all data below the 5th percentile would be placed at that level, and all data above the 95th percentile would be set at that level.

winsorization technique


Alpha Trimmed Technique

You may establish an alpha value using the Alpha Trimmed Technique; for instance, if alpha = 5%, all values in the lower and higher 5% range are trimmed or eliminated.

alpha trimmed technique


Missing Values

Missing values refer to data fields that may be empty or include NA, NaN, or Null.

3 Variants of Missing Values

  • Missingness At Random (MAR)
  • Missingness Not At Random (MNAR)
  • Missingness Completely At Random (MCAR)

missing values


Imputation

Imputation is a technique used to replace missing values with logical values. Wide variety of Techniques are available, choosing the one which fits the data is an art:

imputation 


Transformation

Changing the underlying nature of the data for better analysis.

Types of transformation

  • Logarithmic
  • Exponential
  • Square Root
  • Reciprocal
  • Box-Cox
  • Johnson
  • Discretization / Binning / Grouping - Converting continuous data to discrete
  • Binarization - Converting continuous data into two categories (binary)
  • Rounding - Rounding off the decimals to the nearest integer e.g. 5.6 = 6

Binning - Two types of Binning

  • Fixed Width Binning
  • Adaptive Binning

binning - two types of binning


Normalization

Normalization / Standardization - Making the data scale-free and unitless.

normalization

Methods of Normalization / Standardization includes

  • Standardized Scaling also called as Standardization
  • Min-Max Scaler also called as Normalization or Range Method, Robust Scaling

Standardization has two parts:

  • Mean Normalization or Mean Subtraction - Mean Normalization will make the mean of the data ‘Zero’
  • Variance Normalization - Variance Normalization will make the variance of the data ‘One’

normalisation

The Min-Max Scaler or Range technique is another name for normalisation. When dealing with negative numbers, the range of normalised data can occasionally be between -1 and +1 with a minimum value of 0 and a maximum value of 1.

The drawback of Mix-Max Scaler is that outliers might affect scaled numbers.

Because it takes into account the "Median" and "IQR," robust scaling is not impacted by outliers.

normalization


Dummy Variable

Create a dummy variable by representing or converting numerical data from categorical data.

Techniques for Dummy Variable creation are:

dummy variable


Type Casting

transforming one type to another, such is changing a character type to a factor type or an integer type to a floating-point type.

type casting


Handling Duplicates

enables us to gather the truth from all the many sources into a single source.

For instance, a person may open a bank account, but his transactions might be shown as John Travolta in some, John in some, and Travolta in some—despite the fact that all three names belong to the same individual. We thus combine all of these names into one.

handlung duplicates


String Manipulation

Working with textual data. Various ways of converting unstructured textual data into structured data are:

string manipulation


Zero or Near Zero Variance

variables that are important on a single level or on the same levels for the majority of them. For instance, all of the zip code numbers are the same or all entries in the gender column are classified as female.

We exclude variables from our analysis that have zero or almost zero feature variance.

Data Science Placement Success Story

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore

Data Science Training Institutes in Other Locations

Navigate to Address

360DigiTMG - Data Science, Data Scientist Course Training in Bangalore

No 23, 2nd Floor, 9th Main Rd, 22nd Cross Rd, 7th Sector, HSR Layout, Bengaluru, Karnataka 560102

1800-212-654-321

Get Direction: Data Science Course

Make an Enquiry