Login
Congrats in choosing to up-skill for your bright career! Please share correct details.
Home / Blog / Data Science Digital Book / CRISP - DM Data Cleansing / Data Preparation
Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.
Table of Content
Other names for data cleaning include data preparation, data organisation, munging, and data wrangling.
Outlier or Extreme Values: Any value that deviates significantly from the rest of the data in terms of size or range.
.
By reducing outliers, the Winsorization process alters the sample distribution of random variables. In the case of 90% winsorization, all data below the 5th percentile would be placed at that level, and all data above the 95th percentile would be set at that level.
You may establish an alpha value using the Alpha Trimmed Technique; for instance, if alpha = 5%, all values in the lower and higher 5% range are trimmed or eliminated.
Missing values refer to data fields that may be empty or include NA, NaN, or Null.
Imputation is a technique used to replace missing values with logical values. Wide variety of Techniques are available, choosing the one which fits the data is an art:
Changing the underlying nature of the data for better analysis.
Normalization / Standardization - Making the data scale-free and unitless.
The Min-Max Scaler or Range technique is another name for normalisation. When dealing with negative numbers, the range of normalised data can occasionally be between -1 and +1 with a minimum value of 0 and a maximum value of 1.
The drawback of Mix-Max Scaler is that outliers might affect scaled numbers.
Because it takes into account the "Median" and "IQR," robust scaling is not impacted by outliers.
Create a dummy variable by representing or converting numerical data from categorical data.
transforming one type to another, such is changing a character type to a factor type or an integer type to a floating-point type.
enables us to gather the truth from all the many sources into a single source.
For instance, a person may open a bank account, but his transactions might be shown as John Travolta in some, John in some, and Travolta in some—despite the fact that all three names belong to the same individual. We thus combine all of these names into one.
Working with textual data. Various ways of converting unstructured textual data into structured data are:
variables that are important on a single level or on the same levels for the majority of them. For instance, all of the zip code numbers are the same or all entries in the gender column are classified as female.
We exclude variables from our analysis that have zero or almost zero feature variance.
Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore
360DigiTMG - Data Science, Data Scientist Course Training in Bangalore
No 23, 2nd Floor, 9th Main Rd, 22nd Cross Rd, 7th Sector, HSR Layout, Bengaluru, Karnataka 560102
1800-212-654-321
Didn’t receive OTP? Resend
Let's Connect! Please share your details here