Home / Blog / Data Science Digital Book / CRISP - DM Data Collection

CRISP - DM Data Collection

July 15, 2024
32

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

CRISP - DM Business Understanding

Articulate the business problem by understanding the client/customer requirements

Formulate Business Objectives

Formulate Business Constraints

A few examples on Business Objective and Business Constraints

Business Problem : Significant proportion of customers who take loan are unable to repay
Business Objective : Minimize Loan Defaulters
Business Constraint : Maximize Profits
Business Problem : Significant proportion of customers are complaining that they did not do the credit card transaction
Business Objective : Minimize Fraud
Business Constraint : Maximize Convenience
Keys points to remember : Ensure that objectives and constraints are SMART
Key Deliverable: Project Charter

CRISP - DM Data Collection

Understanding various data types is pivotal to proceed further with data collection.

Data Types

Continuous

Any data, which can be represented in a decimal format and makes sense.

Discrete

Data when represented in decimal format does not make sense.

CRoss Industry Standard Process for Data Mining

Categorical Data Examples

Whether a person claims an insurance or not;

will a person 'pay on time',

'with a delay' or 'will default'.

CRoss Industry Standard Process for Data Mining

Count Data Examples

'Number of people who claim insurance',

'number of loan defaulters'.

CRoss Industry Standard Process for Data Mining

Data Understanding

Qualitative

Qualitative data is non-numerical data.

Examples

This weight heavy
That kitten is small

Quantitative

Quantitative data include numbers.

Examples

Weight 85 kg
Height 164.3 cm

Click here to learn Data Science in Hyderabad

Continuous Data and Count Data fall under Quantitative Data.

CRoss Industry Standard Process for Data Mining

Structured vs Unstructured

Structured data is that data which in raw state can be placed in a tabular format.

Unstructured data is that data which in its raw state cannot be placed in any tabular format.

Video is split into images and images into pixels and each pixel intensity value can be an entry in a column and this becomes structured.

Videos, Images, Audio/Speech, Textual Data are examples of Unstructured data.

CRoss Industry Standard Process for Data Mining

Data that may be presented in a tabular fashion in its unprocessed condition is considered structured data.

Data that cannot be presented in any tabular format in its unprocessed state is referred to as unstructured data.

Each pixel intensity value may be an item in a column and each frame of a video is broken up into pictures and pixels, which results in a structured data structure.

Unstructured data includes, among other things, audio, video, images, and text.

Mel Frequency Cepstral Coefficient (MFCC) allows audio files and speech data to be transformed into features.

To make text data structured, it can be transformed into a Bag of Words (BoW), for instance.

Example: But because I'm poor, all I have are my dreams. Please tread carefully because you're walking on them.

Poor	Dream	Spread	Feet	Tread	Soft
1	3	1	1	1	1

Click here to learn Data Science in Bangalore

Data Understanding

CRoss Industry Standard Process for Data Mining

Big Data vs Non-Big Data

Data that is subject to the 5 Vs CRoss Industry Standard Process for Data Mining

A high volume, quickly produced from a wide variety of sources, with a degree of veracity and the right value

Big Data is information that cannot be processed with currently available software or stored on currently accessible technology.

Non-Big Data is any data that can be processed with currently available software and stored on currently accessible hardware.

Cross-Sectional vs Time Series vs Longitudinal / Panel Data

Cross-sectional data is that data, where date, time, and sequence in which we arrange the data is immaterial
Cross-sectional data usually contains more than one variable

Examples:

Population survey of demographics
Profit & Loss statements of various companies

Time Series data is that data, where the date, time, and sequence in which we arrange the data is important
Time Series data usually contains only one variable of interest to be forecasted

Examples:

Monitoring patient blood pressure every week
Global warming trend

Longitudinal Data is also called Panel Data
Longitudinal Data includes properties of both Cross-Sectional and Time Series
Data as well as Time Series Data, wherein there is more than one variable, which are sorted based on the date and time

Examples:

Monitoring patient blood pressure every weekExam scores of all students in a class from sessional to final exams
Health scores of all employees recorded every month

Click here to learn Artificial Intelligence in Hyderabad

Balanced vs Imbalanced Data

Whether a person claims an insurance or not,
Will a person 'pay on time, 'with a delay' or 'will default', etc,

Balanced data is that data where the classes of output variables are more or less in equal proportion
E.g. 47% of people have defaulted and 53% of data is not defaulted in the loan default variable
When we have balanced data then we can simply apply random sampling techniques

Imbalanced data is that data where the classes of output variables are in unequal proportion
E.g. 23% of data is defaulted and 77% of data is not defaulted in the loan default variable

Rule of thumb: If the percentage of the minority output class is less than 30%, the data are unbalanced.

Refer to page 2 for information on sampling for unbalanced data.

When we have imbalanced data then we apply different sampling techniques such as:

Random Resampling - Undersampling and Oversampling
Bootstrap Resampling
K-Fold Cross Validation
Repeated K-Fold Cross Validation
Stratified K-Fold Cross-Validation
Leave-One-Out (N-Fold Cross-Validation) LOOCV
SMOTE (Synthetic Minority Oversampling Technique)
MSMOTE (Modified SMOTE)
Cluster-Based Sampling
Ensemble Techniques

Click here to learn Artificial Intelligence in Bangalore

CRoss Industry Standard Process for Data Mining

Click here to learn Data Analytics in Bangalore

Data Collection Sources

Primary Data: Data Collected at the Source

Secondary Data: Data Collected Before Hand

CRoss Industry Standard Process for Data Mining

Primary Data

Examples of Primary Data

Surveys, Design of Experiments, IoT Sensors Data, Interviews, Focus Groups, etc.

Survey steps:

1. Understand the business reality and objective behind conducting the survey. E.g. Sales are low for a training company

2. Perform root cause analysis - Fishbone Analysis, 5-Why Analysis, etc. E.g. Product Pricing is uncompetitive

3. Formulate Decision Problem. E.g. Should product prices be changed

4. Formulate Research Objective. E.g. Determine the price elasticity of demand and the impact on sales and profits of various levels of price changes

5. List out Constructs. E.g. Training Enrolment

6. Deduce Aspects based on construct. E.g. Time aspect, Strength aspect, Constraint aspect

7. Devise Survey Questionnaire based on the Aspects. E.g. I am most likely to enroll for the training program in: In the next one week, In the next one month, In the next one quarter, etc.

Design of Experiments examples:

Coupon marketing with a 10% discount vs 20% discount, to which of these customers are responding well
Coupon targeting customers within 10 km radius versus 20 km radius
Combinations of discount & distance to experiment

Secondary Data

Organizational data are stored in databases

Open Source databases

Syndicate (paid) databases

Oracle DB
Microsoft DB
MySQL
NoSQL - MongoDB
Big Data, etc.
Industry reports
Government reports
Quasi-government reports, etc.

Click here to learn Data Analytics in Hyderabad

Meta Data Description: Data about Data

Obtaining meta data description is mandatory before we proceed further in the project
Understand the data volume details such as size, number of records, total databases, tables, etc.
Understand the data attributes/variables - description and values which these variables take

CRoss Industry Standard Process for Data Mining

Click here to learn Machine Learning in Hyderabad

Preliminaries for Data Analysis

According to the ratio of the favourable examples to the total number of probable cases, probability is the degree to which an event is likely to occur.

CRoss Industry Standard Process for Data Mining

Properties of Probability:

Ranges from 0 to 1
Summation of probabilities of all values of an event will be equal to 1

CRoss Industry Standard Process for Data Mining

Click here to learn Machine Learning in Bangalore

Base Equation

Output and input variables are two major categories for random variables.

Mathematically the relation between these is expressed using base equation:

CRoss Industry Standard Process for Data Mining

Y is known as:

Dependent variable
Response
Regressand
Explained variable
Criterion
Measures variable
Experimental variable
Label
Class variable

X is known as:

Independent
Explanatory
Predictor
Covariates
Regressors
Factors
Carriers
Controlled variable
Manipulated variable
Exposure variable

Random Variable

CRoss Industry Standard Process for Data Mining

Upper case is always used to indicate random variables.

Lower case letters are used to indicate the values that a random variable can take.

Example: One dice roll X equals 1, 2, 3, 4, 5, and 6.

Probability Distribution

Probability Distribution is the process of tabulating or displaying the probability of all potential outcomes of an event.

The underlying probability distribution is known as a probability distribution if a random variable is continuous.

The underlying probability distribution is known as Discrete Probability Distribution if a random variable is discrete.

CRoss Industry Standard Process for Data Mining

Sampling Techniques

Sampling techniques, sometimes referred to as non-probability sampling, are based on convenience and involve varying priorities for the data that must be collected to represent the population.

CRoss Industry Standard Process for Data Mining

Inferential Statistics

Inferential statistical is a process of analysing the sample data and deriving statements / properties of a population.

CRoss Industry Standard Process for Data Mining

Sampling Techniques

The default strategy for inferential statistics is known as probability sampling, often known as unbiased sampling. Every data point that has to be collected will have an equal chance of being chosen.

A few examples of Non-Probability Sampling:

CRoss Industry Standard Process for Data Mining

Sampling Techniques

Unbiased Sampling also known as Probability sampling is the default approach for inferential statistics. Each data point to be collected will have equal opportunity to get selected.

A few examples of Probability Sampling:

CRoss Industry Standard Process for Data Mining

Sampling Funnel

CRoss Industry Standard Process for Data Mining

Population	All Covid-19 cases on the planet
Sampling Frame	The majority of Covid-19 cases are in the USA, India, and Brazil and hence these 3 countries can be selected as a Sampling Frame Sampling Frame does not have any hard and fast rule, It is devised based on business logic
Simple Random Sampling	Randomly Sample 10% or 20% of the data from the sampling frame using Simple Random Sampling (SRS) technique SRS is the gold standard technique used for sampling SRS is the only sampling technique which has no bias Other sampling techniques such as Stratified sampling, etc., Also can be used to sample the data but SRS is the best

Data Science Placement Success Story

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore

Data Science Training Institutes in Other Locations

Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Vizag, Tirunelveli, Aurangabad

Navigate to Address

360DigiTMG - Data Science, IR 4.0, AI, Machine Learning Training in Malaysia

Level 16, 1 Sentral, Jalan Stesen Sentral 5, Kuala Lumpur Sentral, 50470 Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia

+60 19-383 1378

Get Direction: Data Science Course

Previous Blog

Next Blog

Certification Program in Data Science

Practical Data Scientist Online Program

Data Science using Python and R Programming

Foundation Program in Data Science

Exclusive Python & R Program For Beginners

Data Science for Managers

AI & Deep Learning Course Training in USA

Business Analytics in USA

Professional Course in Data Analytics

Data Visualization Using Tableau in USA

MLOps Course with Training & Job Assistance in USA

Professional Certificate Course in Data Engineering

HR Analytics Course Training USA

Life Sciences and HealthCare Analytics Course in USA

Data Science for Internal Auditors

Certificate course on Data Science

Certificate course on Data Analytics

Certificate course on MLOps

Certificate course on Data Engineering

CRISP - DM Data Collection

Meet the Author : Mr. Bharani Kumar

CRISP - DM Business Understanding

A few examples on Business Objective and Business Constraints

CRISP - DM Data Collection

Data Types

Continuous

Discrete

Categorical Data Examples

Count Data Examples

Data Understanding

Qualitative

Quantitative

Structured vs Unstructured

Data Understanding

Big Data vs Non-Big Data

Cross-Sectional vs Time Series vs Longitudinal / Panel Data

Balanced vs Imbalanced Data

Data Collection Sources

Primary Data

Examples of Primary Data

Survey steps:

Design of Experiments examples:

Secondary Data

Meta Data Description: Data about Data

Preliminaries for Data Analysis

Properties of Probability:

Base Equation

Y is known as:

X is known as:

Random Variable

Probability Distribution

Sampling Techniques

Inferential Statistics

Sampling Techniques

A few examples of Non-Probability Sampling:

Sampling Techniques

A few examples of Probability Sampling:

Sampling Funnel

Data Science Placement Success Story

Data Science Training Institutes in Other Locations

Navigate to Address

Get Direction: Data Science Course

Domain Analytics

Data Science

Emerging Technologies

Enter OTP