Call Us

Home / Blog / Data Science Digital Book / CRISP - DM Data Collection

CRISP - DM Data Collection

  • July 15, 2023
  • 7423
  • 32
Author Images

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Read More >

CRoss Industry Standard Process for Data Mining

CRISP - DM Business Understanding

Articulate the business problem by understanding the client/customer requirements

Formulate Business Objectives

Formulate Business Constraints

A few examples on Business Objective and Business Constraints
  • Business Problem : Significant proportion of customers who take loan are unable to repay
  • Business Objective : Minimize Loan Defaulters
  • Business Constraint : Maximize Profits
  • Business Problem : Significant proportion of customers are complaining that they did not do the credit card transaction
  • Business Objective : Minimize Fraud
  • Business Constraint : Maximize Convenience
  • Keys points to remember : Ensure that objectives and constraints are SMART CRoss Industry Standard Process for Data Mining
  • Key Deliverable: Project Charter

CRISP - DM Data Collection

Understanding various data types is pivotal to proceed further with data collection.

Data Types

Continuous

Any data, which can be represented in a decimal format and makes sense.

Discrete

Data when represented in decimal format does not make sense.

CRoss Industry Standard Process for Data Mining

Categorical Data Examples

Whether a person claims an insurance or not;

will a person 'pay on time',

'with a delay' or 'will default'.

CRoss Industry Standard Process for Data Mining

Count Data Examples

'Number of people who claim insurance',

'number of loan defaulters'.

CRoss Industry Standard Process for Data Mining CRoss Industry Standard Process for Data Mining

Data Understanding

CRoss Industry Standard Process for Data Mining
Qualitative

Qualitative data is non-numerical data.

Examples

  • This weight heavy
  • That kitten is small

CRoss Industry Standard Process for Data Mining
Quantitative

Quantitative data include numbers.

Examples

  • Weight 85 kg
  • Height 164.3 cm

Click here to learn Data Science in Hyderabad

Continuous Data and Count Data fall under Quantitative Data.

CRoss Industry Standard Process for Data Mining

Structured vs Unstructured

Structured data is that data which in raw state can be placed in a tabular format.

Unstructured data is that data which in its raw state cannot be placed in any tabular format.

 

Video is split into images and images into pixels and each pixel intensity value can be an entry in a column and this becomes structured.

Videos, Images, Audio/Speech, Textual Data are examples of Unstructured data.

 

CRoss Industry Standard Process for Data Mining

Data that may be presented in a tabular fashion in its unprocessed condition is considered structured data.

Data that cannot be presented in any tabular format in its unprocessed state is referred to as unstructured data.

Each pixel intensity value may be an item in a column and each frame of a video is broken up into pictures and pixels, which results in a structured data structure.

Unstructured data includes, among other things, audio, video, images, and text.

Mel Frequency Cepstral Coefficient (MFCC) allows audio files and speech data to be transformed into features.

To make text data structured, it can be transformed into a Bag of Words (BoW), for instance.

Example: But because I'm poor, all I have are my dreams. Please tread carefully because you're walking on them.

Poor Dream Spread Feet Tread Soft
1 3 1 1 1 1

Click here to learn Data Science in Bangalore

Data Understanding

CRoss Industry Standard Process for Data Mining

Big Data vs Non-Big Data

Data that is subject to the 5 VsCRoss Industry Standard Process for Data Mining

A high volume, quickly produced from a wide variety of sources, with a degree of veracity and the right value

Big Data is information that cannot be processed with currently available software or stored on currently accessible technology.

Non-Big Data is any data that can be processed with currently available software and stored on currently accessible hardware.

Cross-Sectional vs Time Series vs Longitudinal / Panel Data

CRoss Industry Standard Process for Data Mining
  • Cross-sectional data is that data, where date, time, and sequence in which we arrange the data is immaterial
  • Cross-sectional data usually contains more than one variable
Examples:
  • Population survey of demographics
  • Profit & Loss statements of various companies
CRoss Industry Standard Process for Data Mining
  • Time Series data is that data, where the date, time, and sequence in which we arrange the data is important
  • Time Series data usually contains only one variable of interest to be forecasted
Examples:
  • Monitoring patient blood pressure every week
  • Global warming trend
CRoss Industry Standard Process for Data Mining
  • Longitudinal Data is also called Panel Data
  • Longitudinal Data includes properties of both Cross-Sectional and Time Series
  • Data as well as Time Series Data, wherein there is more than one variable, which are sorted based on the date and time
Examples:
  • Monitoring patient blood pressure every weekExam scores of all students in a class from sessional to final exams
  • Health scores of all employees recorded every month

Click here to learn Artificial Intelligence in Hyderabad

Balanced vs Imbalanced Data

  • Whether a person claims an insurance or not,
  • Will a person 'pay on time, 'with a delay' or 'will default', etc,
  • Balanced data is that data where the classes of output variables are more or less in equal proportion
  • E.g. 47% of people have defaulted and 53% of data is not defaulted in the loan default variable
  • When we have balanced data then we can simply apply random sampling techniques
CRoss Industry Standard Process for Data Mining
  • Imbalanced data is that data where the classes of output variables are in unequal proportion
  • E.g. 23% of data is defaulted and 77% of data is not defaulted in the loan default variable
  •  
CRoss Industry Standard Process for Data Mining

Rule of thumb: If the percentage of the minority output class is less than 30%, the data are unbalanced.

Refer to page 2 for information on sampling for unbalanced data.

When we have imbalanced data then we apply different sampling techniques such as:

  • Random Resampling - Undersampling and Oversampling
  • Bootstrap Resampling
  • K-Fold Cross Validation
  • Repeated K-Fold Cross Validation
  • Stratified K-Fold Cross-Validation
  • Leave-One-Out (N-Fold Cross-Validation) LOOCV
  • SMOTE (Synthetic Minority Oversampling Technique)
  • MSMOTE (Modified SMOTE)
  • Cluster-Based Sampling
  • Ensemble Techniques

Click here to learn Artificial Intelligence in Bangalore

CRoss Industry Standard Process for Data Mining

Click here to learn Data Analytics in Bangalore

Data Collection Sources

Primary Data: Data Collected at the Source

Secondary Data: Data Collected Before Hand

CRoss Industry Standard Process for Data Mining CRoss Industry Standard Process for Data Mining

Primary Data

Examples of Primary Data

Surveys, Design of Experiments, IoT Sensors Data, Interviews, Focus Groups, etc.

Survey steps:

1. Understand the business reality and objective behind conducting the survey. E.g. Sales are low for a training company

2. Perform root cause analysis - Fishbone Analysis, 5-Why Analysis, etc. E.g. Product Pricing is uncompetitive

3. Formulate Decision Problem. E.g. Should product prices be changed

4. Formulate Research Objective. E.g. Determine the price elasticity of demand and the impact on sales and profits of various levels of price changes

5. List out Constructs. E.g. Training Enrolment

6. Deduce Aspects based on construct. E.g. Time aspect, Strength aspect, Constraint aspect

7. Devise Survey Questionnaire based on the Aspects. E.g. I am most likely to enroll for the training program in: In the next one week, In the next one month, In the next one quarter, etc.

Design of Experiments examples:
  • Coupon marketing with a 10% discount vs 20% discount, to which of these customers are responding well
  • Coupon targeting customers within 10 km radius versus 20 km radius
  • Combinations of discount & distance to experiment

Secondary Data

Organizational data are stored in databases

Open Source databases

Syndicate (paid) databases

  • Oracle DB
  • Microsoft DB
  • MySQL
  • NoSQL - MongoDB
  • Big Data, etc.
  • Industry reports
  • Government reports
  • Quasi-government reports, etc.

Click here to learn Data Analytics in Hyderabad

Meta Data Description: Data about Data

  • Obtaining meta data description is mandatory before we proceed further in the project
  • Understand the data volume details such as size, number of records, total databases, tables, etc.
  • Understand the data attributes/variables - description and values which these variables take

CRoss Industry Standard Process for Data Mining

Click here to learn Machine Learning in Hyderabad

Preliminaries for Data Analysis

According to the ratio of the favourable examples to the total number of probable cases, probability is the degree to which an event is likely to occur.

CRoss Industry Standard Process for Data Mining

Properties of Probability:
  • Ranges from 0 to 1
  • Summation of probabilities of all values of an event will be equal to 1

CRoss Industry Standard Process for Data Mining

Click here to learn Machine Learning in Bangalore

Base Equation

Output and input variables are two major categories for random variables.

Mathematically the relation between these is expressed using base equation:

CRoss Industry Standard Process for Data Mining

Y is known as:
  • Dependent variable
  • Response
  • Regressand
  • Explained variable
  • Criterion
  • Measures variable
  • Experimental variable
  • Label
  • Class variable
X is known as:
  • Independent
  • Explanatory
  • Predictor
  • Covariates
  • Regressors
  • Factors
  • Carriers
  • Controlled variable
  • Manipulated variable
  • Exposure variable

Random Variable

CRoss Industry Standard Process for Data Mining

Upper case is always used to indicate random variables.

Lower case letters are used to indicate the values that a random variable can take.

Example: One dice roll X equals 1, 2, 3, 4, 5, and 6.

Probability Distribution

Probability Distribution is the process of tabulating or displaying the probability of all potential outcomes of an event.

The underlying probability distribution is known as a probability distribution if a random variable is continuous.

The underlying probability distribution is known as Discrete Probability Distribution if a random variable is discrete.

CRoss Industry Standard Process for Data Mining

Sampling Techniques

Sampling techniques, sometimes referred to as non-probability sampling, are based on convenience and involve varying priorities for the data that must be collected to represent the population.

CRoss Industry Standard Process for Data Mining

Inferential Statistics

Inferential statistical is a process of analysing the sample data and deriving statements / properties of a population.

CRoss Industry Standard Process for Data Mining

Sampling Techniques

The default strategy for inferential statistics is known as probability sampling, often known as unbiased sampling. Every data point that has to be collected will have an equal chance of being chosen.

A few examples of Non-Probability Sampling:

CRoss Industry Standard Process for Data Mining

Sampling Techniques

Unbiased Sampling also known as Probability sampling is the default approach for inferential statistics. Each data point to be collected will have equal opportunity to get selected.

A few examples of Probability Sampling:

CRoss Industry Standard Process for Data Mining

Sampling Funnel

CRoss Industry Standard Process for Data Mining

Population All Covid-19 cases on the planet
Sampling Frame
  • The majority of Covid-19 cases are in the USA, India, and Brazil and hence these 3 countries can be selected as a Sampling Frame
  • Sampling Frame does not have any hard and fast rule, It is devised based on business logic
Simple Random Sampling
  • Randomly Sample 10% or 20% of the data from the sampling frame using Simple Random Sampling (SRS) technique
  • SRS is the gold standard technique used for sampling
  • SRS is the only sampling technique which has no bias
  • Other sampling techniques such as Stratified sampling, etc., Also can be used to sample the data but SRS is the best

 

Data Science Placement Success Story

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore

Data Science Training Institutes in Other Locations

Navigate to Address

360DigiTMG - Data Science, IR 4.0, AI, Machine Learning Training in Malaysia

Level 16, 1 Sentral, Jalan Stesen Sentral 5, Kuala Lumpur Sentral, 50470 Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia

+60 19-383 1378

Get Direction: Data Science Course

Read
Success Stories
Make an Enquiry