CRISP - DM Business Understanding
Articulate the business problem by understanding the client/customer requirements
Formulate Business Objectives
Formulate Business Constraints
A few examples on Business Objective and Business Constraints
- Business Problem : Significant proportion of customers who take loan are unable to repay
- Business Objective : Minimize Loan Defaulters
- Business Constraint : Maximize Profits
- Business Problem : Significant proportion of customers are complaining that they did not do the credit card transaction
- Business Objective : Minimize Fraud
- Business Constraint : Maximize Convenience
- Keys points to remember : Ensure that objectives and constraints are SMART
- Key Deliverable: Project Charter
CRISP - DM Data Collection
Understanding various data types is pivotal to proceed further with data collection.
Any data, which can be represented in a decimal format and makes sense.
Data when represented in decimal format does not make sense.
Categorical Data Examples
Whether a person claims an insurance or not;
will a person 'pay on time',
'with a delay' or 'will default'.
Count Data Examples
'Number of people who claim insurance',
'number of loan defaulters'.
Qualitative data is non-numerical data.
- This weight heavy
- That kitten is small
Quantitative data include numbers.
- Weight 85 kg
- Height 164.3 cm
Click here to learn Data Science in Hyderabad
Continuous Data and Count Data fall under Quantitative Data.
Structured vs Unstructured
Structured data is that data which in raw state can be placed in a tabular format.
Unstructured data is that data which in its raw state cannot be placed in any tabular format.
Video is split into images and images into pixels and each pixel intensity value can be an entry in a column and this becomes structured.
Videos, Images, Audio/Speech, Textual Data are examples of Unstructured data.
Data that may be presented in a tabular fashion in its unprocessed condition is considered structured data.
Data that cannot be presented in any tabular format in its unprocessed state is referred to as unstructured data.
Each pixel intensity value may be an item in a column and each frame of a video is broken up into pictures and pixels, which results in a structured data structure.
Unstructured data includes, among other things, audio, video, images, and text.
Mel Frequency Cepstral Coefficient (MFCC) allows audio files and speech data to be transformed into features.
To make text data structured, it can be transformed into a Bag of Words (BoW), for instance.
Example: But because I'm poor, all I have are my dreams. Please tread carefully because you're walking on them.
Click here to learn Data Science in Bangalore
Big Data vs Non-Big Data
Data that is subject to the 5 Vs
A high volume, quickly produced from a wide variety of sources, with a degree of veracity and the right value
Big Data is information that cannot be processed with currently available software or stored on currently accessible technology.
Non-Big Data is any data that can be processed with currently available software and stored on currently accessible hardware.
Cross-Sectional vs Time Series vs Longitudinal / Panel Data
- Cross-sectional data is that data, where date, time, and sequence in which we arrange the data is immaterial
- Cross-sectional data usually contains more than one variable
- Population survey of demographics
- Profit & Loss statements of various companies
- Time Series data is that data, where the date, time, and sequence in which we arrange the data is important
- Time Series data usually contains only one variable of interest to be forecasted
- Monitoring patient blood pressure every week
- Global warming trend
- Longitudinal Data is also called Panel Data
- Longitudinal Data includes properties of both Cross-Sectional and Time Series
- Data as well as Time Series Data, wherein there is more than one variable, which are sorted based on the date and time
- Monitoring patient blood pressure every weekExam scores of all students in a class from sessional to final exams
- Health scores of all employees recorded every month
Click here to learn Artificial Intelligence in Hyderabad
Balanced vs Imbalanced Data
- Whether a person claims an insurance or not,
- Will a person 'pay on time, 'with a delay' or 'will default', etc,
- Balanced data is that data where the classes of output variables are more or less in equal proportion
- E.g. 47% of people have defaulted and 53% of data is not defaulted in the loan default variable
- When we have balanced data then we can simply apply random sampling techniques
- Imbalanced data is that data where the classes of output variables are in unequal proportion
- E.g. 23% of data is defaulted and 77% of data is not defaulted in the loan default variable
Rule of thumb: If the percentage of the minority output class is less than 30%, the data are unbalanced.
Refer to page 2 for information on sampling for unbalanced data.
When we have imbalanced data then we apply different sampling techniques such as:
- Random Resampling - Undersampling and Oversampling
- Bootstrap Resampling
- K-Fold Cross Validation
- Repeated K-Fold Cross Validation
- Stratified K-Fold Cross-Validation
- Leave-One-Out (N-Fold Cross-Validation) LOOCV
- SMOTE (Synthetic Minority Oversampling Technique)
- MSMOTE (Modified SMOTE)
- Cluster-Based Sampling
- Ensemble Techniques
Click here to learn Artificial Intelligence in Bangalore
Click here to learn Data Analytics in Bangalore
Data Collection Sources
Primary Data: Data Collected at the Source
Secondary Data: Data Collected Before Hand
Examples of Primary Data
Surveys, Design of Experiments, IoT Sensors Data, Interviews, Focus Groups, etc.
1. Understand the business reality and objective behind conducting the survey. E.g. Sales are low for a training company
2. Perform root cause analysis - Fishbone Analysis, 5-Why Analysis, etc. E.g. Product Pricing is uncompetitive
3. Formulate Decision Problem. E.g. Should product prices be changed
4. Formulate Research Objective. E.g. Determine the price elasticity of demand and the impact on sales and profits of various levels of price changes
5. List out Constructs. E.g. Training Enrolment
6. Deduce Aspects based on construct. E.g. Time aspect, Strength aspect, Constraint aspect
7. Devise Survey Questionnaire based on the Aspects. E.g. I am most likely to enroll for the training program in: In the next one week, In the next one month, In the next one quarter, etc.
Design of Experiments examples:
- Coupon marketing with a 10% discount vs 20% discount, to which of these customers are responding well
- Coupon targeting customers within 10 km radius versus 20 km radius
- Combinations of discount & distance to experiment
Organizational data are stored in databases
Syndicate (paid) databases
- Oracle DB
- Microsoft DB
- NoSQL - MongoDB
- Big Data, etc.
- Industry reports
- Government reports
- Quasi-government reports, etc.
Click here to learn Data Analytics in Hyderabad
Output and input variables are two major categories for random variables.
Mathematically the relation between these is expressed using base equation:
Y is known as:
- Dependent variable
- Explained variable
- Measures variable
- Experimental variable
- Class variable
X is known as:
- Controlled variable
- Manipulated variable
- Exposure variable
Upper case is always used to indicate random variables.
Lower case letters are used to indicate the values that a random variable can take.
Example: One dice roll X equals 1, 2, 3, 4, 5, and 6.
Probability Distribution is the process of tabulating or displaying the probability of all potential outcomes of an event.
The underlying probability distribution is known as a probability distribution if a random variable is continuous.
The underlying probability distribution is known as Discrete Probability Distribution if a random variable is discrete.
Sampling techniques, sometimes referred to as non-probability sampling, are based on convenience and involve varying priorities for the data that must be collected to represent the population.
Inferential statistical is a process of analysing the sample data and deriving statements / properties of a population.
The default strategy for inferential statistics is known as probability sampling, often known as unbiased sampling. Every data point that has to be collected will have an equal chance of being chosen.
A few examples of Non-Probability Sampling:
Unbiased Sampling also known as Probability sampling is the default approach for inferential statistics. Each data point to be collected will have equal opportunity to get selected.
A few examples of Probability Sampling:
||All Covid-19 cases on the planet
- The majority of Covid-19 cases are in the USA, India, and Brazil and hence these 3 countries can be selected as a Sampling Frame
- Sampling Frame does not have any hard and fast rule, It is devised based on business logic
|Simple Random Sampling
- Randomly Sample 10% or 20% of the data from the sampling frame using Simple Random Sampling (SRS) technique
- SRS is the gold standard technique used for sampling
- SRS is the only sampling technique which has no bias
- Other sampling techniques such as Stratified sampling, etc., Also can be used to sample the data but SRS is the best
Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore
Data Science Training Institutes in Other Locations
, Andhra Pradesh
, Anna Nagar
, Greater Warangal
, Navi Mumbai
Navigate to Address
360DigiTMG - Data Science, IR 4.0, AI, Machine Learning Training in Malaysia
Level 16, 1 Sentral, Jalan Stesen Sentral 5, Kuala Lumpur Sentral, 50470 Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia
+60 19-383 1378