Login
Congrats in choosing to up-skill for your bright career! Please share correct details.
Home / Blog / Data Science Digital Book / CRISP - DM Data Collection
Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.
Table of Content
Articulate the business problem by understanding the client/customer requirements
Formulate Business Objectives
Formulate Business Constraints
Understanding various data types is pivotal to proceed further with data collection.
Any data, which can be represented in a decimal format and makes sense.
Data when represented in decimal format does not make sense.
Whether a person claims an insurance or not;
will a person 'pay on time',
'with a delay' or 'will default'.
'Number of people who claim insurance',
'number of loan defaulters'.
Qualitative data is non-numerical data.
Examples
Quantitative data include numbers.
Click here to learn Data Science in Hyderabad
Continuous Data and Count Data fall under Quantitative Data.
Structured data is that data which in raw state can be placed in a tabular format.
Unstructured data is that data which in its raw state cannot be placed in any tabular format.
Video is split into images and images into pixels and each pixel intensity value can be an entry in a column and this becomes structured.
Videos, Images, Audio/Speech, Textual Data are examples of Unstructured data.
Data that may be presented in a tabular fashion in its unprocessed condition is considered structured data.
Data that cannot be presented in any tabular format in its unprocessed state is referred to as unstructured data.
Each pixel intensity value may be an item in a column and each frame of a video is broken up into pictures and pixels, which results in a structured data structure.
Unstructured data includes, among other things, audio, video, images, and text.
Mel Frequency Cepstral Coefficient (MFCC) allows audio files and speech data to be transformed into features.
To make text data structured, it can be transformed into a Bag of Words (BoW), for instance.
Example: But because I'm poor, all I have are my dreams. Please tread carefully because you're walking on them.
Click here to learn Data Science in Bangalore
Data that is subject to the 5 Vs
A high volume, quickly produced from a wide variety of sources, with a degree of veracity and the right value
Big Data is information that cannot be processed with currently available software or stored on currently accessible technology.
Non-Big Data is any data that can be processed with currently available software and stored on currently accessible hardware.
Click here to learn Artificial Intelligence in Hyderabad
Rule of thumb: If the percentage of the minority output class is less than 30%, the data are unbalanced.
Refer to page 2 for information on sampling for unbalanced data.
When we have imbalanced data then we apply different sampling techniques such as:
Click here to learn Artificial Intelligence in Bangalore
Click here to learn Data Analytics in Bangalore
Primary Data: Data Collected at the Source
Secondary Data: Data Collected Before Hand
Surveys, Design of Experiments, IoT Sensors Data, Interviews, Focus Groups, etc.
1. Understand the business reality and objective behind conducting the survey. E.g. Sales are low for a training company
2. Perform root cause analysis - Fishbone Analysis, 5-Why Analysis, etc. E.g. Product Pricing is uncompetitive
3. Formulate Decision Problem. E.g. Should product prices be changed
4. Formulate Research Objective. E.g. Determine the price elasticity of demand and the impact on sales and profits of various levels of price changes
5. List out Constructs. E.g. Training Enrolment
6. Deduce Aspects based on construct. E.g. Time aspect, Strength aspect, Constraint aspect
7. Devise Survey Questionnaire based on the Aspects. E.g. I am most likely to enroll for the training program in: In the next one week, In the next one month, In the next one quarter, etc.
Organizational data are stored in databases
Open Source databases
Syndicate (paid) databases
Click here to learn Data Analytics in Hyderabad
Click here to learn Machine Learning in Hyderabad
According to the ratio of the favourable examples to the total number of probable cases, probability is the degree to which an event is likely to occur.
Click here to learn Machine Learning in Bangalore
Output and input variables are two major categories for random variables.
Mathematically the relation between these is expressed using base equation:
Upper case is always used to indicate random variables.
Lower case letters are used to indicate the values that a random variable can take.
Example: One dice roll X equals 1, 2, 3, 4, 5, and 6.
Probability Distribution is the process of tabulating or displaying the probability of all potential outcomes of an event.
The underlying probability distribution is known as a probability distribution if a random variable is continuous.
The underlying probability distribution is known as Discrete Probability Distribution if a random variable is discrete.
Sampling techniques, sometimes referred to as non-probability sampling, are based on convenience and involve varying priorities for the data that must be collected to represent the population.
Inferential statistical is a process of analysing the sample data and deriving statements / properties of a population.
The default strategy for inferential statistics is known as probability sampling, often known as unbiased sampling. Every data point that has to be collected will have an equal chance of being chosen.
Unbiased Sampling also known as Probability sampling is the default approach for inferential statistics. Each data point to be collected will have equal opportunity to get selected.
Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore
360DigiTMG - Data Science, IR 4.0, AI, Machine Learning Training in Malaysia
Level 16, 1 Sentral, Jalan Stesen Sentral 5, Kuala Lumpur Sentral, 50470 Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia
+60 19-383 1378
Didn’t receive OTP? Resend
Let's Connect! Please share your details here