Home / Blog / MLOps / Pitfalls on only data driven ML approaches

Pitfalls on only data driven ML approaches

  • July 17, 2023
  • 4861
  • 39
Author Images

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Read More >

Machine Learning (ML) systems are multiplex, and a system has more failure possibilities the more multiplex it is. Building reliable ML algorithms requires having a clear understanding of what may go wrong. Together, let's examine concrete instances to illustrate potential hazards that could appear at various stages.  


Understanding Business problem :

In real-time scenarios, issues rarely patent themselves as submissive data science problems. In every project, the first step is to formulate the objective of the business problem. In other words, we have to translate a high-level aim into a well-defined Data Science problem.

Please go to the store and get six milk packets, and if you find eggs, buy a dozen.

The Husband came back home. Are you able to guess what transpired?

Wife began shouting at him immediately!

Instead of bringing 12 eggs and six milk packets as requested, the husband brought 12 milk packets after misinterpreting the wife's words.

Solution: As a result, we must determine the problem's precise purpose.


Assuming huge data will solve our problem/faulty inputs

We have so much data collected, now all we need is a data science team to make sense of it and put it all to work. Irrelevant feature selection is also associated with this huge data. Many skills need a large number of column sets to charge the learning process. But, in the attempt to gather enough learning data, it can be challenging to ensure we include the perfect and relevant inputs. Click here to learn Data Science Course in Chennai


The process of building a well-accomplished model requires keen exploration and analysis to ensure we pick and select the appropriate features. The domain understanding and involving subject matter experts are the two most important pilots for selecting the right features. We can use techniques of random forest, and principal component analysis (PCA) which helps us in selecting more effective features.


Train test leakage

The data science team could unintentionally collect data for training using criterion that aid in result prediction. The algorithm will thus display performance that is unrealistic in nature. For instance, a group could inadvertently add a variable that denotes a specific ailment's treatment in a model meant to forecast the condition.

Consider that you are developing a method for diagnosing a certain ailment using radiological scans. if the patient is represented by many photos. All of our photographs will be randomly divided into train and test sets, with some images going to the training set and some to the test set. Because your model recognises the patient rather than the disease in this instance, our model, which was trained on the training set, may have excellent performance on the test set.


In order for the model to accurately predict outcomes, the DS team must carefully construct their datasets using only information that will be accessible at the time of training.

Myth 4:

Missing Data :

Data will be missing in a few cases i.e the dataset has missing records. If by any chance we miss that and proceed with the training, it may induce a significant bias in the results. Missingness of the one or other feature has many reasons to understand.

For instance , if the data is recorded through a survey , the respondents are less likely to answer a few particular questions.

Missingness at random , missingness completely at random , missingness not at random are variants for missingness.

The reason could be anything for data missing but this may lead to a bias model .

Solution :

If you cannot design the training to ensure the use of complete datasets, applying statistical techniques, including discarding records with missing values or using proper imputation strategy to fill values for the missing data.


Performance is not the outcome:

The desired result is not guaranteed by clinging to the metrics that are specified during issue formulation.

The hidden stratification problem states that any illness detection algorithm with great overall performance may often overlook a rare subtype. The flexibility in sub-group performance is locked up by the overall performance.

Even if we do better in the healthcare scenario, one of the criterion may be that no records should be incorrectly predicted and that our algorithm should identify all cancer patients as having cancer. This might lead to poor model construction. Performance is frequently not significant!

Solution :

The cost-based effect analysis will determine if the model is sound or not. Never should a model be developed that requires us to forecast a patient's cancer status since there is a higher risk involved in such a scenario. For such difficulties, we want subject matter specialists who can solve them.

Myth6 :

Optimization of model's hyperparameters:

Most of the models have hyperparameters — those are the settings that affect the model's configuration. For example, the kernel function used in an SVM, the number of trees to be built in a random forest, and the neural network architecture. Tuning the hyperparameters significantly affect the performance of the model, and there is usually one size never fits. That is, they need to be picked from a particular data set to get the best t out of the model.

Solution :

Using AutoML techniques is the best way to optimize both the preference of the model and its hyperparameters, in addition to other parts of the data mining pipeline

Key takeaways:

  • Successful ML projects require much more than a high-performing model on a backdated test set.
  • Consequently, instead of focusing on improving gains in model performance, we need to pay attention to preventing, detecting, and fixing pitfalls across all levels of model building.

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore

Data Science Training Institutes in Other Locations

Navigate to Address

360DigiTMG - Data Science, IR 4.0, AI, Machine Learning Training in Malaysia

Level 16, 1 Sentral, Jalan Stesen Sentral 5, Kuala Lumpur Sentral, 50470 Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia

+60 19-383 1378

Get Direction: Data Science Course

Make an Enquiry
Call Us