Overfitting and Underfitting
Table of Content
What is Overfitting?
An overfitting scenario is when a model performs very well on training data but poorly on test data. The noise that the machine learning model learns along with the patterns will have a detrimental impact on the model's performance on test data. When using nonlinear models with a nonlinear decision boundary, the overfitting issue typically arises. In SVM, a decision boundary could be a hyperplane or a linearly separable line.
The pattern is nonlinear in this instance, as is evident. Results from the model cannot be generalised to new data.
Non-linear models, such as decision trees, may frequently overfit the decision boundary that they produce. A high variance problem is another name for this. If we use target shooting as an example, if there is substantial volatility, it would be comparable to having an unstable target. Overfitting results in a big Val/test error and a relatively tiny train error.
Reasons for Overfitting:
- If Data preprocessing is not done properly and contains a noise factor in it.
- If the model overfits It also said that it has a high variance.
- If model has to learn many parameters
- The model can memorise the pattern but can not learn patterns
To confront overfitting :
Using K Fold Cross-Validation:
The ideal preventative approach to address overfitting is cross-validation. The entire dataset is split into k sets, each of roughly similar size. The algorithm will train the data on the k-1 sets using the first set as test data. The calculation of the test error.
In the second iteration, the residual k-1 sets are used as train data for calculating test error, and the second set is chosen as the test set.
Once all k sets have been processed, the procedure repeats.
The method for K=5 is illustrated in the image below.
In any case, we may modify folds to find the ideal k to address overfitting.
We can tune folds, either way, to select the best k to solve overfitting.
- Using sufficient training data:
This will not always work if the model is not so complex. We can try a less powerful model with fewer parameters. Data augmentation will help to solve this sometimes.
- Quantity of features:
Overfitting may be avoided by doing feature engineering and feature selection.
We add additional characteristics to the model in an effort to increase its accuracy, but doing so may overcomplicate it and cause overfitting.
To make the model as basic as feasible, regularisation maintains the parameter values as little as possible. When compared to initial simple models, strong regularisation would perform better. In order to prevent the model from overlearning the patterns in the data, the regularisation approach helps to decrease the parameters. The tuning parameter is what aids in getting the proper fit. distinct machine learning algorithms have distinct hyperparameters. For instance, neural network dropout, pruning strategy, decision tree ccp_alpha, maximum tree depth, regression using L1/L2 norms, etc.
Please click the following link to learn about pruning techniques.
- Adopting ensemble techniques
Like boosting, bagging Random forest can be used to solve variance problems.
What is Underfitting?
Underfitting occurs when a model does not learn the patterns on training data well enough to generalise to unknown data. The link between input and output variables is inaccurately learned by the model. When the model is overly simplistic or requires additional training time, input characteristics, etc., this happens. Both train and Val/test error are significant.
The model generates forecasts that are accurate but initially off. When compared to overfitting, underfitting is not a major problem because it can be readily fixed. The algorithms' principles can be applied to smaller data sets, which can lead to inaccurate predictions.
Reasons for Underfitting :
- The model has high bias
- The training data is not sufficient to learn the patterns .
- The model is too simple.
- Data cleansing should be performed properly so that it can capture the relation between variables .
- Maybe we can say the noise factor is also one of the reasons for underfitting..
To confront Underfitting:
Adding more features to the data:
By including more inputs to our data, we may make the model more complicated and better reflect the relationship between the variables. Building polynomial models starting with 2 degrees, 3 degrees, etc. will allow us to try it out.
Underfitting can be fixed by adding inputs in a sequential manner. For instance, increasing the number of hidden neurons in a neural network or the number of trees in a random forest would increase complexity to the model and improve training outcomes.
Increase duration of training :
We are stopping the training soon by not allowing the algorithm to learn the patterns completely. It is very important to maintain the right steps while training otherwise it may run into overfitting. We can increase the number of epochs in neural networks.
By imposing a penalty on the input parameters with the greater coefficients, regularisation aids in lowering the variance associated with a model. A model's noise and outliers may be reduced using a variety of methods, including L1/L2 regularisation and other techniques. The model will not be able to recognise the prevailing trend if the data dimensions are too stable, which results in underfitting. Reducing the regularisation level improves complexity and variance incorporated into the model, enabling effective model training.
What is the best fit in Machine Learning?
When the model predicts with zero error it is the best fit scenario. From the below charts we can infer that the model initially fails to capture the relationship between x and y. Then we added features to improve the pattern learning. To reduce underfitting we keep on adding features that will eventually make your model more complex resulting in overfitting. Click here to learn Data Science Course in Hyderabad
The alternative possibility is that when learning time grows over time as a result of additional inputs, error on training data and test data will also likely decrease. The model will become overfitted if this persists and training the data takes more time.
So choosing the right set of features, the right amount of training, right regularisation penalty terms will help us in achieving the RIGHT fit or the best fit.
We try to find the ideal ratio of bias to variance for every model. This only makes sure that we record the key patterns in our model while disregarding the noise. A bias-variance tradeoff can be used to describe this. Our model's error is lowered and maintained as low as feasible with its assistance.
A model that has been optimised will be sensitive to the patterns in our data while also being able to generalise to new data. This should have a modest bias and variance to avoid overfitting and underfitting. Therefore, achieving minimal bias and low variance is our goal.
Data Science Training Institutes in Other Locations
Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Vizag, Tirunelveli, Aurangabad
Navigate to Address
360DigiTMG - Data Science, IR 4.0, AI, Machine Learning Training in Malaysia
Level 16, 1 Sentral, Jalan Stesen Sentral 5, Kuala Lumpur Sentral, 50470 Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia
+60 19-383 1378