Data Mining Supervised Learning
Machine Learning Primer
Steps based on Training & Testing datasets
- Get the historical / past data data needed for analysis which is the output of data cleansing
- Split the data into training data & testing data
- Split the data based on random sampling if the data is balanced
- Split the data based on other sampling techniques if the data is imbalanced
(Refer to Step 2 of CRISP-DM to know about imbalance dataset sampling techniques)
- We may divide the data according to the 80/20 rule, whereby 80% of the data is used for training and the remaining 20% is used for testing.
- Build the model using the training data
- Test the model on testing data to get the predicted values
- To determine inaccuracy or accuracy, compare the anticipated values and actual values of the testing data. Techniques for model evaluation are covered in the sections that follow. This will provide us with Testing Accuracy or Testing Error.Also test the built model on training data
- Compare the training data predicted values and training data actual values to calculate the error or accuracy. This will give us Training Error or Training Accuracy
- To determine the inaccuracy or accuracy, compare the training data's projected values to its actual values. This will provide us with Training Accuracy or Training Error.
- Training Error and Testing Error
- If training error and testing error are small and close to each other then the model is considered to be RIGHT FIT (how low the error values should be is a subjective evaluation. E.g., In healthcare even 1% error might be considered high, whereas in a garment manufacturing process even 8% error might be considered low)
- If training error is low and testing error is high then the model is considered to be OVERFITTING. Overfitting is also called VARIANCE
- If training error is high then testing error also will be high. This scenario is called UNDERFITTING or BIAS
- If training error is high and testing error is low then something is seriously wrong with the data or model you built. Redo the entire project
- Overfitting is a frequent issue that can be difficult to resolve. Different regularisation strategies (also known as generalisation approaches) are used by various machine learning algorithms to handle overfitting.
- By adding more features (columns) or datapoints (observations), underfitting issues can be quickly fixed. Additionally, effective feature engineering and transformation will deal with this problem.
Click here to learn Data Science in Hyderabad
The challenge of Training & Testing dataset split, which leads to information leak is countered with new school of thought with an idea to split the data into:
- Training Data
- Validation Data (Development Data)
- Testing Data
Data Science Training Institutes in Other Locations
Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Vizag, Tirunelveli, Aurangabad
Navigate to Address
360DigiTMG - Data Science Course, Data Scientist Course Training in Chennai
D.No: C1, No.3, 3rd Floor, State Highway 49A, 330, Rajiv Gandhi Salai, NJK Avenue, Thoraipakkam, Tamil Nadu 600097