Call Us

Home / Blog / Data Science / TPOT an Auto-ML Library

TPOT an Auto-ML Library

  • June 23, 2023
  • 3057
  • 67
Author Images

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Read More >

Introduction

The Python Automated Machine Learning tool TPOT uses coding to optimise machine learning pipelines.

TPOT will intelligently examine hundreds of potential pipelines to find the most straightforward one for your data, automating the most tiresome portion of machine learning. TPOT gives you the Python code for the best pipeline it identified so you may play around with the pipeline from there after it has completed searching (or you grow tired of waiting).

ML Pipeline Example

ML Pipeline Example

Installation

TPOT is built on Python libraries, which includes

  • NumPy
  • SciPy
  • scikit-learn
  • joblib
  • xgboost
  • DEAP
  • update_checker
  • tqdm
  • stopit
  • pandas

Most of the required Python packages are installed through the Anaconda Python distribution.

We can also install TPOT using pip or conda-forge

Learn the core concepts of Data Science Course video on Youtube:

Using pip

NumPy, SciPy, scikit-learn, pandas, joblib, and PyTorch installed in Anaconda as shown below:

ML Pipeline Example

DEAP, update checker, tqdm, stopit and xgboost will be installed with pip as shown:

ML Pipeline Example

Windows users: Pip installation might not work on some Windows environments, and it's going to cause unexpected errors. If you've got issues installing XGBoost, check the XGBoost installation documentation.

If you intend to use Dask for parallel training, confirm to put in dask[delay] and dask[dataframe] and dask_ml. It's noted that dask-ml>=1.7 requires distributed>=2.4.0 and scikit-learn>=0.23.0.

ML Pipeline Example

If you propose to use the TPOT-MDR configuration, confirm to put in scikit-mdr and scikit-rebate:

ML Pipeline Example

To enable support for PyTorch-based Neural Networks (TPOT-NN), you'll install PyTorch. TPOT-NN will work with either CPU or GPU PyTorch, but we strongly recommend employing a GPU version, if possible, as CPU PyTorch models tend to coach very slowly.

We recommend following PyTorch's installation instructions customized for your software package and Python distribution.

Finally to put in TPOT itself, run the subsequent command:

ML Pipeline Example

conda-forge

Install TPOT and the core dependencies:

ML Pipeline Example

Install additional dependencies we can use the command as shown below:

ML Pipeline Example

For using TPOT-cuML configuration

This configuration requires an NVIDIA Pascal architecture or better GPU with compute capability 6.0+,in which the library cuML is installed. With this configuration, all model training and predicting are going to be GPU-accelerated. This configuration is especially useful for medium-sized and bigger datasets on which CPU-based estimators are a standard bottleneck, and works for both the TPOTClassifier and TPOTRegressor.

To install TPOT for using TPOT-cuML configuration we have to download the below-mentioned conda environment yml file

ML Pipeline Example
  • Applications
  • AutoML algorithms aren't intended to run only some minutes

    Of actually TPOT will identify a somewhat good pipeline for your dataset in less time than a few minutes of running time. To determine the most efficient pipeline for your dataset, TPOT must be run for a sufficient amount of time. In other cases, it won't even be able to discover any suitable pipeline at all, in which case a Runtime Error would occur. No pipeline has been optimised yet. Are raised, therefore call fit () first. To enable TPOT to fully explore the pipeline space for your dataset, it is beneficial to run numerous instances of TPOT in parallel for a long period of time.

  • AutoML algorithms can recommend different solutions for the identical dataset

    If you're working with a fairly complex dataset or run TPOT for a brief amount of your time, different TPOT runs may lead to different pipeline recommendations. TPOT's optimization algorithm is stochastic nature, which implies that it uses randomness (in part) to go looking at the possible pipeline space. When two TPOT runs recommend different pipelines, this suggests that the TPOT runs didn't converge because of a lack of your time or that multiple pipelines perform more-or-less the identical on your dataset.

    This is a bonus over fixed grid search techniques: TPOT is supposed to be an assistant that provides ideas on the way to solve a specific machine learning problem by exploring pipeline configurations that you simply may need never considered, then leaves the fine-tuning to more constrained parameter tuning techniques like grid search.

  • AutoML algorithms can take an extended time to complete the search

    AutoML algorithms aren't as simple as fitting one model on the dataset; they're considering multiple machine learning algorithms (random forests, linear models, SVMs, etc.) in an exceeding pipeline with multiple preprocessing steps (missing value imputation, scaling, PCA, feature selection, etc.), the hyperparameters for all of the models and preprocessing steps, still as multiple ways to ensemble or stack the algorithms within the pipeline.

    As a result, running TPOT on bigger datasets will take forever, but it's crucial to comprehend why. The TPOT will typically examine 10,000 pipeline configurations before completing with the default TPOT parameters, which are 100 generations in length with 100 population size. Consider a grid-based examination of 10,000 hyperparameter variations for an algorithm used for machine learning and how long that grid search would take to put this amount into perspective. With 10-fold cross-validation, there are 10,000 model configurations to evaluate, which implies that around 100,000 models are fitted and assessed on the training data during a single grid search. Even for simple models like decision trees, that process takes a lot of time.

TPOT with code

Considered the TPOT interface to be as similar as possible to scikit-learn.

TPOT may be imported similar to any regular Python module. To import TPOT, type:

ML Pipeline Example

now create an instance of TPOT as shown below:

ML Pipeline Example

Apart from the category name, a TPOTRegressor is employed the identical way as a TPOTClassifier. You'll be able to read more about the TPOTClassifier and TPOTRegressor classes within the API documentation.

Example:

ML Pipeline Example

The TPOT is prepared to optimize a pipeline. You'll tell TPOT to optimize a pipeline supported a knowledge set with the fit() i.e.,

ML Pipeline Example

The fit function initializes the genetic programming algorithm to search out the highest-scoring pipeline supported by average k-fold cross-validation. Now, the pipeline is trained on the complete set of provided samples, and also the TPOT instance is used as a fitted model.

Then proceed to do the ultimate pipeline on the testing set with the score()

ML Pipeline Example

Finally, you'll be able to tell TPOT to export the related Python code for the optimized pipeline to a computer file with the export()

ML Pipeline Example

In the below example, the script using TPOT to optimize a pipeline, score it, and export the most effective pipeline to a file.

ML Pipeline Example

TPOT commands

To use TPOT via the program line, enter the command with a path to the data file

ML Pipeline Example

For the brief explanation for the arguments, we can use as follows

ML Pipeline Example

Classification
class tpot.TPOTClassifier(generations=100,population_size=100,
offspring_size=None,mutation_rate=0.9,
crossover_rate=0.1,scoring='accuracy',cv=5,
subsample=1.0,n_jobs=1,
max_time_mins=None,max_eval_time_mins=5,random_state=None,config_dict=None,
template=None,
warm_start=False,
memory=None,
use_dask=False,
periodic_checkpoint_folder=None,
early_stop=None,
verbosity=0,
disable_update_check=False,
log_file=None)

Automated machine learning for supervised classification tasks.

The TPOTClassifier does an intelligent search across Machine Learning pipelines, which may include supervised classification models, preprocessors, feature selection strategies, and other estimators or transformers that adhere to the scikit-learn API. All of the objects in the pipeline will have their hyperparameters searched by the TPOTClassifier.

However, the config_dict parameter allows for complete customization of the algorithms, transformers, and hyperparameters that the TPOTClassifier examines.

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore

Data Science Placement Success Story

Data Science Training Institutes in Other Locations

Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Visakhapatnam, Tirunelveli, Aurangabad

Data Analyst Courses in Other Locations

ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu , Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka

 

Navigate to Address

360DigiTMG - Data Analytics, Data Science Course Training Hyderabad

2-56/2/19, 3rd floor, Vijaya Towers, near Meridian School, Ayyappa Society Rd, Madhapur, Hyderabad, Telangana 500081

099899 94319

Get Direction: Data Science Course

Make an Enquiry