Home / Blog / Data Science / TPOT an Auto-ML Library

TPOT an Auto-ML Library

June 23, 2024
67

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Introduction

The Python Automated Machine Learning tool TPOT uses coding to optimise machine learning pipelines.

TPOT will intelligently examine hundreds of potential pipelines to find the most straightforward one for your data, automating the most tiresome portion of machine learning. TPOT gives you the Python code for the best pipeline it identified so you may play around with the pipeline from there after it has completed searching (or you grow tired of waiting).

ML Pipeline Example

Installation

TPOT is built on Python libraries, which includes

NumPy
SciPy
scikit-learn
joblib
xgboost
DEAP
update_checker
tqdm
stopit
pandas

Most of the required Python packages are installed through the Anaconda Python distribution.

We can also install TPOT using pip or conda-forge

Using pip

NumPy, SciPy, scikit-learn, pandas, joblib, and PyTorch installed in Anaconda as shown below:

DEAP, update checker, tqdm, stopit and xgboost will be installed with pip as shown:

Windows users: Pip installation might not work on some Windows environments, and it's going to cause unexpected errors. If you've got issues installing XGBoost, check the XGBoost installation documentation.

If you intend to use Dask for parallel training, confirm to put in dask[delay] and dask[dataframe] and dask_ml. It's noted that dask-ml>=1.7 requires distributed>=2.4.0 and scikit-learn>=0.23.0.

If you propose to use the TPOT-MDR configuration, confirm to put in scikit-mdr and scikit-rebate:

To enable support for PyTorch-based Neural Networks (TPOT-NN), you'll install PyTorch. TPOT-NN will work with either CPU or GPU PyTorch, but we strongly recommend employing a GPU version, if possible, as CPU PyTorch models tend to coach very slowly.

We recommend following PyTorch's installation instructions customized for your software package and Python distribution.

Finally to put in TPOT itself, run the subsequent command:

conda-forge

Install TPOT and the core dependencies:

Install additional dependencies we can use the command as shown below:

For using TPOT-cuML configuration

This configuration requires an NVIDIA Pascal architecture or better GPU with compute capability 6.0+,in which the library cuML is installed. With this configuration, all model training and predicting are going to be GPU-accelerated. This configuration is especially useful for medium-sized and bigger datasets on which CPU-based estimators are a standard bottleneck, and works for both the TPOTClassifier and TPOTRegressor.

To install TPOT for using TPOT-cuML configuration we have to download the below-mentioned conda environment yml file

Applications
AutoML algorithms aren't intended to run only some minutes
Of actually TPOT will identify a somewhat good pipeline for your dataset in less time than a few minutes of running time. To determine the most efficient pipeline for your dataset, TPOT must be run for a sufficient amount of time. In other cases, it won't even be able to discover any suitable pipeline at all, in which case a Runtime Error would occur. No pipeline has been optimised yet. Are raised, therefore call fit () first. To enable TPOT to fully explore the pipeline space for your dataset, it is beneficial to run numerous instances of TPOT in parallel for a long period of time.
AutoML algorithms can recommend different solutions for the identical dataset
If you're working with a fairly complex dataset or run TPOT for a brief amount of your time, different TPOT runs may lead to different pipeline recommendations. TPOT's optimization algorithm is stochastic nature, which implies that it uses randomness (in part) to go looking at the possible pipeline space. When two TPOT runs recommend different pipelines, this suggests that the TPOT runs didn't converge because of a lack of your time or that multiple pipelines perform more-or-less the identical on your dataset.

This is a bonus over fixed grid search techniques: TPOT is supposed to be an assistant that provides ideas on the way to solve a specific machine learning problem by exploring pipeline configurations that you simply may need never considered, then leaves the fine-tuning to more constrained parameter tuning techniques like grid search.
AutoML algorithms can take an extended time to complete the search
AutoML algorithms aren't as simple as fitting one model on the dataset; they're considering multiple machine learning algorithms (random forests, linear models, SVMs, etc.) in an exceeding pipeline with multiple preprocessing steps (missing value imputation, scaling, PCA, feature selection, etc.), the hyperparameters for all of the models and preprocessing steps, still as multiple ways to ensemble or stack the algorithms within the pipeline.

As a result, running TPOT on bigger datasets will take forever, but it's crucial to comprehend why. The TPOT will typically examine 10,000 pipeline configurations before completing with the default TPOT parameters, which are 100 generations in length with 100 population size. Consider a grid-based examination of 10,000 hyperparameter variations for an algorithm used for machine learning and how long that grid search would take to put this amount into perspective. With 10-fold cross-validation, there are 10,000 model configurations to evaluate, which implies that around 100,000 models are fitted and assessed on the training data during a single grid search. Even for simple models like decision trees, that process takes a lot of time.

TPOT with code

Considered the TPOT interface to be as similar as possible to scikit-learn.

TPOT may be imported similar to any regular Python module. To import TPOT, type:

now create an instance of TPOT as shown below:

Apart from the category name, a TPOTRegressor is employed the identical way as a TPOTClassifier. You'll be able to read more about the TPOTClassifier and TPOTRegressor classes within the API documentation.

Example:

The TPOT is prepared to optimize a pipeline. You'll tell TPOT to optimize a pipeline supported a knowledge set with the fit() i.e.,

The fit function initializes the genetic programming algorithm to search out the highest-scoring pipeline supported by average k-fold cross-validation. Now, the pipeline is trained on the complete set of provided samples, and also the TPOT instance is used as a fitted model.

Then proceed to do the ultimate pipeline on the testing set with the score()

Finally, you'll be able to tell TPOT to export the related Python code for the optimized pipeline to a computer file with the export()

In the below example, the script using TPOT to optimize a pipeline, score it, and export the most effective pipeline to a file.

TPOT commands

To use TPOT via the program line, enter the command with a path to the data file

For the brief explanation for the arguments, we can use as follows

Classification
class tpot.TPOTClassifier(generations=100,population_size=100,
offspring_size=None,mutation_rate=0.9,
crossover_rate=0.1,scoring='accuracy',cv=5,
subsample=1.0,n_jobs=1,
max_time_mins=None,max_eval_time_mins=5,random_state=None,config_dict=None,
template=None,
warm_start=False,
memory=None,
use_dask=False,
periodic_checkpoint_folder=None,
early_stop=None,
verbosity=0,
disable_update_check=False,
log_file=None)

Automated machine learning for supervised classification tasks.

The TPOTClassifier does an intelligent search across Machine Learning pipelines, which may include supervised classification models, preprocessors, feature selection strategies, and other estimators or transformers that adhere to the scikit-learn API. All of the objects in the pipeline will have their hyperparameters searched by the TPOTClassifier.

However, the config_dict parameter allows for complete customization of the algorithms, transformers, and hyperparameters that the TPOTClassifier examines.

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore