Home / Blog / Data Science / H2O – A New Experience in AutoML

H2O – A New Experience in AutoML

June 23, 2024
95

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Installation:

The steps to install H2O in a Python environment are as follows:-

Run the following command in a kernel or the command terminal. These are the obligatory requirements for H2O to work. If you are interested in learning about the other requirements, the following link can satisfy the need:

In a scenario, where the user prefers Anaconda for installation of the packages, they can use the following command:

First step would be install the obligatory dependencies
There might be scenario where the library would already be installed in the system, so run the next command to remove that instance
To install the library the following command is executed:
pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
conda install -c h2oai h2o=3.30.0.6

Initializing H2O:

Quick peek as to how we can one start working with H2O, within python itself. In order to run the library from Anaconda, the same steps have to be taken.

Getting Data into H2O:

Next step is to get data into the H2O cluster that is running on location machinery. It is very important that the data being uploaded to the cluster is appropriate for the H2O cluster. Currently H2O supports the following file formats

Csv ( delimited, utf-8 only)
ORC
SVMLight
XLS(BIFF 8 only)
XLSX(BIFF 8 only)
Avro version 1.8.0 (without multiple parsing or column type modifications)
Parquet

H2O also takes data from a few sources at the moment these being:-

Local file system
Remote files
S3 bucket
HDFS
JDBC
Hive

Additional sources can be used by generic HDFS APIs like OpenSwift Stack.

Data Preprocessing:

Numerous preprocessing features are included in H2O, including as automatic imputation, normalisation, and one-hot encoding for XGBoost models. Categorical data may be handled natively using group splitting, which is supported by tree-based models like Random Forests in H2O. With Word2Vec, text-based encoding is also supported. The H2O AutoML protocol also includes feature selection and feature reduction.

Algorithms:

H2O also supports a lot of Machine Learning algorithms, these include both supervised as well as unsupervised learning models. But before getting to that one must understand what type of data H2O supports. This is important to know because different models will end up with different data types. H2O supports both numeric and categorical data types. It also supports textual data with word2vec.

H2O allows for early stopping while training models. In order to stop supervised learning model, the following command is used

Max_runtimes_secs (0 by default)

In the scenario where max_runtime_secs is exceeded before the model building is complete, the build would fail. This command can also be used when applying grid search, in which context it specifies the maximum time to search the grid.

In supervised learning, these models are supported in the H2O framework:-

AutoML
Cox Proportional Hazards
Deep Learning ie neural networks
Distributed Random Forest
Generalised Linear Models
ModelSelection
Generalised Additive Models
ANOVA GLM
Gradient Boost
Naïve Bayes
RuleFit
Stacked Ensembles
Support Vector Machines
Distributed Uplift Random Forest
XGBoost

In unsupervised learning, it supports the following models:-

Aggregator
Generalised Low Rank Models
Isolated Forest
Extended isolated Forest
K-means clustering
Principal Component analysis

It allows supports other models like:-

Target Encoding
TF-IDF
Word2Vec
Permutation Variable Importance

Training of Models:

H2O facilitates both supervised and unsupervised learning, as was previously mentioned. It supports both classification and regression approaches in supervised learning. The result of classification can be binary or multiclass classification and often takes the form of categorical data. The result of regression is a numerical prediction. The procedure to input whether the model is classification or regression in H2O is as follows: -

If the model is classification then, the output column needs to be encoded in categorical
In the scenario where the response is numeric but it has to be a classification problem, the numeric column needs to be updated as such

This is done to ensure that the model being is trained is trained at maximum efficiency

Classification training:

In this above example, we convert the input as well output columns as factor that by default were numeric. Making this conversion allows to use the classification technique

Regression Model:

In regression, we use the Boston house price dataset. In line 4 we check if the column is numeric or not, and then that column is set as the response column.

Unsupervised Training:

In unsupervised learning, the K-means clustering is implemented on the Iris dataset.

Grid Search:

Cartesian grid search and random grid search are the two types of grid searches available in H2O. In a Cartesian grid search, the user defines the search parameters, and the h2o model will carry out those directives. To account for every conceivable parameter, a model will be trained. The model will comprise 5*10*4*2 = 400 models, for instance, if there are 4 hyperparameters and the grid search values are 5, 10, 4, 2.

In random search, the hyperparameter values are defined by the user, but sampling is done uniformly over all models by H2O. Users can also specify the stopping criterion for the grid search while it is running random search. This can be done either by setting a maximum number of models to search for, or by specifying a time limit for each search. Grid search can also be stopped by implementing a performance metric based criterion where the model will stop once it reaches a certain threshold of performance.

Random search performed on the same dataset

Checkpoint creation for Grid search models

Once the search is completed, the models, stored in the H2O cluster can be accessed its model ID and sorted based any particular performance metric like RMSE

Admissible ML models:

In H2O, there are added tools which work towards the aim of admissible ML models. These models are efficient, fair (reduced discrimination towards minority groups, as well providing more interpretability. There 2 methods within H2O for admissibility:-

Infogram:- This is an information diagram which is a newly added graphical feature exploration method
L-feature: - This helps towards mitigation of unfairness that might exist within the H2O models. They can also help in identifying hidden problematic proxy features from any dataset.

Saving, loading, downloading and uploading models

It is possible to save binary models with python in H2O using the h2o.save_model() call. These binary models are not version compatible, so in case there is a version update, the models would have to be updated as well. In order to save models for production environments, MOJO/POJO format is used. These models are saved as plain Java code and they are not dependent on H2O clusters for running.

MOJO

The Mojo import functionality allows H2O models to be used for external use cases. The below table shows which models are currently supported within H2O

Model Name	Exportable	Importable
AutoML	Yes	Yes
GAM	Yes	No
GBM	Yes	Yes
GLM	Yes	Yes
MAXR	No	No
XGBoost	Yes	Yes
DRF	Yes	Yes
Deep Learning	Yes	Yes
Stacked Ensemble	Yes	Yes
CoxPH	Yes	Yes
RuleFit	Yes	Yes
Naïve Bayes Classifier	No	No
SVM	No	No
Kmeans	Yes	No
Isolation Forest	Yes	Yes
Extended Isolation Forest	Yes	Yes
Aggregator	No	No
GLRM	Yes	No
PCA	Yes	No

The above code provides, explanation as to how to download the model as well as upload any models that can be uploaded based on the table.

Downloading logs

H20 allows the user to download the logs of any work that has been done within the cluster. If there are no existing jobs then the following steps can be followed to download the logs:-

In the terminal type the following path: cd /tmp/h2o-username/h2ologs
HTTPD will contain the rest API files while, the rest of the files will be stored in this format: h2o__--.log

To view the logs when the cluster is active:-

Open H2O web UI
Admin > Inspect Log or go to: http://localhost:54321/LogView.html.
Download logs button can be used to down any logs available

In order to download the logs in python, the following command can be used:

h2o.download_all_logs(dirname = ‘./CWD/’, filename = ‘autoh20_log.zip’

With this this discussion for H2O comes to an end. The potential for this AutoML framework is huge and it is something that should be explored to a greater extent.