Call Us

Home / Blog / Data Science / H2O – A New Experience in AutoML

H2O – A New Experience in AutoML

  • June 23, 2023
  • 4304
  • 95
Author Images

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Read More >

Everyone wants a piece of the latest fad known as machine learning. But the difficulty for many individuals is that learning machine learning requires a lot more work and years of concentration. AutoML can help in this situation. As the name implies, AutoML automates the entire ML process. This makes it possible for individuals to apply the ML technique without needing the same level of technical expertise as a skilled data scientist. However, AutoML is not the end-all be-all of machine learning processes; users would still need to be actively involved in the early processes, such as data preparation, feature selection, and output processing. The debate about AutoML may go on forever, but let's concentrate on the most important part of this blog, H2O.

One of the most popular AutoML models right now in the Data Science world is H2O. Within that time frame, H2O automates the model training and tuning processes. The user can choose a time range for the model training procedure. Additionally, H2O has a feature that enables users to create explanations for both the group of models being utilised and for individual models. This explanation's call function is really straightforward because it just calls one other function.

Installation:

The steps to install H2O in a Python environment are as follows:-

360DigiTMG

Run the following command in a kernel or the command terminal. These are the obligatory requirements for H2O to work. If you are interested in learning about the other requirements, the following link can satisfy the need:

In a scenario, where the user prefers Anaconda for installation of the packages, they can use the following command:

  • First step would be install the obligatory dependencies
  • There might be scenario where the library would already be installed in the system, so run the next command to remove that instance 360DigiTMG
  • To install the library the following command is executed:
  • pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
  • conda install -c h2oai h2o=3.30.0.6

Initializing H2O:

Quick peek as to how we can one start working with H2O, within python itself. In order to run the library from Anaconda, the same steps have to be taken. 

360DigiTMG

Getting Data into H2O:

Next step is to get data into the H2O cluster that is running on location machinery. It is very important that the data being uploaded to the cluster is appropriate for the H2O cluster. Currently H2O supports the following file formats

 

  • Csv ( delimited, utf-8 only)
  • ORC
  • SVMLight
  • XLS(BIFF 8 only)
  • XLSX(BIFF 8 only)
  • Avro version 1.8.0 (without multiple parsing or column type modifications)
  • Parquet

H2O also takes data from a few sources at the moment these being:-

  • Local file system
  • Remote files
  • S3 bucket
  • HDFS
  • JDBC
  • Hive

Additional sources can be used by generic HDFS APIs like OpenSwift Stack.

Data Preprocessing:

Numerous preprocessing features are included in H2O, including as automatic imputation, normalisation, and one-hot encoding for XGBoost models. Categorical data may be handled natively using group splitting, which is supported by tree-based models like Random Forests in H2O. With Word2Vec, text-based encoding is also supported. The H2O AutoML protocol also includes feature selection and feature reduction.

Algorithms:

H2O also supports a lot of Machine Learning algorithms, these include both supervised as well as unsupervised learning models. But before getting to that one must understand what type of data H2O supports. This is important to know because different models will end up with different data types. H2O supports both numeric and categorical data types. It also supports textual data with word2vec.

H2O allows for early stopping while training models. In order to stop supervised learning model, the following command is used

  • Max_runtimes_secs (0 by default)

In the scenario where max_runtime_secs is exceeded before the model building is complete, the build would fail. This command can also be used when applying grid search, in which context it specifies the maximum time to search the grid.

In supervised learning, these models are supported in the H2O framework:-

  • AutoML
  • Cox Proportional Hazards
  • Deep Learning ie neural networks
  • Distributed Random Forest
  • Generalised Linear Models
  • ModelSelection
  • Generalised Additive Models
  • ANOVA GLM
  • Gradient Boost
  • Naïve Bayes
  • RuleFit
  • Stacked Ensembles
  • Support Vector Machines
  • Distributed Uplift Random Forest
  • XGBoost

In unsupervised learning, it supports the following models:-

  • Aggregator
  • Generalised Low Rank Models
  • Isolated Forest
  • Extended isolated Forest
  • K-means clustering
  • Principal Component analysis

It allows supports other models like:-

  • Target Encoding
  • TF-IDF
  • Word2Vec
  • Permutation Variable Importance

Training of Models:

H2O facilitates both supervised and unsupervised learning, as was previously mentioned. It supports both classification and regression approaches in supervised learning. The result of classification can be binary or multiclass classification and often takes the form of categorical data. The result of regression is a numerical prediction. The procedure to input whether the model is classification or regression in H2O is as follows: -

  • If the model is classification then, the output column needs to be encoded in categorical
  • In the scenario where the response is numeric but it has to be a classification problem, the numeric column needs to be updated as such

This is done to ensure that the model being is trained is trained at maximum efficiency

Classification training:

360DigiTMG

In this above example, we convert the input as well output columns as factor that by default were numeric. Making this conversion allows to use the classification technique

Regression Model:

360DigiTMG

In regression, we use the Boston house price dataset. In line 4 we check if the column is numeric or not, and then that column is set as the response column. 

Unsupervised Training:

In unsupervised learning, the K-means clustering is implemented on the Iris dataset.

360DigiTMG

Admissible ML models:

In H2O, there are added tools which work towards the aim of admissible ML models. These models are efficient, fair (reduced discrimination towards minority groups, as well providing more interpretability. There 2 methods within H2O for admissibility:-

  • Infogram:- This is an information diagram which is a newly added graphical feature exploration method
  • L-feature: - This helps towards mitigation of unfairness that might exist within the H2O models. They can also help in identifying hidden problematic proxy features from any dataset.

Saving, loading, downloading and uploading models

It is possible to save binary models with python in H2O using the h2o.save_model() call. These binary models are not version compatible, so in case there is a version update, the models would have to be updated as well. In order to save models for production environments, MOJO/POJO format is used. These models are saved as plain Java code and they are not dependent on H2O clusters for running.

MOJO

The Mojo import functionality allows H2O models to be used for external use cases. The below table shows which models are currently supported within H2O

Model Name Exportable Importable
AutoML Yes Yes
GAM Yes No
GBM Yes Yes
GLM Yes Yes
MAXR No No
XGBoost Yes Yes
DRF Yes Yes
Deep Learning Yes Yes
Stacked Ensemble Yes Yes
CoxPH Yes Yes
RuleFit Yes Yes
Naïve Bayes Classifier No No
SVM No No
Kmeans Yes No
Isolation Forest Yes Yes
Extended Isolation Forest Yes Yes
Aggregator No No
GLRM Yes No
PCA Yes No
360DigiTMG

The above code provides, explanation as to how to download the model as well as upload any models that can be uploaded based on the table.

Downloading logs

H20 allows the user to download the logs of any work that has been done within the cluster. If there are no existing jobs then the following steps can be followed to download the logs:-

  • In the terminal type the following path: cd /tmp/h2o-username/h2ologs
  • HTTPD will contain the rest API files while, the rest of the files will be stored in this format: h2o__--.log

To view the logs when the cluster is active:-

  • Open H2O web UI
  • Admin > Inspect Log or go to: http://localhost:54321/LogView.html.
  • Download logs button can be used to down any logs available

In order to download the logs in python, the following command can be used:

h2o.download_all_logs(dirname = ‘./CWD/’, filename = ‘autoh20_log.zip’

With this this discussion for H2O comes to an end. The potential for this AutoML framework is huge and it is something that should be explored to a greater extent.

Data Science Placement Success Story

Data Science Training Institutes in Other Locations

Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Visakhapatnam, Tirunelveli, Aurangabad

Data Analyst Courses in Other Locations

ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu , Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka

 

Navigate to Address

360DigiTMG - Data Analytics, Data Science Course Training Hyderabad

2-56/2/19, 3rd floor, Vijaya Towers, near Meridian School, Ayyappa Society Rd, Madhapur, Hyderabad, Telangana 500081

099899 94319

Get Direction: Data Science Course

Read
Success Stories
Make an Enquiry