H2O – A New Experience in AutoML
Table of Content
Everyone wants a piece of the latest fad known as machine learning. But the difficulty for many individuals is that learning machine learning requires a lot more work and years of concentration. AutoML can help in this situation. As the name implies, AutoML automates the entire ML process. This makes it possible for individuals to apply the ML technique without needing the same level of technical expertise as a skilled data scientist. However, AutoML is not the end-all be-all of machine learning processes; users would still need to be actively involved in the early processes, such as data preparation, feature selection, and output processing. The debate about AutoML may go on forever, but let's concentrate on the most important part of this blog, H2O.
One of the most popular AutoML models right now in the Data Science world is H2O. Within that time frame, H2O automates the model training and tuning processes. The user can choose a time range for the model training procedure. Additionally, H2O has a feature that enables users to create explanations for both the group of models being utilised and for individual models. This explanation's call function is really straightforward because it just calls one other function.
Learn the core concepts of Data Science Course video on Youtube:
The steps to install H2O in a Python environment are as follows:-
Run the following command in a kernel or the command terminal. These are the obligatory requirements for H2O to work. If you are interested in learning about the other requirements, the following link can satisfy the need:
In a scenario, where the user prefers Anaconda for installation of the packages, they can use the following command:
- First step would be install the obligatory dependencies
- There might be scenario where the library would already be installed in the system, so run the next command to remove that instance
- To install the library the following command is executed:
- pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
- conda install -c h2oai h2o=184.108.40.206
Quick peek as to how we can one start working with H2O, within python itself. In order to run the library from Anaconda, the same steps have to be taken.
Getting Data into H2O:
Next step is to get data into the H2O cluster that is running on location machinery. It is very important that the data being uploaded to the cluster is appropriate for the H2O cluster. Currently H2O supports the following file formats
- Csv ( delimited, utf-8 only)
- XLS(BIFF 8 only)
- XLSX(BIFF 8 only)
- Avro version 1.8.0 (without multiple parsing or column type modifications)
H2O also takes data from a few sources at the moment these being:-
- Local file system
- Remote files
- S3 bucket
Additional sources can be used by generic HDFS APIs like OpenSwift Stack.
Numerous preprocessing features are included in H2O, including as automatic imputation, normalisation, and one-hot encoding for XGBoost models. Categorical data may be handled natively using group splitting, which is supported by tree-based models like Random Forests in H2O. With Word2Vec, text-based encoding is also supported. The H2O AutoML protocol also includes feature selection and feature reduction.
H2O also supports a lot of Machine Learning algorithms, these include both supervised as well as unsupervised learning models. But before getting to that one must understand what type of data H2O supports. This is important to know because different models will end up with different data types. H2O supports both numeric and categorical data types. It also supports textual data with word2vec.
H2O allows for early stopping while training models. In order to stop supervised learning model, the following command is used
- Max_runtimes_secs (0 by default)
In the scenario where max_runtime_secs is exceeded before the model building is complete, the build would fail. This command can also be used when applying grid search, in which context it specifies the maximum time to search the grid.
In supervised learning, these models are supported in the H2O framework:-
- Cox Proportional Hazards
- Deep Learning ie neural networks
- Distributed Random Forest
- Generalised Linear Models
- Generalised Additive Models
- ANOVA GLM
- Gradient Boost
- Naïve Bayes
- Stacked Ensembles
- Support Vector Machines
- Distributed Uplift Random Forest
In unsupervised learning, it supports the following models:-
- Generalised Low Rank Models
- Isolated Forest
- Extended isolated Forest
- K-means clustering
- Principal Component analysis
It allows supports other models like:-
- Target Encoding
- Permutation Variable Importance
Training of Models:
H2O facilitates both supervised and unsupervised learning, as was previously mentioned. It supports both classification and regression approaches in supervised learning. The result of classification can be binary or multiclass classification and often takes the form of categorical data. The result of regression is a numerical prediction. The procedure to input whether the model is classification or regression in H2O is as follows: -
- If the model is classification then, the output column needs to be encoded in categorical
- In the scenario where the response is numeric but it has to be a classification problem, the numeric column needs to be updated as such
This is done to ensure that the model being is trained is trained at maximum efficiency
In this above example, we convert the input as well output columns as factor that by default were numeric. Making this conversion allows to use the classification technique
In regression, we use the Boston house price dataset. In line 4 we check if the column is numeric or not, and then that column is set as the response column.
In unsupervised learning, the K-means clustering is implemented on the Iris dataset.
Cartesian grid search and random grid search are the two types of grid searches available in H2O. In a Cartesian grid search, the user defines the search parameters, and the h2o model will carry out those directives. To account for every conceivable parameter, a model will be trained. The model will comprise 5*10*4*2 = 400 models, for instance, if there are 4 hyperparameters and the grid search values are 5, 10, 4, 2.
In random search, the hyperparameter values are defined by the user, but sampling is done uniformly over all models by H2O. Users can also specify the stopping criterion for the grid search while it is running random search. This can be done either by setting a maximum number of models to search for, or by specifying a time limit for each search. Grid search can also be stopped by implementing a performance metric based criterion where the model will stop once it reaches a certain threshold of performance.
Random search performed on the same dataset
Checkpoint creation for Grid search models
Once the search is completed, the models, stored in the H2O cluster can be accessed its model ID and sorted based any particular performance metric like RMSE
Admissible ML models:
In H2O, there are added tools which work towards the aim of admissible ML models. These models are efficient, fair (reduced discrimination towards minority groups, as well providing more interpretability. There 2 methods within H2O for admissibility:-
- Infogram:- This is an information diagram which is a newly added graphical feature exploration method
- L-feature: - This helps towards mitigation of unfairness that might exist within the H2O models. They can also help in identifying hidden problematic proxy features from any dataset.
Saving, loading, downloading and uploading models
It is possible to save binary models with python in H2O using the h2o.save_model() call. These binary models are not version compatible, so in case there is a version update, the models would have to be updated as well. In order to save models for production environments, MOJO/POJO format is used. These models are saved as plain Java code and they are not dependent on H2O clusters for running.
The Mojo import functionality allows H2O models to be used for external use cases. The below table shows which models are currently supported within H2O
|Naïve Bayes Classifier||No||No|
|Extended Isolation Forest||Yes||Yes|
The above code provides, explanation as to how to download the model as well as upload any models that can be uploaded based on the table.
H20 allows the user to download the logs of any work that has been done within the cluster. If there are no existing jobs then the following steps can be followed to download the logs:-
- In the terminal type the following path: cd /tmp/h2o-username/h2ologs
- HTTPD will contain the rest API files while, the rest of the files will be stored in this format: h2o_
_ - - .log
To view the logs when the cluster is active:-
- Open H2O web UI
- Admin > Inspect Log or go to: http://localhost:54321/LogView.html.
- Download logs button can be used to down any logs available
In order to download the logs in python, the following command can be used:
h2o.download_all_logs(dirname = ‘./CWD/’, filename = ‘autoh20_log.zip’
With this this discussion for H2O comes to an end. The potential for this AutoML framework is huge and it is something that should be explored to a greater extent.
Data Science Placement Success Story
Data Science Training Institutes in Other Locations
Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Visakhapatnam, Tirunelveli, Aurangabad
Data Analyst Courses in Other Locations
ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu , Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka
Navigate to Address
360DigiTMG - Data Analytics, Data Science Course Training Hyderabad
2-56/2/19, 3rd floor, Vijaya Towers, near Meridian School, Ayyappa Society Rd, Madhapur, Hyderabad, Telangana 500081