Login
Congrats in choosing to up-skill for your bright career! Please share correct details.
Home / Blog / Data Science / H2O – A New Experience in AutoML
Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.
Table of Content
Everyone wants a piece of the latest fad known as machine learning. But the difficulty for many individuals is that learning machine learning requires a lot more work and years of concentration. AutoML can help in this situation. As the name implies, AutoML automates the entire ML process. This makes it possible for individuals to apply the ML technique without needing the same level of technical expertise as a skilled data scientist. However, AutoML is not the end-all be-all of machine learning processes; users would still need to be actively involved in the early processes, such as data preparation, feature selection, and output processing. The debate about AutoML may go on forever, but let's concentrate on the most important part of this blog, H2O.
One of the most popular AutoML models right now in the Data Science world is H2O. Within that time frame, H2O automates the model training and tuning processes. The user can choose a time range for the model training procedure. Additionally, H2O has a feature that enables users to create explanations for both the group of models being utilised and for individual models. This explanation's call function is really straightforward because it just calls one other function.
The steps to install H2O in a Python environment are as follows:-
Run the following command in a kernel or the command terminal. These are the obligatory requirements for H2O to work. If you are interested in learning about the other requirements, the following link can satisfy the need:
In a scenario, where the user prefers Anaconda for installation of the packages, they can use the following command:
Quick peek as to how we can one start working with H2O, within python itself. In order to run the library from Anaconda, the same steps have to be taken.
Next step is to get data into the H2O cluster that is running on location machinery. It is very important that the data being uploaded to the cluster is appropriate for the H2O cluster. Currently H2O supports the following file formats
H2O also takes data from a few sources at the moment these being:-
Additional sources can be used by generic HDFS APIs like OpenSwift Stack.
Numerous preprocessing features are included in H2O, including as automatic imputation, normalisation, and one-hot encoding for XGBoost models. Categorical data may be handled natively using group splitting, which is supported by tree-based models like Random Forests in H2O. With Word2Vec, text-based encoding is also supported. The H2O AutoML protocol also includes feature selection and feature reduction.
H2O also supports a lot of Machine Learning algorithms, these include both supervised as well as unsupervised learning models. But before getting to that one must understand what type of data H2O supports. This is important to know because different models will end up with different data types. H2O supports both numeric and categorical data types. It also supports textual data with word2vec.
H2O allows for early stopping while training models. In order to stop supervised learning model, the following command is used
In the scenario where max_runtime_secs is exceeded before the model building is complete, the build would fail. This command can also be used when applying grid search, in which context it specifies the maximum time to search the grid.
In supervised learning, these models are supported in the H2O framework:-
In unsupervised learning, it supports the following models:-
It allows supports other models like:-
H2O facilitates both supervised and unsupervised learning, as was previously mentioned. It supports both classification and regression approaches in supervised learning. The result of classification can be binary or multiclass classification and often takes the form of categorical data. The result of regression is a numerical prediction. The procedure to input whether the model is classification or regression in H2O is as follows: -
This is done to ensure that the model being is trained is trained at maximum efficiency
In this above example, we convert the input as well output columns as factor that by default were numeric. Making this conversion allows to use the classification technique
In regression, we use the Boston house price dataset. In line 4 we check if the column is numeric or not, and then that column is set as the response column.
In unsupervised learning, the K-means clustering is implemented on the Iris dataset.
Cartesian grid search and random grid search are the two types of grid searches available in H2O. In a Cartesian grid search, the user defines the search parameters, and the h2o model will carry out those directives. To account for every conceivable parameter, a model will be trained. The model will comprise 5*10*4*2 = 400 models, for instance, if there are 4 hyperparameters and the grid search values are 5, 10, 4, 2.
In random search, the hyperparameter values are defined by the user, but sampling is done uniformly over all models by H2O. Users can also specify the stopping criterion for the grid search while it is running random search. This can be done either by setting a maximum number of models to search for, or by specifying a time limit for each search. Grid search can also be stopped by implementing a performance metric based criterion where the model will stop once it reaches a certain threshold of performance.
Random search performed on the same dataset
Checkpoint creation for Grid search models
Once the search is completed, the models, stored in the H2O cluster can be accessed its model ID and sorted based any particular performance metric like RMSE
In H2O, there are added tools which work towards the aim of admissible ML models. These models are efficient, fair (reduced discrimination towards minority groups, as well providing more interpretability. There 2 methods within H2O for admissibility:-
It is possible to save binary models with python in H2O using the h2o.save_model() call. These binary models are not version compatible, so in case there is a version update, the models would have to be updated as well. In order to save models for production environments, MOJO/POJO format is used. These models are saved as plain Java code and they are not dependent on H2O clusters for running.
The Mojo import functionality allows H2O models to be used for external use cases. The below table shows which models are currently supported within H2O
The above code provides, explanation as to how to download the model as well as upload any models that can be uploaded based on the table.
H20 allows the user to download the logs of any work that has been done within the cluster. If there are no existing jobs then the following steps can be followed to download the logs:-
To view the logs when the cluster is active:-
In order to download the logs in python, the following command can be used:
h2o.download_all_logs(dirname = ‘./CWD/’, filename = ‘autoh20_log.zip’
With this this discussion for H2O comes to an end. The potential for this AutoML framework is huge and it is something that should be explored to a greater extent.
Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Visakhapatnam, Tirunelveli, Aurangabad
ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu , Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka
360DigiTMG - Data Analytics, Data Science Course Training Hyderabad
2-56/2/19, 3rd floor, Vijaya Towers, near Meridian School, Ayyappa Society Rd, Madhapur, Hyderabad, Telangana 500081
099899 94319
Didn’t receive OTP? Resend
Let's Connect! Please share your details here