Home / Blog / MLOps / Master the Art of Engineering the Data

Master the Art of Engineering the Data

July 17, 2024
39

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

As shown in figure 1, machine learning (ML) models only make up 5% to 10% of the whole artificial intelligence (AI) solution.

Machine Learning

Figure 1

As shown in figure 1, machine learning (ML) models only make up 5% to 10% of the whole artificial intelligence (AI) solution.

While ML code is at the centre of any AI application, the other components—data intake, storage, processing, etc.—are just as crucial to the programme's ability to grow as required and perform as intended. The remaining parts, excluding ML code, make up 90% to 95% of the total labour required to make an AI application function as intended. By any measure, this is a massive amount of labour, but cloud-based implementations can help data scientists and AI specialists. The phrases "AI application," "ML platform," and "ML system" may all be used interchangeably, and this page makes sure readers are familiar with all of them.

Designing an ML system requires one to overcome a lot of challenges.

Occam's Razor or Principle of Parsimony states that the simplest of the models is the best model. However, research has proven that the simplest model on large datasets trumps the models built on small datasets.
Working on large datasets would mean using multiple servers interconnected so that they split the data into chunks and process the chunks. This needs a distributed network of servers and usually, the cloud is a preferred choice.
ML systems should always be ready to operate at scale during the inference phase. Remember the AI-enabled video game - Pokemon Go from a company called Niantic. This has achieved a scale of 0.5 Billion users in under 2 months. Serving infrastructure is this important and it is evident from Figure 1, where this consumes a major part of the AI system as a whole.
Input/Output bandwidth limitations also can be addressed using cloud infrastructure. Careful evaluation of server infrastructure needed for the ML platform is extremely crucial. Typically during the training phase, the compute, storage, and network bandwidth needed will be very high in comparison to post-deployment in production. However, these needs vary based on business needs as in the Pokemon Go example mentioned in point 3.

The operations position normally entails taking the error-free code from the Developers and putting it into production without any problems. Later, the two roles were combined to create a new phrase called DevOps.

Instantiating or creating the infrastructure (mainly RAM, CPU, storage, and network bandwidth) needed to ensure the smooth running of the code.
Configuring the server with the Operating System (OS), middleware software, regular patches, and updates from the OS provider.
Ensuring server high availability (continuous availability of servers) and low latency (quick responsive nature).
Optimizing server infrastructure so that the costs are reduced.

In the end, cloud computing models were used to automate many of these tasks.

Let's look at a sequence of papers that discuss MLOps stages, starting with this initial article's grasp of how to manage datasets.

Gain business domain knowledge to get started on any project. E.g., if you are going to work on the Uber dataset then start gaining knowledge on taxi-hailing services.
Gain knowledge of business rules, which can be as simple as doing calculations for the fares from various government websites or syndicate (buy the data) data sources. Pricing during peak hours, during rainy days, etc., would be part of business rules knowledge.
Input and Output variables should be defined by researching previous similar kinds of projects and also understanding the schema of the business problem being solved. Variable names, data types along with example values will serve as a very good start for the schema. Figure 2 is one such example. Geocoding for latitude and longitude describing pick up and drop off can also be another set of variables that involves some amount of feature engineering.

Machine Learning

1. Gain knowledge on implementing business solutions. Here software engineering ways of providing solutions can be evaluated. Software engineers can develop code to incorporate business rules and integrate them with Google Maps API. API will calculate the shortest distance and consider factors such as traffic, bad weather, etc., and explains about ETA (Expected Arrival Time). However, Google Maps charges per API request, and it will be a very costly affair given that the entire setup must be on-premise. Hence, cloud-based solutions are more viable with ML-driven route optimization for getting the shortest distance with the least charge.

2. Gain an understanding of what datasets are available for business solution implementation. For implementing ML solutions no expensive syndicate data is needed for route planning or distance calculation. The model will learn from the data and estimate the price per trip. Extra information: opendata.dc.gov has data on trips and charges for approximately 5 years and is 12 GB in size. The dataset of this size will be a zip file. Inside the zip file, there will be pipe-delimited comma-separated values (.csv) and text files (.txt) and each row has the taxi trip details. Such information must be obtained and a detailed understanding of metadata is important.

In the end, cloud computing models were used to automate many of these tasks. Let's look at a sequence of papers that discuss MLOps stages, starting with this initial article's grasp of how to manage datasets.