• CRoss Industry Standard Process for Machine Learning with Quality Assurance – CRISP–ML(Q)
  • Process Model which is just apt for Machine Learning and Analytics Projects
  • Six Step Process Model, which is a structured approach in handling Data Science as well as Artificial Intelligence Projects

While all the steps are equally important, let us discuss each step, in further detail.

Get started with the first stage of CRISP–ML(Q):

1a

Business Understanding

This stage is pivotal for the success of the project because Garbage-In & Garbage-Out!

If the business problem is incorrectly understood, then the organizations digress from the problem at hand, eventually leading them into catastrophic situations. Appropriate definition of business problem will lead to assessing as-is processes from right perspective and thereby the Data Mining Problem will get clearly defined. To make projects a success, Project Charter, which is the first document created on any project, is prepared & signed-off by the sponsor. This serves as a mandatory input in crafting a detailed project management plan.

KPMG survey says that most of the organizations (~70%) have experienced project failure (at least one) in the last 1 year.

IBM CTO says that most of the projects (~87%) in the space of Data Science sees the light into production.

The Four Key Steps of Business Understanding Phase of CRISP-ML(Q)

  • Define Business Problem

    Define the Scope of the ML Application

    • Business Problem
    • Business Objectives
    • Business Constraints
  • Assess and Analyze Scenarios

    Define Success Criteria

    • Business Success Criteria
    • ML Success Criteria
    • Economic Success Criteria
  • Define Data Mining Problem

    Feasibility

    • ML Technology Applicability
    • Legal Constraints
    • Requirements on the Application
  • Project Plan

    Project Charter is the key document

1b

Data Understanding

For data-driven decision making, one should understand Data Collection and the various sources which generate the data.

There are wide sources from which data is generated and collected including Primary Data Sources (Surveys, Design of Experiments, Simulations), Secondary Data Sources (RDBMS, Industrial IoT sensors, etc). Once done, team should describe the data to document data dictionary, so that every team member is well informed on the variables and data being used for further analysis.

IDC forecasts connected IoT devices to generate ~79.4 Zeta Bytes of data in 2025.

By 2025, the explosion in the data is expected to reach 175 zettabytes. Another research says that the amount of data will get doubled every year from 2022.

The Two Key Steps of Data Understanding Phase of CRISP-ML(Q)

  • Data Collection

    Data Collection

    • Data Version Control
  • Data Description

    Data Quality Verification

    • Data Description
    • Data Requirements
    • Data Verification
2

Data Preparation

Majority of the effort is spent on this step. More the effort invested in this stage, easier will be the model building stage.

One of the phases in Data Preparation has been given a lot of names such as Data Munging, Data Wrangling, Data Cleansing, etc. Looking at the wide variety of the data sources and formats and ensuring that we bring all of these to a common format is going to ensure that the next steps are a smooth sail. Exploring the data and performing Descriptive Statistics will help strike gold and these insights will help business address low hanging problems for lightning fast results. This will also help list down the quick wins, short-term wins and long-term wins. Another aim of this step is to ensure that we have data in a format needed for Building Models. Also, we shortlist the critical few features from trivial many using various feature selection techniques alongside carefully curated features as a result of feature engineering.

60% to 80% of the data scientists’ time is spent on data cleansing.

15% to 25% of the data scientists’ time is spent on modelling.

5% to 15% of the time is spent in deploying model into production.

The Five Key Steps of Data Preparation Phase of CRISP-ML(Q)

  • Data Integration

    Selecting Data

    • Feature Selection
    • Data Selection
    • Imbalanced Classes Verification
  • Data Wrangling

    Cleaning Data

    • Noise Reduction
    • Data Imputation
  • Attribute Generation & Selection

    Construct Data

    • Feature Engineering
    • Data Augmentation
  • Attribute Generation & Selection

    Standardize Data

    • File Format
    • Feature Scaling
  • Attribute Generation & Selection

    Exploratory Data Analysis / Descriptive Analytics

3

Data Mining/Machine Learning

Proof of the cake is in the eating. In this stage we identify which of the various machine learning models will be applicable to the address the Business problem at hand. That will in turn help organizations take proactive as well as strategic decisions.

Given the business problems and data availability, we must decide on what Data Mining Supervised or Unsupervised Learning techniques will be apt. Deciding the evaluation techniques given the objective is going to help business lay trust in production usage of the solutions. Resorting to various regularization techniques in machine learning and hyper parameter tuning in modelling will be imperative for the success. Finally meeting the success criteria will define the success of the project.

Unsupervised Learning

Supervised Learning

Semi-Supervised Learning

Forecasting and Time Series

Self-Supervised Learning

Reinforcement Learning

38.76% is the growth rate of global machine learning market between 2020 and 2030. – Market Research Future

38% is the number of US jobs that will be automated by 2030 - PWC

The Six Key Steps of Machine Learning Phase of CRISP-ML(Q)

  • Selecting Model Techniques

    Research for Similar Problems

  • Model Building

    Define Quality Measures of the Model

    • Performance
    • Robustness
    • Scalability
    • Explain ability
    • Model Complexity
    • Resource Demand
  • Model Evaluation and Tuning

    Model Selection

    • Using Unlabelled Data and Pre-trained Models
    • Ensemble Methods
  • Model Assessment

    Incorporate Domain Knowledge

  • Selecting Model Techniques

    Model Compression

  • Model Evaluation and Tuning

    Assure Reproducibility

    • Method Reproducibility
    • Result Reproducibility
    • Experiments Documentation
4

Evaluation

Identifying the metric for measuring the models efficacy & performance is a key factor to ensuring its usability in production environment.

Experimentation is the key to identifying the right business fit model with right parameters. Models with least errors need not always be right for the business problem being solved. Selecting the model which is both accurate as well as something which is aligned with business objectives & aligning with business constraints is critical.

Industry Standards for Continual Improvement:
1. First time model deployment

60% to 80% accuracy is good enough to get started on social sciences projects.

2. One year from time of model deployment

10% improvement on accuracy in the initial year of model upgradation.

3. Every year after one year

The Four Steps of Evaluation Phase of CRISP-ML(Q)

  • Pitch results against standards

    Validate Performance

  • Evaluate assumptions & constraints

    Determine Robustness

  • ML pipeline and architecture

    Increase Explain ability for ML Practitioner & End User

  • Update OPA

    Compare Results with Defined Success Criteria

5

Model Deployment

Deployment of solution in the most cost effective & performance efficient manner is yet another key factor.

Identifying the “Resource Requirement” for deployment in line with customer constraints is imperative for the smooth operations of the deployed solution.

Transitioning from Development to Test environment and eventually into Production should be performed in a seamless manner with appropriate testing. Understand the infrastructure requirements including the servers, business continuity planning, disaster recovery planning, etc., is key to handle any unforeseen situations. Risk management, which is performed throughout the project has to be effectively implemented. If any risk realizes then triggering the risk response plan will tease away customer dissatisfaction.

Deciding on cloud vs on-premise and suggesting a deployment strategy so that models are scalable, reliable, secure and maintainable are critical factors for the success.

360DigiTMG

94% of enterprises use cloud services

360DigiTMG

67% of enterprise infrastructure is now cloud-based

The Five Key Steps of Model Deployment Phase of CRISP-ML(Q)

  • Define Business Problem

    Define Inference Hardware

  • Assess and Analyze Scenarios

    Model Evaluation Under Production Condition

  • Define Data Mining Problem

    Assure User Acceptance & Usability

  • Project Plan

    Minimize the Risks of Unforeseen Errors

  • Project Plan

    Deployment Strategy

Average Number of AI or ML Projects Deployed

Estimated Number of Projects Deployed (Mean)

Estimated Number of Projects Deployed (Mean)
6

Monitoring and Maintenance

It is a journey not a destination. Continual course correction of the deployed models and sometimes retraining it goes a long way in achieving customer delight.

Defining the maintenance strategy is going to define the closure of the project yet start of a new journey. This is a cyclical process. Various industry standards are in place with respect to when a model needs to be retrained. One must also account for factors leading to Model Drift and decay.

When to retrain the model?

360DigiTMG

The data new data is ~ 20% of the Training data

360DigiTMG

The accuracy in production changes by more than 5%

360DigiTMG
360DigiTMG

When there are substantial policy changes in inter and intra organization that are bound to affect the model assumptions

The Five Key Aspects of Monitoring & Maintenance Phase of CRISP-ML(Q)

  • Define Business Problem

    P.E.S.T.E.L effects on Business & Data

  • Assess and Analyze Scenarios

    Non-Stationary Data Distribution – Data Drift

  • Define Data Mining Problem

    Hardware Degradation

  • Project Plan

    Periodic System & Software Updates

  • Project Plan

    Model Performance Degradation – Model Drift

  • Project Plan

    Strategy for Retire, Replace, Update