Guide to ML Development Stages

Frame The Problem.

Framing the problem is the crucial and very first step in `MLops`. Problem can be of type - prediction, classification, clustering etc. Deciding and framing the problem helps us to choose algorithms among pool of available option to experiment with.

Data Gathering.

To create a reliable model, we need data first, now sources of data are multiple, and their selection varies as per need. We can directly get data formatted in CSV files (awesome), hit APIs to get the data, web scraping, data warehouse etc.

Data Preprocessing.

The gathered data will most of the time not be in state to be used directly, we cannot train model using the raw data. Presence of outliers, noise, unstructured format, missing values, irrelevant features etc. makes the raw data unfit to use in original state.

So, we need to perform data preprocessing, remove duplicates, replace missing values, remove outliers, scale the data to common scale etc. are some common steps in preprocessing.

EDA.

Stands for Exploratory Data Analysis, the main purpose is to find the relationship among or between independent features, we try to understand how one feature is related to other. Univariate Analysis, making data balance so that no single feature dominates the feature space Multivariate analysis etc. are some common steps.

Feature Engineering.

Here we select, remove or generate new features if needed from preprocessed data. Can checkout all here - Feature Engineering.

Model Training, Evaluation & Selection.

Now, comes the part where experiments start to choose the best algorithm as per problem framing and data we have. It’s always advised to try several algorithms and then evaluate them on the basis of evaluation metrics.

Model Deployment.

Now we have a functional model ready to be used by real users and perform the task for which it is trained. The model is then integrated with software or used as API.

Testing & Optimizing.

This steps always remains in loop, the performance of model in production is always monitored, if the model is based on online learning, then it becomes even more important to regularly check the shift in the performance of model on new incoming data.

if the model starts to show a shift in the performance, then training phase is again repeated to align its performance as per our need.

Thanks!

ML Development Life Cycle

Table of contents