ML Zoomcamp 1.4 - CRISP-DM

CRISP-DM
ML Process
DataTalks.Club
Machine Learning Zoomcamp
Session #1.4

DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Session #1.4: Plan
● CRISP-DM — methodology for organizing ML projects

Session #1.4: Plan
● CRISP-DM — methodology for organizing ML projects
● From problem understanding to deployment

Session #1.4: Plan
● CRISP-DM
● From problem understanding to deployment
● Spam detection example

ML Projects
● Understand the problem
● Collect the data
● Train the model
● Use it

Model
Spam detection
✉

Picture: CRISP-DM
CRISP-DM

Model
spam
not spam
✉

Picture: CRISP-DM
Identify the business
problem, understand
how we can solve it

Picture: CRISP-DM
Do we actually
need ML here?

Business understanding
● Our users complain about spam
● Analyze to what extent it’s a problem
● Will Machine Learning help?
● If not: propose an alternative solution

Business understanding
Define the goal:
● Reduce the amount of spam messages, or
● Reduce the amount of complaints about spam
The goal has to be measurable
● Reduce the amount of spam by 50%

Picture: CRISP-DM
Analyze available
data sources, decide
if we need to get more
data

Data understanding
Identify the data sources
● We have a report spam button
● Is the data behind this button good enough?
● Is it reliable?
● Do we track it correctly?
● Is the dataset large enough?
● Do we need to get more data?

Data understanding
Identify the data sources
● It may influence the goal
● We may go back to the previous step and adjust it

Picture: CRISP-DM
Transform the data so
it can be put into a ML
algorithm

Data preparation
● Clean the data
● Build the pipelines
● Convert into tabular form

Data preparation
Emails
Mark as
spam
Data
processing
pipeline

Subject: You won 1 MILLION!
From: winner@moneys.com
Congratulations! You've won $1,000,000!
In order to access the money, deposit $100 to
XXXXXX
Yours sincerely,
Moneyball
[1, 1, 0, 0, 1, 0]
XXXXXX
Yours sincerely,
Moneyball
XXXXXX
Yours sincerely,
Moneyball
XXXXXX
Yours sincerely,
Moneyball
[0, 1, 0, 0, 0, 1]
[0, 0, 0, 1, 1, 0]
[1, 1, 0, 0, 1, 1]

Picture: CRISP-DM
Training the models:
the actual Machine
Learning happens
here

Modeling
Training a model:
● Try different models
● Select the best one

Modeling
Which model to choose?
● Logistic regression
● Decision tree
● Neural network
● Or many others

Modeling
Sometimes, we may go back to data preparation:
● Add new features
● Fix data issues

Picture: CRISP-DM
Measure how well
the model solves
the business problem

Evaluation
Is the model good enough?
● Have we reached the goal?
● Do our metrics improve?
Goal: Reduce the amount of spam by 50%
● Have we reduced it? By how much?
● (Evaluate on the test group)

Evaluation
Do a retrospective:
● Was the goal achievable?
● Did we solve/measure the right thing?
After that, we may decide to:
● Go back and adjust the goal
● Roll the model to more users/all users
● Stop working on the project

Evaluation + Deployment
Often happens together:
● Online evaluation: evaluation of live users
● It means: deploy the model, evaluate it

Picture: CRISP-DM
Deploy the model
to production

Deployment
● Roll the model to all users
● Proper monitoring
● Ensuring the quality and maintainability

ML projects require many iterations!
Iterate!

ML projects require many iterations!
Iterate!
📝
Start simple
Learn from feedback
Improve

Summary
● Business understanding: define a measurable goal. Ask: do we need ML?

Summary
● Data understanding: do we have the data? Is it good?

Summary
● Data preparation: transform data into a table, so we can put it into ML

Summary
● Modelling: to select the best model, use the validation set

Summary
● Evaluation: validate that the goal is reached

Summary
● Deployment: roll out to production to all the users

Summary
● Deployment: roll out to production to all the users
● Iterate: start simple, learn from the feedback, improve

Next
The modelling step of CRISP-DM

ML Zoomcamp 1.4 - CRISP-DM

Related slideshows

More Related Content

ML Zoomcamp 1.4 - CRISP-DM