SlideShare a Scribd company logo
CRISP-DM
ML Process
DataTalks.Club
Machine Learning Zoomcamp
Session #1.4
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Session #1.4: Plan
● CRISP-DM — methodology for organizing ML projects
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Session #1.4: Plan
● CRISP-DM — methodology for organizing ML projects
● From problem understanding to deployment
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Session #1.4: Plan
● CRISP-DM
● From problem understanding to deployment
● Spam detection example
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
ML Projects
● Understand the problem
● Collect the data
● Train the model
● Use it
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Model
Spam detection
✉
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Picture: CRISP-DM
CRISP-DM
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Model
spam
not spam
✉
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Picture: CRISP-DM
Identify the business
problem, understand
how we can solve it
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Picture: CRISP-DM
Do we actually
need ML here?
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Business understanding
● Our users complain about spam
● Analyze to what extent it’s a problem
● Will Machine Learning help?
● If not: propose an alternative solution
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Business understanding
Define the goal:
● Reduce the amount of spam messages, or
● Reduce the amount of complaints about spam
The goal has to be measurable
● Reduce the amount of spam by 50%
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Picture: CRISP-DM
Analyze available
data sources, decide
if we need to get more
data
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Data understanding
Identify the data sources
● We have a report spam button
● Is the data behind this button good enough?
● Is it reliable?
● Do we track it correctly?
● Is the dataset large enough?
● Do we need to get more data?
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Data understanding
Identify the data sources
● It may influence the goal
● We may go back to the previous step and adjust it
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Picture: CRISP-DM
Transform the data so
it can be put into a ML
algorithm
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Data preparation
● Clean the data
● Build the pipelines
● Convert into tabular form
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Data preparation
Emails
Mark as
spam
Data
processing
pipeline
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Subject: You won 1 MILLION!
From: winner@moneys.com
Congratulations! You've won $1,000,000!
In order to access the money, deposit $100 to
XXXXXX
Yours sincerely,
Moneyball
[1, 1, 0, 0, 1, 0]
Subject: You won 1 MILLION!
From: winner@moneys.com
Congratulations! You've won $1,000,000!
In order to access the money, deposit $100 to
XXXXXX
Yours sincerely,
Moneyball
Subject: You won 1 MILLION!
From: winner@moneys.com
Congratulations! You've won $1,000,000!
In order to access the money, deposit $100 to
XXXXXX
Yours sincerely,
Moneyball
Subject: You won 1 MILLION!
From: winner@moneys.com
Congratulations! You've won $1,000,000!
In order to access the money, deposit $100 to
XXXXXX
Yours sincerely,
Moneyball
[0, 1, 0, 0, 0, 1]
[0, 0, 0, 1, 1, 0]
[1, 1, 0, 0, 1, 1]
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Picture: CRISP-DM
Training the models:
the actual Machine
Learning happens
here
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Modeling
Training a model:
● Try different models
● Select the best one
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Modeling
Which model to choose?
● Logistic regression
● Decision tree
● Neural network
● Or many others
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Modeling
Sometimes, we may go back to data preparation:
● Add new features
● Fix data issues
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Picture: CRISP-DM
Measure how well
the model solves
the business problem
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Evaluation
Is the model good enough?
● Have we reached the goal?
● Do our metrics improve?
Goal: Reduce the amount of spam by 50%
● Have we reduced it? By how much?
● (Evaluate on the test group)
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Evaluation
Do a retrospective:
● Was the goal achievable?
● Did we solve/measure the right thing?
After that, we may decide to:
● Go back and adjust the goal
● Roll the model to more users/all users
● Stop working on the project
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Evaluation + Deployment
Often happens together:
● Online evaluation: evaluation of live users
● It means: deploy the model, evaluate it
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Picture: CRISP-DM
Deploy the model
to production
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Deployment
● Roll the model to all users
● Proper monitoring
● Ensuring the quality and maintainability
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
ML projects require many iterations!
Iterate!
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
ML projects require many iterations!
Iterate!
📝
Start simple
Learn from feedback
Improve
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Summary
● Business understanding: define a measurable goal. Ask: do we need ML?
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Summary
● Business understanding: define a measurable goal. Ask: do we need ML?
● Data understanding: do we have the data? Is it good?
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Summary
● Business understanding: define a measurable goal. Ask: do we need ML?
● Data understanding: do we have the data? Is it good?
● Data preparation: transform data into a table, so we can put it into ML
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Summary
● Business understanding: define a measurable goal. Ask: do we need ML?
● Data understanding: do we have the data? Is it good?
● Data preparation: transform data into a table, so we can put it into ML
● Modelling: to select the best model, use the validation set
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Summary
● Business understanding: define a measurable goal. Ask: do we need ML?
● Data understanding: do we have the data? Is it good?
● Data preparation: transform data into a table, so we can put it into ML
● Modelling: to select the best model, use the validation set
● Evaluation: validate that the goal is reached
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Summary
● Business understanding: define a measurable goal. Ask: do we need ML?
● Data understanding: do we have the data? Is it good?
● Data preparation: transform data into a table, so we can put it into ML
● Modelling: to select the best model, use the validation set
● Evaluation: validate that the goal is reached
● Deployment: roll out to production to all the users
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Summary
● Business understanding: define a measurable goal. Ask: do we need ML?
● Data understanding: do we have the data? Is it good?
● Data preparation: transform data into a table, so we can put it into ML
● Modelling: to select the best model, use the validation set
● Evaluation: validate that the goal is reached
● Deployment: roll out to production to all the users
● Iterate: start simple, learn from the feedback, improve
DataTalks.Club — mlzoomcamp.com — @Al_Grigor
Next
The modelling step of CRISP-DM

More Related Content

ML Zoomcamp 1.4 - CRISP-DM

  • 2. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Session #1.4: Plan ● CRISP-DM — methodology for organizing ML projects
  • 3. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Session #1.4: Plan ● CRISP-DM — methodology for organizing ML projects ● From problem understanding to deployment
  • 4. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Session #1.4: Plan ● CRISP-DM ● From problem understanding to deployment ● Spam detection example
  • 5. DataTalks.Club — mlzoomcamp.com — @Al_Grigor ML Projects ● Understand the problem ● Collect the data ● Train the model ● Use it
  • 6. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Model Spam detection ✉
  • 7. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Picture: CRISP-DM CRISP-DM
  • 8. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Model spam not spam ✉
  • 9. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Picture: CRISP-DM Identify the business problem, understand how we can solve it
  • 10. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Picture: CRISP-DM Do we actually need ML here?
  • 11. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Business understanding ● Our users complain about spam ● Analyze to what extent it’s a problem ● Will Machine Learning help? ● If not: propose an alternative solution
  • 12. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Business understanding Define the goal: ● Reduce the amount of spam messages, or ● Reduce the amount of complaints about spam The goal has to be measurable ● Reduce the amount of spam by 50%
  • 13. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Picture: CRISP-DM Analyze available data sources, decide if we need to get more data
  • 14. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Data understanding Identify the data sources ● We have a report spam button ● Is the data behind this button good enough? ● Is it reliable? ● Do we track it correctly? ● Is the dataset large enough? ● Do we need to get more data?
  • 15. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Data understanding Identify the data sources ● It may influence the goal ● We may go back to the previous step and adjust it
  • 16. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Picture: CRISP-DM Transform the data so it can be put into a ML algorithm
  • 17. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Data preparation ● Clean the data ● Build the pipelines ● Convert into tabular form
  • 18. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Data preparation Emails Mark as spam Data processing pipeline
  • 19. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Subject: You won 1 MILLION! From: winner@moneys.com Congratulations! You've won $1,000,000! In order to access the money, deposit $100 to XXXXXX Yours sincerely, Moneyball [1, 1, 0, 0, 1, 0] Subject: You won 1 MILLION! From: winner@moneys.com Congratulations! You've won $1,000,000! In order to access the money, deposit $100 to XXXXXX Yours sincerely, Moneyball Subject: You won 1 MILLION! From: winner@moneys.com Congratulations! You've won $1,000,000! In order to access the money, deposit $100 to XXXXXX Yours sincerely, Moneyball Subject: You won 1 MILLION! From: winner@moneys.com Congratulations! You've won $1,000,000! In order to access the money, deposit $100 to XXXXXX Yours sincerely, Moneyball [0, 1, 0, 0, 0, 1] [0, 0, 0, 1, 1, 0] [1, 1, 0, 0, 1, 1]
  • 20. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Picture: CRISP-DM Training the models: the actual Machine Learning happens here
  • 21. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Modeling Training a model: ● Try different models ● Select the best one
  • 22. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Modeling Which model to choose? ● Logistic regression ● Decision tree ● Neural network ● Or many others
  • 23. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Modeling Sometimes, we may go back to data preparation: ● Add new features ● Fix data issues
  • 24. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Picture: CRISP-DM Measure how well the model solves the business problem
  • 25. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Evaluation Is the model good enough? ● Have we reached the goal? ● Do our metrics improve? Goal: Reduce the amount of spam by 50% ● Have we reduced it? By how much? ● (Evaluate on the test group)
  • 26. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Evaluation Do a retrospective: ● Was the goal achievable? ● Did we solve/measure the right thing? After that, we may decide to: ● Go back and adjust the goal ● Roll the model to more users/all users ● Stop working on the project
  • 27. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Evaluation + Deployment Often happens together: ● Online evaluation: evaluation of live users ● It means: deploy the model, evaluate it
  • 28. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Picture: CRISP-DM Deploy the model to production
  • 29. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Deployment ● Roll the model to all users ● Proper monitoring ● Ensuring the quality and maintainability
  • 30. DataTalks.Club — mlzoomcamp.com — @Al_Grigor ML projects require many iterations! Iterate!
  • 31. DataTalks.Club — mlzoomcamp.com — @Al_Grigor ML projects require many iterations! Iterate! 📝 Start simple Learn from feedback Improve
  • 32. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Summary ● Business understanding: define a measurable goal. Ask: do we need ML?
  • 33. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Summary ● Business understanding: define a measurable goal. Ask: do we need ML? ● Data understanding: do we have the data? Is it good?
  • 34. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Summary ● Business understanding: define a measurable goal. Ask: do we need ML? ● Data understanding: do we have the data? Is it good? ● Data preparation: transform data into a table, so we can put it into ML
  • 35. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Summary ● Business understanding: define a measurable goal. Ask: do we need ML? ● Data understanding: do we have the data? Is it good? ● Data preparation: transform data into a table, so we can put it into ML ● Modelling: to select the best model, use the validation set
  • 36. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Summary ● Business understanding: define a measurable goal. Ask: do we need ML? ● Data understanding: do we have the data? Is it good? ● Data preparation: transform data into a table, so we can put it into ML ● Modelling: to select the best model, use the validation set ● Evaluation: validate that the goal is reached
  • 37. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Summary ● Business understanding: define a measurable goal. Ask: do we need ML? ● Data understanding: do we have the data? Is it good? ● Data preparation: transform data into a table, so we can put it into ML ● Modelling: to select the best model, use the validation set ● Evaluation: validate that the goal is reached ● Deployment: roll out to production to all the users
  • 38. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Summary ● Business understanding: define a measurable goal. Ask: do we need ML? ● Data understanding: do we have the data? Is it good? ● Data preparation: transform data into a table, so we can put it into ML ● Modelling: to select the best model, use the validation set ● Evaluation: validate that the goal is reached ● Deployment: roll out to production to all the users ● Iterate: start simple, learn from the feedback, improve
  • 39. DataTalks.Club — mlzoomcamp.com — @Al_Grigor Next The modelling step of CRISP-DM