What are common mistakes
in Data Science projects?
(and how to avoid them?)
Artur Suchwałko, Ph.D., QuantUp
AI & Big Data 2018, March 10, 2018, Lviv, Ukraine
Real-world Data Science projects
Real-world Data Science projects
• Kaggle competitions and real Data Science projects are two quite
different disciplines
• When a data frame is prepared then it’s easy
• What is done not correctly and can be corrected?
• Analysis of a business problem
• Data
• Process
• Methods, models
• Hardware, sofware
• People
(Everything based on practical experience: 20 years, 100 projects, 3,000
hours of workshops.
For the majority of topics I could add quotes from talks.)
Analysis of a business problem

Witness wednesdays informing agile software development with continuous user...
Witness wednesdays  informing agile software development with continuous user...Witness wednesdays  informing agile software development with continuous user...
Witness wednesdays informing agile software development with continuous user...

In the startup world speed to market is everything. This talk covers how it is possible to embed user insights into a rapid software development cycle by conducting usability studies that break the stereotype that "research takes too long." Justin Marx and Rebecca Destello illustrate how to plan, conduct, analyze and inform development sprints in just one week with what famously became known as "Witness Wednesdays." Justin Marx, Product Designer and Rebecca Destello, Manager, Research & Insights - both with Atlas Informatics.

lean uxresearchsoftware
Better products faster: let's bring the user into the userstory // TAPOST_201...
Better products faster: let's bring the user into the userstory // TAPOST_201...Better products faster: let's bring the user into the userstory // TAPOST_201...
Better products faster: let's bring the user into the userstory // TAPOST_201...

Why is it that everyone knows the importance of frequent user testing, yet hardly anyone does it? Because user testing often is time consuming, complex and expensive. It probably doesn’t fit in your development process and thus feels like extra work. To feel reassured you tell yourself to test with users once you have something working, or at the very end of the process. This is strange, because everybody knows that changing your product late in the process will increase costs exponentially. We created a way so that user testing saves time, improves the quality and doesn’t cost a lot of money. Team driven, pragmatic and no extra resources needed. The talk will show how, with only 2 hours every sprint, we focused on creating better products faster. We would love to share our learnings and simple DIY tools that let you start user testing with your current teams tomorrow!

icemobileuser testingux
Managing Data Science by David Martínez Rego
Managing Data Science by David Martínez RegoManaging Data Science by David Martínez Rego
Managing Data Science by David Martínez Rego

big data spain
We have Big Data. We need to implement
Big Data solutions
• If you can email your data or fit it in a pendrive it means you don’t
have Big Data!
• Many Data Science tasks for millions of records can be completed
using (powerful) laptops
• Decisions are data-driven or not. It’s not about data magnitude but
about way the decisions are taken
• Be (more than) sure that we need Big Data technologies for storing
and processing
• During PoC / prototype stage don’t use Big Data tools
• Important: Not valid for some problems
Use social media data
• It’s a tremendous effort if you don’t use an off-the-shelf solution
• Usually business value is not big
• Be sure that the effort will be rewarded
Let’s build a model in one week
• It’s possible (in theory)
• If you don’t analyze the process thoughtfully and don’t detect false
predictors then the model will not work in production
• We will be really happy to see how well it performs on our
development sample
• Take enough time
• Be sure that the process is correct

This document summarizes a lecture about using the lean startup approach for product development. It discusses: - Using a minimum viable product (MVP) to test assumptions quickly without overbuilding. An example is Dropbox starting with a simple demo video. - The build-experiment-learn feedback loop, where you build an MVP, experiment to collect data, and learn how to improve. Key phases are identifying leap-of-faith assumptions to test like value and growth hypotheses. - The dilemma of having more traction to raise funds after validating assumptions with an MVP, rather than prematurely seeking funds with just an idea and no customer feedback. Starting small allows wisely using funds.

• To avoid mistakes it is good to ask ourselves these questions (and
answer them), e.g.:
• What business problem are we solving?
• What will be business value we can get from the results?
• What could be lost in translation fro business into analytics?
• Do we have adequate and representative data?
• What process does generate them? What are they influenced by?
• What is model building process?
• What analytical tools should be used? Could we apply simpler
• How do we control all the risk?
• It is good to do it repeatedly
• It’s best to involve someone experienced
• It’s beneficial to educate the receivers of the results
• During the conference!
• After the conference: artur [at] quantup [dot] eu

  • 1. What are common mistakes in Data Science projects? (and how to avoid them?) Artur Suchwałko, Ph.D., QuantUp AI & Big Data 2018, March 10, 2018, Lviv, Ukraine
  • 3. Real-world Data Science projects • Kaggle competitions and real Data Science projects are two quite different disciplines • When a data frame is prepared then it’s easy • What is done not correctly and can be corrected? • Analysis of a business problem • Data • Process • Methods, models • Hardware, sofware • People (Everything based on practical experience: 20 years, 100 projects, 3,000 hours of workshops. For the majority of topics I could add quotes from talks.)
  • 4. Analysis of a business problem
  • 5. No. We don’t want to build a model of production and storage in our factory Problem: • We’d like just to optimize cutting a log (a trunk of a dead tree) into planks • Let’s do it in the simplest way. Why should we waste time and money? • The others can do it. Why do you make it complicated?!? Solution: • To build the production and storage model • Otherwise you will optimize log cutting in a different sawmill • or something completely different
  • 6. Solution of a wrong analytical problem Problem: • Stating of a wrong problem and solving it can decrease predictive ability of a model • Similarly, removing so called false predictors (leaks from future) • But we never want to have pure predictive power. Usually business wants actionability and real value Solution: • Focus on what influences your busines
  • 8. Preparation of a development sample is not very important Problem: • Let’s take a sample and model! • Preparation of the development sample decides if the model will fit the reality we model or not • The data and thus the sample is generated (or influenced) by a process that must be well known and understoo Solution: • Think it over really carefully.
  • 9. We have Big Data. We need to implement Big Data solutions Problem: • If you can email your data or fit it in a pendrive it means you don’t have Big Data! • Many Data Science tasks for millions of records can be completed using (powerful) laptops • Decisions are data-driven or not. It’s not about data magnitude but about way the decisions are taken Solution: • Be (more than) sure that we need Big Data technologies for storing and processing • During PoC / prototype stage don’t use Big Data tools • Important: Not valid for some problems
  • 10. Use social media data Problem: • It’s a tremendous effort if you don’t use an off-the-shelf solution • Usually business value is not big Solution: • Be sure that the effort will be rewarded
  • 12. Let’s build a model in one week Problem: • It’s possible (in theory) • If you don’t analyze the process thoughtfully and don’t detect false predictors then the model will not work in production • We will be really happy to see how well it performs on our development sample Solution: • Take enough time • Be sure that the process is correct
  • 13. There is too short time to complete the task / model Problem: • Data problems • Stucked in preprocessing • The implementation takes too long • Too short experience Solution: • Prepare a full product as soon as possible, e.g.: • cutting out all the functionalities, e.g. a scoring application with a simple / dummy model • a full code for building the model but using simpler methods • improve it in the next iterations • Using CRISP-DM / checklist to support your memory • Usually you can start implementation from the first product version
  • 14. Way you prepare the result (a model, a data product) doesn’t matter Problem: • I want a model. It must work. I don’t care how you’ll build it. Just build it! • The process is crucial • If it is wrong then the analysis is not fully reproducible • We take a technical debt • and sooner or later we will be forced to pay it back Solution: • Build models in a fully reproducible way
  • 15. Implementation – I’m sure it’ll work out somehow Problem: • Implementation without planned tests usually fail • What is really painful, it takes time to realize that they failed (a model works and generates risk) Solution: • Plan both, implementation and tests
  • 17. AI. We desperately need AI! Problem: • We don’t need • Predictive modeling is not AI! • It happens that full control over a model is more important than predictive power Solution: • Let’s think what we’d like to achieve and how to do this • Data-driven decision making is more important
  • 18. A model just learns everything it is exposed to Problem: • You need to promise self-learning to sell a service / a software • But it will not learn automatically if not fed by suitable data • In many situations you don’t have such data to design a feedback loop Solution: • Analyze a process that generates the data for the development sample • Put aside a “not touched” sample • The model will be taught using a sample and refined in an ongoing way
  • 19. Start modeling from using Deep Learning! Problem: • But everybody uses it… • No!!! • Many problems are too simple for DL • In particular, the problems with data in a data frame Solution: • Random Forest, xgboost
  • 20. If we have 3000 classes then let’s build a BIG classifier Problem: • For example when we’d like to recommend bank products • Such a random classifier has error 2999/3000 = 99.97% (not 50%) • Usually the dataset is too small Solution: • It’s good to use a simpler method (usually)
  • 22. You can do calculations using a laptop Problem: • Sometimes yes, you can • But usually you cannot • Usually it doesn’t make any sense – human’s time is more expensive that machine’s time Solution: • It is good to invest some money in hardware • or use AWS from Amazon (or something similar)
  • 23. Commercial software is excellent Problem: • Users often tell that it is excellent unless bought • The problems appear later Solution: • Test it in similar conditions it will be used • Think seriously about using open source
  • 24. Free software is excellent (and it’s free!) Problem: • It’s free – in terms of a buying cost • It’s not just excellent – the cost is neccessity to have qualified people onboard and to develop software • There happen inconvenient problems Solution: • Use as it should be used • i.e. write clear and clean code, use additional tools, e.g. VCS • Take care of the team to have the skills needed
  • 26. All companies have Data Science teams. Let’s build one for us! Problem: • It’s possible to build a team. It will take a lot of time and lots of money. • If the results will be wasted then the people will leave • They need to have fun working on projects • If I need a plank then do I really need to buy a sawmill? Solution:: • Be sure that: • we know how to use their results • it will give value to the business • PoC can be outsourced. The first data science project can be outsourced.
  • 27. A student or a freshman is enough to give profits from deep analytics to business Problem: • If someone can cut with a scalpel then will we call him a surgeon? • Why someone who can build (technically) a model having a data frame is called a Data Scientist? • Data Scientist is a profession – experience matters! • People without experience usually don’t give any business value for a company. Even after spending a year working with data (!) Solution: • Hire experienced people, especially in the beginning of a DS journey • let them teach the freshmen • But what is you don’t have experienced people? • Invest time, effort, and money in your team. Let a more business analyst control the team
  • 28. The team will learn everything on online courses Problem: • I give each of you $20 (ok, even $50) and learn everything online • It’s true. The team will learn some things • But not the most important ones • A good hands-on training cannot be substituted Solution: • Learning by doing (and applying) • Control and stimulate learning • Buy knowledge
  • 30. Summary • To avoid mistakes it is good to ask ourselves these questions (and answer them), e.g.: • What business problem are we solving? • What will be business value we can get from the results? • What could be lost in translation fro business into analytics? • Do we have adequate and representative data? • What process does generate them? What are they influenced by? • What is model building process? • What analytical tools should be used? Could we apply simpler approaches? • How do we control all the risk? • It is good to do it repeatedly • It’s best to involve someone experienced • It’s beneficial to educate the receivers of the results
  • 32. Contact • During the conference! • After the conference: artur [at] quantup [dot] eu