SlideShare a Scribd company logo
Beat the benchmark.
Getting started with competitive data mining
By
Maheshakya Wijewardena
What is competitive data mining and why?
● Gap between those who are with data and those who can analyze them.
Organizations need to make use of their massive amounts of data, but with less expenditure.
Promote and expand research on applications and data models.
Challenges organized by SIGKDD, ICML, PAKDD, ECML, NIPS, etc. to promote the practices.
Find talent, attract skills...
● eg: Facebook, yahoo, yelp, ...
2
What is competitive data mining and why?
“ I keep saying the sexy job in
next ten years will be the
statisticians “
- Hal Varian
Google chief economist, 2009 3
What is competitive data mining and why?
● Kaggle?
“ is a platform for data prediction competitions that allows
organizations to post their data and have it scrutinized by
the world’s best data scientists. “
- verbatim
4
An outline
● Types of challenges
● Understanding the challenge
● Setting things up
● Analyzing data
● Data preprocessing
● Training models
● Validating models
● ML/Statistics packages
● Conclusion
5
Types of competitions
Those well known tasks you find in the data mining class...
● Most of them are classification
○ Binary or probability
○ Rarely multiclass
● Time series forecasting
○ Predict for some period ahead
○ Seasonal patterns
● Anomaly detection
Majority of competitions focus on the results, not the process.
But there are some which give high priority to process - scalability, technical
feasibility, complexity, etc. (Often for recruitments and research) 6
Before you start...
Be aware of structure of data mining competitions in Kaggle
Always remember that the purpose of the predictive model is to predict on data
that we have not seen!
7
Understand what it is about
● Read the problem until you understand it; pristine.
● Keep an eye on the forum, always - Know how other competitors think.
● Check dataset sizes! - Can you handle it?
● Competitive advantage - Try to get some domain knowledge, but not
necessary.
● How do they evaluate, on what criterion?
○ Area under ROC curve
○ MSE
○ False positive/negative rate
○ Precision - recall
○ ...
8
Setting things up...
● Boil down the problem into sections
● Organize your team - divide work
● Look at benchmarks codes - a good point to start but it’s not enough!
● Look at sample submission files
And most importantly,
● Set up an environment in which you can iterate and test
new ideas rapidly
9
Analyzing Data
KNOW THY DATA !!!
10
Analyzing Data
● Get to know your data
○ Raw data - Image ,video, text - do I need to perform feature extraction too?
○ Numerical, categorical
● Visualize! - Histograms, pie charts, cluster diagrams…
○ Advanced - vector quantization - SOM
● Missing values
● Class imbalance
11
Feature engineering and Data Preprocessing
Typical preprocessing techniques:
● Handle missing values - keep, discard, impute
● Resample - up/downsampling
● Encoding
○ Label encoding
○ One hot encoding / bit maps
● For textual - TF-IDF, feature hashing, bag of words, ...
● Dimensionality reduction - PCA, SVD, ...
12
Feature engineering and Data Preprocessing
Feature engineering is a bit tricker…
● Identify what the most important/impacting features are.
○ Feature selection
○ Strong dependency with the learning algorithms
○ Recursive feature elimination
● Eliminate (trivial) irrelevant features - IDs, timestamps(sometimes)
● Derived features?
13
Important !
Make sure you have your own evaluating metric implemented.
When evaluating your models:
● Simple training/validation split is not enough.
○ K-fold validation uses all fractions while training though you hold out a sample.
● Always have a separate hold out set that you do not touch at all during model
building process
○ Including preprocessing
14
Typical model building process
15
Split
training/
holdout
Preprocess Train model
Evaluation
Implement
model
Training
set
Hold out set for
validation
Preprocess
Good?
Bad?
Be brave and scrap the model !
Training models
● Learning algorithm - select carefully based on the problem
● Hyper parameter tuning
○ Grid search
○ randomized search
○ manual?
● Be aware of overfitting!
● Ensemble methods:
○ Bagging
○ Boosting
○ model ensembling - convex combinations
No matter what models you train, winning solutions will always be ensembles
16
Model Validation
● Get the score of your model from your evaluator.
○ Bad? - Keep it aside and design a new model
○ Good? - go ahead and predict for the test set
● Even though an individual model performs poorly, it might fit in gracefully in an
ensemble
● Confusion matrix
● Try to visualize predicted vs. actual
○ With each feature
○ Gives you an insight on what characteristics of features make the model better or worse
● Keep records.
17
Final steps...
Submissions:
● Try to submit something every day - know your position
● Keep updated
● Don’t do changes in your model which do slight improvements in public leader
board - often a trap !
Don’t forget the forum !
● If you have something interesting, share it with others - but not everything ;)
● Good Kagglers alway give something back
18
About ML/Stat packages...
● Machine learning Packages:
○ R
○ scikit-learn
○ pylearn
○ ML Pack
○ Shogun
○ Spark/H2O - scalable, distributed processing - but limited functionality.
● Statistics
○ Again R
○ statsmodels
● Data manipulation
○ Again R
○ Pandas, numpy, scipy
● Visualization -
○ Again R
○ Matplotlib
Sometimes,
● Deep learning - Theano
● NLP - NLTK
Emerging - Julia 19
Conclusion
● First, try out some “getting started” competitions - take the advantage
● When analyzing data - be patient, be meticulous
● Visualize!
● (Some) Domain knowledge would be useful
● Feature engineering is the key (often)
● Have discipline to have a proper validation framework
● Be brave!
● Learn from others
● “Right” models
● Use of ML/Stat packages effectively
● Good coding/data manipulation and software engineering best practices
● Avoid overfitting!
● Luck....
20
No Free Lunch
21
?
22
References
1. Kaggle, https://www.kaggle.com/
2. Krishna Shankar, Hitchhiker’s guide to Kaglle, http://www.slideshare.
net/ksankar/oscon-kaggle20
3. Beth Schultz, 10 Tips for Winning a Data Science Competition, http://www.
allanalytics.com/author.asp?doc_id=268513
4. Owen Zhang, “Tips for data science competitions”, http://www.slideshare.
net/OwenZhang2/tips-for-data-science-competitions
5. Parsec Labs, https://www.parseclabs.com/knowthydata
23

More Related Content

Beat the Benchmark.

  • 1. Beat the benchmark. Getting started with competitive data mining By Maheshakya Wijewardena
  • 2. What is competitive data mining and why? ● Gap between those who are with data and those who can analyze them. Organizations need to make use of their massive amounts of data, but with less expenditure. Promote and expand research on applications and data models. Challenges organized by SIGKDD, ICML, PAKDD, ECML, NIPS, etc. to promote the practices. Find talent, attract skills... ● eg: Facebook, yahoo, yelp, ... 2
  • 3. What is competitive data mining and why? “ I keep saying the sexy job in next ten years will be the statisticians “ - Hal Varian Google chief economist, 2009 3
  • 4. What is competitive data mining and why? ● Kaggle? “ is a platform for data prediction competitions that allows organizations to post their data and have it scrutinized by the world’s best data scientists. “ - verbatim 4
  • 5. An outline ● Types of challenges ● Understanding the challenge ● Setting things up ● Analyzing data ● Data preprocessing ● Training models ● Validating models ● ML/Statistics packages ● Conclusion 5
  • 6. Types of competitions Those well known tasks you find in the data mining class... ● Most of them are classification ○ Binary or probability ○ Rarely multiclass ● Time series forecasting ○ Predict for some period ahead ○ Seasonal patterns ● Anomaly detection Majority of competitions focus on the results, not the process. But there are some which give high priority to process - scalability, technical feasibility, complexity, etc. (Often for recruitments and research) 6
  • 7. Before you start... Be aware of structure of data mining competitions in Kaggle Always remember that the purpose of the predictive model is to predict on data that we have not seen! 7
  • 8. Understand what it is about ● Read the problem until you understand it; pristine. ● Keep an eye on the forum, always - Know how other competitors think. ● Check dataset sizes! - Can you handle it? ● Competitive advantage - Try to get some domain knowledge, but not necessary. ● How do they evaluate, on what criterion? ○ Area under ROC curve ○ MSE ○ False positive/negative rate ○ Precision - recall ○ ... 8
  • 9. Setting things up... ● Boil down the problem into sections ● Organize your team - divide work ● Look at benchmarks codes - a good point to start but it’s not enough! ● Look at sample submission files And most importantly, ● Set up an environment in which you can iterate and test new ideas rapidly 9
  • 10. Analyzing Data KNOW THY DATA !!! 10
  • 11. Analyzing Data ● Get to know your data ○ Raw data - Image ,video, text - do I need to perform feature extraction too? ○ Numerical, categorical ● Visualize! - Histograms, pie charts, cluster diagrams… ○ Advanced - vector quantization - SOM ● Missing values ● Class imbalance 11
  • 12. Feature engineering and Data Preprocessing Typical preprocessing techniques: ● Handle missing values - keep, discard, impute ● Resample - up/downsampling ● Encoding ○ Label encoding ○ One hot encoding / bit maps ● For textual - TF-IDF, feature hashing, bag of words, ... ● Dimensionality reduction - PCA, SVD, ... 12
  • 13. Feature engineering and Data Preprocessing Feature engineering is a bit tricker… ● Identify what the most important/impacting features are. ○ Feature selection ○ Strong dependency with the learning algorithms ○ Recursive feature elimination ● Eliminate (trivial) irrelevant features - IDs, timestamps(sometimes) ● Derived features? 13
  • 14. Important ! Make sure you have your own evaluating metric implemented. When evaluating your models: ● Simple training/validation split is not enough. ○ K-fold validation uses all fractions while training though you hold out a sample. ● Always have a separate hold out set that you do not touch at all during model building process ○ Including preprocessing 14
  • 15. Typical model building process 15 Split training/ holdout Preprocess Train model Evaluation Implement model Training set Hold out set for validation Preprocess Good? Bad? Be brave and scrap the model !
  • 16. Training models ● Learning algorithm - select carefully based on the problem ● Hyper parameter tuning ○ Grid search ○ randomized search ○ manual? ● Be aware of overfitting! ● Ensemble methods: ○ Bagging ○ Boosting ○ model ensembling - convex combinations No matter what models you train, winning solutions will always be ensembles 16
  • 17. Model Validation ● Get the score of your model from your evaluator. ○ Bad? - Keep it aside and design a new model ○ Good? - go ahead and predict for the test set ● Even though an individual model performs poorly, it might fit in gracefully in an ensemble ● Confusion matrix ● Try to visualize predicted vs. actual ○ With each feature ○ Gives you an insight on what characteristics of features make the model better or worse ● Keep records. 17
  • 18. Final steps... Submissions: ● Try to submit something every day - know your position ● Keep updated ● Don’t do changes in your model which do slight improvements in public leader board - often a trap ! Don’t forget the forum ! ● If you have something interesting, share it with others - but not everything ;) ● Good Kagglers alway give something back 18
  • 19. About ML/Stat packages... ● Machine learning Packages: ○ R ○ scikit-learn ○ pylearn ○ ML Pack ○ Shogun ○ Spark/H2O - scalable, distributed processing - but limited functionality. ● Statistics ○ Again R ○ statsmodels ● Data manipulation ○ Again R ○ Pandas, numpy, scipy ● Visualization - ○ Again R ○ Matplotlib Sometimes, ● Deep learning - Theano ● NLP - NLTK Emerging - Julia 19
  • 20. Conclusion ● First, try out some “getting started” competitions - take the advantage ● When analyzing data - be patient, be meticulous ● Visualize! ● (Some) Domain knowledge would be useful ● Feature engineering is the key (often) ● Have discipline to have a proper validation framework ● Be brave! ● Learn from others ● “Right” models ● Use of ML/Stat packages effectively ● Good coding/data manipulation and software engineering best practices ● Avoid overfitting! ● Luck.... 20
  • 22. ? 22
  • 23. References 1. Kaggle, https://www.kaggle.com/ 2. Krishna Shankar, Hitchhiker’s guide to Kaglle, http://www.slideshare. net/ksankar/oscon-kaggle20 3. Beth Schultz, 10 Tips for Winning a Data Science Competition, http://www. allanalytics.com/author.asp?doc_id=268513 4. Owen Zhang, “Tips for data science competitions”, http://www.slideshare. net/OwenZhang2/tips-for-data-science-competitions 5. Parsec Labs, https://www.parseclabs.com/knowthydata 23