SlideShare a Scribd company logo
Corporate
Relocation Prediction
A 10 Year, 2 million company use
case
Rahul Shetty, Ana Maldonado, Mauricio Rodriguez Lara
Who We Are
Who We Are
Relocation Prediction use case
Problem:
businesses, schools, hospitals, etc. move locations over time
(growth, bankruptcy, new markets, etc)
Can we predict if they will relocate?
- To where?
- When?
- Why?
=> For now, we focus only on
relocation probability
For businesses we have historical Corporate Data:
- Company size, credit rating, relocation, etc...
=> Can company characteristics predict relocation?
- Useful information for service providers, realtors, city councils,
investors and developers
=> Investigatory POC: 6 week study
- Limit the scope to determine if relocation can be predicted, and
if so, which properties can be a signal
Relocation Prediction use case
(Big) Data
We encountered some challenges:
- Monthly data from branches of 2 million companies, going back
10 years… ~ 300 million rows
- Disperse Data: where/how should it be gathered?
- Monthly data too granular: how to aggregate?
- Client did not have a suitable platform for data handling and
analysis...
Data & Modeling Considerations
- High dimensional time series data
- Preserve the temporal granularity to maximize information
- Neural Networks?
- LSTM or CNNs?
- NN design/exploration time > available time
- Simplify data and modeling due to time constraints
Preparing the Data
- Step 1: Collect the data on an appropriate platform:
- Set up Google Cloud platform in one week
- Step 2: Aggregate the Data
- From monthly to yearly: predicted relocation based on
yearly data
- Choose how to deal with categorical variables
- Subsequent Steps : Spawn virtual machine(s) on GCP for
modeling
Summary Statistics
- Final dataset: 75 features from 1 year and ‘has_relocated’
target from following year
- 2 million entries per year
- ~5% relocation (imbalanced dataset)
- Goal: Build a model that can predict ‘has_relocated’
better than randomly (better than 95% accurate)
Modeling step 1: Exploring Models
- Apply binary classification algorithms: SVM, logistic regression,
decision trees (DT), random forests (RF)
- Choose models with best performance: AUC, kappa
- DTs and RF did best
- Apply Sampling Techniques to improve models
- Tune model parameters
- Validate
Modeling step 2: ResultsTPR
FPR
AUC: 0.66
Best DT model produced
by undersampling data,
5-fold CV, and DT
parameters explored via
grid search
Modeling Results: Features
The most important
features having an
influence on a
‘has_relocated’ index
were related to:
- Company financial
assessments and
health
- Company age
Validation
How well can yearly models
predict the next year’s
relocation?
AUC
Validation
How well can yearly models
predict the next year’s
relocation?
… in general, rather well
AUC
Validation
How well can yearly models
predict the next year’s
relocation?
… in general, rather well
… except for 2016 (?)
AUC
- Company properties can be indicative of whether they relocate
- Yearly aggregated data is sufficient for high level indications of
relocation.
- More granular modeling (e.g. with NN) may provide additional
information
Take aways
- Possible to perform successful
POC on big data within 6 weeks
on GCP
Having had more time we would have:
- Full time series modeling
- NN, hierarchical modeling, etc...
- Automate prediction, given company characteristics
- Investigate anomalous year
- Make use of modeling results:
Future work
Any questions?
rshetty@qualogy.com
Thank you!

More Related Content

Rahul Shetty - Corporate relocation prediction - Codemotion Amsterdam 2019

  • 1. Corporate Relocation Prediction A 10 Year, 2 million company use case Rahul Shetty, Ana Maldonado, Mauricio Rodriguez Lara
  • 4. Relocation Prediction use case Problem: businesses, schools, hospitals, etc. move locations over time (growth, bankruptcy, new markets, etc) Can we predict if they will relocate? - To where? - When? - Why? => For now, we focus only on relocation probability
  • 5. For businesses we have historical Corporate Data: - Company size, credit rating, relocation, etc... => Can company characteristics predict relocation? - Useful information for service providers, realtors, city councils, investors and developers => Investigatory POC: 6 week study - Limit the scope to determine if relocation can be predicted, and if so, which properties can be a signal Relocation Prediction use case
  • 6. (Big) Data We encountered some challenges: - Monthly data from branches of 2 million companies, going back 10 years… ~ 300 million rows - Disperse Data: where/how should it be gathered? - Monthly data too granular: how to aggregate? - Client did not have a suitable platform for data handling and analysis...
  • 7. Data & Modeling Considerations - High dimensional time series data - Preserve the temporal granularity to maximize information - Neural Networks? - LSTM or CNNs? - NN design/exploration time > available time - Simplify data and modeling due to time constraints
  • 8. Preparing the Data - Step 1: Collect the data on an appropriate platform: - Set up Google Cloud platform in one week - Step 2: Aggregate the Data - From monthly to yearly: predicted relocation based on yearly data - Choose how to deal with categorical variables - Subsequent Steps : Spawn virtual machine(s) on GCP for modeling
  • 9. Summary Statistics - Final dataset: 75 features from 1 year and ‘has_relocated’ target from following year - 2 million entries per year - ~5% relocation (imbalanced dataset) - Goal: Build a model that can predict ‘has_relocated’ better than randomly (better than 95% accurate)
  • 10. Modeling step 1: Exploring Models - Apply binary classification algorithms: SVM, logistic regression, decision trees (DT), random forests (RF) - Choose models with best performance: AUC, kappa - DTs and RF did best - Apply Sampling Techniques to improve models - Tune model parameters - Validate
  • 11. Modeling step 2: ResultsTPR FPR AUC: 0.66 Best DT model produced by undersampling data, 5-fold CV, and DT parameters explored via grid search
  • 12. Modeling Results: Features The most important features having an influence on a ‘has_relocated’ index were related to: - Company financial assessments and health - Company age
  • 13. Validation How well can yearly models predict the next year’s relocation? AUC
  • 14. Validation How well can yearly models predict the next year’s relocation? … in general, rather well AUC
  • 15. Validation How well can yearly models predict the next year’s relocation? … in general, rather well … except for 2016 (?) AUC
  • 16. - Company properties can be indicative of whether they relocate - Yearly aggregated data is sufficient for high level indications of relocation. - More granular modeling (e.g. with NN) may provide additional information Take aways - Possible to perform successful POC on big data within 6 weeks on GCP
  • 17. Having had more time we would have: - Full time series modeling - NN, hierarchical modeling, etc... - Automate prediction, given company characteristics - Investigate anomalous year - Make use of modeling results: Future work