Rahul Shetty - Corporate relocation prediction - Codemotion Amsterdam 2019
- 4. Relocation Prediction use case
Problem:
businesses, schools, hospitals, etc. move locations over time
(growth, bankruptcy, new markets, etc)
Can we predict if they will relocate?
- To where?
- When?
- Why?
=> For now, we focus only on
relocation probability
- 5. For businesses we have historical Corporate Data:
- Company size, credit rating, relocation, etc...
=> Can company characteristics predict relocation?
- Useful information for service providers, realtors, city councils,
investors and developers
=> Investigatory POC: 6 week study
- Limit the scope to determine if relocation can be predicted, and
if so, which properties can be a signal
Relocation Prediction use case
- 6. (Big) Data
We encountered some challenges:
- Monthly data from branches of 2 million companies, going back
10 years… ~ 300 million rows
- Disperse Data: where/how should it be gathered?
- Monthly data too granular: how to aggregate?
- Client did not have a suitable platform for data handling and
analysis...
- 7. Data & Modeling Considerations
- High dimensional time series data
- Preserve the temporal granularity to maximize information
- Neural Networks?
- LSTM or CNNs?
- NN design/exploration time > available time
- Simplify data and modeling due to time constraints
- 8. Preparing the Data
- Step 1: Collect the data on an appropriate platform:
- Set up Google Cloud platform in one week
- Step 2: Aggregate the Data
- From monthly to yearly: predicted relocation based on
yearly data
- Choose how to deal with categorical variables
- Subsequent Steps : Spawn virtual machine(s) on GCP for
modeling
- 9. Summary Statistics
- Final dataset: 75 features from 1 year and ‘has_relocated’
target from following year
- 2 million entries per year
- ~5% relocation (imbalanced dataset)
- Goal: Build a model that can predict ‘has_relocated’
better than randomly (better than 95% accurate)
- 10. Modeling step 1: Exploring Models
- Apply binary classification algorithms: SVM, logistic regression,
decision trees (DT), random forests (RF)
- Choose models with best performance: AUC, kappa
- DTs and RF did best
- Apply Sampling Techniques to improve models
- Tune model parameters
- Validate
- 11. Modeling step 2: ResultsTPR
FPR
AUC: 0.66
Best DT model produced
by undersampling data,
5-fold CV, and DT
parameters explored via
grid search
- 12. Modeling Results: Features
The most important
features having an
influence on a
‘has_relocated’ index
were related to:
- Company financial
assessments and
health
- Company age
- 15. Validation
How well can yearly models
predict the next year’s
relocation?
… in general, rather well
… except for 2016 (?)
AUC
- 16. - Company properties can be indicative of whether they relocate
- Yearly aggregated data is sufficient for high level indications of
relocation.
- More granular modeling (e.g. with NN) may provide additional
information
Take aways
- Possible to perform successful
POC on big data within 6 weeks
on GCP
- 17. Having had more time we would have:
- Full time series modeling
- NN, hierarchical modeling, etc...
- Automate prediction, given company characteristics
- Investigate anomalous year
- Make use of modeling results:
Future work