900 keynote abbott
- 1. How Predictive Modelers Should Think about Big Data
Dean Abbott
Co-Founder and Chief Data Scientist, SmarterHQ
dabbott@smarterhq.com
Twitter: @deanabb
- 5. 5
The Usual Big Data Talk Track
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
http://whatis.techtarget.com/definition/3Vs
- 6. 6
Big Data: Google Trends
Jan 1, 2004 Aug 1, 2008 Mar 1, 2013
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 7. 7
Big Data: Google Trends
Jan 1, 2004 Aug 1, 2008 Mar 1, 2013
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 8. 8
Big Data: Google Trends
Jan 1, 2004 Aug 1, 2008 Mar 1, 2013
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 9. 9
What is Big Data?
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 10. 10
What is Big Data?
https://www.pinterest.com/pin/30962316158410859/
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 11. 11
How Much Data is Big?
More data than you can
process efficiently
ISBN-13: 978-1118824825
- 15. 15
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
“70 percent of US
millennials say they would
appreciate a brand or
retailer using AI technology
to show more interesting
products. And 72
percent believe that as the
technology develops, brands
using AI will be able to
accurately predict what they
want.”
- 16. 16
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
https://venturebeat.com/2016/09/26/how-a-i-is-helping-retailers/
“The future of retail technology lies in solutions
that are powered by machine learning, which
can provide fast and intelligent automation as
well as dynamic scalability. Machine learning
unleashes powerful self-adapting algorithms to
uncover latent patterns of behavior that are
difficult or impossible for decision-makers to
discover on their own. “
- 17. 17
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
“Extreme Personalization
“…modern commerce continues to evolve
from ‘what’s new’ to the ‘next-new’ player on
the block. To compete, every company —
brick-and-mortar, e-commerce, and modern
commerce — needs to perpetually innovate
on every front.”
“Engagement, not reach: AI and machine
learning is advancing engagement tools to
scale cross-channel, personalized messaging in
the moments that matter in the channel
customers prefer.”
- 18. 18
Big Data Means Integrating Lots of Sources
Database
CRM
Flat
Files
IoT
ETL
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 20. 20
“The vast majority of the
challenges companies struggle
as they operationalize Big
Data are related to people,
not technology: issues like
organizational alignment,
business process and
adoption, and change
management.”
https://hbr.org/2016/02/just-using-big-data-isnt-enough-anymore
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 24. 24
Big Data Can Cause RAM Problems
Rows Columns GB
250,000 100 0.19
250,000 1,000 1.87
1,000,000 1,000 7.48
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 25. 25
Big Data Can Cause RAM Problems
Rows Columns GB
250,000 100 0.19
250,000 1,000 1.87
1,000,000 1,000 7.48
10,000,000 1,000 74.77
10,000,000 10,000 747.66
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 26. 26
Big Data Can Overwhelm -> Width
• Adding features & interactions make big data bigger (worse computationally!)
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 27. 27
Big Data can Mislead
2X
8X
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 35. 36
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
http://www.kdnuggets.com/2015/08/big-data-question-hadoop-spark.html
- 39. 40
Parallelize Building Predictive Models Themselves
• The Target(s): Columns
– Suitable for same types of models for multiple target variables
Days to Next Purchase <= 1 day
Days to Next Purchase <= 3 days
Days to Next Purchase <= 7 days
Days to Next Purchase <= 15 days
Days to Next Purchase <= 30 days
Days to Next Purchase 30-60
days
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 40. 41
NY City Taxi Data
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
• 5,199,911 observations
• 19 variables
• 1.05 GB
Field Name Description
VendorID A code indicating the TPEP provider that provided the record.
1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
tpep_pickup_datetime The date and time when the meter was engaged.
tpep_dropoff_datetime The date and time when the meter was disengaged.
Passenger_count The number of passengers in the vehicle.
This is a driver-entered value.
Trip_distance The elapsed trip distance in miles reported by the taximeter.
Pickup_longitude Longitude where the meter was engaged.
Pickup_latitude Latitude where the meter was engaged.
RateCodeID The final rate code in effect at the end of the trip.
1= Standard rate
2=JFK
3=Newark
4=Nassau or Westchester
5=Negotiated fare
6=Group ride
Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle
memory before sending to the vendor, aka “store and forward,”
because the vehicle did not have a connection to the server.
Y= store and forward trip
N= not a store and forward trip
Dropoff_longitude Longitude where the meter was disengaged.
Dropoff_ latitude Latitude where the meter was disengaged.
Payment_type A numeric code signifying how the passenger paid for the trip.
1= Credit card
2= Cash
3= No charge
4= Dispute
5= Unknown
6= Voided trip
Fare_amount The time-and-distance fare calculated by the meter
Extra Miscellaneous extras and surcharges. Currently, this only includes
the $0.50 and $1 rush hour and overnight charges.
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered
rate in use.
Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop.
Tip_amount Tip amount – This field is automatically populated for credit card
tips. Cash tips are not included.
Tolls_amount Total amount of all tolls paid in trip.
Total_amount The total amount charged to passengers. Does not include cash tips.
Thanks to Joshua Adams for the Azure test results
https://www.linkedin.com/in/joshuaadams3/
- 41. 42
Cores Algorithm Rows Features Elapsed Time
Single Random Forest 25000 19 0:02:58
Single Random Forest 50000 19 0:07:08
Single Random Forest 100000 19 1:11:48
Single Random Forest 200000 19 1:43:05
Single Random Forest 400000 19 5:25:05
Single Random Forest 800000 19 19:25:50
Multiple Random Forest 25000 19 0:01:32
Multiple Random Forest 50000 19 0:03:47
Multiple Random Forest 100000 19 0:34:12
Multiple Random Forest 200000 19 0:57:16
Multiple Random Forest 400000 19 1:48:23
Multiple Random Forest 800000 19 3:48:03
Processing Results in Azure
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
Thanks to Joshua Adams for the Azure test results
https://www.linkedin.com/in/joshuaadams3/
- 42. 43
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 43. 44
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
• k-NN
– k, find optimal set of inputs
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 44. 45
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
• k-NN
– k, find optimal set of inputs
• Neural Networks
– Architecture selection: # hidden layers and # neurons per hidden layer
– Learning parameters (learning rate, momentum for backprop)
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 45. 46
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
• k-NN
– k, find optimal set of inputs
• Neural Networks
– Architecture selection: # hidden layers and # neurons per hidden layer
– Learning parameters (learning rate, momentum for backprop)
• Logistic Regression
– Factorial design / interaction effects
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
- 46. 47
• The Good: Big data + AI is here and decision-makers care
• The Bad: Big data is big, but not smart; requires company buy-in
• The Ugly: Big data stresses infrastructure
• One Solution: cloud computing and parallelization
Conclusions
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved