SlideShare a Scribd company logo
How Predictive Modelers Should Think about Big Data
Dean Abbott
Co-Founder and Chief Data Scientist, SmarterHQ
dabbott@smarterhq.com
Twitter: @deanabb
2
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
3
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
4
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
From Olap.com
5
The Usual Big Data Talk Track
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
http://whatis.techtarget.com/definition/3Vs
6
Big Data: Google Trends
Jan 1, 2004 Aug 1, 2008 Mar 1, 2013
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
7
Big Data: Google Trends
Jan 1, 2004 Aug 1, 2008 Mar 1, 2013
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
8
Big Data: Google Trends
Jan 1, 2004 Aug 1, 2008 Mar 1, 2013
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
9
What is Big Data?
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
10
What is Big Data?
https://www.pinterest.com/pin/30962316158410859/
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
11
How Much Data is Big?
More data than you can
process efficiently
ISBN-13: 978-1118824825
12
Big Data Contains….
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
13
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
900 keynote abbott
15
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
“70 percent of US
millennials say they would
appreciate a brand or
retailer using AI technology
to show more interesting
products. And 72
percent believe that as the
technology develops, brands
using AI will be able to
accurately predict what they
want.”
16
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
https://venturebeat.com/2016/09/26/how-a-i-is-helping-retailers/
“The future of retail technology lies in solutions
that are powered by machine learning, which
can provide fast and intelligent automation as
well as dynamic scalability. Machine learning
unleashes powerful self-adapting algorithms to
uncover latent patterns of behavior that are
difficult or impossible for decision-makers to
discover on their own. “
17
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
“Extreme Personalization
“…modern commerce continues to evolve
from ‘what’s new’ to the ‘next-new’ player on
the block. To compete, every company —
brick-and-mortar, e-commerce, and modern
commerce — needs to perpetually innovate
on every front.”
“Engagement, not reach: AI and machine
learning is advancing engagement tools to
scale cross-channel, personalized messaging in
the moments that matter in the channel
customers prefer.”
18
Big Data Means Integrating Lots of Sources
Database
CRM
Flat
Files
IoT
ETL
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
19
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
20
“The vast majority of the
challenges companies struggle
as they operationalize Big
Data are related to people,
not technology: issues like
organizational alignment,
business process and
adoption, and change
management.”
https://hbr.org/2016/02/just-using-big-data-isnt-enough-anymore
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
21
Big Data is….
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
22
Big Data is….
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
23
Big Data Contains….
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
24
Big Data Can Cause RAM Problems
Rows Columns GB
250,000 100 0.19
250,000 1,000 1.87
1,000,000 1,000 7.48
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
25
Big Data Can Cause RAM Problems
Rows Columns GB
250,000 100 0.19
250,000 1,000 1.87
1,000,000 1,000 7.48
10,000,000 1,000 74.77
10,000,000 10,000 747.66
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
26
Big Data Can Overwhelm -> Width
• Adding features & interactions make big data bigger (worse computationally!)
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
27
Big Data can Mislead
2X
8X
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
29
The Answer is…
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
30
Be Judicious
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
31
Leverage Scalable Environments
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
32
Teradata
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
33
Amazon AWS
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
34
Azure
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
35
Google
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
36
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
http://www.kdnuggets.com/2015/08/big-data-question-hadoop-spark.html
37
Parallelize Record Operations
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
38
Parallelize Column Operations
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
39
Parallelize Building Predictive Models Themselves
• The Target: Column
Days to Next Purchase <= 7 days
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
40
Parallelize Building Predictive Models Themselves
• The Target(s): Columns
– Suitable for same types of models for multiple target variables
Days to Next Purchase <= 1 day
Days to Next Purchase <= 3 days
Days to Next Purchase <= 7 days
Days to Next Purchase <= 15 days
Days to Next Purchase <= 30 days
Days to Next Purchase 30-60
days
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
41
NY City Taxi Data
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
• 5,199,911 observations
• 19 variables
• 1.05 GB
Field Name Description
VendorID A code indicating the TPEP provider that provided the record.
1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
tpep_pickup_datetime The date and time when the meter was engaged.
tpep_dropoff_datetime The date and time when the meter was disengaged.
Passenger_count The number of passengers in the vehicle.
This is a driver-entered value.
Trip_distance The elapsed trip distance in miles reported by the taximeter.
Pickup_longitude Longitude where the meter was engaged.
Pickup_latitude Latitude where the meter was engaged.
RateCodeID The final rate code in effect at the end of the trip.
1= Standard rate
2=JFK
3=Newark
4=Nassau or Westchester
5=Negotiated fare
6=Group ride
Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle
memory before sending to the vendor, aka “store and forward,”
because the vehicle did not have a connection to the server.
Y= store and forward trip
N= not a store and forward trip
Dropoff_longitude Longitude where the meter was disengaged.
Dropoff_ latitude Latitude where the meter was disengaged.
Payment_type A numeric code signifying how the passenger paid for the trip.
1= Credit card
2= Cash
3= No charge
4= Dispute
5= Unknown
6= Voided trip
Fare_amount The time-and-distance fare calculated by the meter
Extra Miscellaneous extras and surcharges. Currently, this only includes
the $0.50 and $1 rush hour and overnight charges.
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered
rate in use.
Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop.
Tip_amount Tip amount – This field is automatically populated for credit card
tips. Cash tips are not included.
Tolls_amount Total amount of all tolls paid in trip.
Total_amount The total amount charged to passengers. Does not include cash tips.
Thanks to Joshua Adams for the Azure test results
https://www.linkedin.com/in/joshuaadams3/
42
Cores Algorithm Rows Features Elapsed Time
Single Random Forest 25000 19 0:02:58
Single Random Forest 50000 19 0:07:08
Single Random Forest 100000 19 1:11:48
Single Random Forest 200000 19 1:43:05
Single Random Forest 400000 19 5:25:05
Single Random Forest 800000 19 19:25:50
Multiple Random Forest 25000 19 0:01:32
Multiple Random Forest 50000 19 0:03:47
Multiple Random Forest 100000 19 0:34:12
Multiple Random Forest 200000 19 0:57:16
Multiple Random Forest 400000 19 1:48:23
Multiple Random Forest 800000 19 3:48:03
Processing Results in Azure
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
Thanks to Joshua Adams for the Azure test results
https://www.linkedin.com/in/joshuaadams3/
43
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
44
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
• k-NN
– k, find optimal set of inputs
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
45
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
• k-NN
– k, find optimal set of inputs
• Neural Networks
– Architecture selection: # hidden layers and # neurons per hidden layer
– Learning parameters (learning rate, momentum for backprop)
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
46
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
• k-NN
– k, find optimal set of inputs
• Neural Networks
– Architecture selection: # hidden layers and # neurons per hidden layer
– Learning parameters (learning rate, momentum for backprop)
• Logistic Regression
– Factorial design / interaction effects
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
47
• The Good: Big data + AI is here and decision-makers care
• The Bad: Big data is big, but not smart; requires company buy-in
• The Ugly: Big data stresses infrastructure
• One Solution: cloud computing and parallelization
Conclusions
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
THANK YOU!
SmarterHQ.com | @deanabb | dabbott@SmarterHQ.com

More Related Content

900 keynote abbott

  • 1. How Predictive Modelers Should Think about Big Data Dean Abbott Co-Founder and Chief Data Scientist, SmarterHQ dabbott@smarterhq.com Twitter: @deanabb
  • 2. 2 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 3. 3 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 4. 4 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved From Olap.com
  • 5. 5 The Usual Big Data Talk Track © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved http://whatis.techtarget.com/definition/3Vs
  • 6. 6 Big Data: Google Trends Jan 1, 2004 Aug 1, 2008 Mar 1, 2013 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 7. 7 Big Data: Google Trends Jan 1, 2004 Aug 1, 2008 Mar 1, 2013 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 8. 8 Big Data: Google Trends Jan 1, 2004 Aug 1, 2008 Mar 1, 2013 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 9. 9 What is Big Data? © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 10. 10 What is Big Data? https://www.pinterest.com/pin/30962316158410859/ © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 11. 11 How Much Data is Big? More data than you can process efficiently ISBN-13: 978-1118824825
  • 12. 12 Big Data Contains…. © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 13. 13 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 15. 15 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved “70 percent of US millennials say they would appreciate a brand or retailer using AI technology to show more interesting products. And 72 percent believe that as the technology develops, brands using AI will be able to accurately predict what they want.”
  • 16. 16 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved https://venturebeat.com/2016/09/26/how-a-i-is-helping-retailers/ “The future of retail technology lies in solutions that are powered by machine learning, which can provide fast and intelligent automation as well as dynamic scalability. Machine learning unleashes powerful self-adapting algorithms to uncover latent patterns of behavior that are difficult or impossible for decision-makers to discover on their own. “
  • 17. 17 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved “Extreme Personalization “…modern commerce continues to evolve from ‘what’s new’ to the ‘next-new’ player on the block. To compete, every company — brick-and-mortar, e-commerce, and modern commerce — needs to perpetually innovate on every front.” “Engagement, not reach: AI and machine learning is advancing engagement tools to scale cross-channel, personalized messaging in the moments that matter in the channel customers prefer.”
  • 18. 18 Big Data Means Integrating Lots of Sources Database CRM Flat Files IoT ETL © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 19. 19 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 20. 20 “The vast majority of the challenges companies struggle as they operationalize Big Data are related to people, not technology: issues like organizational alignment, business process and adoption, and change management.” https://hbr.org/2016/02/just-using-big-data-isnt-enough-anymore © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 21. 21 Big Data is…. © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 22. 22 Big Data is…. © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 23. 23 Big Data Contains…. © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 24. 24 Big Data Can Cause RAM Problems Rows Columns GB 250,000 100 0.19 250,000 1,000 1.87 1,000,000 1,000 7.48 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 25. 25 Big Data Can Cause RAM Problems Rows Columns GB 250,000 100 0.19 250,000 1,000 1.87 1,000,000 1,000 7.48 10,000,000 1,000 74.77 10,000,000 10,000 747.66 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 26. 26 Big Data Can Overwhelm -> Width • Adding features & interactions make big data bigger (worse computationally!) © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 27. 27 Big Data can Mislead 2X 8X © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 28. 29 The Answer is… © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 29. 30 Be Judicious © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 30. 31 Leverage Scalable Environments © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 31. 32 Teradata © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 32. 33 Amazon AWS © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 33. 34 Azure © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 34. 35 Google © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 35. 36 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved http://www.kdnuggets.com/2015/08/big-data-question-hadoop-spark.html
  • 36. 37 Parallelize Record Operations © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 37. 38 Parallelize Column Operations © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 38. 39 Parallelize Building Predictive Models Themselves • The Target: Column Days to Next Purchase <= 7 days © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 39. 40 Parallelize Building Predictive Models Themselves • The Target(s): Columns – Suitable for same types of models for multiple target variables Days to Next Purchase <= 1 day Days to Next Purchase <= 3 days Days to Next Purchase <= 7 days Days to Next Purchase <= 15 days Days to Next Purchase <= 30 days Days to Next Purchase 30-60 days © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 40. 41 NY City Taxi Data © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved • 5,199,911 observations • 19 variables • 1.05 GB Field Name Description VendorID A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc. tpep_pickup_datetime The date and time when the meter was engaged. tpep_dropoff_datetime The date and time when the meter was disengaged. Passenger_count The number of passengers in the vehicle. This is a driver-entered value. Trip_distance The elapsed trip distance in miles reported by the taximeter. Pickup_longitude Longitude where the meter was engaged. Pickup_latitude Latitude where the meter was engaged. RateCodeID The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip Dropoff_longitude Longitude where the meter was disengaged. Dropoff_ latitude Latitude where the meter was disengaged. Payment_type A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip Fare_amount The time-and-distance fare calculated by the meter Extra Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges. MTA_tax $0.50 MTA tax that is automatically triggered based on the metered rate in use. Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop. Tip_amount Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. Tolls_amount Total amount of all tolls paid in trip. Total_amount The total amount charged to passengers. Does not include cash tips. Thanks to Joshua Adams for the Azure test results https://www.linkedin.com/in/joshuaadams3/
  • 41. 42 Cores Algorithm Rows Features Elapsed Time Single Random Forest 25000 19 0:02:58 Single Random Forest 50000 19 0:07:08 Single Random Forest 100000 19 1:11:48 Single Random Forest 200000 19 1:43:05 Single Random Forest 400000 19 5:25:05 Single Random Forest 800000 19 19:25:50 Multiple Random Forest 25000 19 0:01:32 Multiple Random Forest 50000 19 0:03:47 Multiple Random Forest 100000 19 0:34:12 Multiple Random Forest 200000 19 0:57:16 Multiple Random Forest 400000 19 1:48:23 Multiple Random Forest 800000 19 3:48:03 Processing Results in Azure © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved Thanks to Joshua Adams for the Azure test results https://www.linkedin.com/in/joshuaadams3/
  • 42. 43 Algorithm searches we don’t always have “time” to do • K-Means Clustering – # clusters, optimal set of inputs © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 43. 44 Algorithm searches we don’t always have “time” to do • K-Means Clustering – # clusters, optimal set of inputs • k-NN – k, find optimal set of inputs © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 44. 45 Algorithm searches we don’t always have “time” to do • K-Means Clustering – # clusters, optimal set of inputs • k-NN – k, find optimal set of inputs • Neural Networks – Architecture selection: # hidden layers and # neurons per hidden layer – Learning parameters (learning rate, momentum for backprop) © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 45. 46 Algorithm searches we don’t always have “time” to do • K-Means Clustering – # clusters, optimal set of inputs • k-NN – k, find optimal set of inputs • Neural Networks – Architecture selection: # hidden layers and # neurons per hidden layer – Learning parameters (learning rate, momentum for backprop) • Logistic Regression – Factorial design / interaction effects © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 46. 47 • The Good: Big data + AI is here and decision-makers care • The Bad: Big data is big, but not smart; requires company buy-in • The Ugly: Big data stresses infrastructure • One Solution: cloud computing and parallelization Conclusions © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 47. THANK YOU! SmarterHQ.com | @deanabb | dabbott@SmarterHQ.com