SlideShare a Scribd company logo
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions
e.g. Targeted
Marketing
• Assume mass emails to
– 1M people, reaction rate of
1%, 2$ cost per email =>
Cost 2M$ and reach of 10k
people.
• Lets say that looking at
demographics (e.g. where
they live and using decision
tables), you can find
– 250K people with reaction
rate of 6%, => cost 500K$
and reach of 15k people.
A day in your Life
 Think about a day in your life?
– What is the best road to take?
– Would there be any bad weather?
– How to invest my money?
– How is my health?
 There are many decisions that
you can do better if only you can
access the data and process
them.
http://www.flickr.com/photos/kcolwell/5
512461652/ CC licence
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions
Internet of Things
• Currently physical world and
software worlds are
detached
• Internet of things promises
to bridge this
– It is about sensors and
actuators everywhere
– In your fridge, in your
blanket, in your chair, in your
carpet.. Yes even in your
socks
– Umbrella that light up when
there is rain and medicine
cups
What can We do with Big Data?
• Optimize (World is inefficient)
– 30% food wasted farm to plate
– GE Save 1% initiative (http://goo.gl/eYC0QE )
• in trains => 2B/ year
• US healthcare => 20B/ year
• In contrast, Sri Lanka total exports 9B/ year.
• Save lives
– Weather, Disease identification, Personalized
treatment
• Technology advancement
– Most high tech research are done via simulations
Big Data Architecture
Big data Processing Technologies
Landscape
Hindsight: Batch Processing
• Programming model is MapReduce
– Apache Hadoop
– Spark
• Lot of tools built on top
– Hive Shark for (SQL style queries), Mahout (ML), Giraph
(Graph Processing)
• Store and process
• Slow (> 5 minutes for
results for a
reasonable usecase)
Usecase: Targeted Advertising
• Analytics Implemented with MapReduce or Queries
– Min, Max, average, correlation, histograms
– Might join or group data in many ways
– Heatmaps, temporal trends
• Key Performance indicators (KPIs)
– Average time for a ticket in customer service interactions
– Profit per square feet for retail
Real-time Analytics
• Idea is to process data as they are
received in streaming fashion (without
storing)
• Used when we need
– Very fast output (milliseconds)
– Lots of events (few 100k to millions)
• Two main technologies
– Stream Processing (e.g. Apache Strom,
http://storm-project.net/ )
– Complex Event Processing (CEP) e.g.
WSO2 CEP
define partition “playerPartition” as PlayerDataStream.pid;
from PlayerDataStream#win.time(1m)
select pid, avg(speed) as avgSpeed
insert into AvgSpeedStream
using partition playerPartition;
Usecase: DEBS 2013, Football Game
Sketch Algorithms
• Data Structures that can count millions
of entries with few KBs
– Provide approximate answers
– E.g. Count-Min Sketch, Bloom Filters
• Use Cases
– Counting items
– Point estimates, rangesum, heavy hitters,
quantiles, number of distinct elements
– Graph Summaries
– Linear algebraic problems such as
approximating matrix products, least
squares approximation and SVD
See https://sites.google.com/site/algoresearch/datastreamalgorithms
Curious Case of Missing Data
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from
http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
• WW II, Returned
Aircrafts and data
on where they
were hit?
• How would you
add Armour?
Challenges: Causality
• Correlation does not imply Causality!! ( send a
book home example [1])
• Causality
– do repeat experiment with identical test
– If CAN’T do a randomized test (A/B test)
– With Big data we cannot do either
• Option 1: We can act on correlation if we can
verify the guess or if correctness is not critical
(Start Investigation, Check for a disease,
Marketing )
• Option 2: We verify correlations using A/B
testing or propensity analysis
[1] http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/
[2] https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/
Insight (Understanding Why ?)
• Pattern Mining – find frequent
associations (e.g. Market Basket),
frequent sequences
• Clustering
• Graph Analysis
• Knowledge Discovery
• Correlations between features and Finding principal
components
• Simulations, Complex System modeling, matching a
statistical distribution
Usecase: Big Data for development in SL?
• Done using CDR data
• People density 1pm vs
midnight (red =>
increased, blue =>
decreased)
• Urban Planning
– People distribution
– Mobility
– Waste Management
– E.g. see
http://goo.gl/jPujmM
From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
Foresight (Predict)
• Build a Model
– Weather, Economic models
• Predict the future values
– Electricity load, traffic, demand,
sales
• Classification
– Spam detection, Group users,
Sentiment analysis
• Find anomalies
– Fraud, Predictive maintenance
• Recommendations
– Targeted advertising, product
recommendations
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions
Usecase: Predictive Maintenance
• Idea is to fix the problem
before it broke, avoiding
expensive downtimes
– Airplanes, turbines, windmills
– Construction Equipment
– Car, Golf carts
• How
– Build a model for normal
operation and compare
deviation
– Match against known error
patterns
Challenges: Selecting the best
Algorithm for a Problem
• Types of data: categorical (C),
numerical (N)
 N-> N = Regression
 C-> C = Decision trees
 N->C= SVM
• Amount of data
• Required accuracy
• Required interpretability
• Kind of underlying function
See Skytree: Choosing The Right Machine Learning
Methods,
https://www.youtube.com/watch?v=qMUpc10VsmA
Challenges: Feature Engineering
• In ML feature engineering is the key [1].
• You need features to form a kernel. Then you can
solve with less data.
• Deep learning can learn best feature (combination)
via semi or unsupervised learning [2]
1. Bekkerman’s talk https://www.youtube.com/watch?v=wjTJVhmu1JM
2. Deep Learning, http://cl.naist.jp/~kevinduh/a/deep2014/
Challenges: Taking Decisions (Context)
Challenges: Updating Models
● Incorporate more data
o We get more data over time
o We get feed back about
effectiveness of decisions
(e.g. Accuracy of Fraud)
o Trends change
● Track and update model
o Generate models in batch
mode and update
o Streaming (Online) ML,
which is an active research
topic
Challenges: Scaling ML Algorithms
• With more data we can
– Build more accurate and
detailed models [1]
• Scale => Distributed Systems
• Need to build new or adopt
algorithms or use other
methods
– Sampling
– Scaleable version of algorithms
(e.g. Decision Trees, NN )
[1] P Domingos, A Few Useful Things to Know about Machine Learning
Challenges: Lack of Labeled Data
• Most data is not labeled
• Idea of Semi Supervised
learning
• Provide Data + Examples +
Ontology, and algorithm find
new patterns
– Lot of Data
– Few example sentences
• Often uses Expectations
Maximization (EM) Algorithm
Watch Tom Mitchell’s Lecture
https://www.youtube.com/watch?v=psFnHkIjHA0Maximization algorithm
Ontology: People, Cities
Relationships: like,
dislike, live in
Examples: Bob (People)
lives in Colombo (City)
Outline

More Related Content

ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions

  • 2. e.g. Targeted Marketing • Assume mass emails to – 1M people, reaction rate of 1%, 2$ cost per email => Cost 2M$ and reach of 10k people. • Lets say that looking at demographics (e.g. where they live and using decision tables), you can find – 250K people with reaction rate of 6%, => cost 500K$ and reach of 15k people.
  • 3. A day in your Life  Think about a day in your life? – What is the best road to take? – Would there be any bad weather? – How to invest my money? – How is my health?  There are many decisions that you can do better if only you can access the data and process them. http://www.flickr.com/photos/kcolwell/5 512461652/ CC licence
  • 5. Internet of Things • Currently physical world and software worlds are detached • Internet of things promises to bridge this – It is about sensors and actuators everywhere – In your fridge, in your blanket, in your chair, in your carpet.. Yes even in your socks – Umbrella that light up when there is rain and medicine cups
  • 6. What can We do with Big Data? • Optimize (World is inefficient) – 30% food wasted farm to plate – GE Save 1% initiative (http://goo.gl/eYC0QE ) • in trains => 2B/ year • US healthcare => 20B/ year • In contrast, Sri Lanka total exports 9B/ year. • Save lives – Weather, Disease identification, Personalized treatment • Technology advancement – Most high tech research are done via simulations
  • 8. Big data Processing Technologies Landscape
  • 9. Hindsight: Batch Processing • Programming model is MapReduce – Apache Hadoop – Spark • Lot of tools built on top – Hive Shark for (SQL style queries), Mahout (ML), Giraph (Graph Processing) • Store and process • Slow (> 5 minutes for results for a reasonable usecase)
  • 10. Usecase: Targeted Advertising • Analytics Implemented with MapReduce or Queries – Min, Max, average, correlation, histograms – Might join or group data in many ways – Heatmaps, temporal trends • Key Performance indicators (KPIs) – Average time for a ticket in customer service interactions – Profit per square feet for retail
  • 11. Real-time Analytics • Idea is to process data as they are received in streaming fashion (without storing) • Used when we need – Very fast output (milliseconds) – Lots of events (few 100k to millions) • Two main technologies – Stream Processing (e.g. Apache Strom, http://storm-project.net/ ) – Complex Event Processing (CEP) e.g. WSO2 CEP define partition “playerPartition” as PlayerDataStream.pid; from PlayerDataStream#win.time(1m) select pid, avg(speed) as avgSpeed insert into AvgSpeedStream using partition playerPartition;
  • 12. Usecase: DEBS 2013, Football Game
  • 13. Sketch Algorithms • Data Structures that can count millions of entries with few KBs – Provide approximate answers – E.g. Count-Min Sketch, Bloom Filters • Use Cases – Counting items – Point estimates, rangesum, heavy hitters, quantiles, number of distinct elements – Graph Summaries – Linear algebraic problems such as approximating matrix products, least squares approximation and SVD See https://sites.google.com/site/algoresearch/datastreamalgorithms
  • 14. Curious Case of Missing Data http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/ • WW II, Returned Aircrafts and data on where they were hit? • How would you add Armour?
  • 15. Challenges: Causality • Correlation does not imply Causality!! ( send a book home example [1]) • Causality – do repeat experiment with identical test – If CAN’T do a randomized test (A/B test) – With Big data we cannot do either • Option 1: We can act on correlation if we can verify the guess or if correctness is not critical (Start Investigation, Check for a disease, Marketing ) • Option 2: We verify correlations using A/B testing or propensity analysis [1] http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/ [2] https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/
  • 16. Insight (Understanding Why ?) • Pattern Mining – find frequent associations (e.g. Market Basket), frequent sequences • Clustering • Graph Analysis • Knowledge Discovery • Correlations between features and Finding principal components • Simulations, Complex System modeling, matching a statistical distribution
  • 17. Usecase: Big Data for development in SL? • Done using CDR data • People density 1pm vs midnight (red => increased, blue => decreased) • Urban Planning – People distribution – Mobility – Waste Management – E.g. see http://goo.gl/jPujmM From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
  • 18. Foresight (Predict) • Build a Model – Weather, Economic models • Predict the future values – Electricity load, traffic, demand, sales • Classification – Spam detection, Group users, Sentiment analysis • Find anomalies – Fraud, Predictive maintenance • Recommendations – Targeted advertising, product recommendations
  • 20. Usecase: Predictive Maintenance • Idea is to fix the problem before it broke, avoiding expensive downtimes – Airplanes, turbines, windmills – Construction Equipment – Car, Golf carts • How – Build a model for normal operation and compare deviation – Match against known error patterns
  • 21. Challenges: Selecting the best Algorithm for a Problem • Types of data: categorical (C), numerical (N)  N-> N = Regression  C-> C = Decision trees  N->C= SVM • Amount of data • Required accuracy • Required interpretability • Kind of underlying function See Skytree: Choosing The Right Machine Learning Methods, https://www.youtube.com/watch?v=qMUpc10VsmA
  • 22. Challenges: Feature Engineering • In ML feature engineering is the key [1]. • You need features to form a kernel. Then you can solve with less data. • Deep learning can learn best feature (combination) via semi or unsupervised learning [2] 1. Bekkerman’s talk https://www.youtube.com/watch?v=wjTJVhmu1JM 2. Deep Learning, http://cl.naist.jp/~kevinduh/a/deep2014/
  • 24. Challenges: Updating Models ● Incorporate more data o We get more data over time o We get feed back about effectiveness of decisions (e.g. Accuracy of Fraud) o Trends change ● Track and update model o Generate models in batch mode and update o Streaming (Online) ML, which is an active research topic
  • 25. Challenges: Scaling ML Algorithms • With more data we can – Build more accurate and detailed models [1] • Scale => Distributed Systems • Need to build new or adopt algorithms or use other methods – Sampling – Scaleable version of algorithms (e.g. Decision Trees, NN ) [1] P Domingos, A Few Useful Things to Know about Machine Learning
  • 26. Challenges: Lack of Labeled Data • Most data is not labeled • Idea of Semi Supervised learning • Provide Data + Examples + Ontology, and algorithm find new patterns – Lot of Data – Few example sentences • Often uses Expectations Maximization (EM) Algorithm Watch Tom Mitchell’s Lecture https://www.youtube.com/watch?v=psFnHkIjHA0Maximization algorithm Ontology: People, Cities Relationships: like, dislike, live in Examples: Bob (People) lives in Colombo (City)