ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions

e.g. Targeted
Marketing
• Assume mass emails to
– 1M people, reaction rate of
1%, 2$ cost per email =>
Cost 2M$ and reach of 10k
people.
• Lets say that looking at
demographics (e.g. where
they live and using decision
tables), you can find
– 250K people with reaction
rate of 6%, => cost 500K$
and reach of 15k people.

A day in your Life
 Think about a day in your life?
– What is the best road to take?
– Would there be any bad weather?
– How to invest my money?
– How is my health?
 There are many decisions that
you can do better if only you can
access the data and process
them.
http://www.flickr.com/photos/kcolwell/5
512461652/ CC licence

Internet of Things
• Currently physical world and
software worlds are
detached
• Internet of things promises
to bridge this
– It is about sensors and
actuators everywhere
– In your fridge, in your
blanket, in your chair, in your
carpet.. Yes even in your
socks
– Umbrella that light up when
there is rain and medicine
cups

What can We do with Big Data?
• Optimize (World is inefficient)
– 30% food wasted farm to plate
– GE Save 1% initiative (http://goo.gl/eYC0QE )
• in trains => 2B/ year
• US healthcare => 20B/ year
• In contrast, Sri Lanka total exports 9B/ year.
• Save lives
– Weather, Disease identification, Personalized
treatment
• Technology advancement
– Most high tech research are done via simulations

Big data Processing Technologies
Landscape

Hindsight: Batch Processing
• Programming model is MapReduce
– Apache Hadoop
– Spark
• Lot of tools built on top
– Hive Shark for (SQL style queries), Mahout (ML), Giraph
(Graph Processing)
• Store and process
• Slow (> 5 minutes for
results for a
reasonable usecase)

Usecase: Targeted Advertising
• Analytics Implemented with MapReduce or Queries
– Min, Max, average, correlation, histograms
– Might join or group data in many ways
– Heatmaps, temporal trends
• Key Performance indicators (KPIs)
– Average time for a ticket in customer service interactions
– Profit per square feet for retail

Real-time Analytics
• Idea is to process data as they are
received in streaming fashion (without
storing)
• Used when we need
– Very fast output (milliseconds)
– Lots of events (few 100k to millions)
• Two main technologies
– Stream Processing (e.g. Apache Strom,
http://storm-project.net/ )
– Complex Event Processing (CEP) e.g.
WSO2 CEP
define partition “playerPartition” as PlayerDataStream.pid;
from PlayerDataStream#win.time(1m)
select pid, avg(speed) as avgSpeed
insert into AvgSpeedStream
using partition playerPartition;

Usecase: DEBS 2013, Football Game

Sketch Algorithms
• Data Structures that can count millions
of entries with few KBs
– Provide approximate answers
– E.g. Count-Min Sketch, Bloom Filters
• Use Cases
– Counting items
– Point estimates, rangesum, heavy hitters,
quantiles, number of distinct elements
– Graph Summaries
– Linear algebraic problems such as
approximating matrix products, least
squares approximation and SVD
See https://sites.google.com/site/algoresearch/datastreamalgorithms

Curious Case of Missing Data
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from
http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
• WW II, Returned
Aircrafts and data
on where they
were hit?
• How would you
add Armour?

Challenges: Causality
• Correlation does not imply Causality!! ( send a
book home example [1])
• Causality
– do repeat experiment with identical test
– If CAN’T do a randomized test (A/B test)
– With Big data we cannot do either
• Option 1: We can act on correlation if we can
verify the guess or if correctness is not critical
(Start Investigation, Check for a disease,
Marketing )
• Option 2: We verify correlations using A/B
testing or propensity analysis
[1] http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/
[2] https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/

Insight (Understanding Why ?)
• Pattern Mining – find frequent
associations (e.g. Market Basket),
frequent sequences
• Clustering
• Graph Analysis
• Knowledge Discovery
• Correlations between features and Finding principal
components
• Simulations, Complex System modeling, matching a
statistical distribution

Usecase: Big Data for development in SL?
• Done using CDR data
• People density 1pm vs
midnight (red =>
increased, blue =>
decreased)
• Urban Planning
– People distribution
– Mobility
– Waste Management
– E.g. see
http://goo.gl/jPujmM
From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/

Foresight (Predict)
• Build a Model
– Weather, Economic models
• Predict the future values
– Electricity load, traffic, demand,
sales
• Classification
– Spam detection, Group users,
Sentiment analysis
• Find anomalies
– Fraud, Predictive maintenance
• Recommendations
– Targeted advertising, product
recommendations

Usecase: Predictive Maintenance
• Idea is to fix the problem
before it broke, avoiding
expensive downtimes
– Airplanes, turbines, windmills
– Construction Equipment
– Car, Golf carts
• How
– Build a model for normal
operation and compare
deviation
– Match against known error
patterns

Challenges: Selecting the best
Algorithm for a Problem
• Types of data: categorical (C),
numerical (N)
 N-> N = Regression
 C-> C = Decision trees
 N->C= SVM
• Amount of data
• Required accuracy
• Required interpretability
• Kind of underlying function
See Skytree: Choosing The Right Machine Learning
Methods,
https://www.youtube.com/watch?v=qMUpc10VsmA

Challenges: Feature Engineering
• In ML feature engineering is the key [1].
• You need features to form a kernel. Then you can
solve with less data.
• Deep learning can learn best feature (combination)
via semi or unsupervised learning [2]
1. Bekkerman’s talk https://www.youtube.com/watch?v=wjTJVhmu1JM
2. Deep Learning, http://cl.naist.jp/~kevinduh/a/deep2014/

Challenges: Taking Decisions (Context)

Challenges: Updating Models
● Incorporate more data
o We get more data over time
o We get feed back about
effectiveness of decisions
(e.g. Accuracy of Fraud)
o Trends change
● Track and update model
o Generate models in batch
mode and update
o Streaming (Online) ML,
which is an active research
topic

Challenges: Scaling ML Algorithms
• With more data we can
– Build more accurate and
detailed models [1]
• Scale => Distributed Systems
• Need to build new or adopt
algorithms or use other
methods
– Sampling
– Scaleable version of algorithms
(e.g. Decision Trees, NN )
[1] P Domingos, A Few Useful Things to Know about Machine Learning

Challenges: Lack of Labeled Data
• Most data is not labeled
• Idea of Semi Supervised
learning
• Provide Data + Examples +
Ontology, and algorithm find
new patterns
– Lot of Data
– Few example sentences
• Often uses Expectations
Maximization (EM) Algorithm
Watch Tom Mitchell’s Lecture
https://www.youtube.com/watch?v=psFnHkIjHA0Maximization algorithm
Ontology: People, Cities
Relationships: like,
dislike, live in
Examples: Bob (People)
lives in Colombo (City)

ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions

More Related Content

ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions