ML on Big Data: Real-Time Analysis on Time Series

ML on Big Data: Real Time
Analysis on Time Series

Machine Learning
on Big Data
YASWANTH YADLAPALLI

Topics
 Business use case
 Training phase of the algorithm
 Tech stack
 Real time implementation
 Demonstration on a force sensor

Data Model
 We are currently working on these data models :
 Unstructured data
 Structured Data
 Time series Data
 For this talk we are going to concentrate on Time series data

Problem Statement
 To build a reactive application which trains on limited amount of data.

Business use case
 Main use case is in preventive maintenance systems.
 Calendar based maintenance schedules and holding excessive inventory to reduce
downtime all lead to inefficiencies and increase costs.
 Recent failures in machinery of oil rigs, car manufacturing plants have cost their
respective industries millions of dollars in down time and repairs.
 Condition Based Monitoring systems are implemented with the goal of
eliminating unplanned downtime and reducing operations cost by maintaining the
proper equipment at the proper time.
 As they say a stitch in time saves nine.

Time series analytics
 Any analytics algorithm should be a mathematical model that should:
 Data Compression: Compact representation of data
 Signal Processing: extracting signal(sequences) even in presence of noise
 Prediction: using model predict the future values of time series

Terminology
 Patterns
 Block of graph where values are within a
range
 Patterns are grown from pairs of sequential
points till the block conform given
thresholds
 Clusters
 Similar type of patterns

Terminology
 Sequences
 A recurring series of patterns belonging to
a set of clusters.
 Concepts
 Sequences which are tagged as relevant to
the user.
 Knowledge Base
 Inference drawn from concepts.
 This is the compressed representation of
the time series

Phases
 Training phase
 Objective is to build a Knowledge base.
 Bulk historical data is given as input.
 Parameters of the algorithm are fine tuned to match the use case.
 Concepts are identified and assigned an action.
 Validation Phase
 Bulk
 Bulk data is given.
 Patterns are found and classified according to knowledge base.
 Used to identify and tag scenarios over a known timeline.

Phases
 Decision phase
 Real Time
 For example a Kafka source is provided.
 Received data is processed in batches.
 Patterns spanning multiples batches are stitched.
 If a sequence is identified as a concept, the specified action is triggered.

Architecture
Source
•Google Drive
•Kafka
•Local file
Ingestion
•Filters
•Transformation
•Manipulation
•Materialization
Computation
•Patterns
•Clusters
•Sequences
•Concepts
Actions
•SMS
•Email

Tech Architecture
Ui Server FrontendBackend Server

Benchmarks
0
2
4
6
8
10
12
0 10 20 30 40 50 60
Duration(Hours)
Size (GB)
Duration vs Size
Driver memory (GB) Total executor memory (GB) Slave cores (procs)

Real Time Analysis
on Time Series
ROHITH YERAVOTHULA

Training phase output
 Knowledge Base properties:
 Data Compression: Compact representation of data
 Signal Processing: extracting signal(sequences) even in presence of noise
 Prediction: using model predict the future values of time series

Real time system
 Light weight Computation framework
 Ability to handle 3V’s (Volume, Velocity and Variety) of Big Data
 Computation framework with micro batch processing architecture

Data Source
 Data source that can keep the data from the source and ingest into computation
framework which can
 Take Advantage of distributed computation framework
 Store data in a fault tolerant manner

Using Spark and Kafka
Frontend
Via Server

Connecting with IoT
 Connect Mobile accelerometer to AWS IoT and stream data.
 Train the system to predict an user’s behavior using accelerometer data.

Mobile IoT architecture
600 ms
5 sec
3 sec
2 sec
4 sec
Frontend
Via Server

Bottlenecks
 Small File Issues: writing and reading huge number of small files.
 Sharing data between batches.

Fix: Small Files Problem
 Implemented a in memory queue to hold data for several batches and then
compile everything into a single file and write to storage system
 Can also serve UI requests from in-memory queue.
 This eliminates the extra read calls from storage system to serve UI requests
 Allows the writes in first place to be asynchronous

Why Share data between batches
 In Real time data ingestion, data can be broken into different batches depending
upon the batch size we choose
 We need to take care of signals overflowing across batches

Sharing Data between batches
 UpdateStateByKey
 ssc.remember()
 Spark Accumulators

Mobile IoT architecture (updated)
600 ms
5 sec
1 sec
async
Frontend

Demo QUICK DEMONSTRATION
WITH FORCE SENSOR

“
”
Keep calm and ask questions
Q & A session

ML on Big Data: Real-Time Analysis on Time Series

More Related Content

ML on Big Data: Real-Time Analysis on Time Series

Editor's Notes