SlideShare a Scribd company logo
ML on Big Data: Real Time
Analysis on Time Series
Machine Learning
on Big Data
YASWANTH YADLAPALLI
Topics
 Business use case
 Training phase of the algorithm
 Tech stack
 Real time implementation
 Demonstration on a force sensor
Data Model
 We are currently working on these data models :
 Unstructured data
 Structured Data
 Time series Data
 For this talk we are going to concentrate on Time series data
Problem Statement
 To build a reactive application which trains on limited amount of data.
Business use case
 Main use case is in preventive maintenance systems.
 Calendar based maintenance schedules and holding excessive inventory to reduce
downtime all lead to inefficiencies and increase costs.
 Recent failures in machinery of oil rigs, car manufacturing plants have cost their
respective industries millions of dollars in down time and repairs.
 Condition Based Monitoring systems are implemented with the goal of
eliminating unplanned downtime and reducing operations cost by maintaining the
proper equipment at the proper time.
 As they say a stitch in time saves nine.
Sample Data
Our solution
Our solution
Our solution
Time series analytics
 Any analytics algorithm should be a mathematical model that should:
 Data Compression: Compact representation of data
 Signal Processing: extracting signal(sequences) even in presence of noise
 Prediction: using model predict the future values of time series
Terminology
 Patterns
 Block of graph where values are within a
range
 Patterns are grown from pairs of sequential
points till the block conform given
thresholds
 Clusters
 Similar type of patterns
Terminology
 Sequences
 A recurring series of patterns belonging to
a set of clusters.
 Concepts
 Sequences which are tagged as relevant to
the user.
 Knowledge Base
 Inference drawn from concepts.
 This is the compressed representation of
the time series
Phases
 Training phase
 Objective is to build a Knowledge base.
 Bulk historical data is given as input.
 Parameters of the algorithm are fine tuned to match the use case.
 Concepts are identified and assigned an action.
 Validation Phase
 Bulk
 Bulk data is given.
 Patterns are found and classified according to knowledge base.
 Used to identify and tag scenarios over a known timeline.
Phases
 Decision phase
 Real Time
 For example a Kafka source is provided.
 Received data is processed in batches.
 Patterns spanning multiples batches are stitched.
 If a sequence is identified as a concept, the specified action is triggered.
Architecture
Source
•Google Drive
•Kafka
•Local file
Ingestion
•Filters
•Transformation
•Manipulation
•Materialization
Computation
•Patterns
•Clusters
•Sequences
•Concepts
Actions
•SMS
•Email
Tech Architecture
Ui Server FrontendBackend Server
Benchmarks
0
2
4
6
8
10
12
0 10 20 30 40 50 60
Duration(Hours)
Size (GB)
Duration vs Size
Driver memory (GB) Total executor memory (GB) Slave cores (procs)
Real Time Analysis
on Time Series
ROHITH YERAVOTHULA
Training phase output
 Knowledge Base properties:
 Data Compression: Compact representation of data
 Signal Processing: extracting signal(sequences) even in presence of noise
 Prediction: using model predict the future values of time series
Real time system
 Light weight Computation framework
 Ability to handle 3V’s (Volume, Velocity and Variety) of Big Data
 Computation framework with micro batch processing architecture
Tech Architecture
Ui Server FrontendBackend Server
Data Source
 Data source that can keep the data from the source and ingest into computation
framework which can
 Take Advantage of distributed computation framework
 Store data in a fault tolerant manner
Tech Architecture
Ui Server FrontendBackend Server
Using Spark and Kafka
Frontend
Via Server
Connecting with IoT
 Connect Mobile accelerometer to AWS IoT and stream data.
 Train the system to predict an user’s behavior using accelerometer data.
Mobile IoT architecture
600 ms
5 sec
3 sec
2 sec
4 sec
Frontend
Via Server
Bottlenecks
 Small File Issues: writing and reading huge number of small files.
 Sharing data between batches.
Fix: Small Files Problem
 Implemented a in memory queue to hold data for several batches and then
compile everything into a single file and write to storage system
 Can also serve UI requests from in-memory queue.
 This eliminates the extra read calls from storage system to serve UI requests
 Allows the writes in first place to be asynchronous
Why Share data between batches
 In Real time data ingestion, data can be broken into different batches depending
upon the batch size we choose
 We need to take care of signals overflowing across batches
Sharing Data between batches
 UpdateStateByKey
 ssc.remember()
 Spark Accumulators
Mobile IoT architecture
600 ms
5 sec
3 sec
2 sec
4 sec
Frontend
Via Server
Mobile IoT architecture (updated)
600 ms
5 sec
1 sec
async
Frontend
Demo QUICK DEMONSTRATION
WITH FORCE SENSOR
“
”
Keep calm and ask questions
Q & A session

More Related Content

ML on Big Data: Real-Time Analysis on Time Series

  • 1. ML on Big Data: Real Time Analysis on Time Series
  • 2. Machine Learning on Big Data YASWANTH YADLAPALLI
  • 3. Topics  Business use case  Training phase of the algorithm  Tech stack  Real time implementation  Demonstration on a force sensor
  • 4. Data Model  We are currently working on these data models :  Unstructured data  Structured Data  Time series Data  For this talk we are going to concentrate on Time series data
  • 5. Problem Statement  To build a reactive application which trains on limited amount of data.
  • 6. Business use case  Main use case is in preventive maintenance systems.  Calendar based maintenance schedules and holding excessive inventory to reduce downtime all lead to inefficiencies and increase costs.  Recent failures in machinery of oil rigs, car manufacturing plants have cost their respective industries millions of dollars in down time and repairs.  Condition Based Monitoring systems are implemented with the goal of eliminating unplanned downtime and reducing operations cost by maintaining the proper equipment at the proper time.  As they say a stitch in time saves nine.
  • 11. Time series analytics  Any analytics algorithm should be a mathematical model that should:  Data Compression: Compact representation of data  Signal Processing: extracting signal(sequences) even in presence of noise  Prediction: using model predict the future values of time series
  • 12. Terminology  Patterns  Block of graph where values are within a range  Patterns are grown from pairs of sequential points till the block conform given thresholds  Clusters  Similar type of patterns
  • 13. Terminology  Sequences  A recurring series of patterns belonging to a set of clusters.  Concepts  Sequences which are tagged as relevant to the user.  Knowledge Base  Inference drawn from concepts.  This is the compressed representation of the time series
  • 14. Phases  Training phase  Objective is to build a Knowledge base.  Bulk historical data is given as input.  Parameters of the algorithm are fine tuned to match the use case.  Concepts are identified and assigned an action.  Validation Phase  Bulk  Bulk data is given.  Patterns are found and classified according to knowledge base.  Used to identify and tag scenarios over a known timeline.
  • 15. Phases  Decision phase  Real Time  For example a Kafka source is provided.  Received data is processed in batches.  Patterns spanning multiples batches are stitched.  If a sequence is identified as a concept, the specified action is triggered.
  • 17. Tech Architecture Ui Server FrontendBackend Server
  • 18. Benchmarks 0 2 4 6 8 10 12 0 10 20 30 40 50 60 Duration(Hours) Size (GB) Duration vs Size Driver memory (GB) Total executor memory (GB) Slave cores (procs)
  • 19. Real Time Analysis on Time Series ROHITH YERAVOTHULA
  • 20. Training phase output  Knowledge Base properties:  Data Compression: Compact representation of data  Signal Processing: extracting signal(sequences) even in presence of noise  Prediction: using model predict the future values of time series
  • 21. Real time system  Light weight Computation framework  Ability to handle 3V’s (Volume, Velocity and Variety) of Big Data  Computation framework with micro batch processing architecture
  • 22. Tech Architecture Ui Server FrontendBackend Server
  • 23. Data Source  Data source that can keep the data from the source and ingest into computation framework which can  Take Advantage of distributed computation framework  Store data in a fault tolerant manner
  • 24. Tech Architecture Ui Server FrontendBackend Server
  • 25. Using Spark and Kafka Frontend Via Server
  • 26. Connecting with IoT  Connect Mobile accelerometer to AWS IoT and stream data.  Train the system to predict an user’s behavior using accelerometer data.
  • 27. Mobile IoT architecture 600 ms 5 sec 3 sec 2 sec 4 sec Frontend Via Server
  • 28. Bottlenecks  Small File Issues: writing and reading huge number of small files.  Sharing data between batches.
  • 29. Fix: Small Files Problem  Implemented a in memory queue to hold data for several batches and then compile everything into a single file and write to storage system  Can also serve UI requests from in-memory queue.  This eliminates the extra read calls from storage system to serve UI requests  Allows the writes in first place to be asynchronous
  • 30. Why Share data between batches  In Real time data ingestion, data can be broken into different batches depending upon the batch size we choose  We need to take care of signals overflowing across batches
  • 31. Sharing Data between batches  UpdateStateByKey  ssc.remember()  Spark Accumulators
  • 32. Mobile IoT architecture 600 ms 5 sec 3 sec 2 sec 4 sec Frontend Via Server
  • 33. Mobile IoT architecture (updated) 600 ms 5 sec 1 sec async Frontend
  • 35. “ ” Keep calm and ask questions Q & A session

Editor's Notes

  1. In this section I am going go give brief introduction of architecture and business use case of our system. Our goal is to make sense of any given sensor data may it be a pressure sensor in a valve or a camera on a self-driving car. By which we may be able to take smart decisions or make predictions about the future.
  2. Unstructured data doesn't have relations between columns.
  3. Whatever may be the data source, we want to make a generalized solution which can handle any type of variation and enable the user to get a specialized system for this own use case
  4. Assume there is oil rig with 10 machine and 100 sensors each. Say, we know that a component in a machine needs maintainance every 3 months, but in many real life situations the component may breakdown pre maturely, which may cause the company millions in down time. Having a person monitor all the sensor outputs and determine whether any component needs maintainance is not a viable solution. Our system is built to handle this use case.
  5. Please have a look this data from a pressure sensor in a valve. Say as an user we know that the first anomaly is caused by miss orientation of the spring and second is caused when seal of the valve is broken. Can you suggest any methods to isolate these 2 phenomenon. Most of the traditional approaches wouldn't take in to consideration if a new type pattern emerges. next slide our solution.
  6. loss of similarity
  7. Knowledge base is inferences drawn from the given data.
  8. This is the pipeline all the phases of our application go through. First you provide a data source, currently you can upload a local file in your computer or select it from your google chrome, formats supported are CSV and TSV. Then the data is ingested using a schema provided. You can type cast variable, join columns from multiple files, etc. Using this ingested data as our time series we can compute PCSC,
  9. Simple example for say y=sin(x) time series model Prediction on time series data is one of the use case for real time time series data analytics talk 1 explains about how did we train the system and teach it the make decisions. We use the trained system for real time analytics We need some streaming or live computation framework
  10. Spark streaming is a micro batch processing architecture Collects stream data into small batches and processes it Job Creation and scheduling overhead is in order of milliseconds Batch interval can be as small as 1 Second
  11. We can’t rely on original data source as it cannot provide recent data once lost
  12. Apache kafka is a Distributed streaming platform Publish and subscribe to streams of records. Store streams of records in a fault-tolerant way
  13. Streams of time series data will be pumping data into kafka Spark will connect to kafka brokers and consume data Processes data and store to database UI server will pull data for visualization Spark is the computation layer while kafka acts as data source for streaming data We now have a streaming end to end application :  a data source to stream, a compute framework and a storage system We can connect any real time streaming source One such demo: simple aws iot with moble sensors
  14. Every batch is writing a lot of small files into storage system (HDFS) We use parquet as it is one of the best compressed fomat of data available Spark parquet format small files writing is adding up to extra overhead Reading several small files from Storage to serve UI requests is also adding up to delay
  15. Sharing data between batches
  16. State maintain
  17. Make one line Basic problem with live streaming is data will be broken into batches. Our mathematical model can’t rely upon a batch, needs to wait on for next batch to see if data is overflown