Pivotal Real Time Data Stream Analytics
- 1. 1Pivotal Confidential–Internal Use Only 1Pivotal Confidential–Internal Use Only
Journey to an Agile
Data-Driven Enterprise
Real Time Data Stream Processing using
Pivotal Big Data Suite (BDS)
- 2. 2Pivotal Confidential–Internal Use Only
Agenda
Ÿ Problem Statement
Ÿ Real Time Streaming Architecture
Ÿ Problem Solution
Ÿ Pivotal Big Data Suite
Ÿ Demo Screenshots
Ÿ Summary (Pivotal Differentiators)
- 3. 3Pivotal Confidential–Internal Use Only
Problem Statement
Problem solution is loosely based on ACM DEBS 2015 Grand Challenge
http://www.debs2015.org/call-grand-challenge.html
- 4. 4Pivotal Confidential–Internal Use Only
Problem Statement
Data Model
1. Taxi data streamed for New York region
2. Data contains details like taxi number,
pickup time, dropoff time, pickup and
dropoff lat/long, fare, taxes
1. Area to be divided as squares. Each
square is 1kmx1km
Find out EVERY 10 SECONDS
a. Inconsistent data
b. Top 10 areas where taxies are plying the
most (Report starting and ending area and
number of taxies that traveled in these
areas. Each area is a square 1x1km)
c. Total data processed, and time to process
data in-memory for a window of 10 seconds
d. Free taxies available in different areas (only
50 taxies)
Analytical Queries
a. Which taxi driver is not reporting data
correctly
b. Top 10 taxi driver earning the most
- 7. 7Pivotal Confidential–Internal Use Only
SpringXD
(Fast Ingestion)
Real Time Analytics Demo - Flow
Data is streamed to
network port where
springXD is listening
Net
Pkts
- 8. 8Pivotal Confidential–Internal Use Only
SpringXD
(Fast Ingestion)
Real Time Analytics Demo - Flow
Spark Streaming
(In Memory analytics)
Filter (incomplete data)
Business logic
(filter, transformation,
distance calculation,
sorting)
10s moving window
Data is streamed to
network port where
springXD is listening
Data stream is
forwarded to SPARK
streaming which
collects data for every
10seconds window
and applies business
logic on batch of data
Net
Pkts
- 9. 9Pivotal Confidential–Internal Use Only
SpringXD
(Fast Ingestion)
Real Time Analytics Demo - Flow
Spark Streaming
(In Memory analytics)
Gemfire
(In memory data store)
Filter (incomplete data)
Business logic
(filter, transformation,
distance calculation,
sorting)
10s moving window
Data is streamed to
network port where
springXD is listening
After processing,
SPARK uploads
various aggregated
metrics to Gemfire for
fast retrieval
Terminal Output
Net
Pkts
Data stream is
forwarded to SPARK
streaming which
collects data for every
10seconds window
and applies business
logic on batch of data
- 10. 10Pivotal Confidential–Internal Use Only
SpringXD
(Fast Ingestion)
Real Time Analytics Demo - Flow
Spark Streaming
(In Memory analytics)
Gemfire
(In memory data store)
Filter
Business logic
(filter, transformation,
distance calculation,
sorting)
10s moving window
Data is streamed to
network port where
springXD is listening
Pivotal HD
(long term storage)
Terminal Output
Net
Pkts
After processing,
SPARK uploads
various aggregated
metrics to Gemfire for
fast retrieval
Data stream is
forwarded to SPARK
streaming which
collects data for every
10seconds window
and applies business
logic on batch of data
SpringXD allows
multicasting data
stream. One stream
goes to Pivotal HD for
analytics purpose
- 11. 11Pivotal Confidential–Internal Use Only
SpringXD
(Fast Ingestion)
Real Time Analytics Demo - Flow
Spark Streaming
(In Memory analytics)
Gemfire
(In memory data store)
Filter
Business logic
(filter, transformation,
distance calculation,
sorting)
10s moving window
Webapp
Data is streamed to
network port where
springXD is listening
A php webapp
then shows and
refreshes data
every 10 seconds.
Pivotal HD
(long term storage)
Terminal Output
Net
Pkts
After processing,
SPARK uploads
various aggregated
metrics to Gemfire for
fast retrieval
Data stream is
forwarded to SPARK
streaming which
collects data for every
10seconds window
and applies business
logic on batch of data
SpringXD allows
multicasting data
stream. One stream
goes to Pivotal HD for
analytics purpose
- 12. 12Pivotal Confidential–Internal Use Only
Real Time Analytics Demo – Tools used
Ÿ SpringXD è Data Ingestion
Ÿ Spark Streaming è In-memory stream computing
Ÿ Gemfire è In- memory data store
Ÿ HAWQ è Analytic SQL queries on Hadoop
Ÿ Google charts è PHP based web application
- 14. 14Pivotal Confidential–Internal Use Only
Analytics-optimized,
OPD –based Hadoop
distribution
In-memory, distributed
processing from
Apache
Scale-out analytics
pipeline management
with data ingestion
and processing
Agile, Open Source Data Storage and Processing
As we move to combine all the data generated by
our activity and to leverage advanced analytics in
real time, there’s no better way to do that than
through the flexibility and choice provided by
Pivotal’s Big Data Suite.
– Sylvain LeBorne, EVP Data Platforms
“”
- 15. 15Pivotal Confidential–Internal Use Only
Advanced Analytics Power, Speed and Flexibility
Leading analytic data
warehouse
Most advanced
analytical
SQL engine on Hadoop
100X performance improvement
analyzing trends among 500 million
job postings.
“”
- 16. 16Pivotal Confidential–Internal Use Only
Distributed, in-memory database for
high-scale NoSQL applications
In-memory, data structure server for fast
read and write applications
Robust messaging for high-scale
applications
Leverage Big Data Suite data services
within Pivotal Cloud Foundry
applications
Deploy and manage Big Data Suite with
Pivotal Cloud Foundry Foundation
Low Latency, Resilient Data Stores and Messaging
300% improvement in ticket-serving
capacity led to 30% increase in e-
ticket sales.
“”
- 17. 17Pivotal Confidential–Internal Use Only
Pivotal Big Data Suite Differentiators
Ÿ Open Data Platform
– Pivotal and Hortonworks are first two members
– Focused on building ODP core on which Hadoop distributions will work
– Governed by an open governance model
– Flexibility to work on any Hadoop distribution using ODP core
– Faster releases and third-party products certifications than any single vendor
Ÿ Suite of Analytical Products
– HAWQ
– Greenplum
– MADLib
– PivotalR
– Graphlab
- 18. 18Pivotal Confidential–Internal Use Only
Spring XD Value
§ Unified agile experience for
– Data Ingestion
– Real-time Analytics
– Workflow Orchestration
– Data Export
§ Built on existing assets
– Spring Integration
– Spring Batch
– Spring Data (Redis, GemFire, Hadoop)
§ XD = 'eXtreme Data’
– or 'x' as a variable (big, fast, diverse)
- 19. 19Pivotal Confidential–Internal Use Only
Streams
Spring XD
HTTP
Tail
File
Mail
Twi,er
Gemfire
Syslog
TCP
UDP
JMS
RabbitMQ
MQTT
Trigger
ka?a
jdbc
Reactor
TCP/UDP
Filter
Transformer
Object-‐to-‐JSON
JSON-‐to-‐Tuple
Spli,er
Aggregator
HTTP
Client
Groovy
Scripts
Java
Code
JPMML
Evaluator
File
HDFS
JDBC
Mongo
TCP
Log
Mail
RabbitMQ
Gemfire
Splunk
MQTT
Dynamic
Router
Counters
Redis
Ka?a
- 20. 20Pivotal Confidential–Internal Use Only
What is Spark Streaming?
Ÿ Extends Spark for doing large scale stream processing
Ÿ Scales to 100s of nodes and achieves second scale latencies
Ÿ Efficient and fault-tolerant stateful stream processing
Ÿ Integrates with Spark’s batch and interactive processing
Ÿ Provides a simple batch-like API for implementing complex
algorithms
- 21. 21Pivotal Confidential–Internal Use Only
Discretized Stream Processing
Run a streaming computation as a series of very small,
deterministic batch jobs
21
Spark
Spark
Streaming
batches
of
X
seconds
live
data
stream
processed
results
§ Chop
up
the
live
stream
into
batches
of
X
seconds
§ Spark
treats
each
batch
of
data
as
RDDs
and
processes
them
using
RDD
operaSons
§ Finally,
the
processed
results
of
the
RDD
operaSons
are
returned
in
batches
- 22. 22Pivotal Confidential–Internal Use Only
Discretized Stream Processing
Run a streaming computation as a series of very small,
deterministic batch jobs
22
§ Batch
sizes
as
low
as
½
second,
latency
~
1
second
§ PotenSal
for
combining
batch
processing
and
streaming
processing
in
the
same
system
Spark
Spark
Streaming
batches
of
X
seconds
live
data
stream
processed
results
- 23. 23Pivotal Confidential–Internal Use Only
Pivotal HD Value
• Cost-based Query Optimizer
• ANSI SQL Compliant
• Linear, incremental scalability on
commodity/appliance hardware
• Deep Analytic OLAP Queries
• Petabyte Data Storage &
Management
• Low latency updates and
transactions
• Active-active deployment across
WAN
OLAP OLTP
SQL
HDFS
- 24. 24Pivotal Confidential–Internal Use Only
Pivotal HAWQ Value
Ÿ ORCA – New Query Optimizer
Ÿ Open Data Format (Parquet)
Ÿ Additional Analytics
– PL/PGSQL, PL/R, PL/PYTHON
Ÿ Security – Kerberos authentication
support
Ÿ Updated Diagnostic Tools
Ÿ Automated High Availability
- 25. 25Pivotal Confidential–Internal Use Only
GemFire – The Enterprise Data Fabric
– A distributed, memory-based data management platform.
– Gartner -> In Memory Data Grid (IMDG)
– ACID Transactional behaviour on IMDG
– Provides continuous availability, high performance, and
linear scalability for data intensive applications.
– Allows for configurable data consistency.
– Event driven data architecture
25
- 26. 26Pivotal Confidential–Internal Use Only
GemFire – The Enterprise Data Fabric
26
Pivotal GemFire Data Fabric
Reliable Notification
High Scalability WAN Distribution
Continuous Querying Parallel Execution
Continuous Availability Low latency
Data Durability
Enterprise data consuming application
Conventional data
storage systemsFile Database
Other data
Storage System
- 28. 28Pivotal Confidential–Internal Use Only
Reports total number of streams
process, total number of streams
that lacks data and how much time
did spark took to process data
collected in 10 seconds window.
This data is retrieved from Gemfire
Reports top routes and number of
trips in those routes, pickup and
dropoff time. This data is for last 10
seconds. Refreshes every 10
seconds. This data is retrieved from
Gemfire
- 31. 31Pivotal Confidential–Internal Use Only
Visual representation of top three
routes.
Number 1 is blue
Number 2 is green
Number 3 is pink
Straight pin is origin square and
tilted pin is ending square.
This data is retrieved from Gemfire
- 35. 35Pivotal Confidential–Internal Use Only
This showcases power of HAWQ.
Recall that streamed data is put
into HDFS as well. And we use
SQL queries to query on data
stored on HDFS.
- 37. 37Pivotal Confidential–Internal Use Only
Pivotal Big Data Suite
ü Open Data Platform
ü Suite of Analytical Products
ü Team of Data Scientists
ü Open Source Commitment
ü Enterprise Level Support
ü Best in class In-Memory Data grid solution
ü ONE platform for apps, data and mobile services