Apache spark empowering the real time data driven enterprise - StreamAnalytix webinar
- 1. WEBINAR
Apache Spark Empowering the Real-Time
Data Driven Enterprise
October 13, 2017
Anand VenugopalMike Gualtieri
Twitter: mgualtieri Twitter: streamanalytix
VP & Principal Analyst, Forrester Product Head & AVP, StreamAnalytix
- 2. Our Agenda
• Business Value of Streaming Analytics
• Use Cases / Architecture
• Streaming Analytics Platform Criteria
• Spark as a Streaming Technology
• Introducing StreamAnalytix - Visual Spark Studio
• Success Stories and Demo
• Q & A
- 4. — Mike Gualtieri, VP & Principal Analyst
The Real-Time Enterprise with Apache Spark
Twitter: @mgualtieri | Linkedin: mgualtieri
- 6. © 2017 Forrester Research, Inc. Reproduction Prohibited
52%
53%
53%
54%
58%
64%
64%
65%
66%
73%
75%
0% 10% 20% 30% 40% 50% 60% 70% 80%
Better leverage big data and analytics in business…
Create a comprehensive strategy for addressing digital…
Create a comprehensive digital marketing strategy
Better comply with regulations and requirements
Improve differentiation in the market
Increase influence and brand reach in the market
Address rising customer expectations
Improve our ability to innovate
Reduce costs
Improve our products /services
Improve the experience of our customers
• Base: 3,005 global data and analytics decision-makers
• Source: Global Business Technographics Data And Analytics Online Survey 2016
Data and analytics decision-makers are driven by business
priorities
- 8. © 2017 Forrester Research, Inc. Reproduction Prohibited
Real-time means business time
- 12. How can you warn other drivers that the road is
slippery to avoid a crash right now?
- 13. © 2017 Forrester Research, Inc. Reproduction Prohibited
What are movers and shakers saying about
equities that we cover right now?
- 14. How can you prevent this dude from fleecing
you right now?
- 16. How can IoT data be used to predict machine
failure right now?
- 18. © 2017 Forrester Research, Inc. Reproduction Prohibited
Ideate Model Detect Adapt
Machine
Learning
Streaming
Analytics
Descriptive
Analytics
Prescriptive
Analytics
(Real-time Analytics)
(Batch Analytics)
Only the analytical enterprise can compete and win in the
age of the customer
- 20. © 2017 Forrester Research, Inc. Reproduction Prohibited
10-49
Terabytes
5% 50-99
Terabytes
12%
100-500
Terabytes
54%
Greater
than 500
Terabytes
29%
Enterprises have plenty of data from both internal and
external sources
Using your best estimate, what is the size of all data
stored within your company?
Source: Forrester Research, September 2015
Base: 100 US Managers and above currently using Hadoop for processing and analyzing data.
Internal
business
data
49%
External
source
data
51%
What % of the data available is from internal
business applications (ERP and business
applications) versus external sources (social, IoT)?
- 30. © 2017 Forrester Research, Inc. Reproduction Prohibited
Enterprises must act on a range of perishable insights to get
value from data and analytics
Real-time
Insights
Operational
Insights
Performance
Insights
Insight: Shopping
for furniture
Action:
Recommend
cleaning supplies
Insight: Profit
lower than goal
Action: Optimize
price
Insight: Demand
forecast strong
Action: Increase
inventory
Insight: Furniture
demand high
Action: Expand
product line
TimetoAct
Perishability
Sub-second to
seconds
Seconds to
hours
Days to
weeks
Weeks to
years
Sub-second to
seconds
Seconds to
hours
Hours to
weeks
Weeks to
years
Strategic
Insights
- 31. © 2017 Forrester Research, Inc. Reproduction Prohibited
Time To Action
Data
originated
Analytics
performed
Insights
gleaned
Action
taken
Outdated
insights
Impotent or
harmful
actions
Decision
made
Poor
decision
BusinessValuePositiveNegative
Most analytics operations are too slow
- 32. © 2017 Forrester Research, Inc. Reproduction Prohibited
BusinessValue
Time to Action
PositiveNegative
The Real-time
Enterprise
You must compress analytics time-to-insight to maximize
the value of data
- 33. © 2017 Forrester Research, Inc. Reproduction Prohibited
Real-time
Insights
Strategic
Insights
Operational
Insights
Performance
Insights
TimetoAct
Perishability
Sub-second to
seconds
Seconds to
hours
Days to
weeks
Weeks to
years
Sub-second to
seconds
Seconds to
hours
Hours to
weeks
Weeks to
years
Streaming analytics
Batch analytics
IoT applications must act on a range of perishable insights
to get value from big data
- 35. The opportunity to become real-time is high, but
enterprises must redesign applications
- 36. © 2017 Forrester Research, Inc. Reproduction Prohibited
Streaming Data
Application Interface
App Logic
Context
Actions
Real-time Context
Programmed Logic
Learned Logic
Machine Learning
Learning
External
Actions
External
Context
From other data
sources of
applications
To other data
sources or
applications
Applications
Modern applications infuse analytics to respond in real-time
and become smarter
- 39. © 2017 Forrester Research, Inc. Reproduction Prohibited
Streaming analytics lets applications sense, think, and act
in real-time
Source: Forrester Research
- 40. © 2017 Forrester Research, Inc. Reproduction Prohibited
Streaming analytics is very different from plain vanilla
stream ingestion
Source: Forrester Research
- 41. © 2017 Forrester Research, Inc. Reproduction Prohibited
Architecture
• Workload scalability
• Workload latency
• Fault tolerance
• Operational management
Stream/event Handling
• Event sequencing
• Enrichment
Analytical Operators
• Transformation
• Correlation
• Time windows
• Complex event processing
Applications Development
• Development tools
• Data connectors
• Business solution accelerators
• Community innovation
Streaming analytics solutions must be scalable and have
a rich set of stateful analytical operators
- 47. Provide tools that make it easy to manage and
monitor the platform and its interaction with
technology components
- 48. Offer tools for business users to visualize
insights from real-time data
- 55. © 2017 Forrester Research, Inc. Reproduction Prohibited
Spark and Hadoop often coexist in the same cluster
- 56. © 2017 Forrester Research, Inc. Reproduction Prohibited
Hadoop and Spark are friends, but…
- 59. Spark’s directed acyclic graph (DAG) engine
optimizes parallelization to dramatically reduce
intermediary data movement
- 60. © 2017 Forrester Research, Inc. Reproduction Prohibited
and/or and/orand/or
Spark doesn’t need Hadoop; it just needs great compute
and great storage
- 61. © 2017 Forrester Research, Inc. Reproduction Prohibited
Spark includes capabilities for streaming analytics and
machine learning!
- 63. © 2017 Forrester Research, Inc. Reproduction Prohibited
Ideate Model Detect Adapt
Machine
Learning
Streaming
Analytics
Descriptive
Analytics
Prescriptive
Analytics
(Real-time Analytics)
(Batch Analytics)
Unify batch and streaming analytics to create your
real-time enterprise
- 69. “Impetus has the
opportunity to make
StreamAnalytix the
de facto tooling
standard for Spark
and future streaming
engines…”
Impetus Technologies covers open source bases without the headaches.
Take your pick. Impetus’ StreamAnalytix supports Apache Storm and Apache
Spark and is architecturally positioned to support other open source streaming
analytics software such as Apache Flink.
StreamAnalytix also embeds EsperTech to provide advanced streaming
analytics capabilities such as complex event processing.
What also shines about the StreamAnalytix solution is that it includes
enterprise-grade visual tooling for both development and deployment of
streaming applications.
StreamAnalytix tooling also unifies streaming and batch by supporting arbitrary
Spark jobs such as machine learning.
A Strong Performer in The Forrester Wave™:
Streaming Analytics, Q3 2017
- 70. ENABLING THE REAL TIME ENTERPRISE
1
Real-Time Streaming
Data Analytics
2
Makes Spark Easy
(Visual Spark Studio)
- 72. Slow processing jobs
Wherever you are – we can make you faster
HADOOP-MR OR
OTHER NON-BIG
DATA TECH
Faster due to
in-memory
SPARK
BATCH
JOBS
Faster due to
micro batch
SPARK
STREAMING
JOBS
Fastest
EVENT
STREAM
PROCESSING
1
ENABLING THE REAL TIME ENTERPRISE
- 73. Real-time C360 and Churn
Fraud and Anomaly Detection
IoT and Log Analytics
Next Best Offer or Action
Predictive Maintenance
Cyber Security
Real-time Call Center Analytics
Use Cases
Real-time Streaming
Data Analytics
1
ENABLING THE REAL TIME ENTERPRISE
- 74. Learning / Training Real-time + Batch
PMML, H20, Python – on Spark
Kafka, Storm, Esper
Scoring Real-time + Batch
Spark Streaming, SparkML, ML-Lib
Stack
Real-time Streaming
Data Analytics
1
ENABLING THE REAL TIME ENTERPRISE
- 76. Shortage of Spark talent and the urgent need for it
• Spark projects are increasing
• Need to get done quickly, with budget controls
• But, there is a big barrier: Talent - both quality and quantity
• Deep Spark / Scala skills are hard to find
• Big gap between Spark prototype app vs. production grade,
scalable, stable apps that don’t need a lot of baby-sitting
2
IMPACT
• S…LLL...O..OO...WW
• DIFFICULT
• COSTLY
• RISK RIDDEN
• SPARK PROJECTS
- 77. Is the Real-time Enterprise possible ?
With Spark use-cases taking too long to deliver ?
2
- 78. Is the real-time enterprise possible?
SOLUTION
•More people? (They don’t exist yet – just gets more messy and costly)
•Ditch Spark and buy proprietary platforms? ($$$$ - That’s going backwards)
•Just bite the bullet, and delay the project? (Oops!)
•Hire outsourcing companies? (Do they really have more skilled people?)
2
- 79. Is the real-time enterprise possible?
SOLUTION
•Get the right tools
•Make existing people and teams – much more productive
2
- 80. The right Spark tool or platform – does this…
Maintain
Deploy
Develop
+ Debug
Monitor
+ Tune
Apps
Ingest
Analytics/
ML
ETL
Visual IDE
Scale
Performance
2
- 81. Data360
Visual Spark IDE – Drag and Drop
Analytics – Feature extract, ML, Time windows
Transform / Enrich – Filter, Blend, Lookup
Streaming, Batch + Oozie Workflow
Load – HDFS, HBase, Hive, Any NoSQL
View – Real-time Dashboards
Ingest – Tables, Files, Kafka, APIs
Visual Spark Studio
2
- 87. Hadoop Cluster
StreamAnalytix Web Server1 (CentOS / RHEL 6.x or above)
Load
Balancer
With sticky
session
User
StreamAnalytix leverages
Zookeeper for configuration
management4
Standalone spark cluster
or Spark over YARN3
MySQL/
Postgres
RabbitMQ
Deployment diagram
Secured communication
via Kerberos2
StreamAnalytix
Web Container
(Tomcat)
- 93. Transforming the Business - means….
• Creating a real-time enterprise
• Dramatic non-linear increase in performance / cost trade off
• Net new capabilities or revenue streams – that were previously not possible
- 94. Top airline boosts customer digital experience
• Funnels all app data to enterprise bus and into StreamAnalytix
• Couldn’t handle the volume and velocity of data earlier
• Analytical capacity went from 3 days to 3 months
• Ability to correlate events and see patterns across a larger time window
• Customer experience issues proactively resolved in real-time
• Foundation laid for real-time ML, predictive and prescriptive analytics
- 95. JSON
Raw
Data
User
Kafka
Data Ingestion
UI Data Diagnostic Tool Query Results
Data Querying
Data Search
YARN
Parsing Filtering Emitting
StreamAnalytix Spark Pipeline
X Service data
Raw JSON Data
• Multiple Apps
• Multiple Services
All Services data
StreamAnalytix Pipeline Overview
High Level Solution Architecture
Highlights
• Input data velocity ~7K /sec
• Contributing to ~5 TB /day
• ES Data retention of 30 days
• Custom built Web UI for queries
• StreamAnalytix implementation providing
easy onboarding of additional services
and application logs
Benefits
• Diagnostic ability on a larger range of data
• SLAs unaffected, similar and better
• Improved searching with custom Web UI
• Scalable architecture
• Supporting even larger data sets
Solution
ElasticSearch
- 96. •5X performance gain from the same hardware
•New solution based on StreamAnalytix – costs less
•Can onboard 5 times more application traffic for detecting threats
Major bank - insider threat detection: 5X boost
- 98. Pharmacy business processing giant
•Spark based real-time CDC and flow management
•Sense-change, Ingest, Transform, Load
•100s of source tables – data from a large number of pharmacies
•Plus some important real-time ETL / Analytics use cases
•Attunity Kafka StreamAnalytix / Spark - HDFS, Hive
•2 mission critical data pipelines delivered in 1 day, 2 days
•“I could hire a 3 person team instead of a 10 person team”
- 99. Problem Statement
•Oracle based transactions merge to Hive reporting tables in seconds
ACHIEVEMENT
•Spark pipelines for this task built and deployed in 2 days
•Partner Integration with Attunity for CDC
•Consume Oracle multi-table CDC events in real-time
•Capture and reconcile changes into Hive tables
•De-normalize data while landing into Hive
- 100. Workflow: Modelled as StreamAnalytix Oozie workflow to
automate execution of Spark pipelines that perform data
de-normalization and incremental updates to Hive
StreamAnalytix Solution
Data Ingestion
and Staging
Stream data from
Attunity replicate for
multiple tables from
Kafka and store raw
data into HDFS
A complete CDC
solution has 3 parts
Each aspect of the
solution is modelled
as StreamAnalytix
pipeline
Data
De-normalization
Join transactional
data with data at
rest and stores
de-normalized data
on HDFS
Incremental Updates
in Hive
Merge previously
processed
transactional data
with new
incremental updates
- 101. Pipeline #1 - Data ingestion and staging (Streaming)
Data ingestion via Attunity ‘Channel’:
Reads the data from Attunity target
Kafka. This channel is configured to
read data feeds as well as metadata
from a separate topic
Data enrichment: Enriches incoming
data with metadata information and
event timestamp
HDFS: Stores CDC data on HDFS in landing
area using OOB HDFS emitter. HDFS files are
rotated based on time and size configuration
- 102. Pipeline #2 - Data de-normalization (Batch)
HDFS data channel:
Ingests incremental
data from previous runs
of the staging location
Pipeline #1
Reads reference (data
at rest) from a fixed
HDFS location Performs outer join to merge
incremental and static data
Store de-normalized
data to HDFS directory
- 103. Pipeline #3 - Incremental updates in Hive (Batch)
Pipeline #2
Hive SQL query to load a managed
table from the HDFS incremental
data generated from Pipeline #2
Reconciliation step - Hive “merge into” SQL,
performs insert, update and delete operation
based on the operation in incremental data
Clean up step - runs a drop table
command on the managed table to
clean up processed data – so that it
doesn’t get repeatedly processed
- 104. Workflow: Oozie Coordinator Job
Oozie orchestration flow created using StreamAnalytix webstudio –
it orchestrates pipeline #2 & pipeline #3 into a single Oozie flow that
can scheduled as shown here
- 105. “After a long time we now have a new offering we can go sell proudly to our customers”
- Product Manager
•Net new capability for real-time inspection and diagnostics of call quality and customer experience
at the contact center
•Dramatically improves end-user service for their B2B customers
Hosted call center adds new premium product / revenue
source
- 106. Hosted call center
Challenges solved
•Individual events scattered in different media servers
•Needed to filter a lot of noise in the data at the source itself
•Tech support took too long to correlate and solve issues
•Call Center manager had no real-time view on IVR operations
•Needed a variety of cell center metrics in real-time
- 107. Hosted call center solution
Public
Internet
IP
IP
IP
IP
IP
IP
IP
C
C CIP
C
C C
ACD
= Packet
= Circuit
Internet Caller
Chat, VOIP, E-mail,
Collaboration, Video
Wireless Caller
Live Call, IVR,
Voice Mail
Telephone Caller
Live Call, IVR,
Voice Mail
Core Servers
Routing, Admin,
Stats, Logging
Agent Servers
Agent
Interaction
Connection
Servers
IVR, Voice, Chat,
Video, Message
Dialing Servers
Predictive Engine,
Campaign Manager
GATEWAYS
Circuit
NetworksCircuit
Networks
Legacy Call Centers
ADMINISTRATOR/
SUPERVISOR
Administration, Monitoring
Service Creation,
Recording Reports
PC AGENT - SOFTPHONE
PC AGENT – IP PHONE
HYBRID AGENT
PHONE AGENTS
- 111. • 8000+ agent desktops monitored for unethical behaviour in real-time
• Secures customer information
• Ensures top quality service
• Net new capability they couldn’t get earlier at any reasonable price point
Tier 1 Telco deploys new “agent monitoring system”
- 112. Desktop Analytics
Key Business Metrics :
• Average Handling Time
• First Call Resolution
• Sales Close Rate
• Disconnect Save Rate
1yr benefit is $5.41M
in the form of Call Volume Reduction
30 sec AHT reduction for Tech
15 sec AHT reduction for Sales
- 113. Desktop analytics – desktop data pipeline
Call
Center
Agent
Machine
Big Data Platform
Desktop Raw data
processing
App activity
aggregation
Event activity
aggregation
System data enrich
and persist
App and Event data
enrich and persist
• Consume Raw
ACD events
• Parse and Split the
Bulk Jason mssg
into individual
• Data Process for App, Event,
System events
• Aggregate data: Mini batching,
Data sequencing, Enrich Data
with Agent Hierarchy,
Aggregate Data
• Persist data into HIVE, HBASE,
Elastic
- 114. Source System Data type No Of Agent Records/Day
Desktop Data Raw 9 69461
Desktop Data Aggregated 9 45428
Call Data Raw 7000 900000
Call Data Aggregated 7000 900000
Source System Data type No of Agents Records/Day
Desktop Data Raw 7000 60M
Desktop Data Aggregated 7000 20M
Call Data Raw 7000 900000
Call Data Aggregated 7000 900000
Pilot
GA
Desktop analytics - data volume