Operational Analytics Using Spark and NoSQL Data Stores
- 1. Delivering Operational Analytics Using
Spark and NoSQL Data Stores
Mike Ferguson
Managing Director
Intelligent Business Strategies
Basho Webinar
January, 2016
- 2. 2
Copyright © Intelligent Business Strategies 1992-2016!
About Mike Ferguson
Mike Ferguson is Managing Director of Intelligent
Business Strategies Limited. As an analyst and
consultant he specializes in business
intelligence, data management and enterprise
business integration. With over 34 years of IT
experience, Mike has consulted for dozens of
companies, spoken at events all over the world
and written numerous articles. Formerly he was
a principal and co-founder of Codd and Date
Europe Limited – the inventors of the Relational
Model, a Chief Architect at Teradata on the
Teradata DBMS and European Managing
Director of DataBase Associates.
www.intelligentbusiness.biz
mferguson@intelligentbusiness.biz
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
- 3. 3
Copyright © Intelligent Business Strategies 1992-2016!
Topics
The changing landscape of operational and analytical systems
Scalable operational applications and NoSQL data stores
Big data analytics – The era of Hadoop and Spark
The value of operational analytics
Operational analytics using The Basho Data Platform and
Apache Spark
Conclusions
- 4. 4
Copyright © Intelligent Business Strategies 1992-2016!
Topics
The changing landscape of operational and analytical systems
Scalable operational applications and NoSQL data stores
Big data analytics – The era of Hadoop and Spark
The value of operational analytics
Operational analytics using The Basho Data Platform and
Apache Spark
Conclusions
- 5. 5
Copyright © Intelligent Business Strategies 1992-2016!
The Application Processing Spectrum
Source: BI-Research Copyright © BI-Research, 2013-Present
- 6. 6
Copyright © Intelligent Business Strategies 1992-2016!
Big Data Processing – There Is A Growing Number of Data
Stores Optimized for Operational or Analytical Workloads
OLTP RDBMS
NoSQL DBMS NoSQL
• ACID support missing in many NoSQL DBMSs
• Can you live with losing a transaction?
• OK for sensor data for example
Analytical RDBMS
- 7. 7
Copyright © Intelligent Business Strategies 1992-2016!
Analytical
Systems
A Closed Loop Is Still Needed – It Just Now Also
Includes NoSQL Technologies
Operational
applications
Scalable
Analytical
Systems
data data
new data
new insights
Scalable
Operational
applications
Relational &
NoSQL systems
Relational &
NoSQL systems
- 8. 8
Copyright © Intelligent Business Strategies 1992-2016!
Topics - – Where Are We?
The changing landscape of operational and analytical systems
Scalable operational applications and NoSQL data stores
Big data analytics – The era of Hadoop and Spark
The value of operational analytics
Operational analytics using The Basho Data Platform and
Apache Spark
Conclusions
- 9. 9
Copyright © Intelligent Business Strategies 1992-2016!
Analytical
Systems
Demand For Scalable Operational Systems With High
Write Processing Is Driving Demand for NoSQL DBMS
Operational
applications
Scalable
Analytical
Systems
data data
new data
new insights
Scalable
operational
applications
- 10. 10
Copyright © Intelligent Business Strategies 1992-2016!
Success of Big Data Analytics Depends On Being Able
To Scale To Capture High Velocity, High Volume Data
Successful big data analytics requires
1. Ability to scale operational systems to capture, stream and store
the required transactional and non-transactional data
– Support peak transaction rates
– Support peak capture of non-transactional data e.g. shopping
cart data
– Support peak data arrival rates e.g. sensor data
– Support peak ingestion rates
2. Scalable Big Data analytics
3. Closed loop integration of analytical systems back into core
operational transaction processing systems
– Make prescriptive insights available to all that need them to
continuously optimise operations and maximise effectiveness
- 11. 11
Copyright © Intelligent Business Strategies 1992-2016!
E-Business And Mobile Means Operational Systems Are
Having To Scale To Support Masses Of Concurrent Users
Many more users
Operational
applications
Transactional
applications
dataWeb
logs
Cluster
Mobile devices
WWW
data data data
partitioned data
- 12. 12
Copyright © Intelligent Business Strategies 1992-2016!
Example Operational Applications Requiring Scalability
That Are Fuelling Demand For NoSQL DBMSs
Web and mobile commerce
• Shopping cart data, session storage
Internet of Things (IoT) and other time series applications
• Need to scale as the number of devices / things increase
Mobile gaming
• Player profile data, session storage, game performance stats
Healthcare
• Store unstructured healthcare digital imaging and video data
Social network applications
- 13. 13
Copyright © Intelligent Business Strategies 1992-2016!
Types Of NoSQL Database And Product Examples
NoSQL Database Type NoSQL Product Examples
Key Value store Aerospike, Amazon DynamoDB, Basho Riak KV,
Redis, MemcacheDB, Voldemort
Document database CouchDB, IBM DB2 (XML & JSON), MongoDB, IBM
Cloudant, Marklogic, Terrastore, JackRabbit, RaptorDB
Column Family
database
Casandra, DataStax, Google BigTable, Hadoop HBase,
Hypertable, HPCC, Amazon SimpleDB
Graph database AllegroGraph, GraphBase, Horton, InfiniteGraph, IBM
DB2, Neo4j, Oracle Spatial and Graph, Titan, Cray
Research, Teradata Aster
Multi-modal database ArangoDB, CortexDB, MarkLogic , MongoDB
FoundationDB,
Some NoSQL databases are aimed at write processing (data collection)
Others are aimed at specific big data analytical workloads
Issues include lack of standard APIs, weak or no optimizer and non-
immediate consistency
- 14. 14
Copyright © Intelligent Business Strategies 1992-2016!
Global NoSQL Market Size And Forecast 2013 - 2020
Source: https://www.alliedmarketresearch.com/NoSQL-market
- 15. 15
Copyright © Intelligent Business Strategies 1992-2016!
Key Value Stores Can Store Any Data - Examples
Key Value
10034 John Smith
82771
93441
{ "firstName": ”Wayne",
"lastName": ”Rooney",
"age": 25,
"address": {
"streetAddress": "21 Sir Matt Busby Way",
"city": ”Manchester”,
“country”: “England”,
"postalCode": “M1 6DY”
},
"phoneNumbers": [
{ "type": "home”,
"number": ”0161-123-1234”
},
{
"type": ”mobile",
"number": ”07779-123234”
}
]
}
Key value store features:
• Very simple to understand
• Very scalable - hash partitioning
• Data access is via the key
• The application controls what’s stored in
the value
• Very fast performance
• Acceleration via in-memory processing
• Eventual consistency
• Often no support for data types
• No built-in referential integrity
• No understanding of data relationships
• The application must understand any
relationships in data
• Programmer is in complete control
• Application must navigate complex data
Use for specific operational applications
- 16. 16
Copyright © Intelligent Business Strategies 1992-2016!
Key Value Stores
– The Key Is Hashed To Partition The Data
Source: Microsoft
The value can be anything
• A single data field
• A JSON document
• An XML document
• Text
• Image……
Key Value
Easy to partition (hash the key)
Very fast to retrieve and store data
The application needs to know
• What is stored in the VALUE
• How the value is structured
• How to process the value
Key needs to be unique
Can use HTTP to read and write data
e.g. CURL –XPUT, CURL -XGET
- 17. 17
Copyright © Intelligent Business Strategies 1992-2016!
Key Value Stores – A Basho Riak KV Cluster Has
Virtual Nodes Running on Physical Nodes
Source: Basho
SHA1 is a hashing function that hashes a key to determine the node
Riak hash partitions and replicates data (3 copies of the data is the default)
e.g. PUT,
POST, GET….
the valuethe key
hash the key
Nodes can be
added and removed
to a Riak cluster
while it is running
- 18. 18
Copyright © Intelligent Business Strategies 1992-2016!
Key Value Stores - A Basho Riak KV Ring
Riak uses partitions (64 partitions
are the default) and also replicates
the partitions for high availability
Source: Basho
Writing replicas
- 19. 19
Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
The changing landscape of operational and analytical systems
Scalable operational applications and NoSQL data stores
Big data analytics – The era of Hadoop and Spark
The value of operational analytics
Operational analytics using The Basho Data Platform and
Apache Spark
Conclusions
- 20. 20
Copyright © Intelligent Business Strategies 1992-2016!
Analytical
Systems
Demand For Scalable Analytical Systems Is Also
Exploding
Operational
applications
Scalable
Analytical
Systems
data data
new data
new insights
Scalable
operational
applications
- 21. 21
Copyright © Intelligent Business Strategies 1992-2016!
A Hadoop System
Java, Python,
Scala
file file file file file
file file file file file
file file
file file
webHDFS
(An HTTP
interface to
HDFS has
REST APIs)
HDFS
file
file
file
file
PIG latin
scripts
3rd Party SQL
on Hadoop
Analytic
Application
index
indexIndex
partition
SQL
BI Tools
Storm
YARN
MapReduce Tez Spark
SQL
HBase
w
e
b
H
D
F
S
APIs to HBase, APIs to
HDFS
executes on
MR, Tez &
Spark
- 22. 22
Copyright © Intelligent Business Strategies 1992-2016!
Faster Execution Engines For Analytic Applications
– Apache Spark
Java, Python,
Scala
file file file file file
file file file file file
file file
file file
webHDFS
(An HTTP
interface to
HDFS has
REST APIs)
HDFS
file
file
file
file
PIG latin
scripts
3rd Party SQL
on Hadoop
Analytic
Application
index
indexIndex
partition
SQL
BI Tools
Storm
YARN
MapReduce Tez Spark
SQL
HBase
w
e
b
H
D
F
S
APIs to HBase, APIs to
HDFS
- 23. 23
Copyright © Intelligent Business Strategies 1992-2016!
Spark Is A General Purpose In-Memory Execution
Framework That Can Run With Or Without Hadoop
file file file file file
file file file file file
file file
file file
HDFS
file
file
file
file
Storm
YARN
MapReduce Tez Spark
HBase
w
e
b
H
D
F
S
HDFS, S3…..
Tachyon
Spark also
includes an
HDFS
compatible
in-memory
file system
You can use
Spark with
or without
Tachyon
The Spark stack is integrated – E.g. You can use Spark Streaming,
SparkSQL and MLBase together in the same application
Applications / BI Tools
Spark Core
Spark
Streaming
R
Spark SQL
+
DataFrames
GraphX
(Graph
Computation)
MLlib
(Machine
Learning)
SQL Python Scala Java
- 24. 24
Copyright © Intelligent Business Strategies 1992-2016!
Applications / BI Tools
Spark Core
Spark
Streaming
R
Spark SQL +
DataFrames
GraphX
(Graph Computation)
MLlib
(Machine
Learning)
SQL Python Scala Java
Apache Spark
Provides distributed task
dispatching, scheduling,
and basic I/O.
For analysis of real-
time streaming data
A library of pre-built analytic
algorithms that can run in
parallel across a Spark cluster
A graph analysis engine
running on Spark
Query structured data in
Spark apps using SQL
or a DataFrames API
- 25. 25
Copyright © Intelligent Business Strategies 1992-2016!
Spark In-Memory Analytic Applications Can Do A
Lot More Than Map Reduce Processing
Keep only one copy
in memory in a JVM
Track lineage of job
operators used to
derive the data
Use the lineage to
re-compute the
data if there is a
failure
No MapReduce
execution needed
• Just Spark APIs
map
map
join
filter
reduce
Source: Amplab
Spark application
HDFSfile file file file file file
Spark Applications / BI Tools
Spark Core
Spark
Streaming
R
Spark SQL
+
DataFrames
GraphX
(Graph
Computation)
MLlib
(Machine
Learning)
SQL Python Scala Java
- 26. 26
Copyright © Intelligent Business Strategies 1992-2016!
Spark Applications Operate On RDDs (Data)
– You Can Do A Lot More Than Map and Reduce
RDD = Resilient Distributed Datasets
An RDD is a read-only, partitioned collection of records
RDDs can be only created through operators on either
1. A dataset in stable storage or
2. Other existing RDDs.
Map Reduce Sample
Filter Count Take
Groupby Fold First
Sort Reducebykey Partitionby
Union groupByKey Mapwith
Join Cogroup Mapwith
Leftouterjoin Cross Pipe
Rightouterjoin Zip Save
Spark
Operators
Spark Applications
- 27. 27
Copyright © Intelligent Business Strategies 1992-2016!
Simplifying Access To Data Using Via SparkSQL
and Spark DataFrames
A DataFrame is a distributed
collection of data organized into
named columns
Conceptually equivalent to a
relational DBMS table or a data
frame in R/Python
DataFrames can be constructed
from a wide array of sources:
• Structured data files
• Hive tables
• External databases
• Existing RDDs
Uses schema on read
Image source: Databricks.com
Note: that Spark data
sources can be
relational & NoSQL
DBMSs
- 28. 28
Copyright © Intelligent Business Strategies 1992-2016!
Spark Is Going Over The Top of Multiple Data Stores For
Scalable In-Memory Analytics Across The Entire Ecosystem
Streaming
data
Hadoop
data store
Data Warehouse
RDBMS
NoSQL
DBMS
EDW
DW & martsAdvanced Analytic
(multi-structured
data)
mart
Operational NoSQL
Data Stores
Streaming
analytics
e.g. Casandra,
Basho Riak
Applications / BI Tools
Spark Core
Spark
Streaming
R
Spark SQL +
DataFrames
GraphX
(Graph
Computation)
MLlib
(Machine
Learning)
SQL Python Scala Java
- 29. 29
Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
The changing landscape of operational and analytical systems
Scalable operational applications and NoSQL data stores
Big data analytics – The era of Hadoop and Spark
The value of operational analytics
Operational analytics using The Basho Data Platform and
Apache Spark
Conclusions
- 30. 30
Copyright © Intelligent Business Strategies 1992-2016!
Key Business Drivers And Objectives For
Operational Analytics
Combine operational and analytical processing at scale to:
• Improve customer engagement
• Reduce risk
• Avoid unplanned operational cost
• Optimise operational effectiveness
Use BI/Analytics to drive and guide business operations to help
achieve specific target business goals and KPI targets
Automated analysis of operational events as they happen
Automated alerts
On-demand recommendations
Integrate BI/Analytics into every business process to:
• Create a ‘insight driven’ employee base
• Enable mass execution of business strategy via facilitating
mass contribution towards achieve specific business goals
- 31. 31
Copyright © Intelligent Business Strategies 1992-2016!
Five Types Of Operational BI/Analytics
1. Simple operational reporting of current position/state e.g.
session state
2. Situational awareness via visualisation of live operational data
typically on dashboards
3. On-demand analytics of live operational and/or historical data
to improve operational decisions and effectiveness
4. On-demand recommendations for guidance
5. Event stream processing to monitor, automatically analyse and
act on events in real-time to prevent problems arising and to
optimise business operations
- 32. 32
Copyright © Intelligent Business Strategies 1992-2016!
BI/ Analytics Apps /
Services
Operational Analytics – What’s The Difference
Between On-Demand Vs Event-Driven Analysis?
BI/ Analytics Services
Application
On-Demand
Analytical service
(query, report, model,
recommendation)
Message, file arrival, pattern, trigger
Event-Driven
Analytical service
(query, report, model,
recommendation)
streaming
data
- 33. 33
Copyright © Intelligent Business Strategies 1992-2016!
Analytics Need To Be Integrated Into Business
Processes To Optimize Business Operations
Customers Partners &
suppliers
Customer
relationship
management
Operations
management
Supply
chain
management
Marketing
Sales
Service/support
Operations
Finance/accounting
Procurement
Inventorycontrol
Shipping/distribution
Humanresources
Employees
Integrated Intelligent Business Operations
Integrated On-Demand Business Intelligence
- 34. 34
Copyright © Intelligent Business Strategies 1992-2016!
High Value Application Use Cases for Streaming
Analytics
Streaming
Analytics
Source: Adapted from a slide by IBM
- 35. 35
Copyright © Intelligent Business Strategies 1992-2016!
Responding To Events And Event Patterns Means
Reducing Action Time
The time between an event
occurring and action being
taken being as close to zero
as possible
Action distance or action time
Event-
driven data
integration
Automated
analysis
Automated
decision and
action taking
Source: Dr Richard Hackathorne
- 36. 36
Copyright © Intelligent Business Strategies 1992-2016!
With Event Stream Processing The Architecture
Has To Change
Data
cleansing &
integration
Store data
Query/Analyze
(human)
Store
data
Query/Analyze
(automated)
Classic Use
of Analytics
Event / Stream
processing
Act
(automated
or human)
Data
cleansing &
integration
- 37. 37
Copyright © Intelligent Business Strategies 1992-2016!
Time Series Analysis – Query Processing Uses a Time
Window to Look at Continuously Streaming Data
Time Window
T1 T2
E.g. 5 seconds
or 30 seconds
or 5 minutes
Pattern/correlation
Continuous time series
queries (CQs) operate on
the data as it flows by
Stream
processin
g server
CQs
A set of queries (continuous
queries) reside in the data stream
server to process incoming data
Data is pushed into the queries
High frequency data
- 38. 38
Copyright © Intelligent Business Strategies 1992-2016!
Key Requirements For Operational Analytics
On-demand, event-driven and scheduled invocation of analytics
Monitor streaming events as they happen via automatic analysis
Automatic analysis via predictive and statistical models
Automatic interpretation of predictive/statistical model outcomes
Rule-driven automatic actions to automate decision making
• E.g. Alerts, recommendations, transaction and process invocation
Integrate operational analytics into operational applications
Operational reporting
Scale to support large numbers of events and concurrent users
Store relevant data together to speed up analytics execution
Run predictive and statistical models close to the data
Run analytics on a 24x365 basis
- 39. 39
Copyright © Intelligent Business Strategies 1992-2016!
The Importance of In Memory Processing
Massively parallel in-memory processing
is mission critical for scalable operational
systems and operational analytics
Why?
• Performance is a critical
• Large number of concurrent user requests
for on-demand analytics
• Large number of concurrent application
requests for on-demand analytics
• Event driven operational analytics on very
high velocity data needs memory
- 40. 40
Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
The changing landscape of operational and analytical systems
Scalable operational applications and NoSQL data stores
Big data analytics – The era of Hadoop and Spark
The value of operational analytics
Operational analytics using The Basho Data Platform and
Apache Spark
Conclusions
- 41. 41
Copyright © Intelligent Business Strategies 1992-2016!
The Basho Data Platform
SERVICE
INSTANCES
STORAGE
INSTANCES
Solr
Spark
Redis
(Caching)
Solr
Elastic
Search
Web
Services
3rd Party
Web
Services &
Integrations
Riak!KV!
!Key/Value
Riak S2 !
Object
Storage
Riak TS !!
Time!Series!
Document
Store
Columnar Graph
Replicate &
Synchronize
Message
Routing
Cluster
Management
& Monitoring
Logging &
Analytics
Internal Data
Store
CORE SERVICES
BASHO DEVELOPED BASHO INTEGRATED
THE!BASHO!DATA!PLATFORM!
Source:(Basho( hash partitioning,
cluster scalability,
triple replication,
multi-datacentre
replication
co-locates time-series data,
high availability, scalability
replicates and
synchronises data
within and across
Riak KV, Redis and
Spark Clusters Automated cluster
management simplifies
administration
Integrated in-memory
caching for faster
application performance
Search based query
processing on Riak data
using Solr indexes
Integrated in-memory
analytics for Riak KV
and Riak TS data
- 42. 42
Copyright © Intelligent Business Strategies 1992-2016!
Riak TS Is A New Basho Storage Instance
Optimised for Time Series Data And Analytics
A distributed NoSQL database optimised for time series
sequenced, unstructured data capture, aggregation and
analysis from the Internet of Things (IoT)
Highly availability
Scalability - add nodes to a cluster without sharding
Automated and uniform data distributed across the cluster
• Time of geohash based data co-location to ensure time series data
is located on the same node
Data validation on input
APIs and client libraries for Java, Ruby, Python, Go, Erlang,
Node.js or .NET.
Spark integration for operational analysis of time series data.
- 43. 43
Copyright © Intelligent Business Strategies 1992-2016!
Operational Analytics Using The Basho Data
Platform And Apache Spark
Opera&onal*
analy&cs**
web*service*
Opera&onal*
analy&c**
applica&on*
BI*Tool*
data data data
hash*par&&oned*data*
Scalable*
opera&onal
applica&on*
Spark**Core*
Spark*
Stream
<ing*
BlinkDB*
Spark*
SQL*
GraphX* SparkR*MLlib*
write*back*
Opera&onal*Analy&cs*Using*The*Basho*Data*PlaHorm*
recent data
- 44. 44
Copyright © Intelligent Business Strategies 1992-2016!
Operational Analytics Using The Basho Data
Platform And Apache Spark - 2
• Can develop Spark operational analytic applications on
low latency data stored in Basho Riak KV
• Spark-based analytical web services can be invoked on-
demand to analyse data in Riak KV
• Use on-demand Spark jobs for historical analysis and predictions
• Insights produced from analysing Riak KV data in can be
written back to Riak KV for use by other applications
• A form of closed-loop processing
• Spark Streaming can be used to calculate rollups and
detect abnormalities on streaming sensor data
• Recent data can be kept in Redis for dashboard
visualization
- 45. 46
Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
The changing landscape of operational and analytical systems
Scalable operational applications and NoSQL data stores
Big data analytics – The era of Hadoop and Spark
The value of operational analytics
Operational analytics using The Basho Data Platform and
Apache Spark
Conclusions
- 46. 47
Copyright © Intelligent Business Strategies 1992-2016!
Conclusions
As operational application processing scales, so too does
the need to scale operational analytics
Basho is using in-memory processing to accelerate
operational applications (via Redis) and to introduce
scalable operational analytics (via Spark) into these
applications
New scalable ‘smart’ operational applications are therefore
becoming possible with careful design in a NoSQL
environment
- 47. 48
Copyright © Intelligent Business Strategies 1992-2016!
www.intelligentbusiness.biz
mferguson@intelligentbusiness.biz
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
Thank You!