SlideShare a Scribd company logo
Delivering Operational Analytics Using
Spark and NoSQL Data Stores
Mike Ferguson
Managing Director
Intelligent Business Strategies
Basho Webinar
January, 2016
2
Copyright © Intelligent Business Strategies 1992-2016!
About Mike Ferguson
Mike Ferguson is Managing Director of Intelligent
Business Strategies Limited. As an analyst and
consultant he specializes in business
intelligence, data management and enterprise
business integration. With over 34 years of IT
experience, Mike has consulted for dozens of
companies, spoken at events all over the world
and written numerous articles. Formerly he was
a principal and co-founder of Codd and Date
Europe Limited – the inventors of the Relational
Model, a Chief Architect at Teradata on the
Teradata DBMS and European Managing
Director of DataBase Associates.
www.intelligentbusiness.biz
mferguson@intelligentbusiness.biz
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
3
Copyright © Intelligent Business Strategies 1992-2016!
Topics
 The changing landscape of operational and analytical systems
 Scalable operational applications and NoSQL data stores
 Big data analytics – The era of Hadoop and Spark
 The value of operational analytics
 Operational analytics using The Basho Data Platform and
Apache Spark
 Conclusions
4
Copyright © Intelligent Business Strategies 1992-2016!
Topics
The changing landscape of operational and analytical systems
 Scalable operational applications and NoSQL data stores
 Big data analytics – The era of Hadoop and Spark
 The value of operational analytics
 Operational analytics using The Basho Data Platform and
Apache Spark
 Conclusions
5
Copyright © Intelligent Business Strategies 1992-2016!
The Application Processing Spectrum
Source: BI-Research Copyright © BI-Research, 2013-Present
6
Copyright © Intelligent Business Strategies 1992-2016!
Big Data Processing – There Is A Growing Number of Data
Stores Optimized for Operational or Analytical Workloads
OLTP RDBMS
NoSQL DBMS NoSQL
• ACID support missing in many NoSQL DBMSs
• Can you live with losing a transaction?
• OK for sensor data for example
Analytical RDBMS
7
Copyright © Intelligent Business Strategies 1992-2016!
Analytical
Systems
A Closed Loop Is Still Needed – It Just Now Also
Includes NoSQL Technologies
Operational
applications
Scalable
Analytical
Systems
data data
new data
new insights
Scalable
Operational
applications
Relational &
NoSQL systems
Relational &
NoSQL systems
8
Copyright © Intelligent Business Strategies 1992-2016!
Topics - – Where Are We?
 The changing landscape of operational and analytical systems
Scalable operational applications and NoSQL data stores
 Big data analytics – The era of Hadoop and Spark
 The value of operational analytics
 Operational analytics using The Basho Data Platform and
Apache Spark
 Conclusions
9
Copyright © Intelligent Business Strategies 1992-2016!
Analytical
Systems
Demand For Scalable Operational Systems With High
Write Processing Is Driving Demand for NoSQL DBMS
Operational
applications
Scalable
Analytical
Systems
data data
new data
new insights
Scalable
operational
applications
10
Copyright © Intelligent Business Strategies 1992-2016!
Success of Big Data Analytics Depends On Being Able
To Scale To Capture High Velocity, High Volume Data
 Successful big data analytics requires
1. Ability to scale operational systems to capture, stream and store
the required transactional and non-transactional data
– Support peak transaction rates
– Support peak capture of non-transactional data e.g. shopping
cart data
– Support peak data arrival rates e.g. sensor data
– Support peak ingestion rates
2. Scalable Big Data analytics
3. Closed loop integration of analytical systems back into core
operational transaction processing systems
– Make prescriptive insights available to all that need them to
continuously optimise operations and maximise effectiveness
11
Copyright © Intelligent Business Strategies 1992-2016!
E-Business And Mobile Means Operational Systems Are
Having To Scale To Support Masses Of Concurrent Users
Many more users
Operational
applications
Transactional
applications
dataWeb
logs
Cluster
Mobile devices
WWW
data data data
partitioned data
12
Copyright © Intelligent Business Strategies 1992-2016!
Example Operational Applications Requiring Scalability
That Are Fuelling Demand For NoSQL DBMSs
 Web and mobile commerce
• Shopping cart data, session storage
 Internet of Things (IoT) and other time series applications
• Need to scale as the number of devices / things increase
 Mobile gaming
• Player profile data, session storage, game performance stats
 Healthcare
• Store unstructured healthcare digital imaging and video data
 Social network applications
13
Copyright © Intelligent Business Strategies 1992-2016!
Types Of NoSQL Database And Product Examples
NoSQL Database Type NoSQL Product Examples
Key Value store Aerospike, Amazon DynamoDB, Basho Riak KV,
Redis, MemcacheDB, Voldemort
Document database CouchDB, IBM DB2 (XML & JSON), MongoDB, IBM
Cloudant, Marklogic, Terrastore, JackRabbit, RaptorDB
Column Family
database
Casandra, DataStax, Google BigTable, Hadoop HBase,
Hypertable, HPCC, Amazon SimpleDB
Graph database AllegroGraph, GraphBase, Horton, InfiniteGraph, IBM
DB2, Neo4j, Oracle Spatial and Graph, Titan, Cray
Research, Teradata Aster
Multi-modal database ArangoDB, CortexDB, MarkLogic , MongoDB
FoundationDB,
 Some NoSQL databases are aimed at write processing (data collection)
 Others are aimed at specific big data analytical workloads
 Issues include lack of standard APIs, weak or no optimizer and non-
immediate consistency
14
Copyright © Intelligent Business Strategies 1992-2016!
Global NoSQL Market Size And Forecast 2013 - 2020
Source: https://www.alliedmarketresearch.com/NoSQL-market
15
Copyright © Intelligent Business Strategies 1992-2016!
Key Value Stores Can Store Any Data - Examples
Key Value
10034 John Smith
82771
93441
{ "firstName": ”Wayne",
"lastName": ”Rooney",
"age": 25,
"address": {
"streetAddress": "21 Sir Matt Busby Way",
"city": ”Manchester”,
“country”: “England”,
"postalCode": “M1 6DY”
},
"phoneNumbers": [
{ "type": "home”,
"number": ”0161-123-1234”
},
{
"type": ”mobile",
"number": ”07779-123234”
}
]
}
Key value store features:
• Very simple to understand
• Very scalable - hash partitioning
• Data access is via the key
• The application controls what’s stored in
the value
• Very fast performance
• Acceleration via in-memory processing
• Eventual consistency
• Often no support for data types
• No built-in referential integrity
• No understanding of data relationships
• The application must understand any
relationships in data
• Programmer is in complete control
• Application must navigate complex data
Use for specific operational applications
16
Copyright © Intelligent Business Strategies 1992-2016!
Key Value Stores
– The Key Is Hashed To Partition The Data
Source: Microsoft
The value can be anything
• A single data field
• A JSON document
• An XML document
• Text
• Image……
Key Value
Easy to partition (hash the key)
Very fast to retrieve and store data
The application needs to know
• What is stored in the VALUE
• How the value is structured
• How to process the value
Key needs to be unique
Can use HTTP to read and write data
e.g. CURL –XPUT, CURL -XGET
17
Copyright © Intelligent Business Strategies 1992-2016!
Key Value Stores – A Basho Riak KV Cluster Has
Virtual Nodes Running on Physical Nodes
Source: Basho
SHA1 is a hashing function that hashes a key to determine the node
Riak hash partitions and replicates data (3 copies of the data is the default)
e.g. PUT,
POST, GET….
the valuethe key
hash the key
Nodes can be
added and removed
to a Riak cluster
while it is running
18
Copyright © Intelligent Business Strategies 1992-2016!
Key Value Stores - A Basho Riak KV Ring
Riak uses partitions (64 partitions
are the default) and also replicates
the partitions for high availability
Source: Basho
Writing replicas
19
Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
 The changing landscape of operational and analytical systems
 Scalable operational applications and NoSQL data stores
Big data analytics – The era of Hadoop and Spark
 The value of operational analytics
 Operational analytics using The Basho Data Platform and
Apache Spark
 Conclusions
20
Copyright © Intelligent Business Strategies 1992-2016!
Analytical
Systems
Demand For Scalable Analytical Systems Is Also
Exploding
Operational
applications
Scalable
Analytical
Systems
data data
new data
new insights
Scalable
operational
applications
21
Copyright © Intelligent Business Strategies 1992-2016!
A Hadoop System
Java, Python,
Scala
file file file file file
file file file file file
file file
file file
webHDFS
(An HTTP
interface to
HDFS has
REST APIs)
HDFS
file
file
file
file
PIG latin
scripts
3rd Party SQL
on Hadoop
Analytic
Application
index
indexIndex
partition
SQL
BI Tools
Storm
YARN
MapReduce Tez Spark
SQL
HBase
w
e
b
H
D
F
S
APIs to HBase, APIs to
HDFS
executes on
MR, Tez &
Spark
22
Copyright © Intelligent Business Strategies 1992-2016!
Faster Execution Engines For Analytic Applications
– Apache Spark
Java, Python,
Scala
file file file file file
file file file file file
file file
file file
webHDFS
(An HTTP
interface to
HDFS has
REST APIs)
HDFS
file
file
file
file
PIG latin
scripts
3rd Party SQL
on Hadoop
Analytic
Application
index
indexIndex
partition
SQL
BI Tools
Storm
YARN
MapReduce Tez Spark
SQL
HBase
w
e
b
H
D
F
S
APIs to HBase, APIs to
HDFS
23
Copyright © Intelligent Business Strategies 1992-2016!
Spark Is A General Purpose In-Memory Execution
Framework That Can Run With Or Without Hadoop
file file file file file
file file file file file
file file
file file
HDFS
file
file
file
file
Storm
YARN
MapReduce Tez Spark
HBase
w
e
b
H
D
F
S
HDFS, S3…..
Tachyon
Spark also
includes an
HDFS
compatible
in-memory
file system
You can use
Spark with
or without
Tachyon
The Spark stack is integrated – E.g. You can use Spark Streaming,
SparkSQL and MLBase together in the same application
Applications / BI Tools
Spark Core
Spark
Streaming
R
Spark SQL
+
DataFrames
GraphX
(Graph
Computation)
MLlib
(Machine
Learning)
SQL Python Scala Java
24
Copyright © Intelligent Business Strategies 1992-2016!
Applications / BI Tools
Spark Core
Spark
Streaming
R
Spark SQL +
DataFrames
GraphX
(Graph Computation)
MLlib
(Machine
Learning)
SQL Python Scala Java
Apache Spark
Provides distributed task
dispatching, scheduling,
and basic I/O.
For analysis of real-
time streaming data
A library of pre-built analytic
algorithms that can run in
parallel across a Spark cluster
A graph analysis engine
running on Spark
Query structured data in
Spark apps using SQL
or a DataFrames API
25
Copyright © Intelligent Business Strategies 1992-2016!
Spark In-Memory Analytic Applications Can Do A
Lot More Than Map Reduce Processing
 Keep only one copy
in memory in a JVM
 Track lineage of job
operators used to
derive the data
 Use the lineage to
re-compute the
data if there is a
failure
 No MapReduce
execution needed
• Just Spark APIs
map
map
join
filter
reduce
Source: Amplab
Spark application
HDFSfile file file file file file
Spark Applications / BI Tools
Spark Core
Spark
Streaming
R
Spark SQL
+
DataFrames
GraphX
(Graph
Computation)
MLlib
(Machine
Learning)
SQL Python Scala Java
26
Copyright © Intelligent Business Strategies 1992-2016!
Spark Applications Operate On RDDs (Data)
– You Can Do A Lot More Than Map and Reduce
 RDD = Resilient Distributed Datasets
 An RDD is a read-only, partitioned collection of records
 RDDs can be only created through operators on either
1. A dataset in stable storage or
2. Other existing RDDs.
Map Reduce Sample
Filter Count Take
Groupby Fold First
Sort Reducebykey Partitionby
Union groupByKey Mapwith
Join Cogroup Mapwith
Leftouterjoin Cross Pipe
Rightouterjoin Zip Save
Spark
Operators
Spark Applications
27
Copyright © Intelligent Business Strategies 1992-2016!
Simplifying Access To Data Using Via SparkSQL
and Spark DataFrames
 A DataFrame is a distributed
collection of data organized into
named columns
 Conceptually equivalent to a
relational DBMS table or a data
frame in R/Python
 DataFrames can be constructed
from a wide array of sources:
• Structured data files
• Hive tables
• External databases
• Existing RDDs
 Uses schema on read
Image source: Databricks.com
Note: that Spark data
sources can be
relational & NoSQL
DBMSs
28
Copyright © Intelligent Business Strategies 1992-2016!
Spark Is Going Over The Top of Multiple Data Stores For
Scalable In-Memory Analytics Across The Entire Ecosystem
Streaming
data
Hadoop
data store
Data Warehouse
RDBMS
NoSQL
DBMS
EDW
DW & martsAdvanced Analytic
(multi-structured
data)
mart
Operational NoSQL
Data Stores
Streaming
analytics
e.g. Casandra,
Basho Riak
Applications / BI Tools
Spark Core
Spark
Streaming
R
Spark SQL +
DataFrames
GraphX
(Graph
Computation)
MLlib
(Machine
Learning)
SQL Python Scala Java
29
Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
 The changing landscape of operational and analytical systems
 Scalable operational applications and NoSQL data stores
 Big data analytics – The era of Hadoop and Spark
The value of operational analytics
 Operational analytics using The Basho Data Platform and
Apache Spark
 Conclusions
30
Copyright © Intelligent Business Strategies 1992-2016!
Key Business Drivers And Objectives For
Operational Analytics
 Combine operational and analytical processing at scale to:
• Improve customer engagement
• Reduce risk
• Avoid unplanned operational cost
• Optimise operational effectiveness
 Use BI/Analytics to drive and guide business operations to help
achieve specific target business goals and KPI targets
Automated analysis of operational events as they happen
Automated alerts
On-demand recommendations
 Integrate BI/Analytics into every business process to:
• Create a ‘insight driven’ employee base
• Enable mass execution of business strategy via facilitating
mass contribution towards achieve specific business goals
31
Copyright © Intelligent Business Strategies 1992-2016!
Five Types Of Operational BI/Analytics
1. Simple operational reporting of current position/state e.g.
session state
2. Situational awareness via visualisation of live operational data
typically on dashboards
3. On-demand analytics of live operational and/or historical data
to improve operational decisions and effectiveness
4. On-demand recommendations for guidance
5. Event stream processing to monitor, automatically analyse and
act on events in real-time to prevent problems arising and to
optimise business operations
32
Copyright © Intelligent Business Strategies 1992-2016!
BI/ Analytics Apps /
Services
Operational Analytics – What’s The Difference
Between On-Demand Vs Event-Driven Analysis?
BI/ Analytics Services
Application
On-Demand
Analytical service
(query, report, model,
recommendation)
Message, file arrival, pattern, trigger
Event-Driven
Analytical service
(query, report, model,
recommendation)
streaming
data
33
Copyright © Intelligent Business Strategies 1992-2016!
Analytics Need To Be Integrated Into Business
Processes To Optimize Business Operations
Customers Partners &
suppliers
Customer
relationship
management
Operations
management
Supply
chain
management
Marketing
Sales
Service/support
Operations
Finance/accounting
Procurement
Inventorycontrol
Shipping/distribution
Humanresources
Employees
Integrated Intelligent Business Operations
Integrated On-Demand Business Intelligence
34
Copyright © Intelligent Business Strategies 1992-2016!
High Value Application Use Cases for Streaming
Analytics
Streaming
Analytics
Source: Adapted from a slide by IBM
35
Copyright © Intelligent Business Strategies 1992-2016!
Responding To Events And Event Patterns Means
Reducing Action Time
The time between an event
occurring and action being
taken being as close to zero
as possible
Action distance or action time
Event-
driven data
integration
Automated
analysis
Automated
decision and
action taking
Source: Dr Richard Hackathorne
36
Copyright © Intelligent Business Strategies 1992-2016!
With Event Stream Processing The Architecture
Has To Change
Data
cleansing &
integration
Store data
Query/Analyze
(human)
Store
data
Query/Analyze
(automated)
Classic Use
of Analytics
Event / Stream
processing
Act
(automated
or human)
Data
cleansing &
integration
37
Copyright © Intelligent Business Strategies 1992-2016!
Time Series Analysis – Query Processing Uses a Time
Window to Look at Continuously Streaming Data
Time Window
T1 T2
E.g. 5 seconds
or 30 seconds
or 5 minutes
Pattern/correlation
Continuous time series
queries (CQs) operate on
the data as it flows by
Stream
processin
g server
CQs
A set of queries (continuous
queries) reside in the data stream
server to process incoming data
Data is pushed into the queries
High frequency data
38
Copyright © Intelligent Business Strategies 1992-2016!
Key Requirements For Operational Analytics
 On-demand, event-driven and scheduled invocation of analytics
 Monitor streaming events as they happen via automatic analysis
 Automatic analysis via predictive and statistical models
 Automatic interpretation of predictive/statistical model outcomes
 Rule-driven automatic actions to automate decision making
• E.g. Alerts, recommendations, transaction and process invocation
 Integrate operational analytics into operational applications
 Operational reporting
 Scale to support large numbers of events and concurrent users
 Store relevant data together to speed up analytics execution
 Run predictive and statistical models close to the data
 Run analytics on a 24x365 basis
39
Copyright © Intelligent Business Strategies 1992-2016!
The Importance of In Memory Processing
 Massively parallel in-memory processing
is mission critical for scalable operational
systems and operational analytics
 Why?
• Performance is a critical
• Large number of concurrent user requests
for on-demand analytics
• Large number of concurrent application
requests for on-demand analytics
• Event driven operational analytics on very
high velocity data needs memory
40
Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
 The changing landscape of operational and analytical systems
 Scalable operational applications and NoSQL data stores
 Big data analytics – The era of Hadoop and Spark
 The value of operational analytics
Operational analytics using The Basho Data Platform and
Apache Spark
 Conclusions
41
Copyright © Intelligent Business Strategies 1992-2016!
The Basho Data Platform
SERVICE
INSTANCES
STORAGE
INSTANCES
Solr
Spark
Redis
(Caching)
Solr
Elastic
Search
Web
Services
3rd Party
Web
Services &
Integrations
Riak!KV!
!Key/Value
Riak S2 !
Object
Storage
Riak TS !!
Time!Series!
Document
Store
Columnar Graph
Replicate &
Synchronize
Message
Routing
Cluster
Management
& Monitoring
Logging &
Analytics
Internal Data
Store
CORE SERVICES
BASHO DEVELOPED BASHO INTEGRATED
THE!BASHO!DATA!PLATFORM!
Source:(Basho( hash partitioning,
cluster scalability,
triple replication,
multi-datacentre
replication
co-locates time-series data,
high availability, scalability
replicates and
synchronises data
within and across
Riak KV, Redis and
Spark Clusters Automated cluster
management simplifies
administration
Integrated in-memory
caching for faster
application performance
Search based query
processing on Riak data
using Solr indexes
Integrated in-memory
analytics for Riak KV
and Riak TS data
42
Copyright © Intelligent Business Strategies 1992-2016!
Riak TS Is A New Basho Storage Instance
Optimised for Time Series Data And Analytics
 A distributed NoSQL database optimised for time series
sequenced, unstructured data capture, aggregation and
analysis from the Internet of Things (IoT)
 Highly availability
 Scalability - add nodes to a cluster without sharding
 Automated and uniform data distributed across the cluster
• Time of geohash based data co-location to ensure time series data
is located on the same node
 Data validation on input
 APIs and client libraries for Java, Ruby, Python, Go, Erlang,
Node.js or .NET.
 Spark integration for operational analysis of time series data.
43
Copyright © Intelligent Business Strategies 1992-2016!
Operational Analytics Using The Basho Data
Platform And Apache Spark
Opera&onal*
analy&cs**
web*service*
Opera&onal*
analy&c**
applica&on*
BI*Tool*
data data data
hash*par&&oned*data*
Scalable*
opera&onal
applica&on*
Spark**Core*
Spark*
Stream
<ing*
BlinkDB*
Spark*
SQL*
GraphX* SparkR*MLlib*
write*back*
Opera&onal*Analy&cs*Using*The*Basho*Data*PlaHorm*
recent data
44
Copyright © Intelligent Business Strategies 1992-2016!
Operational Analytics Using The Basho Data
Platform And Apache Spark - 2
• Can develop Spark operational analytic applications on
low latency data stored in Basho Riak KV
• Spark-based analytical web services can be invoked on-
demand to analyse data in Riak KV
• Use on-demand Spark jobs for historical analysis and predictions
• Insights produced from analysing Riak KV data in can be
written back to Riak KV for use by other applications
• A form of closed-loop processing
• Spark Streaming can be used to calculate rollups and
detect abnormalities on streaming sensor data
• Recent data can be kept in Redis for dashboard
visualization
46
Copyright © Intelligent Business Strategies 1992-2016!
Topics – Where Are We?
 The changing landscape of operational and analytical systems
 Scalable operational applications and NoSQL data stores
 Big data analytics – The era of Hadoop and Spark
 The value of operational analytics
 Operational analytics using The Basho Data Platform and
Apache Spark
Conclusions
47
Copyright © Intelligent Business Strategies 1992-2016!
Conclusions
 As operational application processing scales, so too does
the need to scale operational analytics
 Basho is using in-memory processing to accelerate
operational applications (via Redis) and to introduce
scalable operational analytics (via Spark) into these
applications
 New scalable ‘smart’ operational applications are therefore
becoming possible with careful design in a NoSQL
environment
48
Copyright © Intelligent Business Strategies 1992-2016!
www.intelligentbusiness.biz
mferguson@intelligentbusiness.biz
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
Thank You!

More Related Content

Operational Analytics Using Spark and NoSQL Data Stores

  • 1. Delivering Operational Analytics Using Spark and NoSQL Data Stores Mike Ferguson Managing Director Intelligent Business Strategies Basho Webinar January, 2016
  • 2. 2 Copyright © Intelligent Business Strategies 1992-2016! About Mike Ferguson Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specializes in business intelligence, data management and enterprise business integration. With over 34 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates. www.intelligentbusiness.biz mferguson@intelligentbusiness.biz Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700
  • 3. 3 Copyright © Intelligent Business Strategies 1992-2016! Topics  The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark  The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  • 4. 4 Copyright © Intelligent Business Strategies 1992-2016! Topics The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark  The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  • 5. 5 Copyright © Intelligent Business Strategies 1992-2016! The Application Processing Spectrum Source: BI-Research Copyright © BI-Research, 2013-Present
  • 6. 6 Copyright © Intelligent Business Strategies 1992-2016! Big Data Processing – There Is A Growing Number of Data Stores Optimized for Operational or Analytical Workloads OLTP RDBMS NoSQL DBMS NoSQL • ACID support missing in many NoSQL DBMSs • Can you live with losing a transaction? • OK for sensor data for example Analytical RDBMS
  • 7. 7 Copyright © Intelligent Business Strategies 1992-2016! Analytical Systems A Closed Loop Is Still Needed – It Just Now Also Includes NoSQL Technologies Operational applications Scalable Analytical Systems data data new data new insights Scalable Operational applications Relational & NoSQL systems Relational & NoSQL systems
  • 8. 8 Copyright © Intelligent Business Strategies 1992-2016! Topics - – Where Are We?  The changing landscape of operational and analytical systems Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark  The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  • 9. 9 Copyright © Intelligent Business Strategies 1992-2016! Analytical Systems Demand For Scalable Operational Systems With High Write Processing Is Driving Demand for NoSQL DBMS Operational applications Scalable Analytical Systems data data new data new insights Scalable operational applications
  • 10. 10 Copyright © Intelligent Business Strategies 1992-2016! Success of Big Data Analytics Depends On Being Able To Scale To Capture High Velocity, High Volume Data  Successful big data analytics requires 1. Ability to scale operational systems to capture, stream and store the required transactional and non-transactional data – Support peak transaction rates – Support peak capture of non-transactional data e.g. shopping cart data – Support peak data arrival rates e.g. sensor data – Support peak ingestion rates 2. Scalable Big Data analytics 3. Closed loop integration of analytical systems back into core operational transaction processing systems – Make prescriptive insights available to all that need them to continuously optimise operations and maximise effectiveness
  • 11. 11 Copyright © Intelligent Business Strategies 1992-2016! E-Business And Mobile Means Operational Systems Are Having To Scale To Support Masses Of Concurrent Users Many more users Operational applications Transactional applications dataWeb logs Cluster Mobile devices WWW data data data partitioned data
  • 12. 12 Copyright © Intelligent Business Strategies 1992-2016! Example Operational Applications Requiring Scalability That Are Fuelling Demand For NoSQL DBMSs  Web and mobile commerce • Shopping cart data, session storage  Internet of Things (IoT) and other time series applications • Need to scale as the number of devices / things increase  Mobile gaming • Player profile data, session storage, game performance stats  Healthcare • Store unstructured healthcare digital imaging and video data  Social network applications
  • 13. 13 Copyright © Intelligent Business Strategies 1992-2016! Types Of NoSQL Database And Product Examples NoSQL Database Type NoSQL Product Examples Key Value store Aerospike, Amazon DynamoDB, Basho Riak KV, Redis, MemcacheDB, Voldemort Document database CouchDB, IBM DB2 (XML & JSON), MongoDB, IBM Cloudant, Marklogic, Terrastore, JackRabbit, RaptorDB Column Family database Casandra, DataStax, Google BigTable, Hadoop HBase, Hypertable, HPCC, Amazon SimpleDB Graph database AllegroGraph, GraphBase, Horton, InfiniteGraph, IBM DB2, Neo4j, Oracle Spatial and Graph, Titan, Cray Research, Teradata Aster Multi-modal database ArangoDB, CortexDB, MarkLogic , MongoDB FoundationDB,  Some NoSQL databases are aimed at write processing (data collection)  Others are aimed at specific big data analytical workloads  Issues include lack of standard APIs, weak or no optimizer and non- immediate consistency
  • 14. 14 Copyright © Intelligent Business Strategies 1992-2016! Global NoSQL Market Size And Forecast 2013 - 2020 Source: https://www.alliedmarketresearch.com/NoSQL-market
  • 15. 15 Copyright © Intelligent Business Strategies 1992-2016! Key Value Stores Can Store Any Data - Examples Key Value 10034 John Smith 82771 93441 { "firstName": ”Wayne", "lastName": ”Rooney", "age": 25, "address": { "streetAddress": "21 Sir Matt Busby Way", "city": ”Manchester”, “country”: “England”, "postalCode": “M1 6DY” }, "phoneNumbers": [ { "type": "home”, "number": ”0161-123-1234” }, { "type": ”mobile", "number": ”07779-123234” } ] } Key value store features: • Very simple to understand • Very scalable - hash partitioning • Data access is via the key • The application controls what’s stored in the value • Very fast performance • Acceleration via in-memory processing • Eventual consistency • Often no support for data types • No built-in referential integrity • No understanding of data relationships • The application must understand any relationships in data • Programmer is in complete control • Application must navigate complex data Use for specific operational applications
  • 16. 16 Copyright © Intelligent Business Strategies 1992-2016! Key Value Stores – The Key Is Hashed To Partition The Data Source: Microsoft The value can be anything • A single data field • A JSON document • An XML document • Text • Image…… Key Value Easy to partition (hash the key) Very fast to retrieve and store data The application needs to know • What is stored in the VALUE • How the value is structured • How to process the value Key needs to be unique Can use HTTP to read and write data e.g. CURL –XPUT, CURL -XGET
  • 17. 17 Copyright © Intelligent Business Strategies 1992-2016! Key Value Stores – A Basho Riak KV Cluster Has Virtual Nodes Running on Physical Nodes Source: Basho SHA1 is a hashing function that hashes a key to determine the node Riak hash partitions and replicates data (3 copies of the data is the default) e.g. PUT, POST, GET…. the valuethe key hash the key Nodes can be added and removed to a Riak cluster while it is running
  • 18. 18 Copyright © Intelligent Business Strategies 1992-2016! Key Value Stores - A Basho Riak KV Ring Riak uses partitions (64 partitions are the default) and also replicates the partitions for high availability Source: Basho Writing replicas
  • 19. 19 Copyright © Intelligent Business Strategies 1992-2016! Topics – Where Are We?  The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores Big data analytics – The era of Hadoop and Spark  The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  • 20. 20 Copyright © Intelligent Business Strategies 1992-2016! Analytical Systems Demand For Scalable Analytical Systems Is Also Exploding Operational applications Scalable Analytical Systems data data new data new insights Scalable operational applications
  • 21. 21 Copyright © Intelligent Business Strategies 1992-2016! A Hadoop System Java, Python, Scala file file file file file file file file file file file file file file webHDFS (An HTTP interface to HDFS has REST APIs) HDFS file file file file PIG latin scripts 3rd Party SQL on Hadoop Analytic Application index indexIndex partition SQL BI Tools Storm YARN MapReduce Tez Spark SQL HBase w e b H D F S APIs to HBase, APIs to HDFS executes on MR, Tez & Spark
  • 22. 22 Copyright © Intelligent Business Strategies 1992-2016! Faster Execution Engines For Analytic Applications – Apache Spark Java, Python, Scala file file file file file file file file file file file file file file webHDFS (An HTTP interface to HDFS has REST APIs) HDFS file file file file PIG latin scripts 3rd Party SQL on Hadoop Analytic Application index indexIndex partition SQL BI Tools Storm YARN MapReduce Tez Spark SQL HBase w e b H D F S APIs to HBase, APIs to HDFS
  • 23. 23 Copyright © Intelligent Business Strategies 1992-2016! Spark Is A General Purpose In-Memory Execution Framework That Can Run With Or Without Hadoop file file file file file file file file file file file file file file HDFS file file file file Storm YARN MapReduce Tez Spark HBase w e b H D F S HDFS, S3….. Tachyon Spark also includes an HDFS compatible in-memory file system You can use Spark with or without Tachyon The Spark stack is integrated – E.g. You can use Spark Streaming, SparkSQL and MLBase together in the same application Applications / BI Tools Spark Core Spark Streaming R Spark SQL + DataFrames GraphX (Graph Computation) MLlib (Machine Learning) SQL Python Scala Java
  • 24. 24 Copyright © Intelligent Business Strategies 1992-2016! Applications / BI Tools Spark Core Spark Streaming R Spark SQL + DataFrames GraphX (Graph Computation) MLlib (Machine Learning) SQL Python Scala Java Apache Spark Provides distributed task dispatching, scheduling, and basic I/O. For analysis of real- time streaming data A library of pre-built analytic algorithms that can run in parallel across a Spark cluster A graph analysis engine running on Spark Query structured data in Spark apps using SQL or a DataFrames API
  • 25. 25 Copyright © Intelligent Business Strategies 1992-2016! Spark In-Memory Analytic Applications Can Do A Lot More Than Map Reduce Processing  Keep only one copy in memory in a JVM  Track lineage of job operators used to derive the data  Use the lineage to re-compute the data if there is a failure  No MapReduce execution needed • Just Spark APIs map map join filter reduce Source: Amplab Spark application HDFSfile file file file file file Spark Applications / BI Tools Spark Core Spark Streaming R Spark SQL + DataFrames GraphX (Graph Computation) MLlib (Machine Learning) SQL Python Scala Java
  • 26. 26 Copyright © Intelligent Business Strategies 1992-2016! Spark Applications Operate On RDDs (Data) – You Can Do A Lot More Than Map and Reduce  RDD = Resilient Distributed Datasets  An RDD is a read-only, partitioned collection of records  RDDs can be only created through operators on either 1. A dataset in stable storage or 2. Other existing RDDs. Map Reduce Sample Filter Count Take Groupby Fold First Sort Reducebykey Partitionby Union groupByKey Mapwith Join Cogroup Mapwith Leftouterjoin Cross Pipe Rightouterjoin Zip Save Spark Operators Spark Applications
  • 27. 27 Copyright © Intelligent Business Strategies 1992-2016! Simplifying Access To Data Using Via SparkSQL and Spark DataFrames  A DataFrame is a distributed collection of data organized into named columns  Conceptually equivalent to a relational DBMS table or a data frame in R/Python  DataFrames can be constructed from a wide array of sources: • Structured data files • Hive tables • External databases • Existing RDDs  Uses schema on read Image source: Databricks.com Note: that Spark data sources can be relational & NoSQL DBMSs
  • 28. 28 Copyright © Intelligent Business Strategies 1992-2016! Spark Is Going Over The Top of Multiple Data Stores For Scalable In-Memory Analytics Across The Entire Ecosystem Streaming data Hadoop data store Data Warehouse RDBMS NoSQL DBMS EDW DW & martsAdvanced Analytic (multi-structured data) mart Operational NoSQL Data Stores Streaming analytics e.g. Casandra, Basho Riak Applications / BI Tools Spark Core Spark Streaming R Spark SQL + DataFrames GraphX (Graph Computation) MLlib (Machine Learning) SQL Python Scala Java
  • 29. 29 Copyright © Intelligent Business Strategies 1992-2016! Topics – Where Are We?  The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  • 30. 30 Copyright © Intelligent Business Strategies 1992-2016! Key Business Drivers And Objectives For Operational Analytics  Combine operational and analytical processing at scale to: • Improve customer engagement • Reduce risk • Avoid unplanned operational cost • Optimise operational effectiveness  Use BI/Analytics to drive and guide business operations to help achieve specific target business goals and KPI targets Automated analysis of operational events as they happen Automated alerts On-demand recommendations  Integrate BI/Analytics into every business process to: • Create a ‘insight driven’ employee base • Enable mass execution of business strategy via facilitating mass contribution towards achieve specific business goals
  • 31. 31 Copyright © Intelligent Business Strategies 1992-2016! Five Types Of Operational BI/Analytics 1. Simple operational reporting of current position/state e.g. session state 2. Situational awareness via visualisation of live operational data typically on dashboards 3. On-demand analytics of live operational and/or historical data to improve operational decisions and effectiveness 4. On-demand recommendations for guidance 5. Event stream processing to monitor, automatically analyse and act on events in real-time to prevent problems arising and to optimise business operations
  • 32. 32 Copyright © Intelligent Business Strategies 1992-2016! BI/ Analytics Apps / Services Operational Analytics – What’s The Difference Between On-Demand Vs Event-Driven Analysis? BI/ Analytics Services Application On-Demand Analytical service (query, report, model, recommendation) Message, file arrival, pattern, trigger Event-Driven Analytical service (query, report, model, recommendation) streaming data
  • 33. 33 Copyright © Intelligent Business Strategies 1992-2016! Analytics Need To Be Integrated Into Business Processes To Optimize Business Operations Customers Partners & suppliers Customer relationship management Operations management Supply chain management Marketing Sales Service/support Operations Finance/accounting Procurement Inventorycontrol Shipping/distribution Humanresources Employees Integrated Intelligent Business Operations Integrated On-Demand Business Intelligence
  • 34. 34 Copyright © Intelligent Business Strategies 1992-2016! High Value Application Use Cases for Streaming Analytics Streaming Analytics Source: Adapted from a slide by IBM
  • 35. 35 Copyright © Intelligent Business Strategies 1992-2016! Responding To Events And Event Patterns Means Reducing Action Time The time between an event occurring and action being taken being as close to zero as possible Action distance or action time Event- driven data integration Automated analysis Automated decision and action taking Source: Dr Richard Hackathorne
  • 36. 36 Copyright © Intelligent Business Strategies 1992-2016! With Event Stream Processing The Architecture Has To Change Data cleansing & integration Store data Query/Analyze (human) Store data Query/Analyze (automated) Classic Use of Analytics Event / Stream processing Act (automated or human) Data cleansing & integration
  • 37. 37 Copyright © Intelligent Business Strategies 1992-2016! Time Series Analysis – Query Processing Uses a Time Window to Look at Continuously Streaming Data Time Window T1 T2 E.g. 5 seconds or 30 seconds or 5 minutes Pattern/correlation Continuous time series queries (CQs) operate on the data as it flows by Stream processin g server CQs A set of queries (continuous queries) reside in the data stream server to process incoming data Data is pushed into the queries High frequency data
  • 38. 38 Copyright © Intelligent Business Strategies 1992-2016! Key Requirements For Operational Analytics  On-demand, event-driven and scheduled invocation of analytics  Monitor streaming events as they happen via automatic analysis  Automatic analysis via predictive and statistical models  Automatic interpretation of predictive/statistical model outcomes  Rule-driven automatic actions to automate decision making • E.g. Alerts, recommendations, transaction and process invocation  Integrate operational analytics into operational applications  Operational reporting  Scale to support large numbers of events and concurrent users  Store relevant data together to speed up analytics execution  Run predictive and statistical models close to the data  Run analytics on a 24x365 basis
  • 39. 39 Copyright © Intelligent Business Strategies 1992-2016! The Importance of In Memory Processing  Massively parallel in-memory processing is mission critical for scalable operational systems and operational analytics  Why? • Performance is a critical • Large number of concurrent user requests for on-demand analytics • Large number of concurrent application requests for on-demand analytics • Event driven operational analytics on very high velocity data needs memory
  • 40. 40 Copyright © Intelligent Business Strategies 1992-2016! Topics – Where Are We?  The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark  The value of operational analytics Operational analytics using The Basho Data Platform and Apache Spark  Conclusions
  • 41. 41 Copyright © Intelligent Business Strategies 1992-2016! The Basho Data Platform SERVICE INSTANCES STORAGE INSTANCES Solr Spark Redis (Caching) Solr Elastic Search Web Services 3rd Party Web Services & Integrations Riak!KV! !Key/Value Riak S2 ! Object Storage Riak TS !! Time!Series! Document Store Columnar Graph Replicate & Synchronize Message Routing Cluster Management & Monitoring Logging & Analytics Internal Data Store CORE SERVICES BASHO DEVELOPED BASHO INTEGRATED THE!BASHO!DATA!PLATFORM! Source:(Basho( hash partitioning, cluster scalability, triple replication, multi-datacentre replication co-locates time-series data, high availability, scalability replicates and synchronises data within and across Riak KV, Redis and Spark Clusters Automated cluster management simplifies administration Integrated in-memory caching for faster application performance Search based query processing on Riak data using Solr indexes Integrated in-memory analytics for Riak KV and Riak TS data
  • 42. 42 Copyright © Intelligent Business Strategies 1992-2016! Riak TS Is A New Basho Storage Instance Optimised for Time Series Data And Analytics  A distributed NoSQL database optimised for time series sequenced, unstructured data capture, aggregation and analysis from the Internet of Things (IoT)  Highly availability  Scalability - add nodes to a cluster without sharding  Automated and uniform data distributed across the cluster • Time of geohash based data co-location to ensure time series data is located on the same node  Data validation on input  APIs and client libraries for Java, Ruby, Python, Go, Erlang, Node.js or .NET.  Spark integration for operational analysis of time series data.
  • 43. 43 Copyright © Intelligent Business Strategies 1992-2016! Operational Analytics Using The Basho Data Platform And Apache Spark Opera&onal* analy&cs** web*service* Opera&onal* analy&c** applica&on* BI*Tool* data data data hash*par&&oned*data* Scalable* opera&onal applica&on* Spark**Core* Spark* Stream <ing* BlinkDB* Spark* SQL* GraphX* SparkR*MLlib* write*back* Opera&onal*Analy&cs*Using*The*Basho*Data*PlaHorm* recent data
  • 44. 44 Copyright © Intelligent Business Strategies 1992-2016! Operational Analytics Using The Basho Data Platform And Apache Spark - 2 • Can develop Spark operational analytic applications on low latency data stored in Basho Riak KV • Spark-based analytical web services can be invoked on- demand to analyse data in Riak KV • Use on-demand Spark jobs for historical analysis and predictions • Insights produced from analysing Riak KV data in can be written back to Riak KV for use by other applications • A form of closed-loop processing • Spark Streaming can be used to calculate rollups and detect abnormalities on streaming sensor data • Recent data can be kept in Redis for dashboard visualization
  • 45. 46 Copyright © Intelligent Business Strategies 1992-2016! Topics – Where Are We?  The changing landscape of operational and analytical systems  Scalable operational applications and NoSQL data stores  Big data analytics – The era of Hadoop and Spark  The value of operational analytics  Operational analytics using The Basho Data Platform and Apache Spark Conclusions
  • 46. 47 Copyright © Intelligent Business Strategies 1992-2016! Conclusions  As operational application processing scales, so too does the need to scale operational analytics  Basho is using in-memory processing to accelerate operational applications (via Redis) and to introduce scalable operational analytics (via Spark) into these applications  New scalable ‘smart’ operational applications are therefore becoming possible with careful design in a NoSQL environment
  • 47. 48 Copyright © Intelligent Business Strategies 1992-2016! www.intelligentbusiness.biz mferguson@intelligentbusiness.biz Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700 Thank You!