SlideShare a Scribd company logo
© Copyright 2015 Glassbeam Inc.
Ad Hoc Analytics
on
Internet of Complex Things
with
Spark and Cassandra
Mohammed Guller
September 2015
© Copyright 2015 Glassbeam Inc.
About Me
 Principal Architect at Glassbeam
 Founded two startups
 Passionate about building products, big
data analytics, and machine learning
www.linkedin.com/in/mohammedguller
@MohammedGuller
4
Available on Amazon
© Copyright 2015 Glassbeam Inc.
Internet of Things (IoT)
5
Network of objects embedded with software for
collecting and exchanging data over the Internet
© Copyright 2015 Glassbeam Inc.
Internet of Complex Things (IoCT)
6
 Data Center Devices
– Server, storage, controller
 Medical Devices
– X-Ray, MRI scan, CT scan
 Manufacturing Systems
 Cars
 Electric Vehicle Chargers
 Other Complex Devices
Glassbeam target market is focused on driving opera onal & business
naly cs value for connected product companies in Industrial IoT market
IT & Networks Medical & Health Care EV Chargers & Smart Grid
© Copyright 2015 Glassbeam Inc.
IT & Networks
Medical &
Healthcare
EV Chargers &
Smart Grid
Industrial & Mfg
Transportation
Glassbeam
7
target market is focused on driving opera onal & business
ue for connected product companies in Industrial IoT market
rks Medical & Health Care
Transporta on
EV Chargers & Smart Grid
Industrial & Mfg
5
Glassbeam target market is focused on d
analy cs value for connected product com
IT & Networks Medical & Health Care
TrIndustrial & Mfg
market is focused on driving opera onal & business
connected product companies in Industrial IoT market
Medical & Health Care EV Chargers & Smart Grid
Transporta on
5
Advanced and
Predictive Analytics
for Connected
Product Companies
© Copyright 2015 Glassbeam Inc.
10101000101011010101110101111010101000101001010101010111110001011001000110000110101110100110011111000000101
01101010011111000101001010110010100101100010011010101140101010000101010000101111001001101011010010101000001
11101010101000101011010101110101111010101000101001010101010111110001011001000110000110101110100110011111000
00010101101010011111000101001010110010100101100010011010101140101010000101010000101111001001101011010010101
00000100100110101101001001001101011010010010011010001001101011010010010011010110100101101001101001101001101
Analytics on Operational Data
8
Operational Data
to
Powerful Insights
© Copyright 2015 Glassbeam Inc.
High-level Architecture
9
1010100010101
10101011101011
1101010100010
1001010101010
11111000101100
1000110000110
10111010011001
11110000001010
11010100111110
0010100101011
0010100101100
0100110101011
4010101000010
10100001011110
0100110101101
0010101000001
11101001111001
0011010110100
1010101010100
0101011010101
11010111101010
1000101001010
10101011111000
1011001000110
00011010111010
011
Data
Inges on
Data
Transforma on
Data Stores Middleware Applica ons
Logs
(Streams/
docs)
SPL Library
S
C
A
L
A
R
I
N
F
O
S
E
R
V
E
R
LogVault
Explorer
Workbench
Standard Apps
Custom Apps
Rules & Alerts
DirectAccess
Glassbeam Studio
Cloud Enablement & Automa on
S3 Amazon
Raw logs
Cassandra
Processed Data
Solr Cloud
Index
Analy cs and
Machine learning
Spark
SQL
Spark
Streaming
MLlib
Event Processing & Rules Engine
End to End cloud based architecture built on modern
technologies to handle any machine, any data, any cloud
* SPL (Semiotic Parsing Language) and SCALAR are patent pending technology inventions of Glassbeam
© Copyright 2015 Glassbeam Inc.
Key Properties of IoCT Data
10
Volume Terabytes of Data
Variety Multi-structured Data
Velocity
Fast Paced Batch Data
Streaming Data
© Copyright 2015 Glassbeam Inc.
Why We Chose C*
11
Volume Economically Scale from Gigabytes
to Terabytes of Data
Variety Store Multi-structured Data
Velocity
Fast Ingest of New Data Quick
Reload of Old Data
Linear
Scalability
Dynamic
Schema
Fast
Writes
© Copyright 2015 Glassbeam Inc.
Modeling Data in C*
 Different from Modeling Data in RDBMS
 Queries Drive Table and Primary Key Definitions
– Primary Key Definition Limits the Kind of Queries You Can Run
– C* Does Not Support Joins
12
© Copyright 2015 Glassbeam Inc.
A Simple Table for Storing Event Data in C*
CREATE TABLE event (
sys_id text,
dt timestamp,
ts timestamp,
severity text,
module text,
message text,
PRIMARY KEY ((sys_id, dt), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
13
© Copyright 2015 Glassbeam Inc.
Another Table to Filter Events by Severity
CREATE TABLE event_by_severity (
sys_id text,
dt timestamp,
ts timestamp,
severity text,
module text,
message text,
PRIMARY KEY ((sys_id, dt), severity, ts)
) WITH CLUSTERING ORDER BY (severity ASC, ts DESC);
14
© Copyright 2015 Glassbeam Inc.
Yet Another Table to Filter Events by Module
CREATE TABLE event_by_module (
sys_id text,
dt timestamp,
ts timestamp,
severity text,
module text,
message text,
PRIMARY KEY ((sys_id, dt), module, ts)
) WITH CLUSTERING ORDER BY (module ASC, ts DESC);
15
© Copyright 2015 Glassbeam Inc.
Ad Hoc Analytics with C*
 Oxymoron
 All queries Must be Known Upfront
16
© Copyright 2015 Glassbeam Inc.
Another Example
Sys_id Model Age OS City State Country
17
© Copyright 2015 Glassbeam Inc.
Intractable Number of Tables
Sys_id Model Age OS City State Country
18
• sys_by_model
• sys_by_os
• sys_by_age
• sys_by_state
• sys_by_state_age
• sys_by_age_state
• sys_by_model_age
• sys_by_age_model
• sys_by_age_model_state
• sys_by_model_state_age
• sys_by_model_state_os
© Copyright 2015 Glassbeam Inc.
Other Barriers to Ad Hoc Analytics
 No Aggregation
 No Group By
 No Joins
19
© Copyright 2015 Glassbeam Inc. 20
What Do
I Do
Now?
© Copyright 2015 Glassbeam Inc. 21
© Copyright 2015 Glassbeam Inc.
Spark
22
 Fast and General-purpose Cluster Computing
Framework for Processing Large Datasets
 API in Scala, Java, Python, SQL, and R
© Copyright 2015 Glassbeam Inc.
Integrated Libraries for a Variety of Tasks
23
Spark Core
Spark
SQL
GraphX
Spark
Streaming
MLlib &
Spark ML
© Copyright 2015 Glassbeam Inc.
One Minor Problem!
 Spark Does not Have Built-in Support for C*
 Built-in Support for HDFS, S3 and JDBC-compliant
Databases
24
© Copyright 2015 Glassbeam Inc.
Spark Cassandra Connector
 Open Source Library for Integrating Spark with C*
 Enables a Spark Application to Process Data in C* Just
Like Data from the Built-in Data Sources
25
© Copyright 2015 Glassbeam Inc.
Spark with C*
 Enables Ad Hoc Analytics
 CQL Limitations No Longer Apply
 Query Data Using SQL/HiveQL
– Filter on Any Column
– Aggregations
– Group By
– Join
26
© Copyright 2015 Glassbeam Inc.
Ad Hoc Analytics in Spark Shell
27
© Copyright 2015 Glassbeam Inc.
Launch the Spark Shell
/path/to/spark/bin/spark-shell 
--master spark://host:7077 
--packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0
28
© Copyright 2015 Glassbeam Inc.
Create a DataFrame
val events = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options( Map(
"keyspace" -> "test",
"table" -> "event"))
.load()
29
© Copyright 2015 Glassbeam Inc.
Fire Queries
events.cache()
events.select("ts", "module", "message").where($"severity" === "ERROR").show
events.select("ts", "severity", "message").where($"module" === "m1").show
events.select("ts", "message").where($"severity" === "ERROR" &&
$"module" === "m1").show
events.groupBy("severity").count()
30
© Copyright 2015 Glassbeam Inc.
Spark SQL JDBC/ODBC Server
 Analyze data in C* with just SQL/HiveQL
 Command Line Shell
– Beeline
 Graphical SQL Client
– Squirrel
 Data Visualization Applications
– Tableau
– ZoomData
– QlikView
31
© Copyright 2015 Glassbeam Inc.
Ad hoc Analytics with Spark SQL JDBC/ODBC server
32
© Copyright 2015 Glassbeam Inc.
Start the Spark SQL JDBC Server
/path/to/spark/sbin/start-thriftserver.sh 
--master spark://hostname:7077 
--packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0
33
© Copyright 2015 Glassbeam Inc.
Launch Beeline From a Terminal
/path/to/spark/bin/beeline
34
© Copyright 2015 Glassbeam Inc.
Connect to the Spark SQL JDBC Server
beeline> !connect jdbc:hive2://localhost:10000
35
© Copyright 2015 Glassbeam Inc.
Create a Temporary Table
0: jdbc:hive2://localhost:10000> CREATE TEMPORARY TABLE event
. . . . . . . . . . . . . . . .> USING org.apache.spark.sql.cassandra
. . . . . . . . . . . . . . . .> OPTIONS (
. . . . . . . . . . . . . . . .> keyspace "test",
. . . . . . . . . . . . . . . .> table "event"
. . . . . . . . . . . . . . . .> );
36
© Copyright 2015 Glassbeam Inc.
Query Data with SQL/HiveQL
...> CACHE TABLE event;
...> SELECT severity, count(1) as total FROM event GROUP BY severity;
...> SELECT module, severity, count(1) FROM event GROUP BY module, severity;
37
© Copyright 2015 Glassbeam Inc.
Caveats
 Latency
 Spark Query May Require Expensive Table Scan
– Reads Every Row
– Disk I / O Slow
38
© Copyright 2015 Glassbeam Inc.
Reduce the Impact of Slow Disk I / O
 Cache Tables
 Replace HDD with SSD
 Add More Nodes
39
© Copyright 2015 Glassbeam Inc.
Recommendations
 Known Queries Requiring Sub-second Response Time
– Query C* Directly
– Create Query Specific Tables
– Pre-aggregate Data
 Ad Hoc Queries
– Spark
40
© Copyright 2015 Glassbeam Inc. 41

More Related Content

Ad hoc analytics with Cassandra and Spark

  • 1. © Copyright 2015 Glassbeam Inc. Ad Hoc Analytics on Internet of Complex Things with Spark and Cassandra Mohammed Guller September 2015
  • 2. © Copyright 2015 Glassbeam Inc. About Me  Principal Architect at Glassbeam  Founded two startups  Passionate about building products, big data analytics, and machine learning www.linkedin.com/in/mohammedguller @MohammedGuller 4 Available on Amazon
  • 3. © Copyright 2015 Glassbeam Inc. Internet of Things (IoT) 5 Network of objects embedded with software for collecting and exchanging data over the Internet
  • 4. © Copyright 2015 Glassbeam Inc. Internet of Complex Things (IoCT) 6  Data Center Devices – Server, storage, controller  Medical Devices – X-Ray, MRI scan, CT scan  Manufacturing Systems  Cars  Electric Vehicle Chargers  Other Complex Devices Glassbeam target market is focused on driving opera onal & business naly cs value for connected product companies in Industrial IoT market IT & Networks Medical & Health Care EV Chargers & Smart Grid
  • 5. © Copyright 2015 Glassbeam Inc. IT & Networks Medical & Healthcare EV Chargers & Smart Grid Industrial & Mfg Transportation Glassbeam 7 target market is focused on driving opera onal & business ue for connected product companies in Industrial IoT market rks Medical & Health Care Transporta on EV Chargers & Smart Grid Industrial & Mfg 5 Glassbeam target market is focused on d analy cs value for connected product com IT & Networks Medical & Health Care TrIndustrial & Mfg market is focused on driving opera onal & business connected product companies in Industrial IoT market Medical & Health Care EV Chargers & Smart Grid Transporta on 5 Advanced and Predictive Analytics for Connected Product Companies
  • 6. © Copyright 2015 Glassbeam Inc. 10101000101011010101110101111010101000101001010101010111110001011001000110000110101110100110011111000000101 01101010011111000101001010110010100101100010011010101140101010000101010000101111001001101011010010101000001 11101010101000101011010101110101111010101000101001010101010111110001011001000110000110101110100110011111000 00010101101010011111000101001010110010100101100010011010101140101010000101010000101111001001101011010010101 00000100100110101101001001001101011010010010011010001001101011010010010011010110100101101001101001101001101 Analytics on Operational Data 8 Operational Data to Powerful Insights
  • 7. © Copyright 2015 Glassbeam Inc. High-level Architecture 9 1010100010101 10101011101011 1101010100010 1001010101010 11111000101100 1000110000110 10111010011001 11110000001010 11010100111110 0010100101011 0010100101100 0100110101011 4010101000010 10100001011110 0100110101101 0010101000001 11101001111001 0011010110100 1010101010100 0101011010101 11010111101010 1000101001010 10101011111000 1011001000110 00011010111010 011 Data Inges on Data Transforma on Data Stores Middleware Applica ons Logs (Streams/ docs) SPL Library S C A L A R I N F O S E R V E R LogVault Explorer Workbench Standard Apps Custom Apps Rules & Alerts DirectAccess Glassbeam Studio Cloud Enablement & Automa on S3 Amazon Raw logs Cassandra Processed Data Solr Cloud Index Analy cs and Machine learning Spark SQL Spark Streaming MLlib Event Processing & Rules Engine End to End cloud based architecture built on modern technologies to handle any machine, any data, any cloud * SPL (Semiotic Parsing Language) and SCALAR are patent pending technology inventions of Glassbeam
  • 8. © Copyright 2015 Glassbeam Inc. Key Properties of IoCT Data 10 Volume Terabytes of Data Variety Multi-structured Data Velocity Fast Paced Batch Data Streaming Data
  • 9. © Copyright 2015 Glassbeam Inc. Why We Chose C* 11 Volume Economically Scale from Gigabytes to Terabytes of Data Variety Store Multi-structured Data Velocity Fast Ingest of New Data Quick Reload of Old Data Linear Scalability Dynamic Schema Fast Writes
  • 10. © Copyright 2015 Glassbeam Inc. Modeling Data in C*  Different from Modeling Data in RDBMS  Queries Drive Table and Primary Key Definitions – Primary Key Definition Limits the Kind of Queries You Can Run – C* Does Not Support Joins 12
  • 11. © Copyright 2015 Glassbeam Inc. A Simple Table for Storing Event Data in C* CREATE TABLE event ( sys_id text, dt timestamp, ts timestamp, severity text, module text, message text, PRIMARY KEY ((sys_id, dt), ts) ) WITH CLUSTERING ORDER BY (ts DESC); 13
  • 12. © Copyright 2015 Glassbeam Inc. Another Table to Filter Events by Severity CREATE TABLE event_by_severity ( sys_id text, dt timestamp, ts timestamp, severity text, module text, message text, PRIMARY KEY ((sys_id, dt), severity, ts) ) WITH CLUSTERING ORDER BY (severity ASC, ts DESC); 14
  • 13. © Copyright 2015 Glassbeam Inc. Yet Another Table to Filter Events by Module CREATE TABLE event_by_module ( sys_id text, dt timestamp, ts timestamp, severity text, module text, message text, PRIMARY KEY ((sys_id, dt), module, ts) ) WITH CLUSTERING ORDER BY (module ASC, ts DESC); 15
  • 14. © Copyright 2015 Glassbeam Inc. Ad Hoc Analytics with C*  Oxymoron  All queries Must be Known Upfront 16
  • 15. © Copyright 2015 Glassbeam Inc. Another Example Sys_id Model Age OS City State Country 17
  • 16. © Copyright 2015 Glassbeam Inc. Intractable Number of Tables Sys_id Model Age OS City State Country 18 • sys_by_model • sys_by_os • sys_by_age • sys_by_state • sys_by_state_age • sys_by_age_state • sys_by_model_age • sys_by_age_model • sys_by_age_model_state • sys_by_model_state_age • sys_by_model_state_os
  • 17. © Copyright 2015 Glassbeam Inc. Other Barriers to Ad Hoc Analytics  No Aggregation  No Group By  No Joins 19
  • 18. © Copyright 2015 Glassbeam Inc. 20 What Do I Do Now?
  • 19. © Copyright 2015 Glassbeam Inc. 21
  • 20. © Copyright 2015 Glassbeam Inc. Spark 22  Fast and General-purpose Cluster Computing Framework for Processing Large Datasets  API in Scala, Java, Python, SQL, and R
  • 21. © Copyright 2015 Glassbeam Inc. Integrated Libraries for a Variety of Tasks 23 Spark Core Spark SQL GraphX Spark Streaming MLlib & Spark ML
  • 22. © Copyright 2015 Glassbeam Inc. One Minor Problem!  Spark Does not Have Built-in Support for C*  Built-in Support for HDFS, S3 and JDBC-compliant Databases 24
  • 23. © Copyright 2015 Glassbeam Inc. Spark Cassandra Connector  Open Source Library for Integrating Spark with C*  Enables a Spark Application to Process Data in C* Just Like Data from the Built-in Data Sources 25
  • 24. © Copyright 2015 Glassbeam Inc. Spark with C*  Enables Ad Hoc Analytics  CQL Limitations No Longer Apply  Query Data Using SQL/HiveQL – Filter on Any Column – Aggregations – Group By – Join 26
  • 25. © Copyright 2015 Glassbeam Inc. Ad Hoc Analytics in Spark Shell 27
  • 26. © Copyright 2015 Glassbeam Inc. Launch the Spark Shell /path/to/spark/bin/spark-shell --master spark://host:7077 --packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0 28
  • 27. © Copyright 2015 Glassbeam Inc. Create a DataFrame val events = sqlContext.read .format("org.apache.spark.sql.cassandra") .options( Map( "keyspace" -> "test", "table" -> "event")) .load() 29
  • 28. © Copyright 2015 Glassbeam Inc. Fire Queries events.cache() events.select("ts", "module", "message").where($"severity" === "ERROR").show events.select("ts", "severity", "message").where($"module" === "m1").show events.select("ts", "message").where($"severity" === "ERROR" && $"module" === "m1").show events.groupBy("severity").count() 30
  • 29. © Copyright 2015 Glassbeam Inc. Spark SQL JDBC/ODBC Server  Analyze data in C* with just SQL/HiveQL  Command Line Shell – Beeline  Graphical SQL Client – Squirrel  Data Visualization Applications – Tableau – ZoomData – QlikView 31
  • 30. © Copyright 2015 Glassbeam Inc. Ad hoc Analytics with Spark SQL JDBC/ODBC server 32
  • 31. © Copyright 2015 Glassbeam Inc. Start the Spark SQL JDBC Server /path/to/spark/sbin/start-thriftserver.sh --master spark://hostname:7077 --packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0 33
  • 32. © Copyright 2015 Glassbeam Inc. Launch Beeline From a Terminal /path/to/spark/bin/beeline 34
  • 33. © Copyright 2015 Glassbeam Inc. Connect to the Spark SQL JDBC Server beeline> !connect jdbc:hive2://localhost:10000 35
  • 34. © Copyright 2015 Glassbeam Inc. Create a Temporary Table 0: jdbc:hive2://localhost:10000> CREATE TEMPORARY TABLE event . . . . . . . . . . . . . . . .> USING org.apache.spark.sql.cassandra . . . . . . . . . . . . . . . .> OPTIONS ( . . . . . . . . . . . . . . . .> keyspace "test", . . . . . . . . . . . . . . . .> table "event" . . . . . . . . . . . . . . . .> ); 36
  • 35. © Copyright 2015 Glassbeam Inc. Query Data with SQL/HiveQL ...> CACHE TABLE event; ...> SELECT severity, count(1) as total FROM event GROUP BY severity; ...> SELECT module, severity, count(1) FROM event GROUP BY module, severity; 37
  • 36. © Copyright 2015 Glassbeam Inc. Caveats  Latency  Spark Query May Require Expensive Table Scan – Reads Every Row – Disk I / O Slow 38
  • 37. © Copyright 2015 Glassbeam Inc. Reduce the Impact of Slow Disk I / O  Cache Tables  Replace HDD with SSD  Add More Nodes 39
  • 38. © Copyright 2015 Glassbeam Inc. Recommendations  Known Queries Requiring Sub-second Response Time – Query C* Directly – Create Query Specific Tables – Pre-aggregate Data  Ad Hoc Queries – Spark 40
  • 39. © Copyright 2015 Glassbeam Inc. 41