SlideShare a Scribd company logo
Visualizing Big Data in Realtime
Sasha Parfenov
sashap@apache.org
June 15, 2017
Agenda
Apache Apex
DataTorrent RTS
Real-time Dashboards and Widgets
App Data Framework
Apache Apex AutoMetrics
Exporting and Packaging Dashboards
Q&A
2
3
What is Apache Apex?
✓ Platform and Runtime Engine - enables development of scalable and
fault-tolerant distributed applications for processing streaming and batch data
✓ Highly Scalable - linear scalability to billions of events per second
✓ Highly Performant - millisecond end-to-end latency
✓ Fault Tolerant - automatically recovers from failures
✓ Stateful - guarantees that application state is preserved
✓ YARN Native - Uses Hadoop YARN for resource management
✓ Developer Friendly - Exposes an easy API for developing Operators, which
can include any custom logic written in Java
✓ Malhar Library - library of many popular operators and application examples
○ Input / Output Connectors - File Systems, RDBMS, NoSQL, Messaging, Social, …
○ Compute Operators - Parsers, Transforms, Stats, ML, Scripting, …
✓ Integrations - Calcite, SAMOA, Beam, Nifi, Geode, Bigtop, etc.
apex.apache.org
4
Apache Apex Use Cases
Data Sources
Op1
Hadoop (YARN + HDFS)
Real-time
Analytics &
Visualizations
Op3
Op2
Op4
Streaming Computation Actions & Insights
Data Targets
5
Apache Apex Enables “Shift Left”
6
Apex Application Development
Application DAG is made up of connected
operators and streams
Stream is a sequence of data tuples
Operator takes one or more input streams,
performs computations & emits one or more
output streams
● Each Operator is YOUR custom business logic
in java, or built-in operator from our open
source library
● Operator has many instances that run in
parallel and each instance is single-threaded
7
Apache Apex & DataTorrent RTS
Ingestion &
Data Prep
Solutions for
Business
Awesome
Visual Tools GUI Application AssemblyManagement & Monitoring Real-Time Data Visualization
Hadoop 2.x - YARN + HDFS | On Prem & Cloud
FileSync | Kafka-to-HDFS | JDBC-to-HDFS | HDFS-to-HDFS | S3-to-HDFS
Application
Templates
Apex-Malhar Operator Library
Apache Apex Core
Big Data
Infrastructure
Core
High-level API
Transformation ML & Score SQL Analytics
Dev Framework
Batch
Support
Apache
Apex
Fraud &
Security
Ad Tech ETL Pipelines IoT & Industrial
8
DataTorrent RTS Visualization Demo
9
Realtime App Visualizations
● Apex App Visualizations
○ Events & Logs
○ Logical & Physical DAGs
○ Tuple Recordings
○ Stats & Metrics
○ Data Queries & Results
● Dashboards
○ Configurable
○ Export/Import via Apex app packages
● Widgets
○ Real-time data streams
○ Visualizations include tables, charts, maps, ...
○ Configurable
○ Support external development and dynamic
loading from Apex app packages.
10
Connecting Dashboards to App Data
Apex Applications with AppData Support
DataTorrent RTS Dashboard & Widgets
DataTorrent RTS Gateway
dtGateway
resultsquery
11
App Data Framework
App Data Framework Documentation
http://docs.datatorrent.com/app_data_framework/
Data Sources are Query + Source + Result
operators exposed via Gateway Topics
App Data Framework Schema & Data Queries
Enables Real-time Visualization Widgets
Console Gateway
Schema Subscribe
Data Subscribe
Data Publish
Schema Publish
Data Query
Data Renew
Schema Query
12
App Data Framework Schema Queries
1. Request application data sources
http://<gateway-host:port>/ws/v2/applications/<appId>
{
...
"appDataSources": [
{
"name": "SnapshotServer.queryResult",
"context": {...},
"query": {
"topic": "TwitterHashtagQueryDemo",
...
},
"result": {
"topic": "TwitterHashtagQueryResultDemo",
...
}
}
]
}
2. Subscribe to schema result on a unique topic
ws://<gateway-host:port>/pubsub
{
"type": "subscribe",
"topic": "TwitterHashtagQueryResultDemo.0.20716154835833223"
}
3. Request schema from published DataSource topic
ws://<gateway-host:port>/pubsub
{
"type": "publish",
"topic": "TwitterHashtagQueryDemo",
"data": {
"id": 0.20716154835833223,
"type": "schemaQuery",
"context": {...}
}
}
4. DataSource responds on unique topic
{
"topic": "TwitterHashtagQueryResultDemo.0.20716154835833223",
"data": {
"id": "0.20716154835833223",
"type": "schemaResult",
"data": [
{
"values": [{
"name": "hashtag",
"type": "string"
},{
"name": "count",
"type": "integer"
}
],
"schemaType": "snapshot",
"schemaVersion": "1.0"
}
]
},
"type": "data"
}
3. Data is published on the unique result topic
{
"topic": "TwitterHashtagQueryResultDemo.0.6760250790172551",
"data": {
"id": "0.6760250790172551",
"type": "dataResult",
"data": [
{
"count": "1398",
"hashtag": "iHeartApache"
},
{
"count": "1415",
"hashtag": "ApexBigDataWorld"
},
{
"count": "1498",
"hashtag": "StreamingBigData"
},
{
"count": "1521",
"hashtag": "ApacheApex"
},
{
"count": "1728",
"hashtag": "DataTorrentRTS"
},
...
],
"countdown": "29"
},
"type": "data"
}
13
App Data Framework Data Queries
1. Subscribe to data result on a unique topic
ws://<gateway-host:port>/pubsub
{
"type": "subscribe",
"topic": "TwitterHashtagQueryResultDemo.0.6760250790172551"
}
2. Request data on query topic with matching id
ws://<gateway-host:port>/pubsub
{
"type": "publish",
"topic": "TwitterHashtagQueryDemo",
"data": {
"id": 0.6760250790172551,
"type": "dataQuery",
"data": {
"fields": [
"hashtag",
"count"
]
},
"countdown": 30,
"incompleteResultOK": true
}
}
14
Easiest way to expose custom data in Apache Apex apps
import com.datatorrent.api.AutoMetric;
public class LineReceiver extends BaseOperator
{
@AutoMetric
long evalsPerWindow;
@AutoMetric
long evalsTotal;
public final transient DefaultInputPort<String> input = new DefaultInputPort<String>()
{
@Override
public void process(String s)
{
evalsPerWindow ++;
evalsTotal++;
}
};
@Override
public void beginWindow(long windowId)
{
evalsPerWindow = 0;
}
}
Apache Apex App Data with AutoMetrics
Example Operators with @AutoMetric
JsonParser.java, PojoToAvro.java, POJOKafkaOutputOperator.java
Custom Aggregators for non-numeric fields
Apache Apex - Building Custom Aggregators
Requesting AutoMetrics Data via StrAM API
http://<appMasterTrackingUrl>/ws/v2/stram/physicalPlan
{
"operators": [{
"name": "picalc",
"metrics": {
"evalsPerWindow": "23000",
"evalsTotal": "1005787500"
}
}]
}
Get StrAM URL with Apex CLI
$ apex
apex> connect <appId>
apex (appId)> get-app-info
... "appMasterTrackingUrl": "node24.datatorrent.com:40466" …
Key Operators Enabling TopN Computation and Visualization
WindowedTopCounter<String> topCounts = dag.addOperator("TopCounter", new WindowedTopCounter<String>());
AppDataSnapshotServerMap snapshotServer = dag.addOperator("SnapshotServer", new AppDataSnapshotServerMap());
snapshotServer.setSnapshotSchemaJSON(SNAPSHOT_SCHEMA);
snapshotServer.setTableFieldToMapField(conversionMap);
PubSubWebSocketAppDataQuery wsQuery = new PubSubWebSocketAppDataQuery();
wsQuery.setUri(uri);
snapshotServer.setEmbeddableQueryInfoProvider(wsQuery);
PubSubWebSocketAppDataResult wsResult = dag.addOperator("QueryResult", new PubSubWebSocketAppDataResult());
wsResult.setUri(uri);
Operator.InputPort<String> queryResultPort = wsResult.input;
Snapshot Schema for SnapshotServer Operator
{
"values": [{"name": "url", "type": "string"},
{"name": "count", "type": "integer"}]
}
15
Snapshot Schema Apps
Available SnapshotServer Implementations
AppDataSnapshotServerMap.java
AppDataSnapshotServerPOJO.java
Example Applications with Snapshot Schema
TwitterTopCounterApplication.java (twitter)
ApplicationAppData.java (pi demo)
Twitter Demo Logical Plan with Snapshot Schema
Dimensions Schema for DimensionsComputation Operator
{
"keys":[{"name":"channel","type":"string","enumValues":["Mobile","Online","Store"]},
{"name":"region","type":"string","enumValues":["Dallas","New York","San Francisco", ... ]},
{"name":"product","type":"string","enumValues":["Laptops","Printers","Routers", ...]}],
"timeBuckets":["1m", "1h", "1d", "5m"],
"values":
[{"name":"sales","type":"double","aggregators":["SUM"]},
{"name":"discount","type":"double","aggregators":["SUM"]},
{"name":"tax","type":"double","aggregators":["SUM"]}],
"dimensions":
[{"combination":[]},
{"combination":["region"]},
{"combination":["product"]},
{"combination":["channel","product"]},
{"combination":["channel","region","product"]}]
}
// full schema -> salesGenericEventSchema.json
16
Dimensions Schema Apps
Key Operators Enabling Dimensions Computation and
Visualization
DimensionsComputationFlexibleSingleSchemaMap dimensions =
dag.addOperator("DimensionsComputation", DimensionsComputationFlexibleSingleSchemaMap.class);
AppDataSingleSchemaDimensionStoreHDHT store = dag.addOperator("Store",
AppDataSingleSchemaDimensionStoreHDHT.class);
PubSubWebSocketAppDataQuery wsIn = new PubSubWebSocketAppDataQuery();
store.setEmbeddableQueryInfoProvider(wsIn);
PubSubWebSocketAppDataResult wsOut = dag.addOperator("QueryResult", new
PubSubWebSocketAppDataResult());
Example Applications with Dimensions Schema
CDRDemoV2.java
SalesDemo.java
Sales Demo Logical Plan with Dimensions Schema
3. Create ui.json in Apex app project folder under
<Apex App>/src/main/resources/resources/ui/ui.json
{
"dashboards": [
{
"file": "TwitterDemo.dtdashboard"
},
{
"name": "Sales Dimensions Demo",
"file": "SalesDemo.dtdashboard",
"appNames": ["SalesDemo-Sasha", "SalesDemo"]
}
]
}
// "appNames" is used to auto-associate packaged dashboards with running apps
4. Compile Apex app project and verify .apa package has
myApp.apa
+ resources/
+ ui/
- ui.json
+ dashboards/
- TwitterDemo.dtdashboard
- SalesDemo.dtdashboard
17
Exporting and Packaging Dashboards
1. Create and download dashboard from UI Console
2. Copy dashboards to Apex app project folder under
<Apex App>/src/main/resources/resources/ui/dashboards/
- TwitterDemo.dtdashboard
- SalesDemo.dtdashboard
Questions?
18
Sasha Parfenov
sashap@apache.org
@utdsasha
Thank You!
19
Resources
• Apache Apex - http://apex.apache.org/
• Subscribe to forums
ᵒ Apex - http://apex.apache.org/community.html
ᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users
• Download - https://datatorrent.com/download/
• Twitter
ᵒ @ApacheApex; Follow - https://twitter.com/apacheapex
ᵒ @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://meetup.com/topics/apache-apex
• Webinars - https://datatorrent.com/webinars/
• Videos - https://youtube.com/user/DataTorrent
• Slides - http://slideshare.net/DataTorrent/presentations
• Startup Accelerator Program - Full featured enterprise product
ᵒ https://datatorrent.com/product/start-up-accelerator/
• Big Data Application Templates/Examples – https://datatorrent.com/apphub
20
We Are Hiring!
jobs@datatorrent.com

More Related Content

Visualizing Big Data in Realtime

  • 1. Visualizing Big Data in Realtime Sasha Parfenov sashap@apache.org June 15, 2017
  • 2. Agenda Apache Apex DataTorrent RTS Real-time Dashboards and Widgets App Data Framework Apache Apex AutoMetrics Exporting and Packaging Dashboards Q&A 2
  • 3. 3 What is Apache Apex? ✓ Platform and Runtime Engine - enables development of scalable and fault-tolerant distributed applications for processing streaming and batch data ✓ Highly Scalable - linear scalability to billions of events per second ✓ Highly Performant - millisecond end-to-end latency ✓ Fault Tolerant - automatically recovers from failures ✓ Stateful - guarantees that application state is preserved ✓ YARN Native - Uses Hadoop YARN for resource management ✓ Developer Friendly - Exposes an easy API for developing Operators, which can include any custom logic written in Java ✓ Malhar Library - library of many popular operators and application examples ○ Input / Output Connectors - File Systems, RDBMS, NoSQL, Messaging, Social, … ○ Compute Operators - Parsers, Transforms, Stats, ML, Scripting, … ✓ Integrations - Calcite, SAMOA, Beam, Nifi, Geode, Bigtop, etc. apex.apache.org
  • 4. 4 Apache Apex Use Cases Data Sources Op1 Hadoop (YARN + HDFS) Real-time Analytics & Visualizations Op3 Op2 Op4 Streaming Computation Actions & Insights Data Targets
  • 5. 5 Apache Apex Enables “Shift Left”
  • 6. 6 Apex Application Development Application DAG is made up of connected operators and streams Stream is a sequence of data tuples Operator takes one or more input streams, performs computations & emits one or more output streams ● Each Operator is YOUR custom business logic in java, or built-in operator from our open source library ● Operator has many instances that run in parallel and each instance is single-threaded
  • 7. 7 Apache Apex & DataTorrent RTS Ingestion & Data Prep Solutions for Business Awesome Visual Tools GUI Application AssemblyManagement & Monitoring Real-Time Data Visualization Hadoop 2.x - YARN + HDFS | On Prem & Cloud FileSync | Kafka-to-HDFS | JDBC-to-HDFS | HDFS-to-HDFS | S3-to-HDFS Application Templates Apex-Malhar Operator Library Apache Apex Core Big Data Infrastructure Core High-level API Transformation ML & Score SQL Analytics Dev Framework Batch Support Apache Apex Fraud & Security Ad Tech ETL Pipelines IoT & Industrial
  • 9. 9 Realtime App Visualizations ● Apex App Visualizations ○ Events & Logs ○ Logical & Physical DAGs ○ Tuple Recordings ○ Stats & Metrics ○ Data Queries & Results ● Dashboards ○ Configurable ○ Export/Import via Apex app packages ● Widgets ○ Real-time data streams ○ Visualizations include tables, charts, maps, ... ○ Configurable ○ Support external development and dynamic loading from Apex app packages.
  • 10. 10 Connecting Dashboards to App Data Apex Applications with AppData Support DataTorrent RTS Dashboard & Widgets DataTorrent RTS Gateway dtGateway resultsquery
  • 11. 11 App Data Framework App Data Framework Documentation http://docs.datatorrent.com/app_data_framework/ Data Sources are Query + Source + Result operators exposed via Gateway Topics App Data Framework Schema & Data Queries Enables Real-time Visualization Widgets Console Gateway Schema Subscribe Data Subscribe Data Publish Schema Publish Data Query Data Renew Schema Query
  • 12. 12 App Data Framework Schema Queries 1. Request application data sources http://<gateway-host:port>/ws/v2/applications/<appId> { ... "appDataSources": [ { "name": "SnapshotServer.queryResult", "context": {...}, "query": { "topic": "TwitterHashtagQueryDemo", ... }, "result": { "topic": "TwitterHashtagQueryResultDemo", ... } } ] } 2. Subscribe to schema result on a unique topic ws://<gateway-host:port>/pubsub { "type": "subscribe", "topic": "TwitterHashtagQueryResultDemo.0.20716154835833223" } 3. Request schema from published DataSource topic ws://<gateway-host:port>/pubsub { "type": "publish", "topic": "TwitterHashtagQueryDemo", "data": { "id": 0.20716154835833223, "type": "schemaQuery", "context": {...} } } 4. DataSource responds on unique topic { "topic": "TwitterHashtagQueryResultDemo.0.20716154835833223", "data": { "id": "0.20716154835833223", "type": "schemaResult", "data": [ { "values": [{ "name": "hashtag", "type": "string" },{ "name": "count", "type": "integer" } ], "schemaType": "snapshot", "schemaVersion": "1.0" } ] }, "type": "data" }
  • 13. 3. Data is published on the unique result topic { "topic": "TwitterHashtagQueryResultDemo.0.6760250790172551", "data": { "id": "0.6760250790172551", "type": "dataResult", "data": [ { "count": "1398", "hashtag": "iHeartApache" }, { "count": "1415", "hashtag": "ApexBigDataWorld" }, { "count": "1498", "hashtag": "StreamingBigData" }, { "count": "1521", "hashtag": "ApacheApex" }, { "count": "1728", "hashtag": "DataTorrentRTS" }, ... ], "countdown": "29" }, "type": "data" } 13 App Data Framework Data Queries 1. Subscribe to data result on a unique topic ws://<gateway-host:port>/pubsub { "type": "subscribe", "topic": "TwitterHashtagQueryResultDemo.0.6760250790172551" } 2. Request data on query topic with matching id ws://<gateway-host:port>/pubsub { "type": "publish", "topic": "TwitterHashtagQueryDemo", "data": { "id": 0.6760250790172551, "type": "dataQuery", "data": { "fields": [ "hashtag", "count" ] }, "countdown": 30, "incompleteResultOK": true } }
  • 14. 14 Easiest way to expose custom data in Apache Apex apps import com.datatorrent.api.AutoMetric; public class LineReceiver extends BaseOperator { @AutoMetric long evalsPerWindow; @AutoMetric long evalsTotal; public final transient DefaultInputPort<String> input = new DefaultInputPort<String>() { @Override public void process(String s) { evalsPerWindow ++; evalsTotal++; } }; @Override public void beginWindow(long windowId) { evalsPerWindow = 0; } } Apache Apex App Data with AutoMetrics Example Operators with @AutoMetric JsonParser.java, PojoToAvro.java, POJOKafkaOutputOperator.java Custom Aggregators for non-numeric fields Apache Apex - Building Custom Aggregators Requesting AutoMetrics Data via StrAM API http://<appMasterTrackingUrl>/ws/v2/stram/physicalPlan { "operators": [{ "name": "picalc", "metrics": { "evalsPerWindow": "23000", "evalsTotal": "1005787500" } }] } Get StrAM URL with Apex CLI $ apex apex> connect <appId> apex (appId)> get-app-info ... "appMasterTrackingUrl": "node24.datatorrent.com:40466" …
  • 15. Key Operators Enabling TopN Computation and Visualization WindowedTopCounter<String> topCounts = dag.addOperator("TopCounter", new WindowedTopCounter<String>()); AppDataSnapshotServerMap snapshotServer = dag.addOperator("SnapshotServer", new AppDataSnapshotServerMap()); snapshotServer.setSnapshotSchemaJSON(SNAPSHOT_SCHEMA); snapshotServer.setTableFieldToMapField(conversionMap); PubSubWebSocketAppDataQuery wsQuery = new PubSubWebSocketAppDataQuery(); wsQuery.setUri(uri); snapshotServer.setEmbeddableQueryInfoProvider(wsQuery); PubSubWebSocketAppDataResult wsResult = dag.addOperator("QueryResult", new PubSubWebSocketAppDataResult()); wsResult.setUri(uri); Operator.InputPort<String> queryResultPort = wsResult.input; Snapshot Schema for SnapshotServer Operator { "values": [{"name": "url", "type": "string"}, {"name": "count", "type": "integer"}] } 15 Snapshot Schema Apps Available SnapshotServer Implementations AppDataSnapshotServerMap.java AppDataSnapshotServerPOJO.java Example Applications with Snapshot Schema TwitterTopCounterApplication.java (twitter) ApplicationAppData.java (pi demo) Twitter Demo Logical Plan with Snapshot Schema
  • 16. Dimensions Schema for DimensionsComputation Operator { "keys":[{"name":"channel","type":"string","enumValues":["Mobile","Online","Store"]}, {"name":"region","type":"string","enumValues":["Dallas","New York","San Francisco", ... ]}, {"name":"product","type":"string","enumValues":["Laptops","Printers","Routers", ...]}], "timeBuckets":["1m", "1h", "1d", "5m"], "values": [{"name":"sales","type":"double","aggregators":["SUM"]}, {"name":"discount","type":"double","aggregators":["SUM"]}, {"name":"tax","type":"double","aggregators":["SUM"]}], "dimensions": [{"combination":[]}, {"combination":["region"]}, {"combination":["product"]}, {"combination":["channel","product"]}, {"combination":["channel","region","product"]}] } // full schema -> salesGenericEventSchema.json 16 Dimensions Schema Apps Key Operators Enabling Dimensions Computation and Visualization DimensionsComputationFlexibleSingleSchemaMap dimensions = dag.addOperator("DimensionsComputation", DimensionsComputationFlexibleSingleSchemaMap.class); AppDataSingleSchemaDimensionStoreHDHT store = dag.addOperator("Store", AppDataSingleSchemaDimensionStoreHDHT.class); PubSubWebSocketAppDataQuery wsIn = new PubSubWebSocketAppDataQuery(); store.setEmbeddableQueryInfoProvider(wsIn); PubSubWebSocketAppDataResult wsOut = dag.addOperator("QueryResult", new PubSubWebSocketAppDataResult()); Example Applications with Dimensions Schema CDRDemoV2.java SalesDemo.java Sales Demo Logical Plan with Dimensions Schema
  • 17. 3. Create ui.json in Apex app project folder under <Apex App>/src/main/resources/resources/ui/ui.json { "dashboards": [ { "file": "TwitterDemo.dtdashboard" }, { "name": "Sales Dimensions Demo", "file": "SalesDemo.dtdashboard", "appNames": ["SalesDemo-Sasha", "SalesDemo"] } ] } // "appNames" is used to auto-associate packaged dashboards with running apps 4. Compile Apex app project and verify .apa package has myApp.apa + resources/ + ui/ - ui.json + dashboards/ - TwitterDemo.dtdashboard - SalesDemo.dtdashboard 17 Exporting and Packaging Dashboards 1. Create and download dashboard from UI Console 2. Copy dashboards to Apex app project folder under <Apex App>/src/main/resources/resources/ui/dashboards/ - TwitterDemo.dtdashboard - SalesDemo.dtdashboard
  • 20. Resources • Apache Apex - http://apex.apache.org/ • Subscribe to forums ᵒ Apex - http://apex.apache.org/community.html ᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users • Download - https://datatorrent.com/download/ • Twitter ᵒ @ApacheApex; Follow - https://twitter.com/apacheapex ᵒ @DataTorrent; Follow – https://twitter.com/datatorrent • Meetups - http://meetup.com/topics/apache-apex • Webinars - https://datatorrent.com/webinars/ • Videos - https://youtube.com/user/DataTorrent • Slides - http://slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product ᵒ https://datatorrent.com/product/start-up-accelerator/ • Big Data Application Templates/Examples – https://datatorrent.com/apphub 20 We Are Hiring! jobs@datatorrent.com