Visualizing Big Data in Realtime
- 3. 3
What is Apache Apex?
✓ Platform and Runtime Engine - enables development of scalable and
fault-tolerant distributed applications for processing streaming and batch data
✓ Highly Scalable - linear scalability to billions of events per second
✓ Highly Performant - millisecond end-to-end latency
✓ Fault Tolerant - automatically recovers from failures
✓ Stateful - guarantees that application state is preserved
✓ YARN Native - Uses Hadoop YARN for resource management
✓ Developer Friendly - Exposes an easy API for developing Operators, which
can include any custom logic written in Java
✓ Malhar Library - library of many popular operators and application examples
○ Input / Output Connectors - File Systems, RDBMS, NoSQL, Messaging, Social, …
○ Compute Operators - Parsers, Transforms, Stats, ML, Scripting, …
✓ Integrations - Calcite, SAMOA, Beam, Nifi, Geode, Bigtop, etc.
apex.apache.org
- 4. 4
Apache Apex Use Cases
Data Sources
Op1
Hadoop (YARN + HDFS)
Real-time
Analytics &
Visualizations
Op3
Op2
Op4
Streaming Computation Actions & Insights
Data Targets
- 6. 6
Apex Application Development
Application DAG is made up of connected
operators and streams
Stream is a sequence of data tuples
Operator takes one or more input streams,
performs computations & emits one or more
output streams
● Each Operator is YOUR custom business logic
in java, or built-in operator from our open
source library
● Operator has many instances that run in
parallel and each instance is single-threaded
- 7. 7
Apache Apex & DataTorrent RTS
Ingestion &
Data Prep
Solutions for
Business
Awesome
Visual Tools GUI Application AssemblyManagement & Monitoring Real-Time Data Visualization
Hadoop 2.x - YARN + HDFS | On Prem & Cloud
FileSync | Kafka-to-HDFS | JDBC-to-HDFS | HDFS-to-HDFS | S3-to-HDFS
Application
Templates
Apex-Malhar Operator Library
Apache Apex Core
Big Data
Infrastructure
Core
High-level API
Transformation ML & Score SQL Analytics
Dev Framework
Batch
Support
Apache
Apex
Fraud &
Security
Ad Tech ETL Pipelines IoT & Industrial
- 9. 9
Realtime App Visualizations
● Apex App Visualizations
○ Events & Logs
○ Logical & Physical DAGs
○ Tuple Recordings
○ Stats & Metrics
○ Data Queries & Results
● Dashboards
○ Configurable
○ Export/Import via Apex app packages
● Widgets
○ Real-time data streams
○ Visualizations include tables, charts, maps, ...
○ Configurable
○ Support external development and dynamic
loading from Apex app packages.
- 10. 10
Connecting Dashboards to App Data
Apex Applications with AppData Support
DataTorrent RTS Dashboard & Widgets
DataTorrent RTS Gateway
dtGateway
resultsquery
- 11. 11
App Data Framework
App Data Framework Documentation
http://docs.datatorrent.com/app_data_framework/
Data Sources are Query + Source + Result
operators exposed via Gateway Topics
App Data Framework Schema & Data Queries
Enables Real-time Visualization Widgets
Console Gateway
Schema Subscribe
Data Subscribe
Data Publish
Schema Publish
Data Query
Data Renew
Schema Query
- 12. 12
App Data Framework Schema Queries
1. Request application data sources
http://<gateway-host:port>/ws/v2/applications/<appId>
{
...
"appDataSources": [
{
"name": "SnapshotServer.queryResult",
"context": {...},
"query": {
"topic": "TwitterHashtagQueryDemo",
...
},
"result": {
"topic": "TwitterHashtagQueryResultDemo",
...
}
}
]
}
2. Subscribe to schema result on a unique topic
ws://<gateway-host:port>/pubsub
{
"type": "subscribe",
"topic": "TwitterHashtagQueryResultDemo.0.20716154835833223"
}
3. Request schema from published DataSource topic
ws://<gateway-host:port>/pubsub
{
"type": "publish",
"topic": "TwitterHashtagQueryDemo",
"data": {
"id": 0.20716154835833223,
"type": "schemaQuery",
"context": {...}
}
}
4. DataSource responds on unique topic
{
"topic": "TwitterHashtagQueryResultDemo.0.20716154835833223",
"data": {
"id": "0.20716154835833223",
"type": "schemaResult",
"data": [
{
"values": [{
"name": "hashtag",
"type": "string"
},{
"name": "count",
"type": "integer"
}
],
"schemaType": "snapshot",
"schemaVersion": "1.0"
}
]
},
"type": "data"
}
- 13. 3. Data is published on the unique result topic
{
"topic": "TwitterHashtagQueryResultDemo.0.6760250790172551",
"data": {
"id": "0.6760250790172551",
"type": "dataResult",
"data": [
{
"count": "1398",
"hashtag": "iHeartApache"
},
{
"count": "1415",
"hashtag": "ApexBigDataWorld"
},
{
"count": "1498",
"hashtag": "StreamingBigData"
},
{
"count": "1521",
"hashtag": "ApacheApex"
},
{
"count": "1728",
"hashtag": "DataTorrentRTS"
},
...
],
"countdown": "29"
},
"type": "data"
}
13
App Data Framework Data Queries
1. Subscribe to data result on a unique topic
ws://<gateway-host:port>/pubsub
{
"type": "subscribe",
"topic": "TwitterHashtagQueryResultDemo.0.6760250790172551"
}
2. Request data on query topic with matching id
ws://<gateway-host:port>/pubsub
{
"type": "publish",
"topic": "TwitterHashtagQueryDemo",
"data": {
"id": 0.6760250790172551,
"type": "dataQuery",
"data": {
"fields": [
"hashtag",
"count"
]
},
"countdown": 30,
"incompleteResultOK": true
}
}
- 14. 14
Easiest way to expose custom data in Apache Apex apps
import com.datatorrent.api.AutoMetric;
public class LineReceiver extends BaseOperator
{
@AutoMetric
long evalsPerWindow;
@AutoMetric
long evalsTotal;
public final transient DefaultInputPort<String> input = new DefaultInputPort<String>()
{
@Override
public void process(String s)
{
evalsPerWindow ++;
evalsTotal++;
}
};
@Override
public void beginWindow(long windowId)
{
evalsPerWindow = 0;
}
}
Apache Apex App Data with AutoMetrics
Example Operators with @AutoMetric
JsonParser.java, PojoToAvro.java, POJOKafkaOutputOperator.java
Custom Aggregators for non-numeric fields
Apache Apex - Building Custom Aggregators
Requesting AutoMetrics Data via StrAM API
http://<appMasterTrackingUrl>/ws/v2/stram/physicalPlan
{
"operators": [{
"name": "picalc",
"metrics": {
"evalsPerWindow": "23000",
"evalsTotal": "1005787500"
}
}]
}
Get StrAM URL with Apex CLI
$ apex
apex> connect <appId>
apex (appId)> get-app-info
... "appMasterTrackingUrl": "node24.datatorrent.com:40466" …
- 15. Key Operators Enabling TopN Computation and Visualization
WindowedTopCounter<String> topCounts = dag.addOperator("TopCounter", new WindowedTopCounter<String>());
AppDataSnapshotServerMap snapshotServer = dag.addOperator("SnapshotServer", new AppDataSnapshotServerMap());
snapshotServer.setSnapshotSchemaJSON(SNAPSHOT_SCHEMA);
snapshotServer.setTableFieldToMapField(conversionMap);
PubSubWebSocketAppDataQuery wsQuery = new PubSubWebSocketAppDataQuery();
wsQuery.setUri(uri);
snapshotServer.setEmbeddableQueryInfoProvider(wsQuery);
PubSubWebSocketAppDataResult wsResult = dag.addOperator("QueryResult", new PubSubWebSocketAppDataResult());
wsResult.setUri(uri);
Operator.InputPort<String> queryResultPort = wsResult.input;
Snapshot Schema for SnapshotServer Operator
{
"values": [{"name": "url", "type": "string"},
{"name": "count", "type": "integer"}]
}
15
Snapshot Schema Apps
Available SnapshotServer Implementations
AppDataSnapshotServerMap.java
AppDataSnapshotServerPOJO.java
Example Applications with Snapshot Schema
TwitterTopCounterApplication.java (twitter)
ApplicationAppData.java (pi demo)
Twitter Demo Logical Plan with Snapshot Schema
- 16. Dimensions Schema for DimensionsComputation Operator
{
"keys":[{"name":"channel","type":"string","enumValues":["Mobile","Online","Store"]},
{"name":"region","type":"string","enumValues":["Dallas","New York","San Francisco", ... ]},
{"name":"product","type":"string","enumValues":["Laptops","Printers","Routers", ...]}],
"timeBuckets":["1m", "1h", "1d", "5m"],
"values":
[{"name":"sales","type":"double","aggregators":["SUM"]},
{"name":"discount","type":"double","aggregators":["SUM"]},
{"name":"tax","type":"double","aggregators":["SUM"]}],
"dimensions":
[{"combination":[]},
{"combination":["region"]},
{"combination":["product"]},
{"combination":["channel","product"]},
{"combination":["channel","region","product"]}]
}
// full schema -> salesGenericEventSchema.json
16
Dimensions Schema Apps
Key Operators Enabling Dimensions Computation and
Visualization
DimensionsComputationFlexibleSingleSchemaMap dimensions =
dag.addOperator("DimensionsComputation", DimensionsComputationFlexibleSingleSchemaMap.class);
AppDataSingleSchemaDimensionStoreHDHT store = dag.addOperator("Store",
AppDataSingleSchemaDimensionStoreHDHT.class);
PubSubWebSocketAppDataQuery wsIn = new PubSubWebSocketAppDataQuery();
store.setEmbeddableQueryInfoProvider(wsIn);
PubSubWebSocketAppDataResult wsOut = dag.addOperator("QueryResult", new
PubSubWebSocketAppDataResult());
Example Applications with Dimensions Schema
CDRDemoV2.java
SalesDemo.java
Sales Demo Logical Plan with Dimensions Schema
- 17. 3. Create ui.json in Apex app project folder under
<Apex App>/src/main/resources/resources/ui/ui.json
{
"dashboards": [
{
"file": "TwitterDemo.dtdashboard"
},
{
"name": "Sales Dimensions Demo",
"file": "SalesDemo.dtdashboard",
"appNames": ["SalesDemo-Sasha", "SalesDemo"]
}
]
}
// "appNames" is used to auto-associate packaged dashboards with running apps
4. Compile Apex app project and verify .apa package has
myApp.apa
+ resources/
+ ui/
- ui.json
+ dashboards/
- TwitterDemo.dtdashboard
- SalesDemo.dtdashboard
17
Exporting and Packaging Dashboards
1. Create and download dashboard from UI Console
2. Copy dashboards to Apex app project folder under
<Apex App>/src/main/resources/resources/ui/dashboards/
- TwitterDemo.dtdashboard
- SalesDemo.dtdashboard
- 20. Resources
• Apache Apex - http://apex.apache.org/
• Subscribe to forums
ᵒ Apex - http://apex.apache.org/community.html
ᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users
• Download - https://datatorrent.com/download/
• Twitter
ᵒ @ApacheApex; Follow - https://twitter.com/apacheapex
ᵒ @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://meetup.com/topics/apache-apex
• Webinars - https://datatorrent.com/webinars/
• Videos - https://youtube.com/user/DataTorrent
• Slides - http://slideshare.net/DataTorrent/presentations
• Startup Accelerator Program - Full featured enterprise product
ᵒ https://datatorrent.com/product/start-up-accelerator/
• Big Data Application Templates/Examples – https://datatorrent.com/apphub
20
We Are Hiring!
jobs@datatorrent.com