The document discusses processing large amounts of "big data" in real time. It proposes developing a "gogobot checkins heat-map" service that would collect check-in locations from text addresses, geocode the locations, and display the locations as a heat map over time intervals. Key aspects discussed include using Storm for horizontal scalability and fault tolerance without message brokers. Sample check-in data would be used to test an initial topology design in Storm before connecting to real data streams.
10. Key Notes
●
Collector Service - Collects checkins as text addresses
–
We need to use GeoLocation Service
●
Upon elapsed interval, the last locations list will be
displayed as Heat-Map in GUI.
●
Web Scale service – 10Ks checkins/seconds all over the
world (imaginary, but lets do it for the exercise).
●
Accuracy – Sample data, NOT critical data.
–
–
Proportionately representative
Data volume is large enough to compensate for data loss.
10
18. Problems ?
●
Tedious: Spend time conf guring where to send
i
messages, deploying workers, and deploying
intermediate queues.
●
Brittle: There's little fault-tolerance.
●
Painful to scale: Partition of running worker/s is
complicated.
18
19. What We Want ?
● Horizontal scalability
● Fault-tolerance
● No intermediate message brokers!
● Higher level abstraction than message
passing
● “Just works”
● Guaranteed data processing (not in this
case)
19
22. What is Storm ?
●
CEP - Open source and distributed realtime
computation system.
–
–
●
Makes it easy to reliably process unbounded streams of
tuples
Doing for realtime processing what Hadoop did for batch
processing.
Fast - 1M Tuples/sec per node.
–
It is scalable,fault-tolerant, guarantees your data will be
processed, and is easy to set up and operate.
22
27. Guarantee for Processing
●
●
●
Storm guarantees the full processing of a tuple by
tracking its state
In case of failure, Storm can re-process it.
Source tuples with full “acked” trees are removed
from the system
27
30. Stream Grouping
●
●
●
●
Shuff e grouping: pick a random task
l
Fields grouping: consistent hashing on a subset of
tuple f elds
i
All grouping: send to all tasks
Global grouping: pick task with lowest id
30
37. HeatMap Input/Output Tuples
●
Input Tuples: Timestamp and Text Address :
–
●
(9:00:07 PM , “287 Hudson St New York NY 10013”)
Output Tuple: Time interval, and a list of points for
it:
–
(9:00:00 PM to 9:00:15 PM,
List((40.719,-73.987),(40.726,-74.001),(40.719,-73.987))
37
93. Reactor Pattern – Key Points
●
●
●
●
Single thread / single event loop
EVERYTHING runs on it
You MUST NOT block the event loop
Many Implementations (partial list):
–
Node.js (JavaScrip), EventMachine (Ruby), Twisted
(Python)... and Vert.X
93
94. Reactor Pattern Problems
●
Some work is naturally blocking:
–
–
●
Intensive data crunching
3rd-party blocking API’s (e.g. JDBC)
Pure reactor (e.g. Node.js) is not a good f t for this
i
kind of work!
94
99. Node.js vs Vert.X
●
Node.js
●
Vert.X
–
JavaScript Only
–
Polyglot (JavaScript,
Java, Ruby, Python...)
–
Inherently Single
Threaded
–
Leverages JVM multithreading
–
No help much with IPC
–
Nervous Event Bus
–
All code MUST be in
Event loop
–
Blocking work can be
done off the event loop
99
100. Node.js vs Vert.X Benchmark
http://vertxproject.wordpress.com/2012/05/09/vert-x-vs-node-js-simple-http-benchmarks/
AMD Phenom II X6 (6 core), 8GB
RAM, Ubuntu 11.04
100
102. Heat-Map Server – Only 6 LOC !
var
var
var
var
vertx = require('vertx');
container = require('vertx/container');
console = require('vertx/console');
config = container.config;
Send checkin
to Vert.X EventBus
vertx.createHttpServer().requestHandler(function(request) {
request.dataHandler(function(buffer) {
vertx.eventBus.send(config.address, {payload:buffer.toString()});
});
request.response.end();
}).listen(config.httpServerPort, config.httpServerHost);
console.log("HTTP CheckinsReactor started on port "+config.httpServerPort);
102
103. Publish
Checkins
Checkin HTTP
Reactor
Checkin
Kafka Topic
Consume Checkins
Checkin HTTP
Firehose
GET Geo
Location
Storm Heat-Map
Topology
Persist Checkin
Intervals
Publish
Interval Key
Index Interval
Locations
Hotzones
Kafka Topic
Database
Consume Intervals Keys
Search
Get Interval Locations
Geo Location
Service
Search
Server
Web App
Push via WebSocket
Index
103
106. When You go out to Salsa Club
●
Good Music
●
Crowded
106
107. More Conclusions..
●
Storm – Great for real-time BigData processing.
Complementary for Hadoop batch jobs.
●
Kafka – Great messaging for logs/events data, been
served as a good “source” for Storm spout
●
Vert.X – Worth trial and check as an alternative for
reactor.
107