SlideShare a Scribd company logo
Processing
“BIG-DATA”
In Real Time
Yanai Franchi , Tikal

1
2
Vacation to Barcelona

3
After a Long Travel Day

4
Going to a Salsa Club

5
Best Salsa Club
NOW
●

Good Music

●

Crowded –
Now!
6
Same Problem in “gogobot”

7
8
Lets' Develop
“Gogobot Checkins Heat-Map”

gogobot checkin
Heat Map Service

9
Key Notes
●

Collector Service - Collects checkins as text addresses
–

We need to use GeoLocation Service

●

Upon elapsed interval, the last locations list will be
displayed as Heat-Map in GUI.

●

Web Scale service – 10Ks checkins/seconds all over the
world (imaginary, but lets do it for the exercise).

●

Accuracy – Sample data, NOT critical data.
–
–

Proportionately representative
Data volume is large enough to compensate for data loss.
10
Heat-Map Context
Heat-Map
Gogobot System
Text-Address
Checkins Heat-Map
Service

Gogobot
Micro Service
Gogobot
Micro Service

Get-GeoCode(Address)

Gogobot
Micro Service

Geo Location
Service

Last Interval Locations
11
Plan A
Simulate Checkins with a File
Check-in #1
Check-in #2
Check-in #3
Check-in #4
Check-in #5
Check-in #6
Check-in #7
Check-in #8
Check-in #9
...

Geo Location
Service
GET Geo
Location

Read
Text Address
Processing
Checkins

Persist Checkin
Intervals

Database
12
Tons of Addresses
Arriving Every Second

13
Architect - First Reaction...

14
Second Reaction...

15
Developer
First
Reaction

16
Second
Reaction

17
Problems ?
●

Tedious: Spend time conf guring where to send
i
messages, deploying workers, and deploying
intermediate queues.

●

Brittle: There's little fault-tolerance.

●

Painful to scale: Partition of running worker/s is
complicated.

18
What We Want ?
● Horizontal scalability
● Fault-tolerance
● No intermediate message brokers!
● Higher level abstraction than message
passing
● “Just works”
● Guaranteed data processing (not in this
case)
19
Apache Storm

✔Horizontal scalability
✔Fault-tolerance
✔No intermediate message brokers!
✔Higher level abstraction than message
passing
✔“Just works”
✔Guaranteed data processing

20
Anatomy of Storm

21
What is Storm ?
●

CEP - Open source and distributed realtime
computation system.
–
–

●

Makes it easy to reliably process unbounded streams of
tuples
Doing for realtime processing what Hadoop did for batch
processing.

Fast - 1M Tuples/sec per node.
–

It is scalable,fault-tolerant, guarantees your data will be
processed, and is easy to set up and operate.
22
Streams

Tuple

Tuple

Tuple

Tuple

Tuple

Tuple

Unbounded sequence of tuples

23
Spouts
Tuple

Tuple

Tuple

Tuple

Sources of Streams
24
Bolts
Tuple

Tuple

Tuple

Tuple

Tuple

TupleTuple

uple
ple T
Tu
Tuple

Processes input streams and produces
new streams

25
Storm Topology
Tuple
Tuple

TupleTuple

TupleTuple
Tuple

Tuple
Tuple
Tuple

le
e Tup
Tupl
Tuple

Tuple

Tuple

TupleTupleTuple

Network of spouts and bolts

26
Guarantee for Processing
●

●
●

Storm guarantees the full processing of a tuple by
tracking its state
In case of failure, Storm can re-process it.
Source tuples with full “acked” trees are removed
from the system

27
Tasks (Bolt/Spout Instance)

Spouts and bolts execute as
many tasks across the cluster

28
Stream Grouping

When a tuple is emitted, which task
(instance) does it go to?

29
Stream Grouping
●
●

●
●

Shuff e grouping: pick a random task
l
Fields grouping: consistent hashing on a subset of
tuple f elds
i
All grouping: send to all tasks
Global grouping: pick task with lowest id

30
Tasks , Executors , Workers

Worker Process

Executor

Task

JVM

Executor

Task

Task

Thread

=

Thread

Sput /
Bolt

Sput / Sput /
Bolt Bolt

31
Node
Worker Process
Executor

Spout A

Supervisor

Executor

Bolt C Bolt C

Executor
Bolt B Bolt B

Node
Worker Process

Supervisor

Executor

Executor

Spout A

Bolt C Bolt C

Executor
Bolt B Bolt B

32
Storm Architecture
NOT critical
for running topology
Upload/Rebalance
Heat-Map Topology
Supervisor
Supervisor

Supervisor

Supervisor

Nimbus

Supervisor

Supervisor

Zoo Keeper
Nodes

Master Node
(similar to Hadoop JobTracker)

33
Storm Architecture
A few
nodes
Upload/Rebalance
Heat-Map Topology
Supervisor
Supervisor

Supervisor

Supervisor

Nimbus

Supervisor

Supervisor

Zoo Keeper

Used For Cluster Coordination
34
Storm Architecture

Upload/Rebalance
Heat-Map Topology
Supervisor
Supervisor

Supervisor

Supervisor

Nimbus

Supervisor

Supervisor

Zoo Keeper

Run Worker Processes
35
Assembling Heatmap Topology

36
HeatMap Input/Output Tuples
●

Input Tuples: Timestamp and Text Address :
–

●

(9:00:07 PM , “287 Hudson St New York NY 10013”)

Output Tuple: Time interval, and a list of points for
it:
–

(9:00:00 PM to 9:00:15 PM,
List((40.719,-73.987),(40.726,-74.001),(40.719,-73.987))

37
Checkins
Spout

Heat Map
Storm
Topology

(9:01 PM @ 287 Hudson st)
Geocode
Lookup
Bolt
(9:01 PM , (40.736, -74,354)))
Heatmap
Builder
Bolt

Upon
Elapsed Interval

(9:00 PM – 9:15 PM , List((40.73, -74,34),
(51.36, -83,33),(69.73, -34,24))
Persistor
Bolt
38
Checkins Spout

public class CheckinsSpout extends BaseRichSpout {

private List<String> sampleLocations;
private int nextEmitIndex;
private SpoutOutputCollector outputCollector;

We hold state
No need for thread safety

@Override
public void open(Map map, TopologyContext topologyContext,
SpoutOutputCollector spoutOutputCollector) {
this.outputCollector = spoutOutputCollector;
this.nextEmitIndex = 0;

sampleLocations = IOUtils.readLines(

}

ClassLoader.getSystemResourceAsStream("sanple-locations.txt"));

@Override
Been called
public void nextTuple() {
iteratively by Storm
String address = checkins.get(nextEmitIndex);
String checkin = new Date().getTime()+"@ADDRESS:"+address;

outputCollector.emit(new Values(checkin));

}

nextEmitIndex = (nextEmitIndex + 1) % sampleLocations.size();

@Override

declareOutputFields(OutputFieldsDeclarer
declarer.declare(new Fields("str"));

public void
}

Declare
output fields

declarer) {

39
Geocode Lookup Bolt
public class GeocodeLookupBolt extends BaseBasicBolt {
private LocatorService locatorService;
@Override
public void prepare(Map stormConf, TopologyContext context) {
locatorService = new GoogleLocatorService();
}
@Override
public void execute(Tuple tuple, BasicOutputCollector outputCollector) {

String str = tuple.getStringByField("str");
String[] parts = str.split("@");
Long time = Long.valueOf(parts[0]);
String address = parts[1];

Get Geocode,
Create DTO

LocationDTO locationDTO = locatorService.getLocation(address);
if(checkinDTO!=null)
}

outputCollector.emit(new Values(time,locationDTO) );

@Override
public void declareOutputFields(OutputFieldsDeclarer fieldsDeclarer) {
}

fieldsDeclarer.declare(new Fields("time", "location"));

40
}
Tick Tuple – Repeating Mantra

41
Two Streams to Heat-Map Builder
Checkin 1

Checkin 4 Checkin 5 Checkin 6

HeatMapBuilder Bolt

On tick tuple, we f ush our Heat-Map
l
42
Tick Tuple in Action
public class HeatMapBuilderBolt extends BaseBasicBolt {
private Map<String, List<LocationDTO>> heatmaps;

Hold latest intervals

@Override
public Map<String, Object> getComponentConfiguration() {
Config conf = new Config();
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 60 );
return conf;
}

Tick interval

@Override
public void execute(Tuple tuple, BasicOutputCollector outputCollector) {
if (isTickTuple(tuple)) {
// Emit accumulated intervals
} else {
// Add check-in info to the current interval in the Map
}
}
private boolean isTickTuple(Tuple tuple) {
return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)
&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);
}
43
Persister Bolt
public class PersistorBolt extends BaseBasicBolt {
private Jedis jedis;
@Override
public void execute(Tuple tuple, BasicOutputCollector outputCollector) {
Long timeInterval = tuple.getLongByField("time-interval");
String city = tuple.getStringByField("city");
String locationsList = objectMapper.writeValueAsString
( tuple.getValueByField("locationsList"));
String dbKey = "checkins-" + timeInterval+"@"+city;

Persist in Redis
for 24h

jedis.setex(dbKey, 3600*24 ,locationsList);
jedis.publish("location-key", dbKey);
}

}

Publish in
Redis channel
for debugging
44
Transforming the Tuples
Checkins
Spout

Sample Checkins File
Check-in #1
Check-in #2
Check-in #3
Check-in #4
Check-in #5
Check-in #6
Check-in #7
Check-in #8
Check-in #9
...

Read
Text Addresses

Shuffle Grouping
Geocode
Lookup
Bolt

Get Geo
Location
Geo Location
Service

Group by city

Field Grouping(city)
Heatmap
Builder
Bolt
Shuffle Grouping

Database

Persistor
Bolt

45
Heat Map Topology
public class LocalTopologyRunner {
public static void main(String[] args) {
TopologyBuilder builder = buildTopolgy();
StormSubmitter.submitTopology(
"local-heatmap", new Config(), builder.createTopology());
}
private static TopologyBuilder buildTopolgy() {
topologyBuilder builder = new TopologyBuilder();
builder.setSpout("checkins", new CheckinsSpout());
builder.setBolt("geocode-lookup", new GeocodeLookupBolt() )
.shuffleGrouping("checkins");
builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() )

.fieldsGrouping("geocode-lookup",

new Fields("city"));

builder.setBolt("persistor", new PersistorBolt() )
.shuffleGrouping("heatmap-builder");

}

}

return builder;
46
Its NOT Scaled

47
48
Scaling the Topology
public class LocalTopologyRunner {
conf.setNumWorkers(20);
public static void main(String[] args) {
TopologyBuilder builder = buildTopolgy();
Config conf = new Config();

Set no. of workers

conf.setNumWorkers(2);

}

StormSubmitter.submitTopology(
"local-heatmap", conf, builder.createTopology());

Parallelism hint
private static TopologyBuilder buildTopolgy() {
topologyBuilder builder = new TopologyBuilder();
builder.setSpout("checkins", new CheckinsSpout(),

4

);

Increase Tasks
For Future

8 )
.shuffleGrouping("checkins").setNumTasks(64);

builder.setBolt("geocode-lookup", new GeocodeLookupBolt() ,

builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() , 4)
.fieldsGrouping("geocode-lookup", new Fields("city"));
builder.setBolt("persistor", new PersistorBolt() ,

2

)

.shuffleGrouping("heatmap-builder").setNumTasks(4);49
return builder;
Demo

50
Recap – Plan A
Sample Checkins File
Check-in #1
Check-in #2
Check-in #3
Check-in #4
Check-in #5
Check-in #6
Check-in #7
Check-in #8
Check-in #9
...

Geo Location
Service
GET Geo
Location

Read
Text Address
Storm Heat-Map
Topology

Persist Checkin
Intervals

Database
51
We have
something working

52
Add Kafka Messaging

53
Plan B Kafka Spout&Bolt to HeatMap
Publish
Checkins

Geo Location
Service

Checkin
Kafka Topic

Read
Kafka
Text Addresses Checkins
Spout

Geocode
Lookup
Bolt
Heatmap
Builder
Bolt

Database

Persistor
Bolt

Kafka
Locations
Bolt
Locations
Topic

54
55
They all are Good
But not for all use-cases

56
Kafka
A little introduction

57
58
Pub-Sub Messaging System

59
60
61
62
63
Doesn't Fear the File System

64
65
66
67
Topics
●
●

Logical collections of partitions (the physical f les).
i
A broker contains some of the partitions for a topic

68
A partition is Consumed by
Exactly One Group's Consumer

69
Distributed &
Fault-Tolerant
70
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Zoo Keeper

Consumer 1

Consumer 2

71
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Broker 4

Zoo Keeper

Consumer 1

Consumer 2

72
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Broker 4

Zoo Keeper

Consumer 1

Consumer 2

73
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Broker 4

Zoo Keeper

Consumer 1

Consumer 2

74
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Broker 4

Zoo Keeper

Consumer 1

Consumer 2

75
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Broker 4

Zoo Keeper

Consumer 1

Consumer 2

76
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Broker 4

Zoo Keeper

Consumer 1

Consumer 2

77
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Zoo Keeper

Consumer 1

Consumer 2

78
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Zoo Keeper

Consumer 1

Consumer 2

79
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Zoo Keeper

Consumer 1

Consumer 2

80
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Zoo Keeper

Consumer 1

81
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Zoo Keeper

Consumer 1

82
Producer 2

Producer 1

Broker 1

Broker 2

Broker 3

Zoo Keeper

Consumer 1

83
Performance Benchmark
1 Broker
1 Producer
1 Consumer
84
85
86
Add Kafka to our Topology
public class LocalTopologyRunner {
...
private static TopologyBuilder buildTopolgy() {

Kafka Spout

...
builder.setSpout("checkins", new KafkaSpout(kafkaConfig));
...
builder.setBolt("kafkaProducer", new KafkaOutputBolt
( "localhost:9092",
"kafka.serializer.StringEncoder",
"locations-topic"))
.shuffleGrouping("persistor");

}

}

Kafka Bolt

return builder;

87
Plan C – Add Reactor
Text-Address

Publish
Checkins

Checkin HTTP
Reactor

Checkin
Kafka Topic

Consume Checkins

Storm Heat-Map
Topology

Persist Checkin
Intervals

Database

GET Geo
Location

Publish
Interval Key
Locations
Kafka Topic

Geo Location
Service

Index Interval
Locations

Search
Server
Index

88
Why Reactor ?

89
C10K
Problem
90
2008:
Thread Per Request/Response

91
Reactor Pattern Paradigm

Application
registers
handlers

...events
trigger
handlers

92
Reactor Pattern – Key Points
●
●
●
●

Single thread / single event loop
EVERYTHING runs on it
You MUST NOT block the event loop
Many Implementations (partial list):
–

Node.js (JavaScrip), EventMachine (Ruby), Twisted
(Python)... and Vert.X

93
Reactor Pattern Problems
●

Some work is naturally blocking:
–
–

●

Intensive data crunching
3rd-party blocking API’s (e.g. JDBC)

Pure reactor (e.g. Node.js) is not a good f t for this
i
kind of work!

94
95
Vert.X Architecture
Vert.X Architecture

Event Bus

96
Vert.X Goodies
●

Growing Module

●

TCP/SSL servers/clients

●

Repository

●

●

web server

HTTP/HTTPS servers/
clients

●

WebSockets support

●

SockJS support

●

Persistors (Mongo,
JDBC, ...)

●

Work queue

●

Timers

●

Authentication

●

Buffers

●

Manager

●

Streams and Pumps

●

Session manager

●

Routing

●

Socket.IO

●

Asynchronous File I/O
97
Node.JS vs Vert.X

98
Node.js vs Vert.X
●

Node.js

●

Vert.X

–

JavaScript Only

–

Polyglot (JavaScript,
Java, Ruby, Python...)

–

Inherently Single
Threaded

–

Leverages JVM multithreading

–

No help much with IPC

–

Nervous Event Bus

–

All code MUST be in
Event loop

–

Blocking work can be
done off the event loop
99
Node.js vs Vert.X Benchmark

http://vertxproject.wordpress.com/2012/05/09/vert-x-vs-node-js-simple-http-benchmarks/

AMD Phenom II X6 (6 core), 8GB
RAM, Ubuntu 11.04

100
HeatMap Reactor Architecture
Vert.X Instance

Vert.X Instance
Automatically sends
EventBus Msg → KafkaTopic

HTTP
Server
Verticle

Kafka
module
Kafka Topic

Event Bus

Storm
Topology
101
Heat-Map Server – Only 6 LOC !
var
var
var
var

vertx = require('vertx');
container = require('vertx/container');
console = require('vertx/console');
config = container.config;

Send checkin
to Vert.X EventBus

vertx.createHttpServer().requestHandler(function(request) {
request.dataHandler(function(buffer) {
vertx.eventBus.send(config.address, {payload:buffer.toString()});
});
request.response.end();
}).listen(config.httpServerPort, config.httpServerHost);
console.log("HTTP CheckinsReactor started on port "+config.httpServerPort);

102
Publish
Checkins

Checkin HTTP
Reactor

Checkin
Kafka Topic

Consume Checkins

Checkin HTTP
Firehose

GET Geo
Location

Storm Heat-Map
Topology

Persist Checkin
Intervals

Publish
Interval Key

Index Interval
Locations

Hotzones
Kafka Topic

Database

Consume Intervals Keys
Search

Get Interval Locations

Geo Location
Service

Search
Server

Web App
Push via WebSocket

Index

103
Demo

104
Summary

105
When You go out to Salsa Club

●

Good Music

●

Crowded

106
More Conclusions..
●

Storm – Great for real-time BigData processing.
Complementary for Hadoop batch jobs.

●

Kafka – Great messaging for logs/events data, been
served as a good “source” for Storm spout

●

Vert.X – Worth trial and check as an alternative for
reactor.

107
Thanks

108

More Related Content

Processing Big Data in Realtime

  • 2. 2
  • 4. After a Long Travel Day 4
  • 5. Going to a Salsa Club 5
  • 6. Best Salsa Club NOW ● Good Music ● Crowded – Now! 6
  • 7. Same Problem in “gogobot” 7
  • 8. 8
  • 9. Lets' Develop “Gogobot Checkins Heat-Map” gogobot checkin Heat Map Service 9
  • 10. Key Notes ● Collector Service - Collects checkins as text addresses – We need to use GeoLocation Service ● Upon elapsed interval, the last locations list will be displayed as Heat-Map in GUI. ● Web Scale service – 10Ks checkins/seconds all over the world (imaginary, but lets do it for the exercise). ● Accuracy – Sample data, NOT critical data. – – Proportionately representative Data volume is large enough to compensate for data loss. 10
  • 11. Heat-Map Context Heat-Map Gogobot System Text-Address Checkins Heat-Map Service Gogobot Micro Service Gogobot Micro Service Get-GeoCode(Address) Gogobot Micro Service Geo Location Service Last Interval Locations 11
  • 12. Plan A Simulate Checkins with a File Check-in #1 Check-in #2 Check-in #3 Check-in #4 Check-in #5 Check-in #6 Check-in #7 Check-in #8 Check-in #9 ... Geo Location Service GET Geo Location Read Text Address Processing Checkins Persist Checkin Intervals Database 12
  • 13. Tons of Addresses Arriving Every Second 13
  • 14. Architect - First Reaction... 14
  • 18. Problems ? ● Tedious: Spend time conf guring where to send i messages, deploying workers, and deploying intermediate queues. ● Brittle: There's little fault-tolerance. ● Painful to scale: Partition of running worker/s is complicated. 18
  • 19. What We Want ? ● Horizontal scalability ● Fault-tolerance ● No intermediate message brokers! ● Higher level abstraction than message passing ● “Just works” ● Guaranteed data processing (not in this case) 19
  • 20. Apache Storm ✔Horizontal scalability ✔Fault-tolerance ✔No intermediate message brokers! ✔Higher level abstraction than message passing ✔“Just works” ✔Guaranteed data processing 20
  • 22. What is Storm ? ● CEP - Open source and distributed realtime computation system. – – ● Makes it easy to reliably process unbounded streams of tuples Doing for realtime processing what Hadoop did for batch processing. Fast - 1M Tuples/sec per node. – It is scalable,fault-tolerant, guarantees your data will be processed, and is easy to set up and operate. 22
  • 27. Guarantee for Processing ● ● ● Storm guarantees the full processing of a tuple by tracking its state In case of failure, Storm can re-process it. Source tuples with full “acked” trees are removed from the system 27
  • 28. Tasks (Bolt/Spout Instance) Spouts and bolts execute as many tasks across the cluster 28
  • 29. Stream Grouping When a tuple is emitted, which task (instance) does it go to? 29
  • 30. Stream Grouping ● ● ● ● Shuff e grouping: pick a random task l Fields grouping: consistent hashing on a subset of tuple f elds i All grouping: send to all tasks Global grouping: pick task with lowest id 30
  • 31. Tasks , Executors , Workers Worker Process Executor Task JVM Executor Task Task Thread = Thread Sput / Bolt Sput / Sput / Bolt Bolt 31
  • 32. Node Worker Process Executor Spout A Supervisor Executor Bolt C Bolt C Executor Bolt B Bolt B Node Worker Process Supervisor Executor Executor Spout A Bolt C Bolt C Executor Bolt B Bolt B 32
  • 33. Storm Architecture NOT critical for running topology Upload/Rebalance Heat-Map Topology Supervisor Supervisor Supervisor Supervisor Nimbus Supervisor Supervisor Zoo Keeper Nodes Master Node (similar to Hadoop JobTracker) 33
  • 34. Storm Architecture A few nodes Upload/Rebalance Heat-Map Topology Supervisor Supervisor Supervisor Supervisor Nimbus Supervisor Supervisor Zoo Keeper Used For Cluster Coordination 34
  • 37. HeatMap Input/Output Tuples ● Input Tuples: Timestamp and Text Address : – ● (9:00:07 PM , “287 Hudson St New York NY 10013”) Output Tuple: Time interval, and a list of points for it: – (9:00:00 PM to 9:00:15 PM, List((40.719,-73.987),(40.726,-74.001),(40.719,-73.987)) 37
  • 38. Checkins Spout Heat Map Storm Topology (9:01 PM @ 287 Hudson st) Geocode Lookup Bolt (9:01 PM , (40.736, -74,354))) Heatmap Builder Bolt Upon Elapsed Interval (9:00 PM – 9:15 PM , List((40.73, -74,34), (51.36, -83,33),(69.73, -34,24)) Persistor Bolt 38
  • 39. Checkins Spout public class CheckinsSpout extends BaseRichSpout { private List<String> sampleLocations; private int nextEmitIndex; private SpoutOutputCollector outputCollector; We hold state No need for thread safety @Override public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) { this.outputCollector = spoutOutputCollector; this.nextEmitIndex = 0; sampleLocations = IOUtils.readLines( } ClassLoader.getSystemResourceAsStream("sanple-locations.txt")); @Override Been called public void nextTuple() { iteratively by Storm String address = checkins.get(nextEmitIndex); String checkin = new Date().getTime()+"@ADDRESS:"+address; outputCollector.emit(new Values(checkin)); } nextEmitIndex = (nextEmitIndex + 1) % sampleLocations.size(); @Override declareOutputFields(OutputFieldsDeclarer declarer.declare(new Fields("str")); public void } Declare output fields declarer) { 39
  • 40. Geocode Lookup Bolt public class GeocodeLookupBolt extends BaseBasicBolt { private LocatorService locatorService; @Override public void prepare(Map stormConf, TopologyContext context) { locatorService = new GoogleLocatorService(); } @Override public void execute(Tuple tuple, BasicOutputCollector outputCollector) { String str = tuple.getStringByField("str"); String[] parts = str.split("@"); Long time = Long.valueOf(parts[0]); String address = parts[1]; Get Geocode, Create DTO LocationDTO locationDTO = locatorService.getLocation(address); if(checkinDTO!=null) } outputCollector.emit(new Values(time,locationDTO) ); @Override public void declareOutputFields(OutputFieldsDeclarer fieldsDeclarer) { } fieldsDeclarer.declare(new Fields("time", "location")); 40 }
  • 41. Tick Tuple – Repeating Mantra 41
  • 42. Two Streams to Heat-Map Builder Checkin 1 Checkin 4 Checkin 5 Checkin 6 HeatMapBuilder Bolt On tick tuple, we f ush our Heat-Map l 42
  • 43. Tick Tuple in Action public class HeatMapBuilderBolt extends BaseBasicBolt { private Map<String, List<LocationDTO>> heatmaps; Hold latest intervals @Override public Map<String, Object> getComponentConfiguration() { Config conf = new Config(); conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 60 ); return conf; } Tick interval @Override public void execute(Tuple tuple, BasicOutputCollector outputCollector) { if (isTickTuple(tuple)) { // Emit accumulated intervals } else { // Add check-in info to the current interval in the Map } } private boolean isTickTuple(Tuple tuple) { return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID) && tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID); } 43
  • 44. Persister Bolt public class PersistorBolt extends BaseBasicBolt { private Jedis jedis; @Override public void execute(Tuple tuple, BasicOutputCollector outputCollector) { Long timeInterval = tuple.getLongByField("time-interval"); String city = tuple.getStringByField("city"); String locationsList = objectMapper.writeValueAsString ( tuple.getValueByField("locationsList")); String dbKey = "checkins-" + timeInterval+"@"+city; Persist in Redis for 24h jedis.setex(dbKey, 3600*24 ,locationsList); jedis.publish("location-key", dbKey); } } Publish in Redis channel for debugging 44
  • 45. Transforming the Tuples Checkins Spout Sample Checkins File Check-in #1 Check-in #2 Check-in #3 Check-in #4 Check-in #5 Check-in #6 Check-in #7 Check-in #8 Check-in #9 ... Read Text Addresses Shuffle Grouping Geocode Lookup Bolt Get Geo Location Geo Location Service Group by city Field Grouping(city) Heatmap Builder Bolt Shuffle Grouping Database Persistor Bolt 45
  • 46. Heat Map Topology public class LocalTopologyRunner { public static void main(String[] args) { TopologyBuilder builder = buildTopolgy(); StormSubmitter.submitTopology( "local-heatmap", new Config(), builder.createTopology()); } private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout()); builder.setBolt("geocode-lookup", new GeocodeLookupBolt() ) .shuffleGrouping("checkins"); builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() ) .fieldsGrouping("geocode-lookup", new Fields("city")); builder.setBolt("persistor", new PersistorBolt() ) .shuffleGrouping("heatmap-builder"); } } return builder; 46
  • 48. 48
  • 49. Scaling the Topology public class LocalTopologyRunner { conf.setNumWorkers(20); public static void main(String[] args) { TopologyBuilder builder = buildTopolgy(); Config conf = new Config(); Set no. of workers conf.setNumWorkers(2); } StormSubmitter.submitTopology( "local-heatmap", conf, builder.createTopology()); Parallelism hint private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout(), 4 ); Increase Tasks For Future 8 ) .shuffleGrouping("checkins").setNumTasks(64); builder.setBolt("geocode-lookup", new GeocodeLookupBolt() , builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() , 4) .fieldsGrouping("geocode-lookup", new Fields("city")); builder.setBolt("persistor", new PersistorBolt() , 2 ) .shuffleGrouping("heatmap-builder").setNumTasks(4);49 return builder;
  • 51. Recap – Plan A Sample Checkins File Check-in #1 Check-in #2 Check-in #3 Check-in #4 Check-in #5 Check-in #6 Check-in #7 Check-in #8 Check-in #9 ... Geo Location Service GET Geo Location Read Text Address Storm Heat-Map Topology Persist Checkin Intervals Database 51
  • 54. Plan B Kafka Spout&Bolt to HeatMap Publish Checkins Geo Location Service Checkin Kafka Topic Read Kafka Text Addresses Checkins Spout Geocode Lookup Bolt Heatmap Builder Bolt Database Persistor Bolt Kafka Locations Bolt Locations Topic 54
  • 55. 55
  • 56. They all are Good But not for all use-cases 56
  • 58. 58
  • 60. 60
  • 61. 61
  • 62. 62
  • 63. 63
  • 64. Doesn't Fear the File System 64
  • 65. 65
  • 66. 66
  • 67. 67
  • 68. Topics ● ● Logical collections of partitions (the physical f les). i A broker contains some of the partitions for a topic 68
  • 69. A partition is Consumed by Exactly One Group's Consumer 69
  • 71. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 Consumer 2 71
  • 72. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 72
  • 73. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 73
  • 74. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 74
  • 75. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 75
  • 76. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 76
  • 77. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Broker 4 Zoo Keeper Consumer 1 Consumer 2 77
  • 78. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 Consumer 2 78
  • 79. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 Consumer 2 79
  • 80. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 Consumer 2 80
  • 81. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 81
  • 82. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 82
  • 83. Producer 2 Producer 1 Broker 1 Broker 2 Broker 3 Zoo Keeper Consumer 1 83
  • 84. Performance Benchmark 1 Broker 1 Producer 1 Consumer 84
  • 85. 85
  • 86. 86
  • 87. Add Kafka to our Topology public class LocalTopologyRunner { ... private static TopologyBuilder buildTopolgy() { Kafka Spout ... builder.setSpout("checkins", new KafkaSpout(kafkaConfig)); ... builder.setBolt("kafkaProducer", new KafkaOutputBolt ( "localhost:9092", "kafka.serializer.StringEncoder", "locations-topic")) .shuffleGrouping("persistor"); } } Kafka Bolt return builder; 87
  • 88. Plan C – Add Reactor Text-Address Publish Checkins Checkin HTTP Reactor Checkin Kafka Topic Consume Checkins Storm Heat-Map Topology Persist Checkin Intervals Database GET Geo Location Publish Interval Key Locations Kafka Topic Geo Location Service Index Interval Locations Search Server Index 88
  • 93. Reactor Pattern – Key Points ● ● ● ● Single thread / single event loop EVERYTHING runs on it You MUST NOT block the event loop Many Implementations (partial list): – Node.js (JavaScrip), EventMachine (Ruby), Twisted (Python)... and Vert.X 93
  • 94. Reactor Pattern Problems ● Some work is naturally blocking: – – ● Intensive data crunching 3rd-party blocking API’s (e.g. JDBC) Pure reactor (e.g. Node.js) is not a good f t for this i kind of work! 94
  • 95. 95
  • 97. Vert.X Goodies ● Growing Module ● TCP/SSL servers/clients ● Repository ● ● web server HTTP/HTTPS servers/ clients ● WebSockets support ● SockJS support ● Persistors (Mongo, JDBC, ...) ● Work queue ● Timers ● Authentication ● Buffers ● Manager ● Streams and Pumps ● Session manager ● Routing ● Socket.IO ● Asynchronous File I/O 97
  • 99. Node.js vs Vert.X ● Node.js ● Vert.X – JavaScript Only – Polyglot (JavaScript, Java, Ruby, Python...) – Inherently Single Threaded – Leverages JVM multithreading – No help much with IPC – Nervous Event Bus – All code MUST be in Event loop – Blocking work can be done off the event loop 99
  • 100. Node.js vs Vert.X Benchmark http://vertxproject.wordpress.com/2012/05/09/vert-x-vs-node-js-simple-http-benchmarks/ AMD Phenom II X6 (6 core), 8GB RAM, Ubuntu 11.04 100
  • 101. HeatMap Reactor Architecture Vert.X Instance Vert.X Instance Automatically sends EventBus Msg → KafkaTopic HTTP Server Verticle Kafka module Kafka Topic Event Bus Storm Topology 101
  • 102. Heat-Map Server – Only 6 LOC ! var var var var vertx = require('vertx'); container = require('vertx/container'); console = require('vertx/console'); config = container.config; Send checkin to Vert.X EventBus vertx.createHttpServer().requestHandler(function(request) { request.dataHandler(function(buffer) { vertx.eventBus.send(config.address, {payload:buffer.toString()}); }); request.response.end(); }).listen(config.httpServerPort, config.httpServerHost); console.log("HTTP CheckinsReactor started on port "+config.httpServerPort); 102
  • 103. Publish Checkins Checkin HTTP Reactor Checkin Kafka Topic Consume Checkins Checkin HTTP Firehose GET Geo Location Storm Heat-Map Topology Persist Checkin Intervals Publish Interval Key Index Interval Locations Hotzones Kafka Topic Database Consume Intervals Keys Search Get Interval Locations Geo Location Service Search Server Web App Push via WebSocket Index 103
  • 106. When You go out to Salsa Club ● Good Music ● Crowded 106
  • 107. More Conclusions.. ● Storm – Great for real-time BigData processing. Complementary for Hadoop batch jobs. ● Kafka – Great messaging for logs/events data, been served as a good “source” for Storm spout ● Vert.X – Worth trial and check as an alternative for reactor. 107