SlideShare a Scribd company logo
ARCHITECTURE AND INFRASTRUCTURE
http://linkedin.com/in/alexvsilva
@thealexsilva
DESIGNING A REACTIVE
REAL-TIME DATA PLATFORM
Who am I?
- DATA Platform Architect
at Pluralsight
- Rackspace
- WDW
TECHNOLOGY
LEARNING
PLATFORM
What shou
ld
Ilearn?Where
Sho
uld
IStart?Who
can
help
me?Whatdid
I learn?
• Online technology
learning platform
• Subscription model
• Data-driven
PLURALSIGHT
IN	THE	BEGINNING…
Development TIME became the
bottleneck
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
HOW DO WE FIX THAT?
REQUIREMENTSDISCOVERABLE
OPEN
EXTENSIBLE
Flexible
Contract
ADAPTABLE
ABSTRACTION
Reactive principles
RESPONSIVE ELASTIC
RESILIENTMESSAGE DRIVEN
RESPONSIVE
ELASTIC
asynchronous share nothing
location
transparency
divide and conquer
RESILENT
MORE THAN “JUST” FAULT TOLERANCE
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
Software systems are complex systems.
“Complex systems run in degraded mode.”
“Complex systems run as broken systems.”
Richard Cook
COMPLEX OR COMPLICATED?
COMPLEX OR COMPLICATED?
MESSAGE DRIVEN
asynchronous FAILURES AS MESSAGES
location
transparency
ISOLATION
Messages and events
SAVE
THIS!
SOMEBODY
LOGGED IN!
FactsTopic
Events ARE…
Past
AddressableSpecific
Messages ARE…
Data platform at pluralsight
REAL-TIME	DATA	REPLICATION	PLATFORM
HYDRA
INGEST
Ingestion + Replication
PORTAL
Schema Manager
“The Log”
HYDRA
STREAMS
Streaming + Replication
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
AKKA
akka
Distributed Fault-Tolerant Asynchronous
Highly
Concurrent
Akka challenges
Remoting Type Safety Debugging
Release
Cycles
WHAT’s AN ACTOR?
behavior
state
MAILBOX
CHILD ACTORS
SUPERVISOR STRATEGY
ACTOR Refs are reactive!
ACTOR Refs Vs. ACTOR PATHS
akka.tcp://Hydra@localhost:9001/user/service/ingestor_registry
Protocol
Address
ActorSystem
Path
Actor paths enable location
transparency
akka.tcp://Hydra/user/service/ingestor_registry
Protocol
ActorSystem
Path
Deploying remote actors
akka	{
actor	{
		provider	=	remote
		deployment	{
				web_analytics_ingestor	{
						remote	=	"akka.tcp://Hydra-1@127.0.0.1:2553"
				}
		}
}
ACTORS are ELASTIC
akka {
actor {
deployment {
/services-manager/handler_registry/segment_handler {
router = round-robin-pool
optimal-size-exploring-resizer {
enabled = on
action-interval = 5s
downsize-after-underutilized-for = 2h
}
}
/services-manager/kafka_producer {
router = round-robin-pool
resizer {
lower-bound = 5
upper-bound = 50
messages-per-resize = 500
}
}
}
}
}
Sending messages on akka
VS
Hydra ingest
Data capture at scale
mitigate
message loss
DATA format
is secondary
automated
replication
schema driven
ENFORCE METADATA AT INGESTION TIME
DATA PIPELINES DATA QUALITY
DATA REPLICATIONDATA DISCOVERY
Metadata is a first class citizen
Why avro?
Schema evolution
Smaller data footprint
Json friendly
Strong community support
Existing tools
INGESTION PROTOCOL OVERVIEW
BRINGING REACTIVE PRINCIPLES TO THE MIX
HYDRA REQUEST
PAYLOAD
{
		"name":"John",
		"age":30,
		"cars":[	"Ford",	"BMW"	]
}
		kafka-topic	=	PersonTopic
		validation	=		Strict
		avro-schema	=	Person.avsc
		
METADATA
+ = HYDRA REQUEST
Anything,
really
HYDRA REPLICATION PROTOCOL
HYDRA REQUEST
INGESTORS
Publish
Akka Actors (remote)
Transport
Transports
Akka Actors (remote)
Kafka
Postgres
Elastic Search
Inspect metadata
and decide
Publish
INGESTORS
Join
STOP
Validate IngestValid
InvalidIgnore
WHY DIFFERENT PHASES?
Divide and conquer
isolation
Small asynchronous tasks
recovery
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
HYDRA Message delivery guarantees
ALSO METADATA DRIVEN
AT-LEAST-ONCE SEMANTICS
AKKA PERSISTENT ACTOR
hydra-delivery-strategy
kafka
A Messaging system based on
distributed log semantics
Scalable
Fault tolerant
Stateful
Strong ordering
High concurrency
BROKER
BROKER
BROKER(User, 0)
Topic: User
(User, 0)
(User, 0)
READS/WRITES FROM/TO
Leader only
REPLICATION PROTOCOL
Replication is about RESILIENCY
BROKER BROKER BROKER BROKER
Looks like A GLOBALLY ORDERED QUEUE
BROKER
APPLICATION
APPLICATION
CONSUMER
APPLICATION
THE LOG is a linear structure
Old New
Messages are added here
Consumers have a position
Only sequential access Read to offset and SCAN
Old New
Consumer 1
Consumer 2
MESSAGES CAN BE REPLAYED
FOR AS LONG AS THEY EXIST IN THE LOG
Old New
Consumer 1
Consumer 2
A DISTRIBUTED REPLICATION PROTOCOL
Rewind and Replay
Hydra STREAMS
STREAMING FEDERATION LAYER
STREAM PROCESSING
Continuously updating datasets
Max(viewed_time) from
clip_views
where location=‘CA’
over 1 day window
Similar features as a database
JOINAGREGGATE FILTER VIEW
Streaming
platforms
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
Why spark?
Support for many different data formats
Structured streaming
Failover and lifecycle management
Medium latency
Unified api
EVENT STREAM / LOG
MATERIALIZEDVIEWS/CACHE
HADOOP
ETL
SERVICE
TRANSF
Writes to
Replicates to
• Reproducible
• Stays in sync
Why kafka streams?
application that can run anywhere
Medium data volumes
Kafka specific
Low latency
Basic tasks
IT WORKS FOR MICROSERVICES TOO
HYDRA
Sends
BROKER
stores
(at a minimum)
INGESTION
Customer
HYDRA
STREAM DISPATCH
{ }
/dsls
submits
POSTs
Invoices Returns
joins/normalizes
streams
Hydra SPARK
What is it?
Abstraction layer on top of SPARK datasets
Models data flows
Sources and operations
Based on a custom dsl
Api-driven
The “DSL” abstraction
Example
WE ARE ON GITHUB!
github.com/pluralsight/hydra
github.com/pluralsight/hydra-spark
Designing a reactive real-time data platform: Architecture and Infrastructure Challenges
Thank You!

More Related Content

Designing a reactive real-time data platform: Architecture and Infrastructure Challenges