SlideShare a Scribd company logo
Bootstrapping microservices
with kafka, Akka and spark
http://linkedin.com/in/alexvsilva
@thealexsilva
ALEX SILVA
Who am I?
- DATA Platform Architect
at Pluralsight
- Rackspace
- WDW
TECHNOLOGY
LEARNING
PLATFORM
What shou
ld
Ilearn?Where
Sho
uld
IStart?Who
can
help
me?Whatdid
I learn?
• Online technology
learning platform
• Subscription model
• Data-driven
PLURALSIGHT
What are we covering today?
microservices
Relational
ARCHITECTURE
COMMIT
LOGS
Data
ingestion
sTREAM
PROCESSING
Putting it all
together
MICROSERVICES
MONOLITHIC APPS
Deployment Boundary
ORDER
SERVICE
AUTH
SERVICE
RETURNS
SERVICE
INVENTORY
SERVICE
SHOPPING
CART
FULFILLMENT
SERVICE
SHOPPING
CART
AUTH
SERVICE
ORDER
SERVICE
RETURNS
SERVICE
INVENTORY
SERVICE
FULFILLMENT
SERVICE
MICROSERVICES
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and Spark
independence
Customer
Printer
Invoices
Job
Returns
Customer Invoices
Jobs
Most services
What data do we share?
HOW do we do it?
Invoices
Customer
Job
Returns
Printer
Encapsulation and loose coupling
“Sliceable”, domain-specific datasets
The service data mismatch
DATA WILL DIVERGE
OVERTIME
Invoices
Customer
Job
Returns
Printer
Is there a better approach?
DATABASES
TRANSACTIONS
ACID is old school
What
consistency do you really need and
when?
ACID 2.0
Associative
Commutative
Idempotent
Distributed
INDEXING
Indexes are awesome and we need it,
they make the lookups fast!
why do you want to scan all the data if you know
what you want?
That’s dumb.
- Dustin Vannoy
REPLICATION
LEADER FOLLOWER
REPLICATE
Failover + Resiliency
MUTATION vs. facts
UPDATE wishlist set qty=3
where user_id=121 and product_id=123
At 2:39pm, user 121 updated his wish list,
changing the quantity of product 123 from 1 to 3.
AND
state mutation
fact
VIEWS
materializedvirtual
Is there a
Better way?
SHOPPING CART
SERVICE
CATALOG
SERVICE
USER
SERVICE
FULFILLMENT
SERVICE
USER COMMIT LOG
RETURNS
SERVICE
WRITES TO
REPLICATES
REPLICATES
REPLICATES
REPLICATES
WHAT IF…
SEPARATE READS FROM WRITES
kafka
A Messaging system based on
distributed log semantics
Scalable
Fault tolerant
Stateful
Strong ordering
High concurrency
BROKER
BROKER
BROKER(User, 0)
Topic: User
(User, 0)
(User, 0)
READS/WRITES FROM/TO
Leader only
REPLICATION PROTOCOL
Replication is about RESILIENCY
BROKER BROKER BROKER BROKER
Looks like A GLOBALLY ORDERED QUEUE
BROKER
APPLICATION
APPLICATION
CONSUMER
APPLICATION
THE LOG is a linear structure
Old New
Messages are added here
Consumers have a position
Only sequential access Read to offset and SCAN
Old New
Consumer 1
Consumer 2
MESSAGES CAN BE REPLAYED
FOR AS LONG AS THEY EXIST IN THE LOG
Old New
Consumer 1
Consumer 2
A DISTRIBUTED REPLICATION PROTOCOL
Rewind and Replay
Bootstrapping Microservices with Kafka, Akka and Spark
LOG CLEAN UP Policy: delete
Scan
1 2 3 4 5 6 7 8 9 10 12 12Old New
After
log.retention.ms or retention.bytes
messages are dropped from the log.
Log clean up policy: compact
Delete retention
point
Cleaner point
delete.retention.ms
16 19 21 23 24 25 261 8 12 13 15Old
New
Log headLog tail
STREAM PROCESSING
Continuously updating datasets
Max(viewed_time) from
clip_views
where location=‘CA’
over 1 day window
Similar features as a database
JOINAGREGGATE FILTER VIEW
Streaming
platforms
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and Spark
Why spark?
Support for many different data formats
Structured streaming
Failover and lifecycle management
Medium latency
Unified api
EVENT STREAM / LOG
MATERIALIZEDVIEWS/CACHE
HADOOP
ETL
SERVICE
TRANSF
Writes to
Replicates to
• Reproducible
• Stays in sync
How do we do it?
Separate data capture from replication
REAL-TIME	DATA	REPLICATION	PLATFORM
Hydra ingest
Data capture at scale
HYDRA REQUEST
INGESTORS Transports
Ingestion replication protocol
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and Spark
What about metadata?
Always Capture metadata at ingestion time
Automate data replication
Automate data pipelines
Automate data discovery
Data Metadata
Make more kinds of datasets:
1. readily available
2. easier to use for the entire
organization.
Message format: AVRO
Why avro?
Schema evolution
Smaller data footprint
Json friendly
Strong community support
Existing tools
Hydra SPARK
What is it?
Abstraction layer on top of datasets
Models data flows
Sources and operations
Based on a custom dsl
Api-driven
The “DSL” abstraction
What is a hydra dsl?
Example
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and Spark
examples
Kafka Source
JSON File Source
SaveAsAvro Operation
DatabaseUpsert Operation
Putting it all together…
HYDRA
BROKER
INGESTION
Customer
HYDRA
STREAM DISPATCH
{ }
/dsls
Invoices Returns
WE ARE ON GITHUB!
github.com/pluralsight/hydra-spark
github.com/pluralsight/hydra
Bootstrapping Microservices with Kafka, Akka and Spark
Thank You!

More Related Content

Bootstrapping Microservices with Kafka, Akka and Spark

Editor's Notes

  1. Generate a lot of data and leverage it to make the product better
  2. Single process Codebase and development
  3. Split the monolith into Many processes, different codebases, different deployment pipelines
  4. Independently Built The build process for creating a service should be completely separate from building another service. Independently Testable Our microservice should be testable independently of the test lifecycle of other services and components. Independently Deployable Our microservice must be independently deployable, this is a fundamental aspect of enabling rapid change. Independent Teams Small independent teams owning the full lifecycle of a service from inception through to it’s final death. Independent Data One of the hardest aspects for the microservice purist to achieve is data independence.
  5. When it comes to being independent, data is usually a naggy point. Services still need to share data somehow Around deployment, contract schemas, deprecation, interconnectivity, etc. Very rarely you will find a service that has a tightly bounded context so that data sharing is secondary. Maybe AuthN services, but even then.
  6. Most services will fall on this area where they slice and dice the same core business facts and data, they just slice them differently
  7. These applications/services must work together. Services force to think about what we need to expose and share to the outside world. Mostly an afterthought
  8. Future services will become even more interconnected and intertwined.
  9. Because of this, you end up with multiple copies of data across different service that will get out of sync. The more mutable copies of data, the more divergent data will be come.
  10. What you do? Keep changing the contract of services to add more attributes? Turn your services into daos?
  11. 8down vote A transaction is a sequence of one or more SQL operations that are treated as a unit. Specifically, each transaction appears to run in isolation, and furthermore, if the system fails, each transaction is either executed in its entirety or not all. The concept of transactions is actually motivated by two completely independent concerns. One has to do with concurrent access to the database by multiple clients and the other has to do with having a system that is resilient to system failures.
  12. Acid is overkill or as some would say old school
  13. Databases do this really well!
  14. the idea is that you have a copy of the same data on multiple machines (nodes), so that you can serve reads in parallel, and so that the system keeps running if you lose a machine.
  15. This distinction between an imperative modification and an immutable fact is something you may have seen in the context of event sourcing. That’s a method of database design that says you should structure all of your data as immutable facts, and it’s an interesting idea.
  16. However, there’s something really compelling about this idea of materialized views. I see a materialized view almost as a kind of cache that magically keeps itself up-to-date. Instead of putting all of the complexity of cache invalidation in the application (risking race conditions and all the discussed problems), materialized views say that cache maintenance should be the responsibility of the data infrastructure.
  17. Stream of immutable facts are used to segregate reads from writes SHARED STATE IS ONLY IN THE CACHE SO THAT DATA CANNOT DIVERGE
  18. Let’s talk about Kafka as a commit log / source for a replication stream
  19. Kafka messages have a key and value.
  20. See the benefit if ever used a regular message queue
  21. Data becomes an immutable stream of facts
  22. Keeps the latest record per key. Truncate history but at least the latest version of every key will be present in the log.
  23. What differentiates Kafka from a traditional messaging system
  24. Medium latency High volume data flows, SQL en masse processing massive scaling - 10,000s nodes not for small volumes rich options for SQL, etc. Low limit: 0.5 seconds (we are ok with that) Failover and lifecycle management from cluster itself - restartability  (ADD TO WHY SPARK)
  25. Why we chose akka and scala? Distributed systems Functional paradigm and datasets Akka is really the backbone of this platfform