Bootstrapping Microservices with Kafka, Akka and Spark
- 2. Who am I?
- DATA Platform Architect
at Pluralsight
- Rackspace
- WDW
- 15. What data do we share?
HOW do we do it?
Invoices
Customer
Job
Returns
Printer
- 25. Indexes are awesome and we need it,
they make the lookups fast!
why do you want to scan all the data if you know
what you want?
That’s dumb.
- Dustin Vannoy
- 28. MUTATION vs. facts
UPDATE wishlist set qty=3
where user_id=121 and product_id=123
At 2:39pm, user 121 updated his wish list,
changing the quantity of product 123 from 1 to 3.
AND
state mutation
fact
- 34. A Messaging system based on
distributed log semantics
Scalable
Fault tolerant
Stateful
Strong ordering
High concurrency
- 37. Looks like A GLOBALLY ORDERED QUEUE
BROKER
APPLICATION
APPLICATION
CONSUMER
APPLICATION
- 38. THE LOG is a linear structure
Old New
Messages are added here
- 39. Consumers have a position
Only sequential access Read to offset and SCAN
Old New
Consumer 1
Consumer 2
- 40. MESSAGES CAN BE REPLAYED
FOR AS LONG AS THEY EXIST IN THE LOG
Old New
Consumer 1
Consumer 2
- 43. LOG CLEAN UP Policy: delete
Scan
1 2 3 4 5 6 7 8 9 10 12 12Old New
After
log.retention.ms or retention.bytes
messages are dropped from the log.
- 44. Log clean up policy: compact
Delete retention
point
Cleaner point
delete.retention.ms
16 19 21 23 24 25 261 8 12 13 15Old
New
Log headLog tail
- 52. Why spark?
Support for many different data formats
Structured streaming
Failover and lifecycle management
Medium latency
Unified api
- 53. EVENT STREAM / LOG
MATERIALIZEDVIEWS/CACHE
HADOOP
ETL
SERVICE
TRANSF
Writes to
Replicates to
• Reproducible
• Stays in sync
- 62. Always Capture metadata at ingestion time
Automate data replication
Automate data pipelines
Automate data discovery
- 67. What is it?
Abstraction layer on top of datasets
Models data flows
Sources and operations
Based on a custom dsl
Api-driven
- 73. Putting it all together…
HYDRA
BROKER
INGESTION
Customer
HYDRA
STREAM DISPATCH
{ }
/dsls
Invoices Returns
- 74. WE ARE ON GITHUB!
github.com/pluralsight/hydra-spark
github.com/pluralsight/hydra
Editor's Notes
- Generate a lot of data and leverage it to make the product better
- Single process
Codebase and development
- Split the monolith into Many processes, different codebases, different deployment pipelines
- Independently Built
The build process for creating a service should be completely separate from building another service.
Independently Testable
Our microservice should be testable independently of the test lifecycle of other services and components.
Independently Deployable
Our microservice must be independently deployable, this is a fundamental aspect of enabling rapid change.
Independent Teams
Small independent teams owning the full lifecycle of a service from inception through to it’s final death.
Independent Data
One of the hardest aspects for the microservice purist to achieve is data independence.
- When it comes to being independent, data is usually a naggy point.
Services still need to share data somehow
Around deployment, contract schemas, deprecation, interconnectivity, etc.
Very rarely you will find a service that has a tightly bounded context so that data sharing is secondary. Maybe AuthN services, but even then.
- Most services will fall on this area where they slice and dice the same core business facts and data, they just slice them differently
- These applications/services must work together.
Services force to think about what we need to expose and share to the outside world.
Mostly an afterthought
- Future services will become even more interconnected and intertwined.
- Because of this, you end up with multiple copies of data across different service that will get out of sync.
The more mutable copies of data, the more divergent data will be come.
- What you do? Keep changing the contract of services to add more attributes?
Turn your services into daos?
- 8down vote
A transaction is a sequence of one or more SQL operations that are treated as a unit.
Specifically, each transaction appears to run in isolation, and furthermore, if the system fails, each transaction is either executed in its entirety or not all.
The concept of transactions is actually motivated by two completely independent concerns. One has to do with concurrent access to the database by multiple clients and the other has to do with having a system that is resilient to system failures.
- Acid is overkill or as some would say old school
- Databases do this really well!
- the idea is that you have a copy of the same data on multiple machines (nodes), so that you can serve reads in parallel, and so that the system keeps running if you lose a machine.
- This distinction between an imperative modification and an immutable fact is something you may have seen in the context of event sourcing. That’s a method of database design that says you should structure all of your data as immutable facts, and it’s an interesting idea.
- However, there’s something really compelling about this idea of materialized views. I see a materialized view almost as a kind of cache that magically keeps itself up-to-date. Instead of putting all of the complexity of cache invalidation in the application (risking race conditions and all the discussed problems), materialized views say that cache maintenance should be the responsibility of the data infrastructure.
- Stream of immutable facts are used to segregate reads from writes
SHARED STATE IS ONLY IN THE CACHE SO THAT DATA CANNOT DIVERGE
- Let’s talk about Kafka as a commit log / source for a replication stream
- Kafka messages have a key and value.
- See the benefit if ever used a regular message queue
- Data becomes an immutable stream of facts
-
Keeps the latest record per key.
Truncate history but at least the latest version of every key will be present in the log.
- What differentiates Kafka from a traditional messaging system
- Medium latency
High volume
data flows, SQL
en masse processing
massive scaling - 10,000s nodes
not for small volumes
rich options for SQL, etc.
Low limit: 0.5 seconds (we are ok with that)
Failover and lifecycle management from cluster itself - restartability (ADD TO WHY SPARK)
- Why we chose akka and scala?
Distributed systems
Functional paradigm and datasets
Akka is really the backbone of this platfform