Everybody loves Microservices, but we all know how difficult it is to make it right. Distributed systems are much more complex to develop and maintain, and over time, we even miss the simplicity of old monoliths. In this talk, I propose a combination of infrastructure, architecture, and design principles to make your microservices bulletproof and easy to maintain with a combination of high scalability, elasticity, fault tolerance, and resilience. This session will also include a discussion about some microservices blueprints like asynchronous communications, how to avoid cascading failures in synchronous calls, and why you should use different storages according to the use case: Document Databases to speed up your performance, RDBMS for transactions, Graphs for recommendations, etc.
4. 4
Who am I?
4
• Matthew D. Groves
• Developer Advocate for Couchbase
• Twitter: @mgroves
• Live Coding: https://twitch.tv/matthewdgroves
• "I am not an expert, but I am an enthusiast." – Alan Stevens
by @natelovett
18. 18
Disk is cheap, add versioning to your state and store
all received change requests, good for:
• Debugging
• Fixing inconsistences
• Auditing
• Query the state of an entity within a period
Event Sourcing/Logging
https://blog.couchbase.com/event-sourcing-event-logging-an-essential-microservice-pattern/
24. 24
Other things to consider
• Auto Retries (after a 502 for instance)
• Circuit Breakers
• Authentication
• Observed Latency
• Bulkheads
• Consistent Metrics
• Logging
• … and more
Microservices allow you to independently deploy and scale parts of your system.
If you determine that just the User Profile part of your system is being used a lot, with a monolith you have to scale the whole thing.
With a microservice, you can just scale the user part.
And not only scale, but developer and deploy independently.
BUT
Microservices are a distributed system, which brings a lot of challenges with it
Monoliths are much simpler to develop – a single application containing all the features
So if I'm in the payment part of the system, and I need to access something about a user, then it's just a method call. Easy.
* But in a microservice system,
You're making an HTTP call to the user system, there's a chance that the user system is down
Also, it's not easy to refactor. If I add a new field to the user service, I can't just go and update all the other services. Some other team might be responsible for the payment service.
Transactions are a problem too. How can I guarantee that an update applies atomically to two different services?
More expensive to develop (time and money).
If you're a small team, don't need to scale a lot, stick to monoliths.
If scale is a problem then microservices might help you. Not just talking about scaling servers—scaling your team, scaling your deployments, scaling your company.
Migrating from monolith to microservices
And what you might see is that you start to develop a microservice LIKE a monolith
It's the worst of both worlds
You have all the problems that you had before, but now they are distributed
Microservices make up a system. It may be a dozen services, but they all act within the same system.
Unlike a monolith, they need to be autonomous.
If you are relying on synchronous calls, you will have trouble. Because the network is a whole new problem. If the user service is offline, you can't get user data.
Think about microservices as if they were human beings. We rely on each other to commute to work, for instance. We rely on a bus driver to get us someplace, but most of the day, I don't need the bus driver to accomplish my job. I'm autonomous, but I rely on the bus driver at some level.
So what does it mean to be an autonomous microservice? 4 characteristics.
By being a microservice, that makes them easier to scale. That's the whole point. You slice up a monolith and can scale parts of it independently.
But just slicing up a monolith doesn't necessarily give you these other things.
So imagine a microservices architure
All these services depend on other services
But say User Service goes down, then the whole system goes down
So now it's like a single point of failure
So how do we isolate this failure? If the user service is offline, the other services need to keep going
One way to improve isolation is caching
So let's say that order, delivery, payment all depend on the user service, the user data
But the user service goes offline, I'd like to keep going
So setup some kind of asych communication
Whenever a user is updated in user service, it will trigger some event for the other services to subscribe to
These services will store the data that they NEED locally, not necessarily the entire user data, just the parts they need
This usually happens with something like RabbitMQ or Kafka, that sort of thing
The order service could still go to the user service and fall back to the cache
But if you are caching data, that gives you a window of time to get the user service fixed, for instance
There could be some eventual consistency problems here. The cache might be out of date for a short period of time (it's not synchronous)
But thinking back to the metaphor of humans: Real life is not consistent. If I change my address, I fill out a card, but it might not get entered into the post office's system for hours or days. In the meantime, stuff will still get sent to my old address.
Other types of communication
We as devs generally tend to think about synchronous, although that's changing
But most problems can be reframed to asych
Streams is a whole different approach, I generally don't see that much, but it's an option
Async should be the norm between services
Think about placing an order on Amazon
You place an order and you get a confirmation email right away, it *seems* synchronous
But generally an inventory service is invoked, then a delivery service, then a payment service, and eventually you get another email that your order has been shipped and you've been charged.
This means that even if the payment service is offline, you can always place an order. They get queued up, and once the payment service is back online, it goes to work.
So if most communication between services is async and you have some caching, then you can tolerate some failure, maybe for minutes, hours, or a day
But, our microservices are not yet resilient
Checkbook analogy
"writing a check", "balancing a checkbook"
Event sourcing is a way of building the history of the state of some entity
So you can then build the current state by using the history
Kinda like an accounting ledger
so let's say that some other team pushes an update, which causes inconsistencies in the service or data
And that system sends a wrong message to my service
So now my service stores the inconsistency, and now I'll send inconsistent messages
And now we have a distributed bug. What's the source? How do we debug it?
Event sourcing allows you to track the history and changes in your service. You can see when the inconsistency was generated.
So you can add an offsetting record to fix it, you can reset the state and reprocess from some snapshot, etc
Who has heard of event sourcing before? Who has used it? If you haven't, you should definitely start researching, and check out this blog post about event sourcing. Very important in a microservice system.
In this case, service 2 is down, or having problems
So maybe the thread in web app is locked, because it is waiting on service 2
The other threads aren't using service 2, so they are working fine.
But what will happen over time, and this might be a short period, or it might be a long period depending on how often service 2 is being used
Is that the entire thread pool in web app will get consumed
And now web app is on fire, cascading failures
So any request that DOESN'T need service 2, but only needs services 3 and 4, won't be able to go through
Everything is being blocked by service 2
Solution is to put in some timeouts
This is a snippet of Java code that will trigger this kind of problem
There is no timeout specified
There is no DEFAULT READ timeout
So define a timeout for how long to wait
For both connect and for read
There are some other things to consider
Basically, you don't want your service to blow up because some other service has blown up
Just a couple examples:
Circuit breaker: basically after a certain number of failures, any more attempts to get that service will fail immediately until some timeout period.
Bulkheads: Suppose we have a vital service that needs the user service, and suppose a less vital service, like analytics, that also needs user service. So a bulkhead allows you to prioritize and assign a limited number of threads to the analytics service. So even if analytics wants to get a million records, it will only be allowed, say, 2 threads at a time to do that.
This a framework to help with many of these things
created by Netflix
Which is fine, but it pushes a lot of responsibility on the developer to handle these failure scenarios
And it's also Java only
And it's currently in maintenance mode
There's another pattern to deal with this called the Service Mesh pattern
Instead of the microservice itself handling network issues and stuff
Deploy an application which takes care of it for me, and acts as a proxy to the other services
This is a way to implement cross cutting concerns across all of your microservices
So it has a circuit breaker built in to handle a lot of timeouts, for instance.
Here are some of the more well-known service mesh providers
Cross cutting concerns they provide:
Externalized configuration - includes credentials, and network locations of external services such as databases and message brokers
Logging - configuring of a logging framework such as log4j or logback
Health checks - a url that a monitoring service can “ping” to determine the health of the application
Metrics - measurements that provide insight into what the application is doing and how it is performing
Distributed tracing - instrument services with code that assigns each external request an unique identifier that is passed between services.
These tools also come with monitoring
So, Hystrix is like a Java framework
But let's say you are using .NET or Node or whatever, so you'd have to use a different tool
Instead of Hystrix, you can run it through these service mesh providers no matter what language: it's language agnostic
And you can get analytics on them
So we've made our microservices resilient using event sourcing, timeouts, circuit breakers, etc
Elastic, meaning we scale out a service (by adding more servers, more nodes) to accommodate more load
The patterns to deal with this is Service Discovery and Load Balancing.
When a new instance of a service is deployed, it has to tell some service registry that it's online.
And when a service is needed, the service registry will be asked "hey, which instance can I call"
But this may not be as necessary anymore, because now we have Kubernetes
Kubernetes you get the same thing using services: a logical set of Pods and a policy
So you define a service in kubernetes, and it will direct you to one of those instances
Autoscaling is also available in kubernetes
So we can specify a number of replicas
I want to scale up, I just change the number
Or I can use Pod Autoscaler to detect some situation and have Kubernetes launch another instance
This is something that Azure / AWS can do, but with Kubernetes, this is cloud agnostic.
So anywhere you can run Kubernetes, including Azure, AWS, your own data center, or wherever
So at this point we've got an autonomous microservice
We've checked all these boxes
So now my microservice is highly scalable, right!?
(pause)
What about the database? I'm forgetting a big part of my system
If we could just store in memory, doesn't matter the language, everything would be blazing fast
But we have to store data, persist data, so it's important to take databases into consideration for the performance of the entire system
So we have all these autonomous services, but maybe they're all talking to the same database
So, naturally, we could just scale up the database, right?
Scaling a relational database is not that easy
Vertical scaling is "easy" but can be very expensive and will eventually hit a ceiling
Scaling horizontally with relational is very challenging
With microservices, it opens up polyglot opportunities
We can use the right tool for the right job
Service 1 can be in python, etc
Maybe Python is good for certain cases that involve natural language processing
Java microservice leverages the business logic that we've built over the years
Some .NET services were brought over in acquisition
And there's that one team that just loves JavaScript no matter what you tell them
So we can do the same thing for persistence
Maybe we store some financial data in relational, all my Java code is fine with Oracle
We have another part of the system that stores user profiles in documents to better engage users and increase flexibility so I'll use Couchbase
Yet another part uses a graph database to detect outliers and fraud so I'll use Neo4j here
And yet another part uses a Full Text Search to help users navigate the site I can use Couchbase or Solr or whatever to accomplish that
etc
Anyone seen this?
This was an incident that occurred at central park
Pokemon go players were trying to capture a Vaporeon
This is pokemon go, start from 0 users and going to almost 300,000,000 users in a few weeks
Imagine the costs of scaling during this short period of time
Not everyone is a pokemon go, scaling to 200 million users
But it's not just about that. I'm sure Oracle could deliver this kind of scaling and performance
But how much is it going to cost?
NoSQL databases will cost much less, not just because of licensing, but also because of technology of HOW they scale
So with polyglot databases, sometimes it's about using the right database
You could use a relational database to work with graphs, but chances are a graph database will do it better.
but sometimes it's about other tradeoffs, like cost.
Of course I have to mention Couchbase
It's a "replacement" not in the sense that I would say always replace your RDBMS with Couchbase
But if you haven't used NoSQL before, a document database (like Couchbase) would be a good place
But as you start to move towards microservices, and slice up your monolith, it may make sense to use
Couchbase with some of those microservices
I find Couchbase in particular to be easy to scale, but how do we make it "elastic" like our microservices?
Anyone know what a Kubernetes Operator is?
An Operator is an application-specific controller that extends the Kubernetes API to manage instances of complex stateful applications
So you could have a Couchbase operator, a MySQL operator, etc.
You don't NEED an operator, but if an operator is available, you should almost definitely use it
So here is a Couchbase cluster defined in Kubernetes that uses the operator
Notice that there is a "size" of "2"
This defines how many nodes of Couchbase I want
To add a 3rd node, I change this number to a 3
To do this WITHOUT an operator, you would need to do a lot more work: scripting, manual steps, etc
Here's the YAML file for the Couchbase operator itself
Operator is on version 1.2
This is what the operator gives you.
Again, you don't strictly *need* an operator. We have at least one customer I know of who has been using kubernetes and Couchbase since before we offered an operator
You could use Couchbase's APIs to do this stuff yourself. But unless you have a really good reason to, I'd stick to the operator
[time permitting]
I'm going to show you a quick demo of Kubernetes and the Couchbase Kubernetes operator
I'm running on Azure with AKS, but everything I'm showing you can be done on any Kubernetes cluster, whether it's amazon or google or whatever
There are some operators for other databases out there.
many of these aren't "official" yet, they are community driven
But in the long run, I think Kubernetes operators will be the way to go with databases
If you want to check out more about Kubernetes and/or Couchbase, here are some free resources for you
If anything looks interesting to you, you have questions or feedback, come talk to me afterwards
I want to hear from you!
My boss says I have to listen to you, it's my job. So now's your chance :)