SlideShare a Scribd company logo
http://particular.net
What to consider when monitoring
microservices
Sean Farmar holds the world record for answering the most
NServiceBus questions - even more than Udi.
With over 20 years of experience, he specializes in providing
simple solutions for complex business requirements using
NServiceBus and applying SOA principles inspired by Udi
Dahan.
As a solution architect with Particular Software, the creators of
NServiceBus, Sean provides support, training and consulting
for customers using NServiceBus and the Particular Platform.
A professional geek, William works for Particular Software
writing amazing software like NServiceBus. Passionate about
the web and security, he is engaged in a sordid love affair with
JavaScript, and spends most of his free time trying to convince
others of it's beauty and elegance.
When not behind his laptop hacking away, this amateur beer
enthusiast can often be found playing boardgames or drinking
cold-brew coffee.
William Brander http://particular.net Sean Farmar
What to consider when monitoring
microservices
Agenda
• Introduction
• A Philosophy on Monitoring
• How things change when they’re distributed
• Monitoring Metrics
• Q & A
An average production system
Database
• Is the web server up?
• Is the database up?
• Can the webserver talk
to the db?
What are you actually monitoring?
Business
Capability
Application
Infrastructure
Are my servers running?Is my application process running?Can users access the system?
A Monitoring Philosophy
Business
Capability
Application
Infrastructure
Capacity
Performance
Health
Monitoring Area Monitoring Concern
Monitoring Concerns
Capacity
Performance
Health
Is the server up?Is there high CPU?Do I have enough disk space?
Is my application generating exceptions?
How quickly is my system processing messages?
Can I handle month end batch jobs?
Is the server up?
Is there high CPU?
Do I have enough disk space?
Application
Infrastructure
Can users access the system?
Are we meeting our SLAs?
What is the impact of adding another customer?
Business
Capability
A Monitoring Philosophy
Business
Capability
Application
Infrastructure
Capacity
Performance
Health
Monitoring Area Monitoring Concern
Proactive
Reactive
Passive
Interaction Type
An average production system
Database
• Is the web server up?
• Is the database up?
• Can the webserver talk
to the db?
Infrastructure PassiveHealth
Going Distributed
UI
BL
DAL
DB
Email
PDF
CRM
Going Distributed
EmailPDF
CRM
SQL
Monitoring distributed systems
Multiple processes and servers and queues
We want to monitor the time it takes for a message to be processed
We need to monitor the message queues
Let’s look at queue length
Queue Length
• Queue length is an indicator of work still outstanding
• High queue length doesn’t necessarily indicate a problem though
Stable or
decreasing
is good
Increasing
is bad
Processing Time
✔ ⌛
⏱️
⌛ ✔
Processing Time and fault tolerance
• Processing Time does not include error handling time
• Avoid losing data due to exceptions or temporary connectivity issues
• If all else fails, move the message to the error queue
Invoke
Exception Diagnostics – Immediate Retries
Input Queue
Immediate Retries x n
Start Delayed Retries
Timeout Queue
If all retries fail:
✘
✘
Exception Diagnostics – Delayed Retries
Return to timeout queue
Invoke
Error Queue
Retry x timesTimeout Queue
If all fails: move to error queue
✘
✘
Detecting Connectivity
• Distributed systems typically work when other parts aren’t available
• How do you know the endpoint you’re sending messages to is
actually processing messages?
Detecting Connectivity
✔⌛
⏱️
Critical time
⏱️
Critical time = The entire time taken to process a
message successfully
⏱️
• Critical Time is the total duration between when a message is created
to when it is processed
Critical Time = Time in Queue +
Processing Time +
Retry Time +
Network Time
Critical Time
Stable or decreasing could
be good
Increasing is bad
Putting these together
• Each of these metrics presents a piece of the puzzle
• Look at them from an endpoint’s perspective, not per message
• Looking at them together gives great insight into your system
Critical Time Processing Time Queue LengthCritical Time Processing Time Queue LengthCritical Time Processing Time Queue Length
Keeping your eye on everything
• These 5 metrics can give a lot of insight
• Some individual metrics are meaningful
• But most tell a story when combined with others
• Let the monitoring philosophy guide what you focus on
• NServiceBus already provides a lot of these metrics for you!
• Letting you focus on monitoring the metrics that impact your business
Learn more
• Try NServiceBus + the Particular Service Platform
• https://docs.particular.net/tutorials/quickstart/
• Take a look at NServiceBus.Metrics Nuget package
• Follow us to find out about the next webinar in the series!
Q&A
Thank you!
@farmar sean.farmar@particular.net
@williambza william.brander@particular.net
https://www.particular.net

More Related Content

What to consider when monitoring microservices

  • 1. http://particular.net What to consider when monitoring microservices Sean Farmar holds the world record for answering the most NServiceBus questions - even more than Udi. With over 20 years of experience, he specializes in providing simple solutions for complex business requirements using NServiceBus and applying SOA principles inspired by Udi Dahan. As a solution architect with Particular Software, the creators of NServiceBus, Sean provides support, training and consulting for customers using NServiceBus and the Particular Platform. A professional geek, William works for Particular Software writing amazing software like NServiceBus. Passionate about the web and security, he is engaged in a sordid love affair with JavaScript, and spends most of his free time trying to convince others of it's beauty and elegance. When not behind his laptop hacking away, this amateur beer enthusiast can often be found playing boardgames or drinking cold-brew coffee.
  • 2. William Brander http://particular.net Sean Farmar What to consider when monitoring microservices
  • 3. Agenda • Introduction • A Philosophy on Monitoring • How things change when they’re distributed • Monitoring Metrics • Q & A
  • 4. An average production system Database • Is the web server up? • Is the database up? • Can the webserver talk to the db?
  • 5. What are you actually monitoring? Business Capability Application Infrastructure Are my servers running?Is my application process running?Can users access the system?
  • 7. Monitoring Concerns Capacity Performance Health Is the server up?Is there high CPU?Do I have enough disk space? Is my application generating exceptions? How quickly is my system processing messages? Can I handle month end batch jobs? Is the server up? Is there high CPU? Do I have enough disk space? Application Infrastructure Can users access the system? Are we meeting our SLAs? What is the impact of adding another customer? Business Capability
  • 9. An average production system Database • Is the web server up? • Is the database up? • Can the webserver talk to the db? Infrastructure PassiveHealth
  • 12. Monitoring distributed systems Multiple processes and servers and queues We want to monitor the time it takes for a message to be processed We need to monitor the message queues
  • 13. Let’s look at queue length
  • 14. Queue Length • Queue length is an indicator of work still outstanding • High queue length doesn’t necessarily indicate a problem though Stable or decreasing is good Increasing is bad
  • 16. Processing Time and fault tolerance • Processing Time does not include error handling time • Avoid losing data due to exceptions or temporary connectivity issues • If all else fails, move the message to the error queue
  • 17. Invoke Exception Diagnostics – Immediate Retries Input Queue Immediate Retries x n Start Delayed Retries Timeout Queue If all retries fail: ✘ ✘
  • 18. Exception Diagnostics – Delayed Retries Return to timeout queue Invoke Error Queue Retry x timesTimeout Queue If all fails: move to error queue ✘ ✘
  • 19. Detecting Connectivity • Distributed systems typically work when other parts aren’t available • How do you know the endpoint you’re sending messages to is actually processing messages?
  • 21. ✔⌛ ⏱️ Critical time ⏱️ Critical time = The entire time taken to process a message successfully ⏱️
  • 22. • Critical Time is the total duration between when a message is created to when it is processed Critical Time = Time in Queue + Processing Time + Retry Time + Network Time Critical Time Stable or decreasing could be good Increasing is bad
  • 23. Putting these together • Each of these metrics presents a piece of the puzzle • Look at them from an endpoint’s perspective, not per message • Looking at them together gives great insight into your system Critical Time Processing Time Queue LengthCritical Time Processing Time Queue LengthCritical Time Processing Time Queue Length
  • 24. Keeping your eye on everything • These 5 metrics can give a lot of insight • Some individual metrics are meaningful • But most tell a story when combined with others • Let the monitoring philosophy guide what you focus on • NServiceBus already provides a lot of these metrics for you! • Letting you focus on monitoring the metrics that impact your business
  • 25. Learn more • Try NServiceBus + the Particular Service Platform • https://docs.particular.net/tutorials/quickstart/ • Take a look at NServiceBus.Metrics Nuget package • Follow us to find out about the next webinar in the series!
  • 26. Q&A
  • 27. Thank you! @farmar sean.farmar@particular.net @williambza william.brander@particular.net https://www.particular.net