SlideShare a Scribd company logo
SignalFx
Microservices and Devs in Charge:
Why Monitoring is an Analytics Problem
SignalFx
Microservices and Devs in Charge:
Why Monitoring is an Analytics Problem
Phillip Liu
phillip@signalfx.com
@SignalFx - signalfx.com
Agenda
• My background
• Microservices, a review
• Analytics approach to monitoring
• Code push side effects, an example
• Summary
SignalFx
My Background
Experience
[2013 - ] SignalFx - Founder, CTO, Software Engineer
Microservices; Monitoring using Analytics
[2008 - 2012] Facebook - Software Engineer, Software Architect
Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house
Analytics
[2004 - 2008] Opsware - Chief Architect, Software Engineer
Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk
[2000 - 2004] Loudcloud - Software Engineer
LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool
[1998 - 2000] Marimba - Software Engineer
Client / Server; Monitoring using SNMP, FreshWater Software
[ … ]
SignalFx
Microservices, a Review
A Microservices Definition
Loosely coupled service
oriented architecture with
bounded context.
Adrian Cockcroft
SignalFx’s Microservices
More than 15 internal services.
Spanning hundreds of
instances.
Across 3 AZs.
Have dependencies on
tens of external services.
Monitoring Challenges
• High iteration rate leads to shortened test
cycles
• Integration test combinations are intractable
• Catch problems during rolling deployments
• Identify upstream/downstream side effects
• e.g. backpressure
• Identify brownouts before the customer
• etc.
SignalFx
Analytics Approach to Monitoring
Measure
Store
Analyze
Detect
SignalFx
Examples
Monitoring at SignalFx
•We use SignalFx to monitor SignalFx
•CollectD for OS and Docker metrics on all VMs
•Yammer metrics for all Java app servers
•Custom logger to count exception types
•All metrics are sent to an analytics service
•Each service deploy a their cadence
•Push lab, then canary in prod, then rest of tier
Code Push Side Effects
Code Push Side Effects
Push canary instance and Metadata API
dashboard shows healthy tier.
Code Push Side Effects
However, upstream UI dashboard
showed unusual # of timeouts.
Code Push Side Effects
In search of root cause.
Always safe to start by looking at exception counts.
Can’t derive much from all the noise.
Code Push Side Effects
Sum the # of exceptions to create a single signal.
Code Push Side Effects
Compare sum with time-shifted sum from a day ago.
Code Push Side Effects
Look at an outlier host - an Analytics
service host.
Code Push Side Effects
java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does
not exist in class com.google.common.hash.BloomFilterStrategies
at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:
1.7.0_79]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
~[na:1.7.0_79]
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:
1990) ~[na:1.7.0_79]
…
Looking at Analytic’s logs revealed
source of the problem.
Code Push Side Effects
• Analytics across multiple microservices reduced
time to identify problem. From push to resolution
was ~15min
• Service instrumentation helped narrowed down
root cause
• Discovery allowed us to create a detector using
analytics to notify similar problems in the future
Other Examples
• A customer started dropping data because they
reverted to an unsupported API
• Compare tsdb write throughput of two different
write strategies
• Create per-service capacity reports
• Identify memory usage patterns across our
Analytics service
• Create a detector for every previously uncaught
error conditions - postmortem output
SignalFx
Summary
• Measure and Store as much metrics and events as
possible
• Use data analytics techniques to
• Identify problems
• Chase down root cause
• Create analytics based detectors to notify you of
recurrence
SignalFx
Thank You!
Phillip Liu
phillip@signalfx.com
WE’RE HIRING
jobs@signalfx.com
@SignalFx - signalfx.com

More Related Content

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

  • 1. SignalFx Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
  • 2. SignalFx Microservices and Devs in Charge: Why Monitoring is an Analytics Problem Phillip Liu phillip@signalfx.com @SignalFx - signalfx.com
  • 3. Agenda • My background • Microservices, a review • Analytics approach to monitoring • Code push side effects, an example • Summary
  • 5. Experience [2013 - ] SignalFx - Founder, CTO, Software Engineer Microservices; Monitoring using Analytics [2008 - 2012] Facebook - Software Engineer, Software Architect Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house Analytics [2004 - 2008] Opsware - Chief Architect, Software Engineer Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk [2000 - 2004] Loudcloud - Software Engineer LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool [1998 - 2000] Marimba - Software Engineer Client / Server; Monitoring using SNMP, FreshWater Software [ … ]
  • 7. A Microservices Definition Loosely coupled service oriented architecture with bounded context. Adrian Cockcroft
  • 8. SignalFx’s Microservices More than 15 internal services. Spanning hundreds of instances. Across 3 AZs. Have dependencies on tens of external services.
  • 9. Monitoring Challenges • High iteration rate leads to shortened test cycles • Integration test combinations are intractable • Catch problems during rolling deployments • Identify upstream/downstream side effects • e.g. backpressure • Identify brownouts before the customer • etc.
  • 12. Store
  • 16. Monitoring at SignalFx •We use SignalFx to monitor SignalFx •CollectD for OS and Docker metrics on all VMs •Yammer metrics for all Java app servers •Custom logger to count exception types •All metrics are sent to an analytics service •Each service deploy a their cadence •Push lab, then canary in prod, then rest of tier
  • 17. Code Push Side Effects
  • 18. Code Push Side Effects Push canary instance and Metadata API dashboard shows healthy tier.
  • 19. Code Push Side Effects However, upstream UI dashboard showed unusual # of timeouts.
  • 20. Code Push Side Effects In search of root cause. Always safe to start by looking at exception counts. Can’t derive much from all the noise.
  • 21. Code Push Side Effects Sum the # of exceptions to create a single signal.
  • 22. Code Push Side Effects Compare sum with time-shifted sum from a day ago.
  • 23. Code Push Side Effects Look at an outlier host - an Analytics service host.
  • 24. Code Push Side Effects java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does not exist in class com.google.common.hash.BloomFilterStrategies at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na: 1.7.0_79] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) ~[na:1.7.0_79] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java: 1990) ~[na:1.7.0_79] … Looking at Analytic’s logs revealed source of the problem.
  • 25. Code Push Side Effects • Analytics across multiple microservices reduced time to identify problem. From push to resolution was ~15min • Service instrumentation helped narrowed down root cause • Discovery allowed us to create a detector using analytics to notify similar problems in the future
  • 26. Other Examples • A customer started dropping data because they reverted to an unsupported API • Compare tsdb write throughput of two different write strategies • Create per-service capacity reports • Identify memory usage patterns across our Analytics service • Create a detector for every previously uncaught error conditions - postmortem output
  • 28. • Measure and Store as much metrics and events as possible • Use data analytics techniques to • Identify problems • Chase down root cause • Create analytics based detectors to notify you of recurrence
  • 29. SignalFx Thank You! Phillip Liu phillip@signalfx.com WE’RE HIRING jobs@signalfx.com @SignalFx - signalfx.com