Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

SignalFx
Microservices and Devs in Charge:
Why Monitoring is an Analytics Problem

SignalFx
Microservices and Devs in Charge:
Why Monitoring is an Analytics Problem
Phillip Liu
phillip@signalfx.com
@SignalFx - signalfx.com

Agenda
• My background
• Microservices, a review
• Analytics approach to monitoring
• Code push side effects, an example
• Summary

Experience
[2013 - ] SignalFx - Founder, CTO, Software Engineer
Microservices; Monitoring using Analytics
[2008 - 2012] Facebook - Software Engineer, Software Architect
Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house
Analytics
[2004 - 2008] Opsware - Chief Architect, Software Engineer
Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk
[2000 - 2004] Loudcloud - Software Engineer
LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool
[1998 - 2000] Marimba - Software Engineer
Client / Server; Monitoring using SNMP, FreshWater Software
[ … ]

SignalFx
Microservices, a Review

A Microservices Definition
Loosely coupled service
oriented architecture with
bounded context.
Adrian Cockcroft

SignalFx’s Microservices
More than 15 internal services.
Spanning hundreds of
instances.
Across 3 AZs.
Have dependencies on
tens of external services.

Monitoring Challenges
• High iteration rate leads to shortened test
cycles
• Integration test combinations are intractable
• Catch problems during rolling deployments
• Identify upstream/downstream side effects
• e.g. backpressure
• Identify brownouts before the customer
• etc.

SignalFx
Analytics Approach to Monitoring

Monitoring at SignalFx
•We use SignalFx to monitor SignalFx
•CollectD for OS and Docker metrics on all VMs
•Yammer metrics for all Java app servers
•Custom logger to count exception types
•All metrics are sent to an analytics service
•Each service deploy a their cadence
•Push lab, then canary in prod, then rest of tier

Code Push Side Effects
Push canary instance and Metadata API
dashboard shows healthy tier.

However, upstream UI dashboard
showed unusual # of timeouts.

In search of root cause.
Always safe to start by looking at exception counts.
Can’t derive much from all the noise.

Sum the # of exceptions to create a single signal.

Compare sum with time-shifted sum from a day ago.

Look at an outlier host - an Analytics
service host.

java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does
not exist in class com.google.common.hash.BloomFilterStrategies
at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:
1.7.0_79]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
~[na:1.7.0_79]
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:
1990) ~[na:1.7.0_79]
…
Looking at Analytic’s logs revealed
source of the problem.

• Analytics across multiple microservices reduced
time to identify problem. From push to resolution
was ~15min
• Service instrumentation helped narrowed down
root cause
• Discovery allowed us to create a detector using
analytics to notify similar problems in the future

Other Examples
• A customer started dropping data because they
reverted to an unsupported API
• Compare tsdb write throughput of two different
write strategies
• Create per-service capacity reports
• Identify memory usage patterns across our
Analytics service
• Create a detector for every previously uncaught
error conditions - postmortem output

• Measure and Store as much metrics and events as
possible
• Use data analytics techniques to
• Identify problems
• Chase down root cause
• Create analytics based detectors to notify you of
recurrence

SignalFx
Thank You!
Phillip Liu
phillip@signalfx.com
WE’RE HIRING
jobs@signalfx.com
@SignalFx - signalfx.com

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Related slideshows

More Related Content

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem