SlideShare a Scribd company logo
1©2018 VMware, Inc.
Improving Lives with the
World’s Best Transportation
2©2018 VMware, Inc.
About Me
HOBBIES
Include Guitar,
Golf,
Skateboarding,
Cooking/Baking,
and Automobiles
FORMER
Tech Lead of Lyft
Observability
CURRENTLY
Working closely with the
Express Drive team – Lyft’s
vehicle rental program for
drivers
OVER
9 years in tech
2010-2014
Zynga
EARLY
Lyft Infrastructure
(DevOps)
Engineer
3©2018 VMware, Inc.
One Billion Rides
2018
4©2018 VMware, Inc.
About Lyft
• Transportation as a service
• “Your friend with a car,” redefines
personal transportation
• Founded in San Francisco 2012
• Currently serving in US and Canada
• Available in 300+ cities and 1500 drivers
at any minute
5©2018 VMware, Inc.
Lyft – More Fun Facts
• 250,000 Lyft community members gave up their cars at the beginning of 2017
• The Lyft community will take 1 million cars off the road by the end of 2019
• Autonomous vehicle fleets will become widespread & will account for the majority of Lyft rides within
five years
• By 2025, private car ownership will all-but end in major US cities
• Lyft rides are carbon-neutral
• Lyft Bikes and Scooters will be our solution to last mile commute
6©2018 VMware, Inc.
Lyft Stats – in 2017
Annual
Rides
MM
New Year
Eve Rides
MM
Employees
2K+
Halloween
Drop-
Offs/sec
K+
Microservices
200+
EC2 instances
10,000+
Lots* of logs and
metrics
7©2018 VMware, Inc.
Observability Team at Lyft
Founded in early 2016, a small and cohesive team of 5 engineers
Team collectively owns
• Client and Server logging infrastructure
• Metric ingest pipeline and real-time aggregation
• Distributed Tracing
• PagerDuty interactions and integrations
• The real-time business metric framework
• Dashboards and user experience with monitoring and alarming setup
• Logging and metric-based alerting
• Baseline monitoring systems for all microservices
• Core libraries
8©2018 VMware, Inc.
Metrics at Lyft:
The Before Times
9©2018 VMware, Inc.
Before Wavefront by Vmware
Challenges with Open Source Tooling
9
• Manual maintenance
• Resource-hungry drives
cost
• Query performance issues
• Ingest performance issues
• Hard to scale
• Sharding handled externally
Reliability Performance Maintainability
10©2018 VMware, Inc.
Observability Challenges Early in 2015
• Lyft used Graphite (and whisper
files) located on i2 instances
• Hard to scale, we handled
sharding externally
• Relays provided poor control for
fan out of data to alternate
destinations
• We computed top-level
aggregates from the one
already existing
• This stack processed local
minutely aggregated samples
11©2018 VMware, Inc.
Observability Challenges Early in 2016
• Replace the poorly scaling Python-
based intermediaries with more
efficient components
• Reduce end to end to end latency
for site >3m to < 2m
• Produce improved and accurate
top-level aggregates -
p95/99/999/9999
12©2018 VMware, Inc.
Early 2016 – Enter Wavefront
• Node.js based StatsD replaced by C implementation of StatsD server – lower overhead, better data
quality
• Added fan-out for StatsD traffic to other clusters or receivers, e.g., Wavefront
• Wrote cluster-wide aggregated metrics to the existing cluster graphite under a new namespace to allow
comparisons of latency and accuracy
• Aggregated StatsD packets over time in several dimensions, including per-host and per-cluster
Wavefront starts serving 20% of reading traffic on March 2016
• Time series ingestion
• Integrated alarms
• Wavefront salt module for alert, dashboard and user management
• Grafana integration
13©2018 VMware, Inc.
So Many Metrics!
System metrics
• CollectD
• Custom scripts
• Bash functions
‚‚
Applications metrics
Core libraries instrumentation
Scraper scripts - pull metrics
• Cloudwatch metrics
• Google Cloud Platform metrics
• Mongo telemetry
Containers generated parameters (future Kubernetes)
14©2018 VMware, Inc.
Opt-in mechanism for
per-host and per-second
data
Only ~300K metrics per
second, thanks to rollups
Per-instance
cardinality limits
So Many Metrics!
Billions per second,
even with aggregation
and sampling
Graphite
meltdown!
15©2018 VMware, Inc.
Wavefront by VMware at Lyft Today
15
• System Monitoring
• Application monitoring
• > 500,000 metrics/second - peaked at 800,000
• 1,000+ engineers using Wavefront
• 1,000+ Wavefront dashboards
• 18,000+ Wavefront alerts
16©2018 VMware, Inc.
Python and Golang
• Common base libraries for each language
• Hundreds of microservices, one monorepo (that is getting decomposed)
• Frequent deploys
• Common “base” deploy, Salt (masterless), AWS public cloud
• DevOps (Infrastructure team) has the role of enabling others, not to operate
• Teams are responsible for their service
• No SRE
Today Lyft Relies on Wavefront for Time Series and Alarming
17©2018 VMware, Inc.
How Does Metrics Aggregation Pipeline at Lyft Work
Cascaded Approach
github.com/lyft/statsrelay.git
github.com/lyft/statsite.git
18©2018 VMware, Inc.
Service level aggregates centrally - correct histograms
Per host aggregates locally
Default metrics aggregated at 60s interval
The 1-second interval is possible with a whitelist
Data Aggregation
19©2018 VMware, Inc.
Transitioning from Graphite to Wavefront Format Is Easy
20©2018 VMware, Inc.
Lyft Business Metrics in Wavefront
Passenger metrics
• New user signups / installs / activations
• Current passengers with the app open
Driver metrics
• New driver applications / activations
• Current drivers with the app open
Ride metrics
• Rides requested / accepted / dropped off / canceled / lapsed
• Lyft Line rides dropped off
• Paid vs. Couponed rides dropped off
Marketplace metrics
• Drivers available
• Drivers en route
• Driver utilization %
21©2018 VMware, Inc.
Passenger - PAX Client Metrics - Wavefront Integration with Grafana
22©2018 VMware, Inc.
Techniques Used at Lyft to Avoid
Production Incidents with Hundreds
of Micro Services
23©2018 VMware, Inc.
from lyft_stats import stats
handler = stats.get_stats(‘test_prefix’)
map = {‘foo’: ‘bar’}
try:
with handler.timer(‘sample.timer’):
# do other things
print(map[‘test’])
except KeyError:
handler.incr(‘illegal.access’)
pass
Easy Application Metrics Collection - Python Metrics Library
24©2018 VMware, Inc.
Easy Metrics Collection Go Metrics Library
https://github.com/lyft/gostats
25©2018 VMware, Inc.
Observability in the Age of Microservice Mesh
26©2018 VMware, Inc.
Envoy Primer
• Envoy Proxy- modern, high performance, small footprint edge and service proxy
designed for cloud-native applications
• Out of process architecture (sidecar)
• C++ 11 code base
• Service discovery and active/passive health checking
• Advanced load balancing
• Edge and service proxy
• HTTP L7 filter architecture
• Best in class Observability (tracing, logging, and stats)
27©2018 VMware, Inc.
Measure Everything!
28©2018 VMware, Inc.
• Monolithic repository for managing dashboards
• Close integration with our salt infrastructure
• Grafana and Wavefront modules for dashboard/alert management
• Dashboards/alerts defined as salt states (jinja2+yaml)
• The rigorous code review process
• Consistent look and feel
• Distributed ownership
Managed Dashboards and Alarms Hub
29©2018 VMware, Inc.
Consistent Look and Feel Across All Our Microservices
30©2018 VMware, Inc.
Envoy Global Health Dashboard
Wavefront Integration with Grafana
31©2018 VMware, Inc.
Metrics-Based Alerting Using Wavefront
32©2018 VMware, Inc.
Metrics-Based Alerting Using Wavefront
33©2018 VMware, Inc.
Metrics-Based Alerting Using Wavefront
34©2018 VMware, Inc.
Enrichment
35©2018 VMware, Inc.
Benefits of Wavefront for Lyft
35
• Multiple-system syndrome
- Fewer tools for triage, better and faster resolution
- Context switching is expensive
- Wavefront puts metrics and data from numerous sources up front
and makes them available in a single click
• Real-time visibility into the performance of our key services
• Highly efficient Alert Engine
- Relies on Wavefront to create smart alerts that dynamically filter
noise and capture veritable anomalies
• Powerful metrics explorer and chart view
36©2018 VMware, Inc.
Finding a Needle in a Haystack
37©2018 VMware, Inc.
Help Us Arrive at Root Cause Quickly
38©2018 VMware, Inc.
Tight Coupling
39©2018 VMware, Inc.
Big Wins with Wavefront
Ability to monitor
releases to help
engineers make
accurate decisions
Predict the
future
Empirical data to
guide decision
making
Robust alerting - for
when you’re not
watching
The first-class citizen, to answer
questions: “Is Lyft up?” or
“How many rides did we
complete?”
Intuitive yet powerful
query language
Lyft - One billion rides - with wavefront

More Related Content

Lyft - One billion rides - with wavefront

  • 1. 1©2018 VMware, Inc. Improving Lives with the World’s Best Transportation
  • 2. 2©2018 VMware, Inc. About Me HOBBIES Include Guitar, Golf, Skateboarding, Cooking/Baking, and Automobiles FORMER Tech Lead of Lyft Observability CURRENTLY Working closely with the Express Drive team – Lyft’s vehicle rental program for drivers OVER 9 years in tech 2010-2014 Zynga EARLY Lyft Infrastructure (DevOps) Engineer
  • 3. 3©2018 VMware, Inc. One Billion Rides 2018
  • 4. 4©2018 VMware, Inc. About Lyft • Transportation as a service • “Your friend with a car,” redefines personal transportation • Founded in San Francisco 2012 • Currently serving in US and Canada • Available in 300+ cities and 1500 drivers at any minute
  • 5. 5©2018 VMware, Inc. Lyft – More Fun Facts • 250,000 Lyft community members gave up their cars at the beginning of 2017 • The Lyft community will take 1 million cars off the road by the end of 2019 • Autonomous vehicle fleets will become widespread & will account for the majority of Lyft rides within five years • By 2025, private car ownership will all-but end in major US cities • Lyft rides are carbon-neutral • Lyft Bikes and Scooters will be our solution to last mile commute
  • 6. 6©2018 VMware, Inc. Lyft Stats – in 2017 Annual Rides MM New Year Eve Rides MM Employees 2K+ Halloween Drop- Offs/sec K+ Microservices 200+ EC2 instances 10,000+ Lots* of logs and metrics
  • 7. 7©2018 VMware, Inc. Observability Team at Lyft Founded in early 2016, a small and cohesive team of 5 engineers Team collectively owns • Client and Server logging infrastructure • Metric ingest pipeline and real-time aggregation • Distributed Tracing • PagerDuty interactions and integrations • The real-time business metric framework • Dashboards and user experience with monitoring and alarming setup • Logging and metric-based alerting • Baseline monitoring systems for all microservices • Core libraries
  • 8. 8©2018 VMware, Inc. Metrics at Lyft: The Before Times
  • 9. 9©2018 VMware, Inc. Before Wavefront by Vmware Challenges with Open Source Tooling 9 • Manual maintenance • Resource-hungry drives cost • Query performance issues • Ingest performance issues • Hard to scale • Sharding handled externally Reliability Performance Maintainability
  • 10. 10©2018 VMware, Inc. Observability Challenges Early in 2015 • Lyft used Graphite (and whisper files) located on i2 instances • Hard to scale, we handled sharding externally • Relays provided poor control for fan out of data to alternate destinations • We computed top-level aggregates from the one already existing • This stack processed local minutely aggregated samples
  • 11. 11©2018 VMware, Inc. Observability Challenges Early in 2016 • Replace the poorly scaling Python- based intermediaries with more efficient components • Reduce end to end to end latency for site >3m to < 2m • Produce improved and accurate top-level aggregates - p95/99/999/9999
  • 12. 12©2018 VMware, Inc. Early 2016 – Enter Wavefront • Node.js based StatsD replaced by C implementation of StatsD server – lower overhead, better data quality • Added fan-out for StatsD traffic to other clusters or receivers, e.g., Wavefront • Wrote cluster-wide aggregated metrics to the existing cluster graphite under a new namespace to allow comparisons of latency and accuracy • Aggregated StatsD packets over time in several dimensions, including per-host and per-cluster Wavefront starts serving 20% of reading traffic on March 2016 • Time series ingestion • Integrated alarms • Wavefront salt module for alert, dashboard and user management • Grafana integration
  • 13. 13©2018 VMware, Inc. So Many Metrics! System metrics • CollectD • Custom scripts • Bash functions ‚‚ Applications metrics Core libraries instrumentation Scraper scripts - pull metrics • Cloudwatch metrics • Google Cloud Platform metrics • Mongo telemetry Containers generated parameters (future Kubernetes)
  • 14. 14©2018 VMware, Inc. Opt-in mechanism for per-host and per-second data Only ~300K metrics per second, thanks to rollups Per-instance cardinality limits So Many Metrics! Billions per second, even with aggregation and sampling Graphite meltdown!
  • 15. 15©2018 VMware, Inc. Wavefront by VMware at Lyft Today 15 • System Monitoring • Application monitoring • > 500,000 metrics/second - peaked at 800,000 • 1,000+ engineers using Wavefront • 1,000+ Wavefront dashboards • 18,000+ Wavefront alerts
  • 16. 16©2018 VMware, Inc. Python and Golang • Common base libraries for each language • Hundreds of microservices, one monorepo (that is getting decomposed) • Frequent deploys • Common “base” deploy, Salt (masterless), AWS public cloud • DevOps (Infrastructure team) has the role of enabling others, not to operate • Teams are responsible for their service • No SRE Today Lyft Relies on Wavefront for Time Series and Alarming
  • 17. 17©2018 VMware, Inc. How Does Metrics Aggregation Pipeline at Lyft Work Cascaded Approach github.com/lyft/statsrelay.git github.com/lyft/statsite.git
  • 18. 18©2018 VMware, Inc. Service level aggregates centrally - correct histograms Per host aggregates locally Default metrics aggregated at 60s interval The 1-second interval is possible with a whitelist Data Aggregation
  • 19. 19©2018 VMware, Inc. Transitioning from Graphite to Wavefront Format Is Easy
  • 20. 20©2018 VMware, Inc. Lyft Business Metrics in Wavefront Passenger metrics • New user signups / installs / activations • Current passengers with the app open Driver metrics • New driver applications / activations • Current drivers with the app open Ride metrics • Rides requested / accepted / dropped off / canceled / lapsed • Lyft Line rides dropped off • Paid vs. Couponed rides dropped off Marketplace metrics • Drivers available • Drivers en route • Driver utilization %
  • 21. 21©2018 VMware, Inc. Passenger - PAX Client Metrics - Wavefront Integration with Grafana
  • 22. 22©2018 VMware, Inc. Techniques Used at Lyft to Avoid Production Incidents with Hundreds of Micro Services
  • 23. 23©2018 VMware, Inc. from lyft_stats import stats handler = stats.get_stats(‘test_prefix’) map = {‘foo’: ‘bar’} try: with handler.timer(‘sample.timer’): # do other things print(map[‘test’]) except KeyError: handler.incr(‘illegal.access’) pass Easy Application Metrics Collection - Python Metrics Library
  • 24. 24©2018 VMware, Inc. Easy Metrics Collection Go Metrics Library https://github.com/lyft/gostats
  • 25. 25©2018 VMware, Inc. Observability in the Age of Microservice Mesh
  • 26. 26©2018 VMware, Inc. Envoy Primer • Envoy Proxy- modern, high performance, small footprint edge and service proxy designed for cloud-native applications • Out of process architecture (sidecar) • C++ 11 code base • Service discovery and active/passive health checking • Advanced load balancing • Edge and service proxy • HTTP L7 filter architecture • Best in class Observability (tracing, logging, and stats)
  • 28. 28©2018 VMware, Inc. • Monolithic repository for managing dashboards • Close integration with our salt infrastructure • Grafana and Wavefront modules for dashboard/alert management • Dashboards/alerts defined as salt states (jinja2+yaml) • The rigorous code review process • Consistent look and feel • Distributed ownership Managed Dashboards and Alarms Hub
  • 29. 29©2018 VMware, Inc. Consistent Look and Feel Across All Our Microservices
  • 30. 30©2018 VMware, Inc. Envoy Global Health Dashboard Wavefront Integration with Grafana
  • 31. 31©2018 VMware, Inc. Metrics-Based Alerting Using Wavefront
  • 32. 32©2018 VMware, Inc. Metrics-Based Alerting Using Wavefront
  • 33. 33©2018 VMware, Inc. Metrics-Based Alerting Using Wavefront
  • 35. 35©2018 VMware, Inc. Benefits of Wavefront for Lyft 35 • Multiple-system syndrome - Fewer tools for triage, better and faster resolution - Context switching is expensive - Wavefront puts metrics and data from numerous sources up front and makes them available in a single click • Real-time visibility into the performance of our key services • Highly efficient Alert Engine - Relies on Wavefront to create smart alerts that dynamically filter noise and capture veritable anomalies • Powerful metrics explorer and chart view
  • 36. 36©2018 VMware, Inc. Finding a Needle in a Haystack
  • 37. 37©2018 VMware, Inc. Help Us Arrive at Root Cause Quickly
  • 39. 39©2018 VMware, Inc. Big Wins with Wavefront Ability to monitor releases to help engineers make accurate decisions Predict the future Empirical data to guide decision making Robust alerting - for when you’re not watching The first-class citizen, to answer questions: “Is Lyft up?” or “How many rides did we complete?” Intuitive yet powerful query language