Lyft - One billion rides - with wavefront
- 2. 2©2018 VMware, Inc.
About Me
HOBBIES
Include Guitar,
Golf,
Skateboarding,
Cooking/Baking,
and Automobiles
FORMER
Tech Lead of Lyft
Observability
CURRENTLY
Working closely with the
Express Drive team – Lyft’s
vehicle rental program for
drivers
OVER
9 years in tech
2010-2014
Zynga
EARLY
Lyft Infrastructure
(DevOps)
Engineer
- 4. 4©2018 VMware, Inc.
About Lyft
• Transportation as a service
• “Your friend with a car,” redefines
personal transportation
• Founded in San Francisco 2012
• Currently serving in US and Canada
• Available in 300+ cities and 1500 drivers
at any minute
- 5. 5©2018 VMware, Inc.
Lyft – More Fun Facts
• 250,000 Lyft community members gave up their cars at the beginning of 2017
• The Lyft community will take 1 million cars off the road by the end of 2019
• Autonomous vehicle fleets will become widespread & will account for the majority of Lyft rides within
five years
• By 2025, private car ownership will all-but end in major US cities
• Lyft rides are carbon-neutral
• Lyft Bikes and Scooters will be our solution to last mile commute
- 6. 6©2018 VMware, Inc.
Lyft Stats – in 2017
Annual
Rides
MM
New Year
Eve Rides
MM
Employees
2K+
Halloween
Drop-
Offs/sec
K+
Microservices
200+
EC2 instances
10,000+
Lots* of logs and
metrics
- 7. 7©2018 VMware, Inc.
Observability Team at Lyft
Founded in early 2016, a small and cohesive team of 5 engineers
Team collectively owns
• Client and Server logging infrastructure
• Metric ingest pipeline and real-time aggregation
• Distributed Tracing
• PagerDuty interactions and integrations
• The real-time business metric framework
• Dashboards and user experience with monitoring and alarming setup
• Logging and metric-based alerting
• Baseline monitoring systems for all microservices
• Core libraries
- 9. 9©2018 VMware, Inc.
Before Wavefront by Vmware
Challenges with Open Source Tooling
9
• Manual maintenance
• Resource-hungry drives
cost
• Query performance issues
• Ingest performance issues
• Hard to scale
• Sharding handled externally
Reliability Performance Maintainability
- 10. 10©2018 VMware, Inc.
Observability Challenges Early in 2015
• Lyft used Graphite (and whisper
files) located on i2 instances
• Hard to scale, we handled
sharding externally
• Relays provided poor control for
fan out of data to alternate
destinations
• We computed top-level
aggregates from the one
already existing
• This stack processed local
minutely aggregated samples
- 11. 11©2018 VMware, Inc.
Observability Challenges Early in 2016
• Replace the poorly scaling Python-
based intermediaries with more
efficient components
• Reduce end to end to end latency
for site >3m to < 2m
• Produce improved and accurate
top-level aggregates -
p95/99/999/9999
- 12. 12©2018 VMware, Inc.
Early 2016 – Enter Wavefront
• Node.js based StatsD replaced by C implementation of StatsD server – lower overhead, better data
quality
• Added fan-out for StatsD traffic to other clusters or receivers, e.g., Wavefront
• Wrote cluster-wide aggregated metrics to the existing cluster graphite under a new namespace to allow
comparisons of latency and accuracy
• Aggregated StatsD packets over time in several dimensions, including per-host and per-cluster
Wavefront starts serving 20% of reading traffic on March 2016
• Time series ingestion
• Integrated alarms
• Wavefront salt module for alert, dashboard and user management
• Grafana integration
- 13. 13©2018 VMware, Inc.
So Many Metrics!
System metrics
• CollectD
• Custom scripts
• Bash functions
‚‚
Applications metrics
Core libraries instrumentation
Scraper scripts - pull metrics
• Cloudwatch metrics
• Google Cloud Platform metrics
• Mongo telemetry
Containers generated parameters (future Kubernetes)
- 14. 14©2018 VMware, Inc.
Opt-in mechanism for
per-host and per-second
data
Only ~300K metrics per
second, thanks to rollups
Per-instance
cardinality limits
So Many Metrics!
Billions per second,
even with aggregation
and sampling
Graphite
meltdown!
- 15. 15©2018 VMware, Inc.
Wavefront by VMware at Lyft Today
15
• System Monitoring
• Application monitoring
• > 500,000 metrics/second - peaked at 800,000
• 1,000+ engineers using Wavefront
• 1,000+ Wavefront dashboards
• 18,000+ Wavefront alerts
- 16. 16©2018 VMware, Inc.
Python and Golang
• Common base libraries for each language
• Hundreds of microservices, one monorepo (that is getting decomposed)
• Frequent deploys
• Common “base” deploy, Salt (masterless), AWS public cloud
• DevOps (Infrastructure team) has the role of enabling others, not to operate
• Teams are responsible for their service
• No SRE
Today Lyft Relies on Wavefront for Time Series and Alarming
- 17. 17©2018 VMware, Inc.
How Does Metrics Aggregation Pipeline at Lyft Work
Cascaded Approach
github.com/lyft/statsrelay.git
github.com/lyft/statsite.git
- 18. 18©2018 VMware, Inc.
Service level aggregates centrally - correct histograms
Per host aggregates locally
Default metrics aggregated at 60s interval
The 1-second interval is possible with a whitelist
Data Aggregation
- 20. 20©2018 VMware, Inc.
Lyft Business Metrics in Wavefront
Passenger metrics
• New user signups / installs / activations
• Current passengers with the app open
Driver metrics
• New driver applications / activations
• Current drivers with the app open
Ride metrics
• Rides requested / accepted / dropped off / canceled / lapsed
• Lyft Line rides dropped off
• Paid vs. Couponed rides dropped off
Marketplace metrics
• Drivers available
• Drivers en route
• Driver utilization %
- 23. 23©2018 VMware, Inc.
from lyft_stats import stats
handler = stats.get_stats(‘test_prefix’)
map = {‘foo’: ‘bar’}
try:
with handler.timer(‘sample.timer’):
# do other things
print(map[‘test’])
except KeyError:
handler.incr(‘illegal.access’)
pass
Easy Application Metrics Collection - Python Metrics Library
- 26. 26©2018 VMware, Inc.
Envoy Primer
• Envoy Proxy- modern, high performance, small footprint edge and service proxy
designed for cloud-native applications
• Out of process architecture (sidecar)
• C++ 11 code base
• Service discovery and active/passive health checking
• Advanced load balancing
• Edge and service proxy
• HTTP L7 filter architecture
• Best in class Observability (tracing, logging, and stats)
- 28. 28©2018 VMware, Inc.
• Monolithic repository for managing dashboards
• Close integration with our salt infrastructure
• Grafana and Wavefront modules for dashboard/alert management
• Dashboards/alerts defined as salt states (jinja2+yaml)
• The rigorous code review process
• Consistent look and feel
• Distributed ownership
Managed Dashboards and Alarms Hub
- 35. 35©2018 VMware, Inc.
Benefits of Wavefront for Lyft
35
• Multiple-system syndrome
- Fewer tools for triage, better and faster resolution
- Context switching is expensive
- Wavefront puts metrics and data from numerous sources up front
and makes them available in a single click
• Real-time visibility into the performance of our key services
• Highly efficient Alert Engine
- Relies on Wavefront to create smart alerts that dynamically filter
noise and capture veritable anomalies
• Powerful metrics explorer and chart view
- 39. 39©2018 VMware, Inc.
Big Wins with Wavefront
Ability to monitor
releases to help
engineers make
accurate decisions
Predict the
future
Empirical data to
guide decision
making
Robust alerting - for
when you’re not
watching
The first-class citizen, to answer
questions: “Is Lyft up?” or
“How many rides did we
complete?”
Intuitive yet powerful
query language