Lyft - One billion rides - with wavefront

1©2018 VMware, Inc.
Improving Lives with the
World’s Best Transportation

About Me
HOBBIES
Include Guitar,
Golf,
Skateboarding,
Cooking/Baking,
and Automobiles
FORMER
Tech Lead of Lyft
Observability
CURRENTLY
Working closely with the
Express Drive team – Lyft’s
vehicle rental program for
drivers
OVER
9 years in tech
2010-2014
Zynga
EARLY
Lyft Infrastructure
(DevOps)
Engineer

One Billion Rides
2018

About Lyft
• Transportation as a service
• “Your friend with a car,” redefines
personal transportation
• Founded in San Francisco 2012
• Currently serving in US and Canada
• Available in 300+ cities and 1500 drivers
at any minute

Lyft – More Fun Facts
• 250,000 Lyft community members gave up their cars at the beginning of 2017
• The Lyft community will take 1 million cars off the road by the end of 2019
• Autonomous vehicle fleets will become widespread & will account for the majority of Lyft rides within
five years
• By 2025, private car ownership will all-but end in major US cities
• Lyft rides are carbon-neutral
• Lyft Bikes and Scooters will be our solution to last mile commute

Lyft Stats – in 2017
Annual
Rides
MM
New Year
Eve Rides
MM
Employees
2K+
Halloween
Drop-
Offs/sec
K+
Microservices
200+
EC2 instances
10,000+
Lots* of logs and
metrics

Observability Team at Lyft
Founded in early 2016, a small and cohesive team of 5 engineers
Team collectively owns
• Client and Server logging infrastructure
• Metric ingest pipeline and real-time aggregation
• Distributed Tracing
• PagerDuty interactions and integrations
• The real-time business metric framework
• Dashboards and user experience with monitoring and alarming setup
• Logging and metric-based alerting
• Baseline monitoring systems for all microservices
• Core libraries

Metrics at Lyft:
The Before Times

Before Wavefront by Vmware
Challenges with Open Source Tooling
9
• Manual maintenance
• Resource-hungry drives
cost
• Query performance issues
• Ingest performance issues
• Hard to scale
• Sharding handled externally
Reliability Performance Maintainability

Observability Challenges Early in 2015
• Lyft used Graphite (and whisper
files) located on i2 instances
• Hard to scale, we handled
sharding externally
• Relays provided poor control for
fan out of data to alternate
destinations
• We computed top-level
aggregates from the one
already existing
• This stack processed local
minutely aggregated samples

Observability Challenges Early in 2016
• Replace the poorly scaling Python-
based intermediaries with more
efficient components
• Reduce end to end to end latency
for site >3m to < 2m
• Produce improved and accurate
top-level aggregates -
p95/99/999/9999

Early 2016 – Enter Wavefront
• Node.js based StatsD replaced by C implementation of StatsD server – lower overhead, better data
quality
• Added fan-out for StatsD traffic to other clusters or receivers, e.g., Wavefront
• Wrote cluster-wide aggregated metrics to the existing cluster graphite under a new namespace to allow
comparisons of latency and accuracy
• Aggregated StatsD packets over time in several dimensions, including per-host and per-cluster
Wavefront starts serving 20% of reading traffic on March 2016
• Time series ingestion
• Integrated alarms
• Wavefront salt module for alert, dashboard and user management
• Grafana integration

So Many Metrics!
System metrics
• CollectD
• Custom scripts
• Bash functions
‚‚
Applications metrics
Core libraries instrumentation
Scraper scripts - pull metrics
• Cloudwatch metrics
• Google Cloud Platform metrics
• Mongo telemetry
Containers generated parameters (future Kubernetes)

Opt-in mechanism for
per-host and per-second
data
Only ~300K metrics per
second, thanks to rollups
Per-instance
cardinality limits
So Many Metrics!
Billions per second,
even with aggregation
and sampling
Graphite
meltdown!

Wavefront by VMware at Lyft Today
15
• System Monitoring
• Application monitoring
• > 500,000 metrics/second - peaked at 800,000
• 1,000+ engineers using Wavefront
• 1,000+ Wavefront dashboards
• 18,000+ Wavefront alerts

Python and Golang
• Common base libraries for each language
• Hundreds of microservices, one monorepo (that is getting decomposed)
• Frequent deploys
• Common “base” deploy, Salt (masterless), AWS public cloud
• DevOps (Infrastructure team) has the role of enabling others, not to operate
• Teams are responsible for their service
• No SRE
Today Lyft Relies on Wavefront for Time Series and Alarming

How Does Metrics Aggregation Pipeline at Lyft Work
Cascaded Approach
github.com/lyft/statsrelay.git
github.com/lyft/statsite.git

Service level aggregates centrally - correct histograms
Per host aggregates locally
Default metrics aggregated at 60s interval
The 1-second interval is possible with a whitelist
Data Aggregation

Transitioning from Graphite to Wavefront Format Is Easy

Lyft Business Metrics in Wavefront
Passenger metrics
• New user signups / installs / activations
• Current passengers with the app open
Driver metrics
• New driver applications / activations
• Current drivers with the app open
Ride metrics
• Rides requested / accepted / dropped off / canceled / lapsed
• Lyft Line rides dropped off
• Paid vs. Couponed rides dropped off
Marketplace metrics
• Drivers available
• Drivers en route
• Driver utilization %

Passenger - PAX Client Metrics - Wavefront Integration with Grafana

Techniques Used at Lyft to Avoid
Production Incidents with Hundreds
of Micro Services

from lyft_stats import stats
handler = stats.get_stats(‘test_prefix’)
map = {‘foo’: ‘bar’}
try:
with handler.timer(‘sample.timer’):
# do other things
print(map[‘test’])
except KeyError:
handler.incr(‘illegal.access’)
pass
Easy Application Metrics Collection - Python Metrics Library

Easy Metrics Collection Go Metrics Library
https://github.com/lyft/gostats

Observability in the Age of Microservice Mesh

Envoy Primer
• Envoy Proxy- modern, high performance, small footprint edge and service proxy
designed for cloud-native applications
• Out of process architecture (sidecar)
• C++ 11 code base
• Service discovery and active/passive health checking
• Advanced load balancing
• Edge and service proxy
• HTTP L7 filter architecture
• Best in class Observability (tracing, logging, and stats)

Measure Everything!

• Monolithic repository for managing dashboards
• Close integration with our salt infrastructure
• Grafana and Wavefront modules for dashboard/alert management
• Dashboards/alerts defined as salt states (jinja2+yaml)
• The rigorous code review process
• Consistent look and feel
• Distributed ownership
Managed Dashboards and Alarms Hub

Consistent Look and Feel Across All Our Microservices

Envoy Global Health Dashboard
Wavefront Integration with Grafana

Metrics-Based Alerting Using Wavefront

Enrichment

Benefits of Wavefront for Lyft
35
• Multiple-system syndrome
- Fewer tools for triage, better and faster resolution
- Context switching is expensive
- Wavefront puts metrics and data from numerous sources up front
and makes them available in a single click
• Real-time visibility into the performance of our key services
• Highly efficient Alert Engine
- Relies on Wavefront to create smart alerts that dynamically filter
noise and capture veritable anomalies
• Powerful metrics explorer and chart view

Finding a Needle in a Haystack

Help Us Arrive at Root Cause Quickly

Tight Coupling

Big Wins with Wavefront
Ability to monitor
releases to help
engineers make
accurate decisions
Predict the
future
Empirical data to
guide decision
making
Robust alerting - for
when you’re not
watching
The first-class citizen, to answer
questions: “Is Lyft up?” or
“How many rides did we
complete?”
Intuitive yet powerful
query language

Lyft - One billion rides - with wavefront

Lyft - One billion rides - with wavefront

Related slideshows

More Related Content

Lyft - One billion rides - with wavefront