Three Perspectives on Measuring Latency

Brought to you by
Three Perspectives on
Measuring Latency
Geoffrey Beausire
SRE at Criteo

Measuring Latency
■ Focusing on measuring latency for monitoring
■ Service Level Indicators (SLI) and Objectives (SLO)
● https://sre.google/sre-book/service-level-objectives/
■ You can also check out How to Measure Latency by Heinrich Hartmann
● https://www.p99conf.io/session/how-to-measure-latency/

What we do:
● AdTech
● Auction based (Real-time bidding)
● Highest bidder gets to display an ad
● Participate in 10M auctions per second
Challenges:
● Need a lot of data to bid the right price
● Low response time (<100ms P99)

What’s a Key/value store:
■ Allow to retrieve a value with a key
■ Trade features (scan/relational) for scalability and performance
■ Commonly used for cache
At Criteo:
■ 217 clusters spread in 7 data centers
■ 250M QPS
■ SLO: 99% of request <1ms
■ 1 trillions records
■ Powered by Memcached and Aerospike
■ 3500 servers (Bare metal Kubernetes)
Criteo’s Key/Value Store Infrastructure

Key/Value Store (KVS) - Challenges
■ On the critical path of every request
■ Used by many teams and many applications
■ Many servers = more incidents (hardware, network, …)

Key/Value Store (KVS) - Monitoring goals
■ Alerted when there is an issue on our service
● Avoid false positives or non actionable alerts
■ Troubleshoot root causes quickly and eﬃciently (client, server, in-between)
■ Expose SLI that our users can trust
■ Help users resolve issues autonomously

Key/Value Store (KVS) - Monitoring goals
■ Alerted when there is an issue on our service
● Avoid false positives or non actionable alerts
■ Troubleshoot root causes quickly and eﬃciently (client, server, in-between)
■ Expose SLI that our users can trust
■ Help users resolve issues autonomously
To achieve this we need high quality latency measurements!

The Key/Value Store Infrastructure
KVS Cluster
Client
KVS Cluster
KVS Cluster
Client

Server Side Monitoring
KVS Cluster
Client
KVS Cluster
KVS Cluster
Client

Client
Node
Server
Latency measurements

Client
Node
Storage
Server

Node
Storage
Software
Node
Storage
Server
Client
Node
Storage
Server
Node
Storage
Server

Client
Service A
Service B
Service C

Server Side Monitoring - Summary
■ Paramount to understand the behavior of the system aka why it is slow
■ Service time should be measured as close as possible from the client
■ Troubleshooting performance is easier knowing the latency of dependencies:
● Latency with storage
● Latency with other servers
■ Metrics need to be high precision: average is not enough, at least p99!

Server Side - Downsides
■ Doesn’t take into account what is between the user and the server (network,
load balancer, API gateways, kernel, etc..)
■ Might not exist or be limited/low resolution
■ Might be impacted directly by user workloads (heavy queries)
This can make it hard to depend on them for SLx (and alerting)

Client Server
Hardware/
Virtualisation/
Kernel

Client Server
Hardware/
Virtualisation/
Kernel
Network

Client Server
Hardware/
Virtualisation/
Kernel
Network
Load Balancer/
API gateway

Client side monitoring
KVS Cluster
Client
KVS Cluster
KVS Cluster
Client

Client Side Monitoring - Summary
Advantages:
■ End to end view of the service: actual response time
Downsides:
■ Might require standardized clients
■ Precision requires a lot of metrics (histograms)
■ Impacted by other factors (latency will increase if the application is starved
for CPU, GC pauses…)

Blackbox Monitoring
KVS Cluster
Client
KVS Cluster
KVS Cluster
Client
Probe (fake client)

Blackbox Monitoring
Goals of the probe:
■ Simulate the behavior of clients
■ Repeatable and stable over time
■ Highly isolated (avoid interference)

■ Select a number of operations (they should reﬂect the behavior of users)
● Write (Create/Update)
● Read
● Delete
■ Schedule these operations at repeated interval on every cluster
■ Add a lot of observability (high resolution histograms, logs, tracing, …)
■ Optimize for latency (not throughput or cpu usage)
Implementation

Implementation You can also measure:
■ Availability: can be derived from
latency operations
■ Durability: check that no data
was lost
■ Cross datacenter replication
■ …
● Read
● Delete

Implementation You can also measure:
■ Availability: can be derived from
latency operations
■ Durability: check that no data
was lost
■ Cross datacenter replication
■ …
● Read
● Delete
The resulting metrics can be used as Service Level Indicators (SLI)!

In Practice
Internally blackbox monitoring is used actively on many tech:
■ Key/Value stores: Memcached, Aerospike
■ Kafka
■ Metrics (Graphite, Prometheus, VictoriaMetrics)
■ Logs (ELK)
■ Other databases: Ceph, ScyllaDB, Elasticsearch
We open-sourced:
https://github.com/criteo/blackbox-prober

Blackbox Monitoring - Summary
Advantages:
■ End to end view of the service
■ Reliable over time
■ Precise
■ Easy to create
Downsides:
■ Artiﬁcial

Conclusion
■ Each approaches have trade-offs:
● Server side: precise but lacks the big picture
● Client side: end-to-end but noisy
● Blackbox: end-to-end and reliable but synthetic
■ Measure everything!
■ Use all of them

Brought to you by
Geoffrey Beausire
g.beausire@criteo.com
@geobeau
is hiring: careers.criteo.com

Three Perspectives on Measuring Latency

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

Similar to Three Perspectives on Measuring Latency

Similar to Three Perspectives on Measuring Latency (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Three Perspectives on Measuring Latency