Instrumenting and Scaling Databases with Envoy
- 2. Seattle 2018
Database outage
1. Disk I/O wait spikes briefly
2. Client opens more connections
3. Slowdown due to auth overhead of new
connections
4. Client opens more connections
5. Hit max connection limit
- 3. Seattle 2018
Databases in the cloud
Instantly provision resilient, high-throughput infrastructure
No access to underlying VM and/or shared hardware
Limited access to telemetry
Limited access to configuration
Closed source or no ability to run custom binary
- 4. Seattle 2018
Cloud Native
Cloud native technologies empower organizations to build and run
scalable applications in modern, dynamic environments such as
public, private, and hybrid clouds.
- 6. Seattle 2018
Instance topology
Application communicates over locally to Envoy
which will proxy all traffic
localhost:6001
localhost:6101
localhost:7000
…
(internal services)
(third-party services)
(cloud services)
and more!
- 7. Seattle 2018
Layer 3 / 4: Proxying TCP
- DNS aware
- Load balancing: round robin, least request, ring hash, random, etc
- Impose an idle timeout
- Healthchecking
- Access logging
localhost:7000
Stats
cx_active
cx_connect_fail
cx_idle_timeout
cx_total
cx_tx_bytes_total
cx_rx_bytes_total
Other benefits
iot.us-east-1.amazonaws.com
174.217.14.202
174.217.14.234
- 8. Seattle 2018
Layer 5 / 6: Offloading SSL
Stats
handshakes
tls_session_reused
fail_verify_no_cert
fail_verify_ca_error
fail_verify_san
cipher.<cipher>
days_until_cert_expires
Other benefits
- Efficient
- Up-to-date and secure (TLS 1.3)
- SNI, cert pinning, session resumption, etc.
- Easier to upgrade
localhost:7000 172.217.14.202:443
- 9. Seattle 2018
Layer 7: Managing HTTP
Stats
cx_http1_total
cx_http2_total
cx_protocol_error
rq_2xx
rq_4xx
rq_5xx
rq_retry
rq_time_ms (hist)
rq_timeout
Other benefits
- Transparent upgrade from HTTP/1 to HTTP/2 (multiplexed)
- Manage request retries and timeouts
- Access logging
- Offload GZIP decompression
HTTP/1
HTTP/2
- 10. Seattle 2018
Statistics
TCP (L3/L4) SSL (L5/L6) HTTP (L7)
cx_active
cx_connect_fail
cx_idle_timeout
cx_total
cx_tx_bytes_total
cx_rx_bytes_total
cx_length_ms (hist)
handshakes
tls_session_reused
fail_verify_no_cert
fail_verify_ca_error
fail_verify_san
cipher.<cipher>
days_until_cert_expires
cx_http1_total
cx_http2_total
cx_protocol_error
rq_2xx
rq_4xx
rq_5xx
rq_retry
rq_time_ms (hist)
rq_timeout
and more!
- 13. Seattle 2018
Observability
Libraries are heterogenous!
SSL ciphers? Status code metrics? Retry?
import pynamodb
use AwsDynamoDbDynamoDbClient;
import "github.com/aws/dynamodb"
&aws.Config{
Endpoint:aws.String("http://localhost:8000")
}
e.g.
Envoy provides standard access logs, stats,
alarms, retry, etc
- 15. Seattle 2018
DynamoDB
- Protocol: JSON over HTTP
- Cloudwatch telemetry
- min, avg, max latency
- per-table capacity unit throughput
- per-minute
- Benefits of Envoy:
- Histogram of latency (percentiles)
- Custom windowing of metrics
- Per-host, per-zone, and per-cluster statistics
- 17. Seattle 2018
POST / HTTP/1.1
X-Amz-Target: DynamoDB_20120810.GetItem
{
"TableName": "pets",
"Key": {
"Name": {"S": "Patty"}
}
}
DynamoDB with codec
dynamodb.table.pets.GetItem.upstream_rq_time
- 18. Seattle 2018
DynamoDB
What was the per-30s p99 for write requests from the
users-streamlistener canary to the pets table?
ts(
envoy.dynamodb.pets.PutItem.upstream_rq_time.p99,
window=30,
group=users-streamlistener,
canary=true,
)
- 19. Seattle 2018
MongoDB
- Protocol: Binary JSON (BSON)
- Benefits of Envoy in TCP mode:
- Per-host, per-cluster, per-zone network I/O
- Benefits of Envoy with Mongo codec:
- Per-operation latency
- Count size and number of documents
- Count scattered gets in sharded cluster
How did the number of documents returned by queries
change in us-east-1a after the 3pm deploy of my service?
- 20. Seattle 2018
MongoDB at scale
Help! My Mongo database is experiencing outages:
- Disk I/O wait spikes briefly
- Client opens more connections
- Slowdown due to auth overhead of new connections
- Open more connections
- Hit max connection limit
Envoy will rate limit new connections to apply backpressure so that query
times can recover.
- 21. Seattle 2018
MongoDB at scale
Help! I deleted an index. I read the code but it was in a 3,000 line class.
The index was still in use and everything fell over until we could
recreate it.
Envoy will efficiently log all Mongo queries in JSON format so that a week
of logs can be audited for usage of the index's fields.
Have you tried the built-in query profiler?
Yes, it caused a serious outage because it's expensive and results in 3x
CPU usage.
- 22. Seattle 2018
MongoDB at scale
Envoy will:
- globally rate limit new connections
- efficiently log all Mongo queries
- track the number of queries with no timeout set
- parse the $comment field of a query so we can time and count queries of
individual application methods, log how many records they returned, etc.
… for applications in 3 different languages across 8 clusters.
… 6 months and several outages later ...
- 25. Seattle 2018
Redis at scale
localhost:6379
…
SET msg hello
INCR comm
MGET lyft hello
SET msg hello
GET hello
INCR comm
GET lyft
OK
1
nil
To the application, the proxy looks like a single instance of Redis.
- 28. Seattle 2018
Roadmap
- More codecs
- Full L7 capability vs bump-in-the-wire
- Better integration of tracing
- More fault injection coverage
- Role-based access control