SlideShare a Scribd company logo
Monitoring Weave Cloud
With Prometheus
Matthias Radestock, CTO
matthias@weave.works
+
Show of Hands
Kubernetes?
Prometheus?
Together?
Kubernetes + Prometheus
Not a new topic...
• Prometheus and Kubernetes up and running by Fabian Reinartz
• Monitoring Kubernetes with Prometheus by Brian Brazil
focused on deploying prom in k8s, monitoring k8s with prom
This talk is about...
monitoring your apps that are running in k8s, with prom
Weave Cloud
A SaaS for Microservice DevOps
Bring your own cluster and Just Add Weave
• one-liner to install Weave Cloud Agent
• integrates, enhances and operates several OSS projects
• adds access control, teams, sharing, history
built from microservices, running on k8s in AWS
It’s complicated!
Quite complicated.
• 7 apps
• 70 microservices
• 400 containers
• 4500 processes
• 15 hosts
• 12 cloud services
Prometheus to the rescue!
measurement coarseness
proximityintime
performance analysis capacity planning
incident response failure detection
Dashboards
Queries
dev enterprise
AlertsNotebooks
ops
Queries
Jamie Wilkinson (Google) @ Velocity 2016
Namespaces
Interlude - Kubernetes 101
Pods
containers
ServicesDeployments
Container Your application code, packaged and running in an isolated
environment.
Pod A set of containers, sharing network namespace and local volumes, co-
scheduled on one machine. Mortal. Has pod IP. Has labels.
Deployment Specify how many replicas of a pod should run in a cluster. Then ensures
that many are running across the cluster. Has labels.
Service Names things in DNS. Gets virtual IP. Two types: ClusterIP for internal
services, NodePort for publishing to outside. Routes based on labels.
Namespace Grouping / segmentation / partitioning / scoping.
Prometheus Deployment in Weave Cloud
It’s pretty standard…
• 1 pod each for prom, alert-
manager, grafana
• 1 prom-node-exporter per node
• prom configured via k8s config
map, reloaded by a watcher
container in prom pod
• SD via k8s API server
node
exp
AM
What metrics?
How busy is my service? Request rate
Are there any errors in my service? Error rate
What is the latency in my service? Duration of requests
• use these for 95% of monitoring & alerting
• combine with Utilisation, Saturation, Error metrics
(Brendan Gregg) plus other metrics for fault finding
Life of a metric
“How long are my DynamoDB requests taking?”
Declare & register
var (
dynamoRequestDuration =
prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "scope",
Name: "dynamo_request_duration_seconds",
Help: "Time in seconds spent doing DynamoDB requests.",
Buckets: prometheus.DefBuckets,
}, []string{"method", "status_code"})
)
func init() {
prometheus.MustRegister(dynamoRequestDuration)
}
Export the same *_request_duration_seconds
metrics from every service. These are the RED metrics.
Observe
err = instrument.TimeRequestHistogram(
ctx, "DynamoDB.PutItem", dynamoRequestDuration, func(_ context.Context) error {
var err error
resp, err = c.db.PutItem(&dynamodb.PutItemInput{
// ... elided ...
})
return err
})
dynamoRequestDuration.
WithLabelValues(“DynamoDB.PutItem", statusCode).Observe(elapsedTime)
...which behind the scenes is...
Use a common library in all services:
Scrape
Prometheus scrapes every /metrics page on every pod
every 15 seconds.
Relabel
Tag pods with the service they belong to, so that
scope_dynamo_request_duration_seconds_sum{
method=“DynamoDB.PutItem",status_code="200"}
becomes
scope_dynamo_request_duration_seconds_sum{
method=“DynamoDB.PutItem”,status_code=“200”,
job=“scope/collection”,
instance=“collection-1557838395-p433q”,
node=“172.4.0.3”}
Relabel - k8s metadata
kind: Deployment
metadata:
name: collection
namespace: scope
spec:
replicas: 3
template:
metadata:
labels:
name: collection
spec:
containers:
- name: collection
image: quay.io/weaveworks/scope:master-456ac0bf
command:
- /home/weave/scope
args:
- ...
Relabel - scrape config
# Rename jobs to be <namespace>/<name, from pod name label>
- source_labels: [__meta_kubernetes_namespace,
__meta_kubernetes_pod_label_name]
action: replace
separator: /
target_label: job
replacement: $1
# Rename instances to be the pod name
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: instance
# Include node name as a extra field
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: node
Store
After scraping, Prometheus stores metrics:
• locally, in the pod (ephemeral)
• remotely, in Weave Cloud (durable)
Query
histogram_quantile(0.9,
sum(rate(scope_dynamo_request_duration_seconds_bucket{
job="scope/collection"}[1m])) by (le,node))
(“How to Query Prometheus”, Julius Volz
www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-
ubuntu-14-04-part-1)
Alerts
Fact: When users are having a bad time, we can almost
always do something about it.
Fact: Watching to see if they are having a bad time is
vigilance, not engineering.
Conclusion: Have a machine tell us when they are
having a bad time, so we can do something about it.
What makes a good alert?
• Symptoms, not causes -> alert on ingress service REDs
• Driven by SLO
• Consistent across services (kind of alert; presentation)
• Backed by per-service playbooks relying on dashboards
ALERT CollectionLatency
IF job:scope_request_duration_seconds:99quantile{job="scope/collection"} > 5.0
FOR 5m
LABELS { severity="warning" }
ANNOTATIONS {
summary = "scope/collection: high latency",
impact = "Data appearing on the Weave Cloud Scope UI will be out of date",
description = "The collection service has a max 99th-quantile latency of {{$value}} ms.",
dashboardURL = "$${base_url}/admin/grafana/dashboard/file/scope-services.json",
playbookURL = "https://github.com/weaveworks/playbooks/#collection",
}
Alert Definition
Alert Example
Demo
Incident response
Dashboards
Reflecting system structure:
• grouped by namespace, i.e. app
• containing one row per service, in order of breadth
first-traversal
Reflecting metric structure:
• just two graphs per service row - showing RED
• auxiliary dashboards for USE and other metrics
github.com/weaveworks/grafanalib
Lessons
Metrics: RED, labelled with app/service
Alerts: symptoms, not causes -> E&D of ingress services
Dashboards: layout based on system and metric structure
Weave Cortex
• highly available, horizontally scalable Prometheus
• built on top of Prometheus core APIs
• OSS - https://github.com/weaveworks/cortex
• part of Weave Cloud SaaS
– remote-write metrics to it from your prom install
– point your grafana at it
Prometheus with Weave Cloud
⚠
apps
YOUR CLUSTER WEAVE CLOUD
Weave Cortex - what’s new?
Correctness bugs squashed
Stability fast, lossless upgrades
separation of read and write path
Performance grpc, varbit, hotspot-free, better indexing
3x faster than our standalone Prom
Features interactive query builder, notebooks,
alerts, grafana integration
Thanks! Questions?
• Try Weave Cloud! - weave.works/guides/
• Join the Weave user group! - meetup.com/pro/Weave/
• Talk to us! - weave.works/help
• Work with us! - weave.works/company/hiring
Credits: slides by Jonathan Lange and Tom Wilkie @ Weaveworks

More Related Content

Monitoring Weave Cloud with Prometheus

  • 1. Monitoring Weave Cloud With Prometheus Matthias Radestock, CTO matthias@weave.works +
  • 3. Kubernetes + Prometheus Not a new topic... • Prometheus and Kubernetes up and running by Fabian Reinartz • Monitoring Kubernetes with Prometheus by Brian Brazil focused on deploying prom in k8s, monitoring k8s with prom This talk is about... monitoring your apps that are running in k8s, with prom
  • 4. Weave Cloud A SaaS for Microservice DevOps Bring your own cluster and Just Add Weave • one-liner to install Weave Cloud Agent • integrates, enhances and operates several OSS projects • adds access control, teams, sharing, history built from microservices, running on k8s in AWS
  • 6. Quite complicated. • 7 apps • 70 microservices • 400 containers • 4500 processes • 15 hosts • 12 cloud services
  • 7. Prometheus to the rescue! measurement coarseness proximityintime performance analysis capacity planning incident response failure detection Dashboards Queries dev enterprise AlertsNotebooks ops Queries Jamie Wilkinson (Google) @ Velocity 2016
  • 8. Namespaces Interlude - Kubernetes 101 Pods containers ServicesDeployments Container Your application code, packaged and running in an isolated environment. Pod A set of containers, sharing network namespace and local volumes, co- scheduled on one machine. Mortal. Has pod IP. Has labels. Deployment Specify how many replicas of a pod should run in a cluster. Then ensures that many are running across the cluster. Has labels. Service Names things in DNS. Gets virtual IP. Two types: ClusterIP for internal services, NodePort for publishing to outside. Routes based on labels. Namespace Grouping / segmentation / partitioning / scoping.
  • 9. Prometheus Deployment in Weave Cloud It’s pretty standard… • 1 pod each for prom, alert- manager, grafana • 1 prom-node-exporter per node • prom configured via k8s config map, reloaded by a watcher container in prom pod • SD via k8s API server node exp AM
  • 10. What metrics? How busy is my service? Request rate Are there any errors in my service? Error rate What is the latency in my service? Duration of requests • use these for 95% of monitoring & alerting • combine with Utilisation, Saturation, Error metrics (Brendan Gregg) plus other metrics for fault finding
  • 11. Life of a metric “How long are my DynamoDB requests taking?”
  • 12. Declare & register var ( dynamoRequestDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{ Namespace: "scope", Name: "dynamo_request_duration_seconds", Help: "Time in seconds spent doing DynamoDB requests.", Buckets: prometheus.DefBuckets, }, []string{"method", "status_code"}) ) func init() { prometheus.MustRegister(dynamoRequestDuration) } Export the same *_request_duration_seconds metrics from every service. These are the RED metrics.
  • 13. Observe err = instrument.TimeRequestHistogram( ctx, "DynamoDB.PutItem", dynamoRequestDuration, func(_ context.Context) error { var err error resp, err = c.db.PutItem(&dynamodb.PutItemInput{ // ... elided ... }) return err }) dynamoRequestDuration. WithLabelValues(“DynamoDB.PutItem", statusCode).Observe(elapsedTime) ...which behind the scenes is... Use a common library in all services:
  • 14. Scrape Prometheus scrapes every /metrics page on every pod every 15 seconds.
  • 15. Relabel Tag pods with the service they belong to, so that scope_dynamo_request_duration_seconds_sum{ method=“DynamoDB.PutItem",status_code="200"} becomes scope_dynamo_request_duration_seconds_sum{ method=“DynamoDB.PutItem”,status_code=“200”, job=“scope/collection”, instance=“collection-1557838395-p433q”, node=“172.4.0.3”}
  • 16. Relabel - k8s metadata kind: Deployment metadata: name: collection namespace: scope spec: replicas: 3 template: metadata: labels: name: collection spec: containers: - name: collection image: quay.io/weaveworks/scope:master-456ac0bf command: - /home/weave/scope args: - ...
  • 17. Relabel - scrape config # Rename jobs to be <namespace>/<name, from pod name label> - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_name] action: replace separator: / target_label: job replacement: $1 # Rename instances to be the pod name - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: instance # Include node name as a extra field - source_labels: [__meta_kubernetes_pod_node_name] target_label: node
  • 18. Store After scraping, Prometheus stores metrics: • locally, in the pod (ephemeral) • remotely, in Weave Cloud (durable)
  • 19. Query histogram_quantile(0.9, sum(rate(scope_dynamo_request_duration_seconds_bucket{ job="scope/collection"}[1m])) by (le,node)) (“How to Query Prometheus”, Julius Volz www.digitalocean.com/community/tutorials/how-to-query-prometheus-on- ubuntu-14-04-part-1)
  • 20. Alerts Fact: When users are having a bad time, we can almost always do something about it. Fact: Watching to see if they are having a bad time is vigilance, not engineering. Conclusion: Have a machine tell us when they are having a bad time, so we can do something about it.
  • 21. What makes a good alert? • Symptoms, not causes -> alert on ingress service REDs • Driven by SLO • Consistent across services (kind of alert; presentation) • Backed by per-service playbooks relying on dashboards
  • 22. ALERT CollectionLatency IF job:scope_request_duration_seconds:99quantile{job="scope/collection"} > 5.0 FOR 5m LABELS { severity="warning" } ANNOTATIONS { summary = "scope/collection: high latency", impact = "Data appearing on the Weave Cloud Scope UI will be out of date", description = "The collection service has a max 99th-quantile latency of {{$value}} ms.", dashboardURL = "$${base_url}/admin/grafana/dashboard/file/scope-services.json", playbookURL = "https://github.com/weaveworks/playbooks/#collection", } Alert Definition
  • 25. Dashboards Reflecting system structure: • grouped by namespace, i.e. app • containing one row per service, in order of breadth first-traversal Reflecting metric structure: • just two graphs per service row - showing RED • auxiliary dashboards for USE and other metrics github.com/weaveworks/grafanalib
  • 26. Lessons Metrics: RED, labelled with app/service Alerts: symptoms, not causes -> E&D of ingress services Dashboards: layout based on system and metric structure
  • 27. Weave Cortex • highly available, horizontally scalable Prometheus • built on top of Prometheus core APIs • OSS - https://github.com/weaveworks/cortex • part of Weave Cloud SaaS – remote-write metrics to it from your prom install – point your grafana at it
  • 28. Prometheus with Weave Cloud ⚠ apps YOUR CLUSTER WEAVE CLOUD
  • 29. Weave Cortex - what’s new? Correctness bugs squashed Stability fast, lossless upgrades separation of read and write path Performance grpc, varbit, hotspot-free, better indexing 3x faster than our standalone Prom Features interactive query builder, notebooks, alerts, grafana integration
  • 30. Thanks! Questions? • Try Weave Cloud! - weave.works/guides/ • Join the Weave user group! - meetup.com/pro/Weave/ • Talk to us! - weave.works/help • Work with us! - weave.works/company/hiring Credits: slides by Jonathan Lange and Tom Wilkie @ Weaveworks