Monitoring Weave Cloud with Prometheus

Monitoring Weave Cloud
With Prometheus
Matthias Radestock, CTO
matthias@weave.works
+

Show of Hands
Kubernetes?
Prometheus?
Together?

Kubernetes + Prometheus
Not a new topic...
• Prometheus and Kubernetes up and running by Fabian Reinartz
• Monitoring Kubernetes with Prometheus by Brian Brazil
focused on deploying prom in k8s, monitoring k8s with prom
This talk is about...
monitoring your apps that are running in k8s, with prom

Weave Cloud
A SaaS for Microservice DevOps
Bring your own cluster and Just Add Weave
• one-liner to install Weave Cloud Agent
• integrates, enhances and operates several OSS projects
• adds access control, teams, sharing, history
built from microservices, running on k8s in AWS

Quite complicated.
• 7 apps
• 70 microservices
• 400 containers
• 4500 processes
• 15 hosts
• 12 cloud services

Prometheus to the rescue!
measurement coarseness
proximityintime
performance analysis capacity planning
incident response failure detection
Dashboards
Queries
dev enterprise
AlertsNotebooks
ops
Queries
Jamie Wilkinson (Google) @ Velocity 2016

Namespaces
Interlude - Kubernetes 101
Pods
containers
ServicesDeployments
Container Your application code, packaged and running in an isolated
environment.
Pod A set of containers, sharing network namespace and local volumes, co-
scheduled on one machine. Mortal. Has pod IP. Has labels.
Deployment Specify how many replicas of a pod should run in a cluster. Then ensures
that many are running across the cluster. Has labels.
Service Names things in DNS. Gets virtual IP. Two types: ClusterIP for internal
services, NodePort for publishing to outside. Routes based on labels.
Namespace Grouping / segmentation / partitioning / scoping.

Prometheus Deployment in Weave Cloud
It’s pretty standard…
• 1 pod each for prom, alert-
manager, grafana
• 1 prom-node-exporter per node
• prom configured via k8s config
map, reloaded by a watcher
container in prom pod
• SD via k8s API server
node
exp
AM

What metrics?
How busy is my service? Request rate
Are there any errors in my service? Error rate
What is the latency in my service? Duration of requests
• use these for 95% of monitoring & alerting
• combine with Utilisation, Saturation, Error metrics
(Brendan Gregg) plus other metrics for fault finding

Life of a metric
“How long are my DynamoDB requests taking?”

Declare & register
var (
dynamoRequestDuration =
prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "scope",
Name: "dynamo_request_duration_seconds",
Help: "Time in seconds spent doing DynamoDB requests.",
Buckets: prometheus.DefBuckets,
}, []string{"method", "status_code"})
)
func init() {
prometheus.MustRegister(dynamoRequestDuration)
}
Export the same *_request_duration_seconds
metrics from every service. These are the RED metrics.

Observe
err = instrument.TimeRequestHistogram(
ctx, "DynamoDB.PutItem", dynamoRequestDuration, func(_ context.Context) error {
var err error
resp, err = c.db.PutItem(&dynamodb.PutItemInput{
// ... elided ...
})
return err
})
dynamoRequestDuration.
WithLabelValues(“DynamoDB.PutItem", statusCode).Observe(elapsedTime)
...which behind the scenes is...
Use a common library in all services:

Scrape
Prometheus scrapes every /metrics page on every pod
every 15 seconds.

Relabel
Tag pods with the service they belong to, so that
scope_dynamo_request_duration_seconds_sum{
method=“DynamoDB.PutItem",status_code="200"}
becomes
scope_dynamo_request_duration_seconds_sum{
method=“DynamoDB.PutItem”,status_code=“200”,
job=“scope/collection”,
instance=“collection-1557838395-p433q”,
node=“172.4.0.3”}

Relabel - k8s metadata
kind: Deployment
metadata:
name: collection
namespace: scope
spec:
replicas: 3
template:
metadata:
labels:
name: collection
spec:
containers:
- name: collection
image: quay.io/weaveworks/scope:master-456ac0bf
command:
- /home/weave/scope
args:
- ...

Relabel - scrape config
# Rename jobs to be <namespace>/<name, from pod name label>
- source_labels: [__meta_kubernetes_namespace,
__meta_kubernetes_pod_label_name]
action: replace
separator: /
target_label: job
replacement: $1
# Rename instances to be the pod name
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: instance
# Include node name as a extra field
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: node

Store
After scraping, Prometheus stores metrics:
• locally, in the pod (ephemeral)
• remotely, in Weave Cloud (durable)

Query
histogram_quantile(0.9,
sum(rate(scope_dynamo_request_duration_seconds_bucket{
job="scope/collection"}[1m])) by (le,node))
(“How to Query Prometheus”, Julius Volz
www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-
ubuntu-14-04-part-1)

Alerts
Fact: When users are having a bad time, we can almost
always do something about it.
Fact: Watching to see if they are having a bad time is
vigilance, not engineering.
Conclusion: Have a machine tell us when they are
having a bad time, so we can do something about it.

What makes a good alert?
• Symptoms, not causes -> alert on ingress service REDs
• Driven by SLO
• Consistent across services (kind of alert; presentation)
• Backed by per-service playbooks relying on dashboards

ALERT CollectionLatency
IF job:scope_request_duration_seconds:99quantile{job="scope/collection"} > 5.0
FOR 5m
LABELS { severity="warning" }
ANNOTATIONS {
summary = "scope/collection: high latency",
impact = "Data appearing on the Weave Cloud Scope UI will be out of date",
description = "The collection service has a max 99th-quantile latency of {{$value}} ms.",
dashboardURL = "$${base_url}/admin/grafana/dashboard/file/scope-services.json",
playbookURL = "https://github.com/weaveworks/playbooks/#collection",
}
Alert Definition

Dashboards
Reflecting system structure:
• grouped by namespace, i.e. app
• containing one row per service, in order of breadth
first-traversal
Reflecting metric structure:
• just two graphs per service row - showing RED
• auxiliary dashboards for USE and other metrics
github.com/weaveworks/grafanalib

Lessons
Metrics: RED, labelled with app/service
Alerts: symptoms, not causes -> E&D of ingress services
Dashboards: layout based on system and metric structure

Weave Cortex
• highly available, horizontally scalable Prometheus
• built on top of Prometheus core APIs
• OSS - https://github.com/weaveworks/cortex
• part of Weave Cloud SaaS
– remote-write metrics to it from your prom install
– point your grafana at it

Prometheus with Weave Cloud
⚠
apps
YOUR CLUSTER WEAVE CLOUD

Weave Cortex - what’s new?
Correctness bugs squashed
Stability fast, lossless upgrades
separation of read and write path
Performance grpc, varbit, hotspot-free, better indexing
3x faster than our standalone Prom
Features interactive query builder, notebooks,
alerts, grafana integration

Thanks! Questions?
• Try Weave Cloud! - weave.works/guides/
• Join the Weave user group! - meetup.com/pro/Weave/
• Talk to us! - weave.works/help
• Work with us! - weave.works/company/hiring
Credits: slides by Jonathan Lange and Tom Wilkie @ Weaveworks

Monitoring Weave Cloud with Prometheus

Related slideshows

More Related Content

Monitoring Weave Cloud with Prometheus