SlideShare a Scribd company logo
1
Oliver Moser, September 2017
Monitoring Stuff with Prometheus
Who am I
• Working at A1 and TAG for quite a while
• some Big Data
• DevOps/SRE/Containers/Orchestration
2
Fundamentals
3
Why do I need monitoring at all?
• Know when stuff breaks (and act on it)
• Understand the performance characteristics of your applications
• To meet service level objectives
• To improve performance/reliability
Fundamentals: Push vs Pull
Monitoring Service
HTTP Server
Metrics Agent
success_event
error_event
...
Monitoring Service
HTTP Server
Metrics Agent
GET /metrics
req_total
req_latency
Whitebox Monitoring
shows internal service information
e.g. request_processing_time,
request_errors_total
Fundamentals: Blackbox vs. Whitebox Monitoring
Blackbox Monitoring
restricted to external service
behaviour
e.g. ping, http
Agent
HTTP Server
HTTP GET / 200 OK
Agent
HTTP Server
GET /metrics
errors_total
req_total
req_latency
Prometheus Project Overview
• OpenSource monitoring tool, originally built at Soundcloud
• Heavily inspired by Google’s Borgmon
• Written in Go (mainly)
• Very active community (150+ contributors for core Prometheus)
• Member of the Cloud Native Computing Foundation
Prometheus Architecture
source: https://prometheus.io/docs/introduction/overview/
Prometheus Server
source: https://prometheus.io/docs/introduction/overview/
Service Discovery
source: https://prometheus.io/docs/introduction/overview/
Service Discovery
• Could live without SD in static environments
• In dynamic environments (e.g. Kubernetes) you must use SD
• Pods, Services,... come and go à impossible to statically configure
API Server
Pod 1
Pod 2
Pod n
...
Jobs, Targets and Exporters
source: https://prometheus.io/docs/introduction/overview/
13
Graphing and Visualization
source: https://prometheus.io/docs/introduction/overview/
15
Alertmanager
source: https://prometheus.io/docs/introduction/overview/
Alertmanager
• Alerts are configured in Prometheus and once triggered forwarded to
Alertmanager
• Alertmanager does
• Notifications (SMS, Slack, Pagerduty, Email etc)
• Deduplication/Grouping
• Silencing
• Inhibition
Prometheus
Server Alertmanager
forward ‘Instance Down’
Datamodel
18
Timeseries
• Tracks values of a metric over time (timestamp t, value v)
• Timestamps increase (strictly) monotonically
• 𝑡" < 𝑡"$%	∀	n	 ∈ 	ℕ
• Values can both increase or decrease
(t1,v1) (t2,v2) (t3,v3)
time
Metric Types
• Only relevant for Client libs (only untyped timeseries on the server)
• Counters: Values always increase
• requests_total, errors_total
• Gauges: Values can increase and decrease
• users_online, memory_free_bytes
• Histogram: Puts your measurements in buckets
• requests_latency_seconds_bucket
• Summary: Calculates percentiles over sliding time window
• requests_latency_seconds_summary
Labels
• A list of key/value pairs (𝑘% =	 𝑣%, 𝑘. =	 𝑣., … , 𝑘" =	 𝑣")
• Labels partition a metric into timeseries
• So for every possible label combo in a given metric there will be a
timeseries is created
req_total{job=”job1", ver=”0.1”}: 10 à timeseries_1
req_total{job=”job1", ver=”0.2”}: 3 à timeseries_2
req_total{job=”job2", ver=”0.1”}: 4 à timeseries_3
req_total{job=”job1", ver=”0.1”}: 12 à timeseries_1
req_total{job=”job1"}: 1 à timeseries_4
Overall Data Model
events_processed_total
{component="enricher-deployment",
instance="10.244.2.98:8080",
job=”geo-enricher",
namespace="geo-analytics",
version="1.9.4-r5” } : 131241535
labels
metric name
label key label value metric value
PromQL
• Query language to select and aggregate timeseries
• Queries evaluate to either
• Instant Vector: Multiple timeseries with the same timestamp
• requests_total{job=”prom-boot”}
• Range Vector: Multiple timeseries with a range of timestamps
• rate(requests_total{job=“prom-boot”}[5m])
• Scalar: A single value
• sum(requests_total)
Alerting Rules
ALERT ErrorRateHigh
IF rate(errors_total[5m]) > 10
LABELS {
service = ”service_1",
severity = "warning"
}
ANNOTATIONS {
summary = "A high number of errors in service”,
description = "{{ $value }} errors have been registered
within the last hour for {{ $labels.instance }}"
}
Demo Time
25
Thanks!
26

More Related Content

Prometheus Introduction (InfraCoders Vienna)