Lessons from Cloud
Scaling Prometheus
metrics in
Kubernetes with
The curious case of the missing metrics
One Label too far...
The Suspects
● Prometheus
● Kubernetes
● Gateway
● Queryd
scrape_interval: 15s
- job_name: prod_twodotoh
- role: service
Gateway Gateway
Queryd Queryd
Problem: Prometheus Debugging is Hard
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.01"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.05"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.5"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.9"} 0.012562015
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.99"} 0.012562015
prometheus_target_sync_length_seconds_sum{scrape_job="prod_twodotoh"} 0.012562015
prometheus_target_sync_length_seconds_count{scrape_job="prod_twodotoh"} 1
Problem: Prometheus Scaling is Hard
scrape_interval: 15s
- job_name: prod_twodotoh_ns_a
- role: service
- a
scrape_interval: 15s
- job_name: prod_twodotoh_ns_a
- role: service
- b
Solution: Isolatation with Telegraf Sidecar
Solution: Isolation with Telegraf Sidecar
apiVersion: apps/v1
kind: Deployment
name: "gateway"
serviceName: "gateway"
replicas: 100
name: "gateway"
app: "gateway"
- name: "telegraf"
image: ""
- name: "gateway"
image: ""
urls = [""]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
Solution: Isolatation with Telegraf Sidecar
Problem: Prom has 1 and only 1 value
scrape_interval: 15s
- job_name: prod_twodotoh
- role: service
- regex: user_agent
action: labeldrop
Solution: Influx for more context
urls = [""]
string = ["user_agent"]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
Problem: Is there a way to prevent?
scrape_interval: 15s
- job_name: prod_twodotoh
- role: service
- regex: user_agent
action: labeldrop
Solution: Telegraf Guard Rails
urls = [""]
limit = 4
## List of tags to preferentially preserve
keep = ["handler", "method", "status"]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
Problem: Hard to Rotate Prom Passwords
scrape_interval: 15s
- job_name: prod_twodotoh
- role: service
bearer_token_file: /etc/hunter2
Solution: Per Pod Credentials
urls = [""]
bearer_token = "/etc/telegraf/hunter2"
Scaling is NOT More Manual Processes
Scaling is NOT saying “You’re Doing it Wrong”
Scaling IS Empowering Developers
Scaling IS Predictability of Failure Modes
The time when we were
Watching the watchers...
Problem: Am I scraping all the pods?
scrape_interval: 15s
- job_name: prod_twodotoh
- role: service
Solution: Telegraf K8s Inventory
url = ""
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]
Prometheus Scraping Designs
Scaling even more
Scaling even more with Influx Enterprise
Scaling even more with Kafka and Influx
Core Idea
● Measure and test metrics scaling
○ Are you missing metrics?
● Decentralize metrics gathering
○ Consider metrics as part of the program
● Empower Developers
○ They know their metrics the best. Allow them local tooling control
First Order Conclusion
● Too easy to shoot yourself in the foot with prometheus metrics.
● Too much in prometheus needs operation heroes.
● Too difficult to express vital information in prometheus about your
program without a ton of centralized control.
● One mistake can impact everyone.
Second Order Conclusion
● Prometheus is not descriptive enough.
● Extremely difficult to change over time.
● The metrics game is not a solved problem.
○ Opentelemetry?
● Probably not one answer to everything.
● Flux into Telegraf
○ Processor for transformation
○ Moving the program near the data
○ Flux Output
○ Monitoring and alerting at edge
● Telegraf Flux scripts hosted in InfluxDB API
○ Runtime plugins without re-compiling
○ Sampling rules from server-side
■ Aggregation on server with input to client
● What else?
Thank You!
The time when collecting metrics impacted storage...
Measure, measure, measure
Problem: Prometheus metrics are heavy

