Monitoring Akka with Kamon 1.0
Dr. Steffen Gebert
Insights into the inner workings of an application
become crucial latest when performance and
scalability issues are encountered. This becomes
especially challenging in distributed systems, like
when using Akka cluster.
A popular open-source solution for monitoring on the
JVM in general, and Akka in particular, is Kamon. With
its recently reached 1.0 milestone, it features means
for both metrics collection and tracing of Akka
applications, running both standalone or distributed.
This talk gives an introduction to Kamon 1.0 with a
focus on its metrics features. The basic setup using
Prometheus and Grafana will be described, as well as
an overview over the different modules and its APIs
for implementing custom metrics. The resulting setup
allows to record both, automatically exposed metrics
about Akka’s actor systems, as well as metrics tailored
to the monitored application’s domain and service
level indicators.
Finally, learnings from a first-time user experience of
getting started with Kamon will be reported. The
example of adding instrumentation to EMnify’s
mobile core application will illustrate, how easy it is to
get started and how to kill the Prometheus on a daily
• Steffen
• has a heart beating for infrastructure
• writes code at EMnify
• PhD in computer science, topic: software-based networks
• EMnify
• MVNO focussed on IoT
• runs virtualized mobile core network
• Würzburg/Berlin, Germany
About Me & Us
Slides available at
• Kamon Overview
• Metrics Instrumentation
• Setup: Kamon with Prometheus & Grafana
• Experience at EMnify
• Summary

• Our application is slow
• Nagios did not tell us
• APM did
Application Performance Monitoring
• Open Source
• Monitoring for the JVM
• Integrations for Akka
• Release 1.0 in January 2018 /
• Tracing
• Per-request call graph
• Context propagation across nodes
• Exemplary objectives:
• Request profiling
• Understanding call graph
• Metrics
Kamon: Feature Set

Exemplary Trace
• Tracing
• Per-request call graph
• Context propagation across nodes
• Exemplary objectives:
• Request profiling
• Understanding call graph
• Metrics
• Time series data
• Counters / gauges / distributions
• Exemplary objectives:
• Function call counts and latency
• Open DB connections
• User logins
• Generated revenue
Kamon: Feature Set
• Custom Metrics
• added to your code where it
makes sense
• Automatic Instrumentation
• integrations into Akka,
Akka HTTP, Play, JDBC, Servlet
• system and JVM metrics
• Counter
• function calls
• customer buying our product
• Gauge
• number of open DB connections
• mailbox size
Custom Metric Types

• Histogram
• latencies
• shopping cart total prices
• Timer
• latencies
• RangeSampler
• number of open DB connections
• mailbox size
Custom Metric Types (2)
(single sample)
value10 20 30 40 50
• Kamon.counter("hello.krakow").increment();
• Histogram hist = Kamon.histogram("age");
• CounterMetric c = Kamon.counter("participants");
Counter cReact = c.refine("conference", "react");
Counter cScala = c.refine("conference", "scala");
Custom Metrics: Implementation
• Actor system metrics
• processed messages
• active actors
• unhandled messages
• dead letters
• Per actor performance metrics
• processing time (per message)
• time in mailbox
• mailbox sizes
• errors
Kamon Akka
Actor A
Actor B
Actor C
• Metrics related to
• routers
• dispatchers
• executors
• actor groups
• remoting (with kamon-akka-remote)
• Requirement (AOP)
• AspectJ Weaver or
• Kanela (Kamon Agent)
Kamon Akka (2)

Kamon + Prometheus + Grafana
Related Projects
Targets Time Series DB Dashboard
DropWizard Metrics
Commercial Tools
Datadog, Dynatrace, Instana, NewRelic, etc.
• Time Series Database
• collection, storage & query of metrics data
• based on Google's Borgmon, CNCF project
• Pull-based model
• scrapes configured targets
• HTTP endpoints on monitored targets
• Easy deployment
• statically linked Golang binaries
• single YAML config file
• Alertmanager.. for alerting ;-)
• Integrated time series database
• on disk, no external dependency
• fixed retention period, no long-term storage / downsampling
• very efficient storage [1]
• query language PromQL
Prometheus TSDB
[1] Storing 16 bytes at scale, Fabian Reinartz @ PromCon 2017

Node Exporter
Service Discovery
Kubernetes, etc.)
Time Series DB Dashboard
• Exporter output (scraped by Prom via HTTP):
myapp_checkouts{product="sim_4ff"} 42.0
myapp_checkouts{product="sim_embedded"} 5412.0
akka_system_dead_letters_total{system="test"} 224.0
• Querying with PromQL
rate(akka_system_dead_letters_total[5m]) 0
// handles counter resets / overflows
Ingesting & Querying
• Just a frontend to supply PromQL queries and build dashboards
• Kamon Akka dashboard available at
with Kamon
EMnify's Experience

• Tick interval (Kamon) and scrape frequency (Prometheus)
• both should match!
• usually (?) 30s or 60s
• for load tests, we went for 5s
• hope to go for 15s in production
• Deployment [for development / load tests]
• EC2 instances tagged in CloudFormation plus EC2 service discovery
• started simple (stupid): Prometheus in container on AWS ECS with EFS
Our Experiences with Kamon+Prometheus
Docker automated build config
• Little CPU resources + NFS storage + high cardinality =
• High cardinality?
• akka_actor_processing_time_seconds_bucket{⏎
class="com.example.SomethingFrequentlyUsed", ⏎
le="0.33", …⏎
How to Kill Prometheus (Regularly)
• Define actor groups += "mygroup"
kamon.util.filters {
"akka.tracked-actor" {
excludes = ["mysystem/some-supervisor/*"]
mygroup {
includes = ["mysystem/some-supervisor/*"]
• Delete Prometheus data to recover
• Continue to watch out for metrics with unnamed actors
How to Fix Kamon to Not Kill Prometheus
• Limit the number of samples per scrape:
# Per-scrape limit on number of scraped samples that will be accepted.
[ sample_limit: <int> | default = 0 ]
• Watch for limit kicking in:
How to Fix Prometheus to Not Kill Itself

Bonus: Kamino
• Hosted service
• by Kamon developers
• currently in private beta
• no price tags, yet
• Great user experience for us
• tailored to Akka monitoring
• distributions over time
• still, few rough edges
Kamino Hosted Service
Targets Time Series DB Dashboard
Per-Actor Metrics
Example: Fixing Bottle Neck

• customized and automated metric collection
• works with both on-prem/OSS and SaaS "backends"
• super friendly community, thanks Ivan!
• distributed tracing
• Monitor your application (from the inside!)
• now!
• better start small
Summary & Conclusion
Find me at the Speaker���s Roundtable
Questions, please!
Monitoring Akka with Kamon 1.0

• Data Collection
• Core
• Akka
• Akka Remote
• Akka HTTP
• Play
• Executors
• System Metrics
• Reporting
• Metrics: Prometheus, Kamino
(WIP: Datadog, InfluxDB, statsd)
• Tracing: Zipkin, Jaeger, Kamino
• Logs: Logback
• Context Propagation
• Akka Remote, Akka HTTP, Play
• http4s
Kamon: Modules
Setup with Kamon
Your ApplicationPort 80
Kamon-prometheus Port 9095
Retrieval PromQL
Port 9090
Node Exporter Port 9100
Prometheus Data Source
0, // lowestDiscernibleValue
10000, // highestTrackableValue
2 // significantValueDigits
Measurement Units / Dynamic Ranges
Prometheus Architecture

• Kamon core trackable values
• highest trackable values for range sampler / histogram
• can be adjusted per metric
• Default Prometheus histogram buckets might not fit
• global default can be adjusted
• PR pending for overriding per metric [1]
Adjusting Value Ranges / Aggregation
[1] kamon-io/kamon-prometheus#12
over timevalue
0 max
(single sample)
value10 20 30 40 50
• Better describe values than
avg/min/max does
• Can be aggregated across nodes
• Usually percentiles/quantiles computed
• Xth percentile: X% of the values lower than <n>
• Median (=50th percentile)
• SLO/SLA candidates 90/95/99th percentile of
response times
Thanos: Prometheus Long-Term Storage
Thanos: Global Scale

scrape_interval: 5s
scrape_timeout: 5s
evaluation_interval: 1m
Our Prometheus Config
- job_name: prometheus
scrape_interval: 5s
scrape_timeout: 5s
metrics_path: /metrics
scheme: http
- targets:
- localhost:9090
- job_name: kamon
scrape_interval: 5s
scrape_timeout: 5s
metrics_path: /metrics
scheme: http
sample_limit: 5000
- region: eu-west-1
refresh_interval: 1m
port: 9095
- source_labels: [__meta_ec2_tag_Environment]
separator: ;
regex: (.*)
target_label: environment
replacement: $1
action: replace
- source_labels: [__meta_ec2_private_ip]
separator: ;
regex: (.*)
target_label: __address__
replacement: ${1}:9095
action: replace
- source_labels: [__meta_ec2_tag_Name]
separator: ;
regex: (.*)
target_label: instance
replacement: ${1}:9095
action: replace
- source_labels: [__meta_ec2_instance_id]
separator: ;
regex: (.*)
target_label: instance_id
replacement: $1
action: replace
- source_labels: [__meta_ec2_tag_Platform]
separator: ;
regex: akka
target_label: platform
replacement: $1
action: keep
- source_labels: [__meta_ec2_tag_AkkaApplication
separator: ;
regex: (.*)
target_label: akka_application
replacement: $1
action: replace
- source_labels: [__meta_ec2_tag_AkkaRole]
separator: ;
regex: (.*)
target_label: akka_role
replacement: $1
action: replace

  • 1. Monitoring Akka with Kamon 1.0 Dr. Steffen Gebert
  • 2. Insights into the inner workings of an application become crucial latest when performance and scalability issues are encountered. This becomes especially challenging in distributed systems, like when using Akka cluster. A popular open-source solution for monitoring on the JVM in general, and Akka in particular, is Kamon. With its recently reached 1.0 milestone, it features means for both metrics collection and tracing of Akka applications, running both standalone or distributed. This talk gives an introduction to Kamon 1.0 with a focus on its metrics features. The basic setup using Prometheus and Grafana will be described, as well as an overview over the different modules and its APIs for implementing custom metrics. The resulting setup allows to record both, automatically exposed metrics about Akka’s actor systems, as well as metrics tailored to the monitored application’s domain and service level indicators. Finally, learnings from a first-time user experience of getting started with Kamon will be reported. The example of adding instrumentation to EMnify’s mobile core application will illustrate, how easy it is to get started and how to kill the Prometheus on a daily basis. Abstract
  • 3. • Steffen • has a heart beating for infrastructure • writes code at EMnify • PhD in computer science, topic: software-based networks • EMnify • MVNO focussed on IoT • runs virtualized mobile core network • Würzburg/Berlin, Germany About Me & Us @StGebert Slides available at
  • 4. • Kamon Overview • Metrics Instrumentation • Setup: Kamon with Prometheus & Grafana • Experience at EMnify • Summary Agenda
  • 5. • Our application is slow • Nagios did not tell us • APM did Application Performance Monitoring
  • 7. Kamon • Open Source • Monitoring for the JVM • Integrations for Akka • Release 1.0 in January 2018 /
  • 8. • Tracing • Per-request call graph • Context propagation across nodes • Exemplary objectives: • Request profiling • Understanding call graph • Metrics Kamon: Feature Set
  • 10. • Tracing • Per-request call graph • Context propagation across nodes • Exemplary objectives: • Request profiling • Understanding call graph • Metrics • Time series data • Counters / gauges / distributions • Exemplary objectives: • Function call counts and latency • Open DB connections • User logins • Generated revenue Kamon: Feature Set
  • 11. • Custom Metrics • added to your code where it makes sense • Automatic Instrumentation • integrations into Akka, Akka HTTP, Play, JDBC, Servlet • system and JVM metrics Metrics
  • 12. • Counter • function calls • customer buying our product • Gauge • number of open DB connections • mailbox size Custom Metric Types t t
  • 13. • Histogram • latencies • shopping cart total prices • Timer • latencies • RangeSampler • number of open DB connections • mailbox size Custom Metric Types (2) histogram (single sample) observations value10 20 30 40 50
  • 14. • Kamon.counter("hello.krakow").increment(); • Histogram hist = Kamon.histogram("age"); hist.record(33); hist.record(21); • CounterMetric c = Kamon.counter("participants"); Counter cReact = c.refine("conference", "react"); Counter cScala = c.refine("conference", "scala"); cReact.increment(42); Custom Metrics: Implementation
  • 15. • Actor system metrics • processed messages • active actors • unhandled messages • dead letters • Per actor performance metrics • processing time (per message) • time in mailbox • mailbox sizes • errors Kamon Akka Mailbox Actor A Mailbox Actor B Mailbox Actor C Message
  • 16. • Metrics related to • routers • dispatchers • executors • actor groups • remoting (with kamon-akka-remote) • Requirement (AOP) • AspectJ Weaver or • Kanela (Kamon Agent) Kamon Akka (2)
  • 17. Kamon + Prometheus + Grafana Setup
  • 18. Related Projects Targets Time Series DB Dashboard simple_client DropWizard Metrics Micrometer Commercial Tools Datadog, Dynatrace, Instana, NewRelic, etc.
  • 19. • Time Series Database • collection, storage & query of metrics data • based on Google's Borgmon, CNCF project • Pull-based model • scrapes configured targets • HTTP endpoints on monitored targets • Easy deployment • statically linked Golang binaries • single YAML config file • Alertmanager.. for alerting ;-) Prometheus
  • 20. • Integrated time series database • on disk, no external dependency • fixed retention period, no long-term storage / downsampling • very efficient storage [1] • query language PromQL Prometheus TSDB [1] Storing 16 bytes at scale, Fabian Reinartz @ PromCon 2017
  • 21. Setup Application Targets Node Exporter cAdvisor Service Discovery (AWS EC2, Kubernetes, etc.) Time Series DB Dashboard
  • 22. • Exporter output (scraped by Prom via HTTP): myapp_checkouts{product="sim_4ff"} 42.0 myapp_checkouts{product="sim_embedded"} 5412.0 akka_system_dead_letters_total{system="test"} 224.0 … • Querying with PromQL rate(akka_system_dead_letters_total[5m]) 0 // handles counter resets / overflows Ingesting & Querying 0
  • 23. • Just a frontend to supply PromQL queries and build dashboards • Kamon Akka dashboard available at Grafana
  • 25. • Tick interval (Kamon) and scrape frequency (Prometheus) • both should match! • usually (?) 30s or 60s • for load tests, we went for 5s • hope to go for 15s in production • Deployment [for development / load tests] • EC2 instances tagged in CloudFormation plus EC2 service discovery • started simple (stupid): Prometheus in container on AWS ECS with EFS Our Experiences with Kamon+Prometheus Docker automated build config
  • 26. • Little CPU resources + NFS storage + high cardinality = • High cardinality? • akka_actor_processing_time_seconds_bucket{⏎ class="com.example.SomethingFrequentlyUsed", ⏎ le="0.33", …⏎ path="mystem/some-supervisor/$aX"} How to Kill Prometheus (Regularly)
  • 27. • Define actor groups += "mygroup" kamon.util.filters { "akka.tracked-actor" { excludes = ["mysystem/some-supervisor/*"] } mygroup { includes = ["mysystem/some-supervisor/*"] } } • Delete Prometheus data to recover • Continue to watch out for metrics with unnamed actors How to Fix Kamon to Not Kill Prometheus
  • 28. • Limit the number of samples per scrape: <scrape_config> # Per-scrape limit on number of scraped samples that will be accepted. [ sample_limit: <int> | default = 0 ] • Watch for limit kicking in: prometheus_target_scrapes_exceeded_sample_limit_total How to Fix Prometheus to Not Kill Itself
  • 30. • Hosted service • by Kamon developers • currently in private beta • no price tags, yet • Great user experience for us • tailored to Akka monitoring • distributions over time • still, few rough edges Kamino Hosted Service Targets Time Series DB Dashboard
  • 32. Example: Fixing Bottle Neck restart deployment
  • 33. • Kamon offers wide range of APM features • customized and automated metric collection • works with both on-prem/OSS and SaaS "backends" • super friendly community, thanks Ivan! • distributed tracing • Monitor your application (from the inside!) • now! • better start small Summary & Conclusion
  • 34. Find me at the Speaker‘s Roundtable Questions, please!
  • 37. • Data Collection • Core • Akka • Akka Remote • Akka HTTP • Play • JDBC • Executors • System Metrics • Reporting • Metrics: Prometheus, Kamino (WIP: Datadog, InfluxDB, statsd) • Tracing: Zipkin, Jaeger, Kamino • Logs: Logback • Context Propagation • Akka Remote, Akka HTTP, Play • http4s Kamon: Modules
  • 38. Setup with Kamon JVM Your ApplicationPort 80 Kamon Kamon-prometheus Port 9095 Prometheus Storage Retrieval PromQL Port 9090 Node Exporter Port 9100 scrapes Grafana *magic* Prometheus Data Source
  • 39. Kamon.histogram( "datavolume", MeasurementUnit.information().gigabytes(), DynamicRange.apply( 0, // lowestDiscernibleValue 10000, // highestTrackableValue 2 // significantValueDigits ) ); Measurement Units / Dynamic Ranges
  • 41. • Kamon core trackable values • highest trackable values for range sampler / histogram • can be adjusted per metric • Default Prometheus histogram buckets might not fit • global default can be adjusted • PR pending for overriding per metric [1] Adjusting Value Ranges / Aggregation [1] kamon-io/kamon-prometheus#12
  • 42. Histograms histogram over timevalue t 10 30 50 observations 0 max histogram (single sample) observations value10 20 30 40 50 • Better describe values than avg/min/max does • Can be aggregated across nodes • Usually percentiles/quantiles computed • Xth percentile: X% of the values lower than <n> • Median (=50th percentile) • SLO/SLA candidates 90/95/99th percentile of response times
  • 45. global: scrape_interval: 5s scrape_timeout: 5s evaluation_interval: 1m Our Prometheus Config scrape_configs: - job_name: prometheus scrape_interval: 5s scrape_timeout: 5s metrics_path: /metrics scheme: http static_configs: - targets: - localhost:9090 - job_name: kamon scrape_interval: 5s scrape_timeout: 5s metrics_path: /metrics scheme: http sample_limit: 5000 ec2_sd_configs: - region: eu-west-1 refresh_interval: 1m port: 9095 relabel_configs: - source_labels: [__meta_ec2_tag_Environment] separator: ; regex: (.*) target_label: environment replacement: $1 action: replace - source_labels: [__meta_ec2_private_ip] separator: ; regex: (.*) target_label: __address__ replacement: ${1}:9095 action: replace - source_labels: [__meta_ec2_tag_Name] separator: ; regex: (.*) target_label: instance replacement: ${1}:9095 action: replace - source_labels: [__meta_ec2_instance_id] separator: ; regex: (.*) target_label: instance_id replacement: $1 action: replace - source_labels: [__meta_ec2_tag_Platform] separator: ; regex: akka target_label: platform replacement: $1 action: keep - source_labels: [__meta_ec2_tag_AkkaApplication separator: ; regex: (.*) target_label: akka_application replacement: $1 action: replace - source_labels: [__meta_ec2_tag_AkkaRole] separator: ; regex: (.*) target_label: akka_role replacement: $1 action: replace