Monitoring Akka with Kamon 1.0

Monitoring Akka with Kamon 1.0
Dr. Steffen Gebert

Insights into the inner workings of an application
become crucial latest when performance and
scalability issues are encountered. This becomes
especially challenging in distributed systems, like
when using Akka cluster.
A popular open-source solution for monitoring on the
JVM in general, and Akka in particular, is Kamon. With
its recently reached 1.0 milestone, it features means
for both metrics collection and tracing of Akka
applications, running both standalone or distributed.
This talk gives an introduction to Kamon 1.0 with a
focus on its metrics features. The basic setup using
Prometheus and Grafana will be described, as well as
an overview over the different modules and its APIs
for implementing custom metrics. The resulting setup
allows to record both, automatically exposed metrics
about Akka’s actor systems, as well as metrics tailored
to the monitored application’s domain and service
level indicators.
Finally, learnings from a first-time user experience of
getting started with Kamon will be reported. The
example of adding instrumentation to EMnify’s
mobile core application will illustrate, how easy it is to
get started and how to kill the Prometheus on a daily
basis.
Abstract

• Steffen
• has a heart beating for infrastructure
• writes code at EMnify
• PhD in computer science, topic: software-based networks
• EMnify
• MVNO focussed on IoT
• runs virtualized mobile core network
• Würzburg/Berlin, Germany
About Me & Us
@StGebert
Slides available at st-g.de/speaking

• Kamon Overview
• Metrics Instrumentation
• Setup: Kamon with Prometheus & Grafana
• Experience at EMnify
• Summary
Agenda

• Our application is slow
• Nagios did not tell us
• APM did
Application Performance Monitoring

Kamon
• Open Source
• Monitoring for the JVM
• Integrations for Akka
• Release 1.0 in January 2018
kamon.io / github.com/kamon-io

• Tracing
• Per-request call graph
• Context propagation across nodes
• Exemplary objectives:
• Request profiling
• Understanding call graph
• Metrics
Kamon: Feature Set

• Tracing
• Per-request call graph
• Context propagation across nodes
• Request profiling
• Understanding call graph
• Metrics
• Time series data
• Counters / gauges / distributions
• Function call counts and latency
• Open DB connections
• User logins
• Generated revenue
Kamon: Feature Set

• Custom Metrics
• added to your code where it
makes sense
• Automatic Instrumentation
• integrations into Akka,
Akka HTTP, Play, JDBC, Servlet
• system and JVM metrics
Metrics

• Counter
• function calls
• customer buying our product
• Gauge
• number of open DB connections
• mailbox size
Custom Metric Types
t
t

• Histogram
• latencies
• shopping cart total prices
• Timer
• latencies
• RangeSampler
• number of open DB connections
• mailbox size
Custom Metric Types (2)
histogram
(single sample)
observations
value10 20 30 40 50

• Kamon.counter("hello.krakow").increment();
• Histogram hist = Kamon.histogram("age");
hist.record(33);
hist.record(21);
• CounterMetric c = Kamon.counter("participants");
Counter cReact = c.refine("conference", "react");
Counter cScala = c.refine("conference", "scala");
cReact.increment(42);
Custom Metrics: Implementation

• Actor system metrics
• processed messages
• active actors
• unhandled messages
• dead letters
• Per actor performance metrics
• processing time (per message)
• time in mailbox
• mailbox sizes
• errors
Kamon Akka
Mailbox
Actor A
Mailbox
Actor B
Mailbox
Actor C
Message

• Metrics related to
• routers
• dispatchers
• executors
• actor groups
• remoting (with kamon-akka-remote)
• Requirement (AOP)
• AspectJ Weaver or
• Kanela (Kamon Agent)
Kamon Akka (2)

Kamon + Prometheus + Grafana
Setup

Related Projects
Targets Time Series DB Dashboard
simple_client
DropWizard Metrics
Micrometer
Commercial Tools
Datadog, Dynatrace, Instana, NewRelic, etc.

• Time Series Database
• collection, storage & query of metrics data
• based on Google's Borgmon, CNCF project
• Pull-based model
• scrapes configured targets
• HTTP endpoints on monitored targets
• Easy deployment
• statically linked Golang binaries
• single YAML config file
• Alertmanager.. for alerting ;-)
Prometheus

• Integrated time series database
• on disk, no external dependency
• fixed retention period, no long-term storage / downsampling
• very efficient storage [1]
• query language PromQL
Prometheus TSDB
[1] Storing 16 bytes at scale, Fabian Reinartz @ PromCon 2017

Setup
Application
Targets
Node Exporter
cAdvisor
Service Discovery
(AWS EC2,
Kubernetes, etc.)
Time Series DB Dashboard

• Exporter output (scraped by Prom via HTTP):
myapp_checkouts{product="sim_4ff"} 42.0
myapp_checkouts{product="sim_embedded"} 5412.0
akka_system_dead_letters_total{system="test"} 224.0
…
• Querying with PromQL
rate(akka_system_dead_letters_total[5m]) 0
// handles counter resets / overflows
Ingesting & Querying
0

• Just a frontend to supply PromQL queries and build dashboards
• Kamon Akka dashboard available at grafana.com/dashboards/4469
Grafana

with Kamon
EMnify's Experience

• Tick interval (Kamon) and scrape frequency (Prometheus)
• both should match!
• usually (?) 30s or 60s
• for load tests, we went for 5s
• hope to go for 15s in production
• Deployment [for development / load tests]
• EC2 instances tagged in CloudFormation plus EC2 service discovery
• started simple (stupid): Prometheus in container on AWS ECS with EFS
Our Experiences with Kamon+Prometheus
Docker automated build config github.com/EMnify/prometheus-docker

• Little CPU resources + NFS storage + high cardinality =
• High cardinality?
• akka_actor_processing_time_seconds_bucket{⏎
class="com.example.SomethingFrequentlyUsed", ⏎
le="0.33", …⏎
path="mystem/some-supervisor/$aX"}
How to Kill Prometheus (Regularly)

• Define actor groups
kamon.akka.actor-groups += "mygroup"
kamon.util.filters {
"akka.tracked-actor" {
excludes = ["mysystem/some-supervisor/*"]
}
mygroup {
includes = ["mysystem/some-supervisor/*"]
}
}
• Delete Prometheus data to recover
• Continue to watch out for metrics with unnamed actors
How to Fix Kamon to Not Kill Prometheus

• Limit the number of samples per scrape:
<scrape_config>
# Per-scrape limit on number of scraped samples that will be accepted.
[ sample_limit: <int> | default = 0 ]
• Watch for limit kicking in:
prometheus_target_scrapes_exceeded_sample_limit_total
How to Fix Prometheus to Not Kill Itself

• Hosted service
• by Kamon developers
• currently in private beta
• no price tags, yet
• Great user experience for us
• tailored to Akka monitoring
• distributions over time
• still, few rough edges
Kamino Hosted Service
Targets Time Series DB Dashboard

Example: Fixing Bottle Neck
restart
deployment

• Kamon offers wide range of APM features
• customized and automated metric collection
• works with both on-prem/OSS and SaaS "backends"
• super friendly community, thanks Ivan!
• distributed tracing
• Monitor your application (from the inside!)
• now!
• better start small
Summary & Conclusion

Find me at the Speaker��s Roundtable
Questions, please!

Monitoring Akka with Kamon 1.0

• Data Collection
• Core
• Akka
• Akka Remote
• Akka HTTP
• Play
• JDBC
• Executors
• System Metrics
• Reporting
• Metrics: Prometheus, Kamino
(WIP: Datadog, InfluxDB, statsd)
• Tracing: Zipkin, Jaeger, Kamino
• Logs: Logback
• Context Propagation
• Akka Remote, Akka HTTP, Play
• http4s
Kamon: Modules

Setup with Kamon
JVM
Your ApplicationPort 80
Kamon
Kamon-prometheus Port 9095
Prometheus
Storage
Retrieval PromQL
Port 9090
Node Exporter Port 9100
scrapes
Grafana
*magic*
Prometheus Data Source

Kamon.histogram(
"datavolume",
MeasurementUnit.information().gigabytes(),
DynamicRange.apply(
0, // lowestDiscernibleValue
10000, // highestTrackableValue
2 // significantValueDigits
)
);
Measurement Units / Dynamic Ranges

• Kamon core trackable values
• highest trackable values for range sampler / histogram
• can be adjusted per metric
• Default Prometheus histogram buckets might not fit
• global default can be adjusted
• PR pending for overriding per metric [1]
Adjusting Value Ranges / Aggregation
[1] kamon-io/kamon-prometheus#12

Histograms
histogram
over timevalue
t
10
30
50
observations
0 max
histogram
(single sample)
observations
value10 20 30 40 50
• Better describe values than
avg/min/max does
• Can be aggregated across nodes
• Usually percentiles/quantiles computed
• Xth percentile: X% of the values lower than <n>
• Median (=50th percentile)
• SLO/SLA candidates 90/95/99th percentile of
response times

https://github.com/improbable-eng/thanos
https://www.slideshare.net/BartomiejPotka/thanos-global-durable-prometheus-monitoring
Thanos: Prometheus Long-Term Storage

global:
scrape_interval: 5s
scrape_timeout: 5s
evaluation_interval: 1m
Our Prometheus Config
scrape_configs:
- job_name: prometheus
scrape_interval: 5s
scrape_timeout: 5s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- localhost:9090
- job_name: kamon
scrape_interval: 5s
scrape_timeout: 5s
metrics_path: /metrics
scheme: http
sample_limit: 5000
ec2_sd_configs:
- region: eu-west-1
refresh_interval: 1m
port: 9095
relabel_configs:
- source_labels: [__meta_ec2_tag_Environment]
separator: ;
regex: (.*)
target_label: environment
replacement: $1
action: replace
- source_labels: [__meta_ec2_private_ip]
separator: ;
regex: (.*)
target_label: __address__
replacement: ${1}:9095
action: replace
- source_labels: [__meta_ec2_tag_Name]
separator: ;
regex: (.*)
target_label: instance
replacement: ${1}:9095
action: replace
- source_labels: [__meta_ec2_instance_id]
separator: ;
regex: (.*)
target_label: instance_id
replacement: $1
action: replace
- source_labels: [__meta_ec2_tag_Platform]
separator: ;
regex: akka
target_label: platform
replacement: $1
action: keep
- source_labels: [__meta_ec2_tag_AkkaApplication
separator: ;
regex: (.*)
target_label: akka_application
replacement: $1
action: replace
- source_labels: [__meta_ec2_tag_AkkaRole]
separator: ;
regex: (.*)
target_label: akka_role
replacement: $1
action: replace

Monitoring Akka with Kamon 1.0

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Monitoring Akka with Kamon 1.0

Similar to Monitoring Akka with Kamon 1.0 (20)

More from Steffen Gebert

More from Steffen Gebert (20)

Recently uploaded

Recently uploaded (20)

Monitoring Akka with Kamon 1.0