Log analytics with ELK stack

Log Analytics with ELK Stack
(Architecture for aggressive cost optimization and infinite data scale)
Denis D’Souza | 27th July 2019

About me...
● Currently a DevOps engineer at Moonfrog Labs
● 6 + years working as DevOps Engineer, SRE and Linux administrator
Worked on a variety of technologies in both service-based and
product-based organisations
● How do I spend my free time ?
Learning new technologies and Playing PC Games
www.linkedin.com/in/denis-dsouza

• A Mobile Gaming Company making mass market social games
• More than 5M+ Daily Active, 15M+ Weekly Active Users
• Real time, Cross platform games optimized for Primary
Market(s) - India and subcontinent
• Profitable!
Current Scale
Who we are ?

1. Our business requirements
2. Choosing the right option
3. ELK Stack overview
4. Our ELK architecture
5. Optimizations we did
6. Cost savings
7. Key takeaways
Our problem statement

● Log analytics platform (Web-Server, Application, Database logs)
● Data Ingestion rate: ~300GB/day
● Frequently accessed data: last 8 days
● Infrequently accessed
● Uptime: 99.90
● Hot Retention period: 90 days
● Cold Retention period: 90 days (with potential to increase)
● Simple and Cost effective solution
● Fairly predictable concurrent user-base
● Not to be used for storing user/business data
Our business requirements

ELK stack Splunk Sumo logic
Product Self managed Cloud Professional
Pricing ~ $30 per GB / month ~ $100 per GB / month * ~ $108 per GB / month *
Data Ingestion ~ 300 GB / day
~ 100 GB / day *
(post ingestion custom pricing)
~ 20 GB / day *
Retention ~ 90 days ~ 90 days * ~ 30 days *
Cost/GB/day ~$ 0.98 per GB / day ~$ 3.33 per GB / day * ~$ 3.60 per GB /day *
* values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only.
References:
https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2
https://www.sumologic.com/pricing/apac/
Choosing the right option

● Index
● Shard
○ Primary
○ Replica
● Segment
● Node
References:
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html
ELK Stack overview: Terminologies

Our ELK architecture: Hot-Warm-Cold data storage
(infinite scale)

Service
Number of
Nodes
Total CPU
Cores
Total RAM
Storage
EBS
1 Elasticsearch 7 28 141 GB
2 Logstash 3 6 12 GB
3 Kibana 1 1 4 GB
Total 11 35 157 GB ~ 20 TB
Data-ingestion per day ~ 300 GB
Hot Retention period 90 days
Docs/sec (at peak load) ~ 7K
Our ELK architecture: Size and scale

Application Side
● Logstash
● Elasticsearch
Infrastructure Side
● EC2
● EBS
● Data transfer
Optimizations we did

Optimizations we did: Application side
Logstash

Pipeline Workers:
● Adjusted "pipeline.workers" to x4 the number of
Cores to improve CPU utilisation on Logstash
server (as threads may spend significant time in
an I/O wait state)
### Core-count: 2 ###
...
pipeline.workers: 8
...
logstash.yml
References:
https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html
Optimizations we did: Logstash

'info' logs:
● Separated application 'info' log to be store in a
different index with retention policy of fewer days
if [sourcetype] == "app_logs" and [level] == "info"
{
elasticsearch {
index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}"
...
Filter config
if [sourcetype] == "nginx" and [status] == "200"
{
elasticsearch {
index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}"
...
References:
https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html
'200' response-code logs:
● Separated Access log with '200' response-code
be store in a different index with retention policy
of fewer days

Log ‘message’ field:
● Removed "message" field if there were no
'grok-failures' in logstash while applying grok
patterns
(reduced storage footprint by ~30% per doc)
if "_grokparsefailure" not in [tags] {
mutate {
remove_field => ["message"]
}
}
Filter config
Eg:
Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0
Gecko" "-"
Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident}
[%{HTTPDATE:timestamp}] "(?:%{WORD:verb} %{NOTSPACE:request}(?:
HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-)
%{QS:referrer} %{QS:agent} %{QS:forwarder}

Elasticsearch
Optimizations we did: Application side

JVM heap vs non-heap memory:
● Optimised JVM heap-size by monitoring the GC
interval, this helped in efficient utilization of system
Memory (33% for JVM, 66% for non-heap) *
jvm.options
### Total system Memory 15GB ###
-Xms5g
-Xmx5g
Heap too small
Heap too large
Optimised Heap
* Recommended heap-size settings by Elastic:
https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
Optimizations we did: Elasticsearch

Shards:
● Created templates with number of shards which
are multiples of the number of Elasticsearch
nodes
(helps fix issues with shards distribution
imbalance which resulted in uneven disk,
compute resource usage)
### Number of ES nodes: 5 ###
{
"template": "appserver-*",
"settings": {
"number_of_shards": "5",
"number_of_replicas": "0",
...
}
}'
Trade-offs:
● Removing replicas will result in search queries
running slower as replicas are used while
performing search operations
● It is not recommended to run production clusters
without replicas
Replicas:
● Removed replicas for the required indexes
(50% savings on storage cost, ~30% reduction in
compute resource utilization)
Optimizations we did: Elasticsearch
Template config

AWS
● EC2
● EBS
● Data transfer (Inter AZ)
Spotinst platform allows users to reliably
leverage excess capacity, simplify cloud
operations and save 80% on compute costs.
Optimizations we did: Infrastructure side

EC2

Stateful EC2 Spot instances:
● Moved all ELK nodes to run on spot instances
(Instances maintain IP address, EBS volumes)
Recovery time: < 10 mins
Trade-offs:
● Prefer using previous generation instance
types to reduce frequent spot take-backs
Optimizations we did: EC2 and spot

Auto-Scaling:
● Performance/time based auto-scaling for
Logstash Instances
Optimizations we did: EC2 and spot

EBS

"Hot-Warm" Architecture:
● "Hot" nodes: store active indexes, use GP2
EBS-disks (General purpose SSD)
● "Warm" nodes: store passive indexes, use SC1
EBS-disks (Cold storage)
(~69% savings on storage cost)
node.attr.box_type: hot
...
elasticsearch.yml
"template": "appserver-*",
"settings": {
"index": {
"routing": {
"allocation": {
"require": {
"box_type": "hot"}
}
}
},
...
Template config
Trade-offs:
● Since "Warm" nodes are using SC1 EBS-disks,
they have lower IOPS, throughput this will result
in search operations being comparatively slower
References:
https://cinhtau.net/2017/06/14/hot-warm-architecture/
Optimizations we did: EBS

Moving indexes to "Warm" nodes:
● Reallocated indexes older than 8 days to "Warm"
nodes
● Recommended to perform this operation during
off-peak hours as it is I/O intensive
actions:
1:
action: allocation
description: "Move index to Warm-nodes after 8
days"
options:
key: box_type
value: warm
allocation_type: require
timeout_override:
continue_if_exception: false
filters:
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 8
...
Curator config
References:
https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x
Optimizations we did: EBS

Single Availability Zone:
● Migrated all ELK node to a single availability zone
(reduce inter AZ data transfer cost for ELK nodes
by 100%)
● Data transfer/day: ~700GB
(Logstash to Elasticsearch: ~300GB,
Elasticsearch inter-communication: ~400GB)
Trade-offs:
● It is not recommended to run production clusters in
a single AZ as it will result in downtime and
potential data loss in case of AZ failures
Optimizations we did: Inter-AZ data transfer

Using S3 for index Snapshots:
● Take snapshots of indexes and store them in S3
curl -XPUT
"http://<domain>:9200/_snapshot/s3_repository/
snap1?pretty?wait_for_completion=true" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false
}
Backup:
References:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html
https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f
Data backup and restore

curl -s -XPOST --url
"http://<domain>:9200/_snapshot/s3_repository/s
nap1/_restore" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false,
}'
On-demand Elasticsearch cluster:
● Launching a on demand ES cluster and importing
the snapshots from S3
Existing Cluster:
● Restore the required snapshots to existing cluster
Restore:
Data backup and restore

Data corruption:
● List out indexes with status as ‘Red’
● Deleted the corrupted indexes
● Restore indexes from S3 snapshots
● Recovery time: depends of size of data
Node failure due to AZ going down:
● Launch a new ELK cluster using AWS cloud
formation templates
● Do the necessary config changes in Filebeat,
Logstash etc.
● Restore the required indexes from S3 snapshots
● Recovery time: depends on provisioning time and
size of data
Node failures due to underlying hardware issue:
● Recycle node in Spotinst console
(will take AMI of root volume, launch new instance,
attach EBS volumes, maintain private IP)
● Recovery time: < 10 mins/node
Snapshot restore time (estimates):
● < 4mins for a 20GB snapshot (test-cluster: 3
nodes, multiple indexes with 3 primary shards
each, no replicas)
Disaster recovery

EC2
Instance type Service Daily cost
5 x r5.xlarge (20C, 160GB) Elasticsearch 40.80
3 x c5.large (6C, 12GB) Logstash 7.17
1 x t3.medium (2C, 4GB) Kibana 1.29
Total ~ 49.26$
EC2 (optimized)
Instance type Service
Daily cost
65% savings + Spotinst charges (20% of savings) Total Savings
5 x m4.xlarge (20C, 80GB) Elasticsearch Hot 14.64
2 x r4.xlarge (8C, 61GB) Elasticsearch Warm 7.50
3 x c4.large (6C, 12GB) Logstash 3.50
1 x t2.medium (2C, 4GB) Kibana 0.69
Total ~ 26.33$ ~ 47%
Cost savings: EC2

Ingesting: 300GB/day
Retention: 90 days
Replica count: 1
Storage
Storage type Retention Daily cost
~54TB (GP2) 90 days ~ 237.60$
Storage (optimized)
Storage type Retention Daily cost Total Savings
~ 3TB (GP2) Hot 8 days 12.00
~ 24TB (SC1) Warm 82 days 24.00
~ 27TB (S3) Backup 90 days 22.50
Total ~ 58.50$ ~ 75%
Ingesting: 300GB/day
Retention: 90 days
Replica count: 0
Backups: Daily S3 snapshots
Cost savings: Storage

ELK stack
ELK stack
(optimized) Savings
EC2 49.40 26.33 47%
Storage 237.60 58.50 75%
Data-transfer 7 0 100%
Total (daily cost) ~ 294.00$ ~ 84.83$ ~ 71% *
Cost/GB (daily) ~ 0.98$ ~ 0.28$
* Total savings are exclusive of some of the application-level optimizations done
Total savings

ELK Stack
(optimized) ELK Stack Splunk Sumo logic
Product Self managed Self managed Cloud Professional
Data Ingestion ~ 300GB/day ~ 300GB/day
~ 100 GB / day *
~ 20 GB / day *
Retention ~ 90 days ~ 90 days ~ 90 days * ~ 30 days *
Cost/GB/day ~ $ 0.28 per GB /day ~ $ 0.98 per GB /day ~ $ 3.33 per GB /day * ~ $ 3.60 per GB /day *
Savings over traditional ELK stack: 71% *
* Total savings are exclusive of some of the application-level optimizations done
Our Costs vs other Platforms

ELK Stack Scalability:
● Logstash: auto-scaling
● Elasticsearch: overprovisioning (nodes run at 60% capacity during peak load), predictive vertical/horizontal scaling
Handling potential data-loss while AZ is down:
● DR mechanisms in place, daily/hourly backups stored in S3, Potential chances of data loss of about 1 hour
● We do not store user-data or business metrics in ELK, users/business will not be impacted
Handling potential data-corruptions in Elasticsearch:
● DR mechanisms in place, recover index from S3 index-snapshots
Managing downtime during spot take-backs:
● Logstash: multiple nodes, minimal impact
● Elasticsearch/Kibana: < 10min downtime per node
● Use previous generation instance types as spot take-back chances are comparatively low
Key Takeaways

Handling back-pressure when a node is down:
● Filebeat: will auto-retry to send old logs
● Logstash: use ‘date’ filter for document timestamp, auto-scaling
● Elasticsearch: overprovisioning
Other log analytics alternatives:
● We have only evaluated ELK, Splunk and Sumo Logic
ELK stack upgrade path:
● Blue Green deployment for major version upgrade
Key Takeaways

● We built a platform tailored to our requirements, yours might be different...
● Building a log analytics platform is not rocket science, but it can be painfully iterative if you
are not aware of the options
● Be aware of the trade-offs you are ‘OK with’ and you can roll out a solution optimised for
your specific requirements
Reflection

Thank you!
Happy to take your questions..
Copyright Disclaimer: All rights to the materials used for this presentation belongs to their respective owners..

Log analytics with ELK stack

Related slideshows

More Related Content

Log analytics with ELK stack