SlideShare a Scribd company logo
Log Analytics with ELK Stack
(Architecture for aggressive cost optimization and infinite data scale)
Denis D’Souza | 27th July 2019
About me...
● Currently a DevOps engineer at Moonfrog Labs
● 6 + years working as DevOps Engineer, SRE and Linux administrator
Worked on a variety of technologies in both service-based and
product-based organisations
● How do I spend my free time ?
Learning new technologies and Playing PC Games
www.linkedin.com/in/denis-dsouza
• A Mobile Gaming Company making mass market social games
• More than 5M+ Daily Active, 15M+ Weekly Active Users
• Real time, Cross platform games optimized for Primary
Market(s) - India and subcontinent
• Profitable!
Current Scale
Who we are ?
1. Our business requirements
2. Choosing the right option
3. ELK Stack overview
4. Our ELK architecture
5. Optimizations we did
6. Cost savings
7. Key takeaways
Our problem statement
● Log analytics platform (Web-Server, Application, Database logs)
● Data Ingestion rate: ~300GB/day
● Frequently accessed data: last 8 days
● Infrequently accessed
● Uptime: 99.90
● Hot Retention period: 90 days
● Cold Retention period: 90 days (with potential to increase)
● Simple and Cost effective solution
● Fairly predictable concurrent user-base
● Not to be used for storing user/business data
Our business requirements
ELK stack Splunk Sumo logic
Product Self managed Cloud Professional
Pricing ~ $30 per GB / month ~ $100 per GB / month * ~ $108 per GB / month *
Data Ingestion ~ 300 GB / day
~ 100 GB / day *
(post ingestion custom pricing)
~ 20 GB / day *
(post ingestion custom pricing)
Retention ~ 90 days ~ 90 days * ~ 30 days *
Cost/GB/day ~$ 0.98 per GB / day ~$ 3.33 per GB / day * ~$ 3.60 per GB /day *
* values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only.
References:
https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2
https://www.sumologic.com/pricing/apac/
Choosing the right option
ELK Stack overview
● Index
● Shard
○ Primary
○ Replica
● Segment
● Node
References:
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html
ELK Stack overview: Terminologies
Our ELK architecture
Our ELK architecture: Hot-Warm-Cold data storage
(infinite scale)
Service
Number of
Nodes
Total CPU
Cores
Total RAM
Storage
EBS
1 Elasticsearch 7 28 141 GB
2 Logstash 3 6 12 GB
3 Kibana 1 1 4 GB
Total 11 35 157 GB ~ 20 TB
Data-ingestion per day ~ 300 GB
Hot Retention period 90 days
Docs/sec (at peak load) ~ 7K
Our ELK architecture: Size and scale
Application Side
● Logstash
● Elasticsearch
Infrastructure Side
● EC2
● EBS
● Data transfer
Optimizations we did
Optimizations we did: Application side
Logstash
Pipeline Workers:
● Adjusted "pipeline.workers" to x4 the number of
Cores to improve CPU utilisation on Logstash
server (as threads may spend significant time in
an I/O wait state)
### Core-count: 2 ###
...
pipeline.workers: 8
...
logstash.yml
References:
https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html
Optimizations we did: Logstash
'info' logs:
● Separated application 'info' log to be store in a
different index with retention policy of fewer days
if [sourcetype] == "app_logs" and [level] == "info"
{
elasticsearch {
index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}"
...
Filter config
if [sourcetype] == "nginx" and [status] == "200"
{
elasticsearch {
index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}"
...
References:
https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html
'200' response-code logs:
● Separated Access log with '200' response-code
be store in a different index with retention policy
of fewer days
Optimizations we did: Logstash
Log ‘message’ field:
● Removed "message" field if there were no
'grok-failures' in logstash while applying grok
patterns
(reduced storage footprint by ~30% per doc)
if "_grokparsefailure" not in [tags] {
mutate {
remove_field => ["message"]
}
}
Filter config
Eg:
Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0
Gecko" "-"
Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident}
[%{HTTPDATE:timestamp}] "(?:%{WORD:verb} %{NOTSPACE:request}(?:
HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-)
%{QS:referrer} %{QS:agent} %{QS:forwarder}
Optimizations we did: Logstash
Elasticsearch
Optimizations we did: Application side
JVM heap vs non-heap memory:
● Optimised JVM heap-size by monitoring the GC
interval, this helped in efficient utilization of system
Memory (33% for JVM, 66% for non-heap) *
jvm.options
### Total system Memory 15GB ###
-Xms5g
-Xmx5g
Heap too small
Heap too large
Optimised Heap
* Recommended heap-size settings by Elastic:
https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
Optimizations we did: Elasticsearch
Shards:
● Created templates with number of shards which
are multiples of the number of Elasticsearch
nodes
(helps fix issues with shards distribution
imbalance which resulted in uneven disk,
compute resource usage)
### Number of ES nodes: 5 ###
{
"template": "appserver-*",
"settings": {
"number_of_shards": "5",
"number_of_replicas": "0",
...
}
}'
Trade-offs:
● Removing replicas will result in search queries
running slower as replicas are used while
performing search operations
● It is not recommended to run production clusters
without replicas
Replicas:
● Removed replicas for the required indexes
(50% savings on storage cost, ~30% reduction in
compute resource utilization)
Optimizations we did: Elasticsearch
Template config
AWS
● EC2
● EBS
● Data transfer (Inter AZ)
Spotinst platform allows users to reliably
leverage excess capacity, simplify cloud
operations and save 80% on compute costs.
Optimizations we did: Infrastructure side
Optimizations we did: Infrastructure side
EC2
Stateful EC2 Spot instances:
● Moved all ELK nodes to run on spot instances
(Instances maintain IP address, EBS volumes)
Recovery time: < 10 mins
Trade-offs:
● Prefer using previous generation instance
types to reduce frequent spot take-backs
Optimizations we did: EC2 and spot
Auto-Scaling:
● Performance/time based auto-scaling for
Logstash Instances
Optimizations we did: EC2 and spot
Optimizations we did: Infrastructure side
EBS
"Hot-Warm" Architecture:
● "Hot" nodes: store active indexes, use GP2
EBS-disks (General purpose SSD)
● "Warm" nodes: store passive indexes, use SC1
EBS-disks (Cold storage)
(~69% savings on storage cost)
node.attr.box_type: hot
...
elasticsearch.yml
"template": "appserver-*",
"settings": {
"index": {
"routing": {
"allocation": {
"require": {
"box_type": "hot"}
}
}
},
...
Template config
Trade-offs:
● Since "Warm" nodes are using SC1 EBS-disks,
they have lower IOPS, throughput this will result
in search operations being comparatively slower
References:
https://cinhtau.net/2017/06/14/hot-warm-architecture/
Optimizations we did: EBS
Moving indexes to "Warm" nodes:
● Reallocated indexes older than 8 days to "Warm"
nodes
● Recommended to perform this operation during
off-peak hours as it is I/O intensive
actions:
1:
action: allocation
description: "Move index to Warm-nodes after 8
days"
options:
key: box_type
value: warm
allocation_type: require
timeout_override:
continue_if_exception: false
filters:
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 8
...
Curator config
References:
https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x
Optimizations we did: EBS
Single Availability Zone:
● Migrated all ELK node to a single availability zone
(reduce inter AZ data transfer cost for ELK nodes
by 100%)
● Data transfer/day: ~700GB
(Logstash to Elasticsearch: ~300GB,
Elasticsearch inter-communication: ~400GB)
Trade-offs:
● It is not recommended to run production clusters in
a single AZ as it will result in downtime and
potential data loss in case of AZ failures
Optimizations we did: Inter-AZ data transfer
Using S3 for index Snapshots:
● Take snapshots of indexes and store them in S3
curl -XPUT
"http://<domain>:9200/_snapshot/s3_repository/
snap1?pretty?wait_for_completion=true" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false
}
Backup:
References:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html
https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f
Data backup and restore
curl -s -XPOST --url
"http://<domain>:9200/_snapshot/s3_repository/s
nap1/_restore" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false,
}'
On-demand Elasticsearch cluster:
● Launching a on demand ES cluster and importing
the snapshots from S3
Existing Cluster:
● Restore the required snapshots to existing cluster
Restore:
Data backup and restore
Data corruption:
● List out indexes with status as ‘Red’
● Deleted the corrupted indexes
● Restore indexes from S3 snapshots
● Recovery time: depends of size of data
Node failure due to AZ going down:
● Launch a new ELK cluster using AWS cloud
formation templates
● Do the necessary config changes in Filebeat,
Logstash etc.
● Restore the required indexes from S3 snapshots
● Recovery time: depends on provisioning time and
size of data
Node failures due to underlying hardware issue:
● Recycle node in Spotinst console
(will take AMI of root volume, launch new instance,
attach EBS volumes, maintain private IP)
● Recovery time: < 10 mins/node
Snapshot restore time (estimates):
● < 4mins for a 20GB snapshot (test-cluster: 3
nodes, multiple indexes with 3 primary shards
each, no replicas)
Disaster recovery
EC2
Instance type Service Daily cost
5 x r5.xlarge (20C, 160GB) Elasticsearch 40.80
3 x c5.large (6C, 12GB) Logstash 7.17
1 x t3.medium (2C, 4GB) Kibana 1.29
Total ~ 49.26$
EC2 (optimized)
Instance type Service
Daily cost
65% savings + Spotinst charges (20% of savings) Total Savings
5 x m4.xlarge (20C, 80GB) Elasticsearch Hot 14.64
2 x r4.xlarge (8C, 61GB) Elasticsearch Warm 7.50
3 x c4.large (6C, 12GB) Logstash 3.50
1 x t2.medium (2C, 4GB) Kibana 0.69
Total ~ 26.33$ ~ 47%
Cost savings: EC2
Ingesting: 300GB/day
Retention: 90 days
Replica count: 1
Storage
Storage type Retention Daily cost
~54TB (GP2) 90 days ~ 237.60$
Storage (optimized)
Storage type Retention Daily cost Total Savings
~ 3TB (GP2) Hot 8 days 12.00
~ 24TB (SC1) Warm 82 days 24.00
~ 27TB (S3) Backup 90 days 22.50
Total ~ 58.50$ ~ 75%
Ingesting: 300GB/day
Retention: 90 days
Replica count: 0
Backups: Daily S3 snapshots
Cost savings: Storage
ELK stack
ELK stack
(optimized) Savings
EC2 49.40 26.33 47%
Storage 237.60 58.50 75%
Data-transfer 7 0 100%
Total (daily cost) ~ 294.00$ ~ 84.83$ ~ 71% *
Cost/GB (daily) ~ 0.98$ ~ 0.28$
* Total savings are exclusive of some of the application-level optimizations done
Total savings
ELK Stack
(optimized) ELK Stack Splunk Sumo logic
Product Self managed Self managed Cloud Professional
Data Ingestion ~ 300GB/day ~ 300GB/day
~ 100 GB / day *
(post ingestion custom pricing)
~ 20 GB / day *
(post ingestion custom pricing)
Retention ~ 90 days ~ 90 days ~ 90 days * ~ 30 days *
Cost/GB/day ~ $ 0.28 per GB /day ~ $ 0.98 per GB /day ~ $ 3.33 per GB /day * ~ $ 3.60 per GB /day *
Savings over traditional ELK stack: 71% *
* Total savings are exclusive of some of the application-level optimizations done
Our Costs vs other Platforms
ELK Stack Scalability:
● Logstash: auto-scaling
● Elasticsearch: overprovisioning (nodes run at 60% capacity during peak load), predictive vertical/horizontal scaling
Handling potential data-loss while AZ is down:
● DR mechanisms in place, daily/hourly backups stored in S3, Potential chances of data loss of about 1 hour
● We do not store user-data or business metrics in ELK, users/business will not be impacted
Handling potential data-corruptions in Elasticsearch:
● DR mechanisms in place, recover index from S3 index-snapshots
Managing downtime during spot take-backs:
● Logstash: multiple nodes, minimal impact
● Elasticsearch/Kibana: < 10min downtime per node
● Use previous generation instance types as spot take-back chances are comparatively low
Key Takeaways
Handling back-pressure when a node is down:
● Filebeat: will auto-retry to send old logs
● Logstash: use ‘date’ filter for document timestamp, auto-scaling
● Elasticsearch: overprovisioning
Other log analytics alternatives:
● We have only evaluated ELK, Splunk and Sumo Logic
ELK stack upgrade path:
● Blue Green deployment for major version upgrade
Key Takeaways
● We built a platform tailored to our requirements, yours might be different...
● Building a log analytics platform is not rocket science, but it can be painfully iterative if you
are not aware of the options
● Be aware of the trade-offs you are ‘OK with’ and you can roll out a solution optimised for
your specific requirements
Reflection
Thank you!
Happy to take your questions..
Copyright Disclaimer: All rights to the materials used for this presentation belongs to their respective owners..

More Related Content

Log analytics with ELK stack

  • 1. Log Analytics with ELK Stack (Architecture for aggressive cost optimization and infinite data scale) Denis D’Souza | 27th July 2019
  • 2. About me... ● Currently a DevOps engineer at Moonfrog Labs ● 6 + years working as DevOps Engineer, SRE and Linux administrator Worked on a variety of technologies in both service-based and product-based organisations ● How do I spend my free time ? Learning new technologies and Playing PC Games www.linkedin.com/in/denis-dsouza
  • 3. • A Mobile Gaming Company making mass market social games • More than 5M+ Daily Active, 15M+ Weekly Active Users • Real time, Cross platform games optimized for Primary Market(s) - India and subcontinent • Profitable! Current Scale Who we are ?
  • 4. 1. Our business requirements 2. Choosing the right option 3. ELK Stack overview 4. Our ELK architecture 5. Optimizations we did 6. Cost savings 7. Key takeaways Our problem statement
  • 5. ● Log analytics platform (Web-Server, Application, Database logs) ● Data Ingestion rate: ~300GB/day ● Frequently accessed data: last 8 days ● Infrequently accessed ● Uptime: 99.90 ● Hot Retention period: 90 days ● Cold Retention period: 90 days (with potential to increase) ● Simple and Cost effective solution ● Fairly predictable concurrent user-base ● Not to be used for storing user/business data Our business requirements
  • 6. ELK stack Splunk Sumo logic Product Self managed Cloud Professional Pricing ~ $30 per GB / month ~ $100 per GB / month * ~ $108 per GB / month * Data Ingestion ~ 300 GB / day ~ 100 GB / day * (post ingestion custom pricing) ~ 20 GB / day * (post ingestion custom pricing) Retention ~ 90 days ~ 90 days * ~ 30 days * Cost/GB/day ~$ 0.98 per GB / day ~$ 3.33 per GB / day * ~$ 3.60 per GB /day * * values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only. References: https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2 https://www.sumologic.com/pricing/apac/ Choosing the right option
  • 8. ● Index ● Shard ○ Primary ○ Replica ● Segment ● Node References: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html ELK Stack overview: Terminologies
  • 10. Our ELK architecture: Hot-Warm-Cold data storage (infinite scale)
  • 11. Service Number of Nodes Total CPU Cores Total RAM Storage EBS 1 Elasticsearch 7 28 141 GB 2 Logstash 3 6 12 GB 3 Kibana 1 1 4 GB Total 11 35 157 GB ~ 20 TB Data-ingestion per day ~ 300 GB Hot Retention period 90 days Docs/sec (at peak load) ~ 7K Our ELK architecture: Size and scale
  • 12. Application Side ● Logstash ● Elasticsearch Infrastructure Side ● EC2 ● EBS ● Data transfer Optimizations we did
  • 13. Optimizations we did: Application side Logstash
  • 14. Pipeline Workers: ● Adjusted "pipeline.workers" to x4 the number of Cores to improve CPU utilisation on Logstash server (as threads may spend significant time in an I/O wait state) ### Core-count: 2 ### ... pipeline.workers: 8 ... logstash.yml References: https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html Optimizations we did: Logstash
  • 15. 'info' logs: ● Separated application 'info' log to be store in a different index with retention policy of fewer days if [sourcetype] == "app_logs" and [level] == "info" { elasticsearch { index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}" ... Filter config if [sourcetype] == "nginx" and [status] == "200" { elasticsearch { index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}" ... References: https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html '200' response-code logs: ● Separated Access log with '200' response-code be store in a different index with retention policy of fewer days Optimizations we did: Logstash
  • 16. Log ‘message’ field: ● Removed "message" field if there were no 'grok-failures' in logstash while applying grok patterns (reduced storage footprint by ~30% per doc) if "_grokparsefailure" not in [tags] { mutate { remove_field => ["message"] } } Filter config Eg: Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0 Gecko" "-" Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident} [%{HTTPDATE:timestamp}] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{QS:forwarder} Optimizations we did: Logstash
  • 18. JVM heap vs non-heap memory: ● Optimised JVM heap-size by monitoring the GC interval, this helped in efficient utilization of system Memory (33% for JVM, 66% for non-heap) * jvm.options ### Total system Memory 15GB ### -Xms5g -Xmx5g Heap too small Heap too large Optimised Heap * Recommended heap-size settings by Elastic: https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html Optimizations we did: Elasticsearch
  • 19. Shards: ● Created templates with number of shards which are multiples of the number of Elasticsearch nodes (helps fix issues with shards distribution imbalance which resulted in uneven disk, compute resource usage) ### Number of ES nodes: 5 ### { "template": "appserver-*", "settings": { "number_of_shards": "5", "number_of_replicas": "0", ... } }' Trade-offs: ● Removing replicas will result in search queries running slower as replicas are used while performing search operations ● It is not recommended to run production clusters without replicas Replicas: ● Removed replicas for the required indexes (50% savings on storage cost, ~30% reduction in compute resource utilization) Optimizations we did: Elasticsearch Template config
  • 20. AWS ● EC2 ● EBS ● Data transfer (Inter AZ) Spotinst platform allows users to reliably leverage excess capacity, simplify cloud operations and save 80% on compute costs. Optimizations we did: Infrastructure side
  • 21. Optimizations we did: Infrastructure side EC2
  • 22. Stateful EC2 Spot instances: ● Moved all ELK nodes to run on spot instances (Instances maintain IP address, EBS volumes) Recovery time: < 10 mins Trade-offs: ● Prefer using previous generation instance types to reduce frequent spot take-backs Optimizations we did: EC2 and spot
  • 23. Auto-Scaling: ● Performance/time based auto-scaling for Logstash Instances Optimizations we did: EC2 and spot
  • 24. Optimizations we did: Infrastructure side EBS
  • 25. "Hot-Warm" Architecture: ● "Hot" nodes: store active indexes, use GP2 EBS-disks (General purpose SSD) ● "Warm" nodes: store passive indexes, use SC1 EBS-disks (Cold storage) (~69% savings on storage cost) node.attr.box_type: hot ... elasticsearch.yml "template": "appserver-*", "settings": { "index": { "routing": { "allocation": { "require": { "box_type": "hot"} } } }, ... Template config Trade-offs: ● Since "Warm" nodes are using SC1 EBS-disks, they have lower IOPS, throughput this will result in search operations being comparatively slower References: https://cinhtau.net/2017/06/14/hot-warm-architecture/ Optimizations we did: EBS
  • 26. Moving indexes to "Warm" nodes: ● Reallocated indexes older than 8 days to "Warm" nodes ● Recommended to perform this operation during off-peak hours as it is I/O intensive actions: 1: action: allocation description: "Move index to Warm-nodes after 8 days" options: key: box_type value: warm allocation_type: require timeout_override: continue_if_exception: false filters: - filtertype: age source: name direction: older timestring: '%Y.%m.%d' unit: days unit_count: 8 ... Curator config References: https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x Optimizations we did: EBS
  • 27. Single Availability Zone: ● Migrated all ELK node to a single availability zone (reduce inter AZ data transfer cost for ELK nodes by 100%) ● Data transfer/day: ~700GB (Logstash to Elasticsearch: ~300GB, Elasticsearch inter-communication: ~400GB) Trade-offs: ● It is not recommended to run production clusters in a single AZ as it will result in downtime and potential data loss in case of AZ failures Optimizations we did: Inter-AZ data transfer
  • 28. Using S3 for index Snapshots: ● Take snapshots of indexes and store them in S3 curl -XPUT "http://<domain>:9200/_snapshot/s3_repository/ snap1?pretty?wait_for_completion=true" -d' { "indices": "index_1,index_2", "ignore_unavailable": true, "include_global_state": false } Backup: References: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f Data backup and restore
  • 29. curl -s -XPOST --url "http://<domain>:9200/_snapshot/s3_repository/s nap1/_restore" -d' { "indices": "index_1,index_2", "ignore_unavailable": true, "include_global_state": false, }' On-demand Elasticsearch cluster: ● Launching a on demand ES cluster and importing the snapshots from S3 Existing Cluster: ● Restore the required snapshots to existing cluster Restore: Data backup and restore
  • 30. Data corruption: ● List out indexes with status as ‘Red’ ● Deleted the corrupted indexes ● Restore indexes from S3 snapshots ● Recovery time: depends of size of data Node failure due to AZ going down: ● Launch a new ELK cluster using AWS cloud formation templates ● Do the necessary config changes in Filebeat, Logstash etc. ● Restore the required indexes from S3 snapshots ● Recovery time: depends on provisioning time and size of data Node failures due to underlying hardware issue: ● Recycle node in Spotinst console (will take AMI of root volume, launch new instance, attach EBS volumes, maintain private IP) ● Recovery time: < 10 mins/node Snapshot restore time (estimates): ● < 4mins for a 20GB snapshot (test-cluster: 3 nodes, multiple indexes with 3 primary shards each, no replicas) Disaster recovery
  • 31. EC2 Instance type Service Daily cost 5 x r5.xlarge (20C, 160GB) Elasticsearch 40.80 3 x c5.large (6C, 12GB) Logstash 7.17 1 x t3.medium (2C, 4GB) Kibana 1.29 Total ~ 49.26$ EC2 (optimized) Instance type Service Daily cost 65% savings + Spotinst charges (20% of savings) Total Savings 5 x m4.xlarge (20C, 80GB) Elasticsearch Hot 14.64 2 x r4.xlarge (8C, 61GB) Elasticsearch Warm 7.50 3 x c4.large (6C, 12GB) Logstash 3.50 1 x t2.medium (2C, 4GB) Kibana 0.69 Total ~ 26.33$ ~ 47% Cost savings: EC2
  • 32. Ingesting: 300GB/day Retention: 90 days Replica count: 1 Storage Storage type Retention Daily cost ~54TB (GP2) 90 days ~ 237.60$ Storage (optimized) Storage type Retention Daily cost Total Savings ~ 3TB (GP2) Hot 8 days 12.00 ~ 24TB (SC1) Warm 82 days 24.00 ~ 27TB (S3) Backup 90 days 22.50 Total ~ 58.50$ ~ 75% Ingesting: 300GB/day Retention: 90 days Replica count: 0 Backups: Daily S3 snapshots Cost savings: Storage
  • 33. ELK stack ELK stack (optimized) Savings EC2 49.40 26.33 47% Storage 237.60 58.50 75% Data-transfer 7 0 100% Total (daily cost) ~ 294.00$ ~ 84.83$ ~ 71% * Cost/GB (daily) ~ 0.98$ ~ 0.28$ * Total savings are exclusive of some of the application-level optimizations done Total savings
  • 34. ELK Stack (optimized) ELK Stack Splunk Sumo logic Product Self managed Self managed Cloud Professional Data Ingestion ~ 300GB/day ~ 300GB/day ~ 100 GB / day * (post ingestion custom pricing) ~ 20 GB / day * (post ingestion custom pricing) Retention ~ 90 days ~ 90 days ~ 90 days * ~ 30 days * Cost/GB/day ~ $ 0.28 per GB /day ~ $ 0.98 per GB /day ~ $ 3.33 per GB /day * ~ $ 3.60 per GB /day * Savings over traditional ELK stack: 71% * * Total savings are exclusive of some of the application-level optimizations done Our Costs vs other Platforms
  • 35. ELK Stack Scalability: ● Logstash: auto-scaling ● Elasticsearch: overprovisioning (nodes run at 60% capacity during peak load), predictive vertical/horizontal scaling Handling potential data-loss while AZ is down: ● DR mechanisms in place, daily/hourly backups stored in S3, Potential chances of data loss of about 1 hour ● We do not store user-data or business metrics in ELK, users/business will not be impacted Handling potential data-corruptions in Elasticsearch: ● DR mechanisms in place, recover index from S3 index-snapshots Managing downtime during spot take-backs: ● Logstash: multiple nodes, minimal impact ● Elasticsearch/Kibana: < 10min downtime per node ● Use previous generation instance types as spot take-back chances are comparatively low Key Takeaways
  • 36. Handling back-pressure when a node is down: ● Filebeat: will auto-retry to send old logs ● Logstash: use ‘date’ filter for document timestamp, auto-scaling ● Elasticsearch: overprovisioning Other log analytics alternatives: ● We have only evaluated ELK, Splunk and Sumo Logic ELK stack upgrade path: ● Blue Green deployment for major version upgrade Key Takeaways
  • 37. ● We built a platform tailored to our requirements, yours might be different... ● Building a log analytics platform is not rocket science, but it can be painfully iterative if you are not aware of the options ● Be aware of the trade-offs you are ‘OK with’ and you can roll out a solution optimised for your specific requirements Reflection
  • 38. Thank you! Happy to take your questions.. Copyright Disclaimer: All rights to the materials used for this presentation belongs to their respective owners..