Introducing log analysis to your organization

Introducing Log Analysis
To Your Organization
Rafał Kuć

Sematext Und Mich
logs
metrics
cloud
&

Next 60 minutes…
Log shipping
- buffers
- protocols
- parsing
Central buffering
- Kafka
- Redis
Storage & Analysis
- Elasticsearch
- Kibana
- Grafana
Why & How?
- Should I try?
- Open source
- Commercial

Why You Should Care
Environments are getting bigger

Why You Should Care
Containers are everywhere

Why You Should Care
Infrastructure work gets automated
Created by Kjpargeter - Freepik.com

Why You Should Care
Logs & metrics at the same place

Why You Should Care
Faster diagnostics == less money spent
Logs & metrics at the same place

Going For Commercial Solution
cloud

Going For Commercial Solution
Icon made by Smashicons from www.flaticon.com

Going Open-Source – Today’s Focus

Log shipping architecture
File

File Shipper

File Shipper
File Shipper
File Shipper

File Shipper
File Shipper
File Shipper
Centralized
Buffer

File Shipper
File Shipper
File Shipper
Centralized
Buffer
data

File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data

Focus: Shipper
File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data

What about the shipper?
logs
Centralized
Buffer
Which shipper to use?
Which protocol should be used
What about the buffering
Log to JSON or parse and how

Buffers
performance & availability
batches & threads when central buffer is gone

Buffer types
Disk || memory || combined hybrid approach
On source || centralized
App
Buffer
App
Buffer
file or local log shipper
easy scaling – fewer moving parts
often with the use of lightweight shipper
App
App
Kafka / Redis / Logstash / etc…
one place for all changes
extra features made easy (like TTL)
ES
ES

Buffers Summary
Simple Reliable
App
Buffer
App
Buffer
ES
App
App
ES

Protocols
UDP – fast, cool for the application, not reliable
TCP – reliable (almost) application gets ACK when written to buffer
Application level ACKs may be needed
HTTP
RELP
Beats
Kafka
Logstash, rsyslog, Fluentd
Logstash, rsyslog
Logstash, Filebeat
Logstash, rsyslog, Filebeat, Fluentd

Choosing the shipper
application
rsyslog Elasticsearch
http
socket
memory & disk
assisted queues

Final Architecture
application
http
socket
memory & disk
assisted queues
application
file
rsyslog
Logagent
filebeat
consumer

Final Architecture
application
http
socket
memory & disk
assisted queues
application
file
rsyslog
Logagent
filebeat
consumer
Parsing Done Here

Focus: Centralized Buffer
File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data

Why Apache Kafka?
Fast & easy to use
Easy to scale
Fault tolerant and highly available
Supports streaming
Works in publish/subscribe mode

Kafka architecture
ZooKeeper
ZooKeeper
ZooKeeper
Kafka
Kafka
KafkaKafka

Kafka & topics
security_logs access_logs
app1_logs app2_logs
Kafka stores data
in topics
written on disk

Kafka & topics & partitions & replicas
logs
partition 2
logs
partition 1
logs
partition 3
logs
partition 4
logs replica
partition 2
logs replica
partition 1
logs replica
partition 3
logs replica
partition 4

Scaling Kafka
logs
partition 1

Scaling Kafka
logs
partition 1
logs
partition 2
logs
partition 3
logs
partition 4

Scaling Kafka
logs
partition 1
logs
partition 2
logs
partition 3
logs
partition 4
logs
partition 5
logs
partition 6
logs
partition 7
logs
partition 8
logs
partition 9
logs
partition 10
logs
partition 11
logs
partition 12
logs
partition 13
logs
partition 14
logs
partition 15
logs
partition 16

Things to remember when using Kafka
Scales by adding more partitions not threads
The more IOPS the better
Keep the # of consumers equal to # of partitions
Replicas used for HA and FT only
Offsets stored per consumer – multiple destinations
easily possible

Focus: Elasticsearch
File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data

Elasticsearch cluster architecture
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest

Dedicated masters please
client
client
client
data
data
data
data
data
data
master
master
master
discovery.zen.minimum_master_nodes -> N/2 + 1 master eligible nodes
ingest
ingest
ingest

Elasticsearch – Indices
Index – logical place for data

Index – can be compared to database in DB

Index – built out of one or more shards

Index – built out of one or more shards
Shard – can be spread among multiple nodes

Scaling Elasticsearch
Logs
Shard1

Logs
Shard1
Users
Shard1
Invoices
Shard1

Logs
Shard1
Logs
Shard2
Logs
Shard3
Logs
Shard4

Logs
Shard3
Logs
Shard2
Logs
Shard4
Logs
Shard1

Logs
Shard1
Logs
Replica4
Logs
Shard2
Logs
Replica3
Logs
Shard4
Logs
Replica1
Logs
Shard3
Logs
Replica2

One big index is a no-go
Not scalable enough for time based data

Indexing slows down with time

Expensive merges

Expensive merges
Delete by query needed for data retention

Daily indices are a good start
2017.11.16 2017.11.17 2017.11.20 2017.11.21. . .
Indexing is faster for smaller indices
Deletes are cheap
Search can be performed on indices that are needed
Static indices are cache friendly
indexing
most searches

Daily indices are a good start
2017.11.16 2017.11.17 2017.11.20 2017.11.21. . .
Indexing is faster for smaller indices
Deletes are cheap
Search can be performed on indices that are needed
Static indices are cache friendly
indexing
most searches
We delete whole indices

Daily indices are sub-optimal
black
friday
saturday
sunday
load
is not
even

Size based indices are optimal
size limit for indices
logs_01
indexing
around 5 – 10GB per shard on AWS

logs_01
indexing
logs_02

logs_01 logs_02
indexing
logs_N. . .

Slice using size
Predictable searching and indexing performance
Better indices balancing
Fewer shards
Easier handling of spiky loads
Less costs because of better hardware utilization

Proper Elasticsearch configuration
Keep index.refresh_interval at maximum possible value
1 sec -> 100%, 5 sec -> 125%, 30 sec -> 175%
You can loosen up merges
- possible because of heavy aggregation use
- segments_per_tier -> higher
- max_merge_at_once-> higher
- max_merged_segment -> lower
All prefixed with
index.merge.policy
} higher indexing
throughput

Proper Elasticsearch configuration
Index only needed fields
Use doc values
Do not index _source
Do not store _all

Optimization time
We can optimize data nodes for time based data
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest

Hot – cold architecture
ES hot ES cold ES cold
-Dnode.attr.tag=hot -Dnode.attr.tag=cold -Dnode.attr.tag=cold

logs_2017.11.22
-Dnode.attr.tag=hot -Dnode.attr.tag=cold -Dnode.attr.tag=cold
curl -XPUT localhost:9200/logs_2017.11.22 -d '{
"settings" : {
"index.routing.allocation.exclude.tag" : "cold",
"index.routing.allocation.include.tag" : "hot"
}
}'

logs_2017.11.22
indexing

logs_2017.11.22
logs_2017.11.23
indexing

logs_2017.11.22
logs_2017.11.23
indexing
move index after day ends
curl -XPUT localhost:9200/logs_2017.11.22/_settings -d '{
"index.routing.allocation.exclude.tag" : "hot",
"index.routing.allocation.include.tag” : "cold"
}'

logs_2017.11.23 logs_2017.11.22
indexing

logs_2017.11.23
logs_2017.11.24
logs_2017.11.22
indexing

logs_2017.11.23
logs_2017.11.24
logs_2017.11.22
indexing
move index after day ends

logs_2017.11.24 logs_2017.11.22 logs_2017.11.23
indexing

Hot ES Tier
Good CPU
Lots of I/O
Cold ES Tier
Memory bound
Decent I/O
ES cold
Cold ES Tier
Memory bound
Decent I/O

Hot – cold architecture summary
ES cold
Optimize costs – different hardware for different tier
Performance – use case optimized hardware
Isolation – long running searches don’t affect indexing

Elasticsearch client node needs
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest

Elasticsearch client node needs
No data = no IOPS
Large query throughput = high CPU usage
Lots of results = high memory usage
Lots of concurrent queries = higher resources utilization

Elasticsearch ingest node needs
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest

No data = no IOPS
Large index throughput = high CPU & memory usage
Complicated rules = high CPU usage
Larger documents = more resources utilization

Elasticsearch master node needs
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest

No data = no IOPS
Large number of indices = high CPU & memory usage
Complicated mappings = high memory usage
Daily indices = spikes in resources utilization

What about OS?
Say NO to swap
Set the right disk scheduler
CFQ for spinning disks
deadline for SSD
Use proper mount options for ext4
noatime
nodirtime
data=writeback, nobarier
For bare metal
check CPU governor
disable transparent huge pages
/proc/sys/vm/nr_hugepages=0

We are engineers!
We develop DevOps tools!
We are DevOps people!
We do fun stuff ;)
http://sematext.com/jobs

Thank you for listening! Get in touch!
Rafał
rafal.kuc@sematext.com
@kucrafal
http://sematext.com
@sematext http://sematext.com/jobs

Introducing log analysis to your organization

Related slideshows

More Related Content

Introducing log analysis to your organization