SlideShare a Scribd company logo
Introducing	Log	Analysis
To	Your	Organization
Rafał	Kuć
Sematext Und	Mich
logs
metrics
cloud
&
Next	60	minutes…
Log	shipping	
- buffers
- protocols
- parsing
Central	buffering
- Kafka
- Redis
Storage	&	Analysis
- Elasticsearch
- Kibana
- Grafana
Why	&	How?
- Should	I	try?
- Open	source
- Commercial
Why	You	Should	Care
Environments	are	getting	bigger
Why	You	Should	Care
Environments	are	getting	bigger
Containers	are	everywhere
Why	You	Should	Care
Environments	are	getting	bigger
Containers	are	everywhere
Infrastructure	work	gets	automated
Created	by	Kjpargeter - Freepik.com
Why	You	Should	Care
Environments	are	getting	bigger
Containers	are	everywhere
Infrastructure	work	gets	automated
Logs	&	metrics	at	the	same	place
Why	You	Should	Care
Environments	are	getting	bigger
Containers	are	everywhere
Infrastructure	work	gets	automated
Faster	diagnostics	==	less	money	spent
Logs	&	metrics	at	the	same	place
Going	For	Commercial	Solution
cloud
Going	For	Commercial	Solution
cloud
Going	For	Commercial	Solution
cloud
Going	For	Commercial	Solution
cloud
Going	For	Commercial	Solution
cloud
Going	For	Commercial	Solution
cloud
Going	For	Commercial	Solution
cloud
Going	For	Commercial	Solution
cloud
Going	For	Commercial	Solution
cloud
Going	For	Commercial	Solution
Icon	made	by	Smashicons from www.flaticon.com
Going	Open-Source
Going	Open-Source
Going	Open-Source
Going	Open-Source
Going	Open-Source	– Today’s	Focus
Log	shipping	architecture
File
Log	shipping	architecture
File Shipper
Log	shipping	architecture
File Shipper
File Shipper
File Shipper
Log	shipping	architecture
File Shipper
File Shipper
File Shipper
Centralized
Buffer
Log	shipping	architecture
File Shipper
File Shipper
File Shipper
Centralized
Buffer
data
Log	shipping	architecture
File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data
Focus:	Shipper
File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data
What	about	the	shipper?
logs
Centralized
Buffer
Which	shipper	to	use?
Which	protocol should	be	used
What	about	the	buffering
Log	to	JSON or	parse and	how
Buffers
performance & availability
batches	&	threads when	central	buffer	is	gone
Buffer	types
Disk ||	memory ||	combined	hybrid approach
On	source	||	centralized
App
Buffer
App
Buffer
file	or	local	log	shipper
easy	scaling	– fewer	moving	parts
often	with	the	use	of	lightweight	shipper
App
App
Kafka /	Redis /	Logstash /	etc…
one	place	for	all	changes
extra	features	made	easy	(like	TTL)
ES
ES
Buffers	Summary
Simple Reliable
App
Buffer
App
Buffer
ES
App
App
ES
Protocols
UDP	– fast,	cool	for	the	application,	not	reliable
TCP – reliable	(almost) application	gets	ACK when	written to	buffer
Application level	ACKs	may	be	needed
HTTP
RELP
Beats
Kafka
Logstash,	rsyslog,	Fluentd
Logstash,	rsyslog
Logstash,	Filebeat
Logstash,	rsyslog,	Filebeat,	Fluentd
Choosing	the	shipper
application
rsyslog Elasticsearch
http
socket
memory	&	disk	
assisted	queues
Final	Architecture
application
rsyslog Elasticsearch
http
socket
memory	&	disk	
assisted	queues
application
file
rsyslog
Logagent
filebeat
consumer
Final	Architecture
application
rsyslog Elasticsearch
http
socket
memory	&	disk	
assisted	queues
application
file
rsyslog
Logagent
filebeat
consumer
Parsing	Done	Here
Focus:	Centralized	Buffer
File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data
Why	Apache	Kafka?
Fast &	easy	to	use
Easy	to	scale
Fault	tolerant	and	highly	available
Supports	streaming
Works	in	publish/subscribe mode
Kafka	architecture
ZooKeeper
ZooKeeper
ZooKeeper
Kafka
Kafka
KafkaKafka
Kafka	&	topics
security_logs access_logs
app1_logs app2_logs
Kafka	stores	data
in topics	
written	on	disk
Kafka	&	topics	&	partitions	&	replicas
logs
partition	2
logs
partition	1
logs
partition	3
logs
partition	4
logs		replica
partition	2
logs		replica
partition	1
logs		replica
partition	3
logs		replica
partition	4
Scaling	Kafka
logs
partition	1
Scaling	Kafka
logs
partition	1
logs
partition	2
logs
partition	3
logs
partition		4
Scaling	Kafka
logs
partition	1
logs
partition	2
logs
partition	3
logs
partition		4
logs
partition	5
logs
partition	6
logs
partition	7
logs
partition	8
logs
partition	9
logs
partition	10
logs
partition	11
logs
partition	12
logs
partition	13
logs
partition	14
logs
partition	15
logs
partition	16
Things	to	remember	when	using	Kafka
Scales by	adding more	partitions not	threads
The	more	IOPS the	better
Keep	the	#	of	consumers	equal	to	#	of	partitions
Replicas used	for	HA and	FT only
Offsets stored	per	consumer	– multiple	destinations
easily	possible
Focus:	Elasticsearch
File Shipper
File Shipper
File Shipper
Centralized
Buffer
ES ES ES
ES ES ES
ES ES ES
data
Elasticsearch	cluster	architecture
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest
Dedicated	masters	please
client
client
client
data
data
data
data
data
data
master
master
master
discovery.zen.minimum_master_nodes ->	N/2	+	1	master	eligible	nodes
ingest
ingest
ingest
Elasticsearch	– Indices
Index – logical	place	for	data
Elasticsearch	– Indices
Index – logical	place	for	data
Index	– can	be	compared	to	database	in	DB
Elasticsearch	– Indices
Index – logical	place	for	data
Index	– can	be	compared	to	database	in	DB
Index	– built	out	of	one	or	more	shards
Elasticsearch	– Indices
Index – logical	place	for	data
Index	– can	be	compared	to	database	in	DB
Index	– built	out	of	one	or	more	shards
Shard – can	be	spread	among	multiple	nodes
Scaling	Elasticsearch
Logs
Shard1
Scaling	Elasticsearch
Logs
Shard1	
Users
Shard1	
Invoices
Shard1
Scaling	Elasticsearch
Logs
Shard1	
Logs
Shard2	
Logs
Shard3	
Logs
Shard4
Scaling	Elasticsearch
Logs
Shard3	
Logs
Shard2	
Logs
Shard4	
Logs
Shard1
Scaling	Elasticsearch
Logs
Shard1	
Logs
Replica4
Logs
Shard2	
Logs
Replica3
Logs
Shard4	
Logs
Replica1
Logs
Shard3	
Logs
Replica2
One	big	index	is	a	no-go
Not	scalable	enough	for	time	based	data
One	big	index	is	a	no-go
Not	scalable	enough	for	time	based	data
Indexing	slows	down	with	time
One	big	index	is	a	no-go
Not	scalable	enough	for	time	based	data
Indexing	slows	down	with	time
Expensive	merges
One	big	index	is	a	no-go
Not	scalable	enough	for	time	based	data
Indexing	slows	down	with	time
Expensive	merges
Delete by	query needed	for	data	retention
Daily	indices	are	a	good	start
2017.11.16 2017.11.17 2017.11.20 2017.11.21.	.	.
Indexing is	faster for	smaller	indices
Deletes are	cheap	
Search can	be	performed	on	indices	that	are	needed
Static indices	are	cache	friendly
indexing
most	searches
Daily	indices	are	a	good	start
2017.11.16 2017.11.17 2017.11.20 2017.11.21.	.	.
Indexing is	faster for	smaller	indices
Deletes are	cheap	
Search can	be	performed	on	indices	that	are	needed
Static indices	are	cache	friendly
indexing
most	searches
We	delete whole	indices
Daily	indices	are	sub-optimal
black	
friday
saturday
sunday
load
is	not
even
Size	based	indices	are	optimal
size	limit	for	indices
logs_01
indexing
around	5	– 10GB	per	shard	on	AWS
Size	based	indices	are	optimal
size	limit	for	indices
logs_01
indexing
around	5	– 10GB	per	shard	on	AWS
Size	based	indices	are	optimal
size	limit	for	indices
logs_01
indexing
logs_02
around	5	– 10GB	per	shard	on	AWS
Size	based	indices	are	optimal
size	limit	for	indices
logs_01
indexing
logs_02
around	5	– 10GB	per	shard	on	AWS
Size	based	indices	are	optimal
size	limit	for	indices
logs_01 logs_02
indexing
logs_N.	.	.
around	5	– 10GB	per	shard	on	AWS
Slice	using	size
Predictable searching	and	indexing	performance
Better indices	balancing
Fewer	shards
Easier handling of	spiky	loads
Less	costs	because	of	better hardware	utilization
Proper	Elasticsearch	configuration
Keep	index.refresh_interval at	maximum	possible	value
1	sec	->	100%,	5	sec	->	125%,	30	sec	-> 175%	
You	can	loosen up	merges
- possible	because	of	heavy	aggregation	use
- segments_per_tier ->	higher
- max_merge_at_once->	higher
- max_merged_segment ->	lower
All	prefixed	with	
index.merge.policy
} higher	indexing	
throughput
Proper	Elasticsearch	configuration
Index only	needed	fields
Use	doc	values
Do	not	index	_source
Do	not	store	_all
Optimization	time
We	can	optimize data	nodes	for	time	based	data
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest
Hot	– cold	architecture
ES	hot ES	cold ES	cold
-Dnode.attr.tag=hot -Dnode.attr.tag=cold -Dnode.attr.tag=cold
Hot	– cold	architecture
logs_2017.11.22
ES	hot ES	cold ES	cold
-Dnode.attr.tag=hot -Dnode.attr.tag=cold -Dnode.attr.tag=cold
curl	-XPUT	localhost:9200/logs_2017.11.22 -d	'{	
"settings"	:	{		
"index.routing.allocation.exclude.tag"	:	"cold",	
"index.routing.allocation.include.tag"	:	"hot"	
}
}'
Hot	– cold	architecture
logs_2017.11.22
ES	hot ES	cold ES	cold
indexing
Hot	– cold	architecture
logs_2017.11.22
logs_2017.11.23
ES	hot ES	cold ES	cold
indexing
Hot	– cold	architecture
logs_2017.11.22
logs_2017.11.23
ES	hot ES	cold ES	cold
indexing
move	index	after	day	ends
curl	-XPUT	localhost:9200/logs_2017.11.22/_settings	-d	'{
"index.routing.allocation.exclude.tag"	:	"hot",
"index.routing.allocation.include.tag”	:	"cold"
}'
Hot	– cold	architecture
logs_2017.11.23 logs_2017.11.22
ES	hot ES	cold ES	cold
indexing
Hot	– cold	architecture
logs_2017.11.23
logs_2017.11.24
logs_2017.11.22
ES	hot ES	cold ES	cold
indexing
Hot	– cold	architecture
logs_2017.11.23
logs_2017.11.24
logs_2017.11.22
ES	hot ES	cold ES	cold
indexing
move	index	after	day	ends
Hot	– cold	architecture
logs_2017.11.24 logs_2017.11.22 logs_2017.11.23
ES	hot ES	cold ES	cold
indexing
Hot	– cold	architecture
Hot	ES	Tier
Good	CPU
Lots	of	I/O
Cold	ES	Tier
Memory	bound
Decent	I/O
ES	cold
Cold	ES	Tier
Memory	bound
Decent	I/O
Hot	– cold	architecture	summary
ES	cold
Optimize	costs – different	hardware	for	different	tier
Performance – use	case	optimized	hardware
Isolation – long	running	searches	don’t	affect	indexing
Elasticsearch	client node	needs
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest
Elasticsearch	client node	needs
No	data	=	no	IOPS
Large	query	throughput	=	high	CPU	usage
Lots	of	results	=	high	memory usage
Lots	of	concurrent	queries	=	higher	resources utilization
Elasticsearch	ingest node	needs
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest
Elasticsearch	ingest	node	needs
No	data	=	no	IOPS
Large	index	throughput	=	high	CPU	&	memory	usage
Complicated	rules	=	high	CPU	usage
Larger	documents	=	more	resources utilization
Elasticsearch	master node	needs
client
client
client
data
data
data
data
data
data
master
master
master
ingest
ingest
ingest
Elasticsearch	ingest	node	needs
No	data	=	no	IOPS
Large	number	of	indices	=	high	CPU	&	memory	usage
Complicated	mappings	=	high	memory	usage
Daily	indices	=	spikes	in	resources utilization
What	about	OS?
Say	NO to	swap
Set	the	right	disk	scheduler
CFQ for	spinning	disks
deadline for	SSD
Use	proper	mount options	for	ext4
noatime
nodirtime
data=writeback,	nobarier
For	bare	metal
check	CPU	governor
disable	transparent	huge	pages
/proc/sys/vm/nr_hugepages=0
Analysis	- Kibana
Analysis	- Kibana
Analysis	- Kibana
Analysis	- Kibana
Analysis	- Kibana
Analysis	- Kibana
Analysis	- Kibana
Analysis	- Grafana
Analysis	- Grafana
Analysis	- Grafana
Where	To	Go	From	Here?
We	are	engineers!
We	develop DevOps	tools!
We	are	DevOps people!
We	do	fun	stuff	;)
http://sematext.com/jobs
Thank	you	for	listening!	Get	in	touch!
Rafał
rafal.kuc@sematext.com
@kucrafal
http://sematext.com
@sematext http://sematext.com/jobs

More Related Content

Introducing log analysis to your organization