The document discusses various techniques for analyzing time series data in Elasticsearch including:
1) Using a hot-cold architecture to separate indexing and query nodes for optimized performance.
2) Implementing shard allocation awareness to distribute shards across availability zones for high availability.
3) Employing time framed indices, index templates, and aliases to easily manage indices for different time periods.
4) Leveraging aggregations, metrics, buckets, and pipeline aggregations to analyze and summarize large volumes of time series data.
2. Table of Contents
- Hot-Cold Architecture
- Data High Availability
- Data design at large scale
- Search Execution
- Time framed indices
- Aggregations
4. Hot-Cold Architecture
Hot Data Nodes
Perform indexing
Hold most recent data
Use SSD storage, Writing is an Intensive IO operation
Cold Data Nodes
Handle read only operations
Can use large spinning disks
11. Shard Allocation Awareness
cluster.routing.allocation.awareness.attributes: rack_1
● Data replication is spanned across AZs
● No two copies of same shard on the same rack
● Elasticsearch is fully aware of shard distribution
● Awareness can be set based cluster or index
● Elasticsearch will prefer using local shards
● Always balance your nodes across AZs
● Routing Allocation Awareness can be updated
on a live cluster
cluster.routing.allocation.awareness.attributes: rack_2
Availability Zone 1 Availability Zone 2
12. Shard Allocation Awareness
cluster.routing.allocation.awareness.attributes: rack_1
● Data replication is spanned across AZs
● No two copies of same shard on the same rack
● Elasticsearch is fully aware of shard distribution
● Awareness can be set based cluster or index
● Elasticsearch will prefer using local shards
● Always balance your nodes across AZs
● Routing Allocation Awareness can be updated
on a live cluster
cluster.routing.allocation.awareness.attributes: rack_2
Availability Zone 1 Availability Zone 2
13. Shard Allocation Awareness
cluster.routing.allocation.awareness.attributes: rack_1
● Data replication is spanned across AZs
● No two copies of same shard on the same rack
● Elasticsearch is fully aware of shard distribution
● Awareness can be set based cluster or index
● Elasticsearch will prefer using local shards
● Always balance your nodes across AZs
● Routing Allocation Awareness can be updated
on a live cluster
cluster.routing.allocation.awareness.attributes: rack_2
Availability Zone 1 Availability Zone 2
14. Shard Allocation Awareness
cluster.routing.allocation.awareness.attributes: rack_1
● Data replication is spanned across AZs
● No two copies of same shard on the same rack
● Elasticsearch is fully aware of shard distribution
● Awareness can be set based cluster or index
● Elasticsearch will prefer using local shards
● Always balance your nodes across AZs
● Routing Allocation Awareness can be updated
on a live cluster
● Use Forced Awareness to avoid the extra load
of reallocation of missing shards
cluster.routing.allocation.awareness.attributes: rack_2
Availability Zone 1 Availability Zone 2
15. Shard Allocation Awareness
cluster.routing.allocation.awareness.attributes: rack_1
● Data replication is spanned across AZs
● No two copies of same shard on the same rack
● Elasticsearch is fully aware of shard distribution
● Awareness can be set based cluster or index
● Elasticsearch will prefer using local shards
● Always balance your nodes across AZs
● Routing Allocation Awareness can be updated
on a live cluster
● Use Forced Awareness to avoid the extra load
of reallocation of missing shards
cluster.routing.allocation.awareness.attributes: rack_2
Availability Zone 1 Availability Zone 2
Make sure you can handle the load with less nodes!
16. Forced Awareness
● Forced awareness solves this problem by NEVER allowing
copies of the same shard to be allocated to the same zone.
● Avoid extra of reallocating unassigned shards after rack
failure.
● Allow no single point of failure for your system.
● Make sure you can handle the load with less nodes.
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
cluster.routing.allocation.awareness.attributes: rack1,zone1
33. Search Execution Preference
Elasticsearch targets shards and replicas in round-robin manner. Each shard is queried similarly
_primary Query only primary shards (latest info from index or optimize for writing path)
_primary_first Query primary first in available
_replica Query replica shard only
_replica_first Query replica first in available
_local Query shards available on the current node
_only_node:node_id Query a specific node
_only_nodes:* Query only a set of nodes
_prefer_node:node_id Query a prefered noe
_shards:1,3 e,g _shards:1,3;_local Query specific shards with a preference
PUT _search?preference=_replica
36. Closing/Opening Index
➔ Closing an index
◆ Removes all shard allocations from the cluster
◆ But keeps the index data around
◆ Helps reduce the resources used on the cluster
◆ Consumes only disk space
➔ Opening an index
◆ Allows to open a closed index
◆ Note, those are not “milliseconds” time operation, opening an index can take a few seconds
to a couple of minutes
◆ Flushing before closing will reduce the opening time
37. Index Templates
- Order allows you to override other templates
- Settings allows you to scale anytime
- Aliases can be defined on index creation
39. Time framed indices lifecycle
1. Use Index templates to generate mappings for new indices
2. Use aliases to decouple your application from data logic
3. Use hot nodes for fresh data
4. Move old data to cold nodes
5. Close old indices before deletion
6. Change your time frame at any point to scale (Monthly, Weekly….)
7. Use Routing if you have too many shards in a big cluster
46. Doc Values
- Why do we need this?
- Sorting, Aggregations, Some Scripting
- Doc Values
- Build columnar style data structure on disk
- Created at indexing time, stored as part of the segment
- Read like other pieces of the Lucene index
- Don't take up heap space
- Uses file system cache
- Default for not_analyzed string and numeric fields in 2.0+
47. Raw Fields
- Use customer_name.raw for aggregations
- Use customer_name for search
52. Scripted Metric Aggregation
- Init_script Executed first. Allows initialization of variables.
- map_script Executed once after each document is collected.
- combine_script Executed once on each shard after document collection is complete.
- reduce_script Executed once on the coordinating node after all shards have returned their results.
53. Buckets Aggregations
- Children Aggregation
- Date Histogram Aggregation
- Date Range Aggregation
- Filter Aggregation
- Filters Aggregation
- Global Aggregation
- Histogram Aggregation
- Missing Aggregation
- Range Aggregation
- Reverse nested Aggregation
- Sampler Aggregation
- Significant Terms Aggregation
- Terms Aggregation
60. Pipeline Aggregations
Parent
- Able to compute new buckets or new
aggregations to a parent aggregation.
Sibling
- Able to compute new buckets or new aggregation
on the same level.