Side by Side with Elasticsearch & Solr, Part 2

Side by Side with
Elasticsearch & Solr
Part 2 - Performance & Scalability
Radu Gheorghe
Rafał Kuć

Who are we?
Radu Rafał
Logsene
cftweia

One year progress
Facet by function
https://issues.apache.org/jira/browse/SOLR-1581
Analytics component
Solr as standalone application
top_hits aggregation
https://github.com/elastic/elasticsearch/pull/6124
minimum_should_match on has_child
filters aggregation

That's not all
JSON facets
Backup + restore
Streaming aggregations
Moving averages aggregation
Computation on aggregations
Cluster state diff support
Cross data center replication
Shadow replicas

This year’s agenda
Horizontal scaling
Products use-case
Logs use-case

Horizontal scaling in theory
Node
shard

Node
shard
Node
shard

Node
shard
Node
shard
Node
shard

Node
shard shard
shard shard
Node
shard shard
shard shard
Node
shard shard
shard shard

Node
shard shard
shard shard
replica
replica
replica
replica
Node
shard shard
shard shard
replica
replica
replica
replica
Node
shard shard
shard shard
replica
replica
replica
replica

Horizontal scaling - the API
Create / remove replicas on the fly -
Collections API
Moving shards around the cluster using
add / delete replica
Create / remove replicas on the fly -
Update Indices Settings
Moving shards around the cluster using
Cluster Reroute API
Shard splitting using
Collections API
Automatic shard balancing
by default
Migrating data with a given routing key
to another collection using API
Shard allocation awareness & rule
based shard placement

The products - assumptions
Steady data growth
Common, known
data structure
Large QPS
Spikes in traffic

Hardware and data
2 x EC2 c3.large instances
(2vCPU, 3.5GB RAM,
2x16GB SSD in RAID0)
vs
Wikipedia

Test requests
One, common query
https://github.com/sematext/berlin-buzzwords-samples/tree/master/2015
Dictionary of common and
uncommon terms
JMeter hammering

Product search @ 5 threads

The products - summary
prepare
for high
QPS
plan for
data
growth
Both are fast, but with wikipedia
<

Hardware and data (2nd try)
(2vCPU, 3.5GB RAM,
vs
Video
search
video video
video video
video video
video
video

Real product search @ 20 threads

The products – real summary
prepare
for high
QPS
plan for
data
growth
With video both are fast and
configuration matters

The logs - assumptions
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs Logs
Logs
Lots of data
No updates
Low query count
Time oriented
data

Hardware and data
(2vCPU, 3.5GB RAM,
vs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs Logs
Logs
Apache logs

Tuning
refresh_interval: 5s
doc_values: truedocValues: true
soft autocommit: 5s
catch all field: on _all: enabled
hard autocommit: 200mb flush_threshold_size: 200mb

Test requests
Filters Aggregations/Facets
filter by client IP date histogram
filter by word in user
agent
top 10 response codes
wildcard filter on domain # of unique IPs
top IPs per response per time
https://github.com/sematext/berlin-buzzwords-samples/tree/master/2015

Test runs
1. Write throughput
2. Capacity of a single index
3. Capacity with time-based indices on
hot/cold setup

Single index write throughput

Time-based indices
smaller indices
lighter indexing
easier to isolate hot data from cold data
easier to relocate
bigger indices
less RAM
less management overhead
smaller cluster state

Hot / Cold in practice
cron job does everything
creates indices in advance, on the hot
nodes (createNodeSet property)
Uses node properties (e.g. node.tag)
index templates + shard allocation
awareness = new indices go to hot
nodes automagically
Optimizes old indices, creates N new
replicas of each shard on the cold nodes
Removes all the replicas from the hot
nodes
Cron job optimizes old indices and
changes shard allocation attributes =>
shards get moved and Do on the cold
nodes

Time-based: 2 hot and 2 cold
Before
AfterAfter
Before

Time-based: 2 hot and 2 cold

The logs - summary
Faceting
use time
based indices
use hot / cold
nodes policy
Filtering

One summary to rule them all
Differences in configuration often matter
more than differences in products
Before the tests we expected results to be
totally opposite in both use cases
Do your own tests with your data and
queries before jumping into conclusions

We are hiring
Dig Search?
Dig Analytics?
Dig Big Data?
Dig Performance?
Dig Logging?
Dig working with and in open – source?
We’re hiring world – wide!
http://sematext.com/about/jobs.html

Call me, maybe?
Radu Gheorghe
@radu0gheorghe
radu.gheorghe@sematext.com
Rafał Kuć
@kucrafal
rafal.kuc@sematext.com
Sematext
@sematext
http://sematext.com

Side by Side with Elasticsearch & Solr, Part 2

Related slideshows

More Related Content

Side by Side with Elasticsearch & Solr, Part 2