Deep Dive Into Elasticsearch

Deep Dive Into Elasticsearch
Kunal Kapoor
Software Consultant
Knoldus Software LLP

AGENDA
● What is Elasticsearch
● Getting Started
● Key Terminologies
● CRUD Operations
● Understanding the physical layout
● What happens when you index a document
● How to make an inverted index mutable
● How per-segment search works
● How a delete operation works
● Segment Merging

What is Elasticsearch
● Search engine based on Lucene.
● Provides near real-time search
● Distributed
● Fault Tolerant
● Notable users:
– Facebook
– Github
– CERN
– LinkedIn

Getting Started
● Download the elasticsearch distribution from
https://www.elastic.co/downloads/elasticsearch
● To start the elasticsearch server run the following
command from within the extracted directory
– ./bin/elasticsearch
● Once the server or node is created you can check the
health of your cluster by running
– curl 'localhost:9200/_cat/health?v'

Key Terminologies
● Node - A node is a single server that is part of your
cluster, stores your data, and participates in the cluster’s
indexing and search capabilities
● Cluster - A cluster is a collection of one or more nodes
(servers) that together holds your entire data and
provides indexing and search capabilities across all
nodes.
● Index - An index is a collection of documents that have
somewhat similar characteristics.

Key Terminologies
● Type - A type is a logical category/partition of your index.
It is defined for documents that have a set of common
fields.
● Shard – A shard is basically an lucene index. Contains
the documents and various data structures that help in
searching.

CRUD Operations
● Indexing a document
– curl -XPOST 'localhost:9200/test/test/1?pretty' -d '{"text":"Hello World"}'
● Updating a document
– curl -XPOST 'localhost:9200/test/test/1?pretty' -d '{"text":"Hello"}'
● Delete a document
– curl -XDELETE 'localhost:9200/test/test/1?pretty'
● Search a document
– curl -XPOST 'localhost:9200/test/_search?pretty' -d '{"query":
{ "match": {"text": "hello" }}}'

Understanding the physical layout

Inverted Index
● Data structure storing a mapping, from content such as
words or numbers, to its locations in a database file, or a
set of documents.
● Provides full-text search
● Consists of 2 parts
– Sorted Dictionary
– Postings
● Immutable

What happens when you index a
document?
● The node that receives the request becomes the
controller for that request.
● That node determines the shard in which the document
should reside on the basis of the documents Id.
– shard = hash(document_id) % number_of_primary_shards
● The request is then forwarded to the appropriate node
which contains the shard.
● The node forwards the request to the appropriate shard.

document?
● The shard performs analysis on the document and creates the
appropriate inverted index which is helpful for searching.
● The request is then sent to the replica shards.
● The documents are analyzed by the standard analyzer by
default.
● It split the documents on white-space and lowercases the
documents.
● The documents are then ready to be inserted in the inverted
index

document?
{
“text”:”Elasticsearch is an awesome search engine”
}
{
“text:”Elasticsearch is not a database”
}

elasticsearch, is, an, awesome, search, engine, elasticsearch,not, a, database

Terms Document1 Document2
elasticsearch ✓ ✓
is ✓ ✓
an ✓ -
awesome ✓ -
search ✓ -
engine ✓ -
not - ✓
a - ✓
database - ✓

How to make an inverted index
mutable?
● Earlier, the whole inverted index would be rewritten to
disk with the changes.
● Very costly approach
● Lucene introduced the concept of per-segment search.
● Now a Lucene index would mean a collection of
segments plus a commit point.
● A commit point is a file that contains the list of segments
that are ready for search.

How per-segment search works?
● New documents are collected in an in-memory buffer.
● Every so often, the buffer is commited (refresh)
– A new supplementary segment with a commit point is
written to file-system cache.
– The transaction log is updated with the request for a
full commit later.
● The buffer is cleared and the segment is made available
for search.

In-memory Buffer
COMMIT POINT
Transaction Log

How a delete operation works?
● Every shard in the Elasticsearch node maintains a .del file
along with the commit point.
● .del file lists which documents in which segments have
been deleted.
● When a delete request is encountered, the appropriate
document is marked as deleted.
● The deleted document will still match the search query but
will be filtered later on.
● Later the document is purged from the file-system while
segment merging.

Segment Merging
● Each segment consumes file handles, memory, and CPU
cycles.
● The more the number of segments, the slower the search
will be.
● Elasticsearch solves this problem by merging segments
in the background.
● Small segments are merged into bigger segments.
● This is the moment when those old deleted documents
are purged from the file-system.

Segment Merging
● This is how the merge process works:-
– The merge process selects a few segments of similar
size and merges them into a new bigger segment in
the background.
– The new segment is flushed to disk.
– A new commit point is written that includes the new
segment and excludes the old, smaller segments.
– The new segment is opened for search.
– The old segments are deleted.

References
● The Definitive Guide by Clinton Gormley and Zachary
Tong
● https://www.elastic.co/guide/en/elasticsearch/reference/current/

Deep Dive Into Elasticsearch

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Deep Dive Into Elasticsearch

Similar to Deep Dive Into Elasticsearch (20)

More from Knoldus Inc.

More from Knoldus Inc. (20)

Recently uploaded

Recently uploaded (20)

Deep Dive Into Elasticsearch

Editor's Notes