SlideShare a Scribd company logo
ElasticSearch 7
Presented By
Anurag
ES1.1 -
Introduction to
ELK Stack
ElasticSearch
● Elasticsearch is a search engine based on the Apache Lucene
library.
● Open Code Business Model
● Rest based
● Distributed
● Most Popular enterprise search engine
● Netflix, Linkedin, Amazon, Oracle and many big names
Elastic (ELK) Stack
The Beats are lightweight data shippers, written in
Go, that run on your servers to capture all sorts of
operational data (logs, metrics, or network packet
data). Beats send the operational data to
Elasticsearch, either directly or via Logstash
Logstash is a server-side data processing
pipeline that ingests data from a multitude of
sources, transforms it, and then sends it to your
favorite "stash."
Kibana is a browser-based analytics and
search dashboard for Elasticsearch.
Distributed RESTful search Engine
How do ElasticSearch and Lucene Differ
Just as a car (ES) and the engine (Lucene) of a car differ
ES makes use of Lucene to manage the indices.
Lucene is a Java library. You can include it in your project and refer to its functions using function calls.
Elasticsearch is a JSON Based, Distributed, web server built over Lucene. Though it's Lucene who is doing the actual work
beneath, Elasticsearch provides us a convenient layer over Lucene. Each shard that gets created in Elasticsearch is a separate
Lucene instance. So to summarize
1. Elasticsearch is built over Lucene and provides a JSON based REST API to refer to Lucene features.
2. Elasticsearch provides a distributed system on top of Lucene. A distributed system is not something Lucene is
aware of or built for. Elasticsearch provides this abstraction of distributed structure.
3. Elasticsearch provides other supporting features like thread-pool, queues, node/cluster monitoring API, data
monitoring API, Cluster management, etc.
ES 1.2 Document
Ranking
Indexing
● Elasticsearch is able to achieve low
latency in responses because, instead of
searching the text directly, it searches in
an index instead.
● Document? The basic unit of data in ES
● Inverted Index (like at the back of a book)
○ Created by tokenizing the terms in
each document
○ Created a sorted list of all unique
terms (terms are normalized,
stemmed etc)
○ Assosciate list of documents where
the word can be found
○ Similar to the index at the back of a
book
Doc1: I am learning the cool stuff
Doc2: I am learning to learn
Inverted Index:
Am -> [Doc1, Doc2]
Cool -> [Doc1]
I -> [Doc1, Doc2]
Learn -> [Doc1, Doc2] // root for of learning
the -> [Doc1]
…
Retrieving
● Term Frequency (TF)
○ Frequency of term in given
document
● Document Frequency (DF)
○ Frequency of term in all
documents
● IDF (Inverse Document
Frequency)
○ IDF = 1 / DF
● Relevance
○ Relevance = TF * IDF
○ Relevance = TF / DF
Search Term: learn
TF1 = 1
TF2 = 2
IDF = ⅓
Rev1 = TF1 * IDF = ⅓
Rev2 = TF2 * IDF = ⅔
Rev2 > Rev1
ES 1.3 ES Cluster
Node Structure
● Index - Logical Namespace of collection of documents
● Shard - Horizontal Partition of an Index
○ Eg Documents 1-10 in one shard, 11-20 in other and so on.
○ In Elasticsearch, each Shard is a self-contained Lucene index in itself.
Cluster Structure
P1
R4
P2
R1
P3
R2
P4
R3
● Here we can see a cluster of 4
nodes
● Each node has 2 shards
● Primary and Replica shards
● For robustness and fault
tolerance, each shard is replicated
● Even if a node goes down, and a
primary shard is lost, a replica can
be made primary until recovery
● Number of replica shards has to be
set at the time of cluster creation
● Write operations on Primary and
repeated on replicas and read from
either
Types on Nodes
● Master Node
○ Cluster wide operations (creating and deleting indexes, keeping track of
index nodes, assigning shards, healthchecks etc)
● Data Node
○ Hold data and index
● Client Node
○ Load Balancer (neither data nor master nodes)
ElasticSearch 1.4
CRUD - Write
Operations
Breaking a shard into Segments
● For ES the basic unit of storage is a shard
● For Lucene the basic unit of storage is a segment
● Each segment is an inverted index
● New documents are added to new segment
● Segments are in memory and data is later persisted to
disk
● Segments are immutable
Coordination Stage
● shard_number = hash(document_id) % (num_of_primary_shards)
● All nodes know where a shard exists
● Document passed to node which contains particular shard_number
Translog
Source:
https://www.elastic.co/guide/en/elasticsearch/referenc
e/current/index-modules-translog.html
Translog and Memory Buffer
● Request written to translog
● Document added to memory buffer (which stores all the newly index documents)
● If the request is successful on the primary shard, the request is parallelly sent to the replica shards.
● In-sync shards which are always in sync with primary
● The client receives acknowledgement that the request was successful only after the translog is fsync’ed on all
primary and insync shards.
Refresh Operation
● In Elasticsearch, the _refresh operation is set to be executed every second by default.
● During this operation, the in-memory buffer contents is copied to a newly created segment in the memory.
● As a result, new data becomes available for search.
Flush Operation
● Flush essentially means that all the documents in the in-memory buffer are written to new Lucene
segments.
● These, along with all existing in-memory segments, are committed to the disk, which clears the
translog. This commit is essentially a Lucene commit.
ElasticSearch 1.5
CRUD - Update &
Delete
Elasticsearch Delete
● Documents in Elasticsearch are immutable and hence, cannot be deleted or modified to
represent any changes.
● Every segment on disk has a .del file associated with it.
● When a delete request is sent, the document is not really deleted, but marked as deleted
in the .del file.
● This document may still match a search query but is filtered out of the results.
● When segments are merged, the documents marked as deleted in the .del file are not
included in the new merged segment.
Elasticsearch Update
● When a new document is created, Elasticsearch assigns a version number to that
document.
● Every change to the document results in a new version number.
● When an update is performed, the old version is marked as deleted in the .del file and
the new version is indexed in a new segment.
● The older version may still match a search query, however, it is filtered out from the
results.
ElasticSearch 1.6
CRUD - Read
Operations
ElasticSearch Read
● In this phase, the coordinating node routes the search request to all the shards
(primary or replica) in the index.
● The shards perform search independently and create a set of results sorted by
relevance score.
● All the shards return the document IDs of the matched documents and relevant
scores to the coordinating node.
● By default, each shard sends the top 10 results to the coordinating node
● The coordinating node sorts the results globally, and creates a list of the top 10 hits.
● The coordinating node then requests the original documents from all the shards.
All the shards enrich the documents and return them to the coordinating node.
● Results are aggregated and sent to the clients
ElasticSearch Read
That’s all folks!
References
1. https://qbox.io/blog/refresh-flush-operations-elasticsearch-guide
2. https://www.elastic.co/guide/index.html
3. https://blog.insightdatascience.com/anatomy-of-an-elasticsearch-cluster-part-i-
7ac9a13b05db

More Related Content

Elasticsearch Architechture

  • 3. ElasticSearch ● Elasticsearch is a search engine based on the Apache Lucene library. ● Open Code Business Model ● Rest based ● Distributed ● Most Popular enterprise search engine ● Netflix, Linkedin, Amazon, Oracle and many big names
  • 4. Elastic (ELK) Stack The Beats are lightweight data shippers, written in Go, that run on your servers to capture all sorts of operational data (logs, metrics, or network packet data). Beats send the operational data to Elasticsearch, either directly or via Logstash Logstash is a server-side data processing pipeline that ingests data from a multitude of sources, transforms it, and then sends it to your favorite "stash." Kibana is a browser-based analytics and search dashboard for Elasticsearch. Distributed RESTful search Engine
  • 5. How do ElasticSearch and Lucene Differ Just as a car (ES) and the engine (Lucene) of a car differ ES makes use of Lucene to manage the indices. Lucene is a Java library. You can include it in your project and refer to its functions using function calls. Elasticsearch is a JSON Based, Distributed, web server built over Lucene. Though it's Lucene who is doing the actual work beneath, Elasticsearch provides us a convenient layer over Lucene. Each shard that gets created in Elasticsearch is a separate Lucene instance. So to summarize 1. Elasticsearch is built over Lucene and provides a JSON based REST API to refer to Lucene features. 2. Elasticsearch provides a distributed system on top of Lucene. A distributed system is not something Lucene is aware of or built for. Elasticsearch provides this abstraction of distributed structure. 3. Elasticsearch provides other supporting features like thread-pool, queues, node/cluster monitoring API, data monitoring API, Cluster management, etc.
  • 7. Indexing ● Elasticsearch is able to achieve low latency in responses because, instead of searching the text directly, it searches in an index instead. ● Document? The basic unit of data in ES ● Inverted Index (like at the back of a book) ○ Created by tokenizing the terms in each document ○ Created a sorted list of all unique terms (terms are normalized, stemmed etc) ○ Assosciate list of documents where the word can be found ○ Similar to the index at the back of a book Doc1: I am learning the cool stuff Doc2: I am learning to learn Inverted Index: Am -> [Doc1, Doc2] Cool -> [Doc1] I -> [Doc1, Doc2] Learn -> [Doc1, Doc2] // root for of learning the -> [Doc1] …
  • 8. Retrieving ● Term Frequency (TF) ○ Frequency of term in given document ● Document Frequency (DF) ○ Frequency of term in all documents ● IDF (Inverse Document Frequency) ○ IDF = 1 / DF ● Relevance ○ Relevance = TF * IDF ○ Relevance = TF / DF Search Term: learn TF1 = 1 TF2 = 2 IDF = ⅓ Rev1 = TF1 * IDF = ⅓ Rev2 = TF2 * IDF = ⅔ Rev2 > Rev1
  • 9. ES 1.3 ES Cluster
  • 10. Node Structure ● Index - Logical Namespace of collection of documents ● Shard - Horizontal Partition of an Index ○ Eg Documents 1-10 in one shard, 11-20 in other and so on. ○ In Elasticsearch, each Shard is a self-contained Lucene index in itself.
  • 11. Cluster Structure P1 R4 P2 R1 P3 R2 P4 R3 ● Here we can see a cluster of 4 nodes ● Each node has 2 shards ● Primary and Replica shards ● For robustness and fault tolerance, each shard is replicated ● Even if a node goes down, and a primary shard is lost, a replica can be made primary until recovery ● Number of replica shards has to be set at the time of cluster creation ● Write operations on Primary and repeated on replicas and read from either
  • 12. Types on Nodes ● Master Node ○ Cluster wide operations (creating and deleting indexes, keeping track of index nodes, assigning shards, healthchecks etc) ● Data Node ○ Hold data and index ● Client Node ○ Load Balancer (neither data nor master nodes)
  • 13. ElasticSearch 1.4 CRUD - Write Operations
  • 14. Breaking a shard into Segments ● For ES the basic unit of storage is a shard ● For Lucene the basic unit of storage is a segment ● Each segment is an inverted index ● New documents are added to new segment ● Segments are in memory and data is later persisted to disk ● Segments are immutable
  • 15. Coordination Stage ● shard_number = hash(document_id) % (num_of_primary_shards) ● All nodes know where a shard exists ● Document passed to node which contains particular shard_number
  • 17. Translog and Memory Buffer ● Request written to translog ● Document added to memory buffer (which stores all the newly index documents) ● If the request is successful on the primary shard, the request is parallelly sent to the replica shards. ● In-sync shards which are always in sync with primary ● The client receives acknowledgement that the request was successful only after the translog is fsync’ed on all primary and insync shards.
  • 18. Refresh Operation ● In Elasticsearch, the _refresh operation is set to be executed every second by default. ● During this operation, the in-memory buffer contents is copied to a newly created segment in the memory. ● As a result, new data becomes available for search.
  • 19. Flush Operation ● Flush essentially means that all the documents in the in-memory buffer are written to new Lucene segments. ● These, along with all existing in-memory segments, are committed to the disk, which clears the translog. This commit is essentially a Lucene commit.
  • 20. ElasticSearch 1.5 CRUD - Update & Delete
  • 21. Elasticsearch Delete ● Documents in Elasticsearch are immutable and hence, cannot be deleted or modified to represent any changes. ● Every segment on disk has a .del file associated with it. ● When a delete request is sent, the document is not really deleted, but marked as deleted in the .del file. ● This document may still match a search query but is filtered out of the results. ● When segments are merged, the documents marked as deleted in the .del file are not included in the new merged segment.
  • 22. Elasticsearch Update ● When a new document is created, Elasticsearch assigns a version number to that document. ● Every change to the document results in a new version number. ● When an update is performed, the old version is marked as deleted in the .del file and the new version is indexed in a new segment. ● The older version may still match a search query, however, it is filtered out from the results.
  • 23. ElasticSearch 1.6 CRUD - Read Operations
  • 24. ElasticSearch Read ● In this phase, the coordinating node routes the search request to all the shards (primary or replica) in the index. ● The shards perform search independently and create a set of results sorted by relevance score. ● All the shards return the document IDs of the matched documents and relevant scores to the coordinating node. ● By default, each shard sends the top 10 results to the coordinating node ● The coordinating node sorts the results globally, and creates a list of the top 10 hits. ● The coordinating node then requests the original documents from all the shards. All the shards enrich the documents and return them to the coordinating node. ● Results are aggregated and sent to the clients
  • 27. References 1. https://qbox.io/blog/refresh-flush-operations-elasticsearch-guide 2. https://www.elastic.co/guide/index.html 3. https://blog.insightdatascience.com/anatomy-of-an-elasticsearch-cluster-part-i- 7ac9a13b05db