SlideShare a Scribd company logo
ELASTICSEARC
H
Name- Binit Pathak
Location- Guwahati, Assam
Agenda
 Introduction to Elasticsearch
 Elasticsearch Architecture
 How Elasticsearch works internally ?
 Elasticsearch-Rest API and Java API
 Elasticsearch 5.X
2
Elasticsearch
 Elasticsearch is a search and analytics engine that enables fast and
scalable searches in a distributed environment.
 Elasticsearch is an Apache Lucene-based search server.
 Elasticsearch is a real-time distributed and open source full-text
search and analytics engine.
3
Use case:
 Wikipedia: This uses Elasticsearch to provide a full text search, and provide
functionalities, such as search-as-you-type, and did-you-
mean suggestions.
 The Guardian: This uses Elasticsearch to process 40 million documents per
day, provide real-time analytics of site-traffic across the organization, and
help understand audience engagement better.
 StumbleUpon: This uses Elasticsearch to power intelligent searches across
its platform and provide great recommendations to millions of customers.
 SoundCloud: This uses Elasticsearch to provide real-time search
capabilities for millions of users across geographies.
 GitHub: This uses Elasticsearch to index over 8 million code repositories,
and index multiple events across the platform, hence providing real-time
search capabilities across it.
4
Features of Elasticsearch
 Elasticsearch is scalable up to petabytes of structured and unstructured data.
◦ Elasticsearch allows us to start small, but will grow with our business. It is built to
scale horizontally out of the box. As we need more capacity, just we can add more
nodes, and let the cluster reorganize itself to take advantage of the extra hardware.
 Elasticsearch uses denormalization to improve the search performance.
 Elasticsearch is open source.
• Elasticsearch is build on top of Apache Lucene.
 Apache Lucene is a high performance, full-featured Information Retrieval library,
written in Java. Elasticsearch uses Lucene internally .to build its state of the art
distributed search and analytics capabilities.
5
 Elasticsearch performs near real-time searches.
◦ Elasticsearch provides data manipulation and search capabilities in near real time.
By default, you can expect a one second delay (refresh interval) from the time you
index/update/delete your data until the time that it appears in your search results.
 Elasticsearch provides support for multi-tenancy.
◦ We can host multiple indexes on one Elasticsearch installation - node or cluster.
Each index can have multiple "types", which are essentially completely different
indexes.
• Elasticsearch is Configurable and Extensible.
 Many of Elasticsearch configurations can be changed while Elasticsearch is
running, but some will require a restart (and in some cases re-indexing).
 Elasticsearch has several extension points like river etc.
• Elasticsearch is Document Oriented.
 Elasticsearch is document oriented, meaning that it stores entire objects
or documents. It not only stores them, but also indexes the contents of each
document in order to make them searchable.
6
 Elasticsearch is schema-less.
◦ When indexing, if a mapping is not provided, a default mapping is created by
guessing the structure from the data fields that compose the document
7
Basic Concepts
 Cluster:
◦ A cluster consists of one or more nodes which share the same cluster name. Each
cluster has a single master node which is chosen automatically by the cluster and
which can be replaced if the current master node fails.
 Node:
◦ A node is a running instance of Elasticsearch which belongs to a cluster. There are
various types of nodes like Master Node, Masterelligible Node, Data Node, Tribe
Node etc.
 Index :
◦ An index is like a ‘database’ in a relational database. It has a mapping which defines
multiple types.
◦ An index is a logical namespace which maps to one or more primary shards and can
have zero or more replica shards.
 Type / Mapping:
◦ A type is like a ‘table’ in a relational database. Each type has a list of fields that can
be specified for documents of that type. The mapping defines how each field in the
document is analyzed.
8
 Document:
◦ A document is a JSON document which is stored in Elasticsearch. It is like a row in a
table in a relational database. Each document is stored in an index and has a type
and an id.
 Field:
◦ A document contains a list of fields, or key-value pairs. The value can be a simple
value (e.g. a string, integer, date), or a nested structure like an array or an object. A
field is similar to a column in a table in a relational database.
 Shard:
◦ A shard is a single Lucene instance. It is a low-level “worker” unit which is managed
automatically by Elasticsearch. An index is a logical namespace which points
to primary and replica shards. Other than defining the number of primary and replica
shards that an index should have, we never need to refer to shards directly. Instead,
our code should deal only with an index. Elasticsearch distributes shards amongst
all nodes in the cluster, and can move shards automatically from one node to another
in the case of node failure, or the addition of new nodes.
9
 Sharding is important for two primary reasons:
◦ It allows you to horizontally split/scale your content volume
◦ It allows you to distribute and parallelize operations across shards (potentially on
multiple nodes) thus increasing performance/throughput
 Elasticsearch allows us to make one or more copies of our index’s shards
into what are called replica shards, or replicas for short.
 Replication is important for two primary reasons:
◦ It provides high availability in case a shard/node fails. For this reason, it is important
to note that a replica shard is never allocated on the same node as the
original/primary shard that it was copied from.
◦ It allows you to scale out your search volume/throughput since searches can be
executed on all replicas in parallel.
10
Elasticsearch architecture
11
A single-node cluster with an index
12
If we were to check the cluster-health now, we would see this:
{
"cluster_name": "elasticsearch",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
“number_of_data_nodes": 1,
"active_primary_shards": 3,
“ “active_shards": 3,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 3,
……………….
}
A two-node cluster—all primary and replica shards are allocated
13
{
"cluster_name": "elasticsearch",
"status": "green",
"timed_out": false,
"number_of_nodes": 2,
"number_of_data_nodes": 2,
"active_primary_shards": 3,
"active_shards": 6,
………………………………….
…………………………………..
}
A three-node cluster—shards have been reallocated to spread the load
14
• One shard each from Node 1 and Node 2 have moved to the new Node 3, and we have two
shards per node, instead of three.
• This means that the hardware resources (CPU, RAM, I/O) of each node are being shared
among fewer shards, allowing each shard to perform better.
Increasing the number_of_replicas to 2
15
• Just having more replica shards on the same number of nodes doesn’t increase
our performance at all because each shard has access to a smaller fraction of its
node’s resources. We need to add hardware to increase throughput.
• But these extra replicas do mean that we have more redundancy: with the node
configuration above, we can now afford to lose two nodes without losing any data.
Coping with failure-Cluster after killing one node
16
• The node we killed was the master node.
 so the first thing that happened was that the nodes elected a new master: Node 2
• Primary shards 1 and 2 were lost when we killed Node 1, and our index cannot function
properly if it is missing primary shards. If we had checked the cluster health at this point, we
would have seen status red: not all primary shards are active!
• A complete copy of the two lost primary shards exists on other nodes, so the first thing that
the new master node did was to promote the replicas of these shards on Node 2 and Node
3 to be primaries, putting us back into cluster health yellow. This promotion process was
instantaneous, like the flick of a switch.
Inside Shard
17
Creating a document
A Lucene index with new documents in the in-memory buffer, ready to commit After a commit, a new segment is added to the commit point and the buffer is cleared
Deletes and Updates
 Segments are immutable, so documents cannot be removed from older
segments, nor can older segments be updated to reflect a newer version of a
document. Instead every commit point includes a .del file that lists which
documents in which segments have been deleted. When a document is
“deleted,” it is actually just marked as deleted in the .del file.
 Document updates work in a similar way: when a document is updated, the
old version of the document is marked as deleted, and the new version of the
document is indexed in a new segment.
18
Near Real-Time Search
19
A Lucene index with new documents in the in-memory buffer
• Commiting a new segment to disk requires an fsync to ensure that the segment is physically
written to disk and that data will not be lost if there is a power failure. But an fsync is costly; it
cannot be performed every time a document is indexed without a big performance hit.
• FileSystem cache - new segment is written to the FileSystem cache first—which is cheap—
and only later is it flushed to disk—which is expensive. But once a file is in the cache, it can be
opened and read, just like any other file.
The buffer contents have been written to a segment, which is
searchable, but is not yet commited.
Making Changes Persistent
20
Elasticsearch added a translog, or transaction log, which records every operation in
Elasticsearch as it happens.
New documents are added to the in-memory buffer and appended
to the transaction log
After a refresh, the buffer is cleared but the transaction log is
not
21
The transaction log keeps accumulating documents After a flush, the segments are fully commited and the transaction log is clea
Segment Merging
22
Two commited segments and one uncommitted segment in the
process of being merged into a bigger segment
Once merging has finished, the old segments are deleted
Distributed Document Store
 Routing a Document to a Shard:
◦ shard = hash(routing) % number_of_primary_shards
 Creating, Indexing, and Deleting a Document
23
 Retrieving a Document
24
 Partial Updates to a Document
25
USING CURL------Indexing a document
 http://localhost:9200/<index>/<type>/[<id
>]
◦ _index -> Where the document lives
◦ _type ->The class of object that the document represents
◦ _id -> The unique identifier for the document
 curl -XPUT 'localhost:9200/website/blog/123' -d'
{
"title": "My first blog entry",
"text": "Just trying this out...",
"date": "2014/01/01"
}'
 Elasticsearch responds as follows:
◦ { "_index": "website", "_type": "blog", "_id": "123", "_version": 1, "created": true 26
Retrieving a document
 GET /website/blog/123?pretty
 {
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 1,
"found" : true,
"_source" :
{
"title": "My first blog entry", "text": "Just trying this out...", "date": "2014/01/01"
}
}
27
Updating a document
 PUT /website/blog/123
 {
"title": "My first blog entry",
"text": "I am starting to get the hang of this...",
"date": "2014/01/02"
 }
 In the response, we can see that Elasticsearch has incremented the _version number:
{ "_index" : "website", "_type" : "blog", "_id" : "123", "_version" : 2, "created": false }
28
Deleting a document
 DELETE /website/blog/123
 {
"found" : true,
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 3
 }
29
JAVA API
 Two ways to connect to Elasticsearch:
◦ Node Client
 Instantiating a node based client is the simplest way to get a Client that
can execute operations against Elasticsearch.
 Node will be a part of cluster.
◦ Transport Client
 The TransportClient connects remotely to an Elasticsearch cluster
using the transport module. It does not join the cluster, but simply gets
one or more initial transport addresses and communicates with them in
round robin fashion on each action
30
Breaking changes in Elasticsearch 5.0
 Query DSL
◦ search_type=count is removed, now use size
◦ search_type=scan is removed, now use scroll
◦ In 5.0, Elasticsearch rejects requests that would query more than
1000 shard copies (primaries or replicas).
◦ The fields parameter has been replaced by stored_fields.
The stored_fields parameter will only return stored fields — it will
no longer extract values from the _source.
 and many more………
31
 Mapping Changes
◦ string fields replaced by text/keyword fields
◦ Numeric fields are now indexed with a completely different data-
structure, called BKD tree, that is expected to require less disk
space and be faster for range queries than the previous way that
numerics were indexed.
 Setting clusters
◦ In ES 2.4
 Settings settings = Settings.settingsBuilder() .put("cluster.name", "myClusterName").build();
TransportClient client = TransportClient.builder().settings(settings).build();
◦ In ES 5.0
 Settings settings = Settings.builder() .put("cluster.name", "myClusterName").build();
TransportClient client = new PreBuiltTransportClient(settings);
32
 API 2.4
◦ TransportClient client = TransportClient.builder().build()
.addTransportAddress(new
InetSocketTransportAddress(InetAddress.getByName("host1"), 9300))
.addTransportAddress(new
InetSocketTransportAddress(InetAddress.getByName("host2"), 9300));
 API 5.0
◦ TransportClient client = new PreBuiltTransportClient(Settings.EMPTY)
.addTransportAddress(new
InetSocketTransportAddress(InetAddress.getByName("host1"), 9300))
.addTransportAddress(new
InetSocketTransportAddress(InetAddress.getByName("host2"), 9300));
33
API changes
Some drawbacks of Elasticsearch 5.0
 Elasticsearch requires at least Java 8.
 Spring data Elasticsearch does not
support ES 5.
34
THANKS
35

More Related Content

Elastic search

  • 2. Agenda  Introduction to Elasticsearch  Elasticsearch Architecture  How Elasticsearch works internally ?  Elasticsearch-Rest API and Java API  Elasticsearch 5.X 2
  • 3. Elasticsearch  Elasticsearch is a search and analytics engine that enables fast and scalable searches in a distributed environment.  Elasticsearch is an Apache Lucene-based search server.  Elasticsearch is a real-time distributed and open source full-text search and analytics engine. 3
  • 4. Use case:  Wikipedia: This uses Elasticsearch to provide a full text search, and provide functionalities, such as search-as-you-type, and did-you- mean suggestions.  The Guardian: This uses Elasticsearch to process 40 million documents per day, provide real-time analytics of site-traffic across the organization, and help understand audience engagement better.  StumbleUpon: This uses Elasticsearch to power intelligent searches across its platform and provide great recommendations to millions of customers.  SoundCloud: This uses Elasticsearch to provide real-time search capabilities for millions of users across geographies.  GitHub: This uses Elasticsearch to index over 8 million code repositories, and index multiple events across the platform, hence providing real-time search capabilities across it. 4
  • 5. Features of Elasticsearch  Elasticsearch is scalable up to petabytes of structured and unstructured data. ◦ Elasticsearch allows us to start small, but will grow with our business. It is built to scale horizontally out of the box. As we need more capacity, just we can add more nodes, and let the cluster reorganize itself to take advantage of the extra hardware.  Elasticsearch uses denormalization to improve the search performance.  Elasticsearch is open source. • Elasticsearch is build on top of Apache Lucene.  Apache Lucene is a high performance, full-featured Information Retrieval library, written in Java. Elasticsearch uses Lucene internally .to build its state of the art distributed search and analytics capabilities. 5
  • 6.  Elasticsearch performs near real-time searches. ◦ Elasticsearch provides data manipulation and search capabilities in near real time. By default, you can expect a one second delay (refresh interval) from the time you index/update/delete your data until the time that it appears in your search results.  Elasticsearch provides support for multi-tenancy. ◦ We can host multiple indexes on one Elasticsearch installation - node or cluster. Each index can have multiple "types", which are essentially completely different indexes. • Elasticsearch is Configurable and Extensible.  Many of Elasticsearch configurations can be changed while Elasticsearch is running, but some will require a restart (and in some cases re-indexing).  Elasticsearch has several extension points like river etc. • Elasticsearch is Document Oriented.  Elasticsearch is document oriented, meaning that it stores entire objects or documents. It not only stores them, but also indexes the contents of each document in order to make them searchable. 6
  • 7.  Elasticsearch is schema-less. ◦ When indexing, if a mapping is not provided, a default mapping is created by guessing the structure from the data fields that compose the document 7
  • 8. Basic Concepts  Cluster: ◦ A cluster consists of one or more nodes which share the same cluster name. Each cluster has a single master node which is chosen automatically by the cluster and which can be replaced if the current master node fails.  Node: ◦ A node is a running instance of Elasticsearch which belongs to a cluster. There are various types of nodes like Master Node, Masterelligible Node, Data Node, Tribe Node etc.  Index : ◦ An index is like a ‘database’ in a relational database. It has a mapping which defines multiple types. ◦ An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.  Type / Mapping: ◦ A type is like a ‘table’ in a relational database. Each type has a list of fields that can be specified for documents of that type. The mapping defines how each field in the document is analyzed. 8
  • 9.  Document: ◦ A document is a JSON document which is stored in Elasticsearch. It is like a row in a table in a relational database. Each document is stored in an index and has a type and an id.  Field: ◦ A document contains a list of fields, or key-value pairs. The value can be a simple value (e.g. a string, integer, date), or a nested structure like an array or an object. A field is similar to a column in a table in a relational database.  Shard: ◦ A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by Elasticsearch. An index is a logical namespace which points to primary and replica shards. Other than defining the number of primary and replica shards that an index should have, we never need to refer to shards directly. Instead, our code should deal only with an index. Elasticsearch distributes shards amongst all nodes in the cluster, and can move shards automatically from one node to another in the case of node failure, or the addition of new nodes. 9
  • 10.  Sharding is important for two primary reasons: ◦ It allows you to horizontally split/scale your content volume ◦ It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput  Elasticsearch allows us to make one or more copies of our index’s shards into what are called replica shards, or replicas for short.  Replication is important for two primary reasons: ◦ It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from. ◦ It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel. 10
  • 12. A single-node cluster with an index 12 If we were to check the cluster-health now, we would see this: { "cluster_name": "elasticsearch", "status": "yellow", "timed_out": false, "number_of_nodes": 1, “number_of_data_nodes": 1, "active_primary_shards": 3, “ “active_shards": 3, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 3, ………………. }
  • 13. A two-node cluster—all primary and replica shards are allocated 13 { "cluster_name": "elasticsearch", "status": "green", "timed_out": false, "number_of_nodes": 2, "number_of_data_nodes": 2, "active_primary_shards": 3, "active_shards": 6, …………………………………. ………………………………….. }
  • 14. A three-node cluster—shards have been reallocated to spread the load 14 • One shard each from Node 1 and Node 2 have moved to the new Node 3, and we have two shards per node, instead of three. • This means that the hardware resources (CPU, RAM, I/O) of each node are being shared among fewer shards, allowing each shard to perform better.
  • 15. Increasing the number_of_replicas to 2 15 • Just having more replica shards on the same number of nodes doesn’t increase our performance at all because each shard has access to a smaller fraction of its node’s resources. We need to add hardware to increase throughput. • But these extra replicas do mean that we have more redundancy: with the node configuration above, we can now afford to lose two nodes without losing any data.
  • 16. Coping with failure-Cluster after killing one node 16 • The node we killed was the master node.  so the first thing that happened was that the nodes elected a new master: Node 2 • Primary shards 1 and 2 were lost when we killed Node 1, and our index cannot function properly if it is missing primary shards. If we had checked the cluster health at this point, we would have seen status red: not all primary shards are active! • A complete copy of the two lost primary shards exists on other nodes, so the first thing that the new master node did was to promote the replicas of these shards on Node 2 and Node 3 to be primaries, putting us back into cluster health yellow. This promotion process was instantaneous, like the flick of a switch.
  • 17. Inside Shard 17 Creating a document A Lucene index with new documents in the in-memory buffer, ready to commit After a commit, a new segment is added to the commit point and the buffer is cleared
  • 18. Deletes and Updates  Segments are immutable, so documents cannot be removed from older segments, nor can older segments be updated to reflect a newer version of a document. Instead every commit point includes a .del file that lists which documents in which segments have been deleted. When a document is “deleted,” it is actually just marked as deleted in the .del file.  Document updates work in a similar way: when a document is updated, the old version of the document is marked as deleted, and the new version of the document is indexed in a new segment. 18
  • 19. Near Real-Time Search 19 A Lucene index with new documents in the in-memory buffer • Commiting a new segment to disk requires an fsync to ensure that the segment is physically written to disk and that data will not be lost if there is a power failure. But an fsync is costly; it cannot be performed every time a document is indexed without a big performance hit. • FileSystem cache - new segment is written to the FileSystem cache first—which is cheap— and only later is it flushed to disk—which is expensive. But once a file is in the cache, it can be opened and read, just like any other file. The buffer contents have been written to a segment, which is searchable, but is not yet commited.
  • 20. Making Changes Persistent 20 Elasticsearch added a translog, or transaction log, which records every operation in Elasticsearch as it happens. New documents are added to the in-memory buffer and appended to the transaction log After a refresh, the buffer is cleared but the transaction log is not
  • 21. 21 The transaction log keeps accumulating documents After a flush, the segments are fully commited and the transaction log is clea
  • 22. Segment Merging 22 Two commited segments and one uncommitted segment in the process of being merged into a bigger segment Once merging has finished, the old segments are deleted
  • 23. Distributed Document Store  Routing a Document to a Shard: ◦ shard = hash(routing) % number_of_primary_shards  Creating, Indexing, and Deleting a Document 23
  • 24.  Retrieving a Document 24
  • 25.  Partial Updates to a Document 25
  • 26. USING CURL------Indexing a document  http://localhost:9200/<index>/<type>/[<id >] ◦ _index -> Where the document lives ◦ _type ->The class of object that the document represents ◦ _id -> The unique identifier for the document  curl -XPUT 'localhost:9200/website/blog/123' -d' { "title": "My first blog entry", "text": "Just trying this out...", "date": "2014/01/01" }'  Elasticsearch responds as follows: ◦ { "_index": "website", "_type": "blog", "_id": "123", "_version": 1, "created": true 26
  • 27. Retrieving a document  GET /website/blog/123?pretty  { "_index" : "website", "_type" : "blog", "_id" : "123", "_version" : 1, "found" : true, "_source" : { "title": "My first blog entry", "text": "Just trying this out...", "date": "2014/01/01" } } 27
  • 28. Updating a document  PUT /website/blog/123  { "title": "My first blog entry", "text": "I am starting to get the hang of this...", "date": "2014/01/02"  }  In the response, we can see that Elasticsearch has incremented the _version number: { "_index" : "website", "_type" : "blog", "_id" : "123", "_version" : 2, "created": false } 28
  • 29. Deleting a document  DELETE /website/blog/123  { "found" : true, "_index" : "website", "_type" : "blog", "_id" : "123", "_version" : 3  } 29
  • 30. JAVA API  Two ways to connect to Elasticsearch: ◦ Node Client  Instantiating a node based client is the simplest way to get a Client that can execute operations against Elasticsearch.  Node will be a part of cluster. ◦ Transport Client  The TransportClient connects remotely to an Elasticsearch cluster using the transport module. It does not join the cluster, but simply gets one or more initial transport addresses and communicates with them in round robin fashion on each action 30
  • 31. Breaking changes in Elasticsearch 5.0  Query DSL ◦ search_type=count is removed, now use size ◦ search_type=scan is removed, now use scroll ◦ In 5.0, Elasticsearch rejects requests that would query more than 1000 shard copies (primaries or replicas). ◦ The fields parameter has been replaced by stored_fields. The stored_fields parameter will only return stored fields — it will no longer extract values from the _source.  and many more……… 31
  • 32.  Mapping Changes ◦ string fields replaced by text/keyword fields ◦ Numeric fields are now indexed with a completely different data- structure, called BKD tree, that is expected to require less disk space and be faster for range queries than the previous way that numerics were indexed.  Setting clusters ◦ In ES 2.4  Settings settings = Settings.settingsBuilder() .put("cluster.name", "myClusterName").build(); TransportClient client = TransportClient.builder().settings(settings).build(); ◦ In ES 5.0  Settings settings = Settings.builder() .put("cluster.name", "myClusterName").build(); TransportClient client = new PreBuiltTransportClient(settings); 32
  • 33.  API 2.4 ◦ TransportClient client = TransportClient.builder().build() .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("host1"), 9300)) .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("host2"), 9300));  API 5.0 ◦ TransportClient client = new PreBuiltTransportClient(Settings.EMPTY) .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("host1"), 9300)) .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("host2"), 9300)); 33 API changes
  • 34. Some drawbacks of Elasticsearch 5.0  Elasticsearch requires at least Java 8.  Spring data Elasticsearch does not support ES 5. 34