2. Agenda
Introduction to Elasticsearch
Elasticsearch Architecture
How Elasticsearch works internally ?
Elasticsearch-Rest API and Java API
Elasticsearch 5.X
2
3. Elasticsearch
Elasticsearch is a search and analytics engine that enables fast and
scalable searches in a distributed environment.
Elasticsearch is an Apache Lucene-based search server.
Elasticsearch is a real-time distributed and open source full-text
search and analytics engine.
3
4. Use case:
Wikipedia: This uses Elasticsearch to provide a full text search, and provide
functionalities, such as search-as-you-type, and did-you-
mean suggestions.
The Guardian: This uses Elasticsearch to process 40 million documents per
day, provide real-time analytics of site-traffic across the organization, and
help understand audience engagement better.
StumbleUpon: This uses Elasticsearch to power intelligent searches across
its platform and provide great recommendations to millions of customers.
SoundCloud: This uses Elasticsearch to provide real-time search
capabilities for millions of users across geographies.
GitHub: This uses Elasticsearch to index over 8 million code repositories,
and index multiple events across the platform, hence providing real-time
search capabilities across it.
4
5. Features of Elasticsearch
Elasticsearch is scalable up to petabytes of structured and unstructured data.
◦ Elasticsearch allows us to start small, but will grow with our business. It is built to
scale horizontally out of the box. As we need more capacity, just we can add more
nodes, and let the cluster reorganize itself to take advantage of the extra hardware.
Elasticsearch uses denormalization to improve the search performance.
Elasticsearch is open source.
• Elasticsearch is build on top of Apache Lucene.
Apache Lucene is a high performance, full-featured Information Retrieval library,
written in Java. Elasticsearch uses Lucene internally .to build its state of the art
distributed search and analytics capabilities.
5
6. Elasticsearch performs near real-time searches.
◦ Elasticsearch provides data manipulation and search capabilities in near real time.
By default, you can expect a one second delay (refresh interval) from the time you
index/update/delete your data until the time that it appears in your search results.
Elasticsearch provides support for multi-tenancy.
◦ We can host multiple indexes on one Elasticsearch installation - node or cluster.
Each index can have multiple "types", which are essentially completely different
indexes.
• Elasticsearch is Configurable and Extensible.
Many of Elasticsearch configurations can be changed while Elasticsearch is
running, but some will require a restart (and in some cases re-indexing).
Elasticsearch has several extension points like river etc.
• Elasticsearch is Document Oriented.
Elasticsearch is document oriented, meaning that it stores entire objects
or documents. It not only stores them, but also indexes the contents of each
document in order to make them searchable.
6
7. Elasticsearch is schema-less.
◦ When indexing, if a mapping is not provided, a default mapping is created by
guessing the structure from the data fields that compose the document
7
8. Basic Concepts
Cluster:
◦ A cluster consists of one or more nodes which share the same cluster name. Each
cluster has a single master node which is chosen automatically by the cluster and
which can be replaced if the current master node fails.
Node:
◦ A node is a running instance of Elasticsearch which belongs to a cluster. There are
various types of nodes like Master Node, Masterelligible Node, Data Node, Tribe
Node etc.
Index :
◦ An index is like a ‘database’ in a relational database. It has a mapping which defines
multiple types.
◦ An index is a logical namespace which maps to one or more primary shards and can
have zero or more replica shards.
Type / Mapping:
◦ A type is like a ‘table’ in a relational database. Each type has a list of fields that can
be specified for documents of that type. The mapping defines how each field in the
document is analyzed.
8
9. Document:
◦ A document is a JSON document which is stored in Elasticsearch. It is like a row in a
table in a relational database. Each document is stored in an index and has a type
and an id.
Field:
◦ A document contains a list of fields, or key-value pairs. The value can be a simple
value (e.g. a string, integer, date), or a nested structure like an array or an object. A
field is similar to a column in a table in a relational database.
Shard:
◦ A shard is a single Lucene instance. It is a low-level “worker” unit which is managed
automatically by Elasticsearch. An index is a logical namespace which points
to primary and replica shards. Other than defining the number of primary and replica
shards that an index should have, we never need to refer to shards directly. Instead,
our code should deal only with an index. Elasticsearch distributes shards amongst
all nodes in the cluster, and can move shards automatically from one node to another
in the case of node failure, or the addition of new nodes.
9
10. Sharding is important for two primary reasons:
◦ It allows you to horizontally split/scale your content volume
◦ It allows you to distribute and parallelize operations across shards (potentially on
multiple nodes) thus increasing performance/throughput
Elasticsearch allows us to make one or more copies of our index’s shards
into what are called replica shards, or replicas for short.
Replication is important for two primary reasons:
◦ It provides high availability in case a shard/node fails. For this reason, it is important
to note that a replica shard is never allocated on the same node as the
original/primary shard that it was copied from.
◦ It allows you to scale out your search volume/throughput since searches can be
executed on all replicas in parallel.
10
12. A single-node cluster with an index
12
If we were to check the cluster-health now, we would see this:
{
"cluster_name": "elasticsearch",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
“number_of_data_nodes": 1,
"active_primary_shards": 3,
“ “active_shards": 3,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 3,
……………….
}
13. A two-node cluster—all primary and replica shards are allocated
13
{
"cluster_name": "elasticsearch",
"status": "green",
"timed_out": false,
"number_of_nodes": 2,
"number_of_data_nodes": 2,
"active_primary_shards": 3,
"active_shards": 6,
………………………………….
…………………………………..
}
14. A three-node cluster—shards have been reallocated to spread the load
14
• One shard each from Node 1 and Node 2 have moved to the new Node 3, and we have two
shards per node, instead of three.
• This means that the hardware resources (CPU, RAM, I/O) of each node are being shared
among fewer shards, allowing each shard to perform better.
15. Increasing the number_of_replicas to 2
15
• Just having more replica shards on the same number of nodes doesn’t increase
our performance at all because each shard has access to a smaller fraction of its
node’s resources. We need to add hardware to increase throughput.
• But these extra replicas do mean that we have more redundancy: with the node
configuration above, we can now afford to lose two nodes without losing any data.
16. Coping with failure-Cluster after killing one node
16
• The node we killed was the master node.
so the first thing that happened was that the nodes elected a new master: Node 2
• Primary shards 1 and 2 were lost when we killed Node 1, and our index cannot function
properly if it is missing primary shards. If we had checked the cluster health at this point, we
would have seen status red: not all primary shards are active!
• A complete copy of the two lost primary shards exists on other nodes, so the first thing that
the new master node did was to promote the replicas of these shards on Node 2 and Node
3 to be primaries, putting us back into cluster health yellow. This promotion process was
instantaneous, like the flick of a switch.
17. Inside Shard
17
Creating a document
A Lucene index with new documents in the in-memory buffer, ready to commit After a commit, a new segment is added to the commit point and the buffer is cleared
18. Deletes and Updates
Segments are immutable, so documents cannot be removed from older
segments, nor can older segments be updated to reflect a newer version of a
document. Instead every commit point includes a .del file that lists which
documents in which segments have been deleted. When a document is
“deleted,” it is actually just marked as deleted in the .del file.
Document updates work in a similar way: when a document is updated, the
old version of the document is marked as deleted, and the new version of the
document is indexed in a new segment.
18
19. Near Real-Time Search
19
A Lucene index with new documents in the in-memory buffer
• Commiting a new segment to disk requires an fsync to ensure that the segment is physically
written to disk and that data will not be lost if there is a power failure. But an fsync is costly; it
cannot be performed every time a document is indexed without a big performance hit.
• FileSystem cache - new segment is written to the FileSystem cache first—which is cheap—
and only later is it flushed to disk—which is expensive. But once a file is in the cache, it can be
opened and read, just like any other file.
The buffer contents have been written to a segment, which is
searchable, but is not yet commited.
20. Making Changes Persistent
20
Elasticsearch added a translog, or transaction log, which records every operation in
Elasticsearch as it happens.
New documents are added to the in-memory buffer and appended
to the transaction log
After a refresh, the buffer is cleared but the transaction log is
not
21. 21
The transaction log keeps accumulating documents After a flush, the segments are fully commited and the transaction log is clea
22. Segment Merging
22
Two commited segments and one uncommitted segment in the
process of being merged into a bigger segment
Once merging has finished, the old segments are deleted
23. Distributed Document Store
Routing a Document to a Shard:
◦ shard = hash(routing) % number_of_primary_shards
Creating, Indexing, and Deleting a Document
23
26. USING CURL------Indexing a document
http://localhost:9200/<index>/<type>/[<id
>]
◦ _index -> Where the document lives
◦ _type ->The class of object that the document represents
◦ _id -> The unique identifier for the document
curl -XPUT 'localhost:9200/website/blog/123' -d'
{
"title": "My first blog entry",
"text": "Just trying this out...",
"date": "2014/01/01"
}'
Elasticsearch responds as follows:
◦ { "_index": "website", "_type": "blog", "_id": "123", "_version": 1, "created": true 26
27. Retrieving a document
GET /website/blog/123?pretty
{
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 1,
"found" : true,
"_source" :
{
"title": "My first blog entry", "text": "Just trying this out...", "date": "2014/01/01"
}
}
27
28. Updating a document
PUT /website/blog/123
{
"title": "My first blog entry",
"text": "I am starting to get the hang of this...",
"date": "2014/01/02"
}
In the response, we can see that Elasticsearch has incremented the _version number:
{ "_index" : "website", "_type" : "blog", "_id" : "123", "_version" : 2, "created": false }
28
30. JAVA API
Two ways to connect to Elasticsearch:
◦ Node Client
Instantiating a node based client is the simplest way to get a Client that
can execute operations against Elasticsearch.
Node will be a part of cluster.
◦ Transport Client
The TransportClient connects remotely to an Elasticsearch cluster
using the transport module. It does not join the cluster, but simply gets
one or more initial transport addresses and communicates with them in
round robin fashion on each action
30
31. Breaking changes in Elasticsearch 5.0
Query DSL
◦ search_type=count is removed, now use size
◦ search_type=scan is removed, now use scroll
◦ In 5.0, Elasticsearch rejects requests that would query more than
1000 shard copies (primaries or replicas).
◦ The fields parameter has been replaced by stored_fields.
The stored_fields parameter will only return stored fields — it will
no longer extract values from the _source.
and many more………
31
32. Mapping Changes
◦ string fields replaced by text/keyword fields
◦ Numeric fields are now indexed with a completely different data-
structure, called BKD tree, that is expected to require less disk
space and be faster for range queries than the previous way that
numerics were indexed.
Setting clusters
◦ In ES 2.4
Settings settings = Settings.settingsBuilder() .put("cluster.name", "myClusterName").build();
TransportClient client = TransportClient.builder().settings(settings).build();
◦ In ES 5.0
Settings settings = Settings.builder() .put("cluster.name", "myClusterName").build();
TransportClient client = new PreBuiltTransportClient(settings);
32
33. API 2.4
◦ TransportClient client = TransportClient.builder().build()
.addTransportAddress(new
InetSocketTransportAddress(InetAddress.getByName("host1"), 9300))
.addTransportAddress(new
InetSocketTransportAddress(InetAddress.getByName("host2"), 9300));
API 5.0
◦ TransportClient client = new PreBuiltTransportClient(Settings.EMPTY)
.addTransportAddress(new
InetSocketTransportAddress(InetAddress.getByName("host1"), 9300))
.addTransportAddress(new
InetSocketTransportAddress(InetAddress.getByName("host2"), 9300));
33
API changes
34. Some drawbacks of Elasticsearch 5.0
Elasticsearch requires at least Java 8.
Spring data Elasticsearch does not
support ES 5.
34