SlideShare a Scribd company logo
Elasticsearch
federico.panini@fazland.com - CTO
Federico Panini
CTO @ fazland.com
email : federico.panini@fazland.com
LikedIn : https://uk.linkedin.com/in/federicopanini
slides : http://www.slideshare.net/FedericoPanini
What is Elasticsearch
federico.panini@fazland.com - CTO
full-text search engine
“A search engine is an automated system which, upon
request, uses a set of data and return an index of its content
classifying them based on math/stats algorithm used to set
the relevance, based in a search key.”
What’s Elasticsearch ?
federico.panini@fazland.com - CTO
full-text search engine
federico.panini@fazland.com - CTO
“It’s a distributed, scalable, and highly available
Real-time search and analytics software.”
What’s Elasticsearch ?
full-text search engine
federico.panini@fazland.com - CTO
Real-time data
Realtime data analysis
Distributed system
High Availability
Full-text searches
Document oriented DB
Schemaless DB
RESTFul Api
Persistence per-operation
Open Source
Based on Apache Lucene
Optimistic version control
What’s Elasticsearch ?
features
Apache Lucene #1
federico.panini@fazland.com - CTO
It’s the heart of Elasticsearch
Lucene is the search engine of Elasticsearch
Apache Lucene #1
federico.panini@fazland.com - CTO
It’s in Java
It’s an Apache Software Foundation, so Open Source!
What has more than Lucene
federico.panini@fazland.com - CTO
full-text searches
horizontal scaling
high availability
Easy to use
near real time
Architecture
federico.panini@fazland.com - CTO
requirements - CPU
Elasticsearch doesn’t need a lot of CPU.
The advice is to use the last CPU model available.
In general is a good practice to use machines with 2 to 8
cores.
Architecture
federico.panini@fazland.com - CTO
requirements - Disco
The I/O disk need is really important for all clusters.
Please use SSD disks.
Architecture
federico.panini@fazland.com - CTO
requirements - HD - bonus slide …
One very important thing to know is you have to pay attention where
data is stored and mostly how. The word you have to remember is
scheduler. The scheduler on *nix system is responsible to decide
when data should be “written” to disc and on which priority. Usually
common unix OS setup cfq as scheduler, which for instance is a
scheduler for rotating disks and optimised for them. The advice is to
use SSD disks and to setup the SO to use “noop” or “deadline”
which are scheduler optimised for SSD’s.
If you use the right scheduler you can reach improvements of
500x !!!
federico.panini@fazland.com - CTO
Operating Systems
Elasticsearch is written in
Java, so it’s a multiplatform
solution. Use the last JDK
available.
Architecture
federico.panini@fazland.com - CTO
requirements - RAM
Elasticsearch is eager of RAM!!!
https://www.elastic.co/guide/en/elasticsearch/guide/current/
heap-sizing.html
Architecture
federico.panini@fazland.com - CTO
memory !?!?
Use solutions with 64GB is fine not more
give to the Java heap size not more than 32GB of RAM
use more than one machine for elasticsearch in order
setup correctly the cluster.
Architecture
federico.panini@fazland.com - CTO
Installation
curl -L -O http://download.elasticsearch.org/PATH/TO/
VERSION.zip
unzip elasticsearch-$VERSION.zip
cd elasticsearch-$VERSION
There are availbes packages for many distribution as
Debian or RPM, and Puppet or Chef modules
Architecture
Java based
federico.panini@fazland.com - CTO
elasticsearch
Elasticsearch has been developed in JAVA
Robust
Scalable
Multiplatform
Talking to Elasticsearch
federico.panini@fazland.com - CTO
clients Java #1
There are 2 clients available in JAVA:
Node client : the client join the cluster
as non-data node, this mean that the
client knows perfectly where data are
and on which node of the cluster.
federico.panini@fazland.com - CTO
clients Java #2
Transport client : is a lightweight client
and is the tool used to comunicate
with the cluster remotely.
Talking to Elasticsearch
There are 2 clients available in JAVA:
federico.panini@fazland.com - CTO
clients Java #2
Both Java clients talk to the cluster on
port 9300, which is the same port use by
the cluster itself.
Talking to Elasticsearch
There are 2 clients available in JAVA:
federico.panini@fazland.com - CTO
client API RESTful
All programming languages other than Java can
talk to the Elasticsearch cluster through
its API Rest available on port 9200.
There are many official clients
available in different programming
languages.:
Groovy, JavaScript, .NET, PHP,
Perl, Python, e Ruby
Talking to Elasticsearch
Elastic
federico.panini@fazland.com - CTO
Document oriented
NoSql
Elasticsearch is a document
oriented database. This mean
Elasticsearch is a schema-less
database.
After inserting documents inside
Elasticsearch, the documents will
be immediately indexed.
Elastic
federico.panini@fazland.com - CTO
Document oriented
JSON
Elasticseach uses JSON as
interchange language between
the server and the API clients.
Elastic
federico.panini@fazland.com - CTO
glossary
cluster
nodes
indexes
shards
replica
segments
in-memory buffers
translog
Elastic
federico.panini@fazland.com - CTO
cluster
The cluster is a set which belong one or more nodes,
which shares the same property cluster.name. The
cluster is used to balance the load of the server itself.
A node could be deleted or inserted to the cluster, the
cluster itself will re-organise itself.
Elastic
federico.panini@fazland.com - CTO
cluster
Inside a cluster a node is elected as Master. The
Master node is responsible to manage operations as
creation or removal indexes, join or deletion of a node.
Every node could be elected as Master.
Elastic
federico.panini@fazland.com - CTO
nodes
A node is a minimum element of Elasticsearch that
ensures the proper working of the cluster.
Elastic
federico.panini@fazland.com - CTO
Index
Database RDBMS Elasticsearch
DATABASE INDEX
Elastic
federico.panini@fazland.com - CTO
Type
Database RDBMS Elasticsearch
Table TYPE
Elastic
federico.panini@fazland.com - CTO
Document
Database RDBMS Elasticsearch
ROW DOCUMENT
Elastic
federico.panini@fazland.com - CTO
Fields
Database RDBMS Elasticsearch
COLUMNS FIELDS
Elastic
federico.panini@fazland.com - CTO
shards
If we want to start indexing data on Elasticsearch we
need to create an index. Index is the term used only to
identify a logical definition, which represent a pointer to
one or more elements called SHARDS.
Elastic
federico.panini@fazland.com - CTO
shards
The shard is the low level element of Elasticsearch, and
contains a subset of all the data inside and index.
The shard is in fact a single instance of Apache Lucene.
Elastic
federico.panini@fazland.com - CTO
Replica shards
Replica shards are mirrors of shards used to protect our
data from hardware failures. As the shards they are used
exactly as the shards.
Elastic
federico.panini@fazland.com - CTO
shards immutability
The number of shards for an index is defined at Index
creation time and is IMMUTABLE.
Elastic
federico.panini@fazland.com - CTO
shards immutability
curl -X http://localhost:9200/blogs
-d ‘{
"settings" : {
"number_of_shards" : 3,
"number_of_replicas" : 1
}
}’
Elastic
federico.panini@fazland.com - CTO
shards immutability
curl http://localhost:9200/_cluster/health“{
"cluster_name": "elasticsearch",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 3,
"active_shards": 3,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 3
}”
Elastic
federico.panini@fazland.com - CTO
shards immutability
Replica shards on a single node instance are useless, the
meaning for cluster is nothing in this case. To make
replica shard useful we need at least 2 nodes to have
data redundancy.
Elastic
federico.panini@fazland.com - CTO
BONUS : manage data conflicts #1
Elastic
federico.panini@fazland.com - CTO
BONUS : manage conflicts #2 : Pessimistic Concurrency Control
Used in standard RDBMS
This approach is based on the concept that conflict could
happened frequently and so to avoid them the RDBMS
lock the resource.
The process lock the access to the row before reading it,
this way we the RDBMS is sure that only one process will
access to this thread and can subsequently modify it and
nobody else.
At the end of its process (update/delete) the thread will
release the LOCK.
Elastic
federico.panini@fazland.com - CTO
BONUS : manage conflicts #3 : Optimistic Concurrency Control
Elasticsearch uses OCC
This approach will consider conflicts as infrequent. The
database won’t lock the resource when access to it.
The responsibility is given to the application : when data is
amended between a read and write then the update fails.
In this case you need to re-get the fresh new data and
trying to update it.
Elastic
federico.panini@fazland.com - CTO
BONUS : manage conflicts#4 : Optimistic Concurrency Control
Elasticsearch is a distributed solution, concurrent and
asynchronous. When a document is created / updated /
deleted is absolutely necessary to replicate this
information across the whole cluster.
Every command sent to the nodes is sent in parallel and
could happen that some data will reach its destination
(node) already expired.
Elastic
federico.panini@fazland.com - CTO
BONUS : manage conflicts#5 : Optimistic Concurrency Control
We need a way to understand that the entry
we’re trying to update as been already
updated by another process.
Elastic
federico.panini@fazland.com - CTO
BONUS : manage conflicts#6 : Optimistic Concurrency Control
VERSIONING
Elastic
federico.panini@fazland.com - CTO
BONUS : manage conflicts#7 : Optimistic Concurrency Control
In Elasticsearch every document has a field named:
_version
This system field is incremented every time an operation
(update / delete) occurs over a document. In this way an
update to _version:3 won’t be never applied to a
document whose _version field value is at 4.
Elastic
federico.panini@fazland.com - CTO
BONUS : manage conflicts #8 : Optimistic Concurrency Control
This approach move all the responsibility from the
database to the application! so WE are responsible to not
create conflicts over a document or and index. If we want
to be sure to not have loss of data we nee to implement
writes with the use of versioning!
Elastic
federico.panini@fazland.com - CTO
BONUS : manage conflicts #9 : Optimistic Concurrency Control
http://www.jillesvangurp.com/2014/12/03/optimistic-
locking-for-updates-in-elasticsearch/
https://aphyr.com/posts/317-call-me-maybe-
elasticsearch
https://www.elastic.co/guide/en/elasticsearch/resiliency/
current/index.html
Elastic
federico.panini@fazland.com - CTO
Simple searches #1
Create Index
API Rest
GET
DELETE
POST
SEARCH
Elastic
federico.panini@fazland.com - CTO
Simple searches - CREATE AN INDEX
curl -XPUT http://fazlab.fazland.com:9200/fazlab
-d
"{ "settings" :
{
"number_of_shards" : 3,
"number_of_replicas" : 1
}
}"
Elastic
federico.panini@fazland.com - CTO
Simple searches - INDEX A DOCUMENT
curl -XPUT
http://fazlab.fazland.com:9200/fazlab/categories/1?pretty
-d '
{
nome: "Federico"
}'
Elastic
federico.panini@fazland.com - CTO
Simple searches - GET A DOCUMENT
curl http://fazlab.fazland.com:9200/fazlab/categories/1?pretty
Elastic
federico.panini@fazland.com - CTO
Simple searches - DELETE A DOCUMENT
curl -XDELETE
http://fazlab.fazland.com:9200/fazlab/categories/2?pretty
Elastic
federico.panini@fazland.com - CTO
Simple searches #1
DEMO SEARCHES!
Elastic
federico.panini@fazland.com - CTO
mapping and analysis
EXACT MATCH vs FULL TEXT
Elastic
federico.panini@fazland.com - CTO
mapping and analysis
EXACT MATCH vs FULL TEXT
Exact match Full Text
where name = ‘Federico’
and user_id = 2
and date > “2014-09-15”
“Frank has been to
South beach”
Frank / FRANK / frank
Elastic
federico.panini@fazland.com - CTO
mapping and analysis
EXACT MATCH vs FULL TEXT
Exact match
Full Text
binary : the document contains these values ?
How much is relevant the document compared to the
term used inside the query ?
Elastic
federico.panini@fazland.com - CTO
mapping and analysis
Elasticsearch to help a full-text search analyse the text
and uses this result to build an inverted index.
Inverted Index Analyzer
Elastic
federico.panini@fazland.com - CTO
Inverted Index
1. The quick brown fox jumped
over the lazy dog
2. Quick brown foxes leap over
lazy dogs in summer
Elastic
federico.panini@fazland.com - CTO
Inverted Index
If we want to search the word
“quick” and “brown” we will pick
only the documents where these 2
words are.
1. The quick brown fox jumped
over the lazy dog
2. Quick brown foxes leap over
lazy dogs in summer
Elastic
federico.panini@fazland.com - CTO
Inverted Index
1. The quick brown fox jumped
over the lazy dog
2. Quick brown foxes leap over
lazy dogs in summer
Elastic
federico.panini@fazland.com - CTO
ANALYZERS
An analyzer has 3 functions:
Character filters
Tokenizer
Token Filters
Elastic
federico.panini@fazland.com - CTO
ANALYZERS - Character Filters
The first part of an analyser is to parse every string with
character filer which will clean / reorganize the strings
before tokenization.
During this phase special HTML chars will be removed
or & will be converted in AND.
Elastic
federico.panini@fazland.com - CTO
ANALYZERS - Tokenizer
The second phase of an analyser is tokenisation which
will divide a sentence in small terms.
Elastic
federico.panini@fazland.com - CTO
ANALYZERS - Token Filters
Successivamente alla fase di Tokenizzazione delle
stringhe in singoli termini (terms), i filtri (selezionati) sono
applicati in sequenza.
After tokenisation filters will be applied in sequence.
For example :
- put lower case the whole text
- remove stop words
- add synonyms
Elastic
federico.panini@fazland.com - CTO
Standard Analyzer
“Set the shape to semi-transparent by calling
set_trans(5)”
The standard analyzer is the default analyzer of
Elasticsearch. Divide text in single words and remove
most of punctuation.
“set, the, shape, to, semi, transparent, by, calling,
set_trans, 5”
Elastic
federico.panini@fazland.com - CTO
Simple Analyzer
“Set the shape to semi-transparent by calling
set_trans(5)”
The simple analyser removes all characters which are
not letters and put the whole text lowercase
“set, the, shape, to, semi, transparent, by, calling,
set, trans”
Elastic
federico.panini@fazland.com - CTO
Whitespace Analyzer
“Set the shape to semi-transparent by calling
set_trans(5)”
The whitespace analyser will create token by white
space and put text in lowercase
“Set, the, shape, to, semi, transparent, by, calling,
set_trans(5)”
Elastic
federico.panini@fazland.com - CTO
Language Analyzer
“Set the shape to semi-transparent by calling
set_trans(5)”
This analyser uses a language specific feature to
remove stop words or to do stemming.
“set, shape, semi, transpar, call, set_tran, 5”
Elastic
federico.panini@fazland.com - CTO
Language Analyzer
arabic, armenian, basque, brazilian, bulgarian, catalan,
chinese, cjk, czech, danish, dutch, english, finnish,
french, galician, german, greek, hindi, hungarian,
indonesian, irish, italian, latvian, norwegian, persian,
portuguese, romanian, russian, sorani, spanish,
swedish, turkish, thai.
Elastic
federico.panini@fazland.com - CTO
Pre-built Analyzers
Standard Analyzer
Simple Analyzer
Whitespace Analyzer
Stop Analyzer
Keyword Analyzer
Pattern Analyzer
Language Analyzers
Snowball Analyzer
Custom Analyzer
Elastic
federico.panini@fazland.com - CTO
Tokenizer
Standard Tokenizer
Edge NGram Tokenizer
Keyword Tokenizer
Letter Tokenizer
Lowercase Tokenizer
NGram Tokenizer
Whitespace Tokenizer
Pattern Tokenizer
UAX Email URL Tokenizer
Path Hierarchy Tokenizer
Elastic
federico.panini@fazland.com - CTO
Token Filters
Standard Token Filter
ASCII Folding Token Filter
Length Token Filter
Lowercase Token Filter
NGram Token Filter
Edge NGram Token Filter
Porter Stem Token Filter
Shingle Token Filter
Stop Token Filter
…
more than 32 Filters
Elastic
federico.panini@fazland.com - CTO
Token Filters
THE END.
References
• Elasticsearch : The Definitive Guide
• https://en.wikipedia.org/wiki/Full_text_search
• https://www.elastic.co/guide/en/elasticsearch/guide/current/
hardware.html
• https://www.elastic.co/guide/en/elasticsearch/guide/current/
heap-sizing.html
• https://mtalavera.wordpress.com/2015/02/16/monitoring-with-
collectd-and-kibana/
• Fuzzy search : https://www.found.no/foundation/fuzzy-search/
• Phonetic-plugin : https://github.com/elastic/elasticsearch-
analysis-phonetic
federico.panini@fazland.com - CTO

More Related Content

Elasticsearch quick Intro (English)