SlideShare a Scribd company logo
Introduction to Elasticsearch
Praveen Manvi July 2016
Agenda
• Overview
– History, Product overview
– ES Vocabulary
– Feature set
• Demo
– Setup/ Configuration
– Eco system
– APIs for Index/Search & monitor
What is ElasticSearch?
– Document (Json) oriented search engine
– Distributed
– Horizontally scalable and Highly Available
– Multi-tenancy enabled
– API centric & RESTful
– Built on Lucene search engine library
& used for
– full-text search, structured search, analytics, or all
three in combination
• Elastic search has become de facto search
solution
• few popular examples
• GitHub uses Elasticsearch to query 130 billion lines of
code.
• Wikipedia uses Elasticsearch to provide full-text search
with highlighted search snippets, and search-as-you-
type and did-you-mean suggestions.
• Stack Overflow combines full-text search with
geolocation queries and uses more-like-this to find
related questions and answers.
History
Shay Benon @kimchy
Doug Cutting @cutting
Started Lucene in 1999, released under apache in
2005.
Now part of cloudera supporting rival solution solr
and commercial offerings
Elasticsearch released in February 2010.
Worked on this for 6 years (started with compass)
Now part of http://elastic.co commercial offerings
Building Blocks
Term Description ( ~analogy with relational database)
Cluster ~Database cluster
Group of nodes
Node ~Instance of database
A JVM process, usually a machine
Index ~Database schema
Hosts mapping types and their definitions contains
many shards
Mapping Type ~Database Table
Field description, indexing requirements
Document ~Database row
Json document.
Shard A Lucene index. Scalable unit and heart of search
engine (primary and replica)
Physical Layout
Logical Layout
Lucene Inverted Index
value add over lucene
• Distributed
– Combines results with fork join against multiple indexes, with the new building blocks
• Transaction Log
– The transaction log guarantees durability, Operations are automatically replayed when a
shard is reopened
– It also simplifies shard relocation/recovery, Helps when moving a shard from one node
to another by being able to replay the changes while transferring committed segments
• Flush/Refresh/Monitor APIs
– For managing the cluster/node/index statuses
• Query DSL
– provides huge set of grammar for search syntax
mapping/index/search docs
Document Metadata Fields
• _id - The id of the document
• _type - The document type
• _source - enabled Stores the original document that
was indexed
• _all enabled Indexes all values of all document fields
• _timestamp disabled timestamp associated with the
document
• _ttl disabled optionally defines an expiration time
• _size disabled indexes the size of the uncompressed
Search Controller
Query DSL
Search request in place
Search Types
• COUNT
• Returns no hits, only total count matching the query,
thus executes in a
• single round trip to the shards
• SCAN
• Allows to iterate over large amounts of data using a
cursor to paginate and hence memory efficient, helpful
for re-indexing and decorating data outside the ES.
• SEARCH
• General search
Aggregation
Aggregations
Nested Aggregations
Introduction to elasticsearch
Few interesting Features
• Bulk Indexing
– Send multiple docs to ES
• Multi Get APIs
– Get multiple documents in a single API
• Percolator
– The idea is to have ES to notify your application when new content matches your filters
instead of having to constantly poll the search engine to check for new updates. Great
for building alerts
• Pagination
• Highlighting
Eco System
(debug tools/development)
Client SDKs
Plugins
•head
•Elastic HQ
•Marvel
•BigDesk
[ES_HOME/bin]./plugin install head
Configuration
• Enabling store compression uses 55% less
storage (LZF/snappy)
• Disabling the '_all' field saves you 13% in
storage.
• Removing _source saves ~26% storage on disk
• ES_HEAP_SIZE set it ½ of the machine memory
(os file cache)
• bootstrap.mlockall to true avoids swap
References
• https://www.youtube.com/watch?v=5444z-L2V2A&spfreload=1 - “Lucene now
and then” from Lucene creator Doug Cutting @ twitter, Gives history and how
lucene evolved.
• https://www.youtube.com/watch?v=lpZ6ZajygDY - from elastic search creator
Shay Benon (Its 3 years old, but its very good content on data design patterns)
• https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html -
Official documentation from elasticsearch
• https://www.manning.com/books/elasticsearch-in-action - From this place
diagrams were picked in this presentation

More Related Content

Introduction to elasticsearch

  • 2. Agenda • Overview – History, Product overview – ES Vocabulary – Feature set • Demo – Setup/ Configuration – Eco system – APIs for Index/Search & monitor
  • 3. What is ElasticSearch? – Document (Json) oriented search engine – Distributed – Horizontally scalable and Highly Available – Multi-tenancy enabled – API centric & RESTful – Built on Lucene search engine library & used for – full-text search, structured search, analytics, or all three in combination
  • 4. • Elastic search has become de facto search solution • few popular examples • GitHub uses Elasticsearch to query 130 billion lines of code. • Wikipedia uses Elasticsearch to provide full-text search with highlighted search snippets, and search-as-you- type and did-you-mean suggestions. • Stack Overflow combines full-text search with geolocation queries and uses more-like-this to find related questions and answers.
  • 5. History Shay Benon @kimchy Doug Cutting @cutting Started Lucene in 1999, released under apache in 2005. Now part of cloudera supporting rival solution solr and commercial offerings Elasticsearch released in February 2010. Worked on this for 6 years (started with compass) Now part of http://elastic.co commercial offerings
  • 6. Building Blocks Term Description ( ~analogy with relational database) Cluster ~Database cluster Group of nodes Node ~Instance of database A JVM process, usually a machine Index ~Database schema Hosts mapping types and their definitions contains many shards Mapping Type ~Database Table Field description, indexing requirements Document ~Database row Json document. Shard A Lucene index. Scalable unit and heart of search engine (primary and replica)
  • 10. value add over lucene • Distributed – Combines results with fork join against multiple indexes, with the new building blocks • Transaction Log – The transaction log guarantees durability, Operations are automatically replayed when a shard is reopened – It also simplifies shard relocation/recovery, Helps when moving a shard from one node to another by being able to replay the changes while transferring committed segments • Flush/Refresh/Monitor APIs – For managing the cluster/node/index statuses • Query DSL – provides huge set of grammar for search syntax
  • 12. Document Metadata Fields • _id - The id of the document • _type - The document type • _source - enabled Stores the original document that was indexed • _all enabled Indexes all values of all document fields • _timestamp disabled timestamp associated with the document • _ttl disabled optionally defines an expiration time • _size disabled indexes the size of the uncompressed
  • 16. Search Types • COUNT • Returns no hits, only total count matching the query, thus executes in a • single round trip to the shards • SCAN • Allows to iterate over large amounts of data using a cursor to paginate and hence memory efficient, helpful for re-indexing and decorating data outside the ES. • SEARCH • General search
  • 21. Few interesting Features • Bulk Indexing – Send multiple docs to ES • Multi Get APIs – Get multiple documents in a single API • Percolator – The idea is to have ES to notify your application when new content matches your filters instead of having to constantly poll the search engine to check for new updates. Great for building alerts • Pagination • Highlighting
  • 25. Configuration • Enabling store compression uses 55% less storage (LZF/snappy) • Disabling the '_all' field saves you 13% in storage. • Removing _source saves ~26% storage on disk • ES_HEAP_SIZE set it ½ of the machine memory (os file cache) • bootstrap.mlockall to true avoids swap
  • 26. References • https://www.youtube.com/watch?v=5444z-L2V2A&spfreload=1 - “Lucene now and then” from Lucene creator Doug Cutting @ twitter, Gives history and how lucene evolved. • https://www.youtube.com/watch?v=lpZ6ZajygDY - from elastic search creator Shay Benon (Its 3 years old, but its very good content on data design patterns) • https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html - Official documentation from elasticsearch • https://www.manning.com/books/elasticsearch-in-action - From this place diagrams were picked in this presentation