SlideShare a Scribd company logo
Introduction to Elasticsearch
with basics of Lucene
May 2014 Meetup
Rahul Jain
@rahuldausa
@http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
Who am I
 Software Engineer
 7 years of software development experience
 Built a platform to search logs in Near real time with
volume of 1TB/day#
 Worked on a Solr search based SEO/SEM software with
40 billion records/month (Topic of next talk?)
 Areas of expertise/interest
 High traffic web applications
 JAVA/J2EE
 Big data, NoSQL
 Information-Retrieval, Machine learning
2# http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr
Agenda
• IR Overview
• Basic Concepts
• Lucene
• Elasticsearch
• Logstash & Kibana - Short Introduction
• Q&A
3
Information Retrieval (IR)
”Information retrieval is the activity of
obtaining information resources (in the
form of documents) relevant to an
information need from a collection of
information resources. Searches can
be based on metadata or on full-text
(or other content-based) indexing”
- Wikipedia
4
Basic Concepts
• Term t : a noun or compound word used in a specific context
• tf (t in d) : term frequency in a document
• measure of how often a term appears in the document
• the number of times term t appears in the currently scored document d
• idf (t) : inverse document frequency
• measure of whether the term is common or rare across all documents,
i.e. how often the term appears across the index
• obtained by dividing the total number of documents by the number of
documents containing the term, and then taking the logarithm of
that quotient.
• boost (index) : boost of the field at index-time
• boost (query) : boost of the field at query-time
5
Basic Concepts
TF - IDF
TF - IDF = Term Frequency X Inverse Document Frequency
Credit: http://http://whatisgraphsearch.com/
Apache Lucene
7
Apache Lucene
• Fast, high performance, scalable search/IR library
• Open source
• Initially developed by Doug Cutting (Also author
of Hadoop)
• Indexing and Searching
• Inverted Index of documents
• Provides advanced Search options like synonyms,
stopwords, based on similarity, proximity.
• http://lucene.apache.org/
8
Lucene Internals - Inverted Index
Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html
9
Lucene Internals (Contd.)
• Defines documents Model
• Index contains documents.
• Each document consist of fields.
• Each Field has attributes.
– What is the data type (FieldType)
– How to handle the content (Analyzers, Filters)
– Is it a stored field (stored="true") or Index field (indexed="true")
10
Indexing Pipeline
• Analyzer : create tokens using a Tokenizer and/or applying
Filters (Token Filters)
• Each field can define an Analyzer at index time/query time or
the both at same time.
Credit : http://www.slideshare.net/otisg/lucene-introduction 11
Analysis Process - Tokenizer
WhitespaceAnalyzer
Simplest built-in analyzer
The quick brown fox jumps over the lazy dog.
[The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.]
Tokens
Analysis Process - Tokenizer
SimpleAnalyzer
Lowercases, split at non-letter boundaries
The quick brown fox jumps over the lazy dog.
[the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]
Tokens
Elasticsearch
14
Introduction
• Enterprise Search platform for Apache Lucene
• Open source
• Highly reliable, scalable, fault tolerant
• Support distributed Indexing, Replication, and load
balanced querying
• http://www.elasticsearch.org/
15
Elasticsearch - Features
• Distributed RESTful search server
• Document oriented
• Domain Driven
• Schema less
• Restful
• Easy to scale horizontally
16
Elasticsearch - Features
• Highlighting
• Spelling Suggestions
• Facets (Group by)
• Query DSL
– based on JSON to define queries
• Automatic shard replication, routing
• Zen discovery
– Unicast
– Multicast
• Master Election
– Re-election if Master Node fails
APIs
• HTTP RESTful Api
• Java Api
• Clients
– perl, python, php, ruby, .net etc
• All APIs perform automatic node
operation rerouting.
How to start
It’s this Easy.
Operations
INDEX CREATION
curl -XPUT "http://localhost:9200/movies/movie/1" -d‘ {
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972
}'
http://localhost:9200/<index>/<type>/[<id>]
Credit: http://joelabrahamsson.com/elasticsearch-101/
INDEX CREATION RESPONSE
Credit: http://joelabrahamsson.com/elasticsearch-101/
UPDATE
curl -XPUT "http://localhost:9200/movies/movie/1" -d' {
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": ["Crime", "Drama"]
}'
Updated Version
Credit: http://joelabrahamsson.com/elasticsearch-101/
New field
GET
curl -XGET "http://localhost:9200/movies/movie/1" -d''
Credit: http://joelabrahamsson.com/elasticsearch-101/
curl -XDELETE "http://localhost:9200/movies/movie/1" -d''
DELETE
Credit: http://joelabrahamsson.com/elasticsearch-101/
 Search across all indexes and all types
 http://localhost:9200/_search
 Search across all types in the movies index.
 http://localhost:9200/movies/_search
 Search explicitly for documents of type movie within the
movies index.
 http://localhost:9200/movies/movie/_search
curl -XPOST "http://localhost:9200/_search" -d'
{
"query": {
"query_string": {
"query": "kill"
}
}
}'
SEARCH
Credit: http://joelabrahamsson.com/elasticsearch-101/
Credit: http://joelabrahamsson.com/elasticsearch-101/
SEARCH RESPONSE
Updating existing Mapping
curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d'
{
"movie": {
"properties": {
"director": {
"type": "multi_field",
"fields": {
"director": {"type": "string"},
"original": {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}'
Credit: http://joelabrahamsson.com/elasticsearch-101/
Cluster Architecture
Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
Index Request
Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
Search Request
Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
Who are using
• Github
• Stumbleupon
• Soundcloud
• Datadog
• Stackoverflow
• Many more…
– http://www.elasticsearch.com/case-studies/
32
Logstash
Logstash
• Open Source, Apache licensee
• Written in JRuby
• Part of Elasticsearch family
• http://logstash.net/
• Current version: 1.4.0
• This talk is with 1.3.3
Logstash
• Multiple Input/ Multiple Output
• Centralize logs
• Collect
• Parse
• Forward/Store
Architecture
Source: http://www.infoq.com/articles/review-the-logstash-book
Logstash – life of an event
• Input  Filters  Output
• Filters are processed in order of config file
• Outputs are processed in order of config file
• Input: Input stream
– File input (tail)
– Log4j
– Redis
– Syslog
– and many more…
• http://logstash.net/docs/1.3.3/
Logstash – life of an event
• Codecs : decoding log messages
• Json
• Multiline
• Netflow
• and many more…
• Filters : processing messages
• Date – Date format
• Grok – Regular expression based extraction
• Mutate – Change data type
• and many more…
• Output : storing the structured message
• Elasticsearch
• Mongodb
• Email
• Nagios
• and many more…
http://logstash.net/docs/1.3.3/
Quick Start
< 1.3.3 version:
java -jar logstash-1.3.3-flatjar.jar
agent -f agent.conf – web
1.4 version:
bin/logstash agent –f agent.conf
bin/logstash –web
basic-agent.conf :
input {
tcp {
type => "apache"
port => 3333
}
}
output {
stdout {
debug => true
}
elasticsearch {
embedded => true
}
}
Kibana
Source: http://www.slideshare.net/AmazeeAG/2014-0422-loggingwithlogstashbastianwidmercampusbern
Source: http://www.slideshare.net/AmazeeAG/2014-0422-loggingwithlogstashbastianwidmercampusbern
Analytics
 Analytics source : Kibana.org based on ElasticSearch and Logstash
 Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8
43
Thanks!
@rahuldausa on twitter and slideshare
http://www.linkedin.com/in/rahuldausa
Find Interesting ?
Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
44

More Related Content

Introduction to Elasticsearch with basics of Lucene

  • 1. Introduction to Elasticsearch with basics of Lucene May 2014 Meetup Rahul Jain @rahuldausa @http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
  • 2. Who am I  Software Engineer  7 years of software development experience  Built a platform to search logs in Near real time with volume of 1TB/day#  Worked on a Solr search based SEO/SEM software with 40 billion records/month (Topic of next talk?)  Areas of expertise/interest  High traffic web applications  JAVA/J2EE  Big data, NoSQL  Information-Retrieval, Machine learning 2# http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr
  • 3. Agenda • IR Overview • Basic Concepts • Lucene • Elasticsearch • Logstash & Kibana - Short Introduction • Q&A 3
  • 4. Information Retrieval (IR) ”Information retrieval is the activity of obtaining information resources (in the form of documents) relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing” - Wikipedia 4
  • 5. Basic Concepts • Term t : a noun or compound word used in a specific context • tf (t in d) : term frequency in a document • measure of how often a term appears in the document • the number of times term t appears in the currently scored document d • idf (t) : inverse document frequency • measure of whether the term is common or rare across all documents, i.e. how often the term appears across the index • obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. • boost (index) : boost of the field at index-time • boost (query) : boost of the field at query-time 5
  • 6. Basic Concepts TF - IDF TF - IDF = Term Frequency X Inverse Document Frequency Credit: http://http://whatisgraphsearch.com/
  • 8. Apache Lucene • Fast, high performance, scalable search/IR library • Open source • Initially developed by Doug Cutting (Also author of Hadoop) • Indexing and Searching • Inverted Index of documents • Provides advanced Search options like synonyms, stopwords, based on similarity, proximity. • http://lucene.apache.org/ 8
  • 9. Lucene Internals - Inverted Index Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html 9
  • 10. Lucene Internals (Contd.) • Defines documents Model • Index contains documents. • Each document consist of fields. • Each Field has attributes. – What is the data type (FieldType) – How to handle the content (Analyzers, Filters) – Is it a stored field (stored="true") or Index field (indexed="true") 10
  • 11. Indexing Pipeline • Analyzer : create tokens using a Tokenizer and/or applying Filters (Token Filters) • Each field can define an Analyzer at index time/query time or the both at same time. Credit : http://www.slideshare.net/otisg/lucene-introduction 11
  • 12. Analysis Process - Tokenizer WhitespaceAnalyzer Simplest built-in analyzer The quick brown fox jumps over the lazy dog. [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.] Tokens
  • 13. Analysis Process - Tokenizer SimpleAnalyzer Lowercases, split at non-letter boundaries The quick brown fox jumps over the lazy dog. [the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog] Tokens
  • 15. Introduction • Enterprise Search platform for Apache Lucene • Open source • Highly reliable, scalable, fault tolerant • Support distributed Indexing, Replication, and load balanced querying • http://www.elasticsearch.org/ 15
  • 16. Elasticsearch - Features • Distributed RESTful search server • Document oriented • Domain Driven • Schema less • Restful • Easy to scale horizontally 16
  • 17. Elasticsearch - Features • Highlighting • Spelling Suggestions • Facets (Group by) • Query DSL – based on JSON to define queries • Automatic shard replication, routing • Zen discovery – Unicast – Multicast • Master Election – Re-election if Master Node fails
  • 18. APIs • HTTP RESTful Api • Java Api • Clients – perl, python, php, ruby, .net etc • All APIs perform automatic node operation rerouting.
  • 19. How to start It’s this Easy.
  • 21. INDEX CREATION curl -XPUT "http://localhost:9200/movies/movie/1" -d‘ { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }' http://localhost:9200/<index>/<type>/[<id>] Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 22. INDEX CREATION RESPONSE Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 23. UPDATE curl -XPUT "http://localhost:9200/movies/movie/1" -d' { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972, "genres": ["Crime", "Drama"] }' Updated Version Credit: http://joelabrahamsson.com/elasticsearch-101/ New field
  • 24. GET curl -XGET "http://localhost:9200/movies/movie/1" -d'' Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 25. curl -XDELETE "http://localhost:9200/movies/movie/1" -d'' DELETE Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 26.  Search across all indexes and all types  http://localhost:9200/_search  Search across all types in the movies index.  http://localhost:9200/movies/_search  Search explicitly for documents of type movie within the movies index.  http://localhost:9200/movies/movie/_search curl -XPOST "http://localhost:9200/_search" -d' { "query": { "query_string": { "query": "kill" } } }' SEARCH Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 28. Updating existing Mapping curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d' { "movie": { "properties": { "director": { "type": "multi_field", "fields": { "director": {"type": "string"}, "original": {"type" : "string", "index" : "not_analyzed"} } } } } }' Credit: http://joelabrahamsson.com/elasticsearch-101/
  • 32. Who are using • Github • Stumbleupon • Soundcloud • Datadog • Stackoverflow • Many more… – http://www.elasticsearch.com/case-studies/ 32
  • 34. Logstash • Open Source, Apache licensee • Written in JRuby • Part of Elasticsearch family • http://logstash.net/ • Current version: 1.4.0 • This talk is with 1.3.3
  • 35. Logstash • Multiple Input/ Multiple Output • Centralize logs • Collect • Parse • Forward/Store
  • 37. Logstash – life of an event • Input  Filters  Output • Filters are processed in order of config file • Outputs are processed in order of config file • Input: Input stream – File input (tail) – Log4j – Redis – Syslog – and many more… • http://logstash.net/docs/1.3.3/
  • 38. Logstash – life of an event • Codecs : decoding log messages • Json • Multiline • Netflow • and many more… • Filters : processing messages • Date – Date format • Grok – Regular expression based extraction • Mutate – Change data type • and many more… • Output : storing the structured message • Elasticsearch • Mongodb • Email • Nagios • and many more… http://logstash.net/docs/1.3.3/
  • 39. Quick Start < 1.3.3 version: java -jar logstash-1.3.3-flatjar.jar agent -f agent.conf – web 1.4 version: bin/logstash agent –f agent.conf bin/logstash –web basic-agent.conf : input { tcp { type => "apache" port => 3333 } } output { stdout { debug => true } elasticsearch { embedded => true } }
  • 43. Analytics  Analytics source : Kibana.org based on ElasticSearch and Logstash  Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8 43
  • 44. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa Find Interesting ? Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ 44