Introduction to Elasticsearch with basics of Lucene

Introduction to Elasticsearch
with basics of Lucene
May 2014 Meetup
Rahul Jain
@rahuldausa
@http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/

Who am I
 Software Engineer
 7 years of software development experience
 Built a platform to search logs in Near real time with
volume of 1TB/day#
 Worked on a Solr search based SEO/SEM software with
40 billion records/month (Topic of next talk?)
 Areas of expertise/interest
 High traffic web applications
 JAVA/J2EE
 Big data, NoSQL
 Information-Retrieval, Machine learning
2# http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr

Agenda
• IR Overview
• Basic Concepts
• Lucene
• Elasticsearch
• Logstash & Kibana - Short Introduction
• Q&A
3

Information Retrieval (IR)
”Information retrieval is the activity of
obtaining information resources (in the
form of documents) relevant to an
information need from a collection of
information resources. Searches can
be based on metadata or on full-text
(or other content-based) indexing”
- Wikipedia
4

Basic Concepts
• Term t : a noun or compound word used in a specific context
• tf (t in d) : term frequency in a document
• measure of how often a term appears in the document
• the number of times term t appears in the currently scored document d
• idf (t) : inverse document frequency
• measure of whether the term is common or rare across all documents,
i.e. how often the term appears across the index
• obtained by dividing the total number of documents by the number of
documents containing the term, and then taking the logarithm of
that quotient.
• boost (index) : boost of the field at index-time
• boost (query) : boost of the field at query-time
5

Basic Concepts
TF - IDF
TF - IDF = Term Frequency X Inverse Document Frequency
Credit: http://http://whatisgraphsearch.com/

Apache Lucene
• Fast, high performance, scalable search/IR library
• Open source
• Initially developed by Doug Cutting (Also author
of Hadoop)
• Indexing and Searching
• Inverted Index of documents
• Provides advanced Search options like synonyms,
stopwords, based on similarity, proximity.
• http://lucene.apache.org/
8

Lucene Internals - Inverted Index
Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html
9

Lucene Internals (Contd.)
• Defines documents Model
• Index contains documents.
• Each document consist of fields.
• Each Field has attributes.
– What is the data type (FieldType)
– How to handle the content (Analyzers, Filters)
– Is it a stored field (stored="true") or Index field (indexed="true")
10

Indexing Pipeline
• Analyzer : create tokens using a Tokenizer and/or applying
Filters (Token Filters)
• Each field can define an Analyzer at index time/query time or
the both at same time.
Credit : http://www.slideshare.net/otisg/lucene-introduction 11

Analysis Process - Tokenizer
WhitespaceAnalyzer
Simplest built-in analyzer
The quick brown fox jumps over the lazy dog.
[The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.]
Tokens

Analysis Process - Tokenizer
SimpleAnalyzer
Lowercases, split at non-letter boundaries
The quick brown fox jumps over the lazy dog.
[the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]
Tokens

Introduction
• Enterprise Search platform for Apache Lucene
• Open source
• Highly reliable, scalable, fault tolerant
• Support distributed Indexing, Replication, and load
balanced querying
• http://www.elasticsearch.org/
15

Elasticsearch - Features
• Distributed RESTful search server
• Document oriented
• Domain Driven
• Schema less
• Restful
• Easy to scale horizontally
16

Elasticsearch - Features
• Highlighting
• Spelling Suggestions
• Facets (Group by)
• Query DSL
– based on JSON to define queries
• Automatic shard replication, routing
• Zen discovery
– Unicast
– Multicast
• Master Election
– Re-election if Master Node fails

APIs
• HTTP RESTful Api
• Java Api
• Clients
– perl, python, php, ruby, .net etc
• All APIs perform automatic node
operation rerouting.

How to start
It’s this Easy.

INDEX CREATION
curl -XPUT "http://localhost:9200/movies/movie/1" -d‘ {
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972
}'
http://localhost:9200/<index>/<type>/[<id>]
Credit: http://joelabrahamsson.com/elasticsearch-101/

INDEX CREATION RESPONSE

UPDATE
curl -XPUT "http://localhost:9200/movies/movie/1" -d' {
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": ["Crime", "Drama"]
}'
Updated Version
New field

GET
curl -XGET "http://localhost:9200/movies/movie/1" -d''

curl -XDELETE "http://localhost:9200/movies/movie/1" -d''
DELETE

 Search across all indexes and all types
 http://localhost:9200/_search
 Search across all types in the movies index.
 http://localhost:9200/movies/_search
 Search explicitly for documents of type movie within the
movies index.
 http://localhost:9200/movies/movie/_search
curl -XPOST "http://localhost:9200/_search" -d'
{
"query": {
"query_string": {
"query": "kill"
}
}
}'
SEARCH

SEARCH RESPONSE

Updating existing Mapping
curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d'
{
"movie": {
"properties": {
"director": {
"type": "multi_field",
"fields": {
"director": {"type": "string"},
"original": {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}'

Cluster Architecture
Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup

Index Request

Search Request

Who are using
• Github
• Stumbleupon
• Soundcloud
• Datadog
• Stackoverflow
• Many more…
– http://www.elasticsearch.com/case-studies/
32

Logstash
• Open Source, Apache licensee
• Written in JRuby
• Part of Elasticsearch family
• http://logstash.net/
• Current version: 1.4.0
• This talk is with 1.3.3

Logstash
• Multiple Input/ Multiple Output
• Centralize logs
• Collect
• Parse
• Forward/Store

Architecture
Source: http://www.infoq.com/articles/review-the-logstash-book

Logstash – life of an event
• Input  Filters  Output
• Filters are processed in order of config file
• Outputs are processed in order of config file
• Input: Input stream
– File input (tail)
– Log4j
– Redis
– Syslog
– and many more…
• http://logstash.net/docs/1.3.3/

Logstash – life of an event
• Codecs : decoding log messages
• Json
• Multiline
• Netflow
• and many more…
• Filters : processing messages
• Date – Date format
• Grok – Regular expression based extraction
• Mutate – Change data type
• Output : storing the structured message
• Elasticsearch
• Mongodb
• Email
• Nagios
http://logstash.net/docs/1.3.3/

Quick Start
< 1.3.3 version:
java -jar logstash-1.3.3-flatjar.jar
agent -f agent.conf – web
1.4 version:
bin/logstash agent –f agent.conf
bin/logstash –web
basic-agent.conf :
input {
tcp {
type => "apache"
port => 3333
}
}
output {
stdout {
debug => true
}
elasticsearch {
embedded => true
}
}

Source: http://www.slideshare.net/AmazeeAG/2014-0422-loggingwithlogstashbastianwidmercampusbern

Analytics
 Analytics source : Kibana.org based on ElasticSearch and Logstash
 Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8
43

Thanks!
@rahuldausa on twitter and slideshare
http://www.linkedin.com/in/rahuldausa
Find Interesting ?
Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
44

Introduction to Elasticsearch with basics of Lucene

Related slideshows

More Related Content

Introduction to Elasticsearch with basics of Lucene