Elasticsearch for beginners

Elasticsearch
For BEGINNERS
Neil Baker
Software Engineer (Zapelin S.L.)

What is elasticsearch?
ElasticSearch is a free and open source distributed inverted index created by Shay Banon.
Build on top of Apache Lucene
- Lucene is a most popular java-based full text search index implementation.
First public release version v0.4 in February 2010.
Developed in Java, so inherently cross-plateform.

Which companies use elasticsearch?

Easy to scale
Everything is one JSON call away (RESTful API)
Unleashed power of Lucene under the hood
Excellent Query DSL
Multi-tenancy
Support for advanced search features (Full Text)
Configurable and Extensible
Document Oriented
Schema free
Conflict management
Active community
Why Elasticsearch?

Elasticsearch allows you to start small, but will grow
with your business. It is built to scale horizontally out
of the box.
As you need more capacity, just add more nodes, and
let the cluster reorganize itself to take advantage of
the extra hardware.
Easy to Scale
RESTful API
Elasticsearch is API driven. Almost any action can be performed using a simple
RESTful API using JSON over HTTP. An API already exists in the language of your
choice.
Responses are always in JSON, which is both machine and human readable.

Excellent Query DSL
The REST API exposes a very complex and capable query DSL, that is very easy to use. Every query is just a
JSON object that can practically contain any type of query, or even several of them combined.
Using filtered queries, with some queries expressed as Lucene filters, helps leverage caching and thus speed
up common queries, or complex queries with parts that can be reused.
Faceting, another very common search feature, is just something that upon-request is accompanied to
search results, and then is ready for you to use.
Per-operation Persistence
Elasticsearch puts your data safety first. Document changes are recorded in
transaction logs on multiple nodes in the cluster to minimize the chance of any
data loss.

You can host multiple indexes on one Elasticsearch installation - node
or cluster. Each index can have multiple "types", which are essentially
completely different indexes.
The nice thing is you can query multiple types and multiple indexes
with one simple query. This opens quite a lot of options.
Multi-tenancy
Support for advanced search features (Full Text)
Elasticsearch uses Lucene under the covers to provide the most powerful full text search
capabilities available in any open source product.
Search comes with multi-language support, a powerful query language, support for
geolocation, context aware did-you-mean suggestions, autocomplete and search snippets.
script support in filters and scorers

Many of Elasticsearch configurations can be changed while Elasticsearch is running, but some will require a restart (and in some cases
reindexing). Most configurations can be changed using the REST API too.
Elasticsearch has several extension points - namely site plugins (let you serve static content from ES - like monitoring javascript apps),
rivers (for feeding data into Elasticsearch), and plugins that let you add modules or components within Elasticsearch itself. This allows you
to switch almost every part of Elasticsearch if so you choose, fairly easily.
If you need to create additional REST endpoints to your Elasticsearch cluster, that is easily done as well.
Configurable and Extensible
Document Oriented
Store complex real world entities in Elasticsearch as structured JSON
documents. All fields are indexed by default, and all the indices can be
used in a single query, to return results at breath taking speed.

Elasticsearch allows you to get started easily. Toss it a JSON document
and it will try to detect the data structure, index the data and make it
searchable. Later, apply your domain specific knowledge of your data
to customize how your data is indexed.
Schema free
Conflict management
Optimistic version control can be used where needed to ensure that data is
never lost due to conflicting changes from multiple processes.
Active community
The community, other than creating nice tools and plugins, is very
helpful and supporting. The overall vibe is really great, and this is an
important metric of any OSS project.
There are also some books currently being written by community
members, and many blog posts around the net sharing experiences
and knowledge

Cluster :
A cluster consistsofone or morenodeswhichsharethesamecluster name. Eachclusterhasasinglemasternode whichis chosenautomaticallyby
thecluster andwhichcan bereplacedif the currentmasternode fails.
Node :
A nodeisarunninginstanceofelasticsearchwhichbelongsto acluster. Multiplenodescanbestartedonasingleserverfor testingpurposes, but
usuallyyoushouldhaveone nodeper server.
Atstartup, anodewill useunicast(or multicast, if specified)to discoveranexistingcluster withthesameclusternameandwill tryto jointhatcluster.
Index :
Anindex is like a‘database’inarelationaldatabase. Ithas amappingwhichdefines multipletypes.
Anindex is alogicalnamespacewhichmapsto one or moreprimaryshardsandcanhavezero or morereplicashards.
Type :
A type islikea‘table’inarelationaldatabase. Each typehasalistoffields thatcanbespecifiedfordocuments of that type. The mappingdefines
howeachfieldinthedocumentis analyzed.

Document :
A documentisaJSONdocumentwhichis storedinelasticsearch. Itislike arowinatableinarelational database. Each documentisstoredinan
indexandhas atypeandan id.
A documentisaJSONobject(also knowninotherlanguages asa hash /hashmap/ associative array) whichcontains zeroor more fields, or key-
value pairs. Theoriginal JSONdocumentthatisindexedwillbestoredinthe_sourcefield, whichisreturnedbydefaultwhengettingor searching
for adocument.
Field :
A documentcontains alistoffields, or key-value pairs. Thevaluecanbeasimple(scalar)value(ega string, integer, date), or anestedstructurelike
an arrayoranobject. A fieldis similartoacolumnina table ina relationaldatabase.
The mappingfor eachfieldhas afield‘type’(notto be confusedwithdocumenttype)whichindicatesthetypeof data thatcanbe storedinthatfield,
eginteger, string, object. Themappingalso allows youto define(amongstother things) howthevalueforafieldshouldbe analyzed.
Mapping :
A mappingislikea‘schemadefinition’inarelational database. Eachindexhas amapping, whichdefineseachtype withintheindex, plus anumber
ofindex-widesettings.A mappingcaneither bedefinedexplicitly, or itwill begeneratedautomaticallywhenadocumentis indexed

Shard :
A shardisasingle Luceneinstance. Itis alow-level“worker�� unitwhich is managedautomaticallybyelasticsearch. An indexisalogicalnamespace
whichpointstoprimaryandreplicashards.
Elasticsearchdistributes shards amongstallnodes in the cluster, andcanmove shardsautomaticallyfromone nodeto another inthecaseof node
failure, or theadditionof newnodes.
PrimaryShard :
Eachdocumentisstoredin asingleprimaryshard. Whenyouindexadocument, itisindexedfirston theprimaryshard, thenonallreplicasof the
primaryshard. Bydefault,anindex has5 primaryshards.Youcanspecifyfewer or more primaryshards to scalethenumber ofdocumentsthat
your indexcanhandle.
ReplicaShard :
Eachprimaryshardcanhavezero ormore replicas. A replicaisacopyof the primaryshard, andhastwo purposes:
1) increasefailover: areplicashardcanbepromotedto aprimaryshardiftheprimaryfails.
2) increaseperformance:get andsearchrequests canbehandledbyprimaryor replicashards.

ElasticSearch Routing
All of your data lives in a primary shard, somewhere in the cluster. You may have five shards or five hundred, but any particular
document is only located in one of them. Routing is the process of determining which shard that document will reside in.
Elasticsearch has no idea where to look for your document. All the docs were randomly distributed around your cluster. so
Elasticsearch has no choice but to broadcasts the request to all shards. This is a non-negligible overhead and can easily impact
performance.
Wouldn’t it be nice if we could tell Elasticsearch which shard the document lived in? Then you would only have to search one shard
to find the document(s) that you need.
Routing ensures that all documents with the same routing value will locate to the same shard, eliminating the need to broadcast
searches.

ElasticSearch
Indices
Types
Documents
Keys
MySQL
Database
Tables
Rows
Columns

Performance
Core i7, a 2Ghz, 8GB RAM, 128GB SSD)
Insert of 10 Mio. Datasets:
Elasticsearch: 23 Minutes
MySQL without index: 56 Minutes
MySQL with Index: 228 Minutes
Select name and firstname of 100 Entrys:
Elasticsearch: 5 ms
MySQL: 9 ms
Select of 100 full Entrys:
Elasticsearch: 5 ms
MySQL: 9 ms
Select of the next 100 full Entrys:
Elasticsearch: 4 ms
MySQL: 18 ms

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.1.1.tar.gz
tar -xzf elasticsearch-5.1.1.tar.gz
cd elasticsearch-5.1.1/
./bin/elasticsearch

Is it running?
GET http://localhost:9200/?pretty
Response :
{
"name": "Vivisector",
"cluster_name": "elasticsearch",
"version": {
"number": "2.3.3",
"build_hash": "218bdf10790eef486ff2c41a3df5cfa32dadcfde",
"build_timestamp": "2016-05-17T15:40:04Z",
"build_snapshot": false,
"lucene_version": "5.5.0"
},
"tagline": "You Know, for Search"
}

Let´s PLAY WITH
ELASTICSEARCH

Indexing a document
Request :
$ curl -XPUT "http://localhost:9200/test-data/cities/21" -d '{
"rank": 21,
"city": "Boston",
"state": "Massachusetts",
"population2010": 617594,
"land_area": 48.277,
"location": {
"lat": 42.332,
"lon": 71.0202 },
"abbreviation": "MA"
}‘
Response : {"ok":true,"_index":"test-data","_type":"cities","_id":"21","_version":1}

Getting a document
Request:
$ curl -XGET "http://localhost:9200/test-data/cities/21?pretty"
Response:
{
"_index" : "test-data",
"_type" : "cities",
"_id" : "21",
"_version" : 1,
"exists" : true, "_source" : {
"rank": 21,
"city": "Boston",
"location": {
"lat": 42.332,
"lon": 71.0202 },
}
}

Updating a document
Request :
$ curl -XPUT "http://localhost:9200/test-data/cities/21" -d '{
"rank": 21,
"city": "Boston",
"location": {
"lat": 42.332,
"lon": 71.0202 },
}‘
Response : {"ok":true,"_index":"test-data","_type":"cities","_id":"21","_version":2}

Searching
Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]
Search across all indexes and all types
http://localhost:9200/_search
Search across all types in the test-data index.
http://localhost:9200/test-data/_search
Search explicitly for documents of type cities within the test-data index.
http://localhost:9200/test-data/cities/_search
Search explicitly for documents of type cities within the test-data index using paging.
http://localhost:9200/test-data/cities/_search?size=5&from=10
There’s3 differenttypesofsearchqueries
 Full Text Search (query string)
 Structured Search (filter)
 Analytics (facets)

Full Text Search (query string)
Inthiscaseyouwillbe searchinginbitsofnaturallanguage for (partially) matchingquerystrings. TheQueryDSL
alternativefor searchingfor“Boston” inall documents, wouldlooklike:
Request:
$ curl -XGET "http://localhost:9200/test-data/cities/_search?pretty=true" -d '{
“query": { “query_string": { “query": “boston" }}}’
Response: {
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0 },
"hits" : {
"total" : 1,
"max_score" : 6.1357985,
"hits" : [ {
"_index" : "test-data",
"_type" : "cities",
"_id" : "21",
"_score" : 6.1357985, "_source" : {"rank":"21","city":"Boston",...}
} ]
}
}...

Structured Search (filter)
Structuredsearchis aboutinterrogatingdata thathasinherentstructure. Dates, timesandnumbersareall structured—theyhave apreciseformatthatyou
canperformlogicaloperations on. Commonoperations includecomparingrangesofnumbersor dates, or determiningwhichof two values is larger.
Withstructuredsearch, the answerto your questionis always ayes or no; somethingeither belongsinthesetor itdoes not. Structuredsearchdoes not
worryaboutdocumentrelevance or scoring—itsimplyincludes or excludesdocuments.
Request:
“query": { “filtered": { “filter”: { “term": { “city” : “boston“ }}}}}’
$ curl -XGET "http://localhost:9200/test-data/cities/_search?pretty" -d '{
"query": {
"range": {
"population2012": {
"from": 500000,
"to": 1000000
}}}}‘
$ curl -XGET "http://localhost:9200/test-data/cities/_search?pretty" -d '{
"query": { "bool": { "should": [{ "match": { "state": "Texas"} }, {"match": { "state":
"California"} }],
"must": { "range": { "population2012": { "from": 500000, "to": 1000000 } } },
"minimum_should_match": 1}}}'

Analytics (facets)
Requestsofthistypewillnotreturnalistofmatchingdocuments,butastatisticalbreakdownof thedocuments.
Elasticsearchhasfunctionalitycalledaggregations,whichallowsyoutogeneratesophisticatedanalyticsoveryourdata.ItissimilartoGROUPBYinSQL.
Request:
“aggs": { “all_states": { “terms“: { “field” : “state“ }}}}’
Response:
{ ...
"hits": { ... },
"aggregations": {
"all_states": {
"buckets": [
{"key": "massachusetts ", "doc_count": 2},
{"key": "danbury", "doc_count": 1}
]
}}}

ElasticSearch Monitoring
ElasticSearch-Head - https://github.com/mobz/elasticsearch-head
Marvel - http://www.elasticsearch.org/guide/en/marvel/current/#_marvel_8217_s_dashboards
Paramedic - https://github.com/karmi/elasticsearch-paramedic
Bigdesk - https://github.com/lukas-vlcek/bigdesk/

ElasticSearch Limitations
Security : ElasticSearchdoesnotprovideanybuild-in authenticationor accesscontrolfunctionality.
Transactions : There is no muchmoresupportfor transactions or processingondatamanipulation.
Durability : ESisdistributedandfairlystablebutbackupsanddurabilityarenotashighpriorityas inotherdatastores
Large Computations: Commandsfor searchingdataare notsuitedto"large"scansof data andadvancedcomputationonthe dbside.
Data Availability : ESmakesdataavailable in "near real-time" whichmayrequireadditional considerationsinyour application(ie:
commentspagewhereauser addsnewcomment,refreshingthepage mightnotactuallyshowthenewpostbecausetheindexisstill
updating).

Open-Source Libraries
https://github.com/elasticsearch

http://stackoverflow.com/questions/tagged/elasticsearch
stackoverflow

Get started.
www.elasticsearch.org

About me
Founder and Technical Director of Terions Communication LTD (London/Berlin, 1996-2006)
- Datacenter Operator for Press and Image Agencys and Part of the DPA (German Press Agency)
Software Developer Zapelin S.L (Adeje)
- Development of IPTV Solutions for Hotels e.g.
RHCE - Red Hat Certified Engineer
CCDP - Cisco Certified Design Professional
DBA - Oracle Certified Professional, MySQL 5 Database Administrator
Contact for Questions: neil@cconnect.es

Elasticsearch for beginners

More Related Content

Elasticsearch for beginners

Editor's Notes