SlideShare a Scribd company logo
ElasticSearch on AWS
                      Real Estate portal Case Study (Spitogatos.gr)


                                                    AWSUG GR meetup #7
                                                      27 September 2012




                                                    Andreas Chatzakis
                                     co-founder / IT Director – Spitogatos.gr

Event sponsored by:                                       @achatzakis on twitter
http://geekandpoke.typepad.com/geekandpoke/2010/09/instant-search.html
#about_us
Helping you find a property

Finding a property in Greece is complex, lacks transparency.
We make life easier for househunters via:
     Powerful search functionality
          Web & Mobile
          Location & Criteria
     Quality content
          Listings (we love photos)
          Articles
     mySpitogatos
          Email alerts
          Save your search
          Favorite listings & notes
          Contact the realtors


                                                                          4
Realtors love us too!

Professionals need help in those turbulent times.
We add value in multiple ways:
     Cost effective promotion & high quality leads
          Targeted channel (very)
          Leads already filtered (we ve seen the fotos!)
     Technology services for realtors
          Turnkey web site solution
          Listing synchronization web service
     B2B via Spitogatos Network (SpiN) business
      network / collaboration tool for realtors
     Channel for foreign buyers via the English version




                                                                                    5
#background
To Search is to Find

Search is central to what we do
   Users searching for property come with structured criteria of huge variety
        Athens Center, residential - flat or studio, for sale, 100-150k €, 85-120 sq meter,
         with a garage
        Athens Center & N.Kosmos, residential - flat, for sale, 75-100k €, 70-100 sq meter,
         2+ bedrooms, only show listings with photos
        Piraeus centre or Mikrolimano, commercial – store, for rent, 500-750 € per
         month, only listings with recently reduced price
        Monetize: # of Listings grouped by paying member + above criteria
        IPhone app → Listings within geo-rectangle + above criteria
        As a result, caching is rarely our friend!
   We used to think Lucene/Solr, ElasticSearch, CloudSearch etc were only useful
    for text search, not adding value for structured search




                                                                                           G
   Have been insisting on trying to optimize MySQL (multi column indices etc)




                                                                                      N
    while throwing replicas to the problem.




                                                                                     O
                                                                                   R
                                                                                               7
Why ElasticSearch

Selected elasticSearch after a (very) brief research* on alternatives:
   AWS's own Cloudsearch:
        Zero management service: nice!
        Not available on eu-west-1
        Currently lacks ES functionality (e.g. geospatial, non english analyzers)
   Sphinx
        Easy MySQL integration
        How do you scale it?*
   Solr
        Industry standard
        Seems like it is conceived as somehow harder to scale/operate*?
   ElasticSearch:
        Piece of cake to setup on AWS (stay tuned!)
        Super distributed, scales & is easy on IT ops (more on that later!)
                                                      * Disclaimer: We did not go through a
                                                                                              8
                                                       detailed product selection process!
#elasticsearch
ElasticSearch basics

A distributed, RESTful Search engine built on top of Lucene
   Free Schema
        JSON documents
        Analyzers
        Boost levels
   Easy & flexible Search
        Lucene query string or JSON based search query DSL
        Facets & Highlighting
        Spatial search
        Custom scripts
   Multi Tenancy
        Store & search across multiple indices
        Each with its own settings
        Use-case: Logs – recent in memory, old on disk

                                                                                 10
Scaling ElasticSearch

Designed from the ground up to be Scalable & Highly Available
   Distributed
        Indices automatically broken into shards
        Replicas for read performance & availability
        Multiple cluster nodes, each hosting 1+ shards/replicas
        peer2peer, each node can delegate operations to other nodes
        Add,remove nodes at will
              Rebalancing & routing automagically behind the scenes
   Discovery
        Multicast or unicast (declarative)
   Gateway
        Allows recovery in case all nodes go down
        Local or shared storage
        Async replication in case of shared storage

                                                                                       11
A scale-up example

Assume a cluster with 4 shards and 1 replica configuration
   1 node example – Status Yellow



   2 nodes example – Status Green



   3 nodes example




     : Primary shard              : Replica shard              : Master node               : Regular node

Master node maintains cluster state, acts if nodes join or leave the cluster by reassigning shards.         12
ElasticSearch on AWS

2 modules make deployment on AWS a breeze
   EC2 discovery
        Filter by security group, AZ, tags
              Requires IAM user with certain EC2 privileges:
               DescribeAvailabilityZones, DescribeInstances, DescribeRegions,
               DescribeSecurityGroups, DescribeTags
       Very useful in autoscaling setups with ephemeral servers
   S3 gateway
        Long term reliable async persistency of cluster state and indices
        Allows deployment without EBS volumes
        Still, local gateway with EBS volumes performs better (less network used,
         faster recovery)
        Won't protect from accidental deletion of index (deletion will propagate to
         shared storage)


                                                                                       13
#implementation
Indexation

Indexation of Spitogatos.gr ads
   DB is still the “source of truth”
        We propagate DELETEs synchronously, INSERTs & UPDATEs asynchronously
              KISS: Cron job (re) indexes never or least-recently indexed listings
              ORM marks new/modified listings as never-indexed (so they go first)
   Location: Multivalue field instead of nested set model in the DB
        e.g. this property is in Greece, Attica, Piraeus, Port of Piraeus
        Property will be included in results when I search for any of the above.
   Flat schema
        Searchable listing owner fields are included in the document (vs a JOIN in our DB)
        Changes to other tables might lead to large # of listings requiring reindexation
         (e.g. real estate agent becomes a paying member)




                                                                                               15
Index Integrity

Making sure our index is consistent with the DB
   Scrutineer ( https://github.com/Aconex/scrutineer )
        Compares DB and ElasticSearch index for mismatches
             exists in ES but not on DB (or vice versa)
             ES version not up to date
        Relies on “_version” field - is incremented via our ORM onChange
        When indexing we explicitly set versioning to “external”
        Had to “hack” it as it doesn't work with EC2 discovery module
           http://labs.spitogatos.gr/?p=45




                                                                                  16
Search – Shards & Routing

How does ElasticSearch decide in which shard to store a doc?
   By default this is done based on hash of document id
   Can be ovverriden while indexing and while searching (routing parameter)
   We shard based on hash of the id of area id
       - Most users search for listings within a specific area
       - We hit only a single shard for a large percentage of the searches.




           No routing                                                Routing by
           specificed                                                specific areaId

                                                                                         17
Search – Flat Schema, Facets & Scoring

We rely a lot on ElasticSearch's Flat Schema, Facets & Scoring
   No joins due to flat schema => fast!
   Multivalue fields => fast filtering for listings in areas of various hierarchy levels
   Facets functionality returns list of paying agents with # listings matching criteria
   Old slow ranking algorithm replaced by elasticSearch scoring functionality
        used to go through our DB and refresh score
             ad age is part of the equation
        Now ES computes this dynamically on every search
        We use custom scoring
        We can modify scoring algorithm and see changes instantly
             no need to recalculate scores for all listings




                                                                                            18
Monitoring

Sematext SPM offers a (currently free) ES monitoring solution
   Cluster Health       Search rate & latency      Disk
   Index Stats          Cache                      Network
   Shard Stats          CPU & RAM                  JVM & GC




                                                                          19
Tooling

ElasticSearch-Head is a GUI for browsing /interacting with a cluster




                                                                       20
Backups

 We take periodic copies from the Gateway
    Cause the Gateway is no cure for accidental deletions or bugs
    S3cmd syncs S3 gateway contents to local folder
         Expect some errors here as files get deleted/modified
    Disables snapshots to gateway
    Syncs again (no errors this time and much faster)
    Reenables snapshots to gateway
    Zips local folder contents, splits into smaller files & uploads to secondary S3 bucket




Get the script here: http://labs.spitogatos.gr/?p=17


                                                                                              21
Learnings

Issues & leasons learned:
   Faceted search can return wrong (smaller) results (on multiple shards)
        Due to the way sorting/merging is done
        Increase facet size field depending on cardinallity of faceted field
   We use Elastica – a PHP client for ElasticSearch - https://github.com/ruflin/Elastica
        Lacking Document Routing and Version Type support
        Our own Jerry Manolarakis on a pull request to add setRouting, setVersionType
   Filters vs queries (Query DSL)
        Filters perform an order of magnitude better than plain queries since no scoring is
         performed and they are automatically cached.
   Do it! Your DB will thank you




CPU Utilization                                  Response time pattern

                                                                                               22
Read more
    Useful resources:

   https://speakerdeck.com/u/jmikola/p/symfony-live-london-elasticsearch
   http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/
   http://www.slideshare.net/elasticsearch/elasticsearch-at-berlinbuzzwords-2010
   http://www.slideshare.net/kucrafal/scaling-massive-elastic-search-clusters-rafa-ku-sematext


    Need help integrating ElasticSearch to your app?




    http://bacterials.net/


                                                     Follow us on twitter: @spitogatosLabs
                                                 Check out our blog: http://labs.spitogatos.gr

                                                                                             23
#questions

More Related Content

ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr)

  • 1. ElasticSearch on AWS Real Estate portal Case Study (Spitogatos.gr) AWSUG GR meetup #7 27 September 2012 Andreas Chatzakis co-founder / IT Director – Spitogatos.gr Event sponsored by: @achatzakis on twitter
  • 4. Helping you find a property Finding a property in Greece is complex, lacks transparency. We make life easier for househunters via:  Powerful search functionality  Web & Mobile  Location & Criteria  Quality content  Listings (we love photos)  Articles  mySpitogatos  Email alerts  Save your search  Favorite listings & notes  Contact the realtors 4
  • 5. Realtors love us too! Professionals need help in those turbulent times. We add value in multiple ways:  Cost effective promotion & high quality leads  Targeted channel (very)  Leads already filtered (we ve seen the fotos!)  Technology services for realtors  Turnkey web site solution  Listing synchronization web service  B2B via Spitogatos Network (SpiN) business network / collaboration tool for realtors  Channel for foreign buyers via the English version 5
  • 7. To Search is to Find Search is central to what we do  Users searching for property come with structured criteria of huge variety  Athens Center, residential - flat or studio, for sale, 100-150k €, 85-120 sq meter, with a garage  Athens Center & N.Kosmos, residential - flat, for sale, 75-100k €, 70-100 sq meter, 2+ bedrooms, only show listings with photos  Piraeus centre or Mikrolimano, commercial – store, for rent, 500-750 € per month, only listings with recently reduced price  Monetize: # of Listings grouped by paying member + above criteria  IPhone app → Listings within geo-rectangle + above criteria  As a result, caching is rarely our friend!  We used to think Lucene/Solr, ElasticSearch, CloudSearch etc were only useful for text search, not adding value for structured search G  Have been insisting on trying to optimize MySQL (multi column indices etc) N while throwing replicas to the problem. O R 7
  • 8. Why ElasticSearch Selected elasticSearch after a (very) brief research* on alternatives:  AWS's own Cloudsearch:  Zero management service: nice!  Not available on eu-west-1  Currently lacks ES functionality (e.g. geospatial, non english analyzers)  Sphinx  Easy MySQL integration  How do you scale it?*  Solr  Industry standard  Seems like it is conceived as somehow harder to scale/operate*?  ElasticSearch:  Piece of cake to setup on AWS (stay tuned!)  Super distributed, scales & is easy on IT ops (more on that later!) * Disclaimer: We did not go through a 8 detailed product selection process!
  • 10. ElasticSearch basics A distributed, RESTful Search engine built on top of Lucene  Free Schema  JSON documents  Analyzers  Boost levels  Easy & flexible Search  Lucene query string or JSON based search query DSL  Facets & Highlighting  Spatial search  Custom scripts  Multi Tenancy  Store & search across multiple indices  Each with its own settings  Use-case: Logs – recent in memory, old on disk 10
  • 11. Scaling ElasticSearch Designed from the ground up to be Scalable & Highly Available  Distributed  Indices automatically broken into shards  Replicas for read performance & availability  Multiple cluster nodes, each hosting 1+ shards/replicas  peer2peer, each node can delegate operations to other nodes  Add,remove nodes at will  Rebalancing & routing automagically behind the scenes  Discovery  Multicast or unicast (declarative)  Gateway  Allows recovery in case all nodes go down  Local or shared storage  Async replication in case of shared storage 11
  • 12. A scale-up example Assume a cluster with 4 shards and 1 replica configuration  1 node example – Status Yellow  2 nodes example – Status Green  3 nodes example : Primary shard : Replica shard : Master node : Regular node Master node maintains cluster state, acts if nodes join or leave the cluster by reassigning shards. 12
  • 13. ElasticSearch on AWS 2 modules make deployment on AWS a breeze  EC2 discovery  Filter by security group, AZ, tags  Requires IAM user with certain EC2 privileges: DescribeAvailabilityZones, DescribeInstances, DescribeRegions, DescribeSecurityGroups, DescribeTags  Very useful in autoscaling setups with ephemeral servers  S3 gateway  Long term reliable async persistency of cluster state and indices  Allows deployment without EBS volumes  Still, local gateway with EBS volumes performs better (less network used, faster recovery)  Won't protect from accidental deletion of index (deletion will propagate to shared storage) 13
  • 15. Indexation Indexation of Spitogatos.gr ads  DB is still the “source of truth”  We propagate DELETEs synchronously, INSERTs & UPDATEs asynchronously  KISS: Cron job (re) indexes never or least-recently indexed listings  ORM marks new/modified listings as never-indexed (so they go first)  Location: Multivalue field instead of nested set model in the DB  e.g. this property is in Greece, Attica, Piraeus, Port of Piraeus  Property will be included in results when I search for any of the above.  Flat schema  Searchable listing owner fields are included in the document (vs a JOIN in our DB)  Changes to other tables might lead to large # of listings requiring reindexation (e.g. real estate agent becomes a paying member) 15
  • 16. Index Integrity Making sure our index is consistent with the DB  Scrutineer ( https://github.com/Aconex/scrutineer )  Compares DB and ElasticSearch index for mismatches  exists in ES but not on DB (or vice versa)  ES version not up to date  Relies on “_version” field - is incremented via our ORM onChange  When indexing we explicitly set versioning to “external”  Had to “hack” it as it doesn't work with EC2 discovery module  http://labs.spitogatos.gr/?p=45 16
  • 17. Search – Shards & Routing How does ElasticSearch decide in which shard to store a doc?  By default this is done based on hash of document id  Can be ovverriden while indexing and while searching (routing parameter)  We shard based on hash of the id of area id - Most users search for listings within a specific area - We hit only a single shard for a large percentage of the searches. No routing Routing by specificed specific areaId 17
  • 18. Search – Flat Schema, Facets & Scoring We rely a lot on ElasticSearch's Flat Schema, Facets & Scoring  No joins due to flat schema => fast!  Multivalue fields => fast filtering for listings in areas of various hierarchy levels  Facets functionality returns list of paying agents with # listings matching criteria  Old slow ranking algorithm replaced by elasticSearch scoring functionality  used to go through our DB and refresh score  ad age is part of the equation  Now ES computes this dynamically on every search  We use custom scoring  We can modify scoring algorithm and see changes instantly  no need to recalculate scores for all listings 18
  • 19. Monitoring Sematext SPM offers a (currently free) ES monitoring solution  Cluster Health  Search rate & latency  Disk  Index Stats  Cache  Network  Shard Stats  CPU & RAM  JVM & GC 19
  • 20. Tooling ElasticSearch-Head is a GUI for browsing /interacting with a cluster 20
  • 21. Backups We take periodic copies from the Gateway  Cause the Gateway is no cure for accidental deletions or bugs  S3cmd syncs S3 gateway contents to local folder  Expect some errors here as files get deleted/modified  Disables snapshots to gateway  Syncs again (no errors this time and much faster)  Reenables snapshots to gateway  Zips local folder contents, splits into smaller files & uploads to secondary S3 bucket Get the script here: http://labs.spitogatos.gr/?p=17 21
  • 22. Learnings Issues & leasons learned:  Faceted search can return wrong (smaller) results (on multiple shards)  Due to the way sorting/merging is done  Increase facet size field depending on cardinallity of faceted field  We use Elastica – a PHP client for ElasticSearch - https://github.com/ruflin/Elastica  Lacking Document Routing and Version Type support  Our own Jerry Manolarakis on a pull request to add setRouting, setVersionType  Filters vs queries (Query DSL)  Filters perform an order of magnitude better than plain queries since no scoring is performed and they are automatically cached.  Do it! Your DB will thank you CPU Utilization Response time pattern 22
  • 23. Read more Useful resources:  https://speakerdeck.com/u/jmikola/p/symfony-live-london-elasticsearch  http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/  http://www.slideshare.net/elasticsearch/elasticsearch-at-berlinbuzzwords-2010  http://www.slideshare.net/kucrafal/scaling-massive-elastic-search-clusters-rafa-ku-sematext Need help integrating ElasticSearch to your app? http://bacterials.net/ Follow us on twitter: @spitogatosLabs Check out our blog: http://labs.spitogatos.gr 23