SlideShare a Scribd company logo
1




Munching & crunching
Lucene index post-processing and applications



              Andrzej Białecki

  <andrzej.bialecki@lucidimagination.com>
Intro
   Started using Lucene in 2003 (1.2-dev?)
   Created Luke – the Lucene Index Toolbox
   Nutch, Hadoop committer, Lucene PMC member
   Nutch project lead
Munching and crunching? But really...
   Stir your imagination
   Think outside the box
   Show some unorthodox use and practical applications
   Close ties to scalability, performance, distributed search and
    query latency
Agenda
  ●   Post-processing
      ●   Splitting, merging, sorting, pruning

  ●   Tiered search
  ●   Bit-wise search
  ●   (Map-reduce indexing models)




Apache Lucene EuroCon   20 May 2010
Why post-process indexes?
     Isn't it better to build them right from the start?
     Sometimes it's not convenient or feasible
         Correcting impact of unexpected common words
         Targetting specific index size or composition:
             Creating evenly-sized shards
             Re-balancing shards across servers
             Fitting indexes completely in RAM

     … and sometimes impossible to do it right
         Trimming index size while retaining quality of top-N results


Apache Lucene EuroCon   20 May 2010
Merging indexes
     It's easy to merge several small indexes into one
     Fundamental Lucene operation during indexing
      (SegmentMerger)
         Command-line utilities exist: IndexMergeTool
         API:
             IndexWriter.addIndexes(IndexReader...)
             IndexWriter.addIndexesNoOptimize(Directory...)
             Hopefully a more flexible API on the flex branch

     Solr: through CoreAdmin action=mergeindexes
             Note: schema must be compatible
Apache Lucene EuroCon   20 May 2010
Splitting indexes                                                             original index
                                                                                 segments_2
     IndexSplitter tool:
         Moves whole segments to standalone indexes
                                                                   _0                _1           _2
             Pros: nearly no IO/CPU involved – just rename &
              create new SegmentInfos file
              Cons:




                                                                   segments_0




                                                                                                  segments_0
          




                                                                                     segments_0
                 Requires a multi-segment index!
                 Very limited control over content of resulting
                  indexes → MergePolicy

                                                                                new indexes




Apache Lucene EuroCon    20 May 2010
Splitting indexes, take 2                                                 original index




                                                                            del2


                                                                                   del1
                                                                                          d1
   MultiPassIndexSplitter tool:
                                                                                         d2
     Uses an IndexReader that keeps the list of deletions in memory
     The source index remains unmodified                                                d3
     For each partition:                                                                d4
       Marks all source documents not in the partition as deleted
       Writes a target split using IndexWriter.addIndexes(IndexReader)
          IndexWriter knows how to skip deleted documents
       Removes the “deleted” mark from all source documents            pass 1     pass 2
   Pros:                                                                     d1         d2
     Arbitrary splits possible (even partially overlapping)
                                                                              d3         d4
     Source index remains intact
   Cons:                                                                     new indexes
     Reads complete index N times – I/O is O(N * indexSize)
     Takes twice as much space (source index remains intact)
      … but maybe it's a feature? 

Apache Lucene EuroCon   20 May 2010
Splitting indexes, take 3                                  1 2 3 4 5 6 7 8 9 10 ...
                                                                                          stored fields
                                                                                            term dict
     SinglePassSplitter                                                                postings+payloads
         Uses the same processing workflow as                                            term vectors
          SegmentMerger, only with multiple outputs
             Write new SegmentInfos and FieldInfos                       1 3 5…           1' 2' 3' 4' 5' 6'...
                                                                                                 stored
             Merge (pass-through) stored fields                                                 terms
                                                              partitioner
             Merge (pass-through) term dictionary                                              postings
                                                                                              term vectors
             Merge (pass-through) postings with payloads
                                                                          246…             1' 2' 3' 4' 5' 6'...
             Merge (pass-through) term vectors
                                                                                                 stored
         Renumbers document id-s on-the-fly to form                                             terms
          contiguous space                                                                      postings
                                                                                              term vectors
         Pros: flexibility as with MultiPassIndexSplitter
         Status: work started, to be contributed soon...                    renumber
Apache Lucene EuroCon   20 May 2010
Splitting indexes, summary
     SinglePassSplitter – best tradeoff of flexibility/IO/CPU
     Interesting scenarios with SinglePassSplitter:
         Split by ranges, round-robin, by field value, by frequency, to a target size, etc...
         “Extract” handful of documents to a separate index
         “Move” documents between indexes:
             “extract” from source
             Add to target (merge)
             Delete from source
         Now the source index may reside on a network FS – the amount of IO is
          O(1 * indexSize)

Apache Lucene EuroCon   20 May 2010
Index sorting - introduction
     “Early termination” technique
         If full execution of a query takes too long then terminate and estimate

     Termination conditions:
         Number of documents – LimitedCollector in Nutch
         Time – TimeLimitingCollector
          (see also extended LUCENE-1720 TimeLimitingIndexReader)

     Problems:
         Difficult to estimate total hits
         Important docs may not be collected if they have high docID-s


Apache Lucene EuroCon   20 May 2010
Index sorting - details                 early termination == poor
                                                                      original index
     Define a global ordering of           0 1 2 3 4 5 6 7 doc ID
                                            c e h f a d g b rank
      documents (e.g. PageRank,
      popularity, quality, etc)
         Documents with good rank                                    ID mapping
          should generally score higher     4 7 0 5 1 3 6 2 old doc ID
                                            0 1 2 3 4 5 6 7 new doc ID
     Sort (internal) ID-s by this
      ordering, descending
                                                                      sorted index
     Map from old to new ID-s              0 1 2 3 4 5 6 7 doc ID
      to follow this ordering               a b c d e f g h rank

                                           early termination == good
     Change the ID-s in postings
Apache Lucene EuroCon   20 May 2010
Index sorting - summary
     Implementation in Nutch: IndexSorter
         Based on PageRank – sorts by decreasing page quality
         Uses FilterIndexReader

     NOTE: “Early termination” will (significantly) reduce quality of
      results with non-sorted indexes – use both or neither




Apache Lucene EuroCon   20 May 2010
Index pruning
     Quick refresh on the index composition:
         Stored fields
         Term dictionary
         Term frequency data
         Positional data (postings)
             With or without payload data
         Term frequency vectors

     Number of documents may be into millions
     Number of terms commonly is well into millions
         Not to mention individual postings …
Apache Lucene EuroCon   20 May 2010
Index pruning & top-N retrieval
     N is usually << 1000
     Very often search quality is judged based on top-20
     Question:
       Do we really need to keep and process ALL terms and ALL
        postings for a good-quality top-N search for common
        queries?




Apache Lucene EuroCon   20 May 2010
Index pruning hypothesis
     There should be a way to remove some of the less important
      data
         While retaining the quality of top-N results!
     Question: what data is less important?
     Some answers:
         That of poorly-scoring documents
         That of common (less selective) terms
     Dynamic pruning: skips less relevant data during query
      processing → runtime cost...
     But can we do this work in advance (static pruning)?
Apache Lucene EuroCon   20 May 2010
What do we need for top-N results?
     Work backwards
     “Foreach” common query:
         Run it against the full index
         Record the top-N matching documents

     “Foreach” document in results:
         Record terms and term positions that contributed to the score

     Finally: remove all non-recorded postings and terms
     First proposed by D. Carmel (2001) for single term queries
Apache Lucene EuroCon   20 May 2010
… but it's too simplistic:
                                  0 quick         0 quick
           before pruning         1 brown         1 brown        after pruning
                                  2 fox           2 fox

                        Query 1: brown       - topN(full) == topN(pruned)
                        Query 2: “brown fox” - topN(full) != topN(pruned)

      Hmm, what about less common queries?
          80/20 rule of “good enough”?

      Term-level is too primitive
          Document-centric pruning
          Impact-centric pruning
          Position-centric pruning
Apache Lucene EuroCon   20 May 2010
Smarter pruning                                    Freq

     Not all term positions are equally                    corpus language
      important                                             model
                                                            document language
     Metrics of term and position                          model
      importance:
         Plain in-document term frequency (TF)
         TF-IDF score obtained from top-N results
          of TermQuery (Carmel method)
         Residual IDF – measure of term
          informativeness (selectivity)
         Key-phrase positions, or term clusters
         Kullback-Leibler divergence from a                            Term
          language model                   →
Apache Lucene EuroCon   20 May 2010
Applications
     Obviously, performance-related
         Some papers claim a modest impact on quality when pruning up to 60% of
          postings
         See LUCENE-1812 for some benchmarks confirming this claim

     Removal / restructuring of (some) stored content
     Legacy indexes, or ones created with a fossilized external chain




Apache Lucene EuroCon   20 May 2010
Stored field pruning
     Some stored data can be compacted, removed, or restructured
     Use case: source text for generating “snippets”
         Split content into sentences
         Reorder sentences by a static “importance” score (e.g. how many rare terms they
          contain)
             NOTE: this may use collection wide statistics!
         Remove the bottom x% of sentences




Apache Lucene EuroCon   20 May 2010
LUCENE-1812: contrib/pruning tools and API
     Based on FilterIndexReader
     Produces output indexes via
      IndexWriter.addIndexes(IndexReader[])

     Design:
         PruningReader – subclass of FilterIndexReader with necessary boilerplate and
          hooks for pruning policies
         StorePruningPolicy – implements rules for modifying stored fields (and list of field
          names)
         TermPruningPolicy – implements rules for modifying term dictionary, postings and
          payloads
         PruningTool – command-line utility to configure and run PruningReader
Apache Lucene EuroCon   20 May 2010
Details of LUCENE-1812
      source index                                                           target index
       stored fields                  StorePruningPolicy                       stored fields




                                                               IndexWriter
          term dict                                                              term dict
  postings+payloads                   TermPruningPolicy                      postings+payloads
      term vectors                                                             term vectors

                                       PruningReader
                                                           IW.addIndexes(IndexReader...)
     IndexWriter consumes source data filtered via PruningReader
     Internal document ID-s are preserved – suitable for bitset ops
      and retrieval by internal ID
         If source index has no deletions
         If target index is empty
Apache Lucene EuroCon   20 May 2010
API: StorePruningPolicy
     May remove (some) fields from (some) documents
     May as well modify the values
     May rename / add fields




Apache Lucene EuroCon   20 May 2010
API: TermPruningPolicy
     Thresholds (in the order of precedence):
         Per term
         Per field
         Default

     Plain TF pruning – TFTermPruningPolicy
         Removes all postings for a term where TF (in-document term frequency) is below
          a threshold

     Top-N term-level – CarmelTermPruningPolicy
         TermQuery search for top-N docs
         Removes all postings for a term outside the top-N docs

Apache Lucene EuroCon   20 May 2010
Results so far...
     TF pruning:
         Term query recall very good
         Phrase query recall very poor – expected...

     Carmel pruning – slightly better term position selection, but
      still heavy negative impact on phrase queries
     Recognizing and keeping key phrases would help
         Use query log for frequent-phrase mining?
         Use collocation miner (Mahout)?
         Savings on pruning will be smaller, but quality will significantly improve


Apache Lucene EuroCon   20 May 2010
References
     Static Index Pruning for Information Retrieval Systems, Carmel
      et al, SIGIR'01
     A document-centric approach to static index pruning in text
      retrieval systems, Büttcher & Clark, CIKM'06
     Locality-based pruning methods for web search, deMoura et al,
      ACM TIS '08
     Pruning strategies for mixed-mode querying, Anh & Moffat,
      CIKM'06

Apache Lucene EuroCon   20 May 2010
Index pruning applied ...
     Index 1: A heavily pruned index that fits in RAM:
         excellent speed
         poor search quality for many less-common query types
     Index 2: Slightly pruned index that fits partially in RAM:
         good speed, good quality for many common query types,
         still poor quality for some other rare query types
     Index 3: Full index on disk:
         Slow speed
         Excellent quality for all query types
     QUESTION: Can we come up with a combined search strategy?
Apache Lucene EuroCon   20 May 2010
Tiered search
   search box 1
   search box 1
                                          RAM
                                                70% pruned


   search box 2
   search box 2                     SSD
                                           30% pruned          ?
                                                              predict
                                                             evaluate
   search box 3
   search box 3         HDD

                                        0% pruned




    Can we predict the best tier without actually running the query?
    How to evaluate if the predictor was right?
Apache Lucene EuroCon     20 May 2010
Tiered search: tier selector and evaluator
     Best tier can be predicted (often enough ):
         Carmel pruning yields excellent results for simple term queries
         Phrase-based pruning yields good results for phrase queries (though less often)

     Quality evaluator: when is predictor wrong?
         Could be very complex, based on gold standard and qrels
         Could be very simple: acceptable number of results

     Fall-back strategy:
         Serial: poor latency, but minimizes load on bulkier tiers
         Partially parallel:
             submit to the next tier only the border-line queries
             Pick the first acceptable answer – reduces latency
Apache Lucene EuroCon   20 May 2010
Tiered versus distributed
     Both applicable to indexes and query loads exceeding single
      machine capabilities
     Distributed sharded search:
         increases latency for all queries (send + execute + integrate from all shards)
             … plus replicas to increase QPS:
                 Increases hardware / management costs
                 While not improving latency

     Tiered search:
         Excellent latency for common queries
         More complex to build and maintain
         Arguably lower hardware cost for comparable scale / QPS
Apache Lucene EuroCon   20 May 2010
Tiered search benefits
     Majority of common queries handled by first tier: RAM-based,
      high QPS, low latency
     Partially parallel mode reduces average latency for more
      complex queries
     Hardware investment likely smaller than for distributed search
      setup of comparable QPS / latency




Apache Lucene EuroCon   20 May 2010
Example Lucene API for tiered search
                                      Could be implemented as
                                      a Solr SearchComponent...




Apache Lucene EuroCon   20 May 2010
Lucene implementation details




Apache Lucene EuroCon   20 May 2010
References
     Efficiency trade-offs in two-tier web search systems, Baeza-
      Yates et al., SIGIR'09
     ResIn: A combination of results caching and index pruning for
      high-performance web search engines, Baeza-Yates et al,
      SIGIR'08
     Three-level caching for efficient query processing in large Web
      search engines, Long & Suel, WWW'05



Apache Lucene EuroCon   20 May 2010
Bit-wise search
     Given a bit pattern query:
      1010 1001 0101 0001
     Find documents with matching bit patterns in a field
     Applications:
         Permission checking
         De-duplication
         Plagiarism detection

     Two variants: non-scoring (filtering) and scoring

Apache Lucene EuroCon   20 May 2010
Non-scoring bitwise search (LUCENE-2460)
     Builds a Filter from intersection of:                0   1    2    3     4  docID
                                                         0x01 0x02 0x03 0x04 0x05 flags
         DocIdSet of documents matching a Query           a   b    b     a    a  type
         Integer value and operation (AND, OR, XOR)
                                                                                  “type:a”
         “Value source” that caches integer values of
          a field (from FieldCache)
                                                         0x01 0x02 0x03 0x04 0x05 flags
     Corresponding Solr field type and
      QParser: SOLR-1913                                 op=AND       val=0x01

     Useful for filtering (not scoring)

                                                                                   Filter

Apache Lucene EuroCon   20 May 2010
Scoring bitwise search (SOLR-1918)
     BooleanQuery in disguise:                        docID    D1           D2     D3
                                                       flags   1010         1011   0011
          1010 = Y-1000 | N-0100 |
                                                               Y1000    Y1000      N1000
                        Y-0010 | N-0001
                                                       bits    N0100    N0100      N0100
                                                               Y0010    Y0010      Y0010
     Solr 32-bit BitwiseField                                 N0001    Y0001      Y0001
         Analyzer creates the bitmasks field
         Currently supports only single value per field   Q = bits:Y1000 bits:N0100
                                                               bits:Y0010 bits:N0001
         Creates BooleanQuery from query int value
                                                                 Results:
     Useful when searching for best
      matching (ranked) bit patterns                             D1 matches 4 of 4 → #1
                                                                 D2 matches 3 of 4 → #2
                                                                 D3 matches 2 of 4 → #3
Apache Lucene EuroCon       20 May 2010
Summary
     Index post-processing covers a range of useful scenarios:
         Merging and splitting, remodeling, extracting, moving ...
         Pruning less important data

     Tiered search + pruned indexes:
         High performance
         Practically unchanged quality
         Less hardware

     Bitwise search:
         Filtering by matching bits
         Ranking by best matching patterns
Apache Lucene EuroCon   20 May 2010
Meta-summary
     Stir your imagination
     Think outside the box
     Show some unorthodox use and practical applications
     Close ties to scalability, performance, distributed search and
      query latency




Apache Lucene EuroCon   20 May 2010
Q&A




Apache Lucene EuroCon   20 May 2010
Thank you!




Apache Lucene EuroCon   05/25/10
Massive indexing with map-reduce
     Map-reduce indexing models
         Google model
         Nutch model
         Modified Nutch model
         Hadoop contrib/indexing model

     Tradeoff analysis and recommendations




Apache Lucene EuroCon   20 May 2010
Google model
   Map():                                               Reduce()

    IN: <seq, docText>                                    IN: <term, list(<seq,pos>)>
       terms = analyze(docText)                             foreach(<seq,pos>)
       foreach (term)                                         docId = calculate(seq, taskId)
            emit(term, <seq,position>)                         Postings(term).append(docId, pos)

       Pros: analysis on the map side
       Cons:
           Too many tiny intermediate records → Combiner
           DocID synchronization across map and reduce tasks
           Lucene: very difficult (impossible?) to create index this way
Apache Lucene EuroCon    20 May 2010
Nutch model (also in SOLR-1301)
   Map():                                             Reduce()

    IN: <seq, docPart>                                  IN: <docId, list(docPart)>
       docId = docPart.get(“url”)                         doc = luceneDoc(list(docPart))
       emit(docId, docPart)                               indexWriter.addDocument(doc)



       Pros: easy to build Lucene index
       Cons:
           Analysis on the reduce side
           Many costly merge operations (large indexes built from scratch on reduce side)
            (plus currently needs copy from local FS to HDFS – see LUCENE-2373)
Apache Lucene EuroCon   20 May 2010
Modified Nutch model (N/A...)
   Map():                                             Reduce()

    IN: <seq, docPart>                                  IN: <docId, list(<docPart,ts>)>
       docId = docPart.get(“url”)                         doc = luceneDoc(list(<docPart,ts>))
       ts = analyze(docPart)                              indexWriter.addDocument(doc)
       emit(docId, <docPart,ts>)

       Pros:
           Analysis on map side
           Easy to build Lucene index
       Cons:
           Many costly merge operations (large indexes built from scratch on reduce side)
            (plus currently needs copy from local FS to HDFS – see LUCENE-2373)
Apache Lucene EuroCon   20 May 2010
Hadoop contrib/indexing model
   Map():                                               Reduce()

    IN: <seq, docText>                                    IN: <random, list(indexData)>
       doc = luceneDoc(docText)                             foreach(indexData)
       indexWriter.addDocument(doc)                           indexWriter.addIndexes(indexData)
       emit(random, indexData)

       Pros:
           analysis on the map side
           Many merges on the map side
           Supports also other operations (deletes, updates)
       Cons:
           Serialization is costly, records are big and require more RAM to sort
Apache Lucene EuroCon   20 May 2010
Massive indexing - summary
     If you first need to collect document parts → SOLR-1301 model
     If you use complex analysis → Hadoop contrib/index
         NOTE: there is no good integration yet of Solr and Hadoop contrib/index module...




Apache Lucene EuroCon   20 May 2010

More Related Content

Munching & crunching - Lucene index post-processing

  • 1. 1 Munching & crunching Lucene index post-processing and applications Andrzej Białecki <andrzej.bialecki@lucidimagination.com>
  • 2. Intro  Started using Lucene in 2003 (1.2-dev?)  Created Luke – the Lucene Index Toolbox  Nutch, Hadoop committer, Lucene PMC member  Nutch project lead
  • 3. Munching and crunching? But really...  Stir your imagination  Think outside the box  Show some unorthodox use and practical applications  Close ties to scalability, performance, distributed search and query latency
  • 4. Agenda ● Post-processing ● Splitting, merging, sorting, pruning ● Tiered search ● Bit-wise search ● (Map-reduce indexing models) Apache Lucene EuroCon 20 May 2010
  • 5. Why post-process indexes?  Isn't it better to build them right from the start?  Sometimes it's not convenient or feasible  Correcting impact of unexpected common words  Targetting specific index size or composition:  Creating evenly-sized shards  Re-balancing shards across servers  Fitting indexes completely in RAM  … and sometimes impossible to do it right  Trimming index size while retaining quality of top-N results Apache Lucene EuroCon 20 May 2010
  • 6. Merging indexes  It's easy to merge several small indexes into one  Fundamental Lucene operation during indexing (SegmentMerger)  Command-line utilities exist: IndexMergeTool  API:  IndexWriter.addIndexes(IndexReader...)  IndexWriter.addIndexesNoOptimize(Directory...)  Hopefully a more flexible API on the flex branch  Solr: through CoreAdmin action=mergeindexes  Note: schema must be compatible Apache Lucene EuroCon 20 May 2010
  • 7. Splitting indexes original index segments_2  IndexSplitter tool:  Moves whole segments to standalone indexes _0 _1 _2  Pros: nearly no IO/CPU involved – just rename & create new SegmentInfos file Cons: segments_0 segments_0  segments_0  Requires a multi-segment index!  Very limited control over content of resulting indexes → MergePolicy new indexes Apache Lucene EuroCon 20 May 2010
  • 8. Splitting indexes, take 2 original index del2 del1 d1  MultiPassIndexSplitter tool: d2  Uses an IndexReader that keeps the list of deletions in memory  The source index remains unmodified d3  For each partition: d4  Marks all source documents not in the partition as deleted  Writes a target split using IndexWriter.addIndexes(IndexReader)  IndexWriter knows how to skip deleted documents  Removes the “deleted” mark from all source documents pass 1 pass 2  Pros: d1 d2  Arbitrary splits possible (even partially overlapping) d3 d4  Source index remains intact  Cons: new indexes  Reads complete index N times – I/O is O(N * indexSize)  Takes twice as much space (source index remains intact) … but maybe it's a feature?  Apache Lucene EuroCon 20 May 2010
  • 9. Splitting indexes, take 3 1 2 3 4 5 6 7 8 9 10 ... stored fields term dict  SinglePassSplitter postings+payloads  Uses the same processing workflow as term vectors SegmentMerger, only with multiple outputs  Write new SegmentInfos and FieldInfos 1 3 5… 1' 2' 3' 4' 5' 6'... stored  Merge (pass-through) stored fields terms partitioner  Merge (pass-through) term dictionary postings term vectors  Merge (pass-through) postings with payloads 246… 1' 2' 3' 4' 5' 6'...  Merge (pass-through) term vectors stored  Renumbers document id-s on-the-fly to form terms contiguous space postings term vectors  Pros: flexibility as with MultiPassIndexSplitter  Status: work started, to be contributed soon... renumber Apache Lucene EuroCon 20 May 2010
  • 10. Splitting indexes, summary  SinglePassSplitter – best tradeoff of flexibility/IO/CPU  Interesting scenarios with SinglePassSplitter:  Split by ranges, round-robin, by field value, by frequency, to a target size, etc...  “Extract” handful of documents to a separate index  “Move” documents between indexes:  “extract” from source  Add to target (merge)  Delete from source  Now the source index may reside on a network FS – the amount of IO is O(1 * indexSize) Apache Lucene EuroCon 20 May 2010
  • 11. Index sorting - introduction  “Early termination” technique  If full execution of a query takes too long then terminate and estimate  Termination conditions:  Number of documents – LimitedCollector in Nutch  Time – TimeLimitingCollector (see also extended LUCENE-1720 TimeLimitingIndexReader)  Problems:  Difficult to estimate total hits  Important docs may not be collected if they have high docID-s Apache Lucene EuroCon 20 May 2010
  • 12. Index sorting - details early termination == poor original index  Define a global ordering of 0 1 2 3 4 5 6 7 doc ID c e h f a d g b rank documents (e.g. PageRank, popularity, quality, etc)  Documents with good rank ID mapping should generally score higher 4 7 0 5 1 3 6 2 old doc ID 0 1 2 3 4 5 6 7 new doc ID  Sort (internal) ID-s by this ordering, descending sorted index  Map from old to new ID-s 0 1 2 3 4 5 6 7 doc ID to follow this ordering a b c d e f g h rank early termination == good  Change the ID-s in postings Apache Lucene EuroCon 20 May 2010
  • 13. Index sorting - summary  Implementation in Nutch: IndexSorter  Based on PageRank – sorts by decreasing page quality  Uses FilterIndexReader  NOTE: “Early termination” will (significantly) reduce quality of results with non-sorted indexes – use both or neither Apache Lucene EuroCon 20 May 2010
  • 14. Index pruning  Quick refresh on the index composition:  Stored fields  Term dictionary  Term frequency data  Positional data (postings)  With or without payload data  Term frequency vectors  Number of documents may be into millions  Number of terms commonly is well into millions  Not to mention individual postings … Apache Lucene EuroCon 20 May 2010
  • 15. Index pruning & top-N retrieval  N is usually << 1000  Very often search quality is judged based on top-20  Question:  Do we really need to keep and process ALL terms and ALL postings for a good-quality top-N search for common queries? Apache Lucene EuroCon 20 May 2010
  • 16. Index pruning hypothesis  There should be a way to remove some of the less important data  While retaining the quality of top-N results!  Question: what data is less important?  Some answers:  That of poorly-scoring documents  That of common (less selective) terms  Dynamic pruning: skips less relevant data during query processing → runtime cost...  But can we do this work in advance (static pruning)? Apache Lucene EuroCon 20 May 2010
  • 17. What do we need for top-N results?  Work backwards  “Foreach” common query:  Run it against the full index  Record the top-N matching documents  “Foreach” document in results:  Record terms and term positions that contributed to the score  Finally: remove all non-recorded postings and terms  First proposed by D. Carmel (2001) for single term queries Apache Lucene EuroCon 20 May 2010
  • 18. … but it's too simplistic: 0 quick 0 quick before pruning 1 brown 1 brown after pruning 2 fox 2 fox Query 1: brown - topN(full) == topN(pruned) Query 2: “brown fox” - topN(full) != topN(pruned)  Hmm, what about less common queries?  80/20 rule of “good enough”?  Term-level is too primitive  Document-centric pruning  Impact-centric pruning  Position-centric pruning Apache Lucene EuroCon 20 May 2010
  • 19. Smarter pruning Freq  Not all term positions are equally corpus language important model document language  Metrics of term and position model importance:  Plain in-document term frequency (TF)  TF-IDF score obtained from top-N results of TermQuery (Carmel method)  Residual IDF – measure of term informativeness (selectivity)  Key-phrase positions, or term clusters  Kullback-Leibler divergence from a Term language model → Apache Lucene EuroCon 20 May 2010
  • 20. Applications  Obviously, performance-related  Some papers claim a modest impact on quality when pruning up to 60% of postings  See LUCENE-1812 for some benchmarks confirming this claim  Removal / restructuring of (some) stored content  Legacy indexes, or ones created with a fossilized external chain Apache Lucene EuroCon 20 May 2010
  • 21. Stored field pruning  Some stored data can be compacted, removed, or restructured  Use case: source text for generating “snippets”  Split content into sentences  Reorder sentences by a static “importance” score (e.g. how many rare terms they contain)  NOTE: this may use collection wide statistics!  Remove the bottom x% of sentences Apache Lucene EuroCon 20 May 2010
  • 22. LUCENE-1812: contrib/pruning tools and API  Based on FilterIndexReader  Produces output indexes via IndexWriter.addIndexes(IndexReader[])  Design:  PruningReader – subclass of FilterIndexReader with necessary boilerplate and hooks for pruning policies  StorePruningPolicy – implements rules for modifying stored fields (and list of field names)  TermPruningPolicy – implements rules for modifying term dictionary, postings and payloads  PruningTool – command-line utility to configure and run PruningReader Apache Lucene EuroCon 20 May 2010
  • 23. Details of LUCENE-1812 source index target index stored fields StorePruningPolicy stored fields IndexWriter term dict term dict postings+payloads TermPruningPolicy postings+payloads term vectors term vectors PruningReader IW.addIndexes(IndexReader...)  IndexWriter consumes source data filtered via PruningReader  Internal document ID-s are preserved – suitable for bitset ops and retrieval by internal ID  If source index has no deletions  If target index is empty Apache Lucene EuroCon 20 May 2010
  • 24. API: StorePruningPolicy  May remove (some) fields from (some) documents  May as well modify the values  May rename / add fields Apache Lucene EuroCon 20 May 2010
  • 25. API: TermPruningPolicy  Thresholds (in the order of precedence):  Per term  Per field  Default  Plain TF pruning – TFTermPruningPolicy  Removes all postings for a term where TF (in-document term frequency) is below a threshold  Top-N term-level – CarmelTermPruningPolicy  TermQuery search for top-N docs  Removes all postings for a term outside the top-N docs Apache Lucene EuroCon 20 May 2010
  • 26. Results so far...  TF pruning:  Term query recall very good  Phrase query recall very poor – expected...  Carmel pruning – slightly better term position selection, but still heavy negative impact on phrase queries  Recognizing and keeping key phrases would help  Use query log for frequent-phrase mining?  Use collocation miner (Mahout)?  Savings on pruning will be smaller, but quality will significantly improve Apache Lucene EuroCon 20 May 2010
  • 27. References  Static Index Pruning for Information Retrieval Systems, Carmel et al, SIGIR'01  A document-centric approach to static index pruning in text retrieval systems, Büttcher & Clark, CIKM'06  Locality-based pruning methods for web search, deMoura et al, ACM TIS '08  Pruning strategies for mixed-mode querying, Anh & Moffat, CIKM'06 Apache Lucene EuroCon 20 May 2010
  • 28. Index pruning applied ...  Index 1: A heavily pruned index that fits in RAM:  excellent speed  poor search quality for many less-common query types  Index 2: Slightly pruned index that fits partially in RAM:  good speed, good quality for many common query types,  still poor quality for some other rare query types  Index 3: Full index on disk:  Slow speed  Excellent quality for all query types  QUESTION: Can we come up with a combined search strategy? Apache Lucene EuroCon 20 May 2010
  • 29. Tiered search search box 1 search box 1 RAM 70% pruned search box 2 search box 2 SSD 30% pruned ? predict evaluate search box 3 search box 3 HDD 0% pruned  Can we predict the best tier without actually running the query?  How to evaluate if the predictor was right? Apache Lucene EuroCon 20 May 2010
  • 30. Tiered search: tier selector and evaluator  Best tier can be predicted (often enough ):  Carmel pruning yields excellent results for simple term queries  Phrase-based pruning yields good results for phrase queries (though less often)  Quality evaluator: when is predictor wrong?  Could be very complex, based on gold standard and qrels  Could be very simple: acceptable number of results  Fall-back strategy:  Serial: poor latency, but minimizes load on bulkier tiers  Partially parallel:  submit to the next tier only the border-line queries  Pick the first acceptable answer – reduces latency Apache Lucene EuroCon 20 May 2010
  • 31. Tiered versus distributed  Both applicable to indexes and query loads exceeding single machine capabilities  Distributed sharded search:  increases latency for all queries (send + execute + integrate from all shards)  … plus replicas to increase QPS:  Increases hardware / management costs  While not improving latency  Tiered search:  Excellent latency for common queries  More complex to build and maintain  Arguably lower hardware cost for comparable scale / QPS Apache Lucene EuroCon 20 May 2010
  • 32. Tiered search benefits  Majority of common queries handled by first tier: RAM-based, high QPS, low latency  Partially parallel mode reduces average latency for more complex queries  Hardware investment likely smaller than for distributed search setup of comparable QPS / latency Apache Lucene EuroCon 20 May 2010
  • 33. Example Lucene API for tiered search Could be implemented as a Solr SearchComponent... Apache Lucene EuroCon 20 May 2010
  • 34. Lucene implementation details Apache Lucene EuroCon 20 May 2010
  • 35. References  Efficiency trade-offs in two-tier web search systems, Baeza- Yates et al., SIGIR'09  ResIn: A combination of results caching and index pruning for high-performance web search engines, Baeza-Yates et al, SIGIR'08  Three-level caching for efficient query processing in large Web search engines, Long & Suel, WWW'05 Apache Lucene EuroCon 20 May 2010
  • 36. Bit-wise search  Given a bit pattern query: 1010 1001 0101 0001  Find documents with matching bit patterns in a field  Applications:  Permission checking  De-duplication  Plagiarism detection  Two variants: non-scoring (filtering) and scoring Apache Lucene EuroCon 20 May 2010
  • 37. Non-scoring bitwise search (LUCENE-2460)  Builds a Filter from intersection of: 0 1 2 3 4 docID 0x01 0x02 0x03 0x04 0x05 flags  DocIdSet of documents matching a Query a b b a a type  Integer value and operation (AND, OR, XOR) “type:a”  “Value source” that caches integer values of a field (from FieldCache) 0x01 0x02 0x03 0x04 0x05 flags  Corresponding Solr field type and QParser: SOLR-1913 op=AND val=0x01  Useful for filtering (not scoring) Filter Apache Lucene EuroCon 20 May 2010
  • 38. Scoring bitwise search (SOLR-1918)  BooleanQuery in disguise: docID D1 D2 D3 flags 1010 1011 0011 1010 = Y-1000 | N-0100 | Y1000 Y1000 N1000 Y-0010 | N-0001 bits N0100 N0100 N0100 Y0010 Y0010 Y0010  Solr 32-bit BitwiseField N0001 Y0001 Y0001  Analyzer creates the bitmasks field  Currently supports only single value per field Q = bits:Y1000 bits:N0100 bits:Y0010 bits:N0001  Creates BooleanQuery from query int value Results:  Useful when searching for best matching (ranked) bit patterns D1 matches 4 of 4 → #1 D2 matches 3 of 4 → #2 D3 matches 2 of 4 → #3 Apache Lucene EuroCon 20 May 2010
  • 39. Summary  Index post-processing covers a range of useful scenarios:  Merging and splitting, remodeling, extracting, moving ...  Pruning less important data  Tiered search + pruned indexes:  High performance  Practically unchanged quality  Less hardware  Bitwise search:  Filtering by matching bits  Ranking by best matching patterns Apache Lucene EuroCon 20 May 2010
  • 40. Meta-summary  Stir your imagination  Think outside the box  Show some unorthodox use and practical applications  Close ties to scalability, performance, distributed search and query latency Apache Lucene EuroCon 20 May 2010
  • 42. Thank you! Apache Lucene EuroCon 05/25/10
  • 43. Massive indexing with map-reduce  Map-reduce indexing models  Google model  Nutch model  Modified Nutch model  Hadoop contrib/indexing model  Tradeoff analysis and recommendations Apache Lucene EuroCon 20 May 2010
  • 44. Google model  Map():  Reduce() IN: <seq, docText> IN: <term, list(<seq,pos>)>  terms = analyze(docText)  foreach(<seq,pos>)  foreach (term) docId = calculate(seq, taskId) emit(term, <seq,position>) Postings(term).append(docId, pos)  Pros: analysis on the map side  Cons:  Too many tiny intermediate records → Combiner  DocID synchronization across map and reduce tasks  Lucene: very difficult (impossible?) to create index this way Apache Lucene EuroCon 20 May 2010
  • 45. Nutch model (also in SOLR-1301)  Map():  Reduce() IN: <seq, docPart> IN: <docId, list(docPart)>  docId = docPart.get(“url”)  doc = luceneDoc(list(docPart))  emit(docId, docPart)  indexWriter.addDocument(doc)  Pros: easy to build Lucene index  Cons:  Analysis on the reduce side  Many costly merge operations (large indexes built from scratch on reduce side) (plus currently needs copy from local FS to HDFS – see LUCENE-2373) Apache Lucene EuroCon 20 May 2010
  • 46. Modified Nutch model (N/A...)  Map():  Reduce() IN: <seq, docPart> IN: <docId, list(<docPart,ts>)>  docId = docPart.get(“url”)  doc = luceneDoc(list(<docPart,ts>))  ts = analyze(docPart)  indexWriter.addDocument(doc)  emit(docId, <docPart,ts>)  Pros:  Analysis on map side  Easy to build Lucene index  Cons:  Many costly merge operations (large indexes built from scratch on reduce side) (plus currently needs copy from local FS to HDFS – see LUCENE-2373) Apache Lucene EuroCon 20 May 2010
  • 47. Hadoop contrib/indexing model  Map():  Reduce() IN: <seq, docText> IN: <random, list(indexData)>  doc = luceneDoc(docText)  foreach(indexData)  indexWriter.addDocument(doc) indexWriter.addIndexes(indexData)  emit(random, indexData)  Pros:  analysis on the map side  Many merges on the map side  Supports also other operations (deletes, updates)  Cons:  Serialization is costly, records are big and require more RAM to sort Apache Lucene EuroCon 20 May 2010
  • 48. Massive indexing - summary  If you first need to collect document parts → SOLR-1301 model  If you use complex analysis → Hadoop contrib/index  NOTE: there is no good integration yet of Solr and Hadoop contrib/index module... Apache Lucene EuroCon 20 May 2010