Munching & crunching - Lucene index post-processing

1

Munching & crunching
Lucene index post-processing and applications

Andrzej Białecki

<andrzej.bialecki@lucidimagination.com>

Intro
 Started using Lucene in 2003 (1.2-dev?)
 Created Luke – the Lucene Index Toolbox
 Nutch, Hadoop committer, Lucene PMC member
 Nutch project lead

Munching and crunching? But really...
 Stir your imagination
 Think outside the box
 Show some unorthodox use and practical applications
 Close ties to scalability, performance, distributed search and
query latency

Agenda
● Post-processing
● Splitting, merging, sorting, pruning

● Tiered search
● Bit-wise search
● (Map-reduce indexing models)

Apache Lucene EuroCon 20 May 2010

Why post-process indexes?
 Isn't it better to build them right from the start?
 Sometimes it's not convenient or feasible
 Correcting impact of unexpected common words
 Targetting specific index size or composition:
 Creating evenly-sized shards
 Re-balancing shards across servers
 Fitting indexes completely in RAM

 … and sometimes impossible to do it right
 Trimming index size while retaining quality of top-N results


Merging indexes
 It's easy to merge several small indexes into one
 Fundamental Lucene operation during indexing
(SegmentMerger)
 Command-line utilities exist: IndexMergeTool
 API:
 IndexWriter.addIndexes(IndexReader...)
 IndexWriter.addIndexesNoOptimize(Directory...)
 Hopefully a more flexible API on the flex branch

 Solr: through CoreAdmin action=mergeindexes
 Note: schema must be compatible

Splitting indexes original index
segments_2
 IndexSplitter tool:
 Moves whole segments to standalone indexes
_0 _1 _2
 Pros: nearly no IO/CPU involved – just rename &
create new SegmentInfos file
Cons:

segments_0

segments_0


segments_0
 Requires a multi-segment index!
 Very limited control over content of resulting
indexes → MergePolicy

new indexes


Splitting indexes, take 2 original index

del2

del1
d1
 MultiPassIndexSplitter tool:
d2
 Uses an IndexReader that keeps the list of deletions in memory
 The source index remains unmodified d3
 For each partition: d4
 Marks all source documents not in the partition as deleted
 Writes a target split using IndexWriter.addIndexes(IndexReader)
 IndexWriter knows how to skip deleted documents
 Removes the “deleted” mark from all source documents pass 1 pass 2
 Pros: d1 d2
 Arbitrary splits possible (even partially overlapping)
d3 d4
 Source index remains intact
 Cons: new indexes
 Reads complete index N times – I/O is O(N * indexSize)
 Takes twice as much space (source index remains intact)
… but maybe it's a feature? 


Splitting indexes, take 3 1 2 3 4 5 6 7 8 9 10 ...
stored fields
term dict
 SinglePassSplitter postings+payloads
 Uses the same processing workflow as term vectors
SegmentMerger, only with multiple outputs
 Write new SegmentInfos and FieldInfos 1 3 5… 1' 2' 3' 4' 5' 6'...
stored
 Merge (pass-through) stored fields terms
partitioner
 Merge (pass-through) term dictionary postings
term vectors
 Merge (pass-through) postings with payloads
246… 1' 2' 3' 4' 5' 6'...
 Merge (pass-through) term vectors
stored
 Renumbers document id-s on-the-fly to form terms
contiguous space postings
term vectors
 Pros: flexibility as with MultiPassIndexSplitter
 Status: work started, to be contributed soon... renumber

Splitting indexes, summary
 SinglePassSplitter – best tradeoff of flexibility/IO/CPU
 Interesting scenarios with SinglePassSplitter:
 Split by ranges, round-robin, by field value, by frequency, to a target size, etc...
 “Extract” handful of documents to a separate index
 “Move” documents between indexes:
 “extract” from source
 Add to target (merge)
 Delete from source
 Now the source index may reside on a network FS – the amount of IO is
O(1 * indexSize)


Index sorting - introduction
 “Early termination” technique
 If full execution of a query takes too long then terminate and estimate

 Termination conditions:
 Number of documents – LimitedCollector in Nutch
 Time – TimeLimitingCollector
(see also extended LUCENE-1720 TimeLimitingIndexReader)

 Problems:
 Difficult to estimate total hits
 Important docs may not be collected if they have high docID-s


Index sorting - details early termination == poor
original index
 Define a global ordering of 0 1 2 3 4 5 6 7 doc ID
c e h f a d g b rank
documents (e.g. PageRank,
popularity, quality, etc)
 Documents with good rank ID mapping
should generally score higher 4 7 0 5 1 3 6 2 old doc ID
0 1 2 3 4 5 6 7 new doc ID
 Sort (internal) ID-s by this
ordering, descending
sorted index
 Map from old to new ID-s 0 1 2 3 4 5 6 7 doc ID
to follow this ordering a b c d e f g h rank

early termination == good
 Change the ID-s in postings

Index sorting - summary
 Implementation in Nutch: IndexSorter
 Based on PageRank – sorts by decreasing page quality
 Uses FilterIndexReader

 NOTE: “Early termination” will (significantly) reduce quality of
results with non-sorted indexes – use both or neither


Index pruning
 Quick refresh on the index composition:
 Stored fields
 Term dictionary
 Term frequency data
 Positional data (postings)
 With or without payload data
 Term frequency vectors

 Number of documents may be into millions
 Number of terms commonly is well into millions
 Not to mention individual postings …

Index pruning & top-N retrieval
 N is usually << 1000
 Very often search quality is judged based on top-20
 Question:
 Do we really need to keep and process ALL terms and ALL
postings for a good-quality top-N search for common
queries?


Index pruning hypothesis
 There should be a way to remove some of the less important
data
 While retaining the quality of top-N results!
 Question: what data is less important?
 Some answers:
 That of poorly-scoring documents
 That of common (less selective) terms
 Dynamic pruning: skips less relevant data during query
processing → runtime cost...
 But can we do this work in advance (static pruning)?

What do we need for top-N results?
 Work backwards
 “Foreach” common query:
 Run it against the full index
 Record the top-N matching documents

 “Foreach” document in results:
 Record terms and term positions that contributed to the score

 Finally: remove all non-recorded postings and terms
 First proposed by D. Carmel (2001) for single term queries

… but it's too simplistic:
0 quick 0 quick
before pruning 1 brown 1 brown after pruning
2 fox 2 fox

Query 1: brown - topN(full) == topN(pruned)
Query 2: “brown fox” - topN(full) != topN(pruned)

 Hmm, what about less common queries?
 80/20 rule of “good enough”?

 Term-level is too primitive
 Document-centric pruning
 Impact-centric pruning
 Position-centric pruning

Smarter pruning Freq

 Not all term positions are equally corpus language
important model
document language
 Metrics of term and position model
importance:
 Plain in-document term frequency (TF)
 TF-IDF score obtained from top-N results
of TermQuery (Carmel method)
 Residual IDF – measure of term
informativeness (selectivity)
 Key-phrase positions, or term clusters
 Kullback-Leibler divergence from a Term
language model →

Applications
 Obviously, performance-related
 Some papers claim a modest impact on quality when pruning up to 60% of
postings
 See LUCENE-1812 for some benchmarks confirming this claim

 Removal / restructuring of (some) stored content
 Legacy indexes, or ones created with a fossilized external chain


Stored field pruning
 Some stored data can be compacted, removed, or restructured
 Use case: source text for generating “snippets”
 Split content into sentences
 Reorder sentences by a static “importance” score (e.g. how many rare terms they
contain)
 NOTE: this may use collection wide statistics!
 Remove the bottom x% of sentences


LUCENE-1812: contrib/pruning tools and API
 Based on FilterIndexReader
 Produces output indexes via
IndexWriter.addIndexes(IndexReader[])

 Design:
 PruningReader – subclass of FilterIndexReader with necessary boilerplate and
hooks for pruning policies
 StorePruningPolicy – implements rules for modifying stored fields (and list of field
names)
 TermPruningPolicy – implements rules for modifying term dictionary, postings and
payloads
 PruningTool – command-line utility to configure and run PruningReader

Details of LUCENE-1812
source index target index
stored fields StorePruningPolicy stored fields

IndexWriter
term dict term dict
postings+payloads TermPruningPolicy postings+payloads
term vectors term vectors

PruningReader
IW.addIndexes(IndexReader...)
 IndexWriter consumes source data filtered via PruningReader
 Internal document ID-s are preserved – suitable for bitset ops
and retrieval by internal ID
 If source index has no deletions
 If target index is empty

API: StorePruningPolicy
 May remove (some) fields from (some) documents
 May as well modify the values
 May rename / add fields


API: TermPruningPolicy
 Thresholds (in the order of precedence):
 Per term
 Per field
 Default

 Plain TF pruning – TFTermPruningPolicy
 Removes all postings for a term where TF (in-document term frequency) is below
a threshold

 Top-N term-level – CarmelTermPruningPolicy
 TermQuery search for top-N docs
 Removes all postings for a term outside the top-N docs


Results so far...
 TF pruning:
 Term query recall very good
 Phrase query recall very poor – expected...

 Carmel pruning – slightly better term position selection, but
still heavy negative impact on phrase queries
 Recognizing and keeping key phrases would help
 Use query log for frequent-phrase mining?
 Use collocation miner (Mahout)?
 Savings on pruning will be smaller, but quality will significantly improve


References
 Static Index Pruning for Information Retrieval Systems, Carmel
et al, SIGIR'01
 A document-centric approach to static index pruning in text
retrieval systems, Büttcher & Clark, CIKM'06
 Locality-based pruning methods for web search, deMoura et al,
ACM TIS '08
 Pruning strategies for mixed-mode querying, Anh & Moffat,
CIKM'06


Index pruning applied ...
 Index 1: A heavily pruned index that fits in RAM:
 excellent speed
 poor search quality for many less-common query types
 Index 2: Slightly pruned index that fits partially in RAM:
 good speed, good quality for many common query types,
 still poor quality for some other rare query types
 Index 3: Full index on disk:
 Slow speed
 Excellent quality for all query types
 QUESTION: Can we come up with a combined search strategy?

Tiered search
search box 1
search box 1
RAM
70% pruned

search box 2
search box 2 SSD
30% pruned ?
predict
evaluate
search box 3
search box 3 HDD

0% pruned

 Can we predict the best tier without actually running the query?
 How to evaluate if the predictor was right?

Tiered search: tier selector and evaluator
 Best tier can be predicted (often enough ):
 Carmel pruning yields excellent results for simple term queries
 Phrase-based pruning yields good results for phrase queries (though less often)

 Quality evaluator: when is predictor wrong?
 Could be very complex, based on gold standard and qrels
 Could be very simple: acceptable number of results

 Fall-back strategy:
 Serial: poor latency, but minimizes load on bulkier tiers
 Partially parallel:
 submit to the next tier only the border-line queries
 Pick the first acceptable answer – reduces latency

Tiered versus distributed
 Both applicable to indexes and query loads exceeding single
machine capabilities
 Distributed sharded search:
 increases latency for all queries (send + execute + integrate from all shards)
 … plus replicas to increase QPS:
 Increases hardware / management costs
 While not improving latency

 Tiered search:
 Excellent latency for common queries
 More complex to build and maintain
 Arguably lower hardware cost for comparable scale / QPS

Tiered search benefits
 Majority of common queries handled by first tier: RAM-based,
high QPS, low latency
 Partially parallel mode reduces average latency for more
complex queries
 Hardware investment likely smaller than for distributed search
setup of comparable QPS / latency


Example Lucene API for tiered search
Could be implemented as
a Solr SearchComponent...


Lucene implementation details


References
 Efficiency trade-offs in two-tier web search systems, Baeza-
Yates et al., SIGIR'09
 ResIn: A combination of results caching and index pruning for
high-performance web search engines, Baeza-Yates et al,
SIGIR'08
 Three-level caching for efficient query processing in large Web
search engines, Long & Suel, WWW'05


Bit-wise search
 Given a bit pattern query:
1010 1001 0101 0001
 Find documents with matching bit patterns in a field
 Applications:
 Permission checking
 De-duplication
 Plagiarism detection

 Two variants: non-scoring (filtering) and scoring


Non-scoring bitwise search (LUCENE-2460)
 Builds a Filter from intersection of: 0 1 2 3 4 docID
0x01 0x02 0x03 0x04 0x05 flags
 DocIdSet of documents matching a Query a b b a a type
 Integer value and operation (AND, OR, XOR)
“type:a”
 “Value source” that caches integer values of
a field (from FieldCache)
0x01 0x02 0x03 0x04 0x05 flags
 Corresponding Solr field type and
QParser: SOLR-1913 op=AND val=0x01

 Useful for filtering (not scoring)

Filter


Scoring bitwise search (SOLR-1918)
 BooleanQuery in disguise: docID D1 D2 D3
flags 1010 1011 0011
1010 = Y-1000 | N-0100 |
Y1000 Y1000 N1000
Y-0010 | N-0001
bits N0100 N0100 N0100
Y0010 Y0010 Y0010
 Solr 32-bit BitwiseField N0001 Y0001 Y0001
 Analyzer creates the bitmasks field
 Currently supports only single value per field Q = bits:Y1000 bits:N0100
bits:Y0010 bits:N0001
 Creates BooleanQuery from query int value
Results:
 Useful when searching for best
matching (ranked) bit patterns D1 matches 4 of 4 → #1
D2 matches 3 of 4 → #2
D3 matches 2 of 4 → #3

Summary
 Index post-processing covers a range of useful scenarios:
 Merging and splitting, remodeling, extracting, moving ...
 Pruning less important data

 Tiered search + pruned indexes:
 High performance
 Practically unchanged quality
 Less hardware

 Bitwise search:
 Filtering by matching bits
 Ranking by best matching patterns

Meta-summary
 Stir your imagination
 Think outside the box
 Show some unorthodox use and practical applications
 Close ties to scalability, performance, distributed search and
query latency


Q&A


Thank you!

Apache Lucene EuroCon 05/25/10

Massive indexing with map-reduce
 Map-reduce indexing models
 Google model
 Nutch model
 Modified Nutch model
 Hadoop contrib/indexing model

 Tradeoff analysis and recommendations


Google model
 Map():  Reduce()

IN: <seq, docText> IN: <term, list(<seq,pos>)>
 terms = analyze(docText)  foreach(<seq,pos>)
 foreach (term) docId = calculate(seq, taskId)
emit(term, <seq,position>) Postings(term).append(docId, pos)

 Pros: analysis on the map side
 Cons:
 Too many tiny intermediate records → Combiner
 DocID synchronization across map and reduce tasks
 Lucene: very difficult (impossible?) to create index this way

Nutch model (also in SOLR-1301)

IN: <seq, docPart> IN: <docId, list(docPart)>
 docId = docPart.get(“url”)  doc = luceneDoc(list(docPart))
 emit(docId, docPart)  indexWriter.addDocument(doc)

 Pros: easy to build Lucene index
 Cons:
 Analysis on the reduce side
 Many costly merge operations (large indexes built from scratch on reduce side)
(plus currently needs copy from local FS to HDFS – see LUCENE-2373)

Modified Nutch model (N/A...)

IN: <seq, docPart> IN: <docId, list(<docPart,ts>)>
 docId = docPart.get(“url”)  doc = luceneDoc(list(<docPart,ts>))
 ts = analyze(docPart)  indexWriter.addDocument(doc)
 emit(docId, <docPart,ts>)

 Pros:
 Analysis on map side
 Easy to build Lucene index
 Cons:
 Many costly merge operations (large indexes built from scratch on reduce side)
(plus currently needs copy from local FS to HDFS – see LUCENE-2373)

Hadoop contrib/indexing model

IN: <seq, docText> IN: <random, list(indexData)>
 doc = luceneDoc(docText)  foreach(indexData)
 indexWriter.addDocument(doc) indexWriter.addIndexes(indexData)
 emit(random, indexData)

 Pros:
 analysis on the map side
 Many merges on the map side
 Supports also other operations (deletes, updates)
 Cons:
 Serialization is costly, records are big and require more RAM to sort

Massive indexing - summary
 If you first need to collect document parts → SOLR-1301 model
 If you use complex analysis → Hadoop contrib/index
 NOTE: there is no good integration yet of Solr and Hadoop contrib/index module...


Munching & crunching - Lucene index post-processing

Related slideshows

More Related Content

Munching & crunching - Lucene index post-processing