SlideShare a Scribd company logo
Tweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score:
Lucene/Solr Similarities Explained
Demo: github.com/sematext/activate/tree/master/2019
More info: sematext.com/blog/search-relevance-solr-elasticsearch-similarity
Radu
Gheorghe
Rafał
Kuć
www.sematext.com
Agenda
BM25 - Best Match: the default
DFR - Divergence From Randomness framework
DFI - Divergence From Independence
IB - Information-Based models
LM - Language Models
Custom similarity
Putting it all together
TF*IDF
You know, for historical reasons
BM25 - the TF part
freq / (freq + k1 * (1 - b + b * dl / avgdl))
Best for Most 😁
BM25 tunables
freq / (freq + k1 * (1 - b + b * dl / avgdl))
k1 - raise or lower ceiling
BM25 tunables
freq / (freq + k1 * (1 - b + b * dl / avgdl))
doc length normalization
BM25 demo
yes, that’s how we look
when we give demos
BM25
Good default. You can
tune the weight of freq
and docLength.
Divergence From Randomness
Basic Model
G, I(n), I(ne), I(F)
After Effect
L, B
Normalization
H1, H2, H3, Z, none
tf * c * avgFieldLength / docFieldLength
Divergence From Randomness - H1
Divergence From Randomness - H1
No normalization, and H1 with c == 1, 3, 5, 7
tf * log2
(1 + c * (avgFieldLength / docFieldLength))
Divergence From Randomness - H2
Divergence From Randomness - H2
No normalization, and H2 with c == 1, 3, 5, 7
tf * (avgFieldLength / docFieldLength)Z
Divergence From Randomness - Z
Divergence From Randomness - Z
No normalization, and Z with z == 0.1, 0.2, 0.3, 0.4
(tf * mu * ((totalTermFreq + 1) / (#fieldTokens + 1)))
(docFieldLength + mu) * mu
Divergence From Randomness - H3
Divergence From Randomness - H3
No normalization, and H3 with mu == 1, 3, 5, 7
DFR demo
Only one, I promise
DFR
Framework. Tunable:
choose algorithm and
tune parameters for
both IDF* and
docLength.
* generic name for importance
of this term
Divergence From Independence
expected frequency
Divergence From Independence
docLength*totalTermFrequency/numberOfFieldTokens
expected frequency
DFI: Standardized
(actual - expected)/sqrt(expected)
DFI demo
Oh, but don’t remove
stopwords*!
1) arbitrarily chops field length
2) stopwords aren’t always
stopwords ;)
DFI
Simple. Parameterless.
Flexible: works well
with various datasets.
Information Based
how much information we get from this term?
Information Based
Distribution
Log-Logistic, Smoothed Power-Law
Lambda
DF, TTF
Normalization
H1, H2, H3, Z, none
Information Based - Log-Logistic
log( tfn / (lambda + 1) )
Information Based - Log-Logistic
lambda: 0.1 (red), 0.3 (black), 0.8 (blue)
Information Based - Retrieval Function
the average of the document information brought
by each query term
Information Based - Retrieval Function - DF
number of matching documents
(docFrequency + 1) / (numberOfDocuments + 1)
Information Based - Retrieval Function - TTF
total number of term occurrences
(totalTermFrequency + 1) / (numberOfDocuments + 1)
IB demo
IB
Framework. like DFR.
Even has the same
normalization options.
But newer and, in the
paper, better.
Language Models
probability of a term being our term
Language Models
totalTermFreq/totalFieldTokens
probability of a term being our term
Language Models: Jelinek-Mercer
log(
(1-λ)*
tf
)docLength
λ * probability
LM demo
feat. Jelinek-Mercer
LM
Two probabilistic
models. Similar
approach to DFI, but
tunable.
Custom Similarity
compute a similarity score using custom code
Custom Similarity - Activate Similarity Factory
public class ActivateSimilarityFactory extends SimilarityFactory {
private volatile Similarity similarity;
public void init(SolrParams params) {
super.init(params);
}
public Similarity getSimilarity() {
if (similarity == null) {
similarity = new ActivateSimilarity();
}
return similarity;
}
}
Custom Similarity - Activate Similarity Factory
public class ActivateSimilarityFactory extends SimilarityFactory {
private volatile Similarity similarity;
public void init(SolrParams params) {
super.init(params);
}
public Similarity getSimilarity() {
if (similarity == null) {
similarity = new ActivateSimilarity();
}
return similarity;
}
}
Custom Similarity - Activate Similarity Factory
public class ActivateSimilarityFactory extends SimilarityFactory {
private volatile Similarity similarity;
public void init(SolrParams params) {
super.init(params);
}
public Similarity getSimilarity() {
if (similarity == null) {
similarity = new ActivateSimilarity();
}
return similarity;
}
}
Custom Similarity - Similarity
public class ActivateSimilarity extends Similarity {
public ActivateSimilarity() {}
public long computeNorm(FieldInvertState state) { return 1; }
public Similarity.SimScorer scorer(float boost,
CollectionStatistics collectionStats, TermStatistics... termStats) {
return new ActivateSimScorer();
}
}
Custom Similarity - Similarity
public class ActivateSimilarity extends Similarity {
public ActivateSimilarity() {}
public long computeNorm(FieldInvertState state) { return 1; }
public Similarity.SimScorer scorer(float boost,
CollectionStatistics collectionStats, TermStatistics... termStats) {
return new ActivateSimScorer();
}
}
Custom Similarity - Similarity
public class ActivateSimilarity extends Similarity {
public ActivateSimilarity() {}
public long computeNorm(FieldInvertState state) { return 1; }
public Similarity.SimScorer scorer(float boost,
CollectionStatistics collectionStats, TermStatistics... termStats) {
return new ActivateSimScorer();
}
}
Custom Similarity - SimScorer
public class ActivateSimScorer extends Similarity.SimScorer {
public float score(float freq, long norm) {
return freq;
}
}
Custom Similarity - SimScorer
public class ActivateSimScorer extends Similarity.SimScorer {
public float score(float freq, long norm) {
return freq;
}
}
Custom
Similarity
demo
Custom
When you need
something special, like
disregarding term
frequency.
Multiple
similarities
demo
Tweaking the Base Score: Lucene/Solr Similarities Explained
THANK YOU

More Related Content

Tweaking the Base Score: Lucene/Solr Similarities Explained