Tweaking the Base Score: Lucene/Solr Similarities Explained

Tweaking the Base Score:
Lucene/Solr Similarities Explained
Demo: github.com/sematext/activate/tree/master/2019
More info: sematext.com/blog/search-relevance-solr-elasticsearch-similarity
Radu
Gheorghe
Rafał
Kuć
www.sematext.com

Agenda
BM25 - Best Match: the default
DFR - Divergence From Randomness framework
DFI - Divergence From Independence
IB - Information-Based models
LM - Language Models
Custom similarity
Putting it all together

TF*IDF
You know, for historical reasons

BM25 - the TF part
freq / (freq + k1 * (1 - b + b * dl / avgdl))
Best for Most 😁

BM25 tunables
k1 - raise or lower ceiling

BM25 tunables
doc length normalization

BM25 demo
yes, that’s how we look
when we give demos

BM25
Good default. You can
tune the weight of freq
and docLength.

Divergence From Randomness
Basic Model
G, I(n), I(ne), I(F)
After Effect
L, B
Normalization
H1, H2, H3, Z, none

tf * c * avgFieldLength / docFieldLength
Divergence From Randomness - H1

No normalization, and H1 with c == 1, 3, 5, 7

tf * log2
(1 + c * (avgFieldLength / docFieldLength))

No normalization, and H2 with c == 1, 3, 5, 7

tf * (avgFieldLength / docFieldLength)Z
Divergence From Randomness - Z

Divergence From Randomness - Z
No normalization, and Z with z == 0.1, 0.2, 0.3, 0.4

(tf * mu * ((totalTermFreq + 1) / (#ﬁeldTokens + 1)))
(docFieldLength + mu) * mu

No normalization, and H3 with mu == 1, 3, 5, 7

DFR
Framework. Tunable:
choose algorithm and
tune parameters for
both IDF* and
docLength.
* generic name for importance
of this term

Divergence From Independence
expected frequency

Divergence From Independence
docLength*totalTermFrequency/numberOfFieldTokens
expected frequency

DFI: Standardized
(actual - expected)/sqrt(expected)

DFI demo
Oh, but don’t remove
stopwords*!
1) arbitrarily chops ﬁeld length
2) stopwords aren’t always
stopwords ;)

DFI
Simple. Parameterless.
Flexible: works well
with various datasets.

Information Based
how much information we get from this term?

Information Based
Distribution
Log-Logistic, Smoothed Power-Law
Lambda
DF, TTF
Normalization
H1, H2, H3, Z, none

Information Based - Log-Logistic
log( tfn / (lambda + 1) )

Information Based - Log-Logistic
lambda: 0.1 (red), 0.3 (black), 0.8 (blue)

Information Based - Retrieval Function
the average of the document information brought
by each query term

Information Based - Retrieval Function - DF
number of matching documents
(docFrequency + 1) / (numberOfDocuments + 1)

Information Based - Retrieval Function - TTF
total number of term occurrences
(totalTermFrequency + 1) / (numberOfDocuments + 1)

IB
Framework. like DFR.
Even has the same
normalization options.
But newer and, in the
paper, better.

Language Models
probability of a term being our term

Language Models
totalTermFreq/totalFieldTokens
probability of a term being our term

Language Models: Jelinek-Mercer
log(
(1-λ)*
tf
)docLength
λ * probability

LM
Two probabilistic
models. Similar
approach to DFI, but
tunable.

Custom Similarity
compute a similarity score using custom code

Custom Similarity - Activate Similarity Factory
public class ActivateSimilarityFactory extends SimilarityFactory {
private volatile Similarity similarity;
public void init(SolrParams params) {
super.init(params);
}
public Similarity getSimilarity() {
if (similarity == null) {
similarity = new ActivateSimilarity();
}
return similarity;
}
}

Custom Similarity - Similarity
public class ActivateSimilarity extends Similarity {
public ActivateSimilarity() {}
public long computeNorm(FieldInvertState state) { return 1; }
public Similarity.SimScorer scorer(float boost,
CollectionStatistics collectionStats, TermStatistics... termStats) {
return new ActivateSimScorer();
}
}

Custom Similarity - SimScorer
public class ActivateSimScorer extends Similarity.SimScorer {
public float score(float freq, long norm) {
return freq;
}
}

Custom
When you need
something special, like
disregarding term
frequency.

Tweaking the Base Score: Lucene/Solr Similarities Explained

More Related Content

Tweaking the Base Score: Lucene/Solr Similarities Explained