python - What is a "good" value for LSI topic coherence?

I'm using the gensim python library to work on small corpora (around 1500 press articles each time). Let say I'm interested in creating clusters of articles relating the same news.

So for each corpus of articles I've tokenized, detected collocations, stemmed and then fed a little dictionary (around 20k tokens) I've passed though a TFIDF model.

Finally I've used the TFIDF corpus to build a LSI model of the corpus and with the help of the document similarity functions of gensim I was able to get very good results.

But I was curious and made some coherence checking of the LSI with:

lsi_topics = [[word for word, prob in topic] for topicid, topic in 
lsi.show_topics(formatted=False)]
lsi_coherence = CoherenceModel(topics=lsi_topics[:10], texts=corpus, dictionary=dictionary, window_size=10).get_coherence()
logger.info("lsi coherence: %.3f" % lsi_coherence)

And I always get values around 0.45 which could seem pretty weak.

So I was wondering how to interpret this coherence value? And does this value make sense when you only need similarity of documents in the index to the index itself (so the queries are a full document from the corpus)?

Edit: I tried different things for text preprocessing such as splitting each document in real sentences before feeding the Phrases class, generating bigrams, trigrams or removing accents or not and in some cases I was able to get a coherence value around 0.55 so at least I guess it can help finding the most efficient way to process raw datas...

edited Jan 30, 2019 at 3:41

asked Jan 28, 2019 at 5:18

fbparis

9001 gold badge10 silver badges25 bronze badges

Add a comment |

Collectives™ on Stack Overflow

What is a "good" value for LSI topic coherence?

0

Browse other questions tagged
python
gensim
topic-modeling
latent-semantic-indexing
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged pythongensimtopic-modelinglatent-semantic-indexing or ask your own question.

Browse other questions tagged
python
gensim
topic-modeling
latent-semantic-indexing
or ask your own question.