3

I'm using the gensim python library to work on small corpora (around 1500 press articles each time). Let say I'm interested in creating clusters of articles relating the same news.

So for each corpus of articles I've tokenized, detected collocations, stemmed and then fed a little dictionary (around 20k tokens) I've passed though a TFIDF model.

Finally I've used the TFIDF corpus to build a LSI model of the corpus and with the help of the document similarity functions of gensim I was able to get very good results.

But I was curious and made some coherence checking of the LSI with:

lsi_topics = [[word for word, prob in topic] for topicid, topic in 
lsi.show_topics(formatted=False)]
lsi_coherence = CoherenceModel(topics=lsi_topics[:10], texts=corpus, dictionary=dictionary, window_size=10).get_coherence()
logger.info("lsi coherence: %.3f" % lsi_coherence)

And I always get values around 0.45 which could seem pretty weak.

So I was wondering how to interpret this coherence value? And does this value make sense when you only need similarity of documents in the index to the index itself (so the queries are a full document from the corpus)?

Edit: I tried different things for text preprocessing such as splitting each document in real sentences before feeding the Phrases class, generating bigrams, trigrams or removing accents or not and in some cases I was able to get a coherence value around 0.55 so at least I guess it can help finding the most efficient way to process raw datas...

0