Hierarchical Dirichlet Process Gensim topic number independent of corpus size

Question

I am using the Gensim HDP module on a set of documents.

>>> hdp = models.HdpModel(corpusB, id2word=dictionaryB)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> hdp = models.HdpModel(corpusA, id2word=dictionaryA)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> len(corpusA)
1113
>>> len(corpusB)
17

Why is the number of topics independent of corpus length?

Roko Mijic · Accepted Answer · 2017-06-06 15:20:16Z

13

@Aaron's code above is broken due to gensim API changes. I rewrote and simplified it as follows. Works as of June 2017 with gensim v2.1.0

import pandas as pd

def topic_prob_extractor(gensim_hdp):
    shown_topics = gensim_hdp.show_topics(num_topics=-1, formatted=False)
    topics_nos = [x[0] for x in shown_topics ]
    weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]

    return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights})

answered Jun 6, 2017 at 15:20

Roko Mijic

6,8754 gold badges30 silver badges36 bronze badges

3

Just saw this. Thanks for the refactor, @Roko
– aaron
Commented Feb 14, 2019 at 11:49
3

I tried this and I get an empty dataframe. Unfortunately, I can't post the data because it is proprietary. Has anyone seen this before?
– John Doe
Commented Apr 8, 2020 at 20:48
@JohnDoe, I did, but then got a non-empty DataFrame after replacing the -1 with my actual number of topics
– Ghillie Dhu
Commented Nov 30, 2020 at 23:12

Add a comment |

Rafs · Accepted Answer · 2018-01-19 18:33:17Z

@Aron's and @Roko Mijic's approaches neglect the fact that the function show_topics returns by default the top 20 words of each topic only. If one returns all the words that compose a topic, all the approximated topic probabilities in that case will be 1 (or 0.999999). I experimented with the following code, which is an adaptation of @Roko Mijic's:

def topic_prob_extractor(gensim_hdp, t=-1, w=25, isSorted=True):
    """
    Input the gensim model to get the rough topics' probabilities
    """
    shown_topics = gensim_hdp.show_topics(num_topics=t, num_words=w ,formatted=False)
    topics_nos = [x[0] for x in shown_topics ]
    weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
    if (isSorted):
        return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights}).sort_values(by = "weight", ascending=False);
    else:
        return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights});

A better, yet I'm not sure if 100% valid, approach is the one mentioned here. You can get the topics' true weights (alpha vector) of the HDP model as:

alpha = hdpModel.hdp_to_lda()[0];

Examining the topics' equivalent alpha values is more logical than tallying up the weights of the first 20 words of each topic to approximate its probability of usage in the data.

Farhood ET · Accepted Answer · 2020-09-07 08:21:29Z

There is apparently a bug in Gensim(version 3.8.3), in which giving -1 to show_topics doesn't return anything at all. So I have tweaked the answers by Roko Mijic and aaron.

def topic_prob_extractor(gensim_hdp):
    shown_topics = gensim_hdp.show_topics(num_topics=gensim_hdp.m_T, formatted=False)
    topics_nos = [x[0] for x in shown_topics ]
    weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
    return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights})

aaron · Accepted Answer · 2016-04-24 04:32:34Z

@user3907335 is exactly correct here: HDP will calculate as many topics as the assigned truncation level. However, it may be the case that many of these topics have basically zero probability of occurring. To help with this in my own work, I wrote a handy little function that performs a rough estimate of the probability weight associated with each topic. Note that this is a rough metric only: it does not account for the probability associated with each word. Even so, it provides a pretty good metric for which topics are meaningful and which aren't:

import pandas as pd
import numpy as np 

def topic_prob_extractor(hdp=None, topn=None):
    topic_list = hdp.show_topics(topics=-1, topn=topn)
    topics = [int(x.split(':')[0].split(' ')[1]) for x in topic_list]
    split_list = [x.split(' ') for x in topic_list]
    weights = []
    for lst in split_list:
        sub_list = []
        for entry in lst: 
            if '*' in entry: 
                sub_list.append(float(entry.split('*')[0]))
        weights.append(np.asarray(sub_list))
    sums = [np.sum(x) for x in weights]
    return pd.DataFrame({'topic_id' : topics, 'weight' : sums})

I assume that you already know how to calculate an HDP model. Once you have an hdp model calculated by gensim you call the function as follows:

topic_weights = topic_prob_extractor(hdp, 500)

Note that this answer is no longer current due to changes in gensim: see @Roko's answer below (stackoverflow.com/a/44393919/2074981). — ASGM, Commented May 3, 2019 at 12:48

Karup · Accepted Answer · 2015-12-12 09:48:00Z

4

I think you misunderstood the operation performed by the called method. Directly from the documentation you can see:

Alias for show_topics() that prints the top n most probable words for topics number of topics to log. Set topics=-1 to print all topics.

You trained the model without specifying the truncation level on the number of topics and the default one is 150. Calling the print_topics with topics=-1 you'll get the top 20 words for each topic , in your case 150 topics.

I'm still a newbie of the library, so maybe I' wrong

edited Dec 12, 2015 at 9:48

Karup

2,0543 gold badges23 silver badges49 bronze badges

answered Dec 12, 2015 at 8:51

p_mesh

412 bronze badges

8

While I suspect you are right, this doesn't make sense. HDP should infer the number of topics. It doesn't make sense that it always goes to the maximum it can, especially given 2 corpuses with such a size difference. If it always goes to the max, it is basically useless for what it was supposed to do, I can just use LDA with N=150
– Makers_F
Commented Dec 29, 2015 at 9:08

Add a comment |

user3907335 · Accepted Answer · 2015-08-06 05:35:16Z

3

I haven't used gensim for HDPs, but is it possible that most of the topics in the smaller corpus have extremely low probability of occurring ? Can you trying printing the topic probabilities? Maybe, the length of the topics array doesn't necessarily mean that all those topics were actually found in the corpus.

answered Aug 6, 2015 at 5:35

user3907335

913 bronze badges

2

nope they are all in topics regardless of probability
– Sam Weisenthal
Commented Dec 6, 2015 at 20:12

Add a comment |

score 0 · Accepted Answer · 2020-10-17 18:56:09Z

Deriving the average coherence of HDP topics from their coherence at the individual text level is a way to order (and potentially truncate) them. The following function does just that:

def order_subset_by_coherence(dirichlet_model, bow_corpus, num_topics=10, num_keywords=10):
    """
    Orders topics based on their average coherence across the corpus

    Parameters
    ----------
        dirichlet_model : gensim.models.hdpmodel.HdpModel
        bow_corpus : list of lists (contains (id, freq) tuples)
        num_topics : int (default=10)
        num_keywords : int (default=10)

    Returns
    -------
        ordered_topics: list of lists containing topic tokens
    """
    shown_topics = dirichlet_model.show_topics(num_topics=150, # return all topics
                                               num_words=num_keywords,
                                               formatted=False)
    model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
    topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0 

    topics_per_response = [response for response in topic_corpus]
    flat_topic_coherences = [item for sublist in topics_per_response for item in sublist]

    significant_topics = list(set([t_c[0] for t_c in flat_topic_coherences])) # those that appear
    topic_averages = [sum([t_c[1] for t_c in flat_topic_coherences if t_c[0] == topic_num]) / len(bow_corpus) \
                      for topic_num in significant_topics]

    topic_indexes_by_avg_coherence = [tup[0] for tup in sorted(enumerate(topic_averages), key=lambda i:i[1])[::-1]]
    significant_topics_by_avg_coherence = [significant_topics[i] for i in topic_indexes_by_avg_coherence]
    ordered_topics = [model_topics[i] for i in significant_topics_by_avg_coherence][:num_topics] # truncate if desired

    return ordered_topics

A version of this function that includes an output of the averages coherences associated with the topics for keyword (tag) generation for a corpus can be found in this answer. A similar process for keywords for individual texts can further be found in this answer.

Collectives™ on Stack Overflow

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

7 Answers 7

Not the answer you're looking for? Browse other questions tagged
python
nlp
lda
gensim
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Not the answer you're looking for? Browse other questions tagged pythonnlpldagensim or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
nlp
lda
gensim
or ask your own question.