1
$\begingroup$

I am currently doing collocational analysis in the Russian National Corpus, to be precise the Russian national news subcorpus, to see what is the most significant collocates of the lemma "gay".

In doing the analysis, I will be using the "logDice" association measure by Rychlý (2008), with this following formula:

$$ \text{logDice} = 14 + \text{log2} \frac{2(w_1 w_2)}{w_1 + w_2} $$

Where:

  • $w_1 w_2$ = the frequency of the word $x$ and $y$

  • $w_1$ = the frequency of the word $x$ (the lemma)

  • $w_2$ = the frequency of the word $y$ (the collocate)

This is the sample of data through which I will conduct a collocational analysis, gathered from the Russian National Corpus:

  lex_1       lex_2 w1w2   w1    w2               dice
1   гей   лесбиянка  256 3035  1000 11.935563044335458
2   гей   бисексуал   56 3035   214 10.632396335625995
3   гей -пропаганда   33 3035    33  10.16087357953928
4   гей трансгендер   40 3035  1125 10.048756281418573
5   гей   -активист   22 3035    25  9.758019438971836
6   гей  пропаганда  109 3035 14989  9.585035580677008

The dice column stands for logDice value, as per stated in the Russian National Corpus. But, when I try to recalculate the logDice value (in the "recalculated_logDice" column), using Rychlý's formula, this is the result that I obtained:

 lex_1       lex_2 w1w2   w1    w2               dice recalculated_logDice
1   гей   лесбиянка  256 3035  1000 11.935563044335458            11.021647
2   гей   бисексуал   56 3035   214 10.632396335625995             9.141575
3   гей -пропаганда   33 3035    33  10.16087357953928             8.461311
4   гей трансгендер   40 3035  1125 10.048756281418573             8.299560
5   гей   -активист   22 3035    25  9.758019438971836             7.880116
6   гей  пропаганда  109 3035 14989  9.585035580677008             7.630553

Basically, my calculation of logDice vs. the Russian News Corpus calculation is different. Therefore, I tried to do another calculation, this time using the natural log (in the "lnDice" column), which yielded this result:

  lex_1       lex_2 w1w2   w1    w2               dice recalculated_logDice    lnDice
1   гей   лесбиянка  256 3035  1000 11.935563044335458            11.021647 11.935563
2   гей   бисексуал   56 3035   214 10.632396335625995             9.141575 10.632396
3   гей -пропаганда   33 3035    33  10.16087357953928             8.461311 10.160874
4   гей трансгендер   40 3035  1125 10.048756281418573             8.299560 10.048756
5   гей   -активист   22 3035    25  9.758019438971836             7.880116  9.758019
6   гей  пропаганда  109 3035 14989  9.585035580677008             7.630553  9.585036

Which is basically the result retrieved from the Russian National Corpus.

My question is, can log2 be substituted with the natural log? Did I make a mistake in my calculation? I seem not able to find out why the Russian National Corpus uses the natural log in place of log2 in their calculation of logDice, given the logDice formula by Rychlý, 2008.

$\endgroup$
2
  • $\begingroup$ This would benefit from an explicit reference for Rychlý, 2008. What is the point of the extra 14: just to avoid negative values? $\endgroup$
    – Nick Cox
    Commented Mar 31 at 10:47
  • $\begingroup$ I agree with you, and to answer your second sentence, the extra 14 is employed because typically Dice coefficient values, especially when transformed using logarithm, will yield extremely small number. So the extra 14 acts as a theoretical maximum of the logDice measurement. And in addition, it will make any negative values of the collocation to signify insignificant result (Rychlý, 2008) $\endgroup$
    – pindakazen
    Commented Mar 31 at 11:07

1 Answer 1

1
$\begingroup$

As you probably know, frequency data typically takes on a log-normal distribution (Winter, 2019): hence the rationale for using any log transformation. In linguistics and psycholinguistics, there is a tendency to use log2 or log10 over natural log. None of those necessarily is better than any other: they just lend themselves to different interpretations (particularly when fitting these values to regressions).

The value of log2 and log10 is that they approximate the scaling factor of the log transformation, or the number of zeroes it has. Log10 of 10 is equal to 1 and log10 of 1000 is equal to 3, as this scales the value by units of ten. The same can be said of log2, where log2 of 4 is equal to 2. Consequently, log10 compresses/expands the data much more than log2 given the factor it uses. This guide and pp.107-115 of Winter (2019) give examples using log10. Interpretations of natural log fitted regressions are similarly described here in the context of regression.

There is no singularly correct log-transform to use. It really depends on what interpretation you are after. But qualitatively log2, log10, and natural log mean different things, and thus usage of each for the formula given in your posts indicates their meaning will consequently change. If you want to keep the scale of the log, you can consider a change of base calculation.

Reference

Winter, B. (2019). Statistics for linguists: An introduction using R (1st ed.). Routledge. https://doi.org/10.4324/9781315165547

$\endgroup$
1
  • 1
    $\begingroup$ Thank you for the input, I find it very useful! Also thanks for the readings! I guess I will go after Rychlý's (2008) log to the base of 2, so that I can make reference to the formula :) $\endgroup$
    – pindakazen
    Commented Mar 31 at 10:09

Not the answer you're looking for? Browse other questions tagged or ask your own question.