
In the context of my thesis, i have created a corpus in order to compare the use of "z" vs "s" in specific words like organisation, organised, recognise, authorise etc. I have created the percentage of "z" for each of those words. So every percentage shows the proportion of the "z uses" within my corpus, for each word. Then i used another corpus (via sketchengine) and found the percentages of "z" for each similar occasion of word. Could you tell me what test exactly should i conduct in order to find if there is statistical significance between the 2 corpora? t-test? chi-squared test? z-test" something else? Furthermore, should i consider the 2 samples as independent or not? I already know that the majority avoids t-test for percentages/proportions. Please have in mind that my corpus is of about 400.000 words and the other corpus of about 100 million words and after conducting shapiro`-wilk and anderson-darling tests, my samples are not of normal distribution.

My corpuses are normal texts, articles, administrative documents, newspapers etc. I am still considering what option is the best. Should i calculate the total appearances of "s" uses (organise,organisation,authorise, recognise etc) and "z" uses respectively, and then create a simple 2*2 table in order to conduct a chi square test as follows? "S" USES "Z" USES MY CORPUS 1250 3245 SKETCHENGINE 352000 890000

Or estimate the distinct percentage of each one word for the "z" use and conduct for example a Wilcoxon-Mann-Whitney test in order to find which corpus has a tendency towards "z" uses?

Thank you in advance.

  • $\begingroup$ Welcome to cross-validated, vigilantius22 :-). Do your corpusses only contain the words that you are monitoring, i.e. organisation, recognise etc, so that you have 100 millions and 400 thousand words, respectively? And are you interested in comparing the percentages of "z" variants between the two groups for each of these words, or just in total? $\endgroup$
    – Ute
    Commented Dec 14, 2023 at 13:05