I have a question related to the creation of a rationalised list of vocabulary working in pairs with kanji learning. If one wants to learn Japanese, at some point there is no other choice than learning big lists of vocabulary, simply because there is not much to relate to the the learner's mother-tongue (at least if one's native language is not Chinese). We also encounter in most kanji books lists of vocabulary associated to each kanji, and the average number of words related to one kanji varies from books (say 4 to 20). Therefore I asked myself what would be the optimal number of words that shall be actually used.
So we have our list of N kanji, and for all i = 1, ... , N, let mi the number of words associated the kanji number i. Now one should be careful to make a list without repetitions, but at the same time minimising the average of numbers associated to each kanji. So we have arranged all words, by relating each to only one kanji, such that the mean
is minimal (this condition will appear clearly later). Then we look for an integer n corresponding to the number of words we learn per kanji, such that the expectation of knowing an arbitrary word is high, but n is not too high either (for example, if one has to learn 100 words per kanji, this system does not work). Therefore, for a word w associated to the kanji i (let us denote Ai the set of words associated to the kanji i), the probability to know the word is
with obvious abuse of notations. Therefore
and one would know approximately n/m̅ of the vocabulary. And n is approximately equal to the average m̅, which explains why we had to minimise this quantity.
Therefore my question is two-sided : what is the minimal value of m̅? Indeed, if we compare a simple kanji like 人 which has more than 2000 matches and some rare kanji with one or two occurrences, it is not a priori obvious if this quantity (m̅) is large (>100 say) or not. I should add that obviously if we consider all words in Japanese language, this system is unrealistic, but for smaller ambitions, it could actually be useful if the number m̅ is not too large. One could for example think about the kanji and vocabulary lists for standardized international tests.