4

I have a question related to the creation of a rationalised list of vocabulary working in pairs with kanji learning. If one wants to learn Japanese, at some point there is no other choice than learning big lists of vocabulary, simply because there is not much to relate to the the learner's mother-tongue (at least if one's native language is not Chinese). We also encounter in most kanji books lists of vocabulary associated to each kanji, and the average number of words related to one kanji varies from books (say 4 to 20). Therefore I asked myself what would be the optimal number of words that shall be actually used.

So we have our list of N kanji, and for all i = 1, ... , N, let mi the number of words associated the kanji number i. Now one should be careful to make a list without repetitions, but at the same time minimising the average of numbers associated to each kanji. So we have arranged all words, by relating each to only one kanji, such that the mean

image

is minimal (this condition will appear clearly later). Then we look for an integer n corresponding to the number of words we learn per kanji, such that the expectation of knowing an arbitrary word is high, but n is not too high either (for example, if one has to learn 100 words per kanji, this system does not work). Therefore, for a word w associated to the kanji i (let us denote Ai the set of words associated to the kanji i), the probability to know the word is

image

with obvious abuse of notations. Therefore

image

and one would know approximately n/ of the vocabulary. And n is approximately equal to the average , which explains why we had to minimise this quantity.

Therefore my question is two-sided : what is the minimal value of ? Indeed, if we compare a simple kanji like 人 which has more than 2000 matches and some rare kanji with one or two occurrences, it is not a priori obvious if this quantity () is large (>100 say) or not. I should add that obviously if we consider all words in Japanese language, this system is unrealistic, but for smaller ambitions, it could actually be useful if the number is not too large. One could for example think about the kanji and vocabulary lists for standardized international tests.

7
  • 4
    We don't have MathJax here, I'm afraid. Here is an image of what the code above looks like rendered on MO: i.sstatic.net/PjYog.png
    – user1478
    Commented Oct 21, 2016 at 16:21
  • The inherent flaw is that the majority of the words in the dictionary, or those word lists are actually useful. I argue that most are not. Very hard to enumerate any type of useful value. Commented Oct 21, 2016 at 16:28
  • 5
    First time I ever seen math used to learn a language.
    – KyloRen
    Commented Oct 22, 2016 at 5:09
  • I don't know if I understand you correctly, but how many kanji you must learn before learning 挨 and 拶, effectively only used in Japanese in a word 挨拶, which is in turn among commonest word? Commented Oct 22, 2016 at 6:27
  • @GoBusto : thanks for the edition, too bad there is not MathJax here.@ broccoli forest : I don't know, because I think that these kanji are not even part of the official 2141 kanji list. So these kanji are to be learn after the official list. And once someone get to this level, it is possible to read about any text (well at least newspapers...) and learn vocabulary in a more natural way than learning big lists. Commented Oct 22, 2016 at 9:27

1 Answer 1

4

I've long been wanting to create a list of kanji (say, all jōyō kanji) ordered by weight, where the weight is determined by the frequency of the kanji itself and by the weight of all those kanji in which this kanji appears as a radical. (Also see the end of this post.)

If the average person has a vocabulary of at least 30,000 words, a list of words covering a larger range (say 95%) of people is obviously much larger than this. Maybe more like 50,000–70,000.

Even though a good portion of words has no (or only obsolete) kanji representations, it seems impractical to try to create a reasonably complete vocabulary list of 50,000 words, or even an "average" list of 30,000. (This would require around 15–25 words per kanji.)

Still, you could choose a number (say 5) and create a list of 10,000 words (kanji-by-kanji in the order their "weights") by choosing the five most common words containing a particular kanji, eliminating words that you chose previously.

This seems to be a reasonable way to learn kanji, as well as creating vocabulary lists that are more tuned to a gradually increasing knowledge of kanji. (Otherwise pure frequency lists would do the job.)

I also think it gives you what you want, because the resulting vocabulary list of 10,000 words is essentially just a reordering of the same list ordered by frequency. (This is not entirely true, as there are extremely rare kanji, like 訃 (least frequent kanji in BCCWJ), which probably don't have 5 words in the top 10,000 words containing this kanji. It's basically a reordering with holes on those words whose kanji already have been represented at least 5 times in the list by more frequent words with the holes having been patched by less frequent words with kanji who haven't been represented.)


There are several ways in which the weights could be adapted to account for more than just frequency:

  1. One could also take into account the stage at which individual kanji are taught in Japanese schools in order to optimize the list not only for usage (in newspapers/books), but also exposure in existing learning materials, which are mostly based on the kyōiku kanji.

  2. I would also like to associate to each kanji all kanji which contain the same phonetic component, even if the on'yomi is slightly different, like 番 幡 or 読 続. One could also consider that such kanji are easier to learn in groups and thus allow for the weights to be increased.

  3. One might also give a small preference to pictographs over ideographs etc.

(I have started this and if I were only half as tech-savvy as the people here on Stack Exchange, I might have finished already.)

3
  • @ Earthliŋ : Thank you very much for this interesting and thourough answer. But how would you actually create this list? I don't know where to find this kind of data and to treat it efficiently. Commented Oct 22, 2016 at 13:18
  • Frequency data in available in corpora, for example the Balanced Corpus of Contemporary Written Japanese kotonoha.gr.jp/shonagon. Reliable information of phonetic components or classification into pictographs/ideographs/etc. would probably have to be taken from dictionaries. The list of kyōiku kanji I have linked in my answer.
    – Earthliŋ
    Commented Oct 22, 2016 at 14:48
  • Thank you very much, I think there is still substantial work to do but I have now enough information about my question to accept your answer. Commented Oct 22, 2016 at 15:41

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .