What do you know if you learned the most frequent Chinese characters?

By Confused Laowai | Date: August 20th, 2012 | Category: Experiments

I have developed an interest in Chinese characters and statistics recently. This interest comes from my research into Chinese orthography and spaced repetition systems for my Master’s degree. For instance the effects of radicals on sub-conscious level on recall of Chinese characters. One question popped into my head the other day on the ratio between Chinese character knowledge versus Chinese vocabulary knowledge.

As you might know, most Chinese words are formed with the combination of two characters. So, by obvious deduction, your character knowledge would produce more vocabulary knowledge. For instance, knowing the characters 饭,商 and 店, not only gives you three single character words, but also two more words: 饭店 and 商店.

Now, this beneficial relationship between character knowledge and vocabulary knowledge isn’t always that apparent. We don’t always learn all the words afforded by all the characters we know. We just don’t have the time and resources to check up these relationships. It would also get very interesting once the character knowledge becomes more. It would essentially start getting exponential. Any new character to your knowledge of characters would then have the opportunity to form words with any of the previous ones.

Out of curiosity I decided to just play around with the most frequent Chinese characters and see what and how many words are created using them.

How it was done

I used some code from HanziJS to do dictionary lookups using the open source CC-CEDICT dictionary. I then used the character frequency list from Junda as a source for the characters.

Using some terrible coding skills I then computed all possible two/three/four character combinations and used those to lookup in the dictionary if such words existed. I also counted the single characters as words themselves.

At first I only ran tests with up to max of 2 characters combinations. The lookups went quite quick. Then I increased it to three and four character combinations. It slowed down dramatically especially if you increase the amount of characters. I contemplated only looking at two character combinations, but then I might miss some opportunities for chengyu to appear.

So I decided to apply an assumption to reduce the amount lookups:

Assumption: Words with three repetitions of the same character next to each other are discarded. So for example: “得得得x” where ‘x’ is any other character from the list. I have not encountered words in my four and half years of Chinese study where characters repeat three times. The same can thus be said for ‘x得得得’.

But that wasn’t very efficient. It only removed a few hundred lookups for instance at the 25 character level.

Something needed to change. The CC-CEDICT has over 100,000 entries. I did not need to look at all the entries to see if there are possible word combinations. So I manually checked the file and created indexes for each character. For instance all dictionary entries starting with ’的’ is very little. Only 11!

This reduced the lookup times by a major factor.

For the curious coders out there, I used Node.js. Also at the end of the post I’ve added a Google Doc link to all the words per category that was generated along with their meanings.

I should also note, that the lookups only produce one dictionary entry. For instance, according to CC-CEDICT, has two dictionary entries. I only chose to stop at one, because I want to equate character knowledge versus possible word combinations, not necessarily all the different meanings.

The Short Graph Version

Amount of characters by frequency and how many words they form.

Character to word ratio

Amount of words per amount of characters in a word.

The Long In-Depth Version

10 Characters

Characters: 的,一,是,不,了,在,人,有,我,他

Total Dictionary Lookups (with assumption): 11,010

Total Dictionary Entries Found: 25

Percentage of Entries Found: 0.23%

Ratio of characters to words: 1:2.5

Total Computing Time Before Index: <1min

Total Computing Time After Index: <1sec

I did not expect very interesting results from only 10 characters. Most of the first 10 characters are grammatical characters and pronouns. Some interesting words that came out were 不在了 (to be dead/to have passed away), 我人 (we) & 他人 (another/sb else/other people). The latter two is the first time I’ve seen 人 used in the same way as 们. Quite cool.

25 Characters

Characters: 的,一,是,不,了,在,人,有,我,他,这,个,们,中,来,上,大,为,和,国,地,到,以,说,时

Total Dictionary Lookups (with assumption): 406,275

Total Dictionary Entries Found: 106

Percentage of Words Found: 0.026%

Ratio of characters to words: 1:4.24

Total Computing Time Before Index: 28min

Total Computing Time After Index: 8secs

Immediately the exponential nature is apparent. We gained a few more words with the increase from 10-25. So the additional 15 characters created 81 more words. The first four character dictionary entries arrived: 中国人大 (China’s National People’s Congress) & 大有人在 (there are plenty such people). They aren’t idioms yet though. 

50 Characters

Characters: 的,一,是,不,了,在,人,有,我,他,这,个,们,中,来,上,大,为,和,国,地,到,以,说,时,要,就,出,会,可,也,你,对,生,能,而,子,那,得,于,着,下,自,之,年,过,发,后,作,里

Total Dictionary Lookups (with assumption): 6,375,050

Total Dictionary Entries Found: 326

Percentage of Words Found: 0.005%

Ratio of characters to words: 1:6.52

Total Computing Time Before Index: 7 hours

Total Computing Time After Index: 1min 30 secs

Ooh boy. The possible combinations is surely hitting massive proportions now. I did not expect 50 characters to take this long to compute. Luckily the second time around it went by a lot quicker.

100 Characters

Characters: 的,一,是,不,了,在,人,有,我,他,这,个,们,中,来,上,大,为,和,国,地,到,以,说,时,要,就,出,会,可,也,你,对,生,能,而,子,那,得,于,着,下,自,之,年,过,发,后,作,里,用,道,行,所,然,家,种,事,成,方,多,经,么,去,法,学,如,都,同,现,当,没,动,面,起,看,定,天,分,还,进,好,小,部,其,些,主,样,理,心,她,本,前,开,但,因,只,从,想,实

Total Dictionary Lookups (with assumption): 101,000,100

Total Dictionary Entries Found: 1128

Percentage of Words Found: 0.001%

Ratio of characters to words: 1:11.26

Total Computing Time Before Index: NA

Total Computing Time After Index: 31min

Things seemed to have escalated quickly. I did over 101,000,100 (yes, that 100 million!) dictionary lookups. Through some more heuristics of character combinations I could’ve brought that number down, but I’m not a pro on natural language processing just yet. You’ll also need to do quite the analysis.

Besides the big numbers involved. I’m glad we finally crossed the 1000 word mark.

Furthermore, the ratio of characters to words from 50 to 100 characters went up by 73%. From 25 to 50 that increase was closer to a 53.7% increase. It would be interesting to see this ratio played out across higher ranges (200 characters, 400 characters, 800 characters) to see if there is a similar exponential trend.

There are quite a few four character words that popped up. Here are my favourites: 有人想你 (bless you!), 自我实现 (Maslow’s Hierarchy of Needs), 自然而然 (Involuntarily) and 同道中人 (Kindred Spirit).

Other Findings

I decided to run further analysis on the 1128 words that were generated with the 100 characters. Here’s some interesting stats:

Characters that start the most words: 不 (59 words), 大 (47 words) and then surprisingly 自 (30).

Number of two character words: 785

Number of three character words: 180

Number of four character words: 63

New order of character frequency based on only the words generated (highest to lowest): 不,人,大,一,家,有,事,心,来,得,学,中,上,下,自,当,生,理,好,地,分,道,子,行,看,时,天,出,说,国,过,成,年,小,面,要,用,本,对,会,为,法,想,多,动,前,是,方,在,所,作,了,起,样,然,同,发,以,个,没,可,能,开,定,这,着,现,从,主,实,之,去,到,后,部,进,的,而,种,就,如,那,我,只,些,于,里,因,其,经,都,还,和,么,们,你,他,但,也,她

It’s interesting to look at the new frequency of the characters. Like, 他, which places among the highest in frequency according to Junda’s corpus, which takes a lot of novels and texts into account, is now lowest among the 100 most characters only based on the frequency of vocabulary.

One could almost say that there are two kinds of character frequencies: frequency of occurrence in texts vs frequency of occurrence in vocabulary. I wonder if there are interesting correlations to be made here. Maybe if we one wants to be efficient in learning Chinese characters, one needs to look at both frequencies, because they both can create good angles of approach.

We all have learned 他, 她 and 你 as some of the first characters we encounter. It makes sense. So looking at frequency only isn’t the necessarily the best approach, but I think there are definitely interesting ways to use frequency to your advantage.

Like, 不, here above for example. It is high in both frequencies and has created the most words, thus it is awesome character.

Conclusion

It was a fun experiment. Perhaps there are interesting experiments like this to be made. Most of words are obscure though. I’m not sure of their usefulness, like 成说 (accepted theory or formulation), 天年 (Natural Life Span), 心经 (The Heart Sutra) and 地心说 (geocentric theory) among many others.

I was curious to see how many words there are actually when one takes the most frequent characters. To think about it, if you know 100 characters, you could potentially know 1128 words. That’s kinda cool. Whether that translates to useful words, like I mentioned above, is probably not likely.

Ultimately, it would be really cool to take one’s own character knowledge and compute that. Unfortunately, it’s hard to get that exact knowledge (perhaps from Anki or Skritter?), but also to do those potential vocabulary lookups past 100 characters can get very very expensive in computation.

I’ll see if I can create a little app that looks up character combinations for characters a user manually provides.

Anyhoo, here is the list of all the words per character category (10, 25, 50 and 100) + all four character combinations: Google Doc Link. Have fun!

EDIT: I got a comment on reddit (and here) from Rob who also did a similar experiment. He used the most frequent characters from Harry Potter up to 6000 characters! Check the data here. Awesome stuff.

Related posts:

Your Mandarin Learning Experience
Defining the Sinosphere
Chinese Characters in the Wild
10 Really Interesting and Cool Chinese Characters

Subscribe via email to receive new posts straight in your inbox!

Enter your email address:


  • http://www.1kli.com/ Vince

    Ohh, very cool! Thanks for sharing this research!

  • http://thomasroten.com/ Thomas

    Great post! I have been thinking about character frequency lists a lot lately. They are nice because they provide a basic structure so you know what you are learning is generally useful. But, seeing how many words can be made from such few characters is even more encouraging.
    Another helpful and related thing is Jun Da’s cumulative frequency percentile column in his statistical data.

  • http://niel.delarouviere.com NielDLR

    Hi Vince,

    thanks for the kind words. It’s a pleasure to share it. I’m glad other people found as cool as I did.

  • http://niel.delarouviere.com NielDLR

    Hi Rob,

    I replied to your comment on reddit.

    I did not do remote lookups. Node.js can do local file I/O. I read the CEDICT file into an four field array.

    Then I read the frequency list (a simple txt file too) into an array as well.

    I then created a four level deep for loop. Which I think might be my problem. It then technically creates all the possible combinations for all the list of characters up to four characters.

    Check my github gist of the code. It might be a bit confusing. I tried to comments where I can.
    https://gist.github.com/3413354

    I think we are approaching it differently. Mine could possibly be the wrong one.

    I’m not I exactly understand how your code works. What I do is take all possible combinations and check if each of those combinations are found in the dictionary, by doing a iteration through the whole dictionary using a loop.

    Is this the right way? I made it quicker, by creating an end and start index for each character in the frequency list for words that start with the chosen character.

    I really want to learn how to do this better, as you did. I’m sure, the difference between Java and Node.js will not be in magnitudes of tens of minutes, thus I’m sure my coding is inefficient.

    I’m going to keep looking at your code to try and understand it.

    Thanks a lot in advance!

  • http://niel.delarouviere.com NielDLR

    I added the link your pastebin into the post. It’s just too interesting to not share. Thanks for the contribution!

  • http://niel.delarouviere.com NielDLR

    Hi Thomas,

    thanks for the comment.

    I agree, they are nice, but one needs to work it into a holistic approach.

  • http://twitter.com/HackingChinese Olle Linge

    I agree with the others, this is very interesting. I feel that there is much to do related to statistics and Chinese characters. I posted an article about the most common radicals yesterday and included a brief discussion about frequency.

    Essentially, what I would like to do is stop discussing “most frequent among the 10 000 most common characters”, which is what most lists do. Instead, I’d like to check what’s the most relevant to learn for someone who is learning their first 1000 characters.

    This could be applied to your experiment as well. What if you checked what you have checked, but only included tho 10 000 most common words? Sure, we’ll run into problems such as “what is a word” and “what is common”, but just because these questions are hard to deal with doesn’t mean the result would be bad or useless. I think there is much that can and ought to be done to make sure people can focus on learning the most relevant things, be it radicals, characters or words.

  • http://beijing10000.wordpress.com/ Wai Man Chan

    Very cool post! This really makes me think about the amazing exponential nature of Chinese Characters. Thank you for sharing!

  • http://www.1kli.com/ Vince

    I really like where this is going. My biggest problem with the character lists is that they’re mostly pretty useless. Have you ever studied directly off one? Because I haven’t. I want them to be as useful as they are entertaining.

    Finding out the answer to “what’s the most relevant to learn for someone who is learning their first 1000 characters” is a step in the right direction.

  • http://niel.delarouviere.com NielDLR

    Thanks for the comment. Glad you enjoyed it!

  • http://niel.delarouviere.com NielDLR

    Hi Olle, thanks for the comment.

    I definitely agree with you. That is indeed a good experiment idea.

    I might just try it out within the next week or two (very busy with my thesis at the moment!) to see if I find some interesting results.

    Relevancy is definitely an issue!

  • http://niel.delarouviere.com NielDLR

    Hi Sander,

    thanks for the comment.

    That’s what I ended up doing with the indexes. I did this manually though (looking into the dictionary file). I could’ve just coded it, but it was just a 100 characters so I decided to do it like that.

    I spent the whole day yesterday to see if I can optimize the code. I ended up creating a much quicker method in the end using a different data structure for the dictionary. I managed to do the 100 character category in just 1min 30 secs, instead of the previous 30minutes. It even checks through the whole dictionary, not just the indexes, but that’s next step with the new data structure.

    I’m kind of getting obsessed now with trying to make it even quicker. It’s kind of fun!

  • http://thomasroten.com/ Thomas

    You should be able to make it a lot faster. I have a script that can search through 30000+ sentences and only find those that have “known” characters in about 2 seconds. This seems to be the same thing you are doing, mine is just using strings with 8+ characters and your’s is 2+.

    The problem boils down to this: assuming you have a string of all allowed characters, find all strings that consist solely of characters from the allowed character string.

    I have two different algorithms I wrote.
    1) Take every sentence in a list and run a regex on each one. The regex consists of all allowed or known characters. This method is fast, but memory intensive.
    2) Have an index that consists of character => sentence ids. Select all characters that are not known. Then select all the sentences that aren’t linked to those characters.

    Hope that helps! Let me know if you want to see any code from my project. It will be useable on http://chinesesentences.com next week or the week after.

  • http://niel.delarouviere.com NielDLR

    Hi Thomas,

    I think I understand what you mean. Like I mentioned, I haven’t coded in ages. I’m behind a lot on data structures and efficient methods to do large data handling.

    I might be wrong here, but I think the difference between your code and mine, is that I do a lot of lookups, rather than one lookup into big data. I hope I’m explaining this correctly.

    So my experiment is a many to many situation, where yours is a one to many situation. I might be wrong in understanding this.

    If I just do one lookup, into my dictionary, this happens under 1sec.

    In my experiment (on the 100 character level) I did 100 million lookups into a dictionary file of over 100 000 entries. If all these happen under 1sec each, then the time starts adding up. I think that’s my problem.

    But hey, send me an email confusedlaowai@gmail.com with some examples of your code. I’m curious. Also, looking forward to your site. Sounds interesting!

  • p2beijing

    Woooooooooooo very very helpful ;) Very great job thnk you for your time and for sharing this interesting study

  • http://niel.delarouviere.com NielDLR

    Thanks for the kind words! It is a great pleasure :)

  • Chad Redman

    Really interesting! I have thought about doing some similar experiments, but my thoughts were more of “which characters are used in the most words” which wouldn’t necessarily be the top N characters. A similar question is “which minimal set of characters would you need to know X% (e.g. 90%) of a text”. But this latter question is just a matter of counting the characters in the text and taking the most frequent characters until they add to X%.

    To get a little geeky here, regular expressions would be a faster way to get the results, depending on the language you are using. You can simply create a regular expression out of the characters, and then test every word to match them. For example (pseudocode):

    for each word (words) {
    print word if word.matches(“^[的一是不了在人有我他]$”)
    }

    This way, you would have one test for every dictionary word, but never more than that. Performance tends to be good even for long strings of characters (I got 0.3 seconds for the top 100 characters) because many languages turn the expressions into ad hoc decision trees that compute faster than the looping and testing that you would be writing in the high-level language.

    Your analysis touched on one thing I have noticed in the past. While rare words often contain at least one character that clearly hints at their meaning, some words contain common characters but have less obvious meanings. This makes them hard to remember, even if they are fairly frequent words. I still have problems with 要是 and 那是, because they look like “want to be” and “that is”, but really mean “if/in case” and “of course”.

  • http://niel.delarouviere.com NielDLR

    Hi Chad,

    thanks for the comment.

    Yes, regular expressions. Ooh boy. I have heard of this elusive thing. I’m going to definitely read up a bit more on it. It makes sense. I understand regular expressions, but making statements for them is where I lack the knowledge. I’ll ask my brothers for help, they are better programmers than I.

    Your last point is true. I remember asking people on the Chinese Language stackexchange section how 要是 became “if”, but the general response was, it just is. It’s like asking how “if” became “if”.

  • http://twitter.com/DaveFlynn Dave Flynn 茶米

    The issue of frequency lists came up for us recently when creating the new version of Mandarin Poster. We found that the most frequently used characters are not necessarily the most useful to beginners – the order of characters in a frequency list might not reflect how a beginner would talk . Sort of along the lines of what you came across with 他. This is a character that’s usually learned in your first ever Mandarin lesson, yet it jumped right down in your testing.

    Then there’s the corpus to consider, which could be influenced by any number of factors, and what is ‘frequent’ (as mentioned by lots of people above) – spoken word, written word, age groups, geographical location all influence this. Thinking on, maybe an all-encompassing frequency list might not be a solution to any useful question, but rather focused lists, that concentrate on distinct groups of people or topics, since this is how our lives are based anyway, would be more useful.

  • http://niel.delarouviere.com NielDLR

    Hey Dave,

    thanks for the comment.

    I completely agree. I think frequency lists, especially character lists, although fascinating in their own right, might limited in their solution as you mentioned.

    You know, now that I think about, Chinese would be on the few languages where we actually would like to look at a frequency list based on the graphemes. I mean, in English we don’t look at frequency lists for letters of the alphabets.

    I’m making bad comparison here, I know, but it sometimes feels the same.

    Because, Chinese orthography is such a complex system, finding the relationship between characters and the vocabulary is an interesting question in itself.

    I think that’s what we are trying to answer here. How can we find the best characters to learn that will enable us to promote efficient, effective (and fun!) vocabulary learning?

    Another thing to consider, is radicals too!

    If you want to chat about this more, send me an email, we can maybe do some research together.

  • Pingback: Side Project Skill Creep

  • http://3000hanzi.com/ Steven Daniels

    Cool analysis.

    One afternoon I coded up a Chinese boggle style game. It would grab 20 – 25 characters and then you’d have to figure out as many words as you could from the characters.

    An interesting thing I noticed: when I tried making it easier by putting a higher percentage of common characters, the game actually got harder. Lots and lots of obscure words. Totally useless for studying.

    I agree with Olle: we need to find better ways of finding relevant characters for Chinese learners to study.

  • Bendy-Ren

    Would it be possible to use data like this to calculate extended HSK vocab lists?

    What I mean is, to take all the vocab words in a certain HSK list, say level 5, dump all of the individual characters that appear there, and run this analysis on each one. Of course the output would be many times larger than the original list, but if you could filter it by frequency, it might be really useful…

  • http://niel.delarouviere.com NielDLR

    Hi Bendy-Ren,

    that is not a bad idea at all. I’ll consider doing this soon!

  • Pingback: ¿Qué puedes aprender sabiendo los 100 caracteres chinos más frecuentes?

  • http://www.cleverclogs.org/ Marjolein Hoekstra

    Michael Burkhardt has self-published a number of books that seamlessly seem to fit your goal. His three-volume series Eating the Dragon takes a character frequency list and progressively matches it to vocabulary from the HSK exams. I find Burkhardt’s approach very intuitive and higly recommend you check it out http://www.lulu.com/shop/search.ep?contributorId=688001

  • http://www.wearyourchinesename.com/chinese-symbols/chinese-symbols.html Giuseppe Romanazzi

    I read with interest the comments of Dave Flynn and Olle Linge and what they say about “most frequently used” versus “most useful to beginners” and “most common” versus “most relevant”.

    They are right and I do agree that “the order of characters in a frequency list might not reflect how a beginner would talk”.

    Anyway, please let me tell my opinion after having lost my hair in Beijing when I was a beginner:

    A beginner who wants something “useful” and “relevant” doesn’t care at all Chinese characters!

    You want to teach “useful” and “relevant” Chinese, right? The word is: pinyin!

    In my textbook, lesson 1 was really very “useful” and “relevant”, I’m not joking, it was just what I wanted to say as a beginner: Mary says to David “How do you do?” and then “How are you?” and “Very well” and “I’m very well too”… but it came with 15 Chinese characters to learn and memorize!

    Lesson 2 was about professor Li and professor Wang greeting each other and “How is your health?” and “Thanks” and “Goodbye” etc…. WITH 25 MORE CHINESE CHARACTERS!

    You author of those lessons, do you know what I think?

    Option 1: You are Chinese, so, I do understand, you don’t know, yes, don’t know how difficult is for me, an Italian beginner, to learn Chinese characters;

    Option 2: You are completely out of mind, please see a good physician.

    So, want to be relevant and useful? Go with pinyin.
    Want to teach characters? Go with the most commonly used first, and only a few each day.

    My two cents.

    Giuseppe Romanazzi
    WearYourChineseName.com