Chinese Character Frequency Lists

Hi. This is in response to a comment left on the Kongzi page.

Calimeron writes:

Hi, I’m using the Yahoo Widget “Mandarin Flashier Cards.” That uses a cedict dictionary with 2 frequencies (in sqlite format). Do you know anything about these frequencies? Or other frequencies available on the web? (I don’t like to use these frequencies if after learning 1000′s of words, I discover the frequencies are not right.) Tip for you: do something similar! If you have a program that starts automatically at the startup of the computer, and then automatically flashes the 5-10000 cards you’d like to study, it’s very difficult not to study!

Hi Calimeron! I have been interested in frequency lists for a long time. The conclusion I’ve come up with is that no one frequency list is the ultimate, be all and end all. The reason for that is because frequency lists merely target whatever they have been analyzed from. Let me give you an example. If I took, say, 200 daily newspapers, 100 novels and business books, and a broad selection of 100 magazines all from last year, threw in a few scripts from plays and transcribed news broadcasts, I’d have several million analyzed words. The Chinese Government put something out like that. It lists frequency of simplified chartacters. That’s one list. But if I added a lot of classical chinese or religious books like the Dao De Jing, the frequency list would change.

Another example is the list put out by the Ministry of Education in Taiwan back in 1997. This is a very modern list with nearly 20 million analyzed words from a broad spectrum of books, magazines, newspapers, television, and even some classical material. This is the frequency order listed in Far East’s 3000 Chinese Character dictionary – a book I cannot reccomend highly enough for it’s value as a learning tool.

Yet another example is the internet frequency order. Someone had compiled something like 5,000 websites from China and came up with 280 million words. It’s huge. Simplified characters. And another compiled in 1993 and 1994 from Chinese Newsgroups. All of these lists have vastly different orders for characters.

There are other lists. The point I am making is, what are you trying to learn? Obviously using one frequency list over another will target you towards that particular culture. Newspapers in Singapore may use shorthand characters for twenty and thirty which will not appear in other lists – although they may be included in a list of characters targeting Singaporean newspapers.

My advice is that if you’re just starting out amd you’re confused about which frequency list to use, use them all! Study the first 100 or 200 characters in one list, then go study the new characters in another list. It will be very revealing to you to understand which characters are the same and which are different. For example, Yi1 (one) and de5 (ownership particle) are going to be in the top ten no matter what, almost for sure. Other words may be closer or farther away. Also you may wish to know that some lists count bigrams such as “ta1 de” (his) separately from ta1 (he) and de (ownership). So you really have to understand what your frequency list is targeting before you can use it.

The way I solve this problem with Kongzi is by allowing the user to import whatever frequency list they like from the internet. It currently can import five or six different lists.

For anyone learning Chinese, I strongly reccomend Far East’s 3,000 Character Dictionary. It is indispensable and the frequency order is very good for learning daily Chinese you will see and speak all over the world. I’ve never seen anything better.

You may also be interested in James E. Dew’s “6,000 Chinese Words”. It is another indispensable book which lists many things including CYY and BLI grading, and several different frequency lists.

For advanced students only, I reccomend buying an entire series of very easy children’s books (think aesop’s fables, three little pigs) and indexing and counting each character in each book. That should take you a couple of months! And making your own frequency list from that. You will come up with about 800 words that are significantly common and you will discover that learning as few as 200 of them will allow you to read 90% of the material in the books. Again I cover this technique in my new textbook, “Welcome to Chinese”.

For super advanced students. Oh? Here’s something to think about. What percentage of the most common 3000 characters are nouns, what percentage are verbs, and so forth? Choose the most common 20 or 30 nouns, the most common 20 or 30 verbs, and so forth, and you could probably set up the kernel of a MCI system in 100 words. This puts near-conversational fluency in the hands of first year college students. This is something akin to what I’m doing for Kongzi and WTC.

Good luck on your Chinese journey!

About these ads

2 Responses

  1. Hi Oliver! Thank you very much for your very warmhearted, thoughtful and complete reply! I’ll certainly order the “Far East 3000 Character Dictionary” and James E. Dew’s “6000 Chinese Words.”

    In reply to your “For advanced students only,” there’s a program on the web that counts frequencies of words in a text, for example: http://www.lextutor.ca/freq/eng/

    Problem with Chinese is that you first have to put spaces between the words, changing “问你一个问题好吗” to “问 你 一 个 问题 好 吗”

    I haven’t found a good automated way to do this yet. Programs that put pinyin on top of the characters might do the trick, as long as they don’t convert the characters to pictures.

    A word macro to put spaces between individual characters:

    Sub Space()

    ‘ Test Macro
    ‘ Macro created 11-3-2007 by Jos

    Selection.HomeKey wdStory

    For i = 0 To ActiveDocument. ComputeStatistics(wdStatisticCharacters)

    Selection.MoveRight Unit:=wdCharacter, Count:=1
    Selection.TypeText Text:=” ”
    Next

    End Sub

    This macro works fine and then stops before the end of the document, so you still have to convert the end of the document.

    Best Regards,
    Calimeron

  2. I’ve just found a segmentation tool (only works for me in Internet Explorer): http://www-rohan.sdsu.edu/~chinese/annotate.html

    Select “Segment only”

    Renli replies: Thanks, I will check it out, although I find I have no free time recently :/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: