Chinese Character Frequency Lists

Hi. This is in response to a comment left on the Kongzi page.

Calimeron writes:

Hi, I’m using the Yahoo Widget “Mandarin Flashier Cards.” That uses a cedict dictionary with 2 frequencies (in sqlite format). Do you know anything about these frequencies? Or other frequencies available on the web? (I don’t like to use these frequencies if after learning 1000’s of words, I discover the frequencies are not right.) Tip for you: do something similar! If you have a program that starts automatically at the startup of the computer, and then automatically flashes the 5-10000 cards you’d like to study, it’s very difficult not to study!

Hi Calimeron! I have been interested in frequency lists for a long time. The conclusion I’ve come up with is that no one frequency list is the ultimate, be all and end all. The reason for that is because frequency lists merely target whatever they have been analyzed from. Let me give you an example. If I took, say, 200 daily newspapers, 100 novels and business books, and a broad selection of 100 magazines all from last year, threw in a few scripts from plays and transcribed news broadcasts, I’d have several million analyzed words. The Chinese Government put something out like that. It lists frequency of simplified chartacters. That’s one list. But if I added a lot of classical chinese or religious books like the Dao De Jing, the frequency list would change.

Another example is the list put out by the Ministry of Education in Taiwan back in 1997. This is a very modern list with nearly 20 million analyzed words from a broad spectrum of books, magazines, newspapers, television, and even some classical material. This is the frequency order listed in Far East’s 3000 Chinese Character dictionary – a book I cannot reccomend highly enough for it’s value as a learning tool.

Yet another example is the internet frequency order. Someone had compiled something like 5,000 websites from China and came up with 280 million words. It’s huge. Simplified characters. And another compiled in 1993 and 1994 from Chinese Newsgroups. All of these lists have vastly different orders for characters.

There are other lists. The point I am making is, what are you trying to learn? Obviously using one frequency list over another will target you towards that particular culture. Newspapers in Singapore may use shorthand characters for twenty and thirty which will not appear in other lists – although they may be included in a list of characters targeting Singaporean newspapers.

My advice is that if you’re just starting out amd you’re confused about which frequency list to use, use them all! Study the first 100 or 200 characters in one list, then go study the new characters in another list. It will be very revealing to you to understand which characters are the same and which are different. For example, Yi1 (one) and de5 (ownership particle) are going to be in the top ten no matter what, almost for sure. Other words may be closer or farther away. Also you may wish to know that some lists count bigrams such as “ta1 de” (his) separately from ta1 (he) and de (ownership). So you really have to understand what your frequency list is targeting before you can use it.

The way I solve this problem with Kongzi is by allowing the user to import whatever frequency list they like from the internet. It currently can import five or six different lists.

For anyone learning Chinese, I strongly reccomend Far East’s 3,000 Character Dictionary. It is indispensable and the frequency order is very good for learning daily Chinese you will see and speak all over the world. I’ve never seen anything better.

You may also be interested in James E. Dew’s “6,000 Chinese Words”. It is another indispensable book which lists many things including CYY and BLI grading, and several different frequency lists.

For advanced students only, I reccomend buying an entire series of very easy children’s books (think aesop’s fables, three little pigs) and indexing and counting each character in each book. That should take you a couple of months! And making your own frequency list from that. You will come up with about 800 words that are significantly common and you will discover that learning as few as 200 of them will allow you to read 90% of the material in the books. Again I cover this technique in my new textbook, “Welcome to Chinese”.

For super advanced students. Oh? Here’s something to think about. What percentage of the most common 3000 characters are nouns, what percentage are verbs, and so forth? Choose the most common 20 or 30 nouns, the most common 20 or 30 verbs, and so forth, and you could probably set up the kernel of a MCI system in 100 words. This puts near-conversational fluency in the hands of first year college students. This is something akin to what I’m doing for Kongzi and WTC.

Good luck on your Chinese journey!

Advertisements

Using Styled Documents

A few beta testers reported back that on their netbooks’ small, high resolution screen, the HTML-limited size of the Chinese Characters was too small. So I redid it using styled documents.

Notice anything different?

kongzi-styled-docs

It’s pretty much the same, except now you can blow the Chinese character up as much as you like.

Perfect for Netbooks! This was definately a step I needed to take in order to move towards Kongzi on portable devices.

One of the next big goals for Kongzi is adding a Speech module (i.e. press F1 to hear the Chinese). Associated with this will be a new “listen and type” style of quiz, “listen mode” for multiple choice, etc.

Development is proceeding extremely rapidly. I’m working on some UI issues now, making it better. Kongzi is becoming more and more robust. I currently am using it to teach myself but entering characters is slow-going, there are only 250 words in the dictionary now.

I have a really good feeling about Kongzi these days. I think it’s going to be a success!