Slapped in the Face by Natural Language Processing

I just spent three weeks solving some of the hardest logic puzzles I’ve done in a while. And what did I learn?

Well, I finally found a class for that little bit of code in the main program that didn’t seem to belong there. Long story.

Recall my last post where I describe the new Memory game module. I had always had a feeling that once I had completed rewriting all the old code and fixing up the UI a bit, that the rest of the program would come out in a landslide. And it sort of did. I was blessed with a small holiday due to summer vacation and being ill, so I had quite a few solid days to work on Kongzi. And work I did. I found and fixed countless tiny widdle bugs, and I also added a phrase module — kongzi can now give multiple choice best-word quizzes.

This caused me to encounter a rather well-known, and actually unsolvable bug related to Natural Language Processing (NLP; and no, not neuro-linguistic programming). Let’s say I have a phrase; any phrase; and I choose one of the a-b-c-d answers to be correct. How do I then choose the answers b c d to be words which are not correct, in such a way, that it will not clue the reader in to the answer? This is much more difficult than it appears. First of all, it is not acceptable to record a small number of answers which are “wrong”, per question, because if these words are not in the current targeted vocabulary, the remaining (correct) word will be consconspicuously familiar. In practice this leads students to pass a test even when they do not really understand the meaning of the words. Limiting words which may appear by part of speech is for the same reason, unacceptable. Limiting words by tense, also. It becomes an exercise in choosing the word based on it’s grammatical fit than anything else.

Example: “I take the ____ to work every morning.” a) grape  b) taxi   c) car    d) airplane

Obviously a is not a good answer, and d is illogical. But among b and c, the answer is not clear. It’s probably c, but it might be b. This is the problem, the student might choose an answer which is right, and the computer would not know that.

The real problem is that with thousands of words and thousands of example phrases it is simply not feasable to list every possible answer to a question.

So the solution I came up with is, to choose the words randomly from the target vocabulary, from the least-known words. This has the advantage of creating a situation where the student cannot choose a word which is a wrong answer intentionally, because the definition of that word is one that the student does not know. The key here is that the problem described above occurs infrequently. Words which the student chooses wrongly in this way will remain in “unknown” status, but words which the student knows will slowly rise to the top regardless of a few occasional errors.

Quality of phrase is also important here. A phrase like “He is a very ______ boy.” can lead to a lot of such errors, because there are a lot of possibilities. However, a phrase like “He is a very _____ boy, he is punished every day.” cuts down a great deal on the possible adjectives. It’s very difficult to have a real collision with a sentence like this. Another example:

Bad: Would you like to watch a ______ movie with me tonight?
Good: I like comedy movies, but I don’t like _____ movies. They’re just too scary!

The second sentence above probably has just 1 or 2 words which could “normally” apply to it, and the chances of an idea collision are very low. But the first sentence, well, if two adjectives are chosen for the multiple choice (or if parts of speech is used to choose only adjectives), the reader might as well flip a coin to help him decide what the right answer is.

So far the solution seems to be holding up well and I have only seen the odd rare collision.

Concentration


I feel as if I have come full circle. The Memory game code is the meat of what I lost back in 2007 when my computer crashed. Now, I have finally gotten around to rewriting it all these years later, and it feels as if a weight has been lifted from my shoulders. Each time I would think of coding for Kongzi I would remember how royally pissed off I was at losing my work. No more.

But, this is not the memory game of 2007. This is the new and improved version. As you can see, you can configure it to display different items as a pair. Of course, you can just use it to play regular “memory” as well, and increase your recognition of foreign language words or letters. However I find this particular mode especially well-suited for learning hiragana pronunciation.

I believe this end of the program is well-polished now, and that it may finally be almost time for the feature freeze before public release.

Kongzi: Japanese

Okay I finally decided to start learning Japanese. I’ve been “learning” Chinese for 20 years, don’t hate me for switching horses midstream, okay? I will always love the old horse too.

Anyways.

As soon as I started learning Japanese I realized that I would have to first learn Hiragana. Hiragana is of course the primary writing system, or “kana” in Japanese. So I dutifully started entering Hiragana into Kongzi. I immediately discovered much to my chagrin that for all the “talk” I made about using Kongzi to learn Japanese as well as Chinese, it wasn’t nearly as convenient. Here’s what I was faced with:

Something's not right...

I always knew I would have to design a new set of tags, so I wasn’t too worried about that. What bothered me was that it said “Traditional / Simplified”, “Pinyin”, “Zhuyin” and so forth. Terms which i’m sure Japanese-learning users would grow to love, but which were unsettling to me, personally. Call me a perfectionist. I was almost going to let it go when I realized that there were deeper, structural problems. Japanese has six peices of information which need to be remembered – Kanji, of course, hiragana and katakana, yes — but also romaji and phonetics, alongside a definition.

I realized I had been cheating with Chinese as well, because I had not included a space for phonetics. After some tinkering however, it all worked out well. I modified the options dialog to include a spot for learning Japanese:

and added the relevant fields to the Add Entries dialog:

Voila!

Then I realized the “phonetic display options” and certain selections in the quiz settings dialog weren’t set up to handle the default display of a third phonetic. Nevermind that quizzes themselves weren’t coded to accept or display a third phonetic. Oh bother. So I coded new interfaces for setting up what’s shown on flashcards, as follows:

Right now i’m slowly changing all the labels to reflect what language is being learned. You can see that in the add entries dialog above where it says “Hiragana” but not in the quiz configuration directly above, where it says “Phonetic 3”.

But, other than those few cosmetic display issues, Kongzi is up and running… in Japanese! やった!

Some final comments.

I can now see how some of the comments I made about Anki vs. Kongzi were unfounded. I understand now why they stirred up a hornet’s nest of Anki supporters who accused me of not knowing what I was talking about. However, after updating the program to support Chinese and Japanese properly, I can at least reaffirm one aspect of that discussion. What I said about Japanese programs claiming they can be used to teach Chinese, “or anything”, are very true. Such as Stackz’ claim that it can be used to teach flags, or Anki’s claim it can be used to teach guitar chords. It’s a gimmick.

To truly be able to use a program to learn a language, that program must be specifically designed to handle that language.

Saying your program is versatile may be true, and it may be possible to hack and kludge your way into using a program in a way it wasn’t truly designed to be used — but I feel very happy with Kongzi now and I feel very happy with it’s native support for Japanese learners.

And Korean? Thai? Tagalog? At the moment I wouldn’t even think of saying Kongzi could be used to learn Korean. I realize now that although it COULD be, it never WOULD be, until proper coding is done, and proper support is given for Korean users. Yet this remains my dream — a single, simple, small and beautiful program that runs on anything, and can be used to learn any language in the world. So I’m sorry Korean learners, but I have a lot on my plate fixing this pig up for dual-language support. Korean will have to wait until next year.

And about releasing the program, well, one day…. I promise….

This is renli, over and out.

JSync and Kongzi Beta-6

You know, I’d probably make more blog posts if wordpress didn’t suck so much. I just have this intangible feeling that it sucks, that something sucks about wordpress. More on that later when I remember what sucks so much about wordpress. Right now I can’t think of anything.

Anyways I have been working on Kongzi off and on. I’ve added a timer to the Multiple Choice quiz. Which is kind of cool. Now you know how much time it took you to finish. I’ll add the same to Flashcards and so forth. I’ve also added the concept of “Sticky Tags” to the Add Entries dialog. The thing that really made me upset the most when adding lots of entries is that the tags wouldn’t be remembered from entry to entry. This may or may not be a desired behavior (from personal experience) so I put it in as a checkbox.

Another annoying thing I fixed is that if you accidentally click “open” instead of “save”, it wipes the Dictionary. That killed me twice during actual classtime so I fixed it. “Open” doesn’t clear your database until you select a file.

So yeah I’ve been real busy at work lately which is why I haven’t bothered to seriously work on Kongzi. I spent most of 2010 with an insane number of teaching hours and I am only now recovering from the shock. It was that bad. What brought me back? Simple, file sharing. The idea came up at a meeting that we should be able to share files. It’s better than e-mailing files to everyone all the time, and it would encourage a greater amount of collaboration. But every option we looked at was “expensive”. So me being me I decided, why pay? Or rather, I should say, why pay someone else? haha. That was something I could do. So I whipped up JSync.

JSync is pretty cool. It doesn’t actually copy files yet, I’m banging out the code for that as we speak. First I’ll talk about the job selection menu which is a cool little piece of work. You’ll notice one item is blue and one item is greyed. The greyed items, greyed (or de-greyed) by the right mouse button are “jobs which will be processed”. The blue item is the currently displayed item. I’m rather proud of that interesting user interface. Next there are three modes: synchronize, contribute, and mirror. Finally there are three types of connection; local, FTP and JSync. JSync will be a very interesting mode which I’ll keep secret for now. Actually I’ll discuss the whole file sharing back end later. I just wanted to mention I’m working on JSync now. And it’s gonna be good. Real good. Just like Kongzi.

I know, I know, I haven’t released Kongzi yet… so when am I gonna release JSync.. well, I’ve been using Kongzi to teach now for a very long time. It’s mainly “finished” and could be released, except for one or two small things. The only feature I want to add to Kongzi now is a cloze test generator (also known as trigger exercise). I’ve realized the best way to do this is to accompany each dictionary with a separate database of sentences and dialogue. I could attach sentences and dialogue to each vocabulary word. But that’s a little clunky. Yeah well this post is long enough. I just needed to write down some ideas. There’s actually a lot here, which I will be discussing in greater detail as time goes by. But for what it’s worth this is it… probably… Once this round of development is complete, I’ll probably release Kongzi. It works really well for teaching and self-learning.

Actually there’s something else I want to do. I want to add hiragana, katakana and a basic list of kanji as well. Then, really work on the presentation a bit and trim the dictionaries so I have a basic list of say 500 english words, 500 chinese and 500 japanese kanji. Maybe 500 korean as well. As a basic release package. Man there’s so much I want to do! Why did I stop working on Kongzi.. I got sick and had a motorcycle accident and I was so busy at work.. Well, I guess I learned something about myself.. back to work!

it’s alive!

Kongzi is working.

I have made a dozen or more small changes, mainly adding some great new features in the last month. But the important thing is, it’s working. I have 40 teaching hours a week now, and I use Kongzi in the classroom to teach English. The students LOVE it. And the schools LOVE the fact that I can give audiovisual teaching. Time to ask for a raise? Maybe 😉

The downside to all this is that I don’t have time to really work on the program. I still haven’t finished the memory game or the new idea I had, the unscrambling game. But other than that the program is very mature now. As I continue to use it over the next months, it’s my hope that weekend here weekend there I can finish it off and then step 3: profit.

I know I’ve been rambling about releasing this program for years but perhaps I lacked confidence in the code. Now that I know it works, and works extremely well, I will have a selling point to work from.

Actually the way these things work there would probably be more money in selling it as a tool to teach English. But hey, either way, it works like a charm 🙂 I hope I can find the time to finish it soon. Please leave supporting comments maybe I just need a pep talk to hunker down and start coding again. Right now I just feel tired.

Chinese Character Frequency Lists

Hi. This is in response to a comment left on the Kongzi page.

Calimeron writes:

Hi, I’m using the Yahoo Widget “Mandarin Flashier Cards.” That uses a cedict dictionary with 2 frequencies (in sqlite format). Do you know anything about these frequencies? Or other frequencies available on the web? (I don’t like to use these frequencies if after learning 1000’s of words, I discover the frequencies are not right.) Tip for you: do something similar! If you have a program that starts automatically at the startup of the computer, and then automatically flashes the 5-10000 cards you’d like to study, it’s very difficult not to study!

Hi Calimeron! I have been interested in frequency lists for a long time. The conclusion I’ve come up with is that no one frequency list is the ultimate, be all and end all. The reason for that is because frequency lists merely target whatever they have been analyzed from. Let me give you an example. If I took, say, 200 daily newspapers, 100 novels and business books, and a broad selection of 100 magazines all from last year, threw in a few scripts from plays and transcribed news broadcasts, I’d have several million analyzed words. The Chinese Government put something out like that. It lists frequency of simplified chartacters. That’s one list. But if I added a lot of classical chinese or religious books like the Dao De Jing, the frequency list would change.

Another example is the list put out by the Ministry of Education in Taiwan back in 1997. This is a very modern list with nearly 20 million analyzed words from a broad spectrum of books, magazines, newspapers, television, and even some classical material. This is the frequency order listed in Far East’s 3000 Chinese Character dictionary – a book I cannot reccomend highly enough for it’s value as a learning tool.

Yet another example is the internet frequency order. Someone had compiled something like 5,000 websites from China and came up with 280 million words. It’s huge. Simplified characters. And another compiled in 1993 and 1994 from Chinese Newsgroups. All of these lists have vastly different orders for characters.

There are other lists. The point I am making is, what are you trying to learn? Obviously using one frequency list over another will target you towards that particular culture. Newspapers in Singapore may use shorthand characters for twenty and thirty which will not appear in other lists – although they may be included in a list of characters targeting Singaporean newspapers.

My advice is that if you’re just starting out amd you’re confused about which frequency list to use, use them all! Study the first 100 or 200 characters in one list, then go study the new characters in another list. It will be very revealing to you to understand which characters are the same and which are different. For example, Yi1 (one) and de5 (ownership particle) are going to be in the top ten no matter what, almost for sure. Other words may be closer or farther away. Also you may wish to know that some lists count bigrams such as “ta1 de” (his) separately from ta1 (he) and de (ownership). So you really have to understand what your frequency list is targeting before you can use it.

The way I solve this problem with Kongzi is by allowing the user to import whatever frequency list they like from the internet. It currently can import five or six different lists.

For anyone learning Chinese, I strongly reccomend Far East’s 3,000 Character Dictionary. It is indispensable and the frequency order is very good for learning daily Chinese you will see and speak all over the world. I’ve never seen anything better.

You may also be interested in James E. Dew’s “6,000 Chinese Words”. It is another indispensable book which lists many things including CYY and BLI grading, and several different frequency lists.

For advanced students only, I reccomend buying an entire series of very easy children’s books (think aesop’s fables, three little pigs) and indexing and counting each character in each book. That should take you a couple of months! And making your own frequency list from that. You will come up with about 800 words that are significantly common and you will discover that learning as few as 200 of them will allow you to read 90% of the material in the books. Again I cover this technique in my new textbook, “Welcome to Chinese”.

For super advanced students. Oh? Here’s something to think about. What percentage of the most common 3000 characters are nouns, what percentage are verbs, and so forth? Choose the most common 20 or 30 nouns, the most common 20 or 30 verbs, and so forth, and you could probably set up the kernel of a MCI system in 100 words. This puts near-conversational fluency in the hands of first year college students. This is something akin to what I’m doing for Kongzi and WTC.

Good luck on your Chinese journey!

Using Styled Documents

A few beta testers reported back that on their netbooks’ small, high resolution screen, the HTML-limited size of the Chinese Characters was too small. So I redid it using styled documents.

Notice anything different?

kongzi-styled-docs

It’s pretty much the same, except now you can blow the Chinese character up as much as you like.

Perfect for Netbooks! This was definately a step I needed to take in order to move towards Kongzi on portable devices.

One of the next big goals for Kongzi is adding a Speech module (i.e. press F1 to hear the Chinese). Associated with this will be a new “listen and type” style of quiz, “listen mode” for multiple choice, etc.

Development is proceeding extremely rapidly. I’m working on some UI issues now, making it better. Kongzi is becoming more and more robust. I currently am using it to teach myself but entering characters is slow-going, there are only 250 words in the dictionary now.

I have a really good feeling about Kongzi these days. I think it’s going to be a success!