Slapped in the Face by Natural Language Processing

I just spent three weeks solving some of the hardest logic puzzles I’ve done in a while. And what did I learn?

Well, I finally found a class for that little bit of code in the main program that didn’t seem to belong there. Long story.

Recall my last post where I describe the new Memory game module. I had always had a feeling that once I had completed rewriting all the old code and fixing up the UI a bit, that the rest of the program would come out in a landslide. And it sort of did. I was blessed with a small holiday due to summer vacation and being ill, so I had quite a few solid days to work on Kongzi. And work I did. I found and fixed countless tiny widdle bugs, and I also added a phrase module — kongzi can now give multiple choice best-word quizzes.

This caused me to encounter a rather well-known, and actually unsolvable bug related to Natural Language Processing (NLP; and no, not neuro-linguistic programming). Let’s say I have a phrase; any phrase; and I choose one of the a-b-c-d answers to be correct. How do I then choose the answers b c d to be words which are not correct, in such a way, that it will not clue the reader in to the answer? This is much more difficult than it appears. First of all, it is not acceptable to record a small number of answers which are “wrong”, per question, because if these words are not in the current targeted vocabulary, the remaining (correct) word will be consconspicuously familiar. In practice this leads students to pass a test even when they do not really understand the meaning of the words. Limiting words which may appear by part of speech is for the same reason, unacceptable. Limiting words by tense, also. It becomes an exercise in choosing the word based on it’s grammatical fit than anything else.

Example: “I take the ____ to work every morning.” a) grape  b) taxi   c) car    d) airplane

Obviously a is not a good answer, and d is illogical. But among b and c, the answer is not clear. It’s probably c, but it might be b. This is the problem, the student might choose an answer which is right, and the computer would not know that.

The real problem is that with thousands of words and thousands of example phrases it is simply not feasable to list every possible answer to a question.

So the solution I came up with is, to choose the words randomly from the target vocabulary, from the least-known words. This has the advantage of creating a situation where the student cannot choose a word which is a wrong answer intentionally, because the definition of that word is one that the student does not know. The key here is that the problem described above occurs infrequently. Words which the student chooses wrongly in this way will remain in “unknown” status, but words which the student knows will slowly rise to the top regardless of a few occasional errors.

Quality of phrase is also important here. A phrase like “He is a very ______ boy.” can lead to a lot of such errors, because there are a lot of possibilities. However, a phrase like “He is a very _____ boy, he is punished every day.” cuts down a great deal on the possible adjectives. It’s very difficult to have a real collision with a sentence like this. Another example:

Bad: Would you like to watch a ______ movie with me tonight?
Good: I like comedy movies, but I don’t like _____ movies. They’re just too scary!

The second sentence above probably has just 1 or 2 words which could “normally” apply to it, and the chances of an idea collision are very low. But the first sentence, well, if two adjectives are chosen for the multiple choice (or if parts of speech is used to choose only adjectives), the reader might as well flip a coin to help him decide what the right answer is.

So far the solution seems to be holding up well and I have only seen the odd rare collision.

Advertisements