Kongzi can now import CC-CEDICT

Yes, Kongzi Beta-5 can now import data from CEDICT (or CC-CEDICT). It has no problem loading up each of the over 100,000 entries and making them editable, searchable, and quizzable. Yet, there is a problem. It turns out that there is so much non-chinese information (such as korean characters, just as an example) and many unsupported (ancient?) characters, and the definitions so unwieldly – so many mistakes and missing pieces of information – that I don’t feel using CC-CEDICT is a good idea.

Looks like I’ll have to just keep adding characters one by one. It will take a while but beginners would only be interested in the first few hundred anyways, so I have my work cut out for me.

I suspect things would be different if there was any sort of organizational data in CC-CEDICT to help draw only on what would be useful, but due to several of the project’s stated rules (such as not creating a separate entry for parts of speech – a huge mistake in and of itself) and not including any frequency data or grading data (another huge mistake where learners are concerned) CC-CEDICT is actually quite useless for learners compared to better organized dictionaries or lists of words (such as the BLI or CYY lists, or the ones used to make Far East’s 3000 Learner’s Dictionary). In that respect you probably wouldn’t want to use CC-CEDICT. I’ll leave the import feature in, however, since you never know who may wish to use it for whatever purpose.

In other news:


As you can see I’ve recovered the Mix ‘n Match quiz feature. The Multiple choice quiz feature was also recovered. Some cosmetic changes will come later, of course, but all the logic is there. Some of the logic is even more advanced than previously. You will notice that at least one of the answers (if not more than one) is highly similar to the one being presented. For example, we include the character for seven, and the phonetics for eight and nine, when the proper answer is three. And so forth. In Mix ‘n Match, we include no, when the correct answer is yes. And so forth. It makes for a more difficult (and therefore more interesting/useful) quiz. Note also that you can’t use similar terms to help you narrow down an answer. The logic will also include head fakes (see the SI prefixes above or the numbers included with yes/no below). You really have to know the answer, no guessing allowed!

You’ll also notice in the above that pinyin tone numbers and tone marks have been mixed. This is because I haven’t finished entering the shortcuts in. That’s fine, the program always recognizes tone numbers (and will always recognize both interchangeably once I finish the shortcuts).


The program is stable enough that I’ve started adding characters again (in order to have fun using it, and beta test it while doing so, of course!) In it’s current state it is very useable, however there are a number of things I’d like to do before releasing it and there are still a few bugs and things I haven’t hammered out yet. I’m still annoyed that I lost the memory game quiz. That was fun and amazing. Although now, the problem of word size has been solved by the implication it will select only from words of the same size (as such are organized by tags, providing a convenient solution). When I do reimplement it, I will cause it to toss in characters with the same radical (or which differ only by a radical) to help the user learn to visualize and discern between characters. I have high hopes for that feature.

I’d like to have at least 1,000 characters in the dictionary before I start public beta testing. One important organization is the grading scheme (A, B, ss, or Jia, Yi, Bing, Ding, etc.) used by some government testing programs, HSK grading, and that sort of thing. Which brings me to frequency representation. What a headache! I could easily include three different items for frequency and have them all be different. Instead of causing needless headaches, or lying to the user that only one number is accurate, I think that the best way to use ‘frequency data’ will be to allow the user to import frequency order lists from a file. That way if the user decides to switch from one order to another it would be an easy thing to do. This way I also save myself from having to write special code to determine which particular frequency order they were using. Brilliant!

Now, back to adding characters to the dictionary…


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: