Automated Annotation Tool

The other day I picked up my Chinese copy of Alice in Wonderland that I picked up in Beijing last year. My intention was to lay in the sun by the lake until I had finished the first page, using the dictionary as needed to achieve basic comprehension. The result was a bad sunburn and only two of four paragraphs finished. What went wrong?

The first page of Alice in Wonderland in Chinese

Given my limited knowledge of Chinese vocabulary, most of my time reading is spent looking up unknown characters and phrases in a dictionary. Using CEDICT on the iPhone is far faster than looking up in a dead-trees tome, but it still is incredibly time-consuming. About an hour into my Alice reading session, I realized that my process of looking up characters and words in the dictionary was quite methodical, and would be easily automated. So, I began work on an Automatic Foreign Language Annotation Tool (German: Fremdspracheannotationswerkzeug, or SAWZ for short). The tool will have the following features:

written in Python
takes Unicode plaintext as input
looks up every "unfamiliar" word and generates an annotation
annotations appear as either footnotes or marginal glosses to minimize interruption to reading
generates a TeX source file for rendering by pdflatex into a PDF

I hacked a preliminary version of the program over the weekend. It generates annotated PDFs. Here's an example:

The first few lines of Alice annotated by SAWZ. Note that this is the most verbose setting, wherein the entire dictionary entry for every new word is given.

Already this program could be useful. However, it clearly gives far too much annotation. The primary improvements that I would like to implement are:

word disambiguation to allow annotation with only the most correct dictionary entry, resulting in a traditional one-word marginal gloss
an ignore list filled with words that the reader already knows to reduce the number of annotations

Implementing the disambiguation would be a non-trivial task of NLP, which I'm up for. However, I have a bit to learn before I can attempt it. The ignore list is easy -- it just requires a list or lists of words sorted by difficulty (or order that they are learned). The words learned in Rosetta Stone, the HSK lists, and/or in Wheatley's course make good starting points. Frequency analysis of phrases from some corpus could also be useful data in determining when to annotate.

The White Rabbit (白兔)

As usual, if anyone is interested in this program, please email me and I'd be glad to share it with you.

After writing this program, I found this thread which links to two online annotation tools:

zhtoolkit
MandarinSpot Annotate -- try the "For Printing" option

These tools are available now, and work quite well. The MandarinSpot tool has a decent print mode, although it will not generate beautifully typeset PDFs like LaTeX will.

Originally published on Quasiphysics.