Classes for character set conversion and Unicode-based tokenization. @author Roger Levy