Getting the dictionary into a suitable XML format

It's hard to give general instructions for getting the dictionary into a suitable format. One possibility is that you already have your dictionary in XML format. In this case you might need to do nothing. However, at least as commonly, people have dictionaries in (SIL's) Shoebox or some similar "backslash" format. Or your dictionary may be in yet another format. And often people have maintained dictionaries in text editors or word processors, and there are then usually numerous errors in formatting and consistency. Kirrkirr requires a correct ("well-formed") XML file, and will make inconsistencies in your data formatting glaringly obvious!

Here are some more details on what's required for the dictionary format. It assumes a basic knowledge of XML. If you don't know what things referred to here mean, you should look at an XML intro. There are now numerous XML books, and you can find various introductions at Robin Cover's XML pages. (Don't try to read the official XML specifications though! They're impenetrable even to Kirrkirr's author. Find a simple introduction. You only need to know the basics.)

If your dictionary isn't currently in XML, then you need to convert it to XML. In practice, we've usually done this using scripts or programs, written in Perl, Python, or Java. One could do it by hand, but it would get very tedious for a large dictionary. There are many XML editor programs that you could use (see the XML resource page above). A particularly common case is that you have a Shoebox (a.k.a. FOSF (Field Ordered Standard Format) or SFM (Standard Format Markers) dictionary. In that case you can follow these instructions. You need Shoebox version 5 or Toolbox for this to work.

  1. Start with a SFM dictionary. Make sure you can load it in Shoebox 5 or Toolbox, if you haven't regularly maintained it with these tools.
  2. Shoebox backslash codes can start with things such as numbers, but XML elements must start with a letter. Unfortunately, though, the Shoebox XML export doesn't ensure this and one can end up with ill-formed XML from a Shoebox/Toolbox export. If you have any tags that do not start with letters, change them so that they do. You can do this before the XML export, or after the XML export by doing search-and-replace on the XML file. Update: I just tested this again with Toolbox 1.5.1 of Feb 2007, and this is still true. The Shoebox XML export does not guarantee well-formed, let alone valid, XML output. Of course, if you stick to MDF markers, which do all start with letters, then this problem won't bite you.
  3. Shoebox fields may contain non-ASCII, non-Unicode characters (such as Windows CP-1252 characters like dashes and "smart" quotes), and Shoebox will export these into the XML file. These should be replaced either before or after the export. The XML file should either contain only US-ASCII characters (using numeric character entities for other characters) or be UTF-8 encoded Unicode. If there are just a few stray "smart" quotes or similar, then this can easily be done by search and replace; if some fields are systematically encoded in some other character set you probably need some software to do this. If the whole file is in another encoding, then you can use some general character set encoding changing software, but if only certain fields are in a different encoding, then what you really need is SIL's TECkit software (use the sfconv component).
  4. Define in Shoebox grouping information for what should be related or subsidiary fields. You should say what is the "record divider" - normally the tag for the headword, and then say which fields should go under other fields. For example, translations should normally go under examples, and if there are subsenses, various fields should be placed under them. This structure will be exported into the XML file. If you've built up your dictionary in Shoebox, you've probably already done this, certainly if you followed the manual, but if you are freshly working with a "backslash" dictionary in Shoebox, then you will have to do this now.
  5. Use the Shoebox menu item File | Export | XML Basic. In the dialog box, accept the default options and have it export the entire file.
  6. If there are cases of multiple identical headwords (homographs), you need to provide something to make them unique. If you want to have links to these words, this is best done by hand by providing a field with this information (this can just be a number used to distinguish each homograph, akin to the superscript letters or numbers that are commonly used for this purpose in printed dictionaries). However, if everything else is correct, you can technically fulfil this requirement by using the Unique Headwords tool inside Kirrkirr on an XML dictionary for which you have a DictionaryInfo file.
  7. If you have picture or sound files, you need them linked to the dictionary headwords. What you want is to have some field whose value is a filename that contains a picture or sound file. This may already be the case in your Shoebox dictionary. Otherwise, you might need to add these links. We've used a tool Picture Adder to do this. The picture or sound file names should just be file names and should not contain folder names or paths, since Kirrkirr will always look for sound or picture files in a particular directory for a particular language (the audio and images folders that we discussed creating previously).
  8. After doing the XML Export, you may perhaps have to fix up the treatment of tags within fields. The Shoebox manual recommends using <XX>...</XX>-style XML tags in the value of a field if you want to denote structure in a field entry, but their XML export doesn't support this and indeed mangles such annotations by simply escaping the angle brackets in the export. If you had any such nested tags, then they need to be reconstructed so that tags within fields remain/are tags, rather than the escaped &lt; and &gt; codes that Shoebox exported. In the past, we've done this with a regular expression replace operation in a text editor. Most plain text editors support such regular expression replace operations.

Before going any further make sure that you check that your dictionary XML file is well-formed XML. You can verify whether or not you have a well-formed XML file by loading it into any piece of software that contains an XML parser. If it won't go through an XML parser, it won't work with Kirrkirr. Kirrkirr itself includes an XML parser, but it is easier at this stage to simply load the dictionary into a Web browser, such as Internet Explorer or Mozilla. (Note: checking by loading it in a browser will only work if the file doesn't define an XSLT stylesheet, or if it does and it is available.) For example, here is a well-formed, if very small, XML dictionary destined for Kirrkirr: tinydict.xml. Try loading it in your web browser. It should be well-formed. Here's a dictionary with an error in it: tinybung.xml.

Character encoding

An XML file can specify its character encoding in the XML header. If it doesn't specify a character encoding, the character encoding is assumed to be UTF-8. A restriction of Kirrkirr at present is that it does not work correctly with all character encodings. The dictionary needs to either be in US-ASCII (with any non-ASCII characters indicated as numberic character entities such as &#237;) or in UTF-8. Also, note that any character entities must be numeric character entities or the built in ones like &lt;. Because of the way Kirrkirr cuts dictionaries up into individual entries, any character entities defined in the DTD are currently lost. We hope to better address some of these issues in a future version.

Proceed to Building an XSLT stylesheet


Valid HTML 4.01! http://nlp.stanford.edu/kirrkirr/dictionaries/xmldictionary.html
Christopher Manning -- <manning@cs.stanford.edu> -- Last modified: Sat Jul 14 23:32:35 PDT 2007