Getting the dictionary into a suitable XML format

It's hard to give general instructions for getting the dictionary into a suitable format. One possibility is that you already have your dictionary in XML format. In this case you might need to do nothing. However, at least as commonly, people have dictionaries in (SIL's) Shoebox or some similar "backslash" format. Or your dictionary may be in yet another format. And often people have maintained dictionaries in text editors or word processors, and there are then usually numerous errors in formatting and consistency. Kirrkirr requires a correct ("well-formed") XML file, and will make inconsistencies in your data formatting glaringly obvious!

Here are some more details on what's required for the dictionary format. It assumes a basic knowledge of XML. If you don't know what things referred to here mean, you should look at an XML intro. There are now numerous XML books, and you can find various introductions at Robin Cover's XML pages. (Don't try to read the official XML specifications though! They're impenetrable even to Kirrkirr's author. Find a simple introduction. You only need to know the basics.)

Your XML dictionary must be well-formed (syntactically correct XML). If there are any errors at all in the XML (missing brackets or quote marks or things like that), you will be completely unable to use the dictionary within Kirrkirr.
You do not need to have a DTD, and indeed it's okay if there's some inconsistency in entry structure, though there must be enough consistency that you can specify things that Kirrkirr needs. But it's okay for the data to be truly "semi-structured" in the sense that occasional entries might define extra fields in an only semi-consistent way.
Related to the above point, there isn't a single XML structure that Kirrkirr works with. Kirrkirr can be made to work with basically any XML dictionary (with a few restrictions noted below). Kirrkirr uses the idea of a mediator, which says how to get semantic notions that Kirrkirr needs out of another XML file. The cost of this flexibility is that you have to provide a DictionaryInfo XML file specifying how to find information in your dictionary. One way to avoid creating this file is to copy the structure of an existing Kirrkirr dictionary which already has a DictionaryInfo file. Here are the restrictions on XML formats for Kirrkirr:
- Kirrkirr assumes that a dictionary is a list of entries. There may be front matter or end matter, and there may be headers corresponding to letters etc., but the essential structure of the dictionary must be as a list of entries. That is, the top level XML element must contain all the dictionary entries as a flat list (perhaps interspersed with other elements). This is most likely already true of any "document-oriented" XML format -- certainly if you have made an XML dictionary by exporting one from Shoebox. Kirrkirr probably can't handle a dictionary done as "standoff annotation" XML, or, say, if you write out relational database tables containing a dictionary as XML.
- Kirrkirr mainly uses XPath to access the dictionary, but uses line-based regular expression searching in the advanced search capability. This means that line breaks will matter for matching in searches, and lines of the file should correspond to natural elements of the document. Kirrkirr would work badly either if there are no line breaks in the file, or, say, one after every word. Ideally, complete examples or definitions will be on one line.
- Kirrkirr builds a unique key to refer to every entry. In the simplest case, this will just be the headword. However, many dictionaries have multiple headwords that are the identical orthographic string (homographs). If this occurs in your dictionary, you must provide something (an XML attribute or another XML element) that can be used to distinguish the homographs so that they can be uniquely referred to. Kirrkirr will use this additional element (in a multipart key) to reference words, and you will need to provide it to crossreference to such words. (If this all seems a lot of trouble, just make sure all the headwords are unique by always using multiple senses for words with multiple meanings rather than allowing identical headwords - however, this isn't usual dictionary-making practice.)
Notwithstanding the above, people sometimes wonder about what a suitable DTD for Kirrkirr is. Sometimes they're wondering particularly about the DTD used with the Warlpiri dictionary. In case you're interested, you can download the DTD for the Warlpiri dictionary. But note that the other sample dictionaries provided with Kirrkirr 4.0 use XML files with quite different structures, and do not provide an explicit DTD.

If your dictionary isn't currently in XML, then you need to convert it to XML. In practice, we've usually done this using scripts or programs, written in Perl, Python, or Java. One could do it by hand, but it would get very tedious for a large dictionary. There are many XML editor programs that you could use (see the XML resource page above). A particularly common case is that you have a Shoebox (a.k.a. FOSF (Field Ordered Standard Format) or SFM (Standard Format Markers) dictionary. In that case you can follow these instructions. You need Shoebox version 5 or Toolbox for this to work.

Start with a SFM dictionary. Make sure you can load it in Shoebox 5 or Toolbox, if you haven't regularly maintained it with these tools.
Shoebox backslash codes can start with things such as numbers, but XML elements must start with a letter. Unfortunately, though, the Shoebox XML export doesn't ensure this and one can end up with ill-formed XML from a Shoebox/Toolbox export. If you have any tags that do not start with letters, change them so that they do. You can do this before the XML export, or after the XML export by doing search-and-replace on the XML file. Update: I just tested this again with Toolbox 1.5.1 of Feb 2007, and this is still true. The Shoebox XML export does not guarantee well-formed, let alone valid, XML output. Of course, if you stick to MDF markers, which do all start with letters, then this problem won't bite you.
Shoebox fields may contain non-ASCII, non-Unicode characters (such as Windows CP-1252 characters like dashes and "smart" quotes), and Shoebox will export these into the XML file. These should be replaced either before or after the export. The XML file should either contain only US-ASCII characters (using numeric character entities for other characters) or be UTF-8 encoded Unicode. If there are just a few stray "smart" quotes or similar, then this can easily be done by search and replace; if some fields are systematically encoded in some other character set you probably need some software to do this. If the whole file is in another encoding, then you can use some general character set encoding changing software, but if only certain fields are in a different encoding, then what you really need is SIL's TECkit software (use the sfconv component).
Define in Shoebox grouping information for what should be related or subsidiary fields. You should say what is the "record divider" - normally the tag for the headword, and then say which fields should go under other fields. For example, translations should normally go under examples, and if there are subsenses, various fields should be placed under them. This structure will be exported into the XML file. If you've built up your dictionary in Shoebox, you've probably already done this, certainly if you followed the manual, but if you are freshly working with a "backslash" dictionary in Shoebox, then you will have to do this now.
Use the Shoebox menu item File | Export | XML Basic. In the dialog box, accept the default options and have it export the entire file.
If there are cases of multiple identical headwords (homographs), you need to provide something to make them unique. If you want to have links to these words, this is best done by hand by providing a field with this information (this can just be a number used to distinguish each homograph, akin to the superscript letters or numbers that are commonly used for this purpose in printed dictionaries). However, if everything else is correct, you can technically fulfil this requirement by using the Unique Headwords tool inside Kirrkirr on an XML dictionary for which you have a DictionaryInfo file.
If you have picture or sound files, you need them linked to the dictionary headwords. What you want is to have some field whose value is a filename that contains a picture or sound file. This may already be the case in your Shoebox dictionary. Otherwise, you might need to add these links. We've used a tool Picture Adder to do this. The picture or sound file names should just be file names and should not contain folder names or paths, since Kirrkirr will always look for sound or picture files in a particular directory for a particular language (the audio and images folders that we discussed creating previously).
After doing the XML Export, you may perhaps have to fix up the treatment of tags within fields. The Shoebox manual recommends using <XX>...</XX>-style XML tags in the value of a field if you want to denote structure in a field entry, but their XML export doesn't support this and indeed mangles such annotations by simply escaping the angle brackets in the export. If you had any such nested tags, then they need to be reconstructed so that tags within fields remain/are tags, rather than the escaped < and > codes that Shoebox exported. In the past, we've done this with a regular expression replace operation in a text editor. Most plain text editors support such regular expression replace operations.

Before going any further make sure that you check that your dictionary XML file is well-formed XML. You can verify whether or not you have a well-formed XML file by loading it into any piece of software that contains an XML parser. If it won't go through an XML parser, it won't work with Kirrkirr. Kirrkirr itself includes an XML parser, but it is easier at this stage to simply load the dictionary into a Web browser, such as Internet Explorer or Mozilla. (Note: checking by loading it in a browser will only work if the file doesn't define an XSLT stylesheet, or if it does and it is available.) For example, here is a well-formed, if very small, XML dictionary destined for Kirrkirr: tinydict.xml. Try loading it in your web browser. It should be well-formed. Here's a dictionary with an error in it: tinybung.xml.

Character encoding

An XML file can specify its character encoding in the XML header. If it doesn't specify a character encoding, the character encoding is assumed to be UTF-8. A restriction of Kirrkirr at present is that it does not work correctly with all character encodings. The dictionary needs to either be in US-ASCII (with any non-ASCII characters indicated as numberic character entities such as í) or in UTF-8. Also, note that any character entities must be numeric character entities or the built in ones like <. Because of the way Kirrkirr cuts dictionaries up into individual entries, any character entities defined in the DTD are currently lost. We hope to better address some of these issues in a future version.

Proceed to Building an XSLT stylesheet

http://nlp.stanford.edu/kirrkirr/dictionaries/xmldictionary.html

Christopher Manning -- <manning@cs.stanford.edu> -- Last modified: Sat Jul 14 23:32:35 PDT 2007