Building a DictionaryInfo XML File for the Kirrkirr Dictionary Visualization Program

Rather than demanding that a dictionary be in a particular XML format (that is, to have a particular Document Type Definition (DTD) or XML Schema), Kirrkirr attempts to provide a mechanism so that it will work effectively with almost any XML dictionary format, without requiring any changes to the program or your data. Nevertheless, in order for Kirrkirr to interface with different dictionaries, each dictionary must be accompanied by a specification file partially describing the structure of the XML dictionary file, in particular how various XML elements map onto Kirrkirr constructs. This specification file is also an XML file. The specification file is effectively a way of passing parameters and characteristics of the dictionary to Kirrkirr. Most of these concern what the structure of the dictionary is, but some provide additional Kirrkirr-specific information, such as how to color links. The rest of this document describes all the possible parameters for a dictionary specification file, with examples from the Warlpiri dictionary.

Making the necessary specification files for a new dictionary isn't rocket science, but it does require having reasonable familiarity with XML, XPath, XSLT, and the structure of the target dictionary. Developing this file is the most technically challenging part of getting Kirrkirr working on a new dictionary. Since the format of th DictionaryInfo file needs to be followed exactly, we suggest you include the DictionaryInfo DTD inside your DictionaryInfo file and make sure that the file is valid according to the DTD (such as by loading it in a web browser such as Internet Explorer or Firefox). However, this isn't required. (Note: unfortunately, the current version of Kirrkirr does not support external DTDs, so you really do need to include it inside your DictionaryInfo file.)

The values of elements in the specification can variously be XML element names ("tags"), XPath expressions, and regular expressions. We try to note carefully which is which. It will be useful to have some familiarity with XPath (certainly if you are trying to do something complicated). XPath is covered in most XML and XSLT books, and there are various web tutorials.

Note that for an L1->L2 dictionary, we below sometimes refer to L1 as the headword language and L2 as the gloss language.

What do I need to put in the specifications file?

The DictionaryInfo file should be an XML file with a header ( <?xml version="1.0"?>), a DTD, and then a single KIRRKIRR_SPEC element, inside which the rest of the elements are defined, mainly as a list, but with some nesting. It may be easiest to start with an existing DictionaryInfo file, and to modify it based on the discussion below to fit your dictionary.

Skip to DTD
Skip to a small example
Skip to the Warlpiri DictionaryInfo
Skip to a discussion of homophones

DictionaryInfo specification file elements

KIRRKIRR_SPEC - (This is a necessary element for the specification file.) This is simply the XML root of the specification file. All other specification elements are contained within it.

DICTIONARY_ENTRY_XPATH - (This is a necessary element for the specification file.) This tells Kirrkirr the path to follow to pick out individual entries in your dictionary. For example, for the dictionary at right, that path is: /DICTIONARY/ENTRY. The value of this element is an XPath. Important: in the current version of Kirrkirr, the value of this element must be a particular highly restricted form of XPath. It is required that the form of this path be an absolute XPath which consists of an element for the dictionary followed by an element for a dictionary entry. Kirrkirr depends on this assumption, and will reject any other form of XPath. (In a future version, we plan to at least slightly generalize things so that there can be intervening elements between the dictionary element and the entry element (e.g., for letters or sections of the dictionary) but will probably continue to insist that a dictionary is basically just a list of entries. This restriction allows Kirrkirr to automatically chop the dictionary up into pieces corresponding to different entries. In practice, this restriction usually isn't too problematic, since this is exactly how most people write their dictionary. For example, the XML export format of a Shoebox dictionary is necessarily always in this form.)

Warlpiri example:


In the Warlpiri XML dictionary structure, DICTIONARY is the root element, and ENTRY is the element that contains each entry.
    <DEF>dog </DEF>

HEADWORD_XPATH - (This is a necessary element for the specification file.) This is the XPath to the headword of each entry. There should be one headword per entry. If multiple headwords are spelled the same way, you need to specify a UNIQUIFIER_XPATH to distinguish them. (If we let HW_TAG be the element name used for headwords, then standardly one would expect that HEADWORD_XPATH = "/" + DICTIONARY_ENTRY_XPATH + "/" + HW_TAG.) This can be an arbitrary unrestricted XPath. For example, one could have the XPath be /DICTIONARY/ENTRY/HW[@TYPE != "OMIT"] to only include in the dictionary words that aren't marked for omission in the TYPE attribute of the HW element.

Warlpiri example:


In the Warlpiri XML dictionary structure, HW is the tag for the headword of each entry.
       <HW> jajina </HW>
       <DEF> mulgara </DEF>
       <HW> maliki </HW>
       <DEF> dog </DEF>

UNIQUIFIER_XPATH - If all of the headwords in the dictionary are unique, this is an optional (unnecessary) element. However, if there are multiple entries in the dictionary with the same headword (i.e., through polysemy/homonymy), this is a necessary element, because there must be a way for Kirrkirr to distinguish entries. This can be either an attribute or another element. (Note that some dictionaries prefer to list all words spelled the same way under one entry, whereas others distinguish between polysemy and homonymy by having multiple senses per entry as well as multiple entries per headword. Either way works fine with Kirrkirr).

Warlpiri example:


In the Warlpiri XML dictionary structure, homonyms are distinguished by an HNUM attribute on the headword tag of homonyms.
       <HW HNUM="1"> -ja </HW>
        <DEF>Assertive clitic.</DEF>
       <HW HNUM="2"> -ja </HW>
        <DEF>Past tense verb suffix.</DEF>

HEADWORD_REGEXP and UNIQUIFIER_REGEXP - (This is an optional element in the specification file.) You don't need these, but if you define a HEADWORD_REGEXP, and, if you have a UNIQUIFIER_XPATH, then also a UNIQUIFIER_REGEXP, then Kirrkirr will use these in searches rather than matching XPaths. Using this will give faster searching over large dictionaries, but should otherwise give exactly the same results as not defining these fields at all. See below in the SEARCH_REGEXP section for more discussion of regular expressions in Kirrkirr. Here, the capturing group of the regular expression picks out the appropriate element content.

<HEADWORD_REGEXP>&lt;HW[^&gt;]*&gt;([^&lt;]+)&lt;/HW&gt;</HEADWORD_REGEXP> <UNIQUIFIER_REGEXP>&lt;HW[^&gt;]*HNUM=\"([^\"]+)\"</UNIQUIFIER_REGEXP>

SEARCH_REGEXPS - (This is an optional element for the specification file.) With no further work on your part, by utilizing the information in other specification elements, Kirrkirr will be able to search based on headwords, gloss words, and through entire dictionary entries. However, commonly people would also like to be able to do searches restricted in various other ways. For example, to look for uses of a word in dictionary examples. In this element one can define a list of SEARCH_REGEXP elements which define other kinds of searches. This list may be empty (which is the same as the element not being present.

SEARCH_REGEXP - (This is an optional element in optional SEARCH_REGEXPS element.) This holds the attributes relating to one user defined search. Each of these elements must contain two elements: the MENU_NAME under which this search is listed, and the REGEXP that is used in the search.

MENU_NAME - (This is a required element in the optional SEARCH_REGEXP element.) The content of this element is an arbitrary string which will be used as the menu name for a particular restricted search in the Advanced Search panel. (In order to facilitate Internationalization, please use underscores rather than spaces, if needed, in your string. (Why?))

REGEXP - (This is a required element in the optional SEARCH_REGEXP element.) This holds a regular expression used to do a restricted search. Note that Kirrkirr uses line-oriented regular expression searching, so one can only successfully defined restricted searches where what one wants to match occurs on one line of the underlying XML file. The regular expression should define a single capturing group. The contents of this capturing group (the first if there are more than one) will be used as the content that is searched for a match. These regular expressions typically would match with some element. Note that the angle bracket signs in the element should be escaped in the usual XML way. Some typical examples are:


	        <MENU_NAME>Warlpiri example</MENU_NAME>

The first form will work for matching an element assuming it has no attributes and there is only ever one case of this element on a line of the file. The second form allows attributes after the tag. As we all know, you can get a long way with regular expressions, but there are some things that you just can't capture this way. Kirrkirr uses the Jakarta ORO library to match text. This library supports Perl version 5.004 regular expression syntax.

XSL_FILES - (This is a necessary element for the specification file.) This element holds the group of tags which describe the XSL files to be used for this dictionary. It must have at least one XSL_FILE as a descendant, so Kirrkirr can display the dictionary in HTML.

XSL_FILE - (There must be at least one XSL_FILE element). This holds the attributes relating to one XSL file.

XSL_FILENAME- (Each XSL_FILE must have an XSL_FILENAME). This is simply the filename of the file. The file must be in the xsl directory which is in your dictionary directory. To keep the program platform independent, try to avoid 'special' characters in filenames (backslashes, forward slashes, etc.), which may work on some systems and not others.

SHORT_NAME - (Each XSL_FILE must have a SHORT_NAME). This is the name that will be shown to the user (in the Preferences | Formatted Entry options). (In order to facilitate Internationalization, please use underscores rather than spaces in your string. (Why?))

XSL_DESCRIPTION - (Each XSL_FILE must have an XSL_DESCRIPTION). This is the description that will be shown to the user (in the Preferences->Formatted Entry options). (In order to facilitate Internationalization, please use underscores rather than spaces in your string. (Why?))

Warlpiri example:

The code on the right shows an example of XSL file descriptions. The first file is named basic.xsl. The user will see the following information in the options panel: Basic View: Display just the definitions. Similarly, the second is named all.xsl and the user will see: Advanced View: Display everything in the database. Note the underscores, which are removed when the user sees the text (indeed, the user might even see different text in another language).
This is the last required element that you need to define in a DictionaryInfo file to get Kirrkirr to sort of work with that dictionary. We'd suggest at first just entering this much information and trying things out. In your file, give a filename for the dictionary.index, but put nothing after the equals sign for the reverseIndex and domainFile properties, since you haven't yet specified enough information for these to be made, and go through the step of making an index file. You should then be able to use your new dictionary in Kirrkirr.

REF_HEADWORD_XPATH and REF_UNIQUIFIER_XPATH - (This is an optional element.) Some elements in a dictionary entry contain links to other headwords in the dictionary. (The link may be one of the network links (defined in LINKS below) or simply a link between words in the Formatted Entry panel, which allows the user to click on a hyperlink and get to another Formatted Entry.) The easiest case is that the entire content of this crossreferencing element uniquely determines the headword that the crossreference is to. In this case (and only this case), one need not define REF_HEADWORD_XPATH. Not setting it is equivalent to giving it the value of . in XPath -- its value is the crossreferencing node, and the entire content of that element is used for the crossreference. In many cases you need to do more, because the word to link to is actually stored as an attribute or subselement and/or a uniquifier is defined to resolve homographs. The word to link to is determined by the REF_HEADWORD_XPATH and the REF_UNIQUIFIER_XPATH (if present). These XPaths are evaluated relative to the larger element being decoded (e.g., relative to a LINK), and so they should not be a full XPath over the dictionary, but defined relative to the containing element. For instance if one had a structure like:

Then the relevant parts of the DictionaryInfo would be as follows (the LINK_XPATH picks out the whole link element, and the other two XPaths are defined relative to it):
For another example of these elements, see the example under LINKS below.

LINKS - (This is an optional element.) This holds the group of tags which describe the links between words in the dictionary. The links are visualized in the network panel. (Common links include synonyms, antonyms, etc). In the dictionary, each word can have any number of links of various types. Each link is described by a name and a color (for display in the network panel). Also, each link must include a way to reference the word that is being linked to (see below for examples).

LINK - (Each LINKS must have at least one LINK). This describes the attributes of one link: the name, color and xpath used to access it for a word in the dictionary.

LINK_NAME - (Each LINK must have one LINK_NAME). This is the short name for the link, displayed on the legend in the background of the network panel. (In order to facilitate Internationalization, please use underscores rather than spaces in your string. (Why?))

LINK_XPATH - (Each LINK must have one LINK_XPATH). This is the XPath to follow in the dictionary to retrieve the linked to thing for a given entry, perhaps DICTIONARY/ENTRY/SYNONYM . If you do nothing more, the content of that element is expected to be a uniquely referring headword, and the link will be clickable if and only if that is the case. One does more complicated things by defining REF_HEADWORD_XPATH and REF_UNIQUIFIER_XPATH. See the examples.

LINK_COLOR - (Each LINK must have one LINK_COLOR). This can be expressed either as a color or RGB numbers.

COLOR_NAME - (Each LINK_COLOR must have either a COLOR_NAME or a set of COLOR_R, COLOR_B and COLOR_G). This must be one of the thirteen canonical Java colors: black, darkGray, grey, pink, red, orange, yellow,green, cyan, blue, magenta, or white. (Though white is probably a bad idea!)

COLOR_R, COLOR_B, and COLOR_G - (Each LINK_COLOR must have either a COLOR_NAME or a set of COLOR_R, COLOR_B and COLOR_G). Each element must be an integer between 0 and 255, inclusive. These represent the red, blue and green components of a color. You can get a color wheel in Kirrkirr by going under "Options -> Preferences -> Word Colors -> RGB".

Warlpiri example:


In the Warlpiri XML dictionary structure, each synonym is a SYNI element under the SYN group (which may be empty) under an ENTRY. The word that is linked to is indicated by attributes in the SYNI element which tell the headword to link to (HENTRY) and the (optional) unique padding for that headword (HNUM). The links show up in the network as blue and the legend names the link as "Same Meaning" (with underscores removed).
       <HW HNUM="1">jajina</HW>
         <SYNI HENTRY="jalurti">jalurti</SYNI>
         <SYNI HENTRY="murrja">murrja</SYNI>
         <SYNI HENTRY="jajina" HNUM="1">jajina</SYNI>
         <SYNI HENTRY="murrja">murrja</SYNI>

SENSE_XPATH - This is an optional element. This is the XPath that will pick out senses within a dictionary entry, assuming that dictionary entries are broken into senses. This XPath is used only in the context of looking for links and semantic domains: the program will attempt to identify links and semantic domains with senses, by synthesizing an XPath that assumes that the XML structure of a sense mirrors that of an entry in the relevant terms. (That is, if /DICT/ENTRY/SYN picks out a synonym in a basic entry and /DICT/ENTRY/SENSE is the path to sense elements, then /DICT/ENTRY/SENSE/SYN must be the path to a synonym inside a sense. (This is assuming that synonyms sometimes appear inside senses and sometimes not; if they always appear inside senses, one can just define the path to synonyms as /DICT/ENTRY/SENSE/SYN .) See the discussion of DOMAIN_XPATH for some further discussion of SENSE_XPATH.


IMAGES_XPATH - This is an optional element. This is the XPath that will pick out images within a dictionary entry


AUDIO_XPATH - This is an optional element.


This may match multiple times in an entry, but at present Kirrkirr only supports one sound file per entry.

DOMAIN_XPATH and DOMAIN_COMPONENT_XPATH - (This is an optional element.) DOMAIN_XPATH is an absolute XPath from the document root which specifies the path that will extract the (one or more) semantic domain descriptions for a word. For example,


The contents of this element must either be #PCDATA or else must contain one or more daughter elements. In the first case the content is interpreted as the domain name. In the second the specification file should also define a DOMAIN_COMPONENT_XPATH, which is a relative XPath:


The sequence of matches of this XPath are interpreted as domain, subdomain, subsubdomain, etc. (or alternatively as a sequence of implicitly ordered meaning components in a componential analysis), and any other elements inside DOMAIN_XPATH are ignored.

Words with multiple semantic domains: Kirrkirr assumes that each sense of a word belongs to a unique semantic domain. To have Kirrkirr correctly work with words that list multiple semantic domains, each must be regarded as a separate sense. One possibility is that the dictionary entry is divided into senses, and then one can show that division with the SENSE_XPATH. If the SENSE_XPATH element is defined, then Kirrkirr will also look for domain information within senses by synthesizing an XPath based on the one given here and the SENSE_XPATH. This will work correctly only if both of these are simple straightforward absolute XPaths accessing everything implicitly via the child:: axis. If you have multiple

FREQUENCY_XPATH - This is an optional element.

SUBWORD_XPATH - This is an optional element. The SUBWORD_XPATH is evaluated as an XPATH on an entry (from the dictionary root). If this path has a value, then the word is assumed to be a subword (i.e., a dictionary subentry). What that means is that for a main entry, you should ensure that the given XPath has no value. That involves a little backwards thinking, so let's go through a few cases of what you should do.

  1. Dictionary has an attribute (perhaps on headwords or entries) that only occurs when the word is a subword. For example, we might have:
    <ENTRY><HW TYPE="SUB">jirrama-jinta</HW>
    for a subentry, and the below for a main entry:
    This is the conceptually easiest case. We write an XPath expression that matches that attribute when present and nothing otherwise (note the use of '@' in an XPath to select an attribute of an element):
  2. But what if all words have a SUBWORD element, which has the value "yes" or "no"?
    <ENTRY><HW>cricket team</HW>
    An XPath like the above won't work (Kirrkirr can't interpret "yes" and "no" any more than "oui" or "non"). What you must do is use a restriction on an XPath so the XPath only matches and returns something for subwords. Restrictions on XPaths are placed inside square brackets, and you can use '.' to refer to the current node:
  3. If you have an element linking from subwords to mainwords, then you shouldn't need to introduce a new element or attribute especially for marking a word as a subword, you should just be able to test for the presence of that link type. So if the dictionary looks like:
    <ENTRY><HW>cricket team</HW>
    then the following will work:

CUSTOM_FIELDS - This is an optional element. Its descendants are CUSTOM_FIELD, FIELD_NAME, and FIELD_XPATH.

DIALECT_XPATH - This is an optional element.

REGISTER_XPATH - This is an optional element.

REVERSE_DICTIONARY_XPATH - This is an optional element. This is a full XPath from the dictionary root that specifies the field or fields which should be used to build gloss langauge entries for the gloss to headword index. Ideally, it is a field that contains glosses (as opposed to full sentence definitions. For instance, if a dictionary entry has a field like:

<ENTRY> ...
<GL><GLI>thorn</GLI>, <GLI>prickle</GLI>, <GLI>sticker</GLI>, <GLI>spike</GLI>, <GLI>spine</GLI></GL> ...
Then the appropriate element would be:
(Although all the glosses shown above are single words, multiword glosses are quite acceptable!) If your dictionary has no such gloss field, then all is not lost. You should give an XPath to whatever field gives the definition (or one word translations in a word list). If this field has full sentence definitions, you will certainly want to define quite a lot of STOPWORDS. The end results will not be as good, but will still be quite usable.

GLOSS_XPATH - This is an optional element. This is a full XPath from the dictionary root that specifies the field or fields which should be used to give glosses of words, such as are shown in the Network pane. Ideally, it is a field that contains glosses. If there is no such field, it should be the field with definitions, and a more generous set of stopwords should be supplied.

POS_XPATH - This is an optional element. It is a full XPath to part of speech information.

Warlpiri example:


In the Warlpiri XML dictionary structure, /DICTIONARY/ENTRY/POS picks out the part of speech of an entry.
       <HW> jajina </HW>
       <DEF> mulgara </DEF>
       <POS> Noun </POS>
       <HW> maliki </HW>
       <DEF> dog </DEF>
       <POS> Noun </POS>

FUZZY_SPELLING_SUBSTITUTIONS - This is an optional element. Its descendants are HEADWORD_LANGUAGE_SUBS and/or GLOSS_LANGUAGE_SUBS. Both of these in turn contain a list of SUBSTITUTION eleements. And these elements consist of an ORIGINAL_REGEXP element followed by a REPLACEMENT_REGEXP element. These are used to implement fuzzy matching in both the basic search box, and in advanced search.

STOPWORDS - This is an optional element. Each child is a STOPWORD element. Each STOPWORD is removed, if present, from the words in the gloss of each headword, in the order given in this file, unless there is only one word left in the gloss, which is then retained. Only words that survive this stopwords filter become main entries in the gloss list. Note: because a stopword is only removed if there are other words present in the gloss, this means that it is in general safe and a good idea to list lots of words in the stopwords list, especially if the gloss field you are using has lengthy glosses/definitions. Putting a word like your on the stopword list won't negatively effect a word which is glossed simply as "your", since the last gloss word is always retained. Also, since words are removed in the order listed, it is best to put the most common 'real' stopwords like the and of earliest in the list.

STOPCHARS - This is an optional element. Each child is a STOPCHAR element. Each STOPCHAR element is stripped from strings (with a few special rules for apostrophe (')) before glosses are separated into words for the reverse index.

DICTERRORS - This is an optional element. Each child is a DICTERROR element. Each DICTERROR is matched (string equality after removing leading and trailing whitespace) against both headwords and gloss items during index construction. If something matches, it is deleted from the headword or gloss index. In theory this element should never be used. If something is to be omitted or sometimes omitted from a dictionary, then it should be appropriately marked in the XML markup, and then the necessary conditions could be specified on either the HEADWORD_XPATH or the GLOSS_XPATH. However, in practice, the world isn't perfect, and most of our dictionaries mark a few special cases for omission with this element.

GLOSS_LANGUAGE_NAME - This is an optional element. The name of the gloss language.

DICTIONARY_LANGUAGE_NAME - This is an optional element. The name of the headword language.

GLOSS_LANGUAGE_ICON - This is an optional element. The icon for the gloss language (a JPEG or GIF picture).

DICTIONARY_LANGUAGE_ICON - This is an optional element. The icon for the headword language (a JPEG or GIF picture).


This is the current KIRRKIRR_SPEC DTD.









Small example


Warlpiri DictionaryInfo

You can download the Warlpiri dictionary specification file. A copy is also included with the web download of Kirrkirr as part of the MiniWrl dictionary.

Getting Kirrkirr working on a dictionary with homophones

Homophones are an issue that requires care to work correctly in Kirrkirr, but Kirrkirr is designed to support them. If your dictionary has no homophones (that is, if every headword of the dictionary is different), then you can ignore this section. But many dictionaries do have repeated headwords (often distinguished by superscript numerals in print dictionaries), and this section goes through how to get such a dictionary working flawlessly in Kirrkirr.

Essentially Kirrkirr require a unique key to refer to each dictionary entry. By default it uses just the headword, but then, if there are homophones although they are listed in the word list, the key is the same for multiple words, and it will always display just one of them in the various panels, which definitely isn't what you'll want.

To fix this, one needs to define an XML element or attribute that will distinguish homophones. This field is the equivalent of the superscript numeral of the print dictionary. Indeed, numbering homophones is the simplest and best thing to do in XML as well.

Then, to have homophones work as headwords you need to specify this element or attribute in the DictionaryInfo file, perhaps as:


If you've also defined HEADWORD_REGEXP (which speeds up searches, but isn't required), then you should also similarly define a UNIQUIFIER_REGEXP:


The inside bit is really the regular expression <hm>([^<]+)<hm> but the angle brackets need to be quoted for XML.

Finally, there is then the issue of links to words that are homophones. If you just have:


and there are two homphones pama, then that's again not going to uniquely identify one, and so the link will not be live in places like the Network display. Kirrkirr supports in XML being able to define a word unique key to match to (which can even differ from the displayed text for the link) via the DictionaryInfo fields:


For instance, if the contents of the synonym element was always the headword, you could define just a uniquifier, with an XML attribute of hn:

<synonym hn="2">pama</synonym>

and then in the DictionaryInfo file you would simply define:


With all that done, everything should work fine! (However, since homophones are usually few (but important!) words, you may want to get the dictionary working in Kirrkirr with non-homophones, before trying to get these details right.)

Proceed to Building the Kirrkirr indices

Valid HTML 4.01!
Christopher Manning -- <> -- Last modified: Tue Aug 2 10:49:20 PDT 2005