Rather than demanding that a dictionary be in a particular XML format (that is, to have a particular Document Type Definition (DTD) or XML Schema), Kirrkirr attempts to provide a mechanism so that it will work effectively with almost any XML dictionary format, without requiring any changes to the program or your data. Nevertheless, in order for Kirrkirr to interface with different dictionaries, each dictionary must be accompanied by a specification file partially describing the structure of the XML dictionary file, in particular how various XML elements map onto Kirrkirr constructs. This specification file is also an XML file. The specification file is effectively a way of passing parameters and characteristics of the dictionary to Kirrkirr. Most of these concern what the structure of the dictionary is, but some provide additional Kirrkirr-specific information, such as how to color links. The rest of this document describes all the possible parameters for a dictionary specification file, with examples from the Warlpiri dictionary.
Making the necessary specification files for a new dictionary isn't rocket science, but it does require having reasonable familiarity with XML, XPath, XSLT, and the structure of the target dictionary. Developing this file is the most technically challenging part of getting Kirrkirr working on a new dictionary. Since the format of th DictionaryInfo file needs to be followed exactly, we suggest you include the DictionaryInfo DTD inside your DictionaryInfo file and make sure that the file is valid according to the DTD (such as by loading it in a web browser such as Internet Explorer or Firefox). However, this isn't required. (Note: unfortunately, the current version of Kirrkirr does not support external DTDs, so you really do need to include it inside your DictionaryInfo file.)
The values of elements in the specification can variously be XML element names ("tags"), XPath expressions, and regular expressions. We try to note carefully which is which. It will be useful to have some familiarity with XPath (certainly if you are trying to do something complicated). XPath is covered in most XML and XSLT books, and there are various web tutorials.
Note that for an L1->L2 dictionary, we below sometimes refer to L1 as the headword language and L2 as the gloss language.
The DictionaryInfo file should be an XML file with a header (
<?xml version="1.0"?>), a DTD, and then a
KIRRKIRR_SPEC element, inside which the rest
of the elements are defined, mainly as a list, but with some
nesting. It may be easiest to start with an existing
DictionaryInfo file, and to modify it based on the discussion
below to fit your dictionary.
(This is a necessary element for the specification file.) This tells
Kirrkirr the path to follow to pick out individual entries in your
dictionary. For example, for the dictionary at right, that path is:
HEADWORD_XPATH - (This is a necessary element for the specification file.) This is the XPath to the headword of each entry. There should be one headword per entry. If multiple headwords are spelled the same way, you need to specify a UNIQUIFIER_XPATH to distinguish them. (If we let HW_TAG be the element name used for headwords, then standardly one would expect that HEADWORD_XPATH = "/" + DICTIONARY_ENTRY_XPATH + "/" + HW_TAG.) This can be an arbitrary unrestricted XPath. For example, one could have the XPath be /DICTIONARY/ENTRY/HW[@TYPE != "OMIT"] to only include in the dictionary words that aren't marked for omission in the TYPE attribute of the HW element.
UNIQUIFIER_XPATH - If all of the headwords in the dictionary are unique, this is an optional (unnecessary) element. However, if there are multiple entries in the dictionary with the same headword (i.e., through polysemy/homonymy), this is a necessary element, because there must be a way for Kirrkirr to distinguish entries. This can be either an attribute or another element. (Note that some dictionaries prefer to list all words spelled the same way under one entry, whereas others distinguish between polysemy and homonymy by having multiple senses per entry as well as multiple entries per headword. Either way works fine with Kirrkirr).
HEADWORD_REGEXP and UNIQUIFIER_REGEXP - (This is an optional element in the specification file.) You don't need these, but if you define a HEADWORD_REGEXP, and, if you have a UNIQUIFIER_XPATH, then also a UNIQUIFIER_REGEXP, then Kirrkirr will use these in searches rather than matching XPaths. Using this will give faster searching over large dictionaries, but should otherwise give exactly the same results as not defining these fields at all. See below in the SEARCH_REGEXP section for more discussion of regular expressions in Kirrkirr. Here, the capturing group of the regular expression picks out the appropriate element content.
SEARCH_REGEXPS - (This is an optional
element for the specification file.)
With no further work on your part, by utilizing the information in other
specification elements, Kirrkirr will be able to search based on
headwords, gloss words, and through entire dictionary entries. However,
commonly people would also like to be able to do searches restricted in
various other ways. For example, to look for uses of a word in
dictionary examples. In this element one can define a list of
SEARCH_REGEXP - (This is an optional
element in optional SEARCH_REGEXPS element.)
This holds the attributes relating to one user defined search. Each of
these elements must contain two elements: the
MENU_NAME - (This is a required element in the optional SEARCH_REGEXP element.) The content of this element is an arbitrary string which will be used as the menu name for a particular restricted search in the Advanced Search panel. (In order to facilitate Internationalization, please use underscores rather than spaces, if needed, in your string. (Why?))
REGEXP - (This is a required element in the optional SEARCH_REGEXP element.) This holds a regular expression used to do a restricted search. Note that Kirrkirr uses line-oriented regular expression searching, so one can only successfully defined restricted searches where what one wants to match occurs on one line of the underlying XML file. The regular expression should define a single capturing group. The contents of this capturing group (the first if there are more than one) will be used as the content that is searched for a match. These regular expressions typically would match with some element. Note that the angle bracket signs in the element should be escaped in the usual XML way. Some typical examples are:
<SEARCH_REGEXP> <MENU_NAME>Definition</MENU_NAME> <REGEXP><DEF>(.*)</DEF></REGEXP> </SEARCH_REGEXP> <SEARCH_REGEXP> <MENU_NAME>Warlpiri example</MENU_NAME> <REGEXP><WE[^>]*>(.*)</WE></REGEXP> </SEARCH_REGEXP>
The first form will work for matching an element assuming it has no attributes and there is only ever one case of this element on a line of the file. The second form allows attributes after the tag. As we all know, you can get a long way with regular expressions, but there are some things that you just can't capture this way. Kirrkirr uses the Jakarta ORO library to match text. This library supports Perl version 5.004 regular expression syntax.
XSL_FILES - (This is a necessary element for the specification file.)
This element holds the group of tags which describe the XSL files to be
used for this dictionary. It must have at least one
XSL_FILE as a descendant, so Kirrkirr can display the dictionary in HTML.
This is the last required element that you need to define in a
DictionaryInfo file to get Kirrkirr to sort of work with
that dictionary. We'd suggest at first just entering this
much information and trying things out. In your
REF_HEADWORD_XPATH and REF_UNIQUIFIER_XPATH - (This is an optional element.) Some elements in a dictionary entry contain links to other headwords in the dictionary. (The link may be one of the network links (defined in LINKS below) or simply a link between words in the Formatted Entry panel, which allows the user to click on a hyperlink and get to another Formatted Entry.) The easiest case is that the entire content of this crossreferencing element uniquely determines the headword that the crossreference is to. In this case (and only this case), one need not define REF_HEADWORD_XPATH. Not setting it is equivalent to giving it the value of . in XPath -- its value is the crossreferencing node, and the entire content of that element is used for the crossreference. In many cases you need to do more, because the word to link to is actually stored as an attribute or subselement and/or a uniquifier is defined to resolve homographs. The word to link to is determined by the REF_HEADWORD_XPATH and the REF_UNIQUIFIER_XPATH (if present). These XPaths are evaluated relative to the larger element being decoded (e.g., relative to a LINK), and so they should not be a full XPath over the dictionary, but defined relative to the containing element. For instance if one had a structure like:
Then the relevant parts of the DictionaryInfo would be as follows (the LINK_XPATH picks out the whole link element, and the other two XPaths are defined relative to it):<DICTIONARY><ENTRY> ... <SYNONYM><WORD>xalta<WORD><HNUM>2</HNUM><SENSE>3</SENSE></SYNONYM> ... </ENTRY></DICTIONARY>
For another example of these elements, see the example under LINKS below.<LINK_XPATH>DICTIONARY/ENTRY/SYNONYM</LINK_XPATH> <REF_HEADWORD_XPATH>WORD</REF_HEADWORD_XPATH> <REF_UNIQUIFIER_XPATH>HNUM</REF_UNIQUIFIER_XPATH>
LINKS - (This is an optional element.) This holds the group of tags which describe the links between words in the dictionary. The links are visualized in the network panel. (Common links include synonyms, antonyms, etc). In the dictionary, each word can have any number of links of various types. Each link is described by a name and a color (for display in the network panel). Also, each link must include a way to reference the word that is being linked to (see below for examples).
LINK - (Each LINKS must have at least one LINK). This describes the attributes of one link: the name, color and xpath used to access it for a word in the dictionary.
LINK_NAME - (Each LINK must have one LINK_NAME). This is the short name for the link, displayed on the legend in the background of the network panel. (In order to facilitate Internationalization, please use underscores rather than spaces in your string. (Why?))
LINK_XPATH - (Each LINK must have one LINK_XPATH). This is the XPath to follow in the dictionary to retrieve the linked to thing for a given entry, perhaps
LINK_COLOR - (Each LINK must have one LINK_COLOR). This can be expressed either as a color or RGB numbers.
COLOR_NAME - (Each LINK_COLOR must have either a COLOR_NAME or a set of COLOR_R, COLOR_B and COLOR_G). This must be one of the thirteen canonical Java colors: black, darkGray, grey, pink, red, orange, yellow,green, cyan, blue, magenta, or white. (Though white is probably a bad idea!)
COLOR_R, COLOR_B, and COLOR_G - (Each LINK_COLOR must have either a COLOR_NAME or a set of COLOR_R, COLOR_B and COLOR_G). Each element must be an integer between 0 and 255, inclusive. These represent the red, blue and green components of a color. You can get a color wheel in Kirrkirr by going under "Options -> Preferences -> Word Colors -> RGB".
SENSE_XPATH - This is an optional element. This is the XPath that will pick out senses within a dictionary entry, assuming that dictionary entries are broken into senses. This XPath is used only in the context of looking for links and semantic domains: the program will attempt to identify links and semantic domains with senses, by synthesizing an XPath that assumes that the XML structure of a sense mirrors that of an entry in the relevant terms. (That is, if /DICT/ENTRY/SYN picks out a synonym in a basic entry and /DICT/ENTRY/SENSE is the path to sense elements, then /DICT/ENTRY/SENSE/SYN must be the path to a synonym inside a sense. (This is assuming that synonyms sometimes appear inside senses and sometimes not; if they always appear inside senses, one can just define the path to synonyms as /DICT/ENTRY/SENSE/SYN .) See the discussion of DOMAIN_XPATH for some further discussion of SENSE_XPATH.
This may match multiple times in an entry, but at present Kirrkirr only supports one sound file per entry.
DOMAIN_XPATH and DOMAIN_COMPONENT_XPATH - (This is an optional element.) DOMAIN_XPATH is an absolute XPath from the document root which specifies the path that will extract the (one or more) semantic domain descriptions for a word. For example,
The contents of this element must either be #PCDATA or else must contain one or more daughter elements. In the first case the content is interpreted as the domain name. In the second the specification file should also define a DOMAIN_COMPONENT_XPATH, which is a relative XPath:
The sequence of matches of this XPath are interpreted as domain, subdomain, subsubdomain, etc. (or alternatively as a sequence of implicitly ordered meaning components in a componential analysis), and any other elements inside DOMAIN_XPATH are ignored.
Words with multiple semantic domains: Kirrkirr assumes that each
sense of a word belongs to a unique semantic domain. To
have Kirrkirr correctly work with words that list multiple
semantic domains, each must be regarded as a separate
sense. One possibility is that the dictionary entry is
divided into senses, and then one can show that division
with the SENSE_XPATH.
If the SENSE_XPATH element is
defined, then Kirrkirr will also look for domain information
within senses by synthesizing an XPath based on the one
given here and the SENSE_XPATH. This will work correctly
only if both of these are simple straightforward absolute
XPaths accessing everything implicitly via the
SUBWORD_XPATH - This is an optional element. The SUBWORD_XPATH is evaluated as an XPATH on an entry (from the dictionary root). If this path has a value, then the word is assumed to be a subword (i.e., a dictionary subentry). What that means is that for a main entry, you should ensure that the given XPath has no value. That involves a little backwards thinking, so let's go through a few cases of what you should do.
REVERSE_DICTIONARY_XPATH - This is an optional element. This is a full XPath from the dictionary root that specifies the field or fields which should be used to build gloss langauge entries for the gloss to headword index. Ideally, it is a field that contains glosses (as opposed to full sentence definitions. For instance, if a dictionary entry has a field like:
<ENTRY> ...Then the appropriate element would be:
<REVERSE_DICTIONARY_XPATH>/DICTIONARY/ENTRY/GL/GLI</REVERSE_DICTIONARY_XPATH>(Although all the glosses shown above are single words, multiword glosses are quite acceptable!) If your dictionary has no such gloss field, then all is not lost. You should give an XPath to whatever field gives the definition (or one word translations in a word list). If this field has full sentence definitions, you will certainly want to define quite a lot of STOPWORDS. The end results will not be as good, but will still be quite usable.
GLOSS_XPATH - This is an optional element. This is a full XPath from the dictionary root that specifies the field or fields which should be used to give glosses of words, such as are shown in the Network pane. Ideally, it is a field that contains glosses. If there is no such field, it should be the field with definitions, and a more generous set of stopwords should be supplied.
FUZZY_SPELLING_SUBSTITUTIONS - This is an optional element. Its descendants are HEADWORD_LANGUAGE_SUBS and/or GLOSS_LANGUAGE_SUBS. Both of these in turn contain a list of SUBSTITUTION eleements. And these elements consist of an ORIGINAL_REGEXP element followed by a REPLACEMENT_REGEXP element. These are used to implement fuzzy matching in both the basic search box, and in advanced search.
STOPWORDS - This is an optional element. Each child is a STOPWORD element. Each STOPWORD is removed, if present, from the words in the gloss of each headword, in the order given in this file, unless there is only one word left in the gloss, which is then retained. Only words that survive this stopwords filter become main entries in the gloss list. Note: because a stopword is only removed if there are other words present in the gloss, this means that it is in general safe and a good idea to list lots of words in the stopwords list, especially if the gloss field you are using has lengthy glosses/definitions. Putting a word like your on the stopword list won't negatively effect a word which is glossed simply as "your", since the last gloss word is always retained. Also, since words are removed in the order listed, it is best to put the most common 'real' stopwords like the and of earliest in the list.
STOPCHARS - This is an optional element. Each child is a STOPCHAR element. Each STOPCHAR element is stripped from strings (with a few special rules for apostrophe (')) before glosses are separated into words for the reverse index.
DICTERRORS - This is an optional element. Each child is a DICTERROR element. Each DICTERROR is matched (string equality after removing leading and trailing whitespace) against both headwords and gloss items during index construction. If something matches, it is deleted from the headword or gloss index. In theory this element should never be used. If something is to be omitted or sometimes omitted from a dictionary, then it should be appropriately marked in the XML markup, and then the necessary conditions could be specified on either the HEADWORD_XPATH or the GLOSS_XPATH. However, in practice, the world isn't perfect, and most of our dictionaries mark a few special cases for omission with this element.
This is the current KIRRKIRR_SPEC DTD.
<!DOCTYPE KIRRKIRR_SPEC [ <!ELEMENT KIRRKIRR_SPEC (DICTIONARY_ENTRY_XPATH,HEADWORD_XPATH,UNIQUIFIER_XPATH?,HEADWORD_REGEXP?,UNIQUIFIER_REGEXP?,SEARCH_REGEXPS?,XSL_FILES,REF_HEADWORD_XPATH?,REF_UNIQUIFIER_XPATH?,LINKS?,SENSE_XPATH?,IMAGES_XPATH?,AUDIO_XPATH?,DOMAIN_XPATH?,DOMAIN_COMPONENT_XPATH?,FREQUENCY_XPATH?,SUBWORD_XPATH?,CUSTOM_FIELDS?,DIALECT_XPATH?,REGISTER_XPATH?,REVERSE_DICTIONARY_XPATH?, GLOSS_XPATH?, POS_XPATH?,FUZZY_SPELLING_SUBSTITUTIONS?, STOPWORDS?,STOPCHARS?,DICTERRORS?,GLOSS_LANGUAGE_NAME?,DICTIONARY_LANGUAGE_NAME?,GLOSS_LANGUAGE_ICON?,DICTIONARY_LANGUAGE_ICON?)> <!ELEMENT ENTRY_TAG (#PCDATA)> <!ELEMENT DICTIONARY_ENTRY_XPATH (#PCDATA)> <!ELEMENT HEADWORD_XPATH (#PCDATA)> <!ELEMENT UNIQUIFIER_XPATH (#PCDATA)> <!ELEMENT HEADWORD_REGEXP (#PCDATA)> <!ELEMENT UNIQUIFIER_REGEXP (#PCDATA)> <!ELEMENT SEARCH_REGEXPS (SEARCH_REGEXP)*> <!ELEMENT SEARCH_REGEXP (MENU_NAME, REGEXP)> <!ELEMENT MENU_NAME (#PCDATA)> <!ELEMENT REGEXP (#PCDATA)> <!ELEMENT XSL_FILES (XSL_FILE)*> <!ELEMENT REF_HEADWORD_XPATH (#PCDATA)> <!ELEMENT REF_UNIQUIFIER_XPATH (#PCDATA)> <!ELEMENT XSL_FILE (XSL_FILENAME, SHORT_NAME, XSL_DESCRIPTION)> <!ELEMENT LINKS ((LINK)*)> <!ELEMENT LINK (LINK_NAME, LINK_COLOR, LINK_XPATH)> <!ELEMENT LINK_NAME (#PCDATA)> <!ELEMENT LINK_XPATH (#PCDATA)> <!ELEMENT LINK_COLOR ((COLOR_NAME) | (COLOR_R, COLOR_G, COLOR_B))> <!ELEMENT COLOR_NAME (#PCDATA)> <!ELEMENT COLOR_R (#PCDATA)> <!ELEMENT COLOR_B (#PCDATA)> <!ELEMENT COLOR_G (#PCDATA)> <!ELEMENT XSL_FILENAME (#PCDATA)> <!ELEMENT SHORT_NAME (#PCDATA)> <!ELEMENT XSL_DESCRIPTION (#PCDATA)> <!ELEMENT SENSE_XPATH (#PCDATA)> <!ELEMENT IMAGES_XPATH (#PCDATA)> <!ELEMENT AUDIO_XPATH (#PCDATA)> <!ELEMENT DOMAIN_XPATH (#PCDATA)> <!ELEMENT DOMAIN_COMPONENT_XPATH (#PCDATA)> <!ELEMENT FREQUENCY_XPATH (#PCDATA)> <!ELEMENT SUBWORD_XPATH (#PCDATA)> <!ELEMENT DIALECT_XPATH (#PCDATA)> <!ELEMENT REGISTER_XPATH (#PCDATA)> <!ELEMENT POS_XPATH (#PCDATA)> <!ELEMENT REVERSE_DICTIONARY_XPATH (#PCDATA)> <!ELEMENT GLOSS_XPATH (#PCDATA)> <!ELEMENT CUSTOM_FIELDS (CUSTOM_FIELD)*> <!ELEMENT CUSTOM_FIELD (FIELD_NAME, FIELD_XPATH)> <!ELEMENT FIELD_NAME (#PCDATA)> <!ELEMENT FIELD_XPATH (#PCDATA)> <!ELEMENT FUZZY_SPELLING_SUBSTITUTIONS (HEADWORD_LANGUAGE_SUBS?, GLOSS_LANGUAGE_SUBS?)> <!ELEMENT HEADWORD_LANGUAGE_SUBS (SUBSTITUTION)*> <!ELEMENT GLOSS_LANGUAGE_SUBS (SUBSTITUTION)*> <!ELEMENT SUBSTITUTION (ORIGINAL_REGEXP, REPLACEMENT_REGEXP)> <!ELEMENT ORIGINAL_REGEXP (#PCDATA)> <!ELEMENT REPLACEMENT_REGEXP (#PCDATA)> <!ELEMENT STOPWORDS (STOPWORD)*> <!ELEMENT STOPWORD (#PCDATA)> <!ELEMENT STOPCHARS (STOPCHAR)*> <!ELEMENT STOPCHAR (#PCDATA)> <!ELEMENT DICTERRORS (DICTERROR)*> <!ELEMENT DICTERROR (#PCDATA)> <!ELEMENT GLOSS_LANGUAGE_NAME (#PCDATA)> <!ELEMENT DICTIONARY_LANGUAGE_NAME (#PCDATA)> <!ELEMENT GLOSS_LANGUAGE_ICON (#PCDATA)> <!ELEMENT DICTIONARY_LANGUAGE_ICON (#PCDATA)> ]>
You can download the Warlpiri dictionary specification file. A copy is also included with the web download of Kirrkirr as part of the MiniWrl dictionary.
Homophones are an issue that requires care to work correctly in Kirrkirr, but Kirrkirr is designed to support them. If your dictionary has no homophones (that is, if every headword of the dictionary is different), then you can ignore this section. But many dictionaries do have repeated headwords (often distinguished by superscript numerals in print dictionaries), and this section goes through how to get such a dictionary working flawlessly in Kirrkirr.
Essentially Kirrkirr require a unique key to refer to each dictionary entry. By default it uses just the headword, but then, if there are homophones although they are listed in the word list, the key is the same for multiple words, and it will always display just one of them in the various panels, which definitely isn't what you'll want.
To fix this, one needs to define an XML element or attribute that will distinguish homophones. This field is the equivalent of the superscript numeral of the print dictionary. Indeed, numbering homophones is the simplest and best thing to do in XML as well.
Then, to have homophones work as headwords you need to specify this element or attribute in the DictionaryInfo file, perhaps as:
If you've also defined
speeds up searches, but isn't required), then you should also similarly
The inside bit is really the regular expression
<hm>([^<]+)<hm> but the
angle brackets need to be quoted for XML.
Finally, there is then the issue of links to words that are homophones. If you just have:
and there are two homphones
pama, then that's again not going to
uniquely identify one, and so the link will not be live in places like
the Network display. Kirrkirr supports in XML being able to define a
word unique key to match to (which can even
differ from the displayed text for the link) via the
For instance, if the contents of the
synonym element was
headword, you could define just a uniquifier, with an XML attribute of
and then in the DictionaryInfo file you would simply define:
With all that done, everything should work fine! (However, since homophones are usually few (but important!) words, you may want to get the dictionary working in Kirrkirr with non-homophones, before trying to get these details right.)
Proceed to Building the Kirrkirr indices