Getting the dictionary into a suitable XML format
It's hard to give general instructions for getting the dictionary
into a suitable format. One possibility is that you already have your
dictionary in XML format. In this case you might need to do nothing.
However, at least as commonly, people have dictionaries in (SIL's)
Shoebox or some similar "backslash" format. Or your dictionary may be
in yet another format. And often people have
maintained dictionaries in text editors or word processors, and there
are then usually numerous errors in formatting and consistency.
Kirrkirr requires a correct ("well-formed") XML file, and will make
inconsistencies in your data formatting glaringly obvious!
Here are some more details on what's required for the dictionary
format. It assumes a basic knowledge of XML. If you don't know what
things referred to here mean, you should look at an XML intro. There
are now numerous XML books, and you can find various introductions at
Robin Cover's XML pages.
(Don't try to read the official XML specifications though!
They're impenetrable even to Kirrkirr's author. Find a simple
introduction. You only need to know the basics.)
- Your XML dictionary must be well-formed (syntactically correct
XML). If there are any errors at all in the XML (missing brackets or quote
marks or things like that), you will be completely unable to use the
dictionary within Kirrkirr.
- You do not need to have a DTD, and indeed it's okay if there's some
inconsistency in entry structure, though there must be enough
consistency that you can specify things that Kirrkirr needs. But it's
okay for the data to be truly "semi-structured" in the sense
that occasional entries might define extra fields in an only
semi-consistent way.
- Related to the above point, there isn't a single XML
structure that Kirrkirr works with. Kirrkirr can be made to work with
basically any XML dictionary (with a few restrictions noted below).
Kirrkirr uses the idea of a mediator, which says how to get
semantic notions that Kirrkirr needs out of another XML file. The
cost of this flexibility is that you have to provide a DictionaryInfo
XML file specifying how to find information in your dictionary. One way
to avoid creating this file is to copy the structure of an existing Kirrkirr
dictionary which already has a DictionaryInfo file. Here are the
restrictions on XML formats for Kirrkirr:
- Kirrkirr assumes that a dictionary is a list of entries. There
may be front matter or end matter, and there may be headers
corresponding to letters etc., but the essential structure of the
dictionary must be as a list of entries. That is, the top
level XML element must contain all the dictionary entries as
a flat list (perhaps interspersed with other elements).
This is most likely already true
of any "document-oriented" XML format -- certainly if you have
made an XML dictionary by exporting one from Shoebox. Kirrkirr
probably can't
handle a dictionary done as "standoff annotation" XML, or, say, if you
write out relational database tables containing a dictionary as XML.
- Kirrkirr mainly uses XPath to access the dictionary, but uses
line-based regular expression searching in the advanced search
capability. This means that line breaks will matter for matching in
searches, and lines of the file should correspond to natural elements
of the document. Kirrkirr would work badly either if there are no
line breaks in the file, or, say, one after every word. Ideally,
complete examples or definitions will be on one line.
- Kirrkirr builds a unique key to refer to every entry. In the
simplest case, this will just be the headword. However, many
dictionaries have multiple headwords that are the identical
orthographic string (homographs). If this occurs in your dictionary,
you must provide something (an XML attribute or another XML element)
that can be used to distinguish the homographs so that they can be uniquely
referred to. Kirrkirr will use this additional element (in a
multipart key) to reference words, and you will need to provide it to
crossreference to such words. (If this all seems a lot of trouble,
just make sure all the headwords are unique by always using multiple
senses for words with multiple meanings
rather than allowing identical headwords - however, this isn't usual
dictionary-making practice.)
- Notwithstanding the above, people sometimes wonder about what a
suitable DTD for Kirrkirr is. Sometimes they're wondering particularly
about the DTD used with the Warlpiri dictionary. In case you're
interested, you can download the DTD for the Warlpiri
dictionary. But note that the other sample dictionaries provided with
Kirrkirr 4.0 use XML files with quite different structures, and do not
provide an explicit DTD.
If your dictionary isn't currently in XML, then you need to convert
it to XML. In practice, we've usually done this using scripts or
programs, written in Perl, Python, or Java. One could do it by hand,
but it would get very tedious for a large dictionary. There are many
XML editor programs that you could use (see the XML resource page
above). A particularly common case is that you have a Shoebox
(a.k.a. FOSF (Field Ordered Standard Format) or SFM (Standard Format
Markers) dictionary. In that case
you can follow these instructions. You need Shoebox version 5 or
Toolbox for this to work.
- Start with a SFM dictionary. Make sure you can load it in Shoebox
5 or Toolbox, if you haven't regularly maintained it with these tools.
- Shoebox backslash codes can start with things such as numbers, but
XML elements must start with a letter. Unfortunately, though,
the Shoebox XML export doesn't ensure this and one can end up with
ill-formed XML from a Shoebox/Toolbox export. If you have any tags that do
not start with letters, change them so that they do. You can do this
before the XML export, or after the XML export by doing
search-and-replace on the XML file. Update: I just tested this
again with Toolbox 1.5.1 of Feb 2007, and this is still true. The
Shoebox XML export does not guarantee well-formed, let alone valid,
XML output. Of course, if you stick to MDF markers, which do all
start with letters, then this problem won't bite you.
- Shoebox fields may contain non-ASCII, non-Unicode characters (such
as Windows CP-1252 characters like dashes and "smart" quotes), and
Shoebox will export these into the XML file. These should be replaced
either before or after the export. The XML file should either contain
only US-ASCII characters (using numeric character entities for other
characters) or be UTF-8 encoded Unicode. If there are just a few stray
"smart" quotes or similar, then this can easily be done by search and
replace; if some fields are systematically encoded in some other
character set you probably need some software to do this. If the whole
file is in another encoding, then you can use some
general character
set encoding changing software, but if only certain fields are in a
different encoding, then what you really need is
SIL's TECkit software (use the
sfconv
component).
- Define in Shoebox grouping information for what should be related or
subsidiary fields. You should say what is the "record divider" -
normally the tag for the headword, and then say which fields should go
under other fields. For example, translations should normally go under
examples, and if there are subsenses, various fields should be placed
under them. This structure will be exported into the XML file. If
you've built up your dictionary in Shoebox, you've probably
already done this, certainly if you followed the manual, but if
you are freshly working with a "backslash" dictionary in
Shoebox, then you will have to do this now.
- Use the Shoebox menu item
File | Export | XML Basic
.
In the dialog box, accept the default options and have it export the
entire file.
- If there are cases of multiple identical headwords (homographs), you
need to provide something to make them unique. If you want to have
links to these words, this is best done by hand by providing a field
with this information (this can just be a number used to distinguish
each homograph, akin to the superscript letters or numbers that
are commonly used for this purpose in printed dictionaries).
However, if everything else is correct, you can
technically fulfil this requirement by using the
Unique
Headwords
tool inside Kirrkirr on an XML dictionary for which you
have a DictionaryInfo file.
- If you have picture or sound files, you need them linked to the
dictionary headwords. What you want is to have some field whose value is a
filename that contains a picture or sound file.
This may already be the case in your Shoebox dictionary.
Otherwise, you might need to add these links. We've used a
tool
Picture Adder
to do this. The picture or sound file
names should just be file names and should not contain folder names or paths,
since Kirrkirr will always look for sound or picture files in a
particular directory for a particular language (the audio
and images
folders that we discussed creating
previously).
- After doing the XML Export, you may perhaps have to fix up the
treatment of tags within fields.
The
Shoebox manual recommends using
<XX>...</XX>
-style
XML tags in the value of a field
if you want to denote structure in a field entry, but their XML export
doesn't support this and indeed mangles such annotations by simply
escaping the angle brackets in the export. If you had any such nested
tags, then they need to be reconstructed so that
tags within fields remain/are tags, rather than the escaped < and
> codes
that Shoebox exported. In the past, we've done this with a regular expression
replace operation in a text editor. Most plain text editors support such
regular expression replace operations.
Before going any further make sure that you check that your
dictionary XML file is well-formed XML.
You can verify whether or not you have a well-formed XML file by
loading it into any piece of software that contains an XML parser.
If it won't go through an XML parser, it won't work with Kirrkirr.
Kirrkirr itself includes an XML parser, but it is easier at this stage
to simply load the dictionary into
a Web browser, such as Internet Explorer or Mozilla. (Note: checking by
loading it in a browser will only work if the file doesn't define an
XSLT stylesheet, or if it does and it is available.) For example, here
is a well-formed, if very small, XML dictionary destined for Kirrkirr:
tinydict.xml
. Try loading it in
your web browser. It should be well-formed. Here's a dictionary with
an error in it: tinybung.xml
.
Character encoding
An XML file can specify its character encoding in the XML header. If
it doesn't specify a character encoding, the character encoding is
assumed to be UTF-8. A restriction of Kirrkirr at present is that
it does not work correctly with all character encodings. The
dictionary needs to either be in US-ASCII (with any non-ASCII
characters indicated as numberic character entities such as
í
) or in UTF-8. Also, note that any character
entities must be numeric character entities or the built
in ones like <
. Because of the way Kirrkirr
cuts dictionaries up into individual entries, any character
entities defined in the DTD are currently lost. We hope to better
address some of these issues in a future version.
Proceed to Building an XSLT stylesheet
http://nlp.stanford.edu/kirrkirr/dictionaries/xmldictionary.html
Christopher Manning
-- <manning@cs.stanford.edu>
--
Last modified: Sat Jul 14 23:32:35 PDT 2007