Kevin Jansz [HREF1], Department of Linguistics, University of Sydney, Australia. kjansz@cs.usyd.edu.au
Wee Jim Sng [HREF2], School of Applied Science, Nanyang Technological University, Singapore. jimemail@singnet.com.sg
Nitin Indurkhya [HREF3], School of Applied Science, Nanyang Technological University, Singapore. nitin@cs.usyd.edu.au
Christopher Manning [HREF4], Departments of Computer Science and Linguistics, Stanford University, USA. manning@cs.stanford.edu
XML is highly suited to representing richly structured information such as dictionary content, and conversely this well-structured storage of the information enables innovative browsing interfaces. We demonstrated this previously with the development of Kirrkirr, a web-based application that allows users to interactively explore a Warlpiri (a Central Australian language) dictionary in XML format. Two key design issues are customised presentation, and efficient access. The greater the level of customisation, the broader range of users Kirrkirr can accommodate. Efficient access is important if the application is to scale up to larger, more complex dictionaries. In this paper, we discuss these two issues and describe the usage of XSL and XQL to further enhance Kirrkirr. While Kirrkirr already provides a number of interfaces to the lexical information, the challenge of creating new features and providing even greater flexibility lies in allowing the user to access certain parts of the XML database without the overhead of greater memory and time usage. By enhancing the indexing techniques of the original Kirrkirr, users may use XSL to personalise the information they access. Emerging technologies such as XQL give the potential for not only efficient access to dictionary entries, but to the fields within the entries. Performance evaluation suggests that the use of XSL and XQL has had a very significant impact on Kirrkirr. The results can easily be seen to apply to a broad range of similar applications.
A language is more than individual words with a definition. It is a vast network of associations between words and within and across the concepts represented by words. The work described here is part of a broader project that has the general aim of providing people with a better understanding of this conceptual map. In particular, traditional paper dictionaries offer very limited ways for making such networks of meaning visible, whereas, on a computer, there are, in theory, no such limitations to the way information can be displayed. However, while dictionaries on computers are now commonplace, there has been little attempt to utilise the potential of the new medium. Most existing electronic dictionaries, whether on the web or on CDROM, present a plain, search-oriented representation of the paper version. In contrast, our goal has been to build fun dictionary tools that are effective for browsing and incidental language learning. Minimally, they should be as effective for browsing as the process of flicking through pages of a paper dictionary, but beyond that we aim to use the computer to provide new means of effective browsing.
For instance, we can show words grouped by synonyms (as in a thesaurus), antonyms, and many other types of linguistic categories rather than merely spelling. The key to the effectiveness of this type of browsing is that the user has control over the level of complexity in the relationships that are presented to them.
The initial focus of our project was Warlpiri, an Australian Aboriginal language spoken in the Tanami desert northwest of Alice Springs. There were a number of factors influencing this choice:
Details of the first version of Kirrkirr, our Warlpiri dictionary browser, have been described before (Jansz 1998; Jansz, Manning and Indurkhya 1999a; Jansz, Manning and Indurkhya 1999b). The ideas in the design are quite general and applicable to other dictionaries as well. As an environment for the interactive exploration of dictionaries, it attempts to more fully utilise graphical interfaces, hypertext, multimedia, and different ways of indexing and accessing information
Written in Java, it can either be run over the web (high bandwidth) or run locally (here Java's main advantage is cross-platform support). As shown in Figure 1, it is currently made up of five main modules:
Figure 1: Kirrkirr
As reflected in the system modules, this application is unlike any other e-dictionary as it was designed to cater for the needs of Warlpiri speakers with various levels of competence. Features such as the searching facility allow information to be accessed easily and quickly, while incorporation of animation and sounds makes the dictionary usable by speakers with little to no background with the language.
Storing the lexical database in an XML formatted file is an effective median between the structure and built-in querying of a relational database and the flexibility and portability of a plain text document. The strengths of this approach were best appreciated in the development of the Kirrkirr dictionary browser.
Initial testing showed that if the program simply read in the entire XML file (about 10Mb of text) and stored it as parsed data structures within memory, memory usage was excessively high. A simple solution was to create an index file that contained the words, information about cross-references for the graphical display, and the corresponding file position of its entry in the XML file. As a result, only the index need reside in memory and of the 9300 entries in the dictionary, only those requested by the user, will be read in and processed. The use of an index resulted in significant performance improvement. This approach worked well with the XML parser being used (Microstar's Ælfred parser). Although the parser (like other XML parsers of which we are aware) was built to parse an entire file at a time, it was relatively easy to adapt the code to allow processing of just one entry. The use of an index file was also well suited to usage over the Web. Because the parser only processes the parts of the XML file, when they are required, the system is very efficient and can be used even over a low bandwidth connection. Once whole entries are parsed, they are kept temporarily in a memory cache that speeds up subsequent accesses to the same entry in the browsing session. The XDI scheme is described in Figure 2.
Figure 2: XDI: Indexing the XML lexical database for better information access
The idea of using an index is well-known from database systems. By using an index, Kirrkirr implicitly recognises the dictionary as a database (albeit a richly structured one) and XDI can be seen as customised index for such a database. One might well argue that if one is to view the dictionary as a database, then perhaps its better to use off-the-shelf database products that have their own customised index structures for efficient access. However, this is a tradeoff with preserving the rich structure in the dictionary. By using XDI, Kirrkirr tries to strike a balance between the indexing capabilities of standard database systems and the expressiveness and portability of XML. However, by using a customised solution, the system complexity goes up and precludes one from taking advantage of alternative solutions to the general problem of XML information retrieval.
XQL is a set of extensions to the Extensible Style Language (XSL) specification that allows developers using XML to easily execute powerful, complex queries on XML documents. Proposed to the W3C by representatives from Microsoft, Texcel and WebMethods in 1998, it competes with the SQL-oriented XML-QL whose specification was submitted to the W3C by AT&T Labs. However, we will not further discuss XML-QL here, but only the XQL API underlying the new data access model of Kirrkirr.
Traditionally, structured queries have been used primarily for relational or object-oriented databases, and documents were queried with relatively unstructured full-text queries. Although sophisticated query engines for structured documents have existed for some time, they have not been a mainstream application. XML documents are structured documents - they blur the distinction between data and documents, allowing documents to be treated as data sources, and traditional data sources to be treated as documents. Some XML documents are nothing more than an ASCII representation of data that might traditionally have been stored in a database. Others are documents containing very little structure beyond the use of headers and tables. Kirrkirr is somewhere in between: an e-dictionary that has complex recursive structure, but also much relatively unstructured free text, and clearly needs effective query mechanisms for access.
Database developers have taken for granted the ability to execute queries on data stores for decades. However, XML being a young data technology, querying functionality had been very limited. XQL gives developers the querying functionality they have become used to in the database world, including the following:
The major differences between SQL and XQL are summarized in Table 1. It is clear that XQL is an invaluable tool with huge potential for accessing dictionary information stored in XML format.
SQL |
XQL |
The database is a set of tables. |
The database is a set of one or more XML documents. |
Queries are done in SQL, a query language that uses tables as a basic model. |
Queries are done in XQL, a query language that uses the structure of XML as a basic model. |
The FROM clause determines the tables which are examined by the query. |
A query is given a set of input nodes from one or more documents, and examines those nodes and their descendants. |
The result of a query is a table containing a set of rows. |
The result of a query is a set of XML document nodes, which can be wrapped in a root node to create a well-formed XML document. |
Table 1: Main differences between SQL and XQL
XQL implementations typically operate on a model of the XML document known as the DOM (Document Object Model). The DOM is a platform-independent, programming-language-neutral application programming interface (API) for HTML and XML documents. Its core outlines a family of types that represent all the objects that make up an XML document: elements, attributes, entity references, comments, textual data and processing instructions. With that, it defines the logical structure of documents and the way a document is accessed and manipulated. (DOM specifies how XML documents are represented as objects, so that they may be used in object oriented programs.)
Increasingly, XML is being used as a way of representing many different kinds of information that may be stored in diverse systems, and much of this would traditionally be seen as data rather than as documents. Nevertheless, XML presents this data as documents, and the DOM may be used to manage this data to allow programs to access and modify the content and the structure of XML documents from within applications. Anything found in an XML document can be accessed, changed, deleted, or added using the DOM, except for the XML internal and external subsets for which DOM interfaces have not yet been provided.
After the XML document has been parsed into a collection of objects conforming to DOM, the object model can be manipulated in any way that makes sense. This mechanism is also known as the "random access" protocol, as any part of the data can be visited at any time. The DOM usually resides in memory (it is the output of the XML parser), but it can also be stored on disk (to save on the time needed to parse the XML repeatedly) as a Persistent DOM (PDOM). When an XML document is large and not likely to change much, as is the case for dictionaries, using its PDOM representation can significantly speedup XQL querying.
The primary purpose of XPath [HREF12] is to address parts of an XML document by operating on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation (as in URLs) for navigating through the hierarchical structure of an XML document. XPath is useful as it provides a common model and syntax to express the patterns required by queries and transformation patterns for XML documents. It is intended primarily as a component that can be used by other specifications and does not define any conformance criteria for independent implementations (of XPath). This is why XQL (compatible with XPath) is used in Kirrkirr instead of a specific XPath implementation.
The XQL search engine and the PDOM used for the new dictionary representation in Kirrkirr originated from a research project at GMD-IPSI [HREF9], the Institute for Integrated Publication and Information Systems of the German National Research Centre for Information Technology. The PDOM incorporated consolidated concepts from many years of leading edge research and development in the fields of federated databases and document management systems. Persistency is achieved by indexed, binary files. The XML document is parsed once and stored in binary form, accessible to DOM operations without the overhead of parsing them when the information is required. The implementation uses a robust and efficient mix of indexing and query optimisation techniques. A cache architecture further boosts performance. This approach scales very well beyond the limitations of main memory. The PDOM is generated from the XML dictionary. Subsequently, Kirrkirr uses XQL to query the PDOM. Parsing of the XML need not be done repeatedly (it is only necessary when the dictionary changes) and access is faster.
The following is a simplified XML hierarchy of the dictionary.
Figure 3: A sample DOM tree
The dictionary is a sequence of many entries, which include some subset of a large number of dictionary components, including a headword (HW element) and perhaps one or more pictures (IMAGE element).
To find an entry whose headword (<HW>) is 'jaja', the following query ([ ] is the filter clause, equivalent to the WHERE clause in SQL) may be used:
/DICTIONARY/ENTRY[HW='jaja']
Alternatively, if the PDOM index is known, say index = 9 for the word jaja, we can use the query:
/DICTIONARY/ENTRY[9]
The time taken to execute the above queries is very slow and depends very much on the number of <ENTRY> nodes in <DICTIONARY>. This is bad news even for the 9300 entries of the current Warlpiri dictionary, and would be totally impractical for something like a large English dictionary, which might have 100,000 or more headwords.
One solution is to split the DOM representation into multiple smaller DOM trees. However, this approach does not scale very well if more words are added to the dictionary in the future and the increase in code complexity is difficult to justify.
Fortunately, implementations of the DOM API provide efficient ways to extract a child node given its index in the parent's node list, and this can be used to execute the above queries more efficiently and extract the required entries. After extracting the required entry from the DOM tree, there may be a need to query the multiple descendant IMGI nodes to extract the filenames of the image files of a dictionary entry (<ENTRY>). This can be very simply achieved by the following query:
ENTRY/IMAGE//IMGI
The '//' denotes recursive descent and in the above context means finding all the IMGI nodes under the IMAGE node.
Yet another virtue of storing the dictionary in XML format is that there is an abstraction of the content of the data from the way it is going to be used or displayed. This makes using XML data very flexible in the way it can be formatted. At a fairly basic and straightforward level, the formatting of the textual content can be controlled by an eXtensible Style Language (XSL) stylesheet.
Formatted dictionary entries with hypertext links are an adaptation of traditional dictionary structure that is useful for displaying and navigating related words. The functionality to let the user point and click on a referenced word to jump immediately to its corresponding entry is a better way of displaying written dictionary entries than requiring users to remember referenced words and perform a new search for their entry.
At the time of Kirrkirr's initial development, the challenge of formatting the XML data in a useable format was finding an XSL processor that implemented at least the basic features of the draft standard.
While the msxsl
application from Microsoft was effective in generating HTML from an XSL
file, because of it's limitation to Windows-platforms and the
inability to deal with large amounts of data meant it could not be included with the Kirrkirr application
or Applet. Hence the HTML needed to be pre-generation in a process that required each entry to be put in a
separate XML file and then fed individually into the msxsl
to generate a HTML file (9000+ files
in all, for each entry in the dictionary). These files would then be packaged together with Kirrkirr.
An XSL file can be used to convert an XML document into format specific format such as HTML, Rich Text Format (RTF) or Postscript. This process is done by an XSL processor, which takes the XML file and the XSL file and creates the formatted file. The structure of an XSL document is essentially a list of rules that tell the XSL processor what to produce when it encounters certain "target-elements" in the XML file. XSL Transformations are now at the stage of a W3C Recommendation (XSLT [HREF14]).
The development of XSL tools is still at a fairly early stage, but nevertheless there is increased availability
of XSL processors. In particular, with James Clark's Java-based XSL processor XT
, it became
possible for Kirrkirr to dynamically generate the formatted dictionary entries in real time. As shown in
Figure 4 (continued from Figure 2), Kirrkirr takes the XML document object returned from the XML parser and
together with the XSL style sheet specified by the user has the XSL processor generate a HTML document.
Figure 4: The dynamic HTML generation process in Kirrkirr
While HTML generation at runtime eliminated the need for HTML files to be packaged with the application (as discussed in Section 4.1), the key benefit was the ability to let the user customise the formatting process (the performance implications discussed in 5.4). At present the user can select from a few pre-defined style sheets which vary from including all the details in the dictionary entry (including related words, semantic domains, examples, senses, sub-senses etc.) to simply displaying the word, it's definition and the links to it's related words. An important aspect of providing this level of customisation is that elementary users need not be overwhelmed by the detailed lexical information in the dictionary and can still have access to the rich network of word relationships.
While the use of XQL and XSL appear, on paper, to be useful and to advance the state-of-the-art, critical questions about efficiency and performance remain unanswered. In this section we examine some of these issues within the context of Kirrkirr. The tests in this section were conducted on a Pentium-133 machine with 48 Mbytes of RAM and running JDK 1.2.1. We deliberately used a comparatively slow and old machine because this is more typical of the equipment likely to be available to potential users of the Warlpiri dictionary.
Method | Size of index file | Size of file to be loaded at start-up | Start-up time needed |
XML+XDI | Index file -2.13Mb | 2.13Mb | 7min |
One-PDOM | 12.5Mb | N.A. | 13min 04s |
One PDOM + Index | PDOM - 12.5Mb Index - 520Kb | 520Kb | 3min 30s |
Segmented PDOM + Index | PDOM - 12.5Mb Index file - 454Kb | 454Kb | 55.48s |
Table 2: Comparison of the start-up times for Kirrkirr with different representations.
Table 2 summarises the comparison results of start-up time. XML+XDI is the original Kirrkirr. In One-PDOM, the dictionary information needed is extracted from one monolithic PDOM. The performance of this method is very unsatisfactory. It is evident that a start-up index created from the PDOM is required. This start-up index need not be as complex as the XDI - it need only have the bare minimum information (just enough to get the rest by querying the PDOM). The use of such an index reduces the start-up time to half that of the original method. An optimised and enhanced version of this is also considered where dictionary information is broken into blocks. Part of the dictionary can be loaded and the user interface created and presented to the user, while the rest of the dictionary loads in the background. This speeds up the start-up time significantly and allows the user to play with some of the words, while the rest are being loaded in the background.
One of the main justifications for using XQL was speedier access using a standardised mechanism. Here we examine if this is true, and what more needs to be done to fix it.
Method | Time taken for extract of dictionary information for a word |
XDI | 118 ns |
XQL + PDOM | 3700 ms (first query) 400 ms (subsequent query) |
Table 3: Summary of time taken for dictionary information extraction
Table 3 compares the time taken for dictionary information extraction. The original method stores all the dictEntries (dictEntry is an object encapsulating the most often used information of a dictionary entry) in a Hashtable in main memory. There are two test results for the XQL method, this is due to the self-optimizing cache architecture of the XQL engine that ensured faster queries after the first (query).
Extraction of dictionary information by querying the PDOM is much slower than getting it from a
Hashtable
, even after using the faster timing for comparison (about three orders slower). The difference
in speed is not a major concern when only information for one word is needed (for example, when a user selects a word from
the list. This is due to the reaction time of the user). Unfortunately, when dictionary information for a
large number of words is required (say during a search function), performance dropped significantly, due to
accumulation of the extra timing required for extracting each of the large number of dictionary entries.
Figure 5: Faster dictionary information retrieval through addition of a cache.
A solution that matches the speed of the XDI retrieval and fast start-up time (PDOM implementation) is
possible by adding a "cache" (basically a HashMap
) between
the PDOM and the user interface. All dictEntries retrieved from the PDOM are stored in the cache.
Now only the first request for a dictEntry not present in the cache will incur a penalty; subsequent
requests for the dictEntry can simply be answered by the cache. This cache, unlike a hardware cache,
does not use any replacement policy (e.g. FIFO, Least Recently Used) to displace dictEntries from it.
The original Kirrkirr takes around 32min 26s to create the XDI index (on the same Pentium 133 machine). Apart from being slow, the index created was far too big and required too much time to load during start-up (as discussed in Section 5.1).
The new dictionary representation requires a smaller index (containing only essential dictionary information) for program start-up. Two steps are involved in creating this new index:
Step | Time Taken |
XML -> PDOM |
3min 2s |
PDOM -> Index |
10min 52s |
Total | 13min 54s |
Table 4: Time taken for each step of the creation of start-up index (new version).
As shown above there is an improvement of 133% in the time taken to create the index file.
Although the technique of pre-generating the HTML files for each of the dictionary entries has the benefit of taking the load of the application, there were a number of reasons for moving to dynamic generation of HTML.
With a HTML file for each entry (9000+) in a single folder, the application not only becomes cumbersome, there is also a performance load on the file system. As demonstrated in Table 5 the performance impact of performing the XSL processing at the instant the user requests a word is not significant.
Step | Time Taken |
First word | 4s |
Other words | 1-2s |
Table 5: Time taken by XSL processor to generate a HTML entry (new version).
There is a noticeable load on the system when processing the first word request, as the Java security framework establishes the user has sufficient permission to access the file system. Future formatted entry requests are processed reasonably faster.
The performance of dynamic XSL generation of HTML is low enough to justify inclusion of the interactive formatting functionality.
In this paper we have described the integration of XQL and XSL components into the Kirrkirr interface for dictionary visualisation. A great advantage in using XML for the storage of data is that it can be converted into any format simply by specifying rules in an XSL style sheet. We have discussed how various XSL style sheets can customise output for different user needs, and in the future we plan to continue exploring the use of XSL for producing printed versions of the dictionary as well.
Being able to produce multiple customised versions of printed dictionaries at low cost (in terms of human labour) is another central need in the domain of dictionaries for languages with few speakers. We are also exploring the development of a GUI interface for defining XSL stylesheets, so they can more easily be customised by users. For efficient access to and searching of dictionary information, we have explored replacing the original custom indexing of Kirrkirr with XQL. While XQL offers many advantages in terms of flexibility and standardisation, our current results suggest that a combination of custom indexing and XQL is the way to achieve optimal performance.
In other work not reported here, we have also continued to investigate the usability of dictionary software by diverse user populations, and to explore various alternative dictionary interfaces modules. While the focus of all this research has been on Warlpiri, this research (and the software constructed) can easily be applied to other languages, including other Australian languages and English. In general, our hope is to provide a better understanding of the usefulness and practicality of innovative dictionary browsing environments. Flexible and efficient infrastructure components of the sort we have explored here are a necessary foundational part of this research.
M. Corris, C. Manning, S. Poetsch, J. Simpson, Dictionaries and endangered languages, to appear. In David Bradley & Maya Bradley (eds) Language Endangerment and Language Maintenance: an active approach.
K. Jansz, "Intelligent processing, storage and visualisation of dictionary information", Computer Science Honours Thesis, Basser Department of Computer Science, University of Sydney, 1998 [HREF6]
K. Jansz, C. Manning, N. Indurkhya, "Kirrkirr: Interactive Visualisation and Multimedia from a Structured Warlpiri Dictionary", Ausweb 99 [HREF7]
K. Jansz, C. Manning, N. Indurkhya, "Kirrkirr: A Java-based visualisation tool for XML dictionaries of Australian Languages", XML/SGML Asia Pacific 1999 [HREF8]
M. Laughren and D. Nash, "Warlpiri Dictionary Project: aims, method, organisation and problems of definition" Papers in Australian Linguistics No. 15: Australian Aboriginal Lexicography pp 109-133, Pacific Linguistics 1983.
Kevin Jansz, Wee Jim Sng, Christopher Manning and Nitin Indurkhya, © 2000. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.
[ Proceedings ]