Kevin Jansz [HREF1], Department of Computer Science, University of Sydney, Australia. firstname.lastname@example.org
Christopher Manning [HREF2], Department of Linguistics, University of Sydney, Australia. email@example.com
Nitin Indurkhya [HREF3], School of Applied Science, Nanyang Technological University, Singapore. firstname.lastname@example.org
While dictionaries on the web -- and on other electronic media such as CD-ROMs -- are now commonplace, there has been surprisingly little work to utilise the web's capabilities for hypertext linking and multimedia to provide a richer visualisation of dictionary content. Existing electronic dictionaries still follow design principles more appropriate for paper dictionaries thereby limiting their scope and effectiveness. In this paper, we take a fundamentally different approach, designing an innovative environment that focuses on fully using the capabilities of the web. We describe Kirrkirr, a web-based application for interactive exploration of dictionaries. It currently targets Warlpiri (a Central Australian language). A key feature of our work is that we have converted the existing Warlpiri dictionary into a richly-structured XML version. The flexibility and hierarchical structure of XML is ideally suited for supporting rich but loosely structured content such as dictionaries, while web-based distribution is particularly attractive because dictionary maintenance can be done on a central server, and Java-based clients can access up-to-date dictionary information as needed. Kirrkirr provides a graph-based display of semantic links between words, which provides an engaging interface that can be explored, manipulated and customised interactively by the user.
Past research into electronic dictionaries falls into the two very distinct categories of dictionary databases or Machine Readable Dictionaries for computational linguistics, and the use of dictionaries for language learners. As identified in (Kegl 1995), it is surprising that despite the enormous potential of having dictionaries in electronic databases there has been almost nothing in the way of combining these two areas of research, and using advances in electronic dictionaries, information visualization, and language education to benefit speakers and learners of a language.
Another problem with current electronic dictionaries is that they are mostly focused on searching. They tend to simply display information in a format mimicking (but normally worse than) the paper versions from which they were adapted. There has been little research on browsing interfaces to dictionaries, or on innovative ways of displaying the informational content of dictionaries in ways that are more suited to the computer medium, making use of ideas such as multimedia and hypertext. The current focus can be partly attributed to the fact that dictionaries have evolved into primarily vocabulary databases. Weiner (1994) discusses the initial purpose of the Oxford English Dictionary (OED) and the eventual diversion from their goal:
"… To create a record of vocabulary so that English literature could be understood by all. But English scholarship grew up and lexicography grew with it … inevitably parting company with the man in the street".
It seems that the initial purpose of the OED has, if not faded, been significantly blurred. Although the OED is comprehensive in its documentation of lexicographic information and the history of words, this information is inaccessible to casual dictionary users who get better value from a shorter work such as the Concise Oxford English Dictionary. For a rather different language and language situation, our research aim has been to try to produce something that comes nearer to meeting the OED's original goal.
An interesting issue raised in (Sharpe 1995) is the "distinction between information gained and knowledge sought", which is very important but rarely identified in e-dictionary research. The speed of information retrieval that e-dictionaries (characteristically) deliver and the focused decontextualized search results they provide can frequently lead to loss of the contextual and memory retention benefits that manually searching through paper dictionaries provides. For language learning purposes, using paper dictionaries has the benefit of exposing the user to many other words as they flick through the pages to the entry they are looking up and as they look at the other entries surrounding the word on the same page. Current on-line systems lose this functionality. They fail to encourage random learning and interest-driven exploration, and the speed of access is probably detrimental to the inadvertent memorisation of word meanings. Providing interactivity and pointers that encourage browsing of the web of entries in a dictionary may be one way of rectifying this, while at the same time it is important to be able to deliver quick and informative search results for the more experienced user.
A large potential advantage of a computerised dictionary is that there is an interface between the user and the complete dictionary. This means that the dictionary can be a large lexical warehouse of all potentially relevant information, and the user can choose or customise the interface so that it provides appropriate subsets of this information presented in a way that suits their preferences. This means that the dictionary can deal with a broader range of intentions and a greater range of language competency than is possible with printed copy. Some of the possibilities were impossible in principle with paper dictionaries, while for others, the cost of producing different dictionary editions for different user needs was prohibitive (for all except the very largest languages). In utilising the potential of an electronic dictionary we may come closer to the initial aims of the OED by customising the overall experience of vocabulary to the individual user. For a particular dictionary application to satisfy a diverse range of users depends on the developers utilising this potential of working in an electronic medium.
In this paper, we describe Kirrkirr [HREF4], a web-based application for interactive exploration of dictionaries. It currently targets the Warlpiri (a Central Australian language) dictionary (Laughren and Nash 1983). The application incorporates a number of novel approaches to utilising the web's capabilities to provide a richer visualization of dictionary content. The aim is to build an environment where exploring words and their meaning is fun for learners.
There are a number of reasons for implementing an electronic bilingual dictionary for Warlpiri. One is the large amount of work that has gone in over a period of decades to gather encyclopedic dictionary information on Warlpiri (Laughren and Nash 1983), resulting in the most detailed lexical materials available for an Australian language. A second is the reasonable number of speakers of the language, and the interest in and availability of bilingual education programs over a number of years. There is interest among Aborigines in speaking their traditional languages (Goddard and Thieberger 1997) and in order to sustain the initial interest, part of the challenge of creating a computerised system is to make it easy and fun to use. Finally, the generally low level of literacy in the region impedes use of printed dictionaries, and offers opportunities for exploring new methods of access in electronic dictionaries. Hence a constraint in the system's design was that it not be heavily reliant on written/typed interaction from the user. Features such as being able to point-and-click at words and hear their pronunciation are important in making the system approachable and readily usable, and hence educationally useful. A strong dependence on alphabetical order would detract from that goal. In our system, if people cannot remember a word, they can hope to find it from any related word. Warlpiri words and their relationships can be explored simply by clicking, whereas with a conventional dictionary, exploring related words is only practical for people with strong literacy skills (rapid alphabetical order lookups, good scanning ability, etc.).
The preparation of different printed editions of the Warlpiri dictionary (comprehensive, concise, and learners' dictionaries) is precluded for economic reasons as the community of speakers is relatively small. In a similar way to the need for the construction of the OED, there is a need for documentation of the language for it to be able to be understood and preserved. Creating a computer interface to the present Warlpiri dictionary makes the information instantly accessible and also customisable to the relative needs of the user.
The application developed was called Kirrkirr (the Warlpiri word for the kind of 'click' sound made by a young kangaroo). It is written in Java using the Swing API. As shown in Figure 1, it is made up of five main modules:
As reflected in the system modules, this application is unlike any other e-dictionary as it was designed to cater for the needs of Warlpiri speakers with various levels of competence. Features such as the searching facility allow information to be accessed easily and quickly, while incorporation of animation and sounds makes the dictionary usable by speakers with little to no background with the language.
Another feature of this application is it is highly customisable, encouraging a high level of interaction from the user to tailor the interface to suit their needs and use the system to its full potential. The flexibility engineered into the system also makes Kirrkirr very easy to extend in the future. We plan some further components, such as being able to browse words by semantic domains.
While there has been enormous support for the adoption of XML, the eXtensible Markup Language [HREF5], much of the discussion has centred on representing rigid and simple structures such as database records in XML. In contrast a dictionary much more thoroughly exploits the capabilities of XML. The entries of an unabridged dictionary typically have a rich hierarchical structure and moreover, the make-up and structure of the entries varies greatly depending on the type of word being defined (Ooi 1998). This is the kind of data for which XML excels. The Warlpiri dictionary is in an ad-hoc format, with some resemblance to the FOSF format (the Summer Institute of Linguistics' Field Ordered Standard Format, described in Goddard and Thieberger 1997) commonly used by Australian linguists, but already possessing hierarchical structure like SGML. The dictionary contains about 5450 entries covering about 9000 headwords, with many of the entries having a rich structure of senses and subsenses akin in detail to the OED. We converted it into XML using an error-correcting stack-based parser written in PERL. Design of an adequate parser required considerable work due to the numerous inconsistencies and errors in the original, reflecting its maintenance by hand for many years, using just plain text editors, and regular expressions for searching and replacing text. The XML version allows definition of the precise semantics of the dictionary content, while leaving unspecified its form of presentation to the user, a flexibility that we exploit in our application.
The need for structured markup for the web has influenced rapid development of XML. The advantage of using a globally recognised markup language for the Warlpiri dictionary is that the research benefits from the various general purpose tools available, rather than having to develop specific tools, as would be the case for the previous dictionary format. The dictionary file also becomes more portable and accessible. The cost is that it is now less easily maintained than before by a human using a plain text editor. Computer support tools become necessary. Fortunately, XML tools are widely available and increasing in number.
Storing the lexical database in an XML formatted file is an effective median between the structure and built-in querying of a relational database and the flexibility and portability of a plain text document. The strengths of this approach were best appreciated in the development of the Kirrkirr dictionary browser.
Initial testing showed that if the program simply read in the entire XML file (about 8.7Mb of text) and stored it as parsed data structures within memory, then the memory used by the system was excessively high. The best solution was to create an index file that contained the words, information about cross-references for the graphical display, and the corresponding file position of its entry in the XML file. Hence of the 5450 entires in the dictionary, only those requested by the user, will be read in and processed. This lead to a considerable improvement in the speed of the application and its resource efficiency.
This approach worked well with the XML parser being used (Microstar's Ælfred parser). Although the parser (like other XML parsers of which we are aware) was built to parse an entire file at a time, it was relatively easy to adapt the code to allow either processing of just one entry or a whole XML document.
The use of an index file was also well suited to using the application over the Web. Because the parser only processes the parts of the XML file that are required, the system is very efficient over a low bandwidth connection.
A virtue of storing the dictionary in XML is that there is an abstraction of the content of the data from the way it is going to be used or displayed. At a fairly basic and straightforward level, the formatting of the textual content can be controlled by an eXtensible Style Language (XSL) stylesheet.
An XSL file can be used to render an XML document in a specific output format such as HTML, Rich Text Format (RTF) or Postscript. This process is done by an XSL processor, which takes the XML file and the XSL file and creates the formatted file. The structure of an XSL document is essentially a list of rules that tell the XSL processor what to produce when it encounters certain "target-elements" in the XML file. This allows us to cater presentation to different user needs: for example, leaving out technical and obscure fields for young non-technical users. XSL is currently only at the stage of a draft standard [HREF6], but we expect that processors with more functionality will soon be available, and possibilities in this area will expand.
For applications with simple, lexical databases behind them, there is little that can be done in the way of accessing the language apart from giving an on-line reflection of what one would see on a printed page. Simply knowing the definition of a word does not achieve much if the language learner does not know the usage of the word and how it relates to other words. In a computerised medium there is no such limitation on the way information can be conveyed to the user.
Despite considerable research into computerising dictionaries, there has been little done in the way of making the network of relationships between words explorable. WordNet (Miller et al. 1993) is the best known example of a lexical database which was created on computer with the idea of structuring words, meanings, and relationships in fundamentally new ways that departed from the paper dictionary tradition, and which brought out relationships that were reflected in human cognition. While this project has created a richly structured dictionary, which has been used in many computational linguistic projects, because the organisation of their database is confined to specific linguistic categories (e.g., meronymy, hyponymy, synonymy), the linking that can be represented is limited. What is worse in terms of WordNet's computerisation of the dictionary is that there has been very little done in making this linking known to the user. Including in the text gloss for a dictionary entry, linguistically technical information about words that are synonymous, antonyms, or super or sub-classes of the entry does not come near to utilising the potential for information visualisation that working in an electronic medium provides.
Most current dictionary systems present the search-dominated interface of classic information retrieval (IR) systems. These are only effective when the user has a clearly specified information need and a good understanding of the content being searched. Search interfaces are ineffective for information needs such as just "getting an idea of" a new document collection, and so some work in IR has emphasised the need for new methods of information access for browsing document collections (Pirolli et al. 1996). Moreover, the essence of hypertext on the World Wide Web is browsing. The issue of searching has only come to the fore on the WWW because of the impossibility of finding what one wants by browsing. This raises the need for better tools for visualising and navigating the contents of the web in browse-mode. One approach to this is to visualise the web using a graph, an approach explored by Huang et al. (1998). Very similar issues arise for us in the visualisation of a dictionary (even if the scale of the problem is not quite so vast).
While also providing a conventional search interface, our main emphasis has been on developing alternative interfaces for browsing the dictionary (Jansz 1998). One of these is the hypertext version of the dictionary, but the more interesting interface is a colour-coded graph representation of the dictionary. An interesting approach to displaying the web of word inter-relationships in a language is to represent the network by a graph, where the words act as nodes and the relationships between them are represented by edges. This is more effective for learning than alphabetical order since semantically related words are shown nearby, even when they are not close by in alphabetical order. Graphical networks have already been exploited for web page relationships (Huang et al. 1998) and for dictionaries. The plumbdesign visual-thesaurus [HREF7] is a good demonstration of how graphical techniques can be used, but it is also a fine example of difficulties that arise from an information visualisation perspective once the graph becomes complicated. In this system, it doesn't take long before the 'sprouted' synonyms and the links between them begin to overlap and move chaotically until the overall image becomes confusing (see Figure 4).
Both these applications have only one link type. However, the richer semantic markup of the dictionary, with many kinds of semantic links (such as synonyms, antonyms, hyponyms, and other forms of relationships) allows us to provide a richer and more meaningful browsing experience. Moreover, the ability to display different link types graphically as different colours solves one of the recurring problems of the present web, with its one type of link: users have some idea of what type of relationship there is to another word before clicking.
In making the Kirrkirr application that allows the Warlpiri dictionary to be browsed it was also important that the interface be appealing to the eye and "fun" to use. And in contrast with the plumbdesign system, from a serious language learning viewpoint, the clarity and simplicity of the graph displayed was of much greater importance.
Features that improve the overall readability of a graph include that there are no nodes overlapping, intersection between edges is kept to a minimum and the nodes are spread over the size of the window they are being viewed in. Huang et al. (1998) propose a fairly effective solution. Their approach involves associating gravitational repulsion forces to the nodes and spring forces to the links between them and then in accordance with the laws of physics, letting the nodes virtually lay themselves out. Their research also explores the issues associated with displaying only part of what may be a very large and complicated graph, and allowing the user to progressively see different sections, but for us the most useful part of their approach is that the iteratively updating algorithm they use can provide an interactive environment in which users can "play".
The algorithm for laying out the nodes in the current frame considers nodes as steel rings, which have gravitational repulsion between them (Huang et al. 1998; Eades et al. 1998). The edges connecting the nodes are considered as steel springs. The algorithm attempts to calculate a layout where these rings and springs are in equilibrium. The conceptual advantage of this analogy is that the nodes will attempt to keep a sufficient distance away from each other so as not to overlap. The nature of a spring is to exert a force so as to maintain its zero energy or relaxed length. Treating edges in this way, means that connected nodes will be repelled from being too close, while at the same time forced closer if they venture beyond the zero energy length.
The network of relationships in a language can become too much for a user to seriously absorb from a detailed diagram. With some entries in the Warlpiri dictionary having more than thirty related words listed there needed to be a high level of customisability added to the application that went beyond the simple functionality to be able to move nodes around.
For this form of visualisation to be of educational benefit, it is important that the graph looks the way the user wants and contains the information they require. For an animated network of words, there is the added requirement that the user can make the graph behave the way they want as well.
One of the features of the modified spring algorithm is that the layout behaviour is determined by just five arbitrary weights, so it is possible to allow the user to customise the graphical layout simply by allowing them to adjust the weights. Because of the iterative nature of the algorithm, these weights can be changed "online", allowing the user to see the effects of the change instantly. These weights include:
Each of these weights were made editable in Graph Layout module of the Kirrkirr application. The user was also allowed to filter the appearance of the graph to only display certain relationships between nodes.
In addition to customising the layout of the graph, the user can 'anchor' nodes in place by right clicking on them and selecting the 'anchor' option. Also incorporated into this node menu are options to delete the specific node, sprout a note-taking window for the word contained in the node, or include the short English gloss for that entry inside the rectangle node.
For speakers with limited literary skills, it is particularly important to have an interface to the dictionary that does not rely on typing to use it. Although the written word cannot be totally avoided in a dictionary, having features such as sounds, images and animated words is a way of taking away the emphasis on the written form of the words before the application can be used.
An important aspect of learning a word is knowing how to say it. Although many printed dictionaries include phonetic symbols with the lexical entries, it is rare that a language learner bothers to look up the jacket of the dictionary to understand how this phonetic alphabet can be decoded. In many cases, the technical nature of this sort of information acts more as a distraction or a perceived barrier to dictionary use rather than a help.
With most computers now having sound playing capabilities, an electronic dictionary can achieve much more in teaching the pronunciation of words by letting users hear the words themselves. This allows immediate recognition of the spoken word, which can then be related to the written form of the word. Including pictures and photos with dictionary entries is also an aspect that makes the dictionary significantly more interesting to use than a standard printed dictionary. It's surprising that few applications bother to include this very basic approach to making the dictionary more visually appealing. Being able to see pictures of especially plants and animals is an effective way of making the system fun to use, while still being educational.
Despite the use of multimedia in an e-dictionary, there must still be some facility for users to read more detailed dictionary information if it is to be considered as a serious reference. Although it can hardly be considered as a non-written interface, having formatted dictionary entries that have hypertext is a useful way of displaying related words. The functionality to let the user point and click on a referenced word to jump immediately to its corresponding entry is a better way of displaying written dictionary entries than requiring users to remember referenced words and perform a new search for their entry.
HTML also has the nice quality of allowing colours, which means that related words could be coloured in the exact same colours that their edges are coloured in the animated graph. Moreover, the ability to display different link types (such as synonyms, antonyms, hyponyms, and other forms of relationships) graphically as different colours gives users have some idea of what type of relationship there is to another word before clicking. These aspects help to take away the importance of fully understanding the written words in the dictionary application before the application can be used.
The search interface to the dictionary provides a user-friendly console where search results can be sorted and manipulated by the user. As well as standard keyword search, which can optionally be restricted to appearance within a specified XML entity, the system provides two features targeted towards two principal groups of users. Linguists often want to search for particular sound patterns (such as certain types of consonant clusters), and so the system allows regular expression matching for such expert users. On the other hand, we expect that many potential users will have limited literacy in Warlpiri, and will have problems in looking up words. In considerable part this is because the phonetic orthography of Warlpiri does not match very closely to the (rather arcane) spelling rules of English in which their literacy skills are usually based. To alleviate this problem, we have implemented a "fuzzy search" algorithm which attempts to find the intended word by using rules which capture common mistakes, sounds confusions and alternative spellings (for example, Warlpiri is spelt as "Warlbiri" or "Walbiri" in some publications and catalogues).
The "fuzzy search" is implemented by taking the user's query string and substituting characters with regular expression 'character classes'. The performance of this "brute force" approach to the approximate spelling problem proved to be quite effective. The biggest time cost in using a regular expression is in defining it (this is when the expression is compiled into automata). Once defined, using it to match strings or perform substitutions is relatively efficient.
While the various aspects of electronic dictionaries that have been researched and incorporated into the Kirrkirr application cover a broad spectrum, there are many other interesting areas of dictionary processing, storage, and visualisation.
With the latest release of the Java programming language (Java 2), the Media API (Application Programming Interface) provides the functionality for playing MPEG videos and various other sound file formats from a standard Java program. This opens up many exciting prospects in the area of multi-media content in the dictionary. Warlpiri, like many Aboriginal languages, has an extensive sign language, and the dictionary already includes cross-references to a printed dictionary of Warlpiri signs. It would be valuable if videos of signs were available from within the program.
As discussed in Section 2, a great advantage in using XML for the storage of data is that it can be converted into any format simply by specifying rules in an XSL style sheet. At present the Kirrkirr program uses only one style sheet for converting the XML entries to HTML. As support for XSL in Java applications improves, it will be possible for an XSL definition interface to be provided to allow the user to define how they want dictionary entries formatted. This sort of online processing would also facilitate printing of customised data from the program (for example in postscript).
Speech technology is also becoming easier to incorporate into programs. The Java 2 release includes a Speech API which facilitates Java applications taking spoken input from the user via a microphone. This would be a particularly useful feature to incorporate into a Warlpiri dictionary application and would go even further in meeting the needs of users with limited literary skills. If this sort of technology could be incorporated with speech synthesis software capable of incorporating Warlpiri sounds, every word in the application could be read out the user. This includes headwords, examples and even program instructions.
The diversity of areas researched is rare, relative to past work in electronic dictionaries, which often addresses the problems of storage, processing and visualisation/teaching as unrelated. Despite some significant research into the construction of lexical databases that go beyond the confined dimensions of their paper ancestors, there has been little attempt in seeing this work through to benefiting people such as language learners, who could truly benefit from a better interface to dictionary information.
We have addressed a broad range of issues relating to electronic dictionaries. The challenge of making dictionary information usable has been addressed in the creation of an application that exploits the strengths of the content and structure of the database to provide real users with a richer dictionary/vocabulary experience. We are currently testing the interface with various potential users (speakers, learners, schoolchildren), and expect that this feedback will lead to new directions in further development.
In creating the Kirrkirr application for use by Warlpiris with the Warlpiri dictionary, the system needed to be developed with usability being a priority. The system has attempted to reduce the importance of knowing the written form of the word before the application can be used, and to provide features such as an animated, clearly laid out network of words and their relationships, multimedia and hypertext aimed at making the system interesting and enjoyable to use. At the same time, features such as advanced search capabilities and note-taking make the system practical as a reference tool. Having designed the system to be highly customisable by the user, it is also highly extensible, allowing new modules to be incorporated with relative ease. While the focus of this research has been on Warlpiri, this research (and the software constructed) can be easily applied to other languages, including other Australian languages and English.
In a paper regarding the issues of Machine Readable Dictionaries (MRDs) and education Kegl concludes:
"... The best future applications of MRDs in education will be those most able to respond to the insights and needs of their users" (Kegl 1995)
This research can be viewed as the first step toward achieving this vision. It is hoped that the application developed will spur more efforts to harness the enormous potential of computers to improve not only dictionary storage, but also dictionary usage.
Goddard, C. and Thieberger, N., (1997), "Lexicographic Research on Australian Languages 1968-1993", Boundary Rider: Essays in Honour of Geoffrey O'Grady, pp 175-208, Pacific Linguistics
Huang, M., Eades, P. and Cohen, R. F., (1998), "WebOFDAV: Navigating and visualizing the Web on-line with animated context swapping", Proceedings of the 7th International World Wide Web Conference, pp. 638--642.
Jansz, K., (1998), "Intelligent processing, storage and visuaisation of dcitionary information", Computer Science Honours Thesis, Basser Department of Computer Science, University of Sydney [HREF10]
Kegl, J., (1995), "Machine-Readable Dictionaries and Education", Automating the Lexicon -- Research and Practice in a Multilingual Environment, Edited by: D. Walker, A. Zampolli and N. Calzolari, Oxford University Press Inc.
Laughren, M. and Nash, D., (1983), "Warlpiri Dictionary Project: aims, method, organisation and problems of definition", Papers in Australian Linguistics No. 15: Australian Aboriginal Lexicography pp 109-133, Pacific Linguistics
Miller, G., Beckwith, R., Fellbaum, C., Gross, R. and Miller, K., (1993), "Introduction to WordNet: An On-line Lexical Database" [HREF8]
Ooi, V. B. Y., (1998), "Computer corpus lexicography", Edinburgh: Edinburgh University Press.
Pirolli, P., Schank, P., Hearst, M. A. and Diehl, C., (1996), "Scatter/Gather Browsing Communicates the Topic Structure of a Very large Text Collection", Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI '96). [HREF9]
Sharpe, P., (1995), "Electronic Dictionaries with Particular Reference to an Electronic Bilingual Dictionary for English-speaking Learners of Japanese", International Journal of Lexicography, Vol. 8, No. 1, pp. 39-54
Weiner, E., (1994), "The Lexicographical Workstation and the Scholarly Dictionary", Computational Approaches to the Lexicon pp. 413-439, Oxford University Press
Kevin Jansz, Christopher Manning and Nitin Indurkhya, © 1999. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.
[ Proceedings ]