AWB Logo

Alembic Workbench User's Guide

7. Editing the Tagset Definition File

When the user saves a file after tagging it, the Workbench consults a tagset definition file to determine the priority with which to insert various coextensive and nested annotations into the SGML-encoded file. The tagset definition file contains an ordered listing of SGML generic identifiers. In this list, tags are entered from lowest to highest priority. For example, the more general Wall Street Journal (WSJ) tags that surround large extents of text are of lower priority than part-of-speech tags that are lexeme-based. This ordering ensures that if a word is tagged coextensively with a part-of-speech tag and a WSJ tag, the part-of-speech tag will be the innermost tag.

In addition, the tagset definition file classifies annotations based on their "family" (e.g., the Generic Identifiers ENAMEX, NUMEX, and TIMEX belong to the Named Entity (NE) family.) and organizes them in separate, tractable tag files. Any Generic Identifiers that are not explicitly listed in the tagset definition are saved in a unknown tag file that is created for miscellaneous, undefined tags. For example, a file that contains both WSJ and Named Entity (NE) markup, along with unknown markup, is maintained as six distinct files: SGML-encoded, raw text, central Document Description, WSJ.tag, NE.tag and UNK.tag files.

By default, the Alembic Workbench consults the tagset definition file found in:
		$AWB/tagsetdefs/awb.tsd
Currently, it is set up to manage annotations derived from WSJ, NE, part-of-speech, etc. However, the user can define new tagsets and add them to the tagset definition file.

To add a new tagset to the tagset definition file:

Use a text editor to open the tagset definition file.

Designate an abbreviation that describes the tagset. For example, NE describes the Named Entity tagset. The Workbench will use this abbreviation in the extension of the tag files it creates.

List the generic identifiers that correspond to the tagset. For example, the generic identifiers ENAMEX, TIMEX, and NUMEX are listed under the NE tagset.

Add the above information in the appropriate position in the file. See the file $AWB/tagsetdef/awb.tsd for more detailed instructions.

NEXT: 8. Normalizing and Preprocessing Files

Return to Alembic Workbench User's Guide Table of Contents