edu.stanford.nlp.trees
Interface TreebankLanguagePack

All Superinterfaces:
Serializable
All Known Implementing Classes:
AbstractTreebankLanguagePack, ArabicTreebankLanguagePack, ChineseTreebankLanguagePack, FrenchTreebankLanguagePack, HebrewTreebankLanguagePack, NegraPennLanguagePack, PennTreebankLanguagePack, TueBaDZLanguagePack

public interface TreebankLanguagePack
extends Serializable

This interface specifies language/treebank specific information for a Treebank, which a parser or other treebank user might need to know.

Some of this is fixed for a (treebank,language) pair, but some of it reflects feature extraction decisions, so it can be sensible to have multiple implementations of this interface for the same (treebank,language) pair.

So far this covers punctuation, character encodings, and characters reserved for label annotations. It should probably be expanded to cover other stuff (unknown words?).

Various methods in this class return arrays. You should treat them as read-only, even though one cannot enforce that in Java.

Implementations in this class do not call basicCategory() on arguments before testing them, so if needed, you should explicitly call basicCategory() yourself before passing arguments to these routines for testing.

This class should be able to be an immutable singleton. It contains data on various things, but no state. At some point we should make it a real immutable singleton.

Author:
Christopher Manning

Field Summary
static String DEFAULT_ENCODING
          Use this as the default encoding for Readers and Writers of Treebank data.
 
Method Summary
 String basicCategory(String category)
          Returns the basic syntactic category of a String by truncating stuff after a (non-word-initial) occurrence of one of the labelAnnotationIntroducingCharacters().
 String categoryAndFunction(String category)
          Returns the syntactic category and 'function' of a String.
 Filter<String> evalBIgnoredPunctuationTagAcceptFilter()
          Returns a filter that accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else.
 Filter<String> evalBIgnoredPunctuationTagRejectFilter()
          Returns a filter that accepts everything except a String that is a punctuation tag that should be ignored by EVALB-style evaluation.
 String[] evalBIgnoredPunctuationTags()
          Returns a String array of punctuation tags that EVALB-style evaluation should ignore for this treebank/language.
 Function<String,String> getBasicCategoryFunction()
          Returns a Function object that maps Strings to Strings according to this TreebankLanguagePack's basicCategory method.
 Function<String,String> getCategoryAndFunctionFunction()
          Returns a Function object that maps Strings to Strings according to this TreebankLanguagePack's categoryAndFunction method.
 String getEncoding()
          Return the charset encoding of the Treebank.
 TokenizerFactory<? extends HasWord> getTokenizerFactory()
          Return a tokenizer factory which might be suitable for tokenizing text that will be used with this Treebank/Language pair.
 GrammaticalStructureFactory grammaticalStructureFactory()
          Return a GrammaticalStructureFactory suitable for this language/treebank.
 GrammaticalStructureFactory grammaticalStructureFactory(Filter<String> puncFilter)
          Return a GrammaticalStructureFactory suitable for this language/treebank.
 HeadFinder headFinder()
          The HeadFinder to use for your treebank.
 boolean isEvalBIgnoredPunctuationTag(String str)
          Accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else.
 boolean isLabelAnnotationIntroducingCharacter(char ch)
          Say whether this character is an annotation introducing character.
 boolean isPunctuationTag(String str)
          Accepts a String that is a punctuation tag name, and rejects everything else.
 boolean isPunctuationWord(String str)
          Accepts a String that is a punctuation word, and rejects everything else.
 boolean isSentenceFinalPunctuationTag(String str)
          Accepts a String that is a sentence end punctuation tag, and rejects everything else.
 boolean isStartSymbol(String str)
          Accepts a String that is a start symbol of the treebank.
 char[] labelAnnotationIntroducingCharacters()
          Return an array of characters at which a String should be truncated to give the basic syntactic category of a label.
 Filter<String> punctuationTagAcceptFilter()
          Return a filter that accepts a String that is a punctuation tag name, and rejects everything else.
 Filter<String> punctuationTagRejectFilter()
          Return a filter that rejects a String that is a punctuation tag name, and accepts everything else.
 String[] punctuationTags()
          Returns a String array of punctuation tags for this treebank/language.
 Filter<String> punctuationWordAcceptFilter()
          Returns a filter that accepts a String that is a punctuation word, and rejects everything else.
 Filter<String> punctuationWordRejectFilter()
          Returns a filter that accepts a String that is not a punctuation word, and rejects punctuation.
 String[] punctuationWords()
          Returns a String array of punctuation words for this treebank/language.
 Filter<String> sentenceFinalPunctuationTagAcceptFilter()
          Returns a filter that accepts a String that is a sentence end punctuation tag, and rejects everything else.
 String[] sentenceFinalPunctuationTags()
          Returns a String array of sentence final punctuation tags for this treebank/language.
 String[] sentenceFinalPunctuationWords()
          Returns a String array of sentence final punctuation words for this treebank/language.
 void setGfCharacter(char gfCharacter)
          Sets the grammatical function indicating character to gfCharacter.
 String startSymbol()
          Returns a String which is the first (perhaps unique) start symbol of the treebank, or null if none is defined.
 Filter<String> startSymbolAcceptFilter()
          Return a filter that accepts a String that is a start symbol of the treebank, and rejects everything else.
 String[] startSymbols()
          Returns a String array of treebank start symbols.
 String stripGF(String category)
          Returns the category for a String with everything following the gf character (which may be language specific) stripped.
 String treebankFileExtension()
          Returns the extension of treebank files for this treebank.
 TreeReaderFactory treeReaderFactory()
          Returns a TreeReaderFactory suitable for general purpose use with this language/treebank.
 TokenizerFactory<Tree> treeTokenizerFactory()
          Return a TokenizerFactory for Trees of this language/treebank.
 

Field Detail

DEFAULT_ENCODING

static final String DEFAULT_ENCODING
Use this as the default encoding for Readers and Writers of Treebank data.

See Also:
Constant Field Values
Method Detail

isPunctuationTag

boolean isPunctuationTag(String str)
Accepts a String that is a punctuation tag name, and rejects everything else.

Parameters:
str - The string to check
Returns:
Whether this is a punctuation tag

isPunctuationWord

boolean isPunctuationWord(String str)
Accepts a String that is a punctuation word, and rejects everything else. If one can't tell for sure (as for ' in the Penn Treebank), it maks the best guess that it can.

Parameters:
str - The string to check
Returns:
Whether this is a punctuation word

isSentenceFinalPunctuationTag

boolean isSentenceFinalPunctuationTag(String str)
Accepts a String that is a sentence end punctuation tag, and rejects everything else.

Parameters:
str - The string to check
Returns:
Whether this is a sentence final punctuation tag

isEvalBIgnoredPunctuationTag

boolean isEvalBIgnoredPunctuationTag(String str)
Accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Parameters:
str - The string to check
Returns:
Whether this is a EVALB-ignored punctuation tag

punctuationTagAcceptFilter

Filter<String> punctuationTagAcceptFilter()
Return a filter that accepts a String that is a punctuation tag name, and rejects everything else.

Returns:
The filter

punctuationTagRejectFilter

Filter<String> punctuationTagRejectFilter()
Return a filter that rejects a String that is a punctuation tag name, and accepts everything else.

Returns:
The filter

punctuationWordAcceptFilter

Filter<String> punctuationWordAcceptFilter()
Returns a filter that accepts a String that is a punctuation word, and rejects everything else. If one can't tell for sure (as for ' in the Penn Treebank), it maks the best guess that it can.

Returns:
The Filter

punctuationWordRejectFilter

Filter<String> punctuationWordRejectFilter()
Returns a filter that accepts a String that is not a punctuation word, and rejects punctuation. If one can't tell for sure (as for ' in the Penn Treebank), it makes the best guess that it can.

Returns:
The Filter

sentenceFinalPunctuationTagAcceptFilter

Filter<String> sentenceFinalPunctuationTagAcceptFilter()
Returns a filter that accepts a String that is a sentence end punctuation tag, and rejects everything else.

Returns:
The Filter

evalBIgnoredPunctuationTagAcceptFilter

Filter<String> evalBIgnoredPunctuationTagAcceptFilter()
Returns a filter that accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Returns:
The Filter

evalBIgnoredPunctuationTagRejectFilter

Filter<String> evalBIgnoredPunctuationTagRejectFilter()
Returns a filter that accepts everything except a String that is a punctuation tag that should be ignored by EVALB-style evaluation. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Returns:
The Filter

punctuationTags

String[] punctuationTags()
Returns a String array of punctuation tags for this treebank/language.

Returns:
The punctuation tags

punctuationWords

String[] punctuationWords()
Returns a String array of punctuation words for this treebank/language.

Returns:
The punctuation words

sentenceFinalPunctuationTags

String[] sentenceFinalPunctuationTags()
Returns a String array of sentence final punctuation tags for this treebank/language. The first in the list is assumed to be the most basic one.

Returns:
The sentence final punctuation tags

sentenceFinalPunctuationWords

String[] sentenceFinalPunctuationWords()
Returns a String array of sentence final punctuation words for this treebank/language.

Returns:
The punctuation words

evalBIgnoredPunctuationTags

String[] evalBIgnoredPunctuationTags()
Returns a String array of punctuation tags that EVALB-style evaluation should ignore for this treebank/language. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Returns:
Whether this is a EVALB-ignored punctuation tag

grammaticalStructureFactory

GrammaticalStructureFactory grammaticalStructureFactory()
Return a GrammaticalStructureFactory suitable for this language/treebank.

Returns:
A GrammaticalStructureFactory suitable for this language/treebank

grammaticalStructureFactory

GrammaticalStructureFactory grammaticalStructureFactory(Filter<String> puncFilter)
Return a GrammaticalStructureFactory suitable for this language/treebank.

Parameters:
puncFilter - A filter which should reject punctuation words (as Strings)
Returns:
A GrammaticalStructureFactory suitable for this language/treebank

getEncoding

String getEncoding()
Return the charset encoding of the Treebank. See documentation for the Charset class.

Returns:
Name of Charset

getTokenizerFactory

TokenizerFactory<? extends HasWord> getTokenizerFactory()
Return a tokenizer factory which might be suitable for tokenizing text that will be used with this Treebank/Language pair. This is for real text of this language pair, not for reading stuff inside the treebank files.

Returns:
A tokenizer

labelAnnotationIntroducingCharacters

char[] labelAnnotationIntroducingCharacters()
Return an array of characters at which a String should be truncated to give the basic syntactic category of a label. The idea here is that Penn treebank style labels follow a syntactic category with various functional and crossreferencing information introduced by special characters (such as "NP-SBJ=1"). This would be truncated to "NP" by the array containing '-' and "=".
Note that these are never deleted as the first character as a label (so they are okay as one character tags, etc.), but only when subsequent characters.

Returns:
An array of characters that set off label name suffixes

isLabelAnnotationIntroducingCharacter

boolean isLabelAnnotationIntroducingCharacter(char ch)
Say whether this character is an annotation introducing character.

Parameters:
ch - A char
Returns:
Whether this char introduces functional annotations

basicCategory

String basicCategory(String category)
Returns the basic syntactic category of a String by truncating stuff after a (non-word-initial) occurrence of one of the labelAnnotationIntroducingCharacters(). This function should work on phrasal category and POS tag labels, but needn't (and couldn't be expected to) work on arbitrary Word strings.

Parameters:
category - The whole String name of the label
Returns:
The basic category of the String

stripGF

String stripGF(String category)
Returns the category for a String with everything following the gf character (which may be language specific) stripped.

Parameters:
category - The String name of the label (may previously have had basic category called on it)
Returns:
The String stripped of grammatical functions

getBasicCategoryFunction

Function<String,String> getBasicCategoryFunction()
Returns a Function object that maps Strings to Strings according to this TreebankLanguagePack's basicCategory method.

Returns:
the String->String Function object

categoryAndFunction

String categoryAndFunction(String category)
Returns the syntactic category and 'function' of a String. This normally involves truncating numerical coindexation showing coreference, etc. By 'function', this means keeping, say, Penn Treebank functional tags or ICE phrasal functions, perhaps returning them as category-function.

Parameters:
category - The whole String name of the label
Returns:
A String giving the category and function

getCategoryAndFunctionFunction

Function<String,String> getCategoryAndFunctionFunction()
Returns a Function object that maps Strings to Strings according to this TreebankLanguagePack's categoryAndFunction method.

Returns:
the String->String Function object

isStartSymbol

boolean isStartSymbol(String str)
Accepts a String that is a start symbol of the treebank.

Parameters:
str - The str to test
Returns:
Whether this is a start symbol

startSymbolAcceptFilter

Filter<String> startSymbolAcceptFilter()
Return a filter that accepts a String that is a start symbol of the treebank, and rejects everything else.

Returns:
The filter

startSymbols

String[] startSymbols()
Returns a String array of treebank start symbols.

Returns:
The start symbols

startSymbol

String startSymbol()
Returns a String which is the first (perhaps unique) start symbol of the treebank, or null if none is defined.

Returns:
The start symbol

treebankFileExtension

String treebankFileExtension()
Returns the extension of treebank files for this treebank. This should be passed as an argument to Treebank loading classes. It might be "mrg" or "fid" or whatever. Don't inlcude the period.

Returns:
the extension on files for this treebank

setGfCharacter

void setGfCharacter(char gfCharacter)
Sets the grammatical function indicating character to gfCharacter.

Parameters:
gfCharacter - Sets the character in label names that sets of grammatical function marking (from the phrase label).

treeReaderFactory

TreeReaderFactory treeReaderFactory()
Returns a TreeReaderFactory suitable for general purpose use with this language/treebank.

Returns:
A TreeReaderFactory suitable for general purpose use with this language/treebank.

treeTokenizerFactory

TokenizerFactory<Tree> treeTokenizerFactory()
Return a TokenizerFactory for Trees of this language/treebank.

Returns:
A TokenizerFactory for Trees of this language/treebank.

headFinder

HeadFinder headFinder()
The HeadFinder to use for your treebank.

Returns:
A suitable HeadFinder


Stanford NLP Group