edu.stanford.nlp.trees
Interface TreebankLanguagePack

All Known Implementing Classes:
AbstractTreebankLanguagePack, ChineseTreebankLanguagePack

public interface TreebankLanguagePack

This interface specifies language/treebank specific information for a Treebank, which a parser might need to know.

Some of this is fixed for a (treebank,language) pair, but some of it reflects feature extraction decisions, so it can be sensible to have multiple implementations of this interface for the same (treebank,language) pair.

So far this covers punctuation, character encodings, and characters reserved for label annotations. It should probably be expanded to cover other stuff (unknown words?).

Various methods in this class return arrays. You should treat them as read-only, even though one cannot enforce that in Java.

Implementations of this method do not call basicCategory() on arguments before testing them, so if needed, you should explicitly call basicCategory() yourself before passing arguments to these routines for testing.

Author:
Christopher Manning

Field Summary
static String DEFAULT_ENCODING
          Use this as the default encoding for Readers and Writers of Treebank data.
 
Method Summary
 String basicCategory(String category)
          Returns the basic syntactic category of a String by truncating stuff after a (non-word-initial) occurrence of one of the labelAnnotationIntroducingCharacters().
 Filter evalBIgnoredPunctuationTagAcceptFilter()
          Returns a filter that accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else.
 String[] evalBIgnoredPunctuationTags()
          Returns a String array of punctuation tags that EVALB-style evaluation should ignore for this treebank/language.
 String getEncoding()
          Return the charset encoding of the Treebank.
 Tokenizer getTokenizer()
          Return a tokenizer factory which might be suitable for tokenizing text that will be used with this Treebank/Language pair.
 boolean isEvalBIgnoredPunctuationTag(String str)
          Accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else.
 boolean isLabelAnnotationIntroducingCharacter(char ch)
          Say whether this character is an annotation introducing character.
 boolean isPunctuationTag(String str)
          Accepts a String that is a punctuation tag name, and rejects everything else.
 boolean isPunctuationWord(String str)
          Accepts a String that is a punctuation word, and rejects everything else.
 boolean isSentenceFinalPunctuationTag(String str)
          Accepts a String that is a sentence end punctuation tag, and rejects everything else.
 boolean isStartSymbol(String str)
          Accepts a String that is a start symbol of the treebank.
 char[] labelAnnotationIntroducingCharacters()
          Return an array of characters at which a String should be truncated to give the basic syntactic category of a label.
 Filter punctuationTagAcceptFilter()
          Return a filter that accepts a String that is a punctuation tag name, and rejects everything else.
 String[] punctuationTags()
          Returns a String array of punctuation tags for this treebank/language.
 Filter punctuationWordAcceptFilter()
          Returns a filter that accepts a String that is a punctuation word, and rejects everything else.
 String[] punctuationWords()
          Returns a String array of punctuation words for this treebank/language.
 Filter sentenceFinalPunctuationTagAcceptFilter()
          Returns a filter that accepts a String that is a sentence end punctuation tag, and rejects everything else.
 String[] sentenceFinalPunctuationTags()
          Returns a String array of sentence final punctuation tags for this treebank/language.
 String[] sentenceFinalPunctuationWords()
          Returns a String array of sentence final punctuation words for this treebank/language.
 String startSymbol()
          Returns a String which is the first (perhaps unique) start symbol of the treebank, or null if none is defined.
 Filter startSymbolAcceptFilter()
          Return a filter that accepts a String that is a start symbol of the treebank, and rejects everything else.
 String[] startSymbols()
          Returns a String array of treebank start symbols.
 

Field Detail

DEFAULT_ENCODING

public static final String DEFAULT_ENCODING
Use this as the default encoding for Readers and Writers of Treebank data.

See Also:
Constant Field Values
Method Detail

isPunctuationTag

public boolean isPunctuationTag(String str)
Accepts a String that is a punctuation tag name, and rejects everything else.

Returns:
Whether this is a punctuation tag

isPunctuationWord

public boolean isPunctuationWord(String str)
Accepts a String that is a punctuation word, and rejects everything else. If one can't tell for sure (as for ' in the Penn Treebank), it maks the best guess that it can.

Returns:
Whether this is a punctuation word

isSentenceFinalPunctuationTag

public boolean isSentenceFinalPunctuationTag(String str)
Accepts a String that is a sentence end punctuation tag, and rejects everything else.

Returns:
Whether this is a sentence final punctuation tag

isEvalBIgnoredPunctuationTag

public boolean isEvalBIgnoredPunctuationTag(String str)
Accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Returns:
Whether this is a EVALB-ignored punctuation tag

punctuationTagAcceptFilter

public Filter punctuationTagAcceptFilter()
Return a filter that accepts a String that is a punctuation tag name, and rejects everything else.

Returns:
The filter

punctuationWordAcceptFilter

public Filter punctuationWordAcceptFilter()
Returns a filter that accepts a String that is a punctuation word, and rejects everything else. If one can't tell for sure (as for ' in the Penn Treebank), it maks the best guess that it can.

Returns:
The Filter

sentenceFinalPunctuationTagAcceptFilter

public Filter sentenceFinalPunctuationTagAcceptFilter()
Returns a filter that accepts a String that is a sentence end punctuation tag, and rejects everything else.

Returns:
The Filter

evalBIgnoredPunctuationTagAcceptFilter

public Filter evalBIgnoredPunctuationTagAcceptFilter()
Returns a filter that accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Returns:
The Filter

punctuationTags

public String[] punctuationTags()
Returns a String array of punctuation tags for this treebank/language.

Returns:
The punctuation tags

punctuationWords

public String[] punctuationWords()
Returns a String array of punctuation words for this treebank/language.

Returns:
The punctuation words

sentenceFinalPunctuationTags

public String[] sentenceFinalPunctuationTags()
Returns a String array of sentence final punctuation tags for this treebank/language.

Returns:
The sentence final punctuation tags

sentenceFinalPunctuationWords

public String[] sentenceFinalPunctuationWords()
Returns a String array of sentence final punctuation words for this treebank/language.

Returns:
The punctuation words

evalBIgnoredPunctuationTags

public String[] evalBIgnoredPunctuationTags()
Returns a String array of punctuation tags that EVALB-style evaluation should ignore for this treebank/language. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Returns:
Whether this is a EVALB-ignored punctuation tag

getEncoding

public String getEncoding()
Return the charset encoding of the Treebank. See documentation for the Charset class.

Returns:
Name of Charset

getTokenizer

public Tokenizer getTokenizer()
Return a tokenizer factory which might be suitable for tokenizing text that will be used with this Treebank/Language pair.

Returns:
A tokenizer

labelAnnotationIntroducingCharacters

public char[] labelAnnotationIntroducingCharacters()
Return an array of characters at which a String should be truncated to give the basic syntactic category of a label. The idea here is that Penn treebank style labels follow a syntactic category with various functional and crossreferencing information introduced by special characters (such as "NP-SBJ=1"). This would be truncated to "NP" by the array containing '-' and "=".
Note that these are never deleted as the first character as a label (so they are okay as one character tags, etc.), but only when subsequent characters.

Returns:
An array of characters that set off label name suffixes

isLabelAnnotationIntroducingCharacter

public boolean isLabelAnnotationIntroducingCharacter(char ch)
Say whether this character is an annotation introducing character.


basicCategory

public String basicCategory(String category)
Returns the basic syntactic category of a String by truncating stuff after a (non-word-initial) occurrence of one of the labelAnnotationIntroducingCharacters().

Parameters:
category - The whole String name of the label
Returns:
The basic category of the String

isStartSymbol

public boolean isStartSymbol(String str)
Accepts a String that is a start symbol of the treebank.

Returns:
Whether this is a start symbol

startSymbolAcceptFilter

public Filter startSymbolAcceptFilter()
Return a filter that accepts a String that is a start symbol of the treebank, and rejects everything else.

Returns:
The filter

startSymbols

public String[] startSymbols()
Returns a String array of treebank start symbols.

Returns:
The start symbols

startSymbol

public String startSymbol()
Returns a String which is the first (perhaps unique) start symbol of the treebank, or null if none is defined.

Returns:
The start symbol


Stanford NLP Group