|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
This interface specifies language/treebank specific information for a Treebank, which a parser might need to know.
Some of this is fixed for a (treebank,language) pair, but some of it reflects feature extraction decisions, so it can be sensible to have multiple implementations of this interface for the same (treebank,language) pair.
So far this covers punctuation, character encodings, and characters reserved for label annotations. It should probably be expanded to cover other stuff (unknown words?).
Various methods in this class return arrays. You should treat them as read-only, even though one cannot enforce that in Java.
Implementations of this method do not call basicCategory() on arguments before testing them, so if needed, you should explicitly call basicCategory() yourself before passing arguments to these routines for testing.
Field Summary | |
static String |
DEFAULT_ENCODING
Use this as the default encoding for Readers and Writers of Treebank data. |
Method Summary | |
String |
basicCategory(String category)
Returns the basic syntactic category of a String by truncating stuff after a (non-word-initial) occurrence of one of the labelAnnotationIntroducingCharacters() . |
Filter |
evalBIgnoredPunctuationTagAcceptFilter()
Returns a filter that accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else. |
String[] |
evalBIgnoredPunctuationTags()
Returns a String array of punctuation tags that EVALB-style evaluation should ignore for this treebank/language. |
String |
getEncoding()
Return the charset encoding of the Treebank. |
Tokenizer |
getTokenizer()
Return a tokenizer factory which might be suitable for tokenizing text that will be used with this Treebank/Language pair. |
boolean |
isEvalBIgnoredPunctuationTag(String str)
Accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else. |
boolean |
isLabelAnnotationIntroducingCharacter(char ch)
Say whether this character is an annotation introducing character. |
boolean |
isPunctuationTag(String str)
Accepts a String that is a punctuation tag name, and rejects everything else. |
boolean |
isPunctuationWord(String str)
Accepts a String that is a punctuation word, and rejects everything else. |
boolean |
isSentenceFinalPunctuationTag(String str)
Accepts a String that is a sentence end punctuation tag, and rejects everything else. |
boolean |
isStartSymbol(String str)
Accepts a String that is a start symbol of the treebank. |
char[] |
labelAnnotationIntroducingCharacters()
Return an array of characters at which a String should be truncated to give the basic syntactic category of a label. |
Filter |
punctuationTagAcceptFilter()
Return a filter that accepts a String that is a punctuation tag name, and rejects everything else. |
String[] |
punctuationTags()
Returns a String array of punctuation tags for this treebank/language. |
Filter |
punctuationWordAcceptFilter()
Returns a filter that accepts a String that is a punctuation word, and rejects everything else. |
String[] |
punctuationWords()
Returns a String array of punctuation words for this treebank/language. |
Filter |
sentenceFinalPunctuationTagAcceptFilter()
Returns a filter that accepts a String that is a sentence end punctuation tag, and rejects everything else. |
String[] |
sentenceFinalPunctuationTags()
Returns a String array of sentence final punctuation tags for this treebank/language. |
String[] |
sentenceFinalPunctuationWords()
Returns a String array of sentence final punctuation words for this treebank/language. |
String |
startSymbol()
Returns a String which is the first (perhaps unique) start symbol of the treebank, or null if none is defined. |
Filter |
startSymbolAcceptFilter()
Return a filter that accepts a String that is a start symbol of the treebank, and rejects everything else. |
String[] |
startSymbols()
Returns a String array of treebank start symbols. |
Field Detail |
public static final String DEFAULT_ENCODING
Method Detail |
public boolean isPunctuationTag(String str)
public boolean isPunctuationWord(String str)
public boolean isSentenceFinalPunctuationTag(String str)
public boolean isEvalBIgnoredPunctuationTag(String str)
public Filter punctuationTagAcceptFilter()
public Filter punctuationWordAcceptFilter()
public Filter sentenceFinalPunctuationTagAcceptFilter()
public Filter evalBIgnoredPunctuationTagAcceptFilter()
public String[] punctuationTags()
public String[] punctuationWords()
public String[] sentenceFinalPunctuationTags()
public String[] sentenceFinalPunctuationWords()
public String[] evalBIgnoredPunctuationTags()
public String getEncoding()
Charset
class.
public Tokenizer getTokenizer()
public char[] labelAnnotationIntroducingCharacters()
public boolean isLabelAnnotationIntroducingCharacter(char ch)
public String basicCategory(String category)
labelAnnotationIntroducingCharacters()
.
category
- The whole String name of the label
public boolean isStartSymbol(String str)
public Filter startSymbolAcceptFilter()
public String[] startSymbols()
public String startSymbol()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |