public interface TreebankLanguagePack extends Serializable
Some of this is fixed for a (treebank,language) pair, but some of it reflects feature extraction decisions, so it can be sensible to have multiple implementations of this interface for the same (treebank,language) pair.
So far this covers punctuation, character encodings, and characters reserved for label annotations. It should probably be expanded to cover other stuff (unknown words?).
Various methods in this class return arrays. You should treat them as read-only, even though one cannot enforce that in Java.
Implementations in this class do not call basicCategory() on arguments before testing them, so if needed, you should explicitly call basicCategory() yourself before passing arguments to these routines for testing. This class should be able to be an immutable singleton. It contains data on various things, but no state. At some point we should make it a real immutable singleton.
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_ENCODING
Use this as the default encoding for Readers and Writers of
Treebank data.
|
Modifier and Type | Method and Description |
---|---|
String |
basicCategory(String category)
Returns the basic syntactic category of a String by truncating
stuff after a (non-word-initial) occurrence of one of the
labelAnnotationIntroducingCharacters() . |
String |
categoryAndFunction(String category)
Returns the syntactic category and 'function' of a String.
|
java.util.function.Predicate<String> |
evalBIgnoredPunctuationTagAcceptFilter()
Returns a filter that accepts a String that is a punctuation
tag that should be ignored by EVALB-style evaluation,
and rejects everything else.
|
java.util.function.Predicate<String> |
evalBIgnoredPunctuationTagRejectFilter()
Returns a filter that accepts everything except a String that is a
punctuation tag that should be ignored by EVALB-style evaluation.
|
String[] |
evalBIgnoredPunctuationTags()
Returns a String array of punctuation tags that EVALB-style evaluation
should ignore for this treebank/language.
|
java.util.function.Function<String,String> |
getBasicCategoryFunction()
Returns a
Function object that maps Strings to Strings according
to this TreebankLanguagePack's basicCategory method. |
java.util.function.Function<String,String> |
getCategoryAndFunctionFunction()
Returns a
Function object that maps Strings to Strings according
to this TreebankLanguagePack's categoryAndFunction method. |
String |
getEncoding()
Return the charset encoding of the Treebank.
|
TokenizerFactory<? extends HasWord> |
getTokenizerFactory()
Return a tokenizer factory which might be suitable for tokenizing text
that will be used with this Treebank/Language pair.
|
GrammaticalStructureFactory |
grammaticalStructureFactory()
Return a GrammaticalStructureFactory suitable for this language/treebank.
|
GrammaticalStructureFactory |
grammaticalStructureFactory(java.util.function.Predicate<String> puncFilter)
Return a GrammaticalStructureFactory suitable for this language/treebank.
|
GrammaticalStructureFactory |
grammaticalStructureFactory(java.util.function.Predicate<String> puncFilter,
HeadFinder typedDependencyHF)
Return a GrammaticalStructureFactory suitable for this language/treebank.
|
HeadFinder |
headFinder()
The HeadFinder to use for your treebank.
|
boolean |
isEvalBIgnoredPunctuationTag(String str)
Accepts a String that is a punctuation
tag that should be ignored by EVALB-style evaluation,
and rejects everything else.
|
boolean |
isLabelAnnotationIntroducingCharacter(char ch)
Say whether this character is an annotation introducing
character.
|
boolean |
isPunctuationTag(String str)
Accepts a String that is a punctuation
tag name, and rejects everything else.
|
boolean |
isPunctuationWord(String str)
Accepts a String that is a punctuation
word, and rejects everything else.
|
boolean |
isSentenceFinalPunctuationTag(String str)
Accepts a String that is a sentence end
punctuation tag, and rejects everything else.
|
boolean |
isStartSymbol(String str)
Accepts a String that is a start symbol of the treebank.
|
char[] |
labelAnnotationIntroducingCharacters()
Return an array of characters at which a String should be
truncated to give the basic syntactic category of a label.
|
MorphoFeatureSpecification |
morphFeatureSpec()
The morphological feature specification for the language.
|
java.util.function.Predicate<String> |
punctuationTagAcceptFilter()
Return a filter that accepts a String that is a punctuation
tag name, and rejects everything else.
|
java.util.function.Predicate<String> |
punctuationTagRejectFilter()
Return a filter that rejects a String that is a punctuation
tag name, and accepts everything else.
|
String[] |
punctuationTags()
Returns a String array of punctuation tags for this treebank/language.
|
java.util.function.Predicate<String> |
punctuationWordAcceptFilter()
Returns a filter that accepts a String that is a punctuation
word, and rejects everything else.
|
java.util.function.Predicate<String> |
punctuationWordRejectFilter()
Returns a filter that accepts a String that is not a punctuation
word, and rejects punctuation.
|
String[] |
punctuationWords()
Returns a String array of punctuation words for this treebank/language.
|
java.util.function.Predicate<String> |
sentenceFinalPunctuationTagAcceptFilter()
Returns a filter that accepts a String that is a sentence end
punctuation tag, and rejects everything else.
|
String[] |
sentenceFinalPunctuationTags()
Returns a String array of sentence final punctuation tags for this
treebank/language.
|
String[] |
sentenceFinalPunctuationWords()
Returns a String array of sentence final punctuation words for
this treebank/language.
|
void |
setGfCharacter(char gfCharacter)
Sets the grammatical function indicating character to gfCharacter.
|
String |
startSymbol()
Returns a String which is the first (perhaps unique) start symbol
of the treebank, or null if none is defined.
|
java.util.function.Predicate<String> |
startSymbolAcceptFilter()
Return a filter that accepts a String that is a start symbol
of the treebank, and rejects everything else.
|
String[] |
startSymbols()
Returns a String array of treebank start symbols.
|
String |
stripGF(String category)
Returns the category for a String with everything following
the gf character (which may be language specific) stripped.
|
boolean |
supportsGrammaticalStructures()
Whether or not we have typed dependencies for this language.
|
String |
treebankFileExtension()
Returns the extension of treebank files for this treebank.
|
TreeReaderFactory |
treeReaderFactory()
Returns a TreeReaderFactory suitable for general purpose use
with this language/treebank.
|
TokenizerFactory<Tree> |
treeTokenizerFactory()
Return a TokenizerFactory for Trees of this language/treebank.
|
HeadFinder |
typedDependencyHeadFinder()
The HeadFinder to use when making typed dependencies.
|
static final String DEFAULT_ENCODING
boolean isPunctuationTag(String str)
str
- The string to checkboolean isPunctuationWord(String str)
str
- The string to checkboolean isSentenceFinalPunctuationTag(String str)
str
- The string to checkboolean isEvalBIgnoredPunctuationTag(String str)
str
- The string to checkjava.util.function.Predicate<String> punctuationTagAcceptFilter()
java.util.function.Predicate<String> punctuationTagRejectFilter()
java.util.function.Predicate<String> punctuationWordAcceptFilter()
java.util.function.Predicate<String> punctuationWordRejectFilter()
java.util.function.Predicate<String> sentenceFinalPunctuationTagAcceptFilter()
java.util.function.Predicate<String> evalBIgnoredPunctuationTagAcceptFilter()
java.util.function.Predicate<String> evalBIgnoredPunctuationTagRejectFilter()
String[] punctuationTags()
String[] punctuationWords()
String[] sentenceFinalPunctuationTags()
String[] sentenceFinalPunctuationWords()
String[] evalBIgnoredPunctuationTags()
GrammaticalStructureFactory grammaticalStructureFactory()
GrammaticalStructureFactory grammaticalStructureFactory(java.util.function.Predicate<String> puncFilter)
puncFilter
- A filter which should reject punctuation words (as Strings)GrammaticalStructureFactory grammaticalStructureFactory(java.util.function.Predicate<String> puncFilter, HeadFinder typedDependencyHF)
puncFilter
- A filter which should reject punctuation words (as Strings)typedDependencyHF
- A HeadFinder which finds heads for typed dependenciesboolean supportsGrammaticalStructures()
String getEncoding()
Charset
class.TokenizerFactory<? extends HasWord> getTokenizerFactory()
char[] labelAnnotationIntroducingCharacters()
boolean isLabelAnnotationIntroducingCharacter(char ch)
ch
- A charString basicCategory(String category)
labelAnnotationIntroducingCharacters()
. This
function should work on phrasal category and POS tag labels,
but needn't (and couldn't be expected to) work on arbitrary
Word strings.category
- The whole String name of the labelString stripGF(String category)
category
- The String name of the label (may previously have had basic category called on it)java.util.function.Function<String,String> getBasicCategoryFunction()
Function
object that maps Strings to Strings according
to this TreebankLanguagePack's basicCategory method.String categoryAndFunction(String category)
category-function
.category
- The whole String name of the labeljava.util.function.Function<String,String> getCategoryAndFunctionFunction()
Function
object that maps Strings to Strings according
to this TreebankLanguagePack's categoryAndFunction method.boolean isStartSymbol(String str)
str
- The str to testjava.util.function.Predicate<String> startSymbolAcceptFilter()
String[] startSymbols()
String startSymbol()
String treebankFileExtension()
void setGfCharacter(char gfCharacter)
gfCharacter
- Sets the character in label names that sets of
grammatical function marking (from the phrase label).TreeReaderFactory treeReaderFactory()
TokenizerFactory<Tree> treeTokenizerFactory()
HeadFinder headFinder()
HeadFinder typedDependencyHeadFinder()
MorphoFeatureSpecification morphFeatureSpec()