edu.stanford.nlp.trees.international.arabic
Class ArabicTreebankLanguagePack

java.lang.Object
  extended by edu.stanford.nlp.trees.AbstractTreebankLanguagePack
      extended by edu.stanford.nlp.trees.international.arabic.ArabicTreebankLanguagePack
All Implemented Interfaces:
TreebankLanguagePack, Serializable

public class ArabicTreebankLanguagePack
extends AbstractTreebankLanguagePack

Specifies the treebank/language specific components needed for parsing the English Penn Treebank.

The encoding for Arabic is the default UTF-8 specified in AbstractTreebankLanguagePack.

Author:
Christopher Manning, Mona Diab, Roger Levy
See Also:
Serialized Form

Field Summary
 
Fields inherited from class edu.stanford.nlp.trees.AbstractTreebankLanguagePack
DEFAULT_ENCODING, DEFAULT_GF_CHAR, gfCharacter
 
Constructor Summary
ArabicTreebankLanguagePack()
          Initialize an Arabic Treebank.
ArabicTreebankLanguagePack(boolean detPlusNounIsBasicCategory)
          Initialize an Arabic Treebank.
 
Method Summary
 String basicCategory(String category)
          Returns the basic syntactic category of a String.
 String[] evalBIgnoredPunctuationTags()
          Returns a String array of punctuation tags that EVALB-style evaluation should ignore for this treebank/language.
 TokenizerFactory<? extends HasWord> getTokenizerFactory()
          Return a tokenizer which might be suitable for tokenizing text that will be used with this Treebank/Language pair.
 HeadFinder headFinder()
          The HeadFinder to use for your treebank.
 char[] labelAnnotationIntroducingCharacters()
          Return an array of characters at which a String should be truncated to give the basic syntactic category of a label.
static void main(String[] args)
           
 String[] punctuationTags()
          Returns a String array of punctuation tags for this treebank/language.
 String[] punctuationWords()
          Returns a String array of punctuation words for this treebank/language.
 String[] sentenceFinalPunctuationTags()
          Returns a String array of sentence final punctuation tags for this treebank/language.
 String[] sentenceFinalPunctuationWords()
          Returns a String array of sentence final punctuation words for this treebank/language.
 void setTokenizerFactory(TokenizerFactory<? extends HasWord> tf)
           
 String[] startSymbols()
          Returns a String array of treebank start symbols.
 String toString()
           
 String treebankFileExtension()
          Returns the extension of treebank files for this treebank.
 TreeReaderFactory treeReaderFactory()
          Returns a TreeReaderFactory suitable for general purpose use with this language/treebank.
 
Methods inherited from class edu.stanford.nlp.trees.AbstractTreebankLanguagePack
categoryAndFunction, evalBIgnoredPunctuationTagAcceptFilter, evalBIgnoredPunctuationTagRejectFilter, getBasicCategoryFunction, getCategoryAndFunctionFunction, getEncoding, getGfCharacter, grammaticalStructureFactory, grammaticalStructureFactory, isEvalBIgnoredPunctuationTag, isLabelAnnotationIntroducingCharacter, isPunctuationTag, isPunctuationWord, isSentenceFinalPunctuationTag, isStartSymbol, punctuationTagAcceptFilter, punctuationTagRejectFilter, punctuationWordAcceptFilter, punctuationWordRejectFilter, sentenceFinalPunctuationTagAcceptFilter, setGfCharacter, startSymbol, startSymbolAcceptFilter, stripGF, treeTokenizerFactory
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

ArabicTreebankLanguagePack

public ArabicTreebankLanguagePack()
Initialize an Arabic Treebank.


ArabicTreebankLanguagePack

public ArabicTreebankLanguagePack(boolean detPlusNounIsBasicCategory)
Initialize an Arabic Treebank.

Parameters:
detPlusNounIsBasicCategory - If invoked with argument true, the category DET+NOUN is considered a basic category for purposes of basicCategory(String). (Note: this is maybe obsolete. In more recent practice we've used tags like DTNN.)
Method Detail

punctuationTags

public String[] punctuationTags()
Returns a String array of punctuation tags for this treebank/language.

Specified by:
punctuationTags in interface TreebankLanguagePack
Specified by:
punctuationTags in class AbstractTreebankLanguagePack
Returns:
The punctuation tags

punctuationWords

public String[] punctuationWords()
Returns a String array of punctuation words for this treebank/language.

Specified by:
punctuationWords in interface TreebankLanguagePack
Specified by:
punctuationWords in class AbstractTreebankLanguagePack
Returns:
The punctuation words

sentenceFinalPunctuationTags

public String[] sentenceFinalPunctuationTags()
Returns a String array of sentence final punctuation tags for this treebank/language.

Specified by:
sentenceFinalPunctuationTags in interface TreebankLanguagePack
Specified by:
sentenceFinalPunctuationTags in class AbstractTreebankLanguagePack
Returns:
The sentence final punctuation tags

sentenceFinalPunctuationWords

public String[] sentenceFinalPunctuationWords()
Returns a String array of sentence final punctuation words for this treebank/language.

Returns:
The sentence final punctuation tags

evalBIgnoredPunctuationTags

public String[] evalBIgnoredPunctuationTags()
Returns a String array of punctuation tags that EVALB-style evaluation should ignore for this treebank/language. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Specified by:
evalBIgnoredPunctuationTags in interface TreebankLanguagePack
Overrides:
evalBIgnoredPunctuationTags in class AbstractTreebankLanguagePack
Returns:
Whether this is a EVALB-ignored punctuation tag

labelAnnotationIntroducingCharacters

public char[] labelAnnotationIntroducingCharacters()
Return an array of characters at which a String should be truncated to give the basic syntactic category of a label. The idea here is that Penn treebank style labels follow a syntactic category with various functional and crossreferencing information introduced by special characters (such as "NP-SBJ=1"). This would be truncated to "NP" by the array containing '-' and "=".

Specified by:
labelAnnotationIntroducingCharacters in interface TreebankLanguagePack
Overrides:
labelAnnotationIntroducingCharacters in class AbstractTreebankLanguagePack
Returns:
An array of characters that set off label name suffixes

startSymbols

public String[] startSymbols()
Returns a String array of treebank start symbols.

Specified by:
startSymbols in interface TreebankLanguagePack
Specified by:
startSymbols in class AbstractTreebankLanguagePack
Returns:
The start symbols

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory<? extends HasWord> tf)

getTokenizerFactory

public TokenizerFactory<? extends HasWord> getTokenizerFactory()
Return a tokenizer which might be suitable for tokenizing text that will be used with this Treebank/Language pair. We assume at the moment that someone else has tokenized our Arabic, and so use the Whitespace tokenizer of superclass.

Specified by:
getTokenizerFactory in interface TreebankLanguagePack
Overrides:
getTokenizerFactory in class AbstractTreebankLanguagePack
Returns:
A tokenizer

treebankFileExtension

public String treebankFileExtension()
Returns the extension of treebank files for this treebank. This is "tree".

Returns:
the extension on files for this treebank

treeReaderFactory

public TreeReaderFactory treeReaderFactory()
Description copied from class: AbstractTreebankLanguagePack
Returns a TreeReaderFactory suitable for general purpose use with this language/treebank.

Specified by:
treeReaderFactory in interface TreebankLanguagePack
Overrides:
treeReaderFactory in class AbstractTreebankLanguagePack
Returns:
A TreeReaderFactory suitable for general purpose use with this language/treebank.

main

public static void main(String[] args)

basicCategory

public String basicCategory(String category)
Description copied from class: AbstractTreebankLanguagePack
Returns the basic syntactic category of a String. This implementation basically truncates stuff after an occurrence of one of the labelAnnotationIntroducingCharacters(). However, there is also special case stuff to deal with labelAnnotationIntroducingCharacters in category labels: (i) if the first char is in this set, it's never truncated (e.g., '-' or '=' as a token), and (ii) if it starts with one of this set, a second instance of the same item from this set is also excluded (to deal with '-LLB-', '-RCB-', etc.).

Specified by:
basicCategory in interface TreebankLanguagePack
Overrides:
basicCategory in class AbstractTreebankLanguagePack
Parameters:
category - The whole String name of the label
Returns:
The basic category of the String

toString

public String toString()
Overrides:
toString in class Object

headFinder

public HeadFinder headFinder()
The HeadFinder to use for your treebank.

Returns:
A suitable HeadFinder


Stanford NLP Group