Package edu.stanford.nlp.trees

A package for (NLP) trees, sentences, and similar things.

See:
          Description

Interface Summary
ConstituentFactory A ConstituentFactory is a factory for creating objects of class Constituent, or some descendent class.
HasAttributes Something that implements the HasAttributes interface knows about attributes stored in a Map.
HasCategory Something that implements the HasCategory interface knows about categories.
HasFollow Something that implements the HasFollow interface knows about the characters that follow a token.
HasTag Something that implements the HasTag interface knows about part-of-speech tags.
HasWord Something that implements the HasWord interface knows about words.
HeadFinder An interface for finding the "head" daughter of a phrase structure tree.
Label Something that implements the Label interface can act as a constituent, node, or word label with linguistic attributes.
Labeled Interface for Objects which have a Label.
LabelFactory A LabelFactory object acts as a factory for creating objects of class Label, or some descendent class.
SentenceProcessor This is a simple interface for applying a transformer to a Sentence.
SentenceReaderFactory A SentenceReaderFactory is a factory for creating objects of class SentenceReader, or some descendent class.
SentenceVisitor This is a simple interface for operations that are going to be applied to a Sentence.
TreebankLanguagePack This interface specifies language/treebank specific information for a Treebank, which a parser might need to know.
TreeFactory A TreeFactory acts as a factory for creating objects of class Tree, or some descendent class.
TreeProcessor This is a simple interface for operations that are going to be applied to a Tree.
TreeReader A TreeReader adds functionality to another Reader by reading in Trees, or some descendant class.
TreeReaderFactory A TreeReaderFactory is a factory for creating objects of class TreeReader, or some descendent class.
TreeTransformer This is a simple interface for a function that alters a local Tree.
 

Class Summary
AbstractLabel An AbstractLabel object acts as a Label with linguistic attributes.
AbstractTreebankLanguagePack This provides an implementation of parts of the TreebankLanguagePack API to reduce the load on fresh implementations.
AdaptiveLabelFactory An AdaptiveLabelFactory object makes simple Labels for objects, by creating a label of an appropriate type depending on the arguments passed in.
AdwaitSentenceReaderFactory This class implements a SentenceReaderFactory which is suitable for reading in tagged or untagged sentences formatted one per line with underscores used only to separate words from POS tags.
AdwaitStreamTokenizer Builds a tokenizer for files where whitespace separates tokens, and eol is significant.
BobChrisTreeNormalizer Normalizes trees in the way used in Manning and Carpenter 1997.
Category A Category object acts as a Label by containing a String that is a category (nonterminal).
CategoryWordTag A CategoryWordTag object acts as a complex Label which contains a category, a head word, and a tag.
CategoryWordTagFactory A CategoryWordTagFactory is a factory that makes a Label which is a CategoryWordTag triplet.
ChineseCollinizer Performs collinization operations on Chinese trees similar to those for English Namely: strips all functional & automatically-added tags strips all punctuation merges PRN and ADVP eliminates ROOT (note that there are a few non-unary ROOT nodes; these are not eliminated)
ChineseHeadFinder HeadFinder for the Penn Chinese Treebank.
ChineseTreebankLanguagePack Language pack for Chinese treebank.
CollinsHeadFinder Implements the HeadFinder found in Michael Collins' 1999 thesis.
CollinsSemanticHeadFinder Implements a 'semantic head' variant of the the HeadFinder found in Michael Collins' 1999 thesis.
Constituent A Constituent object defines a generic edge in a graph.
DanBobChrisTreeNormalizer Normalizes trees roughly the way used in Manning and Carpenter 1997.
DiskSentencebank A DiskSentencebank object stores merely the information to get at a corpus of sentences that is stored on disk.
DiskTreebank A DiskTreebank object stores merely the information to get at a corpus of trees that is stored on disk.
LabeledConstituent A LabeledConstituent object represents a single bracketing in a derivation, including start and end points and Label information, but excluding probabilistic information.
LabeledScoredConstituent A LabeledScoredConstituent object defines an edge in a graph with a label and a score.
LabeledScoredConstituentFactory A LabeledScoredConstituentFactory acts as a factory for creating objects of class LabeledScoredConstituent.
LabeledScoredTreeFactory A LabeledScoredTreeFactory acts as a factory for creating trees with labels and scores.
LabeledScoredTreeLeaf A LabeledScoredTreeLeaf represents the leaf of a tree in a parse tree with labels and scores.
LabeledScoredTreeNode A LabeledScoredTreeNode represents a tree composed of a root label, a score, and an array of daughter parse trees.
LabeledScoredTreeReaderFactory This class implements a TreeReaderFactory that produces labeled, scored array-based Trees, which have been cleaned up to delete empties, etc.
LeftHeadFinder HeadFinder that always returns the leftmost daughter as head.
MemorySentencebank A MemorySentencebank object stores a corpus of examples with given sentence structures in memory (as a Collection)
MemoryTreebank A MemoryTreebank object stores a corpus of examples with given tree structures in memory (as a List).
ModCollinsHeadFinder Implements a variant on the HeadFinder found in Michael Collins' 1999 thesis.
NegraTreeNormalizer Tree normalizer for Negra.
NoPunctTreeNormalizer Normalizes trees roughly the way used in Manning and Carpenter 1997.
NPTmpRetainingTreeNormalizer Same TreeNormalizer as BobChrisTreeNormalizer, but optionally provides four extras.
NullLabel A NullLabel object acts as a Label with linguistic attributes, but doesn't actually store or return anything.
OnePerLineSentenceNormalizer A class for sentence normalization.
ParametricTreeNormalizer Normalizes trees based on parameter settings.
ParentTransformedLabeledNormalizedTreeReaderFactory This class implements a TreeReaderFactory that produces labeled, scored array-based Trees, which have been cleaned up to delete empties, etc.
PennSentenceMrgNormalizer A class for sentence normalization.
PennSentenceNormalizer A class for Penn tag directory sentence normalization.
PennSentenceReaderFactory This class implements a SentenceReaderFactory which is suitable for reading in tagged sentences from the Penn Treebank.
PennTagbankStreamTokenizer Builds a tokenizer for Penn pos tagged directories.
PennTreebankLanguagePack Specifies the treebank/language specific components needed for parsing the English Penn Treebank.
PennTreebankStreamTokenizer Builds a tokenizer for English PennTreebank (release 2) trees.
PennTreeReader A PennTreeReader is a TreeReader that reads in Penn Treebank-style files.
PruneNodesStripSubtagsTreeNormalizer ??
QuasiSemantics Builds a neo-Davidsonian-style semantic representation corresponding to a Penn Treebank style tree.
SbjRetainingTreeNormalizer Same TreeNormalizer as BobChrisTreeNormalizer, but retains -SBJ labels on NP with the new identification NP#SBJ
Sentence Sentence holds a single sentence, and mediating between word numbers and words.
Sentencebank A Sentencebank object provides access to a corpus of sentences -- just plain sentences or tagged sentences, etc.
SentenceNormalizer A class for sentence normalization.
SentenceReader A SentenceReader adds functionality to a Reader by reading in Sentences, or some descendant class.
SepTreeNormalizer Exactly like BobChrisTreeNormalize, except does not strip functional tags from NPs.
SimpleConstituent A SimpleConstituent object defines a generic edge in a graph.
SimpleConstituentFactory A ConstituentFactory acts as a factory for creating objects of class Constituent, or some descendent class.
SimpleSentenceReaderFactory This class implements a simple default SentenceReaderFactory.
SimpleTree A SimpleTree is a minimal concrete implementation of an unlabeled, unscored Tree.
SimpleTreeFactory A SimpleTreeFactory acts as a factory for creating objects of class SimpleTree.
SimpleTreeReaderFactory This class implements a simple default TreeReaderFactory.
Span A Span is an optimized SimpleConstituent object.
StringLabel A StringLabel object acts as a Label by containing a single String, which it sets or returns in response to requests.
StringLabeledScoredTreeReaderFactory This class implements a TreeReaderFactory that produces labeled, scored array-based Trees, which have been cleaned up to delete empties, etc.
StringLabelFactory A StringLabelFactory object makes a simple StringLabel out of a String.
Tag A Tag object acts as a Label by containing a String that is a part-of-speech tag.
TaggedWord A TaggedWord object contains a word and its tag.
TaggedWordFactory A TaggedWordFactory acts as a factory for creating objects of class TaggedWord.
TagMapper A POS tag to POS tag mapper.
Tree The abstract class Tree is used to collect all of the tree types, and acts as a generic composite type.
Treebank A Treebank object provides access to a corpus of examples with given tree structures.
TreeJugglers  
TreeLengthComparator A TreeLengthComparator orders trees by their yield sentence lengths.
TreeNormalizer A class for tree normalization.
TreeNormalizers A collection of static methods that return a TreeNormalizer.
WeightedFollowedTaggedWord A WeightedFollowedTaggedWord object contains a word and its tag, but it also records what text follows the token.
Word A Word object acts as a Label by containing a String.
WordFactory A WordFactory acts as a factory for creating objects of class Word.
WordLabeledScoredTreeReaderFactory This class implements a TreeReaderFactory that produces Word labeled, scored array-based Trees, which have been cleaned up to delete empties, etc., according to the BobChrisTreeNormalizer.
WordTag A WordTag corresponds to a tagged (e.g., for part of speech) word and is implemented with String-valued word and tag.
WordTagFactory A WordTagFactory acts as a factory for creating objects of class WordTag.
 

Package edu.stanford.nlp.trees Description

A package for (NLP) trees, sentences, and similar things. This package provides several key abstractions (via abstract classes) and a number of further classes for related objects. Most of these classes use a Factory pattern to instantiate objects.

A Label is something that can be the label of a Tree or a Constituent. The simplest label is a StringLabel. A Word or a TaggedWord is a Label. They can be constructed with a LabelFactory. A Label often implements various interfaces, such as HasWord.

A Constituent object defines a generic edge in a graph. It has a start and end, and usually a Label. A ConstituentFactory builds a Constituent.

A Tree object provides generic facilities for manipulating NLP trees. A TreeFactory can build a Tree. A Treebank provides an interface to a collection of parsed sentences (normally found on disk as a corpus). A TreeReader reads trees from an InputStream. A TreeReaderFactory builds a TreeReader. A TreeNormalizer canonicalizes a Tree on input from a File. A HeadFinder finds the head daughter of a Tree. The TreeProcessor interface is for general sequential processing of trees, and the TreeTransformer interface is for changing them.

A Sentence is a subclass of an ArrayList. A Sentencebank provides an interface to a large number of sentences (normally found on disk as a corpus). A SentenceReader reads sentences from an InputStream. A SentenceReaderFactory builds a SentenceReader. A SentenceNormalizer canonicalizes a Sentence on input from a File. The SentenceProcessor interface is for general sequential processing of sentences.

There are also various subclasses of StreamTokenizer. The class PairFinder should probably be removed to samples.

Design notes: This package is the result of several iterations of trying to come up with a reusable and extendable set of tree classes. It may still be nonoptimal, but some thought went into it! At any rate, there are several things that it is important to understand to use the class effectively. One is that a Label has a primary value() which is always a String, and this is the only thing that matters for fundamental Label operations, such as checking equality. While anything else (or nothing) can be stored in a Label, all other Label content is regarded as purely decorative. All Label implementations should implement a labelFactory() method that returns a LabelFactory for the appropriate kind of Label. Since this depends on the exact class, this method should always be overwritten when a Label class is extended. The existing Label classes also provide a static factory() method which returns the same thing.

Road Map: There are some plans to change things. We plan to redo Label, so that all Label classes just inherit from AbstractLabel, and do a full equality test on all their fields. The default type of Treebank should be useful. TreeReader should be PennTreeReader. And there is probably more.

Illustrations of use of the trees package

Treebank and Tree

Here is some fairly straightforward code for loading trees from a treebank and iterating over the trees contained therein. It builds a histogram of sentence lengths.

import java.util.Iterator;

import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.io.NumberRangesFileFilter;
import edu.stanford.nlp.util.Timing;

/** This class just prints out sentences and their lengths.
 *  Use: java SentenceLengths /turing/corpora/Treebank2/combined/wsj/07 
 *              [fileRange]
 */
public class SentenceLengths {

    private static final int maxleng = 100;
    private static int[] lengthCounts = new int[maxleng+1];
    private static int numSents = 0;


    public static void main(String[] args) {
	Timing.startTime();
	Treebank treebank = new DiskTreebank(
				     new LabeledScoredTreeReaderFactory());
	if (args.length > 1) {
	    treebank.loadPath(args[0], new NumberRangesFileFilter(args[1],
								  true));
	} else {
	    treebank.loadPath(args[0]);
	}
	
	for (Iterator it = treebank.iterator(); it.hasNext(); ) {
	    Tree t = (Tree) it.next();
	    numSents++;
	    int len = t.yield().length();
	    if (len <= maxleng) {
		lengthCounts[len]++;
	    }
	}
	System.out.print("Files " + args[0] + " ");
	if (args.length > 1) {
	    System.out.print(args[1] + " ");
	}
	System.out.println("consists of " + numSents + " sentences");
	for (int i = 0; i <= maxleng; i++) {
	    System.out.println("  " + lengthCounts[i] + " of length " + i);
	}
	Timing.endTime("Read/count all trees");
    }

}

Treebank, custom TreeReaderFactory, Tree, and Constituent

This example illustrates building a Treebank by hand, specifying a custom TreeReaderFactory, and illustrates more of the Tree package, and the notion of a Constituent. A Constituent has a start and end point and a Label.

import java.io.*;
import java.util.*;

import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;

/** This class counts how often each constituent appears
 *  Use: java ConstituentCounter /turing/corpora/Treebank2/combined/wsj/07
 */
public class ConstituentCounter {

    public static void main(String[] args) {
	Treebank treebank = new DiskTreebank(new TreeReaderFactory() {
		public TreeReader newTreeReader(Reader in) {
		    return new TreeReader(in, 
			new LabeledScoredTreeFactory(new StringLabelFactory()),
				  new BobChrisTreeNormalizer());
		}
	    });

	treebank.loadPath(args[0]);
	Counter cnt = new Counter();
	
	ConstituentFactory confac = LabeledConstituent.factory();
	for (Iterator it = treebank.iterator(); it.hasNext(); ) {
	    Tree t = (Tree) it.next();
	    Set constituents = t.constituents(confac);
	    for (Iterator it2 = constituents.iterator(); it2.hasNext(); ) {
		Constituent c = (Constituent) it2.next();
		cnt.increment(c);
	    }
	}
	SortedSet ss = new TreeSet(cnt.seenSet());
	for (Iterator it = ss.iterator(); it.hasNext(); ) {
	    Constituent c = (Constituent) it.next();
	    System.out.println(c + "  " + cnt.countOf(c));
	}
    }

}

Tree and Label

Dealing with the Tree and Label classes is a central part of using this package. This code works out the set of tags (preterminal labels) used in a Treebank. It illustrates writing ones own code to recurse through a Tree, and getting a String value for a Label.

import java.util.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.Counter;

/** This class prints out trees from strings and counts their preterminals.
 *  Use: java TreesFromStrings '(S (NP (DT This)) (VP (VBD was) (JJ good)))'
 */
public class TreesFromStrings {

    private static void addTerminals(Tree t, Counter c) {
	if (t.isLeaf()) {
	    // do nothing
	} else if (t.isPreTerminal()) {
	    c.increment(t.label().value());
	} else {
	    // phrasal node
	    Tree[] kids = t.children();
	    for (int i = 0; i < kids.length; i++) {
		addTerminals(kids[i], c);
	    }
	}
    }

    public static void main(String[] args) {
       Treebank tb = new MemoryTreebank();
       for (int i = 0; i < args.length; i++) {
	   try {
	       Tree t = Tree.valueOf(args[i]);
	       tb.add(t);
	   } catch (Exception e) {
	       e.printStackTrace();
	   }
       }
       Counter c = new Counter();
       for (Iterator it = tb.iterator(); it.hasNext(); ) {
	   Tree t = (Tree) it.next();
	   addTerminals(t, c);
       }
       System.out.println(c);
   }

}

As well as the Treebank classes, there are corresponding Sentencebank classes (though they are not quite so extensively developed. This final example shows use of a Sentencebank. It also illustrates the Visitor pattern for examining sentences in a Sentencebank. This was actually the original visitation pattern for Treebank and Sentencebank, but these days, it's in general easier to use an Iterator. You can also get Sentences from a Treebank, by taking the yield() or taggedYield() of each Tree.

import java.io.*;

import edu.stanford.nlp.trees.*;

public class SentencePrinter {

    /** Loads SentenceBank from first argument and prints it out.  
* Usage: java SentencePrinter sentencebankPath * @param args Array of command-line arguments */ public static void main(String[] args) { SentenceReaderFactory srf = new SentenceReaderFactory() { public SentenceReader newSentenceReader(Reader in) { return new SentenceReader(in, new TaggedWordFactory(), new PennSentenceNormalizer(), new PennTagbankStreamTokenizer(in)); } }; Sentencebank sentencebank = new DiskSentencebank(srf); sentencebank.loadPath(args[0]); sentencebank.apply(new SentenceVisitor() { public void visitSentence(final Sentence s) { // also print tag as well as word System.out.println(s.toString(false)); } }); } }

Since:
1.2
Author:
Christopher Manning, Dan Klein


Stanford NLP Group