|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectjava.util.AbstractCollection<E>
java.util.AbstractList<E>
java.util.ArrayList<Word>
edu.stanford.nlp.ling.BasicDocument<L>
L
- The type of the labelspublic class BasicDocument<L>
Basic implementation of Document that should be suitable for most needs.
BasicDocument is an ArrayList for storing words and performs tokenization
during construction. Override parse(String)
to provide support
for custom
document formats or to do a custom job of tokenization. BasicDocument should
only be used for documents that are small enough to store in memory.
Document doc=new BasicDocument().init(file);.
Field Summary | |
---|---|
protected List<L> |
labels
Label(s) for this document. |
protected String |
originalText
original text of this document (may be null). |
protected String |
title
title of this document (never null). |
protected TokenizerFactory<Word> |
tokenizerFactory
TokenizerFactory used to convert the text into words inside parse(String) . |
Fields inherited from class java.util.AbstractList |
---|
modCount |
Constructor Summary | |
---|---|
BasicDocument()
Constructs a new (empty) BasicDocument using a PTBTokenizer . |
|
BasicDocument(Collection<Word> d)
|
|
BasicDocument(Document<L,Word,Word> d)
|
|
BasicDocument(TokenizerFactory<Word> tokenizerFactory)
Constructs a new (empty) BasicDocument using the given tokenizer. |
Method Summary | ||
---|---|---|
void |
addLabel(L label)
Adds the given label to the List of labels for this Document if it is not null. |
|
Collection<Word> |
asFeatures()
Returns this (the features are the list of words). |
|
Document<L,Word,Word> |
blankDocument()
Returns a new empty BasicDocument with the same title, labels, and tokenizer as this Document. |
|
static
|
init()
Calls init((String)null,null,true) |
|
BasicDocument<L> |
init(File textFile)
Calls init(textFile,textFile.getCanonicalPath(),true) |
|
BasicDocument<L> |
init(File textFile,
boolean keepOriginalText)
Calls init(textFile,textFile.getCanonicalPath(),keepOriginalText) |
|
BasicDocument<L> |
init(File textFile,
String title)
Calls init(textFile,title,true) |
|
BasicDocument<L> |
init(File textFile,
String title,
boolean keepOriginalText)
Inits a new BasicDocument by reading in the text from the given File. |
|
BasicDocument<L> |
init(List<? extends Word> words)
Calls init(words,null) |
|
BasicDocument<L> |
init(List<? extends Word> words,
String title)
Inits a new BasicDocument with the given list of words and title. |
|
BasicDocument<L> |
init(Reader textReader)
Calls init(textReader,null,true) |
|
BasicDocument<L> |
init(Reader textReader,
boolean keepOriginalText)
Calls init(textReader,null,keepOriginalText) |
|
BasicDocument<L> |
init(Reader textReader,
String title)
Calls init(textReader,title,true) |
|
static
|
init(Reader textReader,
String title,
boolean keepOriginalText)
Inits a new BasicDocument by reading in the text from the given Reader. |
|
static
|
init(String text)
Calls init(text,null,true) |
|
static
|
init(String text,
boolean keepOriginalText)
Calls init(text,null,keepOriginalText) |
|
static
|
init(String text,
String title)
Calls init(text,title,true) |
|
static
|
init(String text,
String title,
boolean keepOriginalText)
Inits a new BasicDocument with the given text contents and title. |
|
BasicDocument<L> |
init(URL textURL)
Calls init(textURL,textURL.toExternalForm(),true) |
|
BasicDocument<L> |
init(URL textURL,
boolean keepOriginalText)
Calls init(textURL,textFile.toExternalForm(),keepOriginalText) |
|
BasicDocument<L> |
init(URL textURL,
String title)
Calls init(textURL,title,true) |
|
BasicDocument<L> |
init(URL textURL,
String title,
boolean keepOriginalText)
Constructs a new BasicDocument by reading in the text from the given URL. |
|
L |
label()
Returns the first label for this Document, or null if none have been set. |
|
Collection<L> |
labels()
Returns the complete List of labels for this Document. |
|
static void |
main(String[] args)
For internal debugging purposes only. |
|
String |
originalText()
Returns the text originally used to construct this document, or null if there was no original text. |
|
protected void |
parse(String text)
Tokenizes the given text to populate the list of words this Document represents. |
|
String |
presentableText()
Returns a "pretty" version of the words in this Document suitable for display. |
|
static
|
printState(BasicDocument<L> bd)
For internal debugging purposes only. |
|
void |
setLabel(L label)
Removes all currently assigned labels for this Document then adds the given label. |
|
void |
setLabels(Collection<L> labels)
Removes all currently assigned labels for this Document then adds all of the given labels. |
|
void |
setTitle(String title)
Sets the title of this Document to the given title. |
|
void |
setTokenizerFactory(TokenizerFactory<Word> tokenizerFactory)
Sets the tokenizerFactory to be used by parse(String) . |
|
String |
title()
Returns the title of this document. |
|
TokenizerFactory<Word> |
tokenizerFactory()
Returns the current TokenizerFactory used by parse(String) . |
Methods inherited from class java.util.ArrayList |
---|
add, add, addAll, addAll, clear, clone, contains, ensureCapacity, get, indexOf, isEmpty, lastIndexOf, remove, remove, removeRange, set, size, toArray, toArray, trimToSize |
Methods inherited from class java.util.AbstractList |
---|
equals, hashCode, iterator, listIterator, listIterator, subList |
Methods inherited from class java.util.AbstractCollection |
---|
containsAll, removeAll, retainAll, toString |
Methods inherited from class java.lang.Object |
---|
finalize, getClass, notify, notifyAll, wait, wait, wait |
Methods inherited from interface java.util.List |
---|
add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, size, subList, toArray, toArray |
Field Detail |
---|
protected String title
protected String originalText
protected final List<L> labels
protected TokenizerFactory<Word> tokenizerFactory
parse(String)
.
Constructor Detail |
---|
public BasicDocument()
PTBTokenizer
.
Call one of the init * methods to populate the document
from a desired source.
public BasicDocument(TokenizerFactory<Word> tokenizerFactory)
public BasicDocument(Document<L,Word,Word> d)
public BasicDocument(Collection<Word> d)
Method Detail |
---|
public static <L> BasicDocument<L> init(String text, String title, boolean keepOriginalText)
parse(String)
to populate the list of words
("" is used if text is null). If specified, a reference to the
original text is also maintained so that the text() method returns the
text given to this constructor. Returns a reference to this
BasicDocument
for convenience (so it's more like a constructor, but inherited).
public static <L> BasicDocument<L> init(String text, String title)
public static <L> BasicDocument<L> init(String text, boolean keepOriginalText)
public static <L> BasicDocument<L> init(String text)
public static <L> BasicDocument<L> init()
public static <L> BasicDocument<L> init(Reader textReader, String title, boolean keepOriginalText) throws IOException
IOException
init(String,String,boolean)
public BasicDocument<L> init(Reader textReader, String title) throws IOException
IOException
public BasicDocument<L> init(Reader textReader, boolean keepOriginalText) throws IOException
IOException
public BasicDocument<L> init(Reader textReader) throws IOException
IOException
public BasicDocument<L> init(File textFile, String title, boolean keepOriginalText) throws FileNotFoundException, IOException
FileNotFoundException
IOException
init(String,String,boolean)
public BasicDocument<L> init(File textFile, String title) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(File textFile, boolean keepOriginalText) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(File textFile) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(URL textURL, String title, boolean keepOriginalText) throws IOException
IOException
init(String,String,boolean)
public BasicDocument<L> init(URL textURL, String title) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(URL textURL, boolean keepOriginalText) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(URL textURL) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(List<? extends Word> words, String title)
public BasicDocument<L> init(List<? extends Word> words)
protected void parse(String text)
setTokenizerFactory(edu.stanford.nlp.objectbank.TokenizerFactory)
public Collection<Word> asFeatures()
asFeatures
in interface Featurizable<Word>
public L label()
label
in interface Labeled<L>
public Collection<L> labels()
labels
in interface Labeled<L>
public void setLabel(L label)
public void setLabels(Collection<L> labels)
public void addLabel(L label)
public String title()
title
in interface Document<L,Word,Word>
public void setTitle(String title)
public TokenizerFactory<Word> tokenizerFactory()
parse(String)
.
public void setTokenizerFactory(TokenizerFactory<Word> tokenizerFactory)
parse(String)
.
Set this tokenizer before calling one of the init methods
because
it will probably call parse. Note that the tokenizer can equivalently be
passed in to the constructor.
BasicDocument(TokenizerFactory)
public Document<L,Word,Word> blankDocument()
Subclasses that want to preserve extra state should override this method and add the extra state to the new document before returning it. The new BasicDocument is created by calling getClass().newInstance() so it should be of the correct subclass, and thus you should be able to cast it down and add extra meta data directly. Note however that in the event an Exception is thrown on instantiation (e.g. if your subclass doesn't have a public empty constructor--it should btw!) then a new BasicDocument is used instead. Thus if you want to be paranoid (or some would say "correct") you should check that your instance is of the correct sub-type as follows (this example assumes the subclass is called NumberedDocument and it has the additional numberproperty):
Document blankDocument=super.blankDocument(); if(blankDocument instanceof NumberedDocument) { ((NumberedDocument)blankDocument).setNumber(getNumber());
blankDocument
in interface Document<L,Word,Word>
public String originalText()
public String presentableText()
HasWord
has its
HasWord.word()
printed, and other elements are skipped.
Subclasses that maintain additional information may which to override this method.
public static void main(String[] args)
public static <L> void printState(BasicDocument<L> bd) throws Exception
Exception
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |