L
- The type of the labelspublic class BasicDocument<L> extends java.util.ArrayList<Word> implements Document<L,Word,Word>
parse(String)
to provide support
for custom
document formats or to do a custom job of tokenization. BasicDocument should
only be used for documents that are small enough to store in memory.
file
and tokenize it, you can call
Document doc = new BasicDocument().init(file);
.Modifier and Type | Field and Description |
---|---|
protected java.util.List<L> |
labels
Label(s) for this document.
|
protected java.lang.String |
originalText
original text of this document (may be null).
|
protected java.lang.String |
title
title of this document (never null).
|
protected TokenizerFactory<Word> |
tokenizerFactory
TokenizerFactory used to convert the text into words inside
parse(String) . |
Constructor and Description |
---|
BasicDocument()
Constructs a new (empty) BasicDocument using a
PTBTokenizer . |
BasicDocument(java.util.Collection<Word> d) |
BasicDocument(Document<L,Word,Word> d) |
BasicDocument(TokenizerFactory<Word> tokenizerFactory)
Constructs a new (empty) BasicDocument using the given tokenizer.
|
Modifier and Type | Method and Description |
---|---|
void |
addLabel(L label)
Adds the given label to the List of labels for this Document if it is not null.
|
java.util.Collection<Word> |
asFeatures()
Returns
this (the features are the list of words). |
<OUT> Document<L,Word,OUT> |
blankDocument()
Returns a new empty BasicDocument with the same title, labels, and
tokenizer as this Document.
|
static <L> BasicDocument<L> |
init()
Calls init((String)null,null,true)
|
BasicDocument<L> |
init(java.io.File textFile)
Calls init(textFile,textFile.getCanonicalPath(),true)
|
BasicDocument<L> |
init(java.io.File textFile,
boolean keepOriginalText)
Calls init(textFile,textFile.getCanonicalPath(),keepOriginalText)
|
BasicDocument<L> |
init(java.io.File textFile,
java.lang.String title)
Calls init(textFile,title,true)
|
BasicDocument<L> |
init(java.io.File textFile,
java.lang.String title,
boolean keepOriginalText)
Inits a new BasicDocument by reading in the text from the given File.
|
BasicDocument<L> |
init(java.util.List<? extends Word> words)
Calls init(words,null)
|
BasicDocument<L> |
init(java.util.List<? extends Word> words,
java.lang.String title)
Initializes a new BasicDocument with the given list of words and title.
|
BasicDocument<L> |
init(java.io.Reader textReader)
Calls init(textReader,null,true)
|
BasicDocument<L> |
init(java.io.Reader textReader,
boolean keepOriginalText)
Calls init(textReader,null,keepOriginalText)
|
BasicDocument<L> |
init(java.io.Reader textReader,
java.lang.String title)
Calls init(textReader,title,true)
|
static <L> BasicDocument<L> |
init(java.io.Reader textReader,
java.lang.String title,
boolean keepOriginalText)
Inits a new BasicDocument by reading in the text from the given Reader.
|
static <L> BasicDocument<L> |
init(java.lang.String text)
Calls init(text,null,true)
|
static <L> BasicDocument<L> |
init(java.lang.String text,
boolean keepOriginalText)
Calls init(text,null,keepOriginalText)
|
static <L> BasicDocument<L> |
init(java.lang.String text,
java.lang.String title)
Calls init(text,title,true)
|
static <L> BasicDocument<L> |
init(java.lang.String text,
java.lang.String title,
boolean keepOriginalText)
Inits a new BasicDocument with the given text contents and title.
|
BasicDocument<L> |
init(java.net.URL textURL)
Calls init(textURL,textURL.toExternalForm(),true)
|
BasicDocument<L> |
init(java.net.URL textURL,
boolean keepOriginalText)
Calls init(textURL,textFile.toExternalForm(),keepOriginalText)
|
BasicDocument<L> |
init(java.net.URL textURL,
java.lang.String title)
Calls init(textURL,title,true)
|
BasicDocument<L> |
init(java.net.URL textURL,
java.lang.String title,
boolean keepOriginalText)
Constructs a new BasicDocument by reading in the text from the given URL.
|
L |
label()
Returns the first label for this Document, or null if none have been
set.
|
java.util.Collection<L> |
labels()
Returns the complete List of labels for this Document.
|
static void |
main(java.lang.String[] args)
For internal debugging purposes only.
|
java.lang.String |
originalText()
Returns the text originally used to construct this document, or null if
there was no original text.
|
protected void |
parse(java.lang.String text)
Tokenizes the given text to populate the list of words this Document
represents.
|
java.lang.String |
presentableText()
Returns a "pretty" version of the words in this Document suitable for
display.
|
void |
setLabel(L label)
Removes all currently assigned labels for this Document then adds
the given label.
|
void |
setLabels(java.util.Collection<L> labels)
Removes all currently assigned labels for this Document then adds all
of the given labels.
|
void |
setTitle(java.lang.String title)
Sets the title of this Document to the given title.
|
void |
setTokenizerFactory(TokenizerFactory<Word> tokenizerFactory)
Sets the tokenizerFactory to be used by
parse(String) . |
java.lang.String |
title()
Returns the title of this document.
|
TokenizerFactory<Word> |
tokenizerFactory()
Returns the current TokenizerFactory used by
parse(String) . |
add, add, addAll, addAll, clear, clone, contains, ensureCapacity, forEach, get, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, removeIf, removeRange, replaceAll, retainAll, set, size, sort, spliterator, subList, toArray, toArray, trimToSize
finalize, getClass, notify, notifyAll, wait, wait, wait
add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, replaceAll, retainAll, set, size, sort, spliterator, subList, toArray, toArray
protected java.lang.String title
protected java.lang.String originalText
protected final java.util.List<L> labels
protected TokenizerFactory<Word> tokenizerFactory
parse(String)
.public BasicDocument()
PTBTokenizer
.
Call one of the init
methods to populate the document from a desired source.public BasicDocument(TokenizerFactory<Word> tokenizerFactory)
init
methods to populate the document from a desired source.public BasicDocument(java.util.Collection<Word> d)
public static <L> BasicDocument<L> init(java.lang.String text, java.lang.String title, boolean keepOriginalText)
parse(String)
to populate the list of words
("" is used if text is null). If specified, a reference to the
original text is also maintained so that the text() method returns the
text given to this constructor. Returns a reference to this
BasicDocument
for convenience (so it's more like a constructor, but inherited).public static <L> BasicDocument<L> init(java.lang.String text, java.lang.String title)
public static <L> BasicDocument<L> init(java.lang.String text, boolean keepOriginalText)
public static <L> BasicDocument<L> init(java.lang.String text)
public static <L> BasicDocument<L> init()
public static <L> BasicDocument<L> init(java.io.Reader textReader, java.lang.String title, boolean keepOriginalText) throws java.io.IOException
java.io.IOException
init(String,String,boolean)
public BasicDocument<L> init(java.io.Reader textReader, java.lang.String title) throws java.io.IOException
java.io.IOException
public BasicDocument<L> init(java.io.Reader textReader, boolean keepOriginalText) throws java.io.IOException
java.io.IOException
public BasicDocument<L> init(java.io.Reader textReader) throws java.io.IOException
java.io.IOException
public BasicDocument<L> init(java.io.File textFile, java.lang.String title, boolean keepOriginalText) throws java.io.IOException
java.io.IOException
init(String,String,boolean)
public BasicDocument<L> init(java.io.File textFile, java.lang.String title) throws java.io.IOException
java.io.IOException
public BasicDocument<L> init(java.io.File textFile, boolean keepOriginalText) throws java.io.IOException
java.io.IOException
public BasicDocument<L> init(java.io.File textFile) throws java.io.IOException
java.io.IOException
public BasicDocument<L> init(java.net.URL textURL, java.lang.String title, boolean keepOriginalText) throws java.io.IOException
java.io.IOException
init(String,String,boolean)
public BasicDocument<L> init(java.net.URL textURL, java.lang.String title) throws java.io.IOException
java.io.IOException
public BasicDocument<L> init(java.net.URL textURL, boolean keepOriginalText) throws java.io.IOException
java.io.IOException
public BasicDocument<L> init(java.net.URL textURL) throws java.io.IOException
java.io.IOException
public BasicDocument<L> init(java.util.List<? extends Word> words, java.lang.String title)
public BasicDocument<L> init(java.util.List<? extends Word> words)
protected void parse(java.lang.String text)
public java.util.Collection<Word> asFeatures()
this
(the features are the list of words).asFeatures
in interface Featurizable<Word>
public L label()
public java.util.Collection<L> labels()
public void setLabel(L label)
setLabel(null)
effectively clears all labels.public void setLabels(java.util.Collection<L> labels)
public void addLabel(L label)
public java.lang.String title()
public void setTitle(java.lang.String title)
public TokenizerFactory<Word> tokenizerFactory()
parse(String)
.public void setTokenizerFactory(TokenizerFactory<Word> tokenizerFactory)
parse(String)
.
Set this tokenizer before calling one of the init
methods
because
it will probably call parse. Note that the tokenizer can equivalently be
passed in to the constructor.BasicDocument(TokenizerFactory)
public <OUT> Document<L,Word,OUT> blankDocument()
getClass().newInstance()
so it should be of the correct subclass,
and thus you should be able to cast it down and add extra meta data directly.
Note however that in the event an Exception is thrown on instantiation
(e.g. if your subclass doesn't have a public empty constructor--it should btw!)
then a new BasicDocument
is used instead. Thus if you want to be paranoid
(or some would say "correct") you should check that your instance is of
the correct sub-type as follows (this example assumes the subclass is called
NumberedDocument
and it has the additional number
property):
Document blankDocument=super.blankDocument(); if(blankDocument instanceof NumberedDocument) { ((NumberedDocument)blankDocument).setNumber(getNumber());
blankDocument
in interface Document<L,Word,Word>
public java.lang.String originalText()
public java.lang.String presentableText()
HasWord
has its
HasWord.word()
printed, and other elements are skipped.
Subclasses that maintain additional information may which to
override this method.public static void main(java.lang.String[] args)