L
- The type of the labelspublic class BasicDocument<L> extends ArrayList<Word> implements Document<L,Word,Word>
parse(String)
to provide support
for custom
document formats or to do a custom job of tokenization. BasicDocument should
only be used for documents that are small enough to store in memory.
The easiest way to use BasicDocuments is to construct them and call an init
method in the same line (we use init methods instead of constructors because
they're inherited and allow subclasses to have other more specific constructors).
For example, to read in a file file and tokenize it, you can call
Document doc=new BasicDocument().init(file);.
Modifier and Type | Field and Description |
---|---|
protected List<L> |
labels
Label(s) for this document.
|
protected String |
originalText
original text of this document (may be null).
|
protected String |
title
title of this document (never null).
|
protected TokenizerFactory<Word> |
tokenizerFactory
TokenizerFactory used to convert the text into words inside
parse(String) . |
modCount
Constructor and Description |
---|
BasicDocument()
Constructs a new (empty) BasicDocument using a
PTBTokenizer . |
BasicDocument(Collection<Word> d) |
BasicDocument(Document<L,Word,Word> d) |
BasicDocument(TokenizerFactory<Word> tokenizerFactory)
Constructs a new (empty) BasicDocument using the given tokenizer.
|
Modifier and Type | Method and Description |
---|---|
void |
addLabel(L label)
Adds the given label to the List of labels for this Document if it is not null.
|
Collection<Word> |
asFeatures()
Returns this (the features are the list of words).
|
<OUT> Document<L,Word,OUT> |
blankDocument()
Returns a new empty BasicDocument with the same title, labels, and
tokenizer as this Document.
|
static <L> BasicDocument<L> |
init()
Calls init((String)null,null,true)
|
BasicDocument<L> |
init(File textFile)
Calls init(textFile,textFile.getCanonicalPath(),true)
|
BasicDocument<L> |
init(File textFile,
boolean keepOriginalText)
Calls init(textFile,textFile.getCanonicalPath(),keepOriginalText)
|
BasicDocument<L> |
init(File textFile,
String title)
Calls init(textFile,title,true)
|
BasicDocument<L> |
init(File textFile,
String title,
boolean keepOriginalText)
Inits a new BasicDocument by reading in the text from the given File.
|
BasicDocument<L> |
init(List<? extends Word> words)
Calls init(words,null)
|
BasicDocument<L> |
init(List<? extends Word> words,
String title)
Inits a new BasicDocument with the given list of words and title.
|
BasicDocument<L> |
init(Reader textReader)
Calls init(textReader,null,true)
|
BasicDocument<L> |
init(Reader textReader,
boolean keepOriginalText)
Calls init(textReader,null,keepOriginalText)
|
BasicDocument<L> |
init(Reader textReader,
String title)
Calls init(textReader,title,true)
|
static <L> BasicDocument<L> |
init(Reader textReader,
String title,
boolean keepOriginalText)
Inits a new BasicDocument by reading in the text from the given Reader.
|
static <L> BasicDocument<L> |
init(String text)
Calls init(text,null,true)
|
static <L> BasicDocument<L> |
init(String text,
boolean keepOriginalText)
Calls init(text,null,keepOriginalText)
|
static <L> BasicDocument<L> |
init(String text,
String title)
Calls init(text,title,true)
|
static <L> BasicDocument<L> |
init(String text,
String title,
boolean keepOriginalText)
Inits a new BasicDocument with the given text contents and title.
|
BasicDocument<L> |
init(URL textURL)
Calls init(textURL,textURL.toExternalForm(),true)
|
BasicDocument<L> |
init(URL textURL,
boolean keepOriginalText)
Calls init(textURL,textFile.toExternalForm(),keepOriginalText)
|
BasicDocument<L> |
init(URL textURL,
String title)
Calls init(textURL,title,true)
|
BasicDocument<L> |
init(URL textURL,
String title,
boolean keepOriginalText)
Constructs a new BasicDocument by reading in the text from the given URL.
|
L |
label()
Returns the first label for this Document, or null if none have been
set.
|
Collection<L> |
labels()
Returns the complete List of labels for this Document.
|
static void |
main(String[] args)
For internal debugging purposes only.
|
String |
originalText()
Returns the text originally used to construct this document, or null if
there was no original text.
|
protected void |
parse(String text)
Tokenizes the given text to populate the list of words this Document
represents.
|
String |
presentableText()
Returns a "pretty" version of the words in this Document suitable for
display.
|
static <L> void |
printState(BasicDocument<L> bd)
For internal debugging purposes only.
|
void |
setLabel(L label)
Removes all currently assigned labels for this Document then adds
the given label.
|
void |
setLabels(Collection<L> labels)
Removes all currently assigned labels for this Document then adds all
of the given labels.
|
void |
setTitle(String title)
Sets the title of this Document to the given title.
|
void |
setTokenizerFactory(TokenizerFactory<Word> tokenizerFactory)
Sets the tokenizerFactory to be used by
parse(String) . |
String |
title()
Returns the title of this document.
|
TokenizerFactory<Word> |
tokenizerFactory()
Returns the current TokenizerFactory used by
parse(String) . |
add, add, addAll, addAll, clear, clone, contains, ensureCapacity, forEach, get, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, removeIf, removeRange, replaceAll, retainAll, set, size, sort, spliterator, subList, toArray, toArray, trimToSize
equals, hashCode
containsAll, toString
finalize, getClass, notify, notifyAll, wait, wait, wait
add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, replaceAll, retainAll, set, size, sort, spliterator, subList, toArray, toArray
parallelStream, removeIf, stream
protected String title
protected String originalText
protected TokenizerFactory<Word> tokenizerFactory
parse(String)
.public BasicDocument()
PTBTokenizer
.
Call one of the init * methods to populate the document
from a desired source.public BasicDocument(TokenizerFactory<Word> tokenizerFactory)
public BasicDocument(Collection<Word> d)
public static <L> BasicDocument<L> init(String text, String title, boolean keepOriginalText)
parse(String)
to populate the list of words
("" is used if text is null). If specified, a reference to the
original text is also maintained so that the text() method returns the
text given to this constructor. Returns a reference to this
BasicDocument
for convenience (so it's more like a constructor, but inherited).public static <L> BasicDocument<L> init(String text, String title)
public static <L> BasicDocument<L> init(String text, boolean keepOriginalText)
public static <L> BasicDocument<L> init(String text)
public static <L> BasicDocument<L> init()
public static <L> BasicDocument<L> init(Reader textReader, String title, boolean keepOriginalText) throws IOException
IOException
init(String,String,boolean)
public BasicDocument<L> init(Reader textReader, String title) throws IOException
IOException
public BasicDocument<L> init(Reader textReader, boolean keepOriginalText) throws IOException
IOException
public BasicDocument<L> init(Reader textReader) throws IOException
IOException
public BasicDocument<L> init(File textFile, String title, boolean keepOriginalText) throws FileNotFoundException, IOException
FileNotFoundException
IOException
init(String,String,boolean)
public BasicDocument<L> init(File textFile, String title) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(File textFile, boolean keepOriginalText) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(File textFile) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(URL textURL, String title, boolean keepOriginalText) throws IOException
IOException
init(String,String,boolean)
public BasicDocument<L> init(URL textURL, String title) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(URL textURL, boolean keepOriginalText) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(URL textURL) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument<L> init(List<? extends Word> words, String title)
public BasicDocument<L> init(List<? extends Word> words)
protected void parse(String text)
public Collection<Word> asFeatures()
asFeatures
in interface Featurizable<Word>
public L label()
public Collection<L> labels()
public void setLabel(L label)
public void setLabels(Collection<L> labels)
public void addLabel(L label)
public String title()
public void setTitle(String title)
public TokenizerFactory<Word> tokenizerFactory()
parse(String)
.public void setTokenizerFactory(TokenizerFactory<Word> tokenizerFactory)
parse(String)
.
Set this tokenizer before calling one of the init methods
because
it will probably call parse. Note that the tokenizer can equivalently be
passed in to the constructor.BasicDocument(TokenizerFactory)
public <OUT> Document<L,Word,OUT> blankDocument()
Subclasses that want to preserve extra state should override this method and add the extra state to the new document before returning it. The new BasicDocument is created by calling getClass().newInstance() so it should be of the correct subclass, and thus you should be able to cast it down and add extra meta data directly. Note however that in the event an Exception is thrown on instantiation (e.g. if your subclass doesn't have a public empty constructor--it should btw!) then a new BasicDocument is used instead. Thus if you want to be paranoid (or some would say "correct") you should check that your instance is of the correct sub-type as follows (this example assumes the subclass is called NumberedDocument and it has the additional numberproperty):
Document blankDocument=super.blankDocument(); if(blankDocument instanceof NumberedDocument) { ((NumberedDocument)blankDocument).setNumber(getNumber());
blankDocument
in interface Document<L,Word,Word>
public String originalText()
public String presentableText()
HasWord
has its
HasWord.word()
printed, and other elements are skipped.
Subclasses that maintain additional information may which to override this method.
public static void main(String[] args)
public static <L> void printState(BasicDocument<L> bd) throws Exception
Exception