edu.stanford.nlp.sequences
Class CoNLLDocumentReaderAndWriter

java.lang.Object
  extended by edu.stanford.nlp.sequences.CoNLLDocumentReaderAndWriter
All Implemented Interfaces:
IteratorFromReaderFactory<List<CoreLabel>>, DocumentReaderAndWriter<CoreLabel>, Serializable

public class CoNLLDocumentReaderAndWriter
extends Object
implements DocumentReaderAndWriter<CoreLabel>

DocumentReader for the original CoNLL 03 format. In this format, there is one word per line, with extra attributes of a word (POS tag, chunk, etc.) in other space or tab separated columns, where leading and trailing whitespace on the line are ignored. Sentences are supposedly separated by a blank line (one with no non-whitespace characters), but where blank lines occur is in practice often fairly random. In particular, sometimes entities span blank lines. Nevertheless, in this class, like in our original CoNLL system, these blank lines are preserved as a special BOUNDARY token and detected and exploited by some features. The text is divided into documents at each '-DOCSTART-' token, which is seen as a special token, which is also preserved. The reader can read data in any of the IOB/IOE/etc. formats and output tokens in any other, based on the entitySubclassification flag.

This reader is specifically for replicating CoNLL systems. For normal use, you should use the saner ColumnDocumentReaderAndWriter.

Author:
Jenny Finkel, Huy Nguyen, Christopher Manning
See Also:
Serialized Form

Field Summary
static String BOUNDARY
           
static String OTHER
           
 
Constructor Summary
CoNLLDocumentReaderAndWriter()
           
 
Method Summary
 Iterator<List<CoreLabel>> getIterator(Reader r)
          Return an iterator over the contents read from r.
 void init(SeqClassifierFlags flags)
          This will be called immediately after construction.
static void main(String[] args)
          Count some stats on what occurs in a file.
 void printAnswers(List<CoreLabel> doc, PrintWriter out)
          Write a standard CoNLL format output file.
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

BOUNDARY

public static final String BOUNDARY
See Also:
Constant Field Values

OTHER

public static final String OTHER
See Also:
Constant Field Values
Constructor Detail

CoNLLDocumentReaderAndWriter

public CoNLLDocumentReaderAndWriter()
Method Detail

init

public void init(SeqClassifierFlags flags)
Description copied from interface: DocumentReaderAndWriter
This will be called immediately after construction. It's easier having an init() method because DocumentReaderAndWriter objects are usually created using reflection.

Specified by:
init in interface DocumentReaderAndWriter<CoreLabel>
Parameters:
flags - Flags specifying behavior

toString

public String toString()
Overrides:
toString in class Object

getIterator

public Iterator<List<CoreLabel>> getIterator(Reader r)
Description copied from interface: IteratorFromReaderFactory
Return an iterator over the contents read from r.

Specified by:
getIterator in interface IteratorFromReaderFactory<List<CoreLabel>>
Parameters:
r - Where to read objects from
Returns:
An Iterator over the objects

printAnswers

public void printAnswers(List<CoreLabel> doc,
                         PrintWriter out)
Write a standard CoNLL format output file.

Specified by:
printAnswers in interface DocumentReaderAndWriter<CoreLabel>
Parameters:
doc - The document: A List of CoreLabel
out - Where to send the answers to

main

public static void main(String[] args)
                 throws IOException,
                        ClassNotFoundException
Count some stats on what occurs in a file.

Throws:
IOException
ClassNotFoundException


Stanford NLP Group