Foundations of Statistical Natural Language Processing
Christopher D. Manning and Hinrich Schütze
Format of document vector file
Each entry consists of 25 lines:
- the document id
- is the document in the training set (T) or in the evaluation set
(E)?
- is the document in the core training set (C) or in the validation
set (V)? (X where this doesn't apply.)
- is the document in the earnings category (Y) or not (N)?
- feature weight for "vs"
- feature weight for "mln"
- feature weight for "cts"
- feature weight for ";"
- feature weight for "&"
- feature weight for "000"
- feature weight for "loss"
- feature weight for "'"
- feature weight for "
- feature weight for "3"
- feature weight for "profit"
- feature weight for "dlrs"
- feature weight for "1"
- feature weight for "pct"
- feature weight for "is"
- feature weight for "s"
- feature weight for "that"
- feature weight for "net"
- feature weight for "lt"
- feature weight for "at"
- semicolon (separator between entries)
Christopher Manning and Hinrich Schütze -- last modified 10 October 1999.