Foundations of Statistical Natural Language Processing

Christopher D. Manning and Hinrich Schütze

Format of document vector file

Each entry consists of 25 lines:
  1. the document id
  2. is the document in the training set (T) or in the evaluation set (E)?
  3. is the document in the core training set (C) or in the validation set (V)? (X where this doesn't apply.)
  4. is the document in the earnings category (Y) or not (N)?
  5. feature weight for "vs"
  6. feature weight for "mln"
  7. feature weight for "cts"
  8. feature weight for ";"
  9. feature weight for "&"
  10. feature weight for "000"
  11. feature weight for "loss"
  12. feature weight for "'"
  13. feature weight for "
  14. feature weight for "3"
  15. feature weight for "profit"
  16. feature weight for "dlrs"
  17. feature weight for "1"
  18. feature weight for "pct"
  19. feature weight for "is"
  20. feature weight for "s"
  21. feature weight for "that"
  22. feature weight for "net"
  23. feature weight for "lt"
  24. feature weight for "at"
  25. semicolon (separator between entries)

Christopher Manning and Hinrich Schütze -- last modified 10 October 1999.