Stanford CoreNLP FAQ

Questions

  1. What encoding does Stanford CoreNLP use?
  2. Can you say more about adding a custom annotator?
  3. What is the format of the XML output for coref?
  4. CoreNLP runs out of memory?

Questions with answers

  1. What encoding does Stanford CoreNLP use?

    By default, it uses UTF-8. You can change the encoding used when reading files by setting the encoding property or by supplying the command line flag -encoding FOO.

  2. Can you say more about adding a custom annotator?

    Here are the steps:

  3. What is the format of the XML output for coref?

    Here is a sample block of coref xml output:
        <coreference>
          <coreference>
            <mention representative="true">
              <sentence>1</sentence>
              <start>1</start>
              <end>3</end>
              <head>2</head>
            </mention>
            <mention>
              <sentence>2</sentence>
              <start>1</start>
              <end>2</end>
              <head>1</head>
            </mention>
          </coreference>
        </coreference>
    

    The entire coref section is demarked by a <coreference> section. Each individual chain is then demarked by another <coreference>. (This is perhaps an unfortunate naming, but at this point there are no plans to change it.)

    Inside the <coreference> section for each chain is a block describing each of the mentions. One mention will be labeled the representative mention. There are fields for sentence, indexed from 1 the range of words, from start (inclusive) to end (not inclusive), also indexed from 1, and head, the index in the sentence of the head word of this mention.

  4. CoreNLP runs out of memory?

    Either add more memory, use fewer annotators, or give CoreNLP smaller documents. Nearly all our annotators load large model files which use lots of memory. Running the full CoreNLP pipeline requires the sum of all these memory requirements. Additionally, the coreference module operates over an entire document. As the document size increases, its processing time and space increase without bound.