By default, it uses Unicode's UTF-8. You can change the encoding used when reading files by either setting the Java encoding property or more simply by supplying the program with the command line flag -encoding FOO (or including the corresponding property in a properties file that you are using).
Here are the steps:
<coreference> <coreference> <mention representative="true"> <sentence>1</sentence> <start>1</start> <end>3</end> <head>2</head> </mention> <mention> <sentence>2</sentence> <start>1</start> <end>2</end> <head>1</head> </mention> </coreference> </coreference>
The entire coref section is demarked by
<coreference> section. Each individual chain is
then demarked by another
<coreference>. (This is
perhaps an unfortunate naming, but at this point there are no plans to
<coreference> section for each chain is
a block describing each of the mentions. One mention will be labeled
representative mention. There are fields
sentence, indexed from 1 the range of words,
start (inclusive) to
inclusive), also indexed from 1, and
head, the index in
the sentence of the head word of this mention.
Either add more memory, use fewer annotators, or give CoreNLP smaller documents. Nearly all our annotators load large model files which use lots of memory. Running the full CoreNLP pipeline requires the sum of all these memory requirements. Additionally, the coreference module operates over an entire document. As the document size increases, its processing time and space increase without bound.
This is part of SUTime. It applies to repeating events such as "every other week" or "every two weeks". SET is not the best name for such an event, but it matches the TIMEX3 standard (see section 2.3 of the linked document)....
Other than English, we currently provide trained CoreNLP models for Chinese. To run CoreNLP on Chinese text, you first have to download the models, which can be found in our release history.
java -cp stanford-corenlp-YYYY-MM-DD.jar:stanford-chinese-corenlp-YYYY-MM-DD-models.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file your-chinese-file.txt
If you see an Exception stacktrace message like:
Exception in thread "main" java.lang.NoSuchFieldError: featureFactoryArgs at edu.stanford.nlp.ie.AbstractSequenceClassifier.
(AbstractSequenceClassifier.java:127) at edu.stanford.nlp.ie.crf.CRFClassifier. (CRFClassifier.java:173)
Caused by: java.lang.NoSuchMethodError: edu.stanford.nlp.util.Generics.newHashMap()Ljava/util/Map; at edu.stanford.nlp.pipeline.AnnotatorPool.
(AnnotatorPool.java:27) at edu.stanford.nlp.pipeline.StanfordCoreNLP.getDefaultAnnotatorPool(StanfordCoreNLP.java:305)
then this isn't caused by the shiny new Stanford NLP tools that you've just downloaded. It is because you also have old versions of one or more Stanford NLP tools on your classpath.
The straightforward case is if you have an older version of a Stanford NLP tool. For example, you may still have a version of Stanford NER on your classpath that was released in 2009. In this case, you should upgrade, or at least use matching versions. For any releases from 2011 on, just use tools released at the same time ... such as the most recent version of everything :) ... and they will all be compatible and play nicely together.
The tricky case of this is when people distribute jar files that hide
other people's classes inside them. People think this will make it easy
for users, since they can distribute one jar that has everything you
need, but, in practice, as soon as people are building applications
using multiple components, this results in a particular bad form
of jar hell. People just shouldn't do this.
The only way to check that other jar files do not
contain conflicting versions of Stanford tools is to look at what is inside
them (for example, with the
jar -tf command).
In practice, if you're having problems, the usual cause is that you have
ark-tweet-nlp on your classpath. The jar file in their github
download hides old versions of many other people's jar files, including Apache
commons-codec (v1.4), commons-lang, commons-math, commons-io, Lucene; Twitter
commons; Google Guava (v10); Jackson; Berkeley NLP code; Percy Liang's fig;
GNU trove; and an outdated version of the Stanford POS tagger
(from 2011). You should complain to them for creating you and us
grief. But you can then fix the problem by using
their jar file from
Maven Central. It doesn't have all those other libraries stuffed inside.
You need to add the flag
-parse.flags "" (or the
parse.flags: ). It's sort of a misfeature/bug
that the default properties of CoreNLP turn this option on by default, because it is useful for
English, but it isn't defined for other languages, and so you get an error.)
The parser can be instructed to keep certain sets of tokens together as a single constituent. If you do this, it will try to make a parse which contains a subtree where the exact set of tokens in that subtree are the ones specified in the constraint.
For any sentence where you want to add constraints, attach
ParserAnnotations.ConstraintAnnotation to that
sentence. This annotation is
ParserConstraint specifies the start (inclusive)
and end (exclusive) of the range and a pattern which the enclosing
constituent must match. However, there is a bug in the way patterns
are handled in the parser, so it is strongly recommended to
.* for the matching pattern.
Site design by Bill MacCartney