public class ProtobufAnnotationSerializer extends AnnotationSerializer
A serializer using Google's protocol buffer format. The files produced by this serializer, in addition to being language-independent, are a little over 10% the size and 4x faster to read+write versus the default Java serialization (see GenericAnnotationSerializer), when both files are compressed with gzip.
Note that this handles only a subset of the possible annotations
that can be attached to a sentence. Nonetheless, it is guaranteed to be
lossless with the default set of named annotators you can create from a
StanfordCoreNLP
pipeline, with default properties defined for each annotator.
Note that the serializer does not gzip automatically -- this must be done by passing in a GZipOutputStream
and calling a GZipInputStream manually. For most Annotations, gzipping provides a notable decrease in size (~2.5x)
due to most of the data being raw Strings.
To allow lossy serialization, use ProtobufAnnotationSerializer(boolean)
.
Otherwise, an exception is thrown if an unknown key appears in the annotation which would not be saved to the
protocol buffer.
If such keys exist, and are a part of the standard CoreNLP pipeline, please let us know!
If you would like to serialize keys in addition to those serialized by default (e.g., you are attaching
your own annotations), then you should do the following:
package edu.stanford.nlp.pipeline; option java_package = "com.example.my.awesome.nlp.app"; option java_outer_classname = "MyAppProtos"; import "CoreNLP.proto"; extend Sentence { optional uint32 myNewField = 101; }
protoc -I=src/edu/stanford/nlp/pipeline/:/path/to/folder/contining/your/proto/file --java_out=/path/to/output/src/folder/ /path/to/proto/file
Extend ProtobufAnnotationSerializer
to serialize and deserialize your field.
Generally, this entail overriding two functions -- one to write the proto and one to read it.
In both cases, you usually want to call the superclass' implementation of the function, and add on to it
from there.
In our running example, adding a field to the CoreNLPProtos.Sentence
proto, you would overwrite:
toProtoBuilder(edu.stanford.nlp.util.CoreMap, java.util.Set)
fromProtoNoTokens(edu.stanford.nlp.pipeline.CoreNLPProtos.Sentence)
Note, importantly, that for the serializer to be able to check for lossless serialization, all annotations added
to the proto must be registered as added by being removed from the set passed to
toProtoBuilder(edu.stanford.nlp.util.CoreMap, java.util.Set)
(and the analogous
functions for documents and tokens).
Lastly, the new annotations must be registered in the original .proto file; this can be achieved by including a static block in the overwritten class:
static { ExtensionRegistry registry = ExtensionRegistry.newInstance(); registry.add(MyAppProtos.myNewField); CoreNLPProtos.registerAllExtensions(registry); }
Modifier and Type | Class and Description |
---|---|
static class |
ProtobufAnnotationSerializer.LossySerializationException
An exception to denote that the serialization would be lossy.
|
AnnotationSerializer.IntermediateEdge, AnnotationSerializer.IntermediateNode, AnnotationSerializer.IntermediateSemanticGraph
Modifier and Type | Field and Description |
---|---|
boolean |
enforceLosslessSerialization
If true, serialization is guaranteed to be lossless or else a runtime exception is thrown
at serialization time.
|
Constructor and Description |
---|
ProtobufAnnotationSerializer()
Create a new Annotation serializer outputting to a protocol buffer format.
|
ProtobufAnnotationSerializer(boolean enforceLosslessSerialization)
Create a new Annotation serializer outputting to a protocol buffer format.
|
Modifier and Type | Method and Description |
---|---|
CoreNLPProtos.IndexedWord |
createIndexedWordProtoFromCL(CoreLabel cl) |
CoreNLPProtos.IndexedWord |
createIndexedWordProtoFromIW(IndexedWord iw) |
static SemanticGraph |
fromProto(CoreNLPProtos.DependencyGraph proto,
java.util.List<CoreLabel> sentence,
java.lang.String docid)
Voodoo magic to convert a serialized dependency graph into a
SemanticGraph . |
Annotation |
fromProto(CoreNLPProtos.Document proto)
Returns a complete document, intended to mimic a document passes as input to
toProto(Annotation) as closely as possible. |
static Tree |
fromProto(CoreNLPProtos.FlattenedParseTree proto)
Retrieve a Tree object from a flattened tree protobuf.
|
static Language |
fromProto(CoreNLPProtos.Language lang)
Return a CoreNLP language from a Protobuf language
|
static java.util.HashMap<java.lang.Integer,java.lang.String> |
fromProto(CoreNLPProtos.MapIntString proto)
Convert a serialized Map back into a Java Map.
|
static java.util.HashMap<java.lang.String,java.lang.String> |
fromProto(CoreNLPProtos.MapStringString proto)
Convert a serialized Map back into a Java Map.
|
static OperatorSpec |
fromProto(CoreNLPProtos.Operator operator)
Return a CoreNLP Operator (Natural Logic operator) from a Protobuf operator
|
static Tree |
fromProto(CoreNLPProtos.ParseTree proto)
Retrieve a Tree object from a saved protobuf.
|
static Tree |
fromProto(CoreNLPProtos.ParseTree proto,
java.util.List<CoreLabel> tokens)
Retrieve a Tree object and then attach the tokens passed in.
|
static Polarity |
fromProto(CoreNLPProtos.Polarity polarity)
Return a CoreNLP Polarity (Natural Logic polarity) from a Protobuf operator
|
static RelationTriple |
fromProto(CoreNLPProtos.RelationTriple proto,
Annotation doc,
int sentenceIndex)
Return a
RelationTriple object from the serialized representation. |
CoreMap |
fromProto(CoreNLPProtos.Sentence proto)
Deprecated.
|
static SentenceFragment |
fromProto(CoreNLPProtos.SentenceFragment fragment,
SemanticGraph tree)
Returns a sentence fragment from a given protocol buffer, and an associated parse tree.
|
CoreLabel |
fromProto(CoreNLPProtos.Token proto)
Create a CoreLabel from its serialized counterpart.
|
protected CoreMap |
fromProtoNoTokens(CoreNLPProtos.Sentence proto)
Create a CoreMap representing a sentence from this protocol buffer.
|
protected void |
loadSentenceMentions(CoreNLPProtos.Sentence proto,
CoreMap sentence) |
Pair<Annotation,java.io.InputStream> |
read(java.io.InputStream is)
Read a single object from this stream.
|
Annotation |
readUndelimited(java.io.File in)
Read a single protocol buffer, which constitutes the entire stream.
|
protected java.lang.String |
recoverOriginalText(java.util.List<CoreLabel> tokens,
CoreNLPProtos.Sentence sentence)
Recover the
CoreAnnotations.TextAnnotation field of a sentence
from the tokens. |
protected void |
setSentenceTokenAnnotations(CoreMap sentence,
CoreNLPProtos.Sentence protoSentence,
java.util.List<CoreLabel> sentenceTokens,
java.lang.String docid)
On a partially finished deserialized sentence, set some annotations which should reuse the same token objects as the parent sentence
|
static CoreNLPProtos.FlattenedParseTree |
toFlattenedTree(Tree tree)
Turn the given tree into a FlattedParseTree object from the proto
The new structure is useful because the ParseTree object can't represent trees past a certain depth. |
static void |
toFlattenedTree(Tree tree,
CoreNLPProtos.FlattenedParseTree.Builder treeBuilder) |
static CoreNLPProtos.MapIntString |
toMapIntStringProto(java.util.Map<java.lang.Integer,java.lang.String> map)
Serialize a Map (from Integers to Strings) to a proto.
|
static CoreNLPProtos.MapStringString |
toMapStringStringProto(java.util.Map<java.lang.String,java.lang.String> map)
Serialize a Map (from Strings to Strings) to a proto.
|
CoreNLPProtos.Document |
toProto(Annotation doc)
Create a Document proto from a CoreMap instance.
|
CoreNLPProtos.CorefChain |
toProto(CorefChain chain)
Create a CorefChain protocol buffer from the given coref chain.
|
CoreNLPProtos.Token |
toProto(CoreLabel coreLabel)
Create a CoreLabel proto from a CoreLabel instance.
|
CoreNLPProtos.Sentence |
toProto(CoreMap sentence)
Create a Sentence proto from a CoreMap instance.
|
CoreNLPProtos.Entity |
toProto(EntityMention ent)
Serialize the given entity mention to the corresponding protocol buffer.
|
static CoreNLPProtos.Language |
toProto(Language lang)
Serialize a CoreNLP Language to a Protobuf Language.
|
CoreNLPProtos.Mention |
toProto(Mention mention) |
static CoreNLPProtos.Operator |
toProto(OperatorSpec op)
Return a Protobuf operator from an OperatorSpec (Natural Logic).
|
static CoreNLPProtos.Polarity |
toProto(Polarity pol)
Return a Protobuf polarity from a CoreNLP Polarity (Natural Logic).
|
CoreNLPProtos.Relation |
toProto(RelationMention rel)
Serialize the given relation mention to the corresponding protocol buffer.
|
CoreNLPProtos.RelationTriple |
toProto(RelationTriple triple)
Return a Protobuf RelationTriple from a RelationTriple.
|
CoreNLPProtos.DependencyGraph |
toProto(SemanticGraph graph)
Create a compact representation of the semantic graph for this dependency parse.
|
CoreNLPProtos.DependencyGraph |
toProto(SemanticGraph graph,
boolean storeTokens)
Create a compact representation of the semantic graph for this dependency parse.
|
static CoreNLPProtos.SentenceFragment |
toProto(SentenceFragment fragment)
Return a Protobuf RelationTriple from a RelationTriple.
|
CoreNLPProtos.SpeakerInfo |
toProto(SpeakerInfo speakerInfo) |
CoreNLPProtos.Timex |
toProto(Timex timex)
Convert the given Timex object to a protocol buffer.
|
static CoreNLPProtos.ParseTree |
toProto(Tree parseTree)
Create a ParseTree proto from a Tree.
|
CoreNLPProtos.Document.Builder |
toProtoBuilder(Annotation doc)
Create a protobuf builder, rather than a compiled protobuf.
|
protected CoreNLPProtos.Document.Builder |
toProtoBuilder(Annotation doc,
java.util.Set<java.lang.Class<?>> keysToSerialize)
The method to extend by subclasses of the Protobuf Annotator if custom additions are added to Tokens.
|
protected CoreNLPProtos.Token.Builder |
toProtoBuilder(CoreLabel coreLabel,
java.util.Set<java.lang.Class<?>> keysToSerialize)
The method to extend by subclasses of the Protobuf Annotator if custom additions are added to Tokens.
|
CoreNLPProtos.Sentence.Builder |
toProtoBuilder(CoreMap sentence)
Create a protobuf builder, rather than a compiled protobuf.
|
protected CoreNLPProtos.Sentence.Builder |
toProtoBuilder(CoreMap sentence,
java.util.Set<java.lang.Class<?>> keysToSerialize)
The method to extend by subclasses of the Protobuf Annotator if custom additions are added to Tokens.
|
CoreNLPProtos.NERMention |
toProtoMention(CoreMap mention)
Convert a mention object to a protocol buffer.
|
CoreNLPProtos.Quote |
toProtoQuote(CoreMap quote)
Convert a quote object to a protocol buffer.
|
CoreNLPProtos.Section |
toProtoSection(CoreMap section)
Create a Section CoreMap protocol buffer from the given Section CoreMap
|
java.io.OutputStream |
write(Annotation corpus,
java.io.OutputStream os)
Append a single object to this stream.
|
readCoreDocument, writeCoreDocument
public final boolean enforceLosslessSerialization
public ProtobufAnnotationSerializer()
public ProtobufAnnotationSerializer(boolean enforceLosslessSerialization)
enforceLosslessSerialization
- If set to true, a ProtobufAnnotationSerializer.LossySerializationException
is thrown at serialization
time if the serialization would be lossy. If set to false,
these exceptions are ignored.public java.io.OutputStream write(Annotation corpus, java.io.OutputStream os) throws java.io.IOException
write
in class AnnotationSerializer
corpus
- The document to serialize to the stream.os
- The output stream to serialize to.java.io.IOException
- Thrown if the underlying output stream throws the exception.public Pair<Annotation,java.io.InputStream> read(java.io.InputStream is) throws java.io.IOException, java.lang.ClassNotFoundException, java.lang.ClassCastException
read
in class AnnotationSerializer
is
- The input stream to read a document from.java.io.IOException
- Thrown if the underlying stream throws the exception.java.lang.ClassNotFoundException
- Thrown if an object was read that does not exist in the classpath.java.lang.ClassCastException
- Thrown if the signature of a class changed in way that was incompatible with the serialized document.public Annotation readUndelimited(java.io.File in) throws java.io.IOException
in
- The file to read.java.io.IOException
- In case the stream cannot be read from.public CoreNLPProtos.Token toProto(CoreLabel coreLabel)
coreLabel
- The CoreLabel to convertprotected CoreNLPProtos.Token.Builder toProtoBuilder(CoreLabel coreLabel, java.util.Set<java.lang.Class<?>> keysToSerialize)
The method to extend by subclasses of the Protobuf Annotator if custom additions are added to Tokens.
In contrast to toProto(edu.stanford.nlp.ling.CoreLabel)
, this function
returns a builder that can be extended.
coreLabel
- The sentence to save to a protocol bufferkeysToSerialize
- A set tracking which keys have been saved. It's important to remove any keys added to the proto
from this set, as the code tracks annotations to ensure lossless serializationpublic CoreNLPProtos.Sentence.Builder toProtoBuilder(CoreMap sentence)
sentence
- The sentence to serialize.public CoreNLPProtos.Sentence toProto(CoreMap sentence)
sentence
- The CoreMap to convert. Note that it should not be a CoreLabel or an Annotation,
and should represent a sentence.java.lang.IllegalArgumentException
- If the sentence is not a valid sentence (e.g., is a document or a word).protected CoreNLPProtos.Sentence.Builder toProtoBuilder(CoreMap sentence, java.util.Set<java.lang.Class<?>> keysToSerialize)
The method to extend by subclasses of the Protobuf Annotator if custom additions are added to Tokens.
In contrast to toProto(edu.stanford.nlp.ling.CoreLabel)
, this function
returns a builder that can be extended.
sentence
- The sentence to save to a protocol bufferkeysToSerialize
- A set tracking which keys have been saved. It's important to remove any keys added to the proto
from this set, as the code tracks annotations to ensure lossless serialization.public CoreNLPProtos.Document toProto(Annotation doc)
doc
- The Annotation to convert.public CoreNLPProtos.Document.Builder toProtoBuilder(Annotation doc)
doc
- The document to serialize.protected CoreNLPProtos.Document.Builder toProtoBuilder(Annotation doc, java.util.Set<java.lang.Class<?>> keysToSerialize)
The method to extend by subclasses of the Protobuf Annotator if custom additions are added to Tokens.
In contrast to toProto(edu.stanford.nlp.ling.CoreLabel)
, this function
returns a builder that can be extended.
doc
- The sentence to save to a protocol bufferkeysToSerialize
- A set tracking which keys have been saved. It's important to remove any keys added to the proto
from this set, as the code tracks annotations to ensure lossless serializationA set tracking which keys have been saved. It's important to remove any keys added to the proto*
from this set, as the code tracks annotations to ensure lossless serialization.public static CoreNLPProtos.ParseTree toProto(Tree parseTree)
parseTree
- The parse tree to convert.public CoreNLPProtos.DependencyGraph toProto(SemanticGraph graph)
graph
- The dependency graph to save.public CoreNLPProtos.DependencyGraph toProto(SemanticGraph graph, boolean storeTokens)
graph
- The dependency graph to save.public CoreNLPProtos.CorefChain toProto(CorefChain chain)
chain
- The coref chain to convert.public CoreNLPProtos.Section toProtoSection(CoreMap section)
section
- The CoreMap representing the section to serialize to a proto.public CoreNLPProtos.IndexedWord createIndexedWordProtoFromIW(IndexedWord iw)
public CoreNLPProtos.IndexedWord createIndexedWordProtoFromCL(CoreLabel cl)
public CoreNLPProtos.Mention toProto(Mention mention)
public CoreNLPProtos.SpeakerInfo toProto(SpeakerInfo speakerInfo)
public CoreNLPProtos.Timex toProto(Timex timex)
timex
- The timex to convert.public CoreNLPProtos.Entity toProto(EntityMention ent)
ent
- The entity mention to serialize.public CoreNLPProtos.Relation toProto(RelationMention rel)
rel
- The relation mention to serialize.public static CoreNLPProtos.Language toProto(Language lang)
lang
- The language to serialize.public static CoreNLPProtos.Operator toProto(OperatorSpec op)
public static CoreNLPProtos.Polarity toProto(Polarity pol)
public static CoreNLPProtos.SentenceFragment toProto(SentenceFragment fragment)
public CoreNLPProtos.RelationTriple toProto(RelationTriple triple)
public static CoreNLPProtos.MapStringString toMapStringStringProto(java.util.Map<java.lang.String,java.lang.String> map)
map
- The map to serialize.public static CoreNLPProtos.MapIntString toMapIntStringProto(java.util.Map<java.lang.Integer,java.lang.String> map)
map
- The map to serialize.public CoreNLPProtos.Quote toProtoQuote(CoreMap quote)
public CoreNLPProtos.NERMention toProtoMention(CoreMap mention)
public CoreLabel fromProto(CoreNLPProtos.Token proto)
proto
- The serialized protobuf to read the CoreLabel from.@Deprecated public CoreMap fromProto(CoreNLPProtos.Sentence proto)
Annotation
expects.proto
- The protocol buffer to read from.protected CoreMap fromProtoNoTokens(CoreNLPProtos.Sentence proto)
proto
- The serialized protobuf to read the sentence from.protected void setSentenceTokenAnnotations(CoreMap sentence, CoreNLPProtos.Sentence protoSentence, java.util.List<CoreLabel> sentenceTokens, java.lang.String docid)
protected void loadSentenceMentions(CoreNLPProtos.Sentence proto, CoreMap sentence)
public Annotation fromProto(CoreNLPProtos.Document proto)
toProto(Annotation)
as closely as possible.
That is, most common fields are serialized, but there is not guarantee that custom additions
will be saved and retrieved.proto
- The protocol buffer to read the document from.public static void toFlattenedTree(Tree tree, CoreNLPProtos.FlattenedParseTree.Builder treeBuilder)
public static CoreNLPProtos.FlattenedParseTree toFlattenedTree(Tree tree)
public static Tree fromProto(CoreNLPProtos.FlattenedParseTree proto)
proto
- The serialized tree.LabeledScoredTreeNode
.public static Tree fromProto(CoreNLPProtos.ParseTree proto, java.util.List<CoreLabel> tokens)
public static Tree fromProto(CoreNLPProtos.ParseTree proto)
proto
- The serialized tree.LabeledScoredTreeNode
.public static Language fromProto(CoreNLPProtos.Language lang)
public static OperatorSpec fromProto(CoreNLPProtos.Operator operator)
public static Polarity fromProto(CoreNLPProtos.Polarity polarity)
public static SemanticGraph fromProto(CoreNLPProtos.DependencyGraph proto, java.util.List<CoreLabel> sentence, java.lang.String docid)
SemanticGraph
.
fromProto(CoreNLPProtos.Document)
method.proto
- The serialized representation of the graph. This relies heavily on indexing into the original document.sentence
- The raw sentence that this graph was saved from must be provided, as it is not saved in the serialized
representation.docid
- A docid must be supplied, as it is not saved by the serialized representation.public static RelationTriple fromProto(CoreNLPProtos.RelationTriple proto, Annotation doc, int sentenceIndex)
RelationTriple
object from the serialized representation.
This requires a sentence and a document so that
(1) we have a docid for the dependency tree can be accurately rebuilt,
and (2) we have references to the tokens to include in the relation triple.proto
- The serialized relation triples.doc
- The document we are deserializing. This document should already
have a docid annotation set, if there is one.sentenceIndex
- The index of the sentence this extraction should be attached to.public static SentenceFragment fromProto(CoreNLPProtos.SentenceFragment fragment, SemanticGraph tree)
fragment
- The saved sentence fragment.tree
- The parse tree for the whole sentence.SentenceFragment
object corresponding to the saved proto.public static java.util.HashMap<java.lang.String,java.lang.String> fromProto(CoreNLPProtos.MapStringString proto)
proto
- The serialized map.public static java.util.HashMap<java.lang.Integer,java.lang.String> fromProto(CoreNLPProtos.MapIntString proto)
proto
- The serialized map.protected java.lang.String recoverOriginalText(java.util.List<CoreLabel> tokens, CoreNLPProtos.Sentence sentence)
CoreAnnotations.TextAnnotation
field of a sentence
from the tokens. This is useful if the text was not set in the protocol buffer, and therefore
needs to be reconstructed from tokens.tokens
- The list of tokens representing this sentence.