public class Ssurgeon
extends java.lang.Object
<ssurgeon-pattern-list>
<ssurgeon-pattern>
<uid>...</uid>
<notes>...</notes>
<semgrex>...</semgrex>
<language>...</language>
<edit-list>...</edit-list>
</ssurgeon-pattern>
</ssurgeon-pattern-list>
The id is the id of the Ssurgeon operation. notes are comments on the Ssurgeon. semgrex is a Semgrex pattern to use when matching for this operation. edit-list is the actual Ssurgeon operation to execute. language is an optional field to determine what
language formalism to use when making new dependencies. By default
it will be English for SD when using the Java API, although most
people probably want UniversalEnglish for UD (including non-English
UD datasets) addEdge -gov node1 -dep node2 -reln depType -weight 0.5
relabelNamedEdge -edge edgename -reln depType
removeEdge -gov node1 -dep node2 reln depType
removeNamedEdge -edge edgename
reattachNamedEdge -edge edgename -gov gov -dep dep
addDep -gov node1 -reln depType -position where ...attributes...
editNode -node node ...attributes...
lemmatize -node node
combineMWT -node node -word word
splitWord -node node -headIndex idx -reln depType -regex w1 -regex w2 ...
setRoots n1 (n2 n3 ...)
mergeNodes n1 n2
killAllIncomingEdges -node node
delete -node node
deleteLeaf -node node
killNonRootedNodes
reindexGraph
addEdge adds a new edge between two existing nodes.
-gov and -dep will be nodes matched by the Semgrex pattern.
-reln is the name of the dependency type to add.
relabelNamedEdge changes the dependency type of a named edge.
edge is the name of the edge in the Semgrex pattern.
-reln is the name of the dependency type to use.
removeEdge deletes an edge based on its description.
-gov is the governor to delete, a named node from the Semgrex pattern.
-dep is the dependent to delete, a named node from the Semgrex pattern.
-reln is the name of the dependency to delete.
If -gov or -dep are left empty, then all (matching) edges to or from the
remaining argument will be deleted.
removeNamedEdge deletes an edge based on its name.
edge is the name of the edge in the Semgrex pattern.
reattachNamedEdge changes an edge's gov and/or dep based on its name.
edge is the name of the edge in the Semgrex pattern.
-gov is the governor to attach to, a named node from the Semgrex pattern. If left blank, no edit.
-dep is the dependent to attach to, a named node from the Semgrex pattern. If left blank, no edit.
At least one of -gov or -dep must be set.
addDep adds a word and a dependency arc to the dependency graph.
-gov is the governor to attach to, a named node from the Semgrex pattern.
-reln is the name of the dependency type to use.
-position is where in the sentence the word should go. - will be the first word of the sentence,
+ will be the last word of the sentence, and -node or +node will be before or after the
named node.
...attributes... means any attributes which can be set from a string or numerical value
eg -text ... sets the text of the word
-pos ... sets the xpos of the word, -cpos ... sets the upos of the word, etc.
You cannot set the index of a word this way; an exception will be thrown.
To put whitespace in an attribute, you can quote it.
So, for example, a Vietnamese word can be set as -word "xin chào"
editNode will edit the attributes of a word.
-node is the node to edit.
...attributes... are the attributes to change, same as with addDep
-morphofeatures ... will set the features to be exactly as written.
-updateMorphoFeatures ... will edit or add the features without overwriting existing features.
-removeMorphoFeatures ... will remove this one morpho feature.
-remove ... will remove the attribute entirely, such as doing -remove lemma to remove the lemma.
lemmatize will put a lemma on a word.
-node is the node to edit.
This only works on English text.
combineMWT will add MWT attributes to a sequence of two or more words.
-node (repeated) is the nodes to edit.
-word is the optional text to use for the new MWT. If not set, the words will be concatenated.
splitWord will split a single word into multiple pieces from the text of the current word
-node is the node to split.
-headIndex is the index (counting from 0) of the word piece to make the head.
-reln is the name of the dependency type to use. pieces other than the head will connect using this relation
-regex regex must match the matched node. all matching groups will be concatenated to form a new word. need at least 2 to split a word
setRoots sets the roots of the sentence to a new root.
n1, n2, ... are the names of the nodes from the Semgrex to use as the root(s).
This is best done in conjunction with other operations which actually manipulate the structure
of the graph, or the new root will weirdly have dependents and the graph will be incorrect.
mergeNodes will merge n1 and n2, assuming they are mergeable.
The nodes can be merged if one of the nodes is the head of a phrase
and the other node depends on the head. TODO: can make it process
more than two nodes at once.
killAllIncomingEdges deletes all edges to a node.
-node is the node to edit.
Note that this is the same as removeEdge with only the dependent set.
delete deletes all nodes reachable from a specific node.
-node is the node to delete.
You will only want to do this after separating the node from the parts of the graph you want to keep.
deleteLeaf deletes a node as long as it is a leaf.
-node is the node to delete.
If the node is not a leaf (no outgoing edges), it will not be deleted.
killNonRootedNodes searches the graph and deletes all nodes which have no path to a root.
reindexGraph reindexes the graph from 1 in case there are gaps or the node indices start later than 1. (Warning: does not work for first index less than 1)
A practical example comes from the UD_English-Pronouns
dataset, where some words had both nsubj and csubj
dependencies:
1 Hers hers PRON PRP Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs 3 nsubj _ _ 2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 cop _ _ 3 easy easy ADJ JJ Degree=Pos 0 root _ _ 4 to to PART TO _ 5 mark _ _ 5 clean clean VERB VB VerbForm=Inf 3 csubj _ SpaceAfter=No 6 . . PUNCT . _ 5 punct _ _
We can update this with the following Semgrex/Ssurgeon pair:
{}=source >nsubj {} >csubj=bad {}
relabelNamedEdge -edge bad -reln advcl
The result will be the csubj updated to advcl
For the most part, each of these operations is already bomb-proof,
eg the pattern will execute once and not repeat on the same part of
the same dependency graph.
However, in the case of addDep, it is not possible to automatically bomb-proof the command,
as certain sentences may legitimately have multiple words with the same attributes as dependents
of the same governor. In this case, it is necessary to make the Semgrex pattern itself bomb-proof.
As an example, if the intent is to change "Jennifer has lovely antennae" to "Jennifer has lovely blue antennae", the following command would "bomb":
{word:antennae}=antennae
addDep -gov antennae -reln dep -word blue
The following would not:
{word:antennae}=antennae !> {word:blue}
addDep -gov antennae -reln dep -word blue
Some patterns which leave the node in the same format will bomb because of the way the dirty bit works. For example:
{word:/pattern/;cpos:VERB;morphofeatures:{VerbForm:Inf}}=word
EditNode -node word -remove morphofeatures
EditNode -node word -updatemorphofeatures Aspect=Imp -updatemorphofeatures VerbForm=Inf
Here, the end result will be the same after at most one iteration through the loop,
but -remove morphofeatures sets the dirty bit and does not go away
when -updatemorphofeatures puts back the deleted features.
TODO: this one at least can be fixed| Modifier and Type | Class and Description |
|---|---|
static class |
Ssurgeon.ArgsBox |
static class |
Ssurgeon.RUNTYPE |
protected static class |
Ssurgeon.SsurgeonArgs |
| Modifier and Type | Field and Description |
|---|---|
protected static Ssurgeon.ArgsBox |
argsBox |
static java.lang.String |
DEP_NODENAME_ARG |
static java.lang.String |
EDGE_NAME_ARG |
static java.lang.String |
EXACT_ARG |
static java.lang.String |
GOV_NODENAME_ARG |
static java.lang.String |
HEAD_INDEX_ARG |
static java.lang.String |
HEAD_INDEX_LOWER_ARG |
static java.lang.String |
NAME_ARG |
static java.lang.String |
NODE_PROTO_ARG |
static java.lang.String |
NODENAME_ARG |
static java.lang.String |
POSITION_ARG |
static java.lang.String |
REGEX_ARG |
static java.lang.String |
RELN_ARG |
static java.lang.String |
REMOVE |
static java.lang.String |
REMOVE_MORPHO_FEATURES |
static java.lang.String |
UPDATE_MORPHO_FEATURES |
static java.lang.String |
UPDATE_MORPHO_FEATURES_LOWER |
static java.lang.String |
WEIGHT_ARG |
| Modifier and Type | Method and Description |
|---|---|
static SsurgPred |
assemblePredFromXML(org.w3c.dom.Element elt)
Constructs a
SsurgPred structure from file, given the root element. |
java.util.Collection<SemanticGraph> |
exhaustFromPatterns(java.util.List<SsurgeonPattern> patternList,
SemanticGraph sg)
Similar to the expandFromPatterns, but performs an exhaustive
search, performing simplifications on the graphs until exhausted.
|
java.util.List<SemanticGraph> |
expandFromPatterns(java.util.List<SsurgeonPattern> patternList,
SemanticGraph sg)
Given a list of SsurgeonPattern edit scripts, and a SemanticGraph
to operate over, returns a list of expansions of that graph, with
the result of each edit applied against a copy of the graph.
|
static java.lang.String |
getEltText(org.w3c.dom.Element element)
For a given Element, treats the first child as a text element
and returns its value.
|
static SsurgeonPattern |
getOperationFromFile(java.lang.String path)
Given a path to a file, converts it into a SsurgeonPattern
TODO: finish implementing this stub.
|
SsurgeonWordlist |
getResource(java.lang.String id)
Returns the given resource with the id.
|
java.util.Collection<SsurgeonWordlist> |
getResources() |
static java.lang.String |
getTagText(org.w3c.dom.Element element,
java.lang.String tag)
For the given element, returns the text for the first child Element with
the given tag.
|
void |
initLog(java.io.File logFilePath) |
static Ssurgeon |
inst() |
static void |
main(java.lang.String[] args)
Performs a simple test and print of a given file.
|
static SsurgeonEdit |
parseEditLine(java.lang.String editLine,
java.util.Map<java.lang.String,java.lang.String> attributeArgs,
Language language)
Given a string entry, converts it into a SsurgeonEdit object.
|
java.util.List<SsurgeonPattern> |
readFromDirectory(java.io.File dir)
Reads all Ssurgeon patterns from file.
|
java.util.List<SsurgeonPattern> |
readFromDocument(org.w3c.dom.Document doc) |
java.util.List<SsurgeonPattern> |
readFromFile(java.io.File file)
Given a path to a file containing a list of SsurgeonPatterns, returns
TODO: deal with resources
|
java.util.List<SsurgeonPattern> |
readFromString(java.lang.String text) |
void |
setLogPrefix(java.lang.String logPrefix) |
static SsurgeonPattern |
ssurgeonPatternFromXML(org.w3c.dom.Element elt)
Given the root Element for a SemgrexPattern (SSURGEON_ELEM_TAG), converts
it into its corresponding SemgrexPattern object.
|
void |
testRead(java.io.File tgtDirPath)
Reads in the test file and prints readable to string (for debugging).
|
static void |
writeToFile(java.io.File tgtFile,
java.util.List<SsurgeonPattern> patterns)
Given a target filepath and a list of Ssurgeon patterns, writes them out as XML forms.
|
static java.lang.String |
writeToString(SsurgeonPattern pattern) |
public static final java.lang.String GOV_NODENAME_ARG
public static final java.lang.String DEP_NODENAME_ARG
public static final java.lang.String EDGE_NAME_ARG
public static final java.lang.String NODENAME_ARG
public static final java.lang.String REGEX_ARG
public static final java.lang.String EXACT_ARG
public static final java.lang.String RELN_ARG
public static final java.lang.String NODE_PROTO_ARG
public static final java.lang.String WEIGHT_ARG
public static final java.lang.String HEAD_INDEX_ARG
public static final java.lang.String HEAD_INDEX_LOWER_ARG
public static final java.lang.String NAME_ARG
public static final java.lang.String POSITION_ARG
public static final java.lang.String UPDATE_MORPHO_FEATURES
public static final java.lang.String UPDATE_MORPHO_FEATURES_LOWER
public static final java.lang.String REMOVE
public static final java.lang.String REMOVE_MORPHO_FEATURES
protected static Ssurgeon.ArgsBox argsBox
public static Ssurgeon inst()
public void initLog(java.io.File logFilePath)
throws java.io.IOException
java.io.IOExceptionpublic void setLogPrefix(java.lang.String logPrefix)
public java.util.List<SemanticGraph> expandFromPatterns(java.util.List<SsurgeonPattern> patternList, SemanticGraph sg) throws java.lang.Exception
java.lang.Exceptionpublic java.util.Collection<SemanticGraph> exhaustFromPatterns(java.util.List<SsurgeonPattern> patternList, SemanticGraph sg) throws java.lang.Exception
java.lang.Exceptionpublic static SsurgeonPattern getOperationFromFile(java.lang.String path)
public SsurgeonWordlist getResource(java.lang.String id)
public java.util.Collection<SsurgeonWordlist> getResources()
public static SsurgeonEdit parseEditLine(java.lang.String editLine, java.util.Map<java.lang.String,java.lang.String> attributeArgs, Language language)
public static void writeToFile(java.io.File tgtFile,
java.util.List<SsurgeonPattern> patterns)
public static java.lang.String writeToString(SsurgeonPattern pattern)
public java.util.List<SsurgeonPattern> readFromString(java.lang.String text)
public java.util.List<SsurgeonPattern> readFromFile(java.io.File file)
public java.util.List<SsurgeonPattern> readFromDocument(org.w3c.dom.Document doc)
public java.util.List<SsurgeonPattern> readFromDirectory(java.io.File dir) throws java.lang.Exception
java.lang.Exceptionpublic static SsurgeonPattern ssurgeonPatternFromXML(org.w3c.dom.Element elt)
public static SsurgPred assemblePredFromXML(org.w3c.dom.Element elt)
SsurgPred structure from file, given the root element.public void testRead(java.io.File tgtDirPath)
throws java.lang.Exception
java.lang.Exceptionpublic static java.lang.String getTagText(org.w3c.dom.Element element,
java.lang.String tag)
public static java.lang.String getEltText(org.w3c.dom.Element element)
public static void main(java.lang.String[] args)