public abstract class SemgrexPattern
extends java.lang.Object
implements java.io.Serializable
tgrep
or Tregex
and operate over
SemanticGraph
objects, which contain IndexedWord nodes
. Unlike
tgrep
but like Unix grep
, there is no pre-indexing
of the data to be searched. Rather there is a linear scan through the graph
where matches are sought.
For example, {lemma:slice;tag:/VB.*/}
represents any verb nodes
with "slice" as their lemma. Attributes are extracted using
AnnotationLookup
.
The root of the graph can be marked by the $ sign, that is {$}
represents the root node.
A node description can be negated with '!'. !{lemma:boy}
matches any token that isn't "boy".
Another way to negate a node description is with a negative
lookahead regex, although this starts to look a little ugly.
For example, {lemma:/^(?!boy).*$/}
will also match any
token with a lemma that isn't "boy". Note, however, that if you
use this style, there needs to be some lemma attached to the token.
%
means any relationship. It is
also OK simply to omit the relationship symbol altogether.
Currently supported node relations and their symbols:
Symbol | Meaning |
---|---|
A <reln B | A is the dependent of a relation reln with B |
A >reln B | A is the governor of a relation reln with B |
A <<reln B | A is the dependent of a relation reln in a chain to B following dep->gov paths
|
A >>reln B | A is the governor of a relation reln in a chain to B following gov->dep paths
|
A x,y<<reln B | A is the dependent of a relation reln in a chain to B following dep->gov paths between distances of x and y
|
A x,y>>reln B | A is the governor of a relation reln in a chain to B following gov->dep paths between distances of x and y
|
A == B | A and B are the same nodes in the same graph |
A . B | A immediately precedes B, i.e. A.index() == B.index() - 1 |
A - B | A immediately succeeds B, i.e. A.index() == B.index() + 1 |
A .. B | A precedes B, i.e. A.index() < B.index()
|
A -- B | A succeeds B, i.e. A.index() > B.index()
|
A $+ B | B is a right immediate sibling of A, i.e. A and B have the same parent and A.index() == B.index() - 1 |
A $- B | B is a left immediate sibling of A, i.e. A and B have the same parent and A.index() == B.index() + 1 |
A $++ B | B is a right sibling of A, i.e. A and B have the same parent and A.index() < B.index()
|
A $-- B | B is a left sibling of A, i.e. A and B have the same parent and A.index() > B.index()
|
A @ B | A is aligned to B (this is only used when you have two dependency graphs which are aligned)
|
In a chain of relations, all relations are relative to the first
node in the chain. For example, "{} >nsubj {} >dobj {}
"
means "any node that is the governor of both a nsubj and
a dobj relation". If instead what you want is a node that is the
governor of a nsubj relation with a node that is itself the
governor of dobj relation, you should use parentheses and write: "{} >nsubj ({} >dobj {})
".
If a relation type is specified for the <<
relation, the
relation type is only used for the first relation in the sequence.
Therefore, if B depends on A with the relation type foo, the
pattern {} <<foo {}
will then match B and
everything that depends on B.
Similarly, if a relation type is specified for the >>
relation, the relation type is only used for the last relation in
the sequence. Therefore, if A governs B with the relation type
foo, the pattern {} >>foo {}
will then match A
and all of the nodes which have a sequence leading to A.
Relations can be grouped using brackets '[' and ']'. So the expression
{} [<subj {} | <agent {}] & @ {}
matches a node that is either the dep of a subj or agent relationship and
has an alignment to some other node.
Relations can be negated with the '!' operator, in which case the expression will match only if there is no node satisfying the relation.
Relations can be made optional with the '?' operator. This way the expression will match even if the optional relation is not satisfied.
The operator ":" partitions a pattern into separate patterns, each of which must be matched. For example, the following is a pattern where the matched node must have both "foo" and "bar" as descendants:
{}=a >> {word:foo} : {}=a >> {word:bar}
This pattern could have been written
{}=a >> {word:foo} >> {word:bar}
However, for more complex examples, partitioning a pattern may make
it more readable.
({tag:NN}=noun)
will match a singular noun node and
after a match is found, the map can be queried with the name to retrieved the
matched node using SemgrexMatcher.getNode(String o)
with (String)
argument "noun" (not "=noun"). Note that you are not allowed to
name a node that is under the scope of a negation operator (the semantics
would be unclear, since you can't store a node that never gets matched to).
Trying to do so will cause a ParseException
to be thrown. Named nodes
can be put within the scope of an optionality operator.
Named nodes that refer back to previously named nodes need not have a node description -- this is known as "backreferencing". In this case, the expression will match only when all instances of the same name get matched to the same node.
For example:
{} >dobj ({} > {}=foo) >mod ({} > {}=foo)
will match a graph in which there are two nodes, X
and
Y
, for which X
is the grandparent of
Y
and there are two paths to Y
, one of
which goes through a dobj
and one of which goes
through a mod
.
{idx:1} >=reln {idx:2}
The name of the relation will then
be stored in the matcher and can be extracted with getRelnName("reln")
.
If the relation is later referenced a second time, the type of
relation must be the same, or the potential match will not be
accepted.
In the case of ancestor and descendant relations, the last relation in the sequence of relations is the name used.
{$} >~edge {}
The edge itself is now stored with the matcher and can
be extracted with getEdgeName("edge")
. If the edge is
later referenced a second time, the exact edge must be the same, or
the potential match will not be accepted.
Modifier and Type | Class and Description |
---|---|
static class |
SemgrexPattern.OutputFormat |
Modifier and Type | Method and Description |
---|---|
static SemgrexPattern |
compile(java.lang.String semgrex) |
static SemgrexPattern |
compile(java.lang.String semgrex,
Env env)
Creates a pattern from the given string.
|
boolean |
equals(java.lang.Object o) |
int |
hashCode() |
static void |
help() |
static void |
main(java.lang.String[] args)
Prints out all matches of a semgrex pattern on a file of dependencies.
|
SemgrexMatcher |
matcher(SemanticGraph sg)
Get a
SemgrexMatcher for this pattern in this graph. |
SemgrexMatcher |
matcher(SemanticGraph hypGraph,
Alignment alignment,
SemanticGraph txtGraph) |
SemgrexMatcher |
matcher(SemanticGraph hypGraph,
Alignment alignment,
SemanticGraph txtGraph,
boolean ignoreCase) |
SemgrexMatcher |
matcher(SemanticGraph sg,
boolean ignoreCase)
Get a
SemgrexMatcher for this pattern in this graph. |
SemgrexMatcher |
matcher(SemanticGraph sg,
IndexedWord root)
Get a
SemgrexMatcher for this pattern in this graph. |
SemgrexMatcher |
matcher(SemanticGraph sg,
java.util.Map<java.lang.String,IndexedWord> variables)
Get a
SemgrexMatcher for this pattern in this graph, with some
initial conditions on the variable assignments |
java.lang.String |
pattern() |
void |
prettyPrint()
Print a multi-line representation of the pattern illustrating its syntax
to
System.out . |
void |
prettyPrint(java.io.PrintStream ps)
Print a multi-line representation of the pattern illustrating its syntax.
|
void |
prettyPrint(java.io.PrintWriter pw)
Print a multi-line representation of the pattern illustrating its syntax.
|
void |
setEnv(Env env)
Recursively sets the env variable to this pattern in this and in all its children
|
abstract java.lang.String |
toString()
The goal is to return a string which will be compiled to the same pattern
|
abstract java.lang.String |
toString(boolean hasPrecedence) |
protected Env env
public SemgrexMatcher matcher(SemanticGraph sg)
SemgrexMatcher
for this pattern in this graph.sg
- The SemanticGraph to match onpublic SemgrexMatcher matcher(SemanticGraph sg, IndexedWord root)
SemgrexMatcher
for this pattern in this graph.sg
- The SemanticGraph to match onroot
- The IndexedWord from which to start the searchpublic SemgrexMatcher matcher(SemanticGraph sg, java.util.Map<java.lang.String,IndexedWord> variables)
SemgrexMatcher
for this pattern in this graph, with some
initial conditions on the variable assignmentspublic SemgrexMatcher matcher(SemanticGraph sg, boolean ignoreCase)
SemgrexMatcher
for this pattern in this graph.sg
- The SemanticGraph to match onignoreCase
- Will ignore case for matching a pattern with a node; not
implemented by Coordination Patternpublic SemgrexMatcher matcher(SemanticGraph hypGraph, Alignment alignment, SemanticGraph txtGraph)
public SemgrexMatcher matcher(SemanticGraph hypGraph, Alignment alignment, SemanticGraph txtGraph, boolean ignoreCase)
public static SemgrexPattern compile(java.lang.String semgrex, Env env)
semgrex
- The pattern stringpublic static SemgrexPattern compile(java.lang.String semgrex)
public java.lang.String pattern()
public void setEnv(Env env)
env
- An Envpublic abstract java.lang.String toString()
toString
in class java.lang.Object
public abstract java.lang.String toString(boolean hasPrecedence)
hasPrecedence
- indicates that this pattern has precedence in terms
of "order of operations", so there is no need to parenthesize the
expressionpublic void prettyPrint(java.io.PrintWriter pw)
public void prettyPrint(java.io.PrintStream ps)
public void prettyPrint()
System.out
.public boolean equals(java.lang.Object o)
equals
in class java.lang.Object
public int hashCode()
hashCode
in class java.lang.Object
public static void help()
public static void main(java.lang.String[] args) throws java.io.IOException
Usage:
java edu.stanford.nlp.semgraph.semgrex.SemgrexPattern [args]
See the help() function for a list of possible arguments to provide.
java.io.IOException