A TregexPattern is a tgrep
-type pattern for matching tree
node configurations. Unlike tgrep
but like Unix
grep
, there is no pre-indexing of the data to be searched.
Rather there is a linear scan through the trees where matches are sought.
TregexPattern instances can be matched against instances of trees.
Currently supported node/node relations and their symbols:
Symbol | Meaning |
---|---|
A << B | A dominates B |
A >> B | A is dominated by B |
A < B | A immediately dominates B |
A > B | A is immediately dominated by B |
A $ B | A is a sister of B (and not equal to B) |
A .. B | A precedes B |
A . B | A immediately precedes B |
A ,, B | A follows B |
A , B | A immediately follows B |
A <<, B | B is a leftmost descendent of A |
A <<- B | B is a rightmost descendent of A |
A >>, B | A is a leftmost descendent of B |
A >>- B | A is a rightmost descendent of B |
A <, B | B is the first child of A |
A >, B | A is the first child of B |
A <- B | B is the last child of A |
A >- B | A is the last child of B |
A <` B | B is the last child of A |
A >` B | A is the last child of B |
A <i B | B is the ith child of A (i > 0) |
A >i B | A is the ith child of B (i > 0) |
A <-i B | B is the ith-to-last child of A (i > 0) |
A >-i B | A is the ith-to-last child of B (i > 0) |
A <: B | B is the only child of A |
A >: B | A is the only child of B |
A <<: B | A dominates B via an unbroken chain (length > 0) of unary local trees. |
A >>: B | A is dominated by B via an unbroken chain (length > 0) of unary local trees. |
A $++ B | A is a left sister of B (same as $.. for context-free trees) |
A $-- B | A is a right sister of B (same as $,, for context-free trees) |
A $+ B | A is the immediate left sister of B (same as $. for context-free trees) |
A $- B | A is the immediate right sister of B (same as $, for context-free trees) |
A $.. B | A is a sister of B and precedes B |
A $,, B | A is a sister of B and follows B |
A $. B | A is a sister of B and immediately precedes B |
A $, B | A is a sister of B and immediately follows B |
A <+(C) B | A dominates B via an unbroken chain of (zero or more) nodes matching description C |
A >+(C) B | A is dominated by B via an unbroken chain of (zero or more) nodes matching description C |
A .+(C) B | A precedes B via an unbroken chain of (zero or more) nodes matching description C |
A ,+(C) B | A follows B via an unbroken chain of (zero or more) nodes matching description C |
A <<# B | B is a head of phrase A |
A >># B | A is a head of phrase B |
A <# B | B is the immediate head of phrase A |
A ># B | A is the immediate head of phrase B |
A == B | A and B are the same node |
A : B | [this is a pattern-segmenting operator that places no constraints on the relationship between A and B] |
Label descriptions can be literal strings, which much match labels
exactly, or regular expressions in regular expression bars: /regex/.
Literal string matching proceeds as String equality.
In order to prevent ambiguity with other Tregex symbols, only standard
"identifiers" are allowed as literals, i.e., strings matching
[a-zA-Z]([a-zA-Z0-9_])* .
If you want to use other symbols, you can do so by using a regular
expression instead of a literal string.
A disjunctive list of literal strings can be given separated by '|'.
The special string '__' (two underscores) can be used to match any
node. (WARNING!! Use of the '__' node description may seriously
slow down search.) If a label description is preceeded by '@', the
label will match any node whose basicCategory matches the
description. find()
,
as in Perl/tgrep;
you need to specify ^
or $
to constrain matches.
(S < VP < NP)
means
"an S over a VP and also over an NP".
If instead what you want is an S above a VP above an NP, you should write
"S < (VP < NP)
".
Nodes can be grouped using parens '(' and ')'
as in S < (NP $++ VP)
to match an S
over an NP, where the NP has a VP as a right sister.
Relations can be combined using the '&' and '|' operators, negated with the '!' operator, and made optional with the '?' operator. Thus
(NP < NN | < NNS)
will match an NP node dominating either
an NN or an NNS. (NP > S & $++ VP)
matches an NP that
is both under an S and has a VP as a right sister.
Relations can be grouped using brackets '[' and ']'. So the expression
NP [< NN | < NNS] & > S
matches an NP that (1) dominates either an NN or an NNS, and (2) is under an S. Without
brackets, & takes precidence over |, and equivalent operators are
left-associative. Also note that & is the default combining operator if the
operator is omitted in a chain of relations, so that the two patterns are equivalent:
As another example,(S < VP < NP)
(S < VP & < NP)
(VP < VV | < NP % NP)
can be written explicitly as (VP [< VV | [< NP & % NP] ] )
Relations can be negated with the '!' operator, in which case the
expression will match only if there is no node satisfying the relation.
For example (NP !< NNP)
matches only NPs not dominating
an NNP. Label descriptions can also be negated with '!': (NP < !NNP|NNS) matches
NPs dominating some node that is not an NNP or an NNS.
Relations can be made optional with the '?' operator. This way the expression will match even if the optional relation is not satisfied. This is useful when used together with node naming (see below).
In order to consider only the "basic category" of a tree label,
i.e. to ignore functional tags or other annotations on the label,
prefix that node's description with the @ symbol. For example
(@NP < @/NN.?/)
This can only be used for individual nodes;
if you want all nodes to use the basic category, it would be more efficient
to use a TreeNormalizer.
The ":" operator allows you to segment a pattern into two pieces. This can simplify your pattern writing. For example, the pattern
S : NPmatches only those S nodes in trees that also have an NP node.
Nodes can be given names (a.k.a. handles) using '='. A named node will be stored in a
map that maps names to nodes so that if a match is found, the node
corresponding to the named node can be extracted from the map. For
example (NP < NNP=name)
will match an NP dominating an NNP
and after a match is found, the map can be queried with the
name to retreived the matched node..
Note that you are not allowed to name a node that is under the scope of a negation operator (the semantics would
be unclear, since you can't store a node that never gets matched to).
Named nodes
Named nodes that refer back to previous named nodes need not have a node description -- this is known as "backreferencing". In this case, the expression will match only when all instances of the same name get matched to the same tree node. For example: the pattern
(@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)
matches only an NP dominating exactly the sequence NP , NP ,
-- the mother NP cannot have any other daughters. Multiple
backreferences are allowed. If the node w/ no node description does not refer
to a previously named node, there will be no error, the expression simply will
not match anything.
Another way to refer to previously named nodes is with the "link" symbol: '~'. A link is like a backreference, except that instead of having to be equal to the referred node, the current node only has to match the label of the referred to node. A link cannot have a node description, i.e. the '~' symbol must immediately follow a relation symbol.
The HeadFinder used to determine heads for the head relations <#
, >#
, <<#
, and >>#
, and also
the Function mapping from labels to Basic Category tags can be
chosen.
If you write a node description using a regular expression, you can assign its matching groups to variable names. If more than one node has a group assigned to the same variable name, then matching will only occur when all such groups capture the same string. This is useful for enforcing coindexation constraints. The syntax is
/ <regex-stuff> /#<group-number>%<variable-name>
For example, the pattern (designed for Penn Treebank trees)
@SBAR < /^WH.*-([0-9]+)$/#1%index << (__=empty < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))
will match only such that the WH- node under the SBAR is coindexed with the trace node that gets the name empty
.
java edu.stanford.nlp.trees.tregex.TregexPattern [[-TCwfosnu] [-h <node-name>]]* pattern
filepath
Arguments:
pattern
: the tree
pattern which optionally names some set of nodes (i.e., gives it the "handle") =name
(for some arbitrary
string "name")
filepath
: the path to files with trees. If this is a directory, there will be recursive descent and the pattern will be run on all files beneath the specified directory.
Options:
-T
causes all trees to be printed as processed. Otherwise only matching nodes
are printed.
-C
suppresses printing of matches, so only the
number of matches is printed.
-w
causes the whole of a tree that matches to be printed.
-f
causes the filename to be printed.
-i <filename>
causes the pattern to be matched to be read from <filename>
rather than the command line. Don't specify a pattern when this option is used.
-o
allows each tree node to be reported only once as the root of a match (by default a node will
be printed once for every way the pattern matches).
-s
causes trees to be printed all on one line (by default they are pretty printed).
-n
causes the number of the tree in which the match was found to be
printed before every match.
-u
causes only the label of each matching node to be printed.
-t
causes only the yield of the selected node (or the yield of the whole tree, if the -w
option is used) to be printed.
-encoding <charset_encoding>
option allows specification of character set.
-h <node-handle>
If an -h
option is given, the root tree node will not be printed. Instead,
for each node-handle
specified, the node matched and given that handle will be printed. Multiple nodes can be printed by using the
-h
option multiple times on a single command line.
-hf <headfinder-class-name>
use the specified HeadFinder
class to determine headship relations.
-hfArg <string>
pass a string argument in to the HeadFinder
class's constructor. -hfArg
can be used multiple times to pass in multiple arguments.
-trf <TreeReaderFactory-class-name>
use the specified TreeReaderFactory
class to read trees from files.
-v
print every tree that contains no matches of the specificed pattern, but print no matches to the pattern.
-x
Instead of the matched subtree, print the matched subtree's identifying number as defined in tgrep2:a
unique identifer for the subtree and is in the form s:n, where s is an integer specifying
the sentence number in the corpus (starting with 1), and n is an integer giving the order
in which the node is encountered in a depth-first search starting with 1 at top node in the
sentence tree.
-extract <code> <tree-file>
extracts the subtree s:n specified by code from the specified tree-file. Overrides all other behavior of tregex. Can't specify multiple encodings etc. yet.
-extractFile <code-file> <tree-file>
extracts every subtree specified by the subtree codes in code-file, which must appear exactly one per line, from the specified tree-file. Overrides all other behavior of tregex. Can't specify multiple encodings etc. yet.