public class RegexNERSequenceClassifier extends AbstractSequenceClassifier<CoreLabel>
regex1 TYPE overwritableType1,Type2... priority regex2 TYPE overwritableType1,Type2... priority ...where each argument is tab-separated, and the last two arguments are optional. Several regexes can be associated with a single type. In the case where multiple regexes match a phrase, the priority ranking is used to choose between the possible types. This classifier is designed to be used as part of a full NER system to label entities that don't fall into the usual NER categories. It only records the label if the token has not already been NER-annotated, or it has been annotated but the NER-type has been designated overwritable (the third argument). Note that this is evaluated token-wise in this classifier, and so it may assign a label against a token sequence that is partly background and partly overwritable. (In contrast, RegexNERAnnotator doesn't allow this.) It assigns labels to AnswerAnnotation, while checking for existing labels in NamedEntityTagAnnotation. The first column regex may be a sequence of regex, each separated by whitespace (matching "\\s+"). The regex will match if the successive regex match a sequence of tokens in the input. Spaces can only be used to separate regular expression tokens; within tokens \\s or similar non-space representations need to be used instead. Notes: Following Java regex conventions, some characters in the file need to be escaped. Only a single backslash should be used though, as these are not String literals. The input to RegexNER will have already been tokenized. So, for example, with our usual English tokenization, things like genitives and commas at the end of words will be separated in the input and matched as a separate token. This class isn't implemented very efficiently, since every regex is evaluated at every token position. So it can and does get quite slow if you have a lot of patterns in your NER rules.
TokensRegex
is a more general framework to provide the functionality of this class.
But at present we still use this class.Modifier and Type | Field and Description |
---|---|
static java.lang.String |
DEFAULT_VALID_POS |
classIndex, featureFactories, flags, knownLCWords, pad, windowSize
Constructor and Description |
---|
RegexNERSequenceClassifier(java.io.BufferedReader reader,
boolean ignoreCase,
boolean overwriteMyLabels,
java.lang.String validPosRegex)
Make a new instance of this classifier.
|
RegexNERSequenceClassifier(java.lang.String mapping,
boolean ignoreCase,
boolean overwriteMyLabels) |
RegexNERSequenceClassifier(java.lang.String mapping,
boolean ignoreCase,
boolean overwriteMyLabels,
java.lang.String validPosRegex)
Make a new instance of this classifier.
|
Modifier and Type | Method and Description |
---|---|
java.util.List<CoreLabel> |
classify(java.util.List<CoreLabel> document)
Classify a
List of something that extendsCoreMap . |
java.util.List<CoreLabel> |
classifyWithGlobalInformation(java.util.List<CoreLabel> tokenSeq,
CoreMap doc,
CoreMap sent)
Classify a
List of something that extends CoreMap using as
additional information whatever is stored in the document and sentence. |
java.util.Set<java.lang.String> |
labels()
Most AbstractSequenceClassifiers have classIndex set.
|
void |
loadClassifier(java.io.ObjectInputStream in,
java.util.Properties props)
Load a classifier from the specified input stream.
|
void |
serializeClassifier(java.io.ObjectOutputStream oos)
Serialize a sequence classifier to an object output stream
|
void |
serializeClassifier(java.lang.String serializePath)
Serialize a sequence classifier to a file on the given path.
|
void |
train(java.util.Collection<java.util.List<CoreLabel>> docs,
DocumentReaderAndWriter<CoreLabel> readerAndWriter)
Trains a classifier from a Collection of sequences.
|
apply, backgroundSymbol, classify, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswersKBest, classifyAndWriteAnswersKBest, classifyAndWriteViterbiSearchGraph, classifyFile, classifyFilesAndWriteAnswers, classifyFilesAndWriteAnswers, classifyKBest, classifyRaw, classifySentence, classifySentenceWithGlobalInformation, classifyStdin, classifyStdin, classifyToCharacterOffsets, classifyToString, classifyToString, classifyWithInlineXML, countResults, countResultsSegmenter, defaultReaderAndWriter, dumpFeatures, finalizeClassification, getKnownLCWords, getSampler, getSequenceModel, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, makeObjectBankFromFile, makeObjectBankFromFile, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromReader, makeObjectBankFromString, makePlainTextReaderAndWriter, makePlainTextReaderAndWriter, makeReaderAndWriter, plainTextReaderAndWriter, printFeatureLists, printFeatures, printProbs, printProbs, printProbsDocument, printProbsDocuments, printResults, reinit, segmentString, segmentString, train, train, train, train, train, train, windowSize, writeAnswers
public static final java.lang.String DEFAULT_VALID_POS
public RegexNERSequenceClassifier(java.lang.String mapping, boolean ignoreCase, boolean overwriteMyLabels)
public RegexNERSequenceClassifier(java.lang.String mapping, boolean ignoreCase, boolean overwriteMyLabels, java.lang.String validPosRegex)
mapping
- A String describing a file/classpath/URI for the RegexNER patternsignoreCase
- The regex in the mapping file should be compiled ignoring caseoverwriteMyLabels
- If true, this classifier overwrites NE labels generated through
this regex NER. This is necessary because sometimes the
RegexNERSequenceClassifier is run successively over the same
text (e.g., to overwrite some older annotations).validPosRegex
- May be null or an empty String, in which case any (or no) POS is valid
in matching. Otherwise, this is a regex which is matched with find()
[not matches()] and which must be matched by the POS of at least one
word in the sequence for it to be labeled via any matching rules.
(Note that this is a postfilter; using this will not speed up matching.)public RegexNERSequenceClassifier(java.io.BufferedReader reader, boolean ignoreCase, boolean overwriteMyLabels, java.lang.String validPosRegex)
reader
- A Reader for the RegexNER patternsignoreCase
- The regex in the mapping file should be compiled ignoring caseoverwriteMyLabels
- If true, this classifier overwrites NE labels generated through
this regex NER. This is necessary because sometimes the
RegexNERSequenceClassifier is run successively over the same
text (e.g., to overwrite some older annotations).validPosRegex
- May be null or an empty String, in which case any (or no) POS is valid
in matching. Otherwise, this is a regex, and only words with a POS that
match the regex will be labeled via any matching rules.public java.util.Set<java.lang.String> labels()
labels
in class AbstractSequenceClassifier<CoreLabel>
public java.util.List<CoreLabel> classify(java.util.List<CoreLabel> document)
AbstractSequenceClassifier
List
of something that extendsCoreMap
.
The classifications are added in place to the items of the document,
which is also returned by this method.
Warning: In many circumstances, you should not call this method directly.
In particular, if you call this method directly, your document will not be preprocessed
to add things like word distributional similarity class or word shape features that your
classifier may rely on to work correctly. In such cases, you should call
classifySentence
instead.classify
in class AbstractSequenceClassifier<CoreLabel>
document
- A List
of something that extends CoreMap
.List
, but with the elements annotated with their
answers (stored under the
CoreAnnotations.AnswerAnnotation
key). The answers will be the class labels defined by the CRF
Classifier. They might be things like entity labels (in BIO
notation or not) or something like "1" vs. "0" on whether to
begin a new token here or not (in word segmentation).public java.util.List<CoreLabel> classifyWithGlobalInformation(java.util.List<CoreLabel> tokenSeq, CoreMap doc, CoreMap sent)
AbstractSequenceClassifier
List
of something that extends CoreMap
using as
additional information whatever is stored in the document and sentence.
This is needed for SUTime (NumberSequenceClassifier), which requires
the document date to resolve relative dates.classifyWithGlobalInformation
in class AbstractSequenceClassifier<CoreLabel>
tokenSeq
- A List
of something that extends CoreMap
public void train(java.util.Collection<java.util.List<CoreLabel>> docs, DocumentReaderAndWriter<CoreLabel> readerAndWriter)
AbstractSequenceClassifier
train
in class AbstractSequenceClassifier<CoreLabel>
docs
- An ObjectBank or a collection of sequences of INreaderAndWriter
- A DocumentReaderAndWriter to use when loading test filespublic void serializeClassifier(java.lang.String serializePath)
AbstractSequenceClassifier
serializeClassifier
in class AbstractSequenceClassifier<CoreLabel>
serializePath
- The path/filename to write the classifier to.public void serializeClassifier(java.io.ObjectOutputStream oos)
AbstractSequenceClassifier
serializeClassifier
in class AbstractSequenceClassifier<CoreLabel>
public void loadClassifier(java.io.ObjectInputStream in, java.util.Properties props) throws java.io.IOException, java.lang.ClassCastException, java.lang.ClassNotFoundException
AbstractSequenceClassifier
loadClassifier
in class AbstractSequenceClassifier<CoreLabel>
in
- The InputStream to load the serialized classifier fromprops
- This Properties object will be used to update the
SeqClassifierFlags which are read from the serialized classifierjava.io.IOException
- If there are problems accessing the input streamjava.lang.ClassCastException
- If there are problems interpreting the serialized datajava.lang.ClassNotFoundException
- If there are problems interpreting the serialized data