public class GetPatternsFromDataMultiClass extends Object implements Serializable
The multi-threaded class (nthread
parameter for number of
threads) takes as input.
To use the default options, run
java -mx1000m edu.stanford.nlp.patterns.surface.GetPatternsFromDataMultiClass -file text_file -seedWordsFiles label1,seedwordlist1;label2,seedwordlist2;... -outDir output_directory (optional)
fileFormat
: (Optional) Default is text. Valid values are text
(or txt) and ser, where the serialized file is of the type Map<String,
List<CoreLabel>>
.
file
: (Required) Input file(s) (default assumed text). Can be
one or more of (concatenated by comma or semi-colon): file, directory, files
with regex in the filename (for example: "mydir/health-.*-processed.txt")
seedWordsFiles
: (Required)
label1,file_seed_words1;label2,file_seed_words2;... where file_seed_words are
files with list of seed words, one in each line
outDir
: (Optional) output directory where visualization/output
files are stored
For other flags, see individual comments for each flag.
To use a properties file, see
projects/core/data/edu/stanford/nlp/patterns/surface/example.properties or patterns/example.properties (depends on which codebase you are using)
as an example for the flags and their brief descriptions. Run the code as:
java -mx1000m -cp classpath edu.stanford.nlp.patterns.surface.GetPatternsFromDataMultiClass -props dir-as-above/example.properties
IMPORTANT: Many flags are described in the classes
ConstantsAndVariables
, CreatePatterns
, and
PhraseScorer
.
Modifier and Type | Class and Description |
---|---|
static class |
GetPatternsFromDataMultiClass.LabelWithSeedWords |
static class |
GetPatternsFromDataMultiClass.PatternScoring
RlogF is from Riloff 1996, when R's denominator is (pos+neg+unlabeled)
|
Constructor and Description |
---|
GetPatternsFromDataMultiClass(Properties props,
Map<String,List<CoreLabel>> sents,
Map<String,Set<String>> seedSets,
boolean labelUsingSeedSets) |
GetPatternsFromDataMultiClass(Properties props,
Map<String,List<CoreLabel>> sents,
Map<String,Set<String>> seedSets,
boolean labelUsingSeedSets,
Map<String,Class<? extends TypesafeMap.Key<String>>> answerClass) |
GetPatternsFromDataMultiClass(Properties props,
Map<String,List<CoreLabel>> sents,
Map<String,Set<String>> seedSets,
boolean labelUsingSeedSets,
Map<String,Class<? extends TypesafeMap.Key<String>>> answerClass,
Map<String,Class> generalizeClasses,
Map<String,Map<Class,Object>> ignoreClasses)
generalize classes basically maps label strings to a map of generalized
strings and the corresponding class ignoreClasses have to be boolean
|
GetPatternsFromDataMultiClass(Properties props,
Map<String,List<CoreLabel>> sents,
Set<String> seedSet,
boolean labelUsingSeedSets,
Class answerClass,
String answerLabel) |
GetPatternsFromDataMultiClass(Properties props,
Map<String,List<CoreLabel>> sents,
Set<String> seedSet,
boolean labelUsingSeedSets,
Class answerClass,
String answerLabel,
Map<String,Class> generalizeClasses,
Map<Class,Object> ignoreClasses) |
GetPatternsFromDataMultiClass(Properties props,
Map<String,List<CoreLabel>> sents,
Set<String> seedSet,
boolean labelUsingSeedSets,
String answerLabel) |
GetPatternsFromDataMultiClass(Properties props,
Map<String,List<CoreLabel>> sents,
Set<String> seedSet,
boolean labelUsingSeedSets,
String answerLabel,
Map<String,Class> generalizeClasses,
Map<Class,Object> ignoreClasses) |
Modifier and Type | Method and Description |
---|---|
static void |
countResults(List<CoreLabel> doc,
Counter<String> entityTP,
Counter<String> entityFP,
Counter<String> entityFN,
String background,
Counter<String> wordTP,
Counter<String> wordTN,
Counter<String> wordFP,
Counter<String> wordFN,
Class<? extends TypesafeMap.Key<String>> whichClassToCompare,
boolean evalPerEntity) |
static boolean |
countResultsPerEntity(List<CoreLabel> doc,
Counter<String> entityTP,
Counter<String> entityFP,
Counter<String> entityFN,
String background,
Counter<String> wordTP,
Counter<String> wordTN,
Counter<String> wordFP,
Counter<String> wordFN,
Class<? extends TypesafeMap.Key<String>> whichClassToCompare)
COPIED from CRFClassifier: Count the successes and failures of the model on
the given document.
|
static void |
countResultsPerToken(List<CoreLabel> doc,
Counter<String> entityTP,
Counter<String> entityFP,
Counter<String> entityFN,
String background,
Counter<String> wordTP,
Counter<String> wordTN,
Counter<String> wordFP,
Counter<String> wordFN,
Class<? extends TypesafeMap.Key<String>> whichClassToCompare)
Count the successes and failures of the model on the given document
***token-based***.
|
static String |
elapsedTime(Date d1,
Date d2) |
void |
evaluate(Map<String,List<CoreLabel>> testSentences,
boolean evalPerEntity) |
static <D> Counter<D> |
FScore(Counter<D> precision,
Counter<D> recall,
double beta) |
double |
FScore(double precision,
double recall,
double beta) |
static List<File> |
getAllFiles(String file) |
Map<String,Counter<Integer>> |
getLearnedPatterns() |
Counter<Integer> |
getLearnedPatterns(String label) |
Counter<SurfacePattern> |
getLearnedPatternsSurfaceForm(String label) |
Counter<String> |
getLearnedWords(String label) |
Set<String> |
getNonBackgroundLabels(CoreLabel l) |
PatternsForEachToken |
getPatsForEachToken() |
Counter<Integer> |
getPatterns(String label,
Set<Integer> alreadyIdentifiedPatterns,
Integer p0,
Counter<String> p0Set,
Set<Integer> ignorePatterns) |
static Class |
getPatternScoringClass(GetPatternsFromDataMultiClass.PatternScoring patternScoring) |
static List<Integer> |
getSubListIndex(String[] l1,
String[] l2,
String[] subl2,
Set<String> englishWords,
HashSet<String> seenFuzzyMatches,
int minLen4Fuzzy)
If l1 is a part of l2, it finds the starting index of l1 in l2 If l1 is not
a sub-array of l2, then it returns -1 note that l2 should have the exact
elements and order as in l1
|
void |
iterateExtractApply() |
void |
iterateExtractApply(Map<String,Integer> p0,
Map<String,Counter<String>> p0Set,
String wordsOutputFile,
String sentsOutFile,
String patternsOutFile,
Map<String,Set<Integer>> ignorePatterns) |
Pair<Counter<Integer>,Counter<String>> |
iterateExtractApply4Label(String label,
Integer p0,
Counter<String> p0Set,
BufferedWriter wordsOutput,
String sentsOutFile,
BufferedWriter patternsOut,
Set<Integer> ignorePatterns,
int numIter,
Set<String> ignoreWords,
CollectionValuedMap<Integer,Triple<String,Integer,Integer>> matchedTokensByPat,
TwoDimensionalCounter<String,Integer> terms) |
void |
labelWords(String label,
Map<String,List<CoreLabel>> sents,
Set<String> identifiedWords,
String outFile,
CollectionValuedMap<Integer,Triple<String,Integer,Integer>> matchedTokensByPat) |
static void |
main(String[] args) |
static Counter<String> |
normalizeSoftMaxMinMaxScores(Counter<String> scores,
boolean minMaxNorm,
boolean softmax,
boolean oneMinusSoftMax) |
void |
processSents(Map<String,List<CoreLabel>> sents) |
static Map<String,Set<String>> |
readSeedWords(Properties props) |
static Map<String,Set<String>> |
readSeedWords(String seedWordsFiles) |
void |
removeOverLappingLabels(Map<String,List<CoreLabel>> sents)
If a token is labeled for two or more labels, then keep the one that has the longest matching phrase.
|
static GetPatternsFromDataMultiClass |
run(Properties props)
Execute the system give a properties file or object.
|
static void |
runLabelSeedWords(Map<String,List<CoreLabel>> sents,
Class answerclass,
String label,
Set<String> seedWords,
ConstantsAndVariables constVars) |
static Map<String,List<CoreLabel>> |
runPOSNEROnTokens(List<CoreMap> sentsCM,
String posModelPath,
boolean useTargetNERRestriction,
String prefix,
boolean useTargetParserParentRestriction,
String numThreads) |
void |
setLearnedPatterns(Counter<Integer> patterns,
String label) |
void |
setLearnedWords(Counter<String> words,
String label) |
static <E> List<List<E>> |
splitIntoNumThreads(List<E> c,
int n,
int numThreads) |
static int |
tokenize(Iterator<String> textReader,
String posModelPath,
boolean lowercase,
boolean useTargetNERRestriction,
String sentIDPrefix,
boolean useTargetParserParentRestriction,
String numThreads,
boolean batchProcessSents,
int numMaxSentencesPerBatchFile,
File saveSentencesSerDirFile,
Map<String,List<CoreLabel>> sents,
int numFilesTillNow) |
void |
writeColumnOutput(String outFile) |
void |
writeLabeledData(String outFile) |
public Map<String,TwoDimensionalCounter<String,Integer>> wordsPatExtracted
public ScorePhrases scorePhrases
public ConstantsAndVariables constVars
public CreatePatterns createPats
public Map<String,TwoDimensionalCounter<Integer,String>> patternsandWords
public TwoDimensionalCounter<String,ConstantsAndVariables.ScorePhraseMeasures> phInPatScoresCache
public GetPatternsFromDataMultiClass(Properties props, Map<String,List<CoreLabel>> sents, Set<String> seedSet, boolean labelUsingSeedSets, String answerLabel) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public GetPatternsFromDataMultiClass(Properties props, Map<String,List<CoreLabel>> sents, Set<String> seedSet, boolean labelUsingSeedSets, Class answerClass, String answerLabel) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public GetPatternsFromDataMultiClass(Properties props, Map<String,List<CoreLabel>> sents, Set<String> seedSet, boolean labelUsingSeedSets, String answerLabel, Map<String,Class> generalizeClasses, Map<Class,Object> ignoreClasses) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public GetPatternsFromDataMultiClass(Properties props, Map<String,List<CoreLabel>> sents, Set<String> seedSet, boolean labelUsingSeedSets, Class answerClass, String answerLabel, Map<String,Class> generalizeClasses, Map<Class,Object> ignoreClasses) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public GetPatternsFromDataMultiClass(Properties props, Map<String,List<CoreLabel>> sents, Map<String,Set<String>> seedSets, boolean labelUsingSeedSets) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, ClassNotFoundException, InterruptedException, ExecutionException
public GetPatternsFromDataMultiClass(Properties props, Map<String,List<CoreLabel>> sents, Map<String,Set<String>> seedSets, boolean labelUsingSeedSets, Map<String,Class<? extends TypesafeMap.Key<String>>> answerClass) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public GetPatternsFromDataMultiClass(Properties props, Map<String,List<CoreLabel>> sents, Map<String,Set<String>> seedSets, boolean labelUsingSeedSets, Map<String,Class<? extends TypesafeMap.Key<String>>> answerClass, Map<String,Class> generalizeClasses, Map<String,Map<Class,Object>> ignoreClasses) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public PatternsForEachToken getPatsForEachToken()
public void removeOverLappingLabels(Map<String,List<CoreLabel>> sents)
PatternsAnnotations.Ln
set, which is already done in runLabelSeedWords function.public static Map<String,List<CoreLabel>> runPOSNEROnTokens(List<CoreMap> sentsCM, String posModelPath, boolean useTargetNERRestriction, String prefix, boolean useTargetParserParentRestriction, String numThreads)
public static int tokenize(Iterator<String> textReader, String posModelPath, boolean lowercase, boolean useTargetNERRestriction, String sentIDPrefix, boolean useTargetParserParentRestriction, String numThreads, boolean batchProcessSents, int numMaxSentencesPerBatchFile, File saveSentencesSerDirFile, Map<String,List<CoreLabel>> sents, int numFilesTillNow) throws InterruptedException, ExecutionException, IOException
public static List<Integer> getSubListIndex(String[] l1, String[] l2, String[] subl2, Set<String> englishWords, HashSet<String> seenFuzzyMatches, int minLen4Fuzzy)
l1
- array you want to find in l2l2
- public static void runLabelSeedWords(Map<String,List<CoreLabel>> sents, Class answerclass, String label, Set<String> seedWords, ConstantsAndVariables constVars) throws InterruptedException, ExecutionException, IOException
public void processSents(Map<String,List<CoreLabel>> sents) throws IOException, ClassNotFoundException
IOException
ClassNotFoundException
public Counter<Integer> getPatterns(String label, Set<Integer> alreadyIdentifiedPatterns, Integer p0, Counter<String> p0Set, Set<Integer> ignorePatterns) throws IOException, ClassNotFoundException
IOException
ClassNotFoundException
public static Class getPatternScoringClass(GetPatternsFromDataMultiClass.PatternScoring patternScoring)
public static <E> List<List<E>> splitIntoNumThreads(List<E> c, int n, int numThreads)
public static Counter<String> normalizeSoftMaxMinMaxScores(Counter<String> scores, boolean minMaxNorm, boolean softmax, boolean oneMinusSoftMax)
public void labelWords(String label, Map<String,List<CoreLabel>> sents, Set<String> identifiedWords, String outFile, CollectionValuedMap<Integer,Triple<String,Integer,Integer>> matchedTokensByPat) throws IOException
IOException
public void iterateExtractApply() throws IOException, ClassNotFoundException
IOException
ClassNotFoundException
public void iterateExtractApply(Map<String,Integer> p0, Map<String,Counter<String>> p0Set, String wordsOutputFile, String sentsOutFile, String patternsOutFile, Map<String,Set<Integer>> ignorePatterns) throws IOException, ClassNotFoundException
p0
- Null in most cases. only used for BPBp0Set
- Null in most caseswordsOutputFile
- If null, output is in the output directorysentsOutFile
- patternsOutFile
- ignorePatterns
- IOException
ClassNotFoundException
public Pair<Counter<Integer>,Counter<String>> iterateExtractApply4Label(String label, Integer p0, Counter<String> p0Set, BufferedWriter wordsOutput, String sentsOutFile, BufferedWriter patternsOut, Set<Integer> ignorePatterns, int numIter, Set<String> ignoreWords, CollectionValuedMap<Integer,Triple<String,Integer,Integer>> matchedTokensByPat, TwoDimensionalCounter<String,Integer> terms) throws IOException, ClassNotFoundException
IOException
ClassNotFoundException
public Counter<SurfacePattern> getLearnedPatternsSurfaceForm(String label)
public static boolean countResultsPerEntity(List<CoreLabel> doc, Counter<String> entityTP, Counter<String> entityFP, Counter<String> entityFN, String background, Counter<String> wordTP, Counter<String> wordTN, Counter<String> wordFP, Counter<String> wordFN, Class<? extends TypesafeMap.Key<String>> whichClassToCompare)
public static void countResultsPerToken(List<CoreLabel> doc, Counter<String> entityTP, Counter<String> entityFP, Counter<String> entityFN, String background, Counter<String> wordTP, Counter<String> wordTN, Counter<String> wordFP, Counter<String> wordFN, Class<? extends TypesafeMap.Key<String>> whichClassToCompare)
public static void countResults(List<CoreLabel> doc, Counter<String> entityTP, Counter<String> entityFP, Counter<String> entityFN, String background, Counter<String> wordTP, Counter<String> wordTN, Counter<String> wordFP, Counter<String> wordFN, Class<? extends TypesafeMap.Key<String>> whichClassToCompare, boolean evalPerEntity)
public void writeLabeledData(String outFile) throws IOException, ClassNotFoundException
IOException
ClassNotFoundException
public void writeColumnOutput(String outFile) throws IOException, ClassNotFoundException
IOException
ClassNotFoundException
public void evaluate(Map<String,List<CoreLabel>> testSentences, boolean evalPerEntity) throws IOException
IOException
public double FScore(double precision, double recall, double beta)
public static Map<String,Set<String>> readSeedWords(Properties props)
public static GetPatternsFromDataMultiClass run(Properties props) throws IOException, ClassNotFoundException, IllegalAccessException, InterruptedException, ExecutionException, InstantiationException, NoSuchMethodException, InvocationTargetException, SQLException
public static void main(String[] args)