public class NumberNormalizer extends Object
TODO: To be merged into QuantifiableEntityNormalizer.
It can be used by QuantifiableEntityNormalizer
to first convert numbers expressed as words
into numeric quantities before figuring
out how to do higher level combos
(like one hundred dollars and five cents)
TODO: Known to not handle the following:
oh: two oh one
non-integers: one and a half, one point five, three fifth
funky numbers: pi
TODO: This class is very language dependent
Should really be AmericanEnglishNumberNormalizer
TODO: Make things not static
Modifier and Type | Field and Description |
---|---|
protected static Pattern |
digitsPattern |
Modifier and Type | Method and Description |
---|---|
static List<CoreMap> |
findAndAnnotateNumericExpressions(CoreMap annotation) |
static List<CoreMap> |
findAndAnnotateNumericExpressionsWithRanges(CoreMap annotation) |
static List<CoreMap> |
findAndMergeNumbers(CoreMap annotationRaw)
Takes annotation and identifies numbers in the annotation
Returns a list of tokens (as CoreMaps) with numbers merged
As by product, also marks each individual token with the TokenBeginAnnotation and TokenEndAnnotation
- this is mainly to make it easier to the rest of the code to figure out what the token offsets are.
|
static List<CoreMap> |
findNumberRanges(CoreMap annotation) |
static List<CoreMap> |
findNumbers(CoreMap annotation)
Find and mark numbers (does not need NumberSequenceClassifier)
Each token is annotated with the numeric value and type
- CoreAnnotations.NumericTypeAnnotation.class: ORDINAL, UNIT (hundred, thousand,..., dozen, gross,...), NUMBER
- CoreAnnotations.NumericValueAnnotation.class: Number representing the numeric value of the token
( two thousand => 2 1000 )
Tries also to separate individual numbers like four five six,
while keeping numbers like four hundred and seven together
Annotate tokens belonging to each composite number with
- CoreAnnotations.NumericCompositeTypeAnnotation.class: ORDINAL (1st, 2nd), NUMBER (one hundred)
- CoreAnnotations.NumericCompositeValueAnnotation.class: Number representing the composite numeric value
( two thousand => 2000 2000 )
Also returns list of CoreMap representing the identified numbers
The function is overly aggressive in marking possible numbers
- should either do more checks or use in conjunction with NumberSequenceClassifier
to avoid marking certain tokens (like second/NN) as numbers...
|
static Env |
getNewEnv() |
static void |
initEnv(Env env) |
static void |
setVerbose(boolean verbose) |
static Number |
wordToNumber(String str)
Fairly generous utility function to convert a string representing
a number (hopefully) to a Number.
|
protected static final Pattern digitsPattern
public static void setVerbose(boolean verbose)
public static Number wordToNumber(String str)
str
- The String to convertpublic static Env getNewEnv()
public static void initEnv(Env env)
public static List<CoreMap> findNumbers(CoreMap annotation)
annotation
- The annotation structurepublic static List<CoreMap> findAndMergeNumbers(CoreMap annotationRaw)
annotationRaw
- The annotation to find numbers inpublic static List<CoreMap> findAndAnnotateNumericExpressions(CoreMap annotation)