|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.WordShapeClassifier
public class WordShapeClassifier
Provides static methods which map any String to another String indicative of its "word shape" -- e.g., whether capitalized, numeric, etc. Different implementations may implement quite different, normally language specific ideas of what word shapes are useful.
Field Summary | |
---|---|
static int |
NOWORDSHAPE
|
static int |
WORDSHAPECHRIS1
|
static int |
WORDSHAPECHRIS2
|
static int |
WORDSHAPECHRIS2USELC
|
static int |
WORDSHAPECHRIS3
|
static int |
WORDSHAPECHRIS3USELC
|
static int |
WORDSHAPECHRIS4
|
static int |
WORDSHAPEDAN1
|
static int |
WORDSHAPEDAN2
|
static int |
WORDSHAPEDAN2BIO
|
static int |
WORDSHAPEDAN2BIOUSELC
|
static int |
WORDSHAPEDAN2USELC
|
static int |
WORDSHAPEJENNY1
|
static int |
WORDSHAPEJENNY1USELC
|
Method Summary | |
---|---|
static void |
addKnownLowerCaseWords(Collection<String> words)
|
static Set<String> |
getKnownLowerCaseWords()
|
static int |
lookupShaper(String name)
|
static void |
main(String[] args)
Usage: java edu.stanford.nlp.process.WordShapeClassifier
[-wordShape name] string+ where name is an argument to lookupShaper . |
static void |
setKnownLowerCaseWords(Set<String> words)
|
static boolean |
usesLC(int shape)
Returns true if the specified word shaper uses known lower case words. |
static String |
wordShape(String inStr,
int wordShaper)
Specify the string and the int identifying which word shaper to use and this returns the result of using that wordshaper on the word. |
static String |
wordShape(String inStr,
int wordShaper,
boolean markKnownLC)
|
static String |
wordShape(String inStr,
int wordShaper,
Set<String> knownLCWords)
Specify the string and the int identifying which word shaper to use and this returns the result of using that wordshaper on the word. |
static String |
wordShapeChris1(String s)
|
static String |
wordShapeChris2(String s,
boolean markKnownLC,
boolean omitIfInBoundary)
This one picks up on Dan2 ideas, but seeks to make less distinctions mid sequence by sorting for long words, but to maintain extra distinctions for short words. |
static String |
wordShapeChris2Long(String s,
boolean markKnownLC,
boolean omitIfInBoundary,
int len)
|
static String |
wordShapeChris4(String s,
boolean markKnownLC,
boolean omitIfInBoundary)
This one picks up on Dan2 ideas, but seeks to make less distinctions mid sequence by sorting for long words, but to maintain extra distinctions for short words, by always recording the class of the first and last two characters of the word. |
static String |
wordShapeDan1(String s)
A fairly basic 5-way classifier, that notes digits, and upper and lower case, mixed, and non-alphanumeric. |
static String |
wordShapeDan2(String s,
boolean markKnownLC)
A fine-grained word shape classifier, that equivalence classes. |
static String |
wordShapeDan2Bio(String s,
boolean useKnownLC)
Returns a fine-grained word shape classifier, that equivalence classes lower and upper case and digits, and collapses sequences of the same type, but keeps all punctuation. |
static String |
wordShapeJenny1(String s,
boolean markKnownLC)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int NOWORDSHAPE
public static final int WORDSHAPEDAN1
public static final int WORDSHAPECHRIS1
public static final int WORDSHAPEDAN2
public static final int WORDSHAPEDAN2USELC
public static final int WORDSHAPEDAN2BIO
public static final int WORDSHAPEDAN2BIOUSELC
public static final int WORDSHAPEJENNY1
public static final int WORDSHAPEJENNY1USELC
public static final int WORDSHAPECHRIS2
public static final int WORDSHAPECHRIS2USELC
public static final int WORDSHAPECHRIS3
public static final int WORDSHAPECHRIS3USELC
public static final int WORDSHAPECHRIS4
Method Detail |
---|
public static int lookupShaper(String name)
public static boolean usesLC(int shape)
shape
- One of the defined shape constants
public static String wordShape(String inStr, int wordShaper)
inStr
- String to calculate word shpape ofwordShaper
- Constant for which shaping formula to use
public static String wordShape(String inStr, int wordShaper, Set<String> knownLCWords)
inStr
- String to calculate word shpape ofwordShaper
- Constant for which shaping formula to useknownLCWords
- A set of known lowercase words, which some shapers use
to decide the class of sentence-initial capitalized words.
Warning: This set will be stored in a static variable and
overwrite any previously set list of knownLCWords.
public static String wordShape(String inStr, int wordShaper, boolean markKnownLC)
public static String wordShapeDan1(String s)
s
- String to find word shape of
public static String wordShapeDan2(String s, boolean markKnownLC)
Note: We treat '_' as a lowercase letter, sort of like many programming languages. We do this because we use '_' joining of tokens in some applications like RTE.
s
- The String whose shape is to be returnedmarkKnownLC
- Whether to mark words whose lower case form is
found in the previously initialized list of known
lower case words
addKnownLowerCaseWords(Collection)
public static String wordShapeJenny1(String s, boolean markKnownLC)
public static String wordShapeChris2(String s, boolean markKnownLC, boolean omitIfInBoundary)
s
- The String to find the word shape ofmarkKnownLC
- Whether to tag with k suffix a word that is known
in a list when lowercasedomitIfInBoundary
- If true, character classes present in the
first or last two (i.e., BOUNDARY_SIZE) letters
of the word are not also registered
as classes that appear in the middle of the word.
public static String wordShapeChris2Long(String s, boolean markKnownLC, boolean omitIfInBoundary, int len)
public static String wordShapeChris4(String s, boolean markKnownLC, boolean omitIfInBoundary)
s
- The String to find the word shape ofmarkKnownLC
- Whether to tag with k suffix a word that is known
in a list when lowercasedomitIfInBoundary
- If true, character classes present in the
first or last two (i.e., BOUNDARY_SIZE) letters
of the word are not also registered
as classes that appear in the middle of the word.
public static Set<String> getKnownLowerCaseWords()
public static void setKnownLowerCaseWords(Set<String> words)
public static void addKnownLowerCaseWords(Collection<String> words)
public static String wordShapeDan2Bio(String s, boolean useKnownLC)
public static String wordShapeChris1(String s)
public static void main(String[] args)
java edu.stanford.nlp.process.WordShapeClassifier
[-wordShape name] string+
name
is an argument to lookupShaper
.
Known names have patterns along the lines of: dan[12](bio)?(UseLC)?,
jenny1(useLC)?, chris[1234](useLC)?.
args
- Command-line arguments, as above.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |