edu.stanford.nlp.process
Class WordShapeClassifier

java.lang.Object
  extended by edu.stanford.nlp.process.WordShapeClassifier

public class WordShapeClassifier
extends Object

Provides static methods which map any String to another String indicative of its "word shape" -- e.g., whether capitalized, numeric, etc. Different implementations may implement quite different, normally language specific ideas of what word shapes are useful.

Author:
Christopher Manning, Dan Klein

Field Summary
static int NOWORDSHAPE
           
static int WORDSHAPECHRIS1
           
static int WORDSHAPECHRIS2
           
static int WORDSHAPECHRIS2USELC
           
static int WORDSHAPECHRIS3
           
static int WORDSHAPECHRIS3USELC
           
static int WORDSHAPECHRIS4
           
static int WORDSHAPEDAN1
           
static int WORDSHAPEDAN2
           
static int WORDSHAPEDAN2BIO
           
static int WORDSHAPEDAN2BIOUSELC
           
static int WORDSHAPEDAN2USELC
           
static int WORDSHAPEJENNY1
           
static int WORDSHAPEJENNY1USELC
           
 
Method Summary
static void addKnownLowerCaseWords(Collection<String> words)
           
static Set<String> getKnownLowerCaseWords()
           
static int lookupShaper(String name)
           
static void main(String[] args)
          Usage: java edu.stanford.nlp.process.WordShapeClassifier [-wordShape name] string+
where name is an argument to lookupShaper.
static void setKnownLowerCaseWords(Set<String> words)
           
static boolean usesLC(int shape)
          Returns true if the specified word shaper uses known lower case words.
static String wordShape(String inStr, int wordShaper)
          Specify the string and the int identifying which word shaper to use and this returns the result of using that wordshaper on the word.
static String wordShape(String inStr, int wordShaper, boolean markKnownLC)
           
static String wordShape(String inStr, int wordShaper, Set<String> knownLCWords)
          Specify the string and the int identifying which word shaper to use and this returns the result of using that wordshaper on the word.
static String wordShapeChris1(String s)
           
static String wordShapeChris2(String s, boolean markKnownLC, boolean omitIfInBoundary)
          This one picks up on Dan2 ideas, but seeks to make less distinctions mid sequence by sorting for long words, but to maintain extra distinctions for short words.
static String wordShapeChris2Long(String s, boolean markKnownLC, boolean omitIfInBoundary, int len)
           
static String wordShapeChris4(String s, boolean markKnownLC, boolean omitIfInBoundary)
          This one picks up on Dan2 ideas, but seeks to make less distinctions mid sequence by sorting for long words, but to maintain extra distinctions for short words, by always recording the class of the first and last two characters of the word.
static String wordShapeDan1(String s)
          A fairly basic 5-way classifier, that notes digits, and upper and lower case, mixed, and non-alphanumeric.
static String wordShapeDan2(String s, boolean markKnownLC)
          A fine-grained word shape classifier, that equivalence classes.
static String wordShapeDan2Bio(String s, boolean useKnownLC)
          Returns a fine-grained word shape classifier, that equivalence classes lower and upper case and digits, and collapses sequences of the same type, but keeps all punctuation.
static String wordShapeJenny1(String s, boolean markKnownLC)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NOWORDSHAPE

public static final int NOWORDSHAPE
See Also:
Constant Field Values

WORDSHAPEDAN1

public static final int WORDSHAPEDAN1
See Also:
Constant Field Values

WORDSHAPECHRIS1

public static final int WORDSHAPECHRIS1
See Also:
Constant Field Values

WORDSHAPEDAN2

public static final int WORDSHAPEDAN2
See Also:
Constant Field Values

WORDSHAPEDAN2USELC

public static final int WORDSHAPEDAN2USELC
See Also:
Constant Field Values

WORDSHAPEDAN2BIO

public static final int WORDSHAPEDAN2BIO
See Also:
Constant Field Values

WORDSHAPEDAN2BIOUSELC

public static final int WORDSHAPEDAN2BIOUSELC
See Also:
Constant Field Values

WORDSHAPEJENNY1

public static final int WORDSHAPEJENNY1
See Also:
Constant Field Values

WORDSHAPEJENNY1USELC

public static final int WORDSHAPEJENNY1USELC
See Also:
Constant Field Values

WORDSHAPECHRIS2

public static final int WORDSHAPECHRIS2
See Also:
Constant Field Values

WORDSHAPECHRIS2USELC

public static final int WORDSHAPECHRIS2USELC
See Also:
Constant Field Values

WORDSHAPECHRIS3

public static final int WORDSHAPECHRIS3
See Also:
Constant Field Values

WORDSHAPECHRIS3USELC

public static final int WORDSHAPECHRIS3USELC
See Also:
Constant Field Values

WORDSHAPECHRIS4

public static final int WORDSHAPECHRIS4
See Also:
Constant Field Values
Method Detail

lookupShaper

public static int lookupShaper(String name)

usesLC

public static boolean usesLC(int shape)
Returns true if the specified word shaper uses known lower case words.

Parameters:
shape - One of the defined shape constants
Returns:
true if the specified word shaper uses known lower case words.

wordShape

public static String wordShape(String inStr,
                               int wordShaper)
Specify the string and the int identifying which word shaper to use and this returns the result of using that wordshaper on the word.

Parameters:
inStr - String to calculate word shpape of
wordShaper - Constant for which shaping formula to use
Returns:
The wordshape String

wordShape

public static String wordShape(String inStr,
                               int wordShaper,
                               Set<String> knownLCWords)
Specify the string and the int identifying which word shaper to use and this returns the result of using that wordshaper on the word.

Parameters:
inStr - String to calculate word shpape of
wordShaper - Constant for which shaping formula to use
knownLCWords - A set of known lowercase words, which some shapers use to decide the class of sentence-initial capitalized words. Warning: This set will be stored in a static variable and overwrite any previously set list of knownLCWords.
Returns:
The wordshape String

wordShape

public static String wordShape(String inStr,
                               int wordShaper,
                               boolean markKnownLC)

wordShapeDan1

public static String wordShapeDan1(String s)
A fairly basic 5-way classifier, that notes digits, and upper and lower case, mixed, and non-alphanumeric.

Parameters:
s - String to find word shape of
Returns:
Its word shape: a 5 way classification

wordShapeDan2

public static String wordShapeDan2(String s,
                                   boolean markKnownLC)
A fine-grained word shape classifier, that equivalence classes. lower and upper case and digits, and collapses sequences of the same type, but keeps all punctuation, etc.

Note: We treat '_' as a lowercase letter, sort of like many programming languages. We do this because we use '_' joining of tokens in some applications like RTE.

Parameters:
s - The String whose shape is to be returned
markKnownLC - Whether to mark words whose lower case form is found in the previously initialized list of known lower case words
Returns:
The word shape
See Also:
addKnownLowerCaseWords(Collection)

wordShapeJenny1

public static String wordShapeJenny1(String s,
                                     boolean markKnownLC)

wordShapeChris2

public static String wordShapeChris2(String s,
                                     boolean markKnownLC,
                                     boolean omitIfInBoundary)
This one picks up on Dan2 ideas, but seeks to make less distinctions mid sequence by sorting for long words, but to maintain extra distinctions for short words. It exactly preserves the character shape of the first and last 2 (i.e., BOUNDARY_SIZE) characters and then will record shapes that occur between them (perhaps only if they are different)

Parameters:
s - The String to find the word shape of
markKnownLC - Whether to tag with k suffix a word that is known in a list when lowercased
omitIfInBoundary - If true, character classes present in the first or last two (i.e., BOUNDARY_SIZE) letters of the word are not also registered as classes that appear in the middle of the word.
Returns:
A word shape for the word.

wordShapeChris2Long

public static String wordShapeChris2Long(String s,
                                         boolean markKnownLC,
                                         boolean omitIfInBoundary,
                                         int len)

wordShapeChris4

public static String wordShapeChris4(String s,
                                     boolean markKnownLC,
                                     boolean omitIfInBoundary)
This one picks up on Dan2 ideas, but seeks to make less distinctions mid sequence by sorting for long words, but to maintain extra distinctions for short words, by always recording the class of the first and last two characters of the word. Compared to chris2 on which it is based, it uses more Unicode classes, and so collapses things like punctuation more, and might work better with real unicode. Implementation note: This hasn't been optimized like wordShapeChris2, but could be.

Parameters:
s - The String to find the word shape of
markKnownLC - Whether to tag with k suffix a word that is known in a list when lowercased
omitIfInBoundary - If true, character classes present in the first or last two (i.e., BOUNDARY_SIZE) letters of the word are not also registered as classes that appear in the middle of the word.
Returns:
A word shape for the word.

getKnownLowerCaseWords

public static Set<String> getKnownLowerCaseWords()

setKnownLowerCaseWords

public static void setKnownLowerCaseWords(Set<String> words)

addKnownLowerCaseWords

public static void addKnownLowerCaseWords(Collection<String> words)

wordShapeDan2Bio

public static String wordShapeDan2Bio(String s,
                                      boolean useKnownLC)
Returns a fine-grained word shape classifier, that equivalence classes lower and upper case and digits, and collapses sequences of the same type, but keeps all punctuation. This adds an extra recognizer for a greek letter embedded in the String, which is usefu for bio.


wordShapeChris1

public static String wordShapeChris1(String s)

main

public static void main(String[] args)
Usage: java edu.stanford.nlp.process.WordShapeClassifier [-wordShape name] string+
where name is an argument to lookupShaper. Known names have patterns along the lines of: dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?.

Parameters:
args - Command-line arguments, as above.


Stanford NLP Group