edu.stanford.nlp.process
Class WordShapeClassifier

java.lang.Object
  extended by edu.stanford.nlp.process.WordShapeClassifier

public class WordShapeClassifier
extends Object

Provides static methods which map any String to another String indicative of its "word shape" -- e.g., whether capitalized, numeric, etc. Different implementations may implement quite different, normally language specific ideas of what word shapes are useful.

Author:
Christopher Manning, Dan Klein

Field Summary
static int NOWORDSHAPE
           
static int WORDSHAPECHRIS1
           
static int WORDSHAPECHRIS2
           
static int WORDSHAPECHRIS2USELC
           
static int WORDSHAPECHRIS3
           
static int WORDSHAPECHRIS3USELC
           
static int WORDSHAPECHRIS4
           
static int WORDSHAPEDAN1
           
static int WORDSHAPEDAN2
           
static int WORDSHAPEDAN2BIO
           
static int WORDSHAPEDAN2BIOUSELC
           
static int WORDSHAPEDAN2USELC
           
static int WORDSHAPEJENNY1
           
static int WORDSHAPEJENNY1USELC
           
 
Method Summary
static void addKnownLowerCaseWords(Collection words)
           
static Set getKnownLowerCaseWords()
           
static int lookupShaper(String name)
           
static void main(String[] args)
          Usage: java edu.stanford.nlp.process.WordShapeClassifier [-wordShape name] string+
where name is an argument to lookupShaper.
static void setKnownLowerCaseWords(Set words)
           
static boolean usesLC(int shape)
          Returns true if the specified word shaper uses known lower case words.
static String wordShape(String inStr, int wordShaper)
          Specify the string and the int identifying which word shaper to use and this returns the result of using that wordshaper on the word.
static String wordShape(String inStr, int wordShaper, boolean markKnownLC)
           
static String wordShape(String inStr, int wordShaper, boolean markKnownLC, Set knownLCWords)
           
static String wordShape(String inStr, int wordShaper, Set knownLCWords)
          Specify the string and the int identifying which word shaper to use and this returns the result of using that wordshaper on the word.
static String wordShapeChris1(String s)
           
static String wordShapeChris2(String s, boolean markKnownLC, boolean omitIfInBoundary)
          This one picks up on Dan2 ideas, but seeks to make less distinctions mid sequence by sorting for long words, but to maintain extra distinctions for short words.
static String wordShapeChris4(String s, boolean markKnownLC, boolean omitIfInBoundary)
          This one picks up on Dan2 ideas, but seeks to make less distinctions mid sequence by sorting for long words, but to maintain extra distinctions for short words, by always recording the class of the first and last two characters of the word.
static String wordShapeDan1(String s)
          A fairly basic 5-way classifier, that notes digits, and upper and lower case, mixed, and non-alphanumeric.
static String wordShapeDan2(String s, boolean markKnownLC)
          A fine-grained word shape classifier, that equivalence classes.
static String wordShapeDan2Bio(String s, boolean useKnownLC)
          Returns a fine-grained word shape classifier, that equivalence classes lower and upper case and digits, and collapses sequences of the same type, but keeps all punctuation.
static String wordShapeJenny1(String s, boolean markKnownLC)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NOWORDSHAPE

public static final int NOWORDSHAPE
See Also:
Constant Field Values

WORDSHAPEDAN1

public static final int WORDSHAPEDAN1
See Also:
Constant Field Values

WORDSHAPECHRIS1

public static final int WORDSHAPECHRIS1
See Also:
Constant Field Values

WORDSHAPEDAN2

public static final int WORDSHAPEDAN2
See Also:
Constant Field Values

WORDSHAPEDAN2USELC

public static final int WORDSHAPEDAN2USELC
See Also:
Constant Field Values

WORDSHAPEDAN2BIO

public static final int WORDSHAPEDAN2BIO
See Also:
Constant Field Values

WORDSHAPEDAN2BIOUSELC

public static final int WORDSHAPEDAN2BIOUSELC
See Also:
Constant Field Values

WORDSHAPEJENNY1

public static final int WORDSHAPEJENNY1
See Also:
Constant Field Values

WORDSHAPEJENNY1USELC

public static final int WORDSHAPEJENNY1USELC
See Also:
Constant Field Values

WORDSHAPECHRIS2

public static final int WORDSHAPECHRIS2
See Also:
Constant Field Values

WORDSHAPECHRIS2USELC

public static final int WORDSHAPECHRIS2USELC
See Also:
Constant Field Values

WORDSHAPECHRIS3

public static final int WORDSHAPECHRIS3
See Also:
Constant Field Values

WORDSHAPECHRIS3USELC

public static final int WORDSHAPECHRIS3USELC
See Also:
Constant Field Values

WORDSHAPECHRIS4

public static final int WORDSHAPECHRIS4
See Also:
Constant Field Values
Method Detail

lookupShaper

public static int lookupShaper(String name)

usesLC

public static boolean usesLC(int shape)
Returns true if the specified word shaper uses known lower case words.


wordShape

public static String wordShape(String inStr,
                               int wordShaper)
Specify the string and the int identifying which word shaper to use and this returns the result of using that wordshaper on the word.


wordShape

public static String wordShape(String inStr,
                               int wordShaper,
                               Set knownLCWords)
Specify the string and the int identifying which word shaper to use and this returns the result of using that wordshaper on the word.


wordShape

public static String wordShape(String inStr,
                               int wordShaper,
                               boolean markKnownLC)

wordShape

public static String wordShape(String inStr,
                               int wordShaper,
                               boolean markKnownLC,
                               Set knownLCWords)

wordShapeDan1

public static String wordShapeDan1(String s)
A fairly basic 5-way classifier, that notes digits, and upper and lower case, mixed, and non-alphanumeric.


wordShapeDan2

public static String wordShapeDan2(String s,
                                   boolean markKnownLC)
A fine-grained word shape classifier, that equivalence classes. lower and upper case and digits, and collapses sequences of the same type, but keeps all punctuation, etc.

Note: We treat '_' as a lowercase letter, sort of like many programming languages. We do this because we use '_' joining of tokens in some applications like RTE.

Parameters:
s - The String whose shape is to be returned
markKnownLC - Whether to mark words whose lower case form is found in the previously initialized list of known lower case words
Returns:
The word shape
See Also:
addKnownLowerCaseWords(Collection)

wordShapeJenny1

public static String wordShapeJenny1(String s,
                                     boolean markKnownLC)

wordShapeChris2

public static String wordShapeChris2(String s,
                                     boolean markKnownLC,
                                     boolean omitIfInBoundary)
This one picks up on Dan2 ideas, but seeks to make less distinctions mid sequence by sorting for long words, but to maintain extra distinctions for short words.

Parameters:
omitIfInBoundary - If true, character classes present in the first or last two letters of the word are not also registered as classes that appear in the middle of the word.

wordShapeChris4

public static String wordShapeChris4(String s,
                                     boolean markKnownLC,
                                     boolean omitIfInBoundary)
This one picks up on Dan2 ideas, but seeks to make less distinctions mid sequence by sorting for long words, but to maintain extra distinctions for short words, by always recording the class of the first and last two characters of the word. Compared to chris2 on which it is based, it uses more Unicode classes, and so collapses things like punctuation more, and might work better with real unicode.

Parameters:
omitIfInBoundary - If true, character classes present in the first or last two letters of the word are not also registered as classes that appear in the middle of the word.

getKnownLowerCaseWords

public static Set getKnownLowerCaseWords()

setKnownLowerCaseWords

public static void setKnownLowerCaseWords(Set words)

addKnownLowerCaseWords

public static void addKnownLowerCaseWords(Collection words)

wordShapeDan2Bio

public static String wordShapeDan2Bio(String s,
                                      boolean useKnownLC)
Returns a fine-grained word shape classifier, that equivalence classes lower and upper case and digits, and collapses sequences of the same type, but keeps all punctuation. This adds an extra recognizer for a greek letter embedded in the String, which is usefu for bio.


wordShapeChris1

public static String wordShapeChris1(String s)

main

public static void main(String[] args)
Usage: java edu.stanford.nlp.process.WordShapeClassifier [-wordShape name] string+
where name is an argument to lookupShaper. Known names have patterns along the lines of: dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?.



Stanford NLP Group