public class LexerUtils
extends java.lang.Object
Modifier and Type | Class and Description |
---|---|
static class |
LexerUtils.DashesEnum |
static class |
LexerUtils.EllipsesEnum |
static class |
LexerUtils.QuotesEnum |
Modifier and Type | Method and Description |
---|---|
static java.lang.String |
asciiQuotes(java.lang.String in)
Convert all single and double quote like characters to the ASCII quote characters: ' ".
|
static java.lang.String |
escapeChar(java.lang.String s,
char c)
This quotes a character with a backslash, but doesn't do it
if the character is already preceded by a backslash.
|
static java.lang.String |
handleDashes(java.lang.String tok,
LexerUtils.DashesEnum dashesStyle) |
static java.lang.String |
handleEllipsis(java.lang.String tok,
LexerUtils.EllipsesEnum ellipsesStyle) |
static java.lang.String |
handleQuotes(java.lang.String tok,
boolean probablyLeft,
LexerUtils.QuotesEnum quoteStyle) |
static java.lang.String |
minimallyNormalizeCurrency(java.lang.String in)
Still at least turn cp1252 euro symbol to Unicode one.
|
static java.lang.String |
normalizeAmp(java.lang.String in)
Convert an XML-escaped ampersand back into an ampersand.
|
static java.lang.String |
normalizeCurrency(java.lang.String in) |
static java.lang.String |
normalizeFractions(boolean normalizeFractions,
boolean escapeForwardSlashAsterisk,
java.lang.String in)
Change precomposed fraction characters to spelled out letter forms.
|
static java.lang.String |
pennNormalizeParens(java.lang.String input,
boolean normalizeParentheses) |
static java.lang.String |
processCp1252misc(java.lang.String arg) |
static java.lang.String |
removeSoftHyphens(java.lang.String in)
This removes from words several Unicode characters that aren't really part of the word form:
(i) the character U+00AD for a soft hyphen,
(ii) the character U+2060 for a (zero-width) no-break word joiner,
(iii) the character U+200C for a zero-width non-joiner.
|
public static java.lang.String normalizeFractions(boolean normalizeFractions, boolean escapeForwardSlashAsterisk, java.lang.String in)
normalizeFractions
- If false, do nothing; if true normalize to ASCII character sequenceescapeForwardSlashAsterisk
- If true also escape forward slash with backslash (deprecated historical PTB)in
- The input string to normalizepublic static java.lang.String normalizeCurrency(java.lang.String in)
public static java.lang.String minimallyNormalizeCurrency(java.lang.String in)
public static java.lang.String removeSoftHyphens(java.lang.String in)
public static java.lang.String processCp1252misc(java.lang.String arg)
public static java.lang.String normalizeAmp(java.lang.String in)
public static java.lang.String escapeChar(java.lang.String s, char c)
public static java.lang.String asciiQuotes(java.lang.String in)
public static java.lang.String handleQuotes(java.lang.String tok, boolean probablyLeft, LexerUtils.QuotesEnum quoteStyle)
public static java.lang.String handleEllipsis(java.lang.String tok, LexerUtils.EllipsesEnum ellipsesStyle)
public static java.lang.String handleDashes(java.lang.String tok, LexerUtils.DashesEnum dashesStyle)
public static java.lang.String pennNormalizeParens(java.lang.String input, boolean normalizeParentheses)