edu.stanford.nlp.process
Class StripTagsProcessor<L,F>
java.lang.Object
edu.stanford.nlp.process.AbstractListProcessor<Word,Word,L,F>
edu.stanford.nlp.process.StripTagsProcessor<L,F>
- Type Parameters:
L
- The type of the labelsF
- The type of the features
- All Implemented Interfaces:
- DocumentProcessor<Word,Word,L,F>, ListProcessor<Word,Word>
public class StripTagsProcessor<L,F>
- extends AbstractListProcessor<Word,Word,L,F>
A Processor
whose process
method deletes all
SGML/XML/HTML tags (tokens starting with <
and ending
with >. Optionally, newlines can be inserted after the
end of block-level tags to roughly simulate where continuous text was
broken up (this helps finding sentence boundaries for example).
- Author:
- Christopher Manning, Sarah Spikes (sdspikes@cs.stanford.edu) (Templatization)
Field Summary |
static Set<String> |
blockTags
Block-level HTML tags that are rendered with surrounding line breaks. |
Constructor Summary |
StripTagsProcessor()
Constructs a new StripTagsProcessor that doesn't mark line breaks. |
StripTagsProcessor(boolean markLineBreaks)
Constructs a new StripTagProcessor that marks line breaks as specified. |
Method Summary |
boolean |
getMarkLineBreaks()
Retruns whether the output of the processor will contain newline words
("\n") at the end of block-level tags. |
static void |
main(String[] args)
For internal debugging purposes only. |
List<Word> |
process(List<Word> in)
Returns a new Document with the same meta-data as in,
and the same words except tags are stripped. |
void |
setMarkLineBreaks(boolean markLineBreaks)
Sets whether the output of the processor will contain newline words
("\n") at the end of block-level tags. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
blockTags
public static final Set<String> blockTags
- Block-level HTML tags that are rendered with surrounding line breaks.
StripTagsProcessor
public StripTagsProcessor()
- Constructs a new StripTagsProcessor that doesn't mark line breaks.
StripTagsProcessor
public StripTagsProcessor(boolean markLineBreaks)
- Constructs a new StripTagProcessor that marks line breaks as specified.
getMarkLineBreaks
public boolean getMarkLineBreaks()
- Retruns whether the output of the processor will contain newline words
("\n") at the end of block-level tags.
setMarkLineBreaks
public void setMarkLineBreaks(boolean markLineBreaks)
- Sets whether the output of the processor will contain newline words
("\n") at the end of block-level tags.
process
public List<Word> process(List<Word> in)
- Returns a new Document with the same meta-data as in,
and the same words except tags are stripped.
main
public static void main(String[] args)
- For internal debugging purposes only.
Stanford NLP Group