edu.stanford.nlp.process
Class StripTagsProcessor<L,F>

java.lang.Object
  extended by edu.stanford.nlp.process.AbstractListProcessor<Word,Word,L,F>
      extended by edu.stanford.nlp.process.StripTagsProcessor<L,F>
Type Parameters:
L - The type of the labels
F - The type of the features
All Implemented Interfaces:
DocumentProcessor<Word,Word,L,F>, ListProcessor<Word,Word>

public class StripTagsProcessor<L,F>
extends AbstractListProcessor<Word,Word,L,F>

A Processor whose process method deletes all SGML/XML/HTML tags (tokens starting with < and ending with >. Optionally, newlines can be inserted after the end of block-level tags to roughly simulate where continuous text was broken up (this helps finding sentence boundaries for example).

Author:
Christopher Manning, Sarah Spikes (sdspikes@cs.stanford.edu) (Templatization)

Field Summary
static Set<String> blockTags
          Block-level HTML tags that are rendered with surrounding line breaks.
 
Constructor Summary
StripTagsProcessor()
          Constructs a new StripTagsProcessor that doesn't mark line breaks.
StripTagsProcessor(boolean markLineBreaks)
          Constructs a new StripTagProcessor that marks line breaks as specified.
 
Method Summary
 boolean getMarkLineBreaks()
          Returns whether the output of the processor will contain newline words ("\n") at the end of block-level tags.
static void main(String[] args)
          For internal debugging purposes only.
 List<Word> process(List<? extends Word> in)
          Returns a new Document with the same meta-data as in, and the same words except tags are stripped.
 void setMarkLineBreaks(boolean markLineBreaks)
          Sets whether the output of the processor will contain newline words ("\n") at the end of block-level tags.
 
Methods inherited from class edu.stanford.nlp.process.AbstractListProcessor
processDocument, processLists
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

blockTags

public static final Set<String> blockTags
Block-level HTML tags that are rendered with surrounding line breaks.

Constructor Detail

StripTagsProcessor

public StripTagsProcessor()
Constructs a new StripTagsProcessor that doesn't mark line breaks.


StripTagsProcessor

public StripTagsProcessor(boolean markLineBreaks)
Constructs a new StripTagProcessor that marks line breaks as specified.

Method Detail

getMarkLineBreaks

public boolean getMarkLineBreaks()
Returns whether the output of the processor will contain newline words ("\n") at the end of block-level tags.

Returns:
Whether the output of the processor will contain newline words ("\n") at the end of block-level tags.

setMarkLineBreaks

public void setMarkLineBreaks(boolean markLineBreaks)
Sets whether the output of the processor will contain newline words ("\n") at the end of block-level tags.


process

public List<Word> process(List<? extends Word> in)
Returns a new Document with the same meta-data as in, and the same words except tags are stripped.


main

public static void main(String[] args)
For internal debugging purposes only.



Stanford NLP Group