edu.stanford.nlp.web
Class HTMLParser

java.lang.Object
  extended by javax.swing.text.html.HTMLEditorKit.ParserCallback
      extended by edu.stanford.nlp.web.HTMLParser

public class HTMLParser
extends javax.swing.text.html.HTMLEditorKit.ParserCallback

Parses an HTML document and returns the plain text (and title). The main thing that HTMLParser is used for is the parse(String url) method, which will return a String with the contents of an HTML page, without the tags. After calling parse, you can get the HTML title (contents of the TITLE tag) by calling title(). Subclasses may override the handleText(), handleComment(), handleStartTag(), etc. methods so that parse(String url) returns something other than the text of the web page. (For example, one may be interested in returning only part of the text, or only the links.)


Field Summary
protected  boolean isBody
           
protected  boolean isScript
           
protected  boolean isTitle
           
protected  java.lang.StringBuffer textBuffer
           
protected  java.lang.String title
           
 
Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
IMPLIED
 
Constructor Summary
HTMLParser()
           
 
Method Summary
 void handleEndTag(javax.swing.text.html.HTML.Tag tag, int pos)
          Sets a flag if the end tag is the "TITLE" element end tag
 void handleStartTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attrSet, int pos)
          Sets a flag if the start tag is the "TITLE" element start tag.
 void handleText(char[] data, int pos)
           
static void main(java.lang.String[] args)
           
 java.lang.String parse(java.io.Reader r)
           
 java.lang.String parse(java.lang.String text)
          The parse method that actually does the work.
 java.lang.String parse(java.net.URL url)
           
 java.lang.String title()
           
 
Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
flush, handleComment, handleEndOfLineString, handleError, handleSimpleTag
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

textBuffer

protected java.lang.StringBuffer textBuffer

title

protected java.lang.String title

isTitle

protected boolean isTitle

isBody

protected boolean isBody

isScript

protected boolean isScript
Constructor Detail

HTMLParser

public HTMLParser()
Method Detail

handleText

public void handleText(char[] data,
                       int pos)
Overrides:
handleText in class javax.swing.text.html.HTMLEditorKit.ParserCallback

handleStartTag

public void handleStartTag(javax.swing.text.html.HTML.Tag tag,
                           javax.swing.text.MutableAttributeSet attrSet,
                           int pos)
Sets a flag if the start tag is the "TITLE" element start tag.

Overrides:
handleStartTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback

handleEndTag

public void handleEndTag(javax.swing.text.html.HTML.Tag tag,
                         int pos)
Sets a flag if the end tag is the "TITLE" element end tag

Overrides:
handleEndTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback

parse

public java.lang.String parse(java.net.URL url)
                       throws java.io.IOException
Throws:
java.io.IOException

parse

public java.lang.String parse(java.io.Reader r)
                       throws java.io.IOException
Throws:
java.io.IOException

parse

public java.lang.String parse(java.lang.String text)
                       throws java.io.IOException
The parse method that actually does the work. Now it first gets rid of singleton tags before running.

Parameters:
text -
Returns:
Throws:
java.io.IOException

title

public java.lang.String title()

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
Throws:
java.io.IOException