edu.stanford.nlp.web
Class HTMLParser
java.lang.Object
javax.swing.text.html.HTMLEditorKit.ParserCallback
edu.stanford.nlp.web.HTMLParser
public class HTMLParser
- extends javax.swing.text.html.HTMLEditorKit.ParserCallback
Parses an HTML document and returns the plain text (and title).
The main thing that HTMLParser is used for is the
parse(String url)
method, which will return a String with the
contents of an HTML page, without the tags. After calling parse, you can get
the HTML title (contents of the TITLE tag) by calling title().
Subclasses may override the handleText(), handleComment(),
handleStartTag(), etc. methods so that parse(String url)
returns something other than the text of the web page. (For example, one
may be interested in returning only part of the text, or only the links.)
- Author:
- Sepandar Kamvar (sdkamvar@stanford.edu)
Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback |
IMPLIED |
Method Summary |
void |
handleEndTag(javax.swing.text.html.HTML.Tag tag,
int pos)
Sets a flag if the end tag is the "TITLE" element end tag |
void |
handleStartTag(javax.swing.text.html.HTML.Tag tag,
javax.swing.text.MutableAttributeSet attrSet,
int pos)
Sets a flag if the start tag is the "TITLE" element start tag. |
void |
handleText(char[] data,
int pos)
|
static void |
main(java.lang.String[] args)
|
java.lang.String |
parse(java.io.Reader r)
|
java.lang.String |
parse(java.lang.String text)
The parse method that actually does the work. |
java.lang.String |
parse(java.net.URL url)
|
java.lang.String |
title()
|
Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback |
flush, handleComment, handleEndOfLineString, handleError, handleSimpleTag |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
textBuffer
protected java.lang.StringBuffer textBuffer
title
protected java.lang.String title
isTitle
protected boolean isTitle
isBody
protected boolean isBody
isScript
protected boolean isScript
HTMLParser
public HTMLParser()
handleText
public void handleText(char[] data,
int pos)
- Overrides:
handleText
in class javax.swing.text.html.HTMLEditorKit.ParserCallback
handleStartTag
public void handleStartTag(javax.swing.text.html.HTML.Tag tag,
javax.swing.text.MutableAttributeSet attrSet,
int pos)
- Sets a flag if the start tag is the "TITLE" element start tag.
- Overrides:
handleStartTag
in class javax.swing.text.html.HTMLEditorKit.ParserCallback
handleEndTag
public void handleEndTag(javax.swing.text.html.HTML.Tag tag,
int pos)
- Sets a flag if the end tag is the "TITLE" element end tag
- Overrides:
handleEndTag
in class javax.swing.text.html.HTMLEditorKit.ParserCallback
parse
public java.lang.String parse(java.net.URL url)
throws java.io.IOException
- Throws:
java.io.IOException
parse
public java.lang.String parse(java.io.Reader r)
throws java.io.IOException
- Throws:
java.io.IOException
parse
public java.lang.String parse(java.lang.String text)
throws java.io.IOException
- The parse method that actually does the work.
Now it first gets rid of singleton tags before running.
- Throws:
java.io.IOException
title
public java.lang.String title()
main
public static void main(java.lang.String[] args)
throws java.io.IOException
- Throws:
java.io.IOException
Stanford NLP Group