edu.stanford.nlp.util
Class XMLUtils

java.lang.Object
  extended by edu.stanford.nlp.util.XMLUtils

public class XMLUtils
extends java.lang.Object

Provides some utilities for dealing with XML files, both by properly parsing them and by using the methods of a desperate Perl hacker.

Author:
Teg Grenager

Nested Class Summary
static class XMLUtils.XMLTag
           
 
Field Summary
static java.util.Set<java.lang.String> breakingTags
          Block-level HTML tags that are rendered with surrounding line breaks.
 
Method Summary
static java.lang.String escapeAttributeXML(java.lang.String in)
          Returns a String in which some XML special characters have been escaped.
static java.lang.String escapeElementXML(java.lang.String in)
          Returns a String in which some the XML special characters have been escaped: just the ones that need escaping in an element content.
static java.lang.String escapeTextAroundXMLTags(java.lang.String s)
           
static java.lang.String escapeXML(java.lang.String in)
          Returns a String in which all the XML special characters have been escaped.
static int findSpace(java.lang.String haystack, int begin)
          return either the first space or the first nbsp
static javax.xml.parsers.DocumentBuilder getValidatingXmlParser(java.io.File schemaFile)
          Returns a validating XML parser given an XSD (not DTD!).
static javax.xml.parsers.DocumentBuilder getXmlParser()
          Returns a non-validating XML parser.
static boolean isBreaking(java.lang.String tag)
           
static boolean isBreaking(XMLUtils.XMLTag tag)
           
static void main(java.lang.String[] args)
          Tests a few methods.
static XMLUtils.XMLTag parseTag(java.lang.String tagString)
           
static XMLUtils.XMLTag readAndParseTag(java.io.Reader r)
           
static org.w3c.dom.Document readDocumentFromFile(java.lang.String filename)
           
static org.w3c.dom.Document readDocumentFromString(java.lang.String s)
           
static java.lang.String readTag(java.io.Reader r)
          Reads all text of the XML tag and returns it as a String.
static java.lang.String readUntilTag(java.io.Reader r)
          Reads all text up to next XML tag and returns it as a String.
static java.lang.String stripTags(java.io.Reader r, java.util.List<java.lang.Integer> mapBack, boolean markLineBreaks)
           
static java.lang.String unescapeStringForXML(java.lang.String s)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

breakingTags

public static final java.util.Set<java.lang.String> breakingTags
Block-level HTML tags that are rendered with surrounding line breaks.

Method Detail

getXmlParser

public static javax.xml.parsers.DocumentBuilder getXmlParser()
Returns a non-validating XML parser. The parser ignores both DTDs and XSDs.

Returns:
An XML parser in the form of a DocumentBuilder

getValidatingXmlParser

public static javax.xml.parsers.DocumentBuilder getValidatingXmlParser(java.io.File schemaFile)
Returns a validating XML parser given an XSD (not DTD!).

Parameters:
schemaFile -
Returns:
An XML parser in the form of a DocumentBuilder

stripTags

public static java.lang.String stripTags(java.io.Reader r,
                                         java.util.List<java.lang.Integer> mapBack,
                                         boolean markLineBreaks)
Parameters:
r - the reader to read the XML/HTML from
mapBack - a List of Integers mapping the positions in the result buffer to positions in the original Reader, will be cleared on receipt
Returns:
the String containing the resulting text

isBreaking

public static boolean isBreaking(java.lang.String tag)

isBreaking

public static boolean isBreaking(XMLUtils.XMLTag tag)

readUntilTag

public static java.lang.String readUntilTag(java.io.Reader r)
                                     throws java.io.IOException
Reads all text up to next XML tag and returns it as a String.

Returns:
the String of the text read, which may be empty.
Throws:
java.io.IOException

readAndParseTag

public static XMLUtils.XMLTag readAndParseTag(java.io.Reader r)
                                       throws java.io.IOException
Returns:
the new XMLTag object, or null if couldn't be created
Throws:
java.io.IOException

unescapeStringForXML

public static java.lang.String unescapeStringForXML(java.lang.String s)

escapeXML

public static java.lang.String escapeXML(java.lang.String in)
Returns a String in which all the XML special characters have been escaped. The resulting String is valid to print in an XML file as an attribute or element value in all circumstances. (Note that it may escape characters that didn't need to be escaped.)

Parameters:
in - The String to escape
Returns:
The escaped String

escapeElementXML

public static java.lang.String escapeElementXML(java.lang.String in)
Returns a String in which some the XML special characters have been escaped: just the ones that need escaping in an element content.

Parameters:
in - The String to escape
Returns:
The escaped String

escapeAttributeXML

public static java.lang.String escapeAttributeXML(java.lang.String in)
Returns a String in which some XML special characters have been escaped. This just escapes attribute value ones, assuming that you're going to quote with double quotes. That is, only " and & are escaped.

Parameters:
in - The String to escape
Returns:
The escaped String

escapeTextAroundXMLTags

public static java.lang.String escapeTextAroundXMLTags(java.lang.String s)

findSpace

public static int findSpace(java.lang.String haystack,
                            int begin)
return either the first space or the first nbsp


readTag

public static java.lang.String readTag(java.io.Reader r)
                                throws java.io.IOException
Reads all text of the XML tag and returns it as a String. Assumes that a '<' character has already been read.

Parameters:
r - The reader to read from
Returns:
The String representing the tag, or null if one couldn't be read (i.e., EOF). The returned item is a complete tag including angle brackets, such as <TXT>
Throws:
java.io.IOException

parseTag

public static XMLUtils.XMLTag parseTag(java.lang.String tagString)

readDocumentFromFile

public static org.w3c.dom.Document readDocumentFromFile(java.lang.String filename)
                                                 throws java.lang.Exception
Throws:
java.lang.Exception

readDocumentFromString

public static org.w3c.dom.Document readDocumentFromString(java.lang.String s)
                                                   throws java.lang.Exception
Throws:
java.lang.Exception

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Tests a few methods. If the first arg is -readDoc then this method tests readDocumentFromFile. Otherwise, it tests readTag/readUntilTag and slurpFile.

Throws:
java.lang.Exception


Stanford NLP Group