If you want to do anything with character data other than
simply copying it unconditionally to an output file, you are probably
interested in knowing what element is belongs to. Unfortunately the SAX
interface doesn't give you this information directly. If you need such
contextual information, your application will have to maintain a data structure
that retains some memory of previous events. The most common is a stack. In the
next section we will show how you can use some simple data structures both to
assemble character data supplied piecemeal by the parser, and to determine what
element it is part of.
There is a second method for reporting character data,
namely
- ignorableWhitespace(char[] chars, int start, int len)
This interface can be used to report what the SAX
specification rather loosely refers to as "ignorable white space". If
the DTD defines an element with "element content" (that is, the
element can have children but cannot contain PCDATA), then XML permits the
child elements to be separated by spaces, tabs, and newlines, even though
"real" character data is not allowed. This white space is probably
insignificant, so a SAX application will almost invariably ignore it: which you
can do simply by having an ignorableWhitespace()
method that does nothing. The only time you might want to do anything else is
if your application is copying the data unchanged to an output file.
The XML specification allows a parser to ignore information
in the external DTD, however. A non-validating parser will not necessarily
distinguish between an element with element content and one with mixed content.
In this case the ignorable white space is likely to be reported via the
ordinary characters() interface. Unfortunately
there is no way within a SAX application of telling whether the parser is a
validating one or not, so a portable application must be prepared for either.
This is another limitation that is remedied in SAX 2.0.
Processing Instructions
There is one more kind of event that parsers report, namely
processing instructions. You probably won't meet these very often: they are the
instructions that can appear anywhere in an XML document between the symbols
"<?" and "?>". A processing instruction has a
name (called a target),
and arbitrary character data (instructions for the target application
concerned).
Processing instructions are notified to the DocumentHandler using the method:
- processingInstruction(String name, String data)
By convention, you should ignore any processing instruction
(or copy it unchanged) unless you recognize its name.
Note that the XML declaration at the start
of a document may look like a processing instruction, but it is not a true
processing instruction, and is not reported to the application via this
interface – indeed, it is not reported at all.
Processing instructions are often written to look like
element start tags, with a sequence of keyword="value" attributes.
This syntax, however, is purely an application convention, and is not defined
by the XML standard. So SAX doesn't recognize it; the contents of the
processing instruction data are passed over in an amorphous lump.
Error Handling
We've glossed over error handling so far, but as always, it
needs careful thought in a real production application.
There are three main kinds of errors that can occur:
- Failure to open the XML input file, or
another file that it refers to, for example the DTD or another external entity.
In this case the parser will throw an IOException (input/output exception), and it is up to your application to
handle it.
- XML errors detected by the parser,
including well-formedness errors and validity errors. These are handled by
calling an error handler which your application can supply, as described below.
- Errors detected by the application:
for example, an invalid date or number in an attribute. You handle these by
throwing an exception in the DocumentHandler
method that detects the error.
Handling XML errors
The SAX specification defines three levels of error
severity, based on the terminology used in the XML standard itself. These are:
Fatal errors
|
These usually mean the XML is not
well-formed. The parser will call the registered error handler if there is
one; if not, it will throw a SAXParseException.
In most cases a parser will stop after the first fatal error it finds.
|
Errors
|
These usually mean the XML is
well-formed but not valid. The parser will call the registered error handler
if there is one; if not, it will ignore the error.
|
Warnings
|
These mean that the XML is correct,
but there is some condition that the parser considers it useful to report.
For example this might be a violation of one of the
"interoperability" rules: input that is correct XML but not correct
SGML. The parser will call the registered error handler if there is one; if
not, it will ignore the error.
|
The application can register an error handler using the
parser's setErrorHandler() method. An error
handler contains three methods, fatalError(),
error(), and warning(),
reflecting the three different error severities. If you don't want to define
all three, you can make an error handler that inherits from HandlerBase: this contains versions of all
three methods that take the same action as if no error handler were registered.
The parameter to the error handling method, in all three
cases, is a SAXParseException object. You
probably think of Java Exceptions as things that are thrown and caught when
errors occur; but in fact an Exception is a regular Java object and can be
passed as a parameter to methods just like any other: it might never be thrown
at all. The SAXParseException contains
information about the error, including where in the source XML file it
occurred. The most common thing for an error handler method to do is to extract
this information to construct an error message, which can be written to a
suitable destination: for example, a web server log file.
The other useful thing the error handling method can do is
to throw an exception: usually, but not necessarily, the exception that the
parser supplied as a parameter. If you do this, the parse will typically be
aborted, and the top-level application will see the same exception thrown by
the parse() method. It then has another
opportunity to output diagnostics. Whether you generate a fatal error message
from within the error handler, or do it by letting the top-level application
catch the exception, is entirely up to you.
Application-Detected Errors
When your application detects an error within a DocumentHandler method (for example, a badly
formatted date), the method should throw a SAXException
containing an appropriate message to explain the problem. After this, the
parser deals with the situation exactly as if it had detected the error itself.
Typically, it doesn't attempt to catch the exception, but exits immediately
from the parse() method with the same
exception, which the top-level application can then catch.
Identifying Where the Error Occurred
When the parser detects an XML syntax error, it will supply
details of the error in a SAXParseException
object. This object will include details of the URL, line, and column where the
error occurred (a line number on its own is not much use, because the error may
be in some external entity not in the main document). When you catch the SAXParseException in your application, you
can extract this information and display it so the user can locate the error.
If the problem with the XML file is detected at application
level (for example, an invalid date), it is equally important to tell the user
where the problem was found, but this time you can't rely on the SAXParseException to locate it. Instead, SAX
defines a Locator interface. The SAX
specification doesn't insist that parsers supply a Locator,
but most parsers do.
One of the methods you must implement in a document handler is the setLocator() method. If the parser maintains
location information it will call this method to tell the document handler where to
find the Locator object. At any subsequent
time while your document handler is processing an event it can ask the Locator object for details of the current
coordinates in the source document. There are three coordinates:
-
The URL of the document or external
entity currently being processed
- The line number within that URL
- The column number within that line
This is of course exactly the same information
that you can get from a SAXParseException
object, and in fact one of the things you can do very easily when your
application detects an error is to throw a SAXParseException that takes the coordinates directly from the Locator object: just write:
if ( [data is not valid] )
{
throw new SAXParseException("Invalid data", locator);
}
Why wasn't the location information simply included in the
events passed to the document handler,
such as startElement()? The reason is
efficiency: most applications only want location information if something goes
wrong, so there should be minimal overhead incurred when it is not needed.
Supplying location information with each call from the parser to the document
handler would be unnecessarily expensive.
Another Example: Using Character Data and Attributes
After this excursion into the world of error handling, let's
develop a slightly more complex example SAX application.
The task this time is for the application to print the
average price of fiction books in the catalog. We'll use the same data file (books.xml) as in our previous example.
We are interested only in those <book>
elements that have the attribute category="fiction",
and for these we are interested only in the contents of the <price> child element. We add up the
prices, count the books, and at the end divide the total price by the number of
books.
Here's our first version of the application:
import org.xml.sax.*;
public class AveragePrice extends HandlerBase
{
private int count = 0;
private boolean isFiction = false;
private double totalPrice = 0.0;
private StringBuffer content = new StringBuffer();
public void determineAveragePrice() throws Exception
{
Parser p = new com.jclark.xml.sax.Driver();
p.setDocumentHandler(this);
p.parse("file:///c:/data/books.xml");
}
public void startElement(String name, AttributeList atts) throws SAXException
{
if (name.equals("book"))
{
String category = atts.getValue("category");
isFiction = (category!=null && category.equals("fiction"));
if (isFiction) count++;
}
content.setLength(0);
}
public void characters(char[] chars, int start, int len) throws SAXException
{
content.append(chars, start, len);
}
public void endElement(String name) throws SAXException
{
if (name.equals("price") && isFiction)
{
try
{
double price = new Double(content.toString()).doubleValue();
totalPrice += price;
}
catch (java.lang.NumberFormatException err)
{
throw new SAXException("Price is not numeric");
}
}
content.setLength(0);
}
public void endDocument() throws SAXException
{
System.out.println("The average price of fiction books is " +
totalPrice / count);
}
public static void main (String args[]) throws java.lang.Exception
{
try
{
(new AveragePrice()).determineAveragePrice();
}
catch (SAXException err)
{
System.err.println("Parsing failed: " + err.getMessage());
}
}
}
There are three main points to note in this code:
- The application needs to maintain one
piece of context, namely whether the current book is fiction or not. It uses an
instance variable to remember this, setting isFiction
to true when a start tag for a fiction book is encountered, and to false when a
start tag for a non-fiction book is read.
- See how the character content is
accumulated in a Java StringBuffer and is
not actually processed until the endElement() event is notified. This kills two birds with one stone: it solves
the problem that the content of a single element might be broken up and
notified piecemeal; at the same time, it means that when we handle the data, we
know which element we are dealing with. The StringBuffer is emptied whenever a start or end tag is read, which means that
when the application gets to the end tag of a PCDATA element (one that contains
character data only) the buffer will contain the character data of that
element.
- The application needs to do something
sensible when the price of a book is not a valid number. (Until XML Schemas
become standardized, we can't rely on the parser to do this piece of validation
for us: DTDs provide no way of restricting the data type of character data
within an element.) This condition is detected by the fact that the Java
constructor Double(String s), which converts a
String to a number, reports an exception. The relevant code catches this
exception, and reports a SAXException
describing the problem. This will cause the parsing to be terminated with an
appropriate error message.
When the code is run on our example XML file it produces the
following output:
>java AveragePrice
The average price of fiction books is 10.99
But the program isn't yet perfect.
Firstly, it can easily fail if the structure of the input
document is not as expected. For example, it will give wrong answers if the <price> element occurs other than in a
<book>, or if there is a <book> with no <price>,
or if a <price> element has its own
child elements. Such things might happen because there is no DTD, or because a
non-validating parser is used that doesn't check the DTD, or because a document
is submitted that uses a different DTD from that expected, or because the DTD
has been enhanced since the program was written.
Secondly, the diagnostics when errors are detected are
rather unfriendly. The user will be told that a price is not numeric, but there
may be hundreds of books in the list: it would be more helpful to say which
one. Even more helpful would be to report all the errors in a single run, so
that the user doesn't have to run the program once to find and correct each
separate error. (Actually, most XML parsers will only report one syntax error
in a single run, so there's a limit to what we can achieve here.)
In the next section we'll look at how to maintain more
information about element context, which is necessary if we're to do more
thorough validation. Before that, we'll make one improvement in the area of
error handling. We'll use the Locator
object to determine where in the source document the error occurred, and report
it accordingly.
In order to show what happens clearly we've switched from
James Clark's xp parser to IBM Alphaworks' xml4j, which provides clearer
messages. Here is the revised program.
This version of the application can also be found on our web
site at https://www.wrox.com
import org.xml.sax.*;
public class AveragePrice extends HandlerBase
{
private int count = 0;
private boolean isFiction = false;
private double totalPrice = 0.0;
private StringBuffer content = new StringBuffer();
private Locator locator;
public void determineAveragePrice() throws Exception
{
Parser p = new com.ibm.xml.parsers.SAXParser();
p.setDocumentHandler(this);
p.parse("file:///c:/data/books.xml");
}
public void setDocumentLocator(Locator loc)
{
locator = loc;
}
public void startElement(String name, AttributeList atts) throws SAXException
{
if (name.equals("book"))
{
String category = atts.getValue("category");
isFiction = (category!=null && category.equals("fiction"));
if (isFiction) count++;
}
content.setLength(0);
}
public void characters(char[] chars, int start, int len) throws SAXException
{
content.append(chars, start, len);
}
public void endElement(String name) throws SAXException
{
if (name.equals("price") && isFiction)
{
try
{
double price = new Double(content.toString()).doubleValue();
totalPrice += price;
}
catch (java.lang.NumberFormatException err)
{
if (locator!=null)
{
System.err.println("Error in " + locator.getSystemId() +
" at line " + locator.getLineNumber() +
" column " + locator.getColumnNumber());
}
throw new SAXException("Price is not numeric", err);
}
}
content.setLength(0);
}
public void endDocument() throws SAXException
{
System.out.println("The average price of fiction books is " +
totalPrice / count);
}
public static void main (String args[]) throws java.lang.Exception
{
try
{
(new AveragePrice()).determineAveragePrice();
}
catch (SAXException err)
{
System.err.println("Parsing failed: " + err.getMessage());
}
}
}
This version of the code improves the diagnostics with very
little extra effort. The revised application does three things:
-
It keeps a note of the Locator object supplied by the parser.
- When an error occurs, it uses the Locator object to print information about the location of the error before
generating the SAXException. Note that the application
has to allow for the case where there is no Locator,
because SAX doesn't require the parser to supply one.
- It also includes details of the
original "root cause" exception (the NumberFormatException) encapsulated within the
SAXParseException, again allowing more precise diagnostics to be written.
This is the output we got from the xml4j parser, after
modifying the price of Moby Dick from
"8.99" to "A.99":
>java AveragePrice
Error in file:///c:/data/books.xml at line 16 column 22
Parsing failed: Price is not numeric
In this example the application produces a message
containing location information before throwing the exception, and then
produces the real error message when the exception is caught at the top level.
An alternative is to pass the location information as part of the exception,
which could be done by throwing a SAXParseException
instead of an ordinary SAXException.
However, the application still has to deal with the case where there is no Locator, in which case throwing a SAXParseException is not very convenient. An
alternative here would be for your application to create its own default
locator (containing no useful information) for use when the parser doesn't
supply one.
Maintaining Context
We've seen in both the examples so far that the DocumentHandler generally needs to maintain
some kind of context information as the parse proceeds. In the first case all
that it did was to accumulate a count of elements; in the second example the DocumentHandler kept track of whether or not
we were currently within a <book>
element with category="fiction".
Nearly all realistic SAX applications will need to maintain
some context information of this kind. Very often, it's appropriate to keep
track of the current element nesting, and in many cases it's also useful to
know the attributes of all the ancestor elements of the data we're currently
processing.
The obvious data structure to hold this information is a Stack, because it's natural to add
information about an element when we reach its start tag, and to remove that
information when we reach its end tag. A stack, of course, still requires far
less memory than you would need to store the whole document, because the
maximum number of entries on the stack is only as great as the maximum nesting
of elements, which even in large and complex documents rarely exceeds a depth
of ten or so.
We can see how a stack can be useful if we modify the
requirements for the previous example application. This time we'll allow our
book catalog to include multi-volume books, with a price for each volume and a
price for the whole set. In calculating the average price, we want to consider
the price of the whole set, not the price of the individual volumes.
The source document might now look like this (it's also
available via the web site at https://www.wrox.com):
<?xml version="1.0"?>
<books>
<book category="reference">
<author>Nigel Rees</author>
<title>Sayings of the Century</title>
<price>8.95</price>
</book>
<book category="fiction">
<author>Evelyn Waugh</author>
<title>Sword of Honour</title>
<price>12.99</price>
</book>
<book category="fiction">
<author>Herman Melville</author>
<title>Moby Dick</title>
<price>8.99</price>
</book>
<book category="fiction">
<author>J. R. R. Tolkien</author>
<title>The Lord of the Rings</title>
<price>22.99</price>
<volume number="1">
<title>The Fellowship of the Ring</title>
<price>8.95</price>
</volume>
<volume number="2">
<title>The Two Towers</title>
<price>8.95</price>
</volume>
<volume number="3">
<title>The Return of the King</title>
<price>8.95</price>
</volume>
</book>
</books>
One way of handling this would be to introduce another flag
in the program, which we set when we encounter a <volume>
start tag, and unset when we find a </volume>
end tag; we could ignore a <price>
element if this flag is set. But this style of programming quickly leads to a
proliferation of flags and complex nesting of if-then-else conditions. A better
approach is to put all the information about the currently open elements on a stack,
which we can then interrogate as required.
Here's the new version of the application:
import org.xml.sax.*;
import org.xml.sax.helpers.AttributeListImpl;
import java.util.Stack;
public class AveragePrice1 extends HandlerBase
{
private int count = 0;
private double totalPrice = 0.0;
private StringBuffer content = new StringBuffer();
private Locator locator;
private Stack context = new Stack();
public void determineAveragePrice() throws Exception
{
Parser p = new com.jclark.xml.sax.Driver();
p.setDocumentHandler(this);
p.parse("file:///c:/data/books1.xml");
}
public void setDocumentLocator(Locator loc)
{
locator = loc;
}
public void startElement(String name, AttributeList atts) throws SAXException
{
ElementDetails details = new ElementDetails(name, atts);
context.push(details);
if (name.equals("book"))
{
if (isFiction()) count++;
}
content.setLength(0);
}
public void characters(char[] chars, int start, int len) throws SAXException
{
content.append(chars, start, len);
}
public void endElement(String name) throws SAXException
{
if (name.equals("price") && isFiction() && !isVolume())
{
try
{
double price = new Double(content.toString()).doubleValue();
totalPrice += price;
}
catch (java.lang.NumberFormatException err)
{
if (locator!=null)
{
System.err.println("Error in " + locator.getSystemId() +
" at line " + locator.getLineNumber() +
" column " + locator.getColumnNumber());
}
throw new SAXException("Price is not numeric", err);
}
}
content.setLength(0);
context.pop();
}
public void endDocument() throws SAXException
{
System.out.println("The average price of fiction books is " +
totalPrice / count );
}
public static void main (String args[]) throws java.lang.Exception
{
(new AveragePrice1()).determineAveragePrice();
}
private boolean isFiction()
{
boolean test = false;
for (int p=context.size()-1; p>=0; p--) {
ElementDetails elem = (ElementDetails)context.elementAt(p);
if (elem.name.equals("book") &&
elem.attributes.getValue("category")!=null &&
elem.attributes.getValue("category").equals("fiction"))
{
return true;
}
}
return false;
}
private boolean isVolume()
{
boolean test = false;
for (int p=context.size()-1; p>=0; p--) {
ElementDetails elem = (ElementDetails)context.elementAt(p);
if (elem.name.equals("volume"))
{
return true;
}
}
return false;
}
private class ElementDetails
{
public String name;
public AttributeList attributes;
public ElementDetails(String name, AttributeList atts)
{
this.name = name;
this.attributes = new AttributeListImpl(atts); // make a copy
}
}
}
It might seem that maintaining this stack is a lot of effort
for rather a small return. But it's a worthwhile investment. All real
applications become more complex over time, and it's worth having a structure
that allows the logic to evolve without destroying the structure of the
program. Note how the condition tests such as isFiction()
and isVolume() have now become methods
applied to the context data structure rather than flags that are maintained as
events occur. As the number of conditions to be tested multiplies, we can write
more of these methods without increasing the complexity of the startElement() and endElement()
methods.
Previous Page Next Page...
©1999 Wrox Press Limited, US and UK.
|