IDevResource.com - XML Channel - Professional XML by Wrox Press

The Developer's Resource & Community Site

COM	XML	ASP	Java & Misc.	NEW: VS.NET
International	This Week	Forums	Author Central	Find a Job

SAX 1.0: The Simple API for XML

(Reproduced with kind permision of Wrox Press: https://www.wrox.com)

In Chapter 5 we looked at how to write applications using the Document Object Model. In this chapter we'll look at an alternative way of processing an XML document: the SAX interface. We'll start by discussing why you might choose to use the SAX interface rather than the DOM. Then we'll explore the interface by writing some simple applications. We'll also discuss some design patterns that are useful when creating more complex SAX applications, and finally we'll look at where SAX is going next.

SAX is a very different style of interface from DOM. With DOM, your application asks what is in the document by following object references in memory; with SAX, the parser tells the application what is in the document by notifying the application of a stream of parsing events.

SAX stands for "Simple API for XML". Or if you really want it in full, the Simple Application Programming Interface for Extensible Markup Language.

As the name implies, SAX is an interface that allows you to write applications to read the data held in an XML document. It's primarily a Java interface, and all of our examples will be in Java. (Since we don't have the space to explain Java in this chapter we will assume knowledge of it for the purposes of this exposition. See Beginning Java 2, Wrox Press ISBN 1861002238, or the documentation at https://www.java.sun.com for more information.)

The SAX interface is supported by virtually every Java XML parser, and the level of compatibility is excellent. For a list of some of the implementations see xmlsoftware.com or David Megginson's site at https://www.megginson.com/SAX/.

To write a SAX application in Java, you'll need to install the SAX classes (in addition to the Java JDK, of course). In most cases you'll find that the XML Parser does this for you automatically (we'll tell you where you can get parsers shortly). Check to see that classes such as org.xml.sax.Parser are present somewhere on your classpath. If not, you can install them from https://www.megginson.com/SAX/.

We'll say a few words later on about where SAX came from and where it's going. But for the moment, we'll just mention a most remarkable feature: SAX doesn't belong to any standards body or consortium, nor to any company or individual; it just exists in cyberspace for anyone to implement and everyone to use. In particular, unlike most of the XML family of standards it has nothing to do with the W3C.

SAX development is co-ordinated by David Megginson, and its specification can be found on his site: https://www.megginson.com/SAX/. That specification, with trivial editorial changes, is reproduced for convenience in Appendix C of this book.

An Event-Based Interface

There are essentially three ways you can read an XML document from a program.

You can use a parser that analyses the document and constructs a tree representation of its contents in memory: the output from the parser passes into the Document Object Model, or DOM. Your program can then start at the top of the tree and navigate around it, following references from one element to another to find the information it needs.
You can use a parser that reads the document and tells your program about the symbols it finds, as it finds them. For example it will tell you when it finds a start tag, when it finds some character data, and when it finds an end tag. This is called an event-based interface because the parser notifies the application of significant events as they occur. If this is the right kind of interface for you, use SAX.

Let's look at event-based parsing in a little more detail.

You may have come across the term 'event-based' in user interface programming, where an application is written to respond to events such as mouse-clicks as they occur. An event-based parser is similar: in particular, you have to get used to the idea that your application is not in control. Once things have been set in motion you don't call the parser, the parser calls you. That can seem strange at first, but once you get used to it, it's not a problem. In fact, it's much easier than user-interface programming, because unlike a user going crazy with a mouse, the XML parsing events occur in a rather predictable sequence. XML elements have to be properly nested, so you know that every element that's been opened will sooner or later be closed, and so on.

Consider a simple XML file such as the following:


<?xml version="1.0"?>
<books>
   <book>Professional XML</book>
</books>

As the parser processes this, it will call a sequence of methods such as the following (we'll describe the actual method names and parameters later, this is just for illustration):


startDocument()
startElement( "books" )
startElement( "book" )
characters( "Professional XML" )
endElement( "book" )
endElement( "books" )
endDocument()

All your application has to do is to provide methods to be called when the events such as startElement and endElement occur.

Why Use an Event-Based Interface?

Given that you have a choice, it's important to understand when it's best to use an event-based interface like SAX, and when it's better to use a tree-based interface like the DOM.

Both interfaces are well standardized and widely supported, so whichever you choose, you have a wide choice of good quality parsers available, most of which are free. In fact many of the parsers support both interfaces.

The Benefits of SAX

The following sections outline the most obvious benefits of the SAX interface.

It Can Parse Files of Any Size

Because there is no need to load the whole file into memory, memory consumption is typically much less than the DOM, and it doesn't increase with the size of the file. Of course the actual amount of memory used by the DOM depends on the parser, but in many cases a 100Kb document will occupy at least 1Mb of memory.

A word of caution though: if your SAX application builds its own in-memory representation of the document, it is likely to take up just as much space as if you allowed the parser to build it.

It Is Useful When You Want to Build Your Own Data Structure

Your application might want to construct a data structure using high-level objects such as books, authors, and publishers rather than low-level elements, attributes, and processing instructions. These "business objects" might only be distantly related to the contents of the XML file; for example, they may combine data from the XML file and other sources. If you want to build up an application-oriented data structure in memory in this way, there is very little advantage in building up a low-level DOM structure first and then demolishing it. Just process each event as it occurs, to make the appropriate incremental change to your business object model.

It Is Useful When You Only Want A Small Subset Of The Information

If you are only interested, say, in counting how many books have arrived in the library this week, or in determining their average price, it is very inefficient and quite unnecessary to read all the data that you don't want into memory along with the small amount that you do want. One of the beauties of SAX is that it makes it very easy to ignore the data you aren't interested in.

It Is Simple

As the name suggests, it's really quite simple to use.

It Is Fast

If it's possible to get the information you need from a single serial pass through the document, SAX will almost certainly be the fastest way to get it.

The Drawbacks of SAX

Having looked at the benefits it is only fair to address the potential drawbacks in using SAX.

There's No Random Access to the Document

Because the document is not in memory you have to handle the data in the order it arrives. SAX can be difficult to use when the document contains a lot of internal cross-references, for example using D and IDREF attributes.

Complex Searches Can Be Difficult to Implement

Complex searches can be quite messy to program as the responsibility is on you to maintain data structures holding any context information you need to retain, for example the attributes of the ancestors of the current element.

The DTD Is Not Available

SAX 1.0 doesn't tell you anything about the contents of the DTD. Actually the DOM doesn't tell you much about it either, though some vendors have extended the DOM interface to do so. This isn't a problem for most applications: the DTD is mainly of interest to the parser; and as we'll see towards the end of the chapter the problem is fixed in SAX 2.0.

Lexical Information Is Not Available

The design principle in SAX is that it doesn't provide you with lexical information. SAX tries to tell you what the writer of the document wanted to say, and avoids troubling you with details of the way they chose to say it. For example:

You can't find out whether the original document contained "
" or "
" or whether it contained a real newline character: all three are reported to the application in the same way.
You don't get told about comments in the document: SAX assumes that comments are there for the author's benefit, not for the reader's.
You don't get told about the order in which attributes were written: it isn't supposed to matter.

These restrictions are only a problem if you want to reproduce the way the document was written, perhaps for the benefit of future editing. For example, if you are writing an application designed to leave the existing content of the document intact, but to add some extra information from another source, the document author might get upset if you change the order of the attributes arbitrarily, or lose all the comments. In fact, most of the restrictions apply just as much to the DOM, although it does give you a little more information in some areas: for example, it retains comments. Again, many of the restrictions are fixed in SAX 2.0; though not all, for example the order of attributes is still a closely guarded secret, as is the choice of delimiter (single or double quotes).

SAX Is Read-Only

The DOM allows you to create or modify a document in memory, as well as reading a document from an XML source file. SAX, by contrast, is designed for reading XML documents, not for writing them.

Actually it turns out that the SAX interface is quite handy for writing XML documents as well as reading them. As we'll see later, the same stream of events that the parser sends to the application when reading an XML document can equally be sent from the application to an XML generator when writing one.

SAX Is Not Supported In Current Browsers

Although there are many XML parsers that support the SAX interface, At the time of writing there isn't a parser built into a mainstream web browser that supports it. You can incorporate a SAX-compliant parser within a Java applet, of course, but the overhead of downloading it from the server may strain the patience of a user with a slow Internet connection. In practice, your choice of interfaces for client-side XML programming is rather limited.

Contribute to IDR:

To contribute an article to IDR, a click here.

To contact us at IDevResource.com, use our feedback form, or email us.

To comment on the site contact our webmaster.