|
The Developer's Resource & Community Site
|
SAX 1.0: The Simple API for XML
(Reproduced with kind permision of Wrox Press: https://www.wrox.com)
In Chapter 5 we looked at how to write applications using
the Document Object Model. In this chapter we'll look at an alternative way of
processing an XML document: the SAX interface. We'll start by discussing why
you might choose to use the SAX interface rather than the DOM. Then we'll
explore the interface by writing some simple applications. We'll also discuss
some design patterns that are useful when creating more complex SAX
applications, and finally we'll look at where SAX is going next.
SAX is a very different style of interface from DOM. With
DOM, your application asks what is in the document by following object
references in memory; with SAX, the parser tells the application what is in the
document by notifying the application of a stream of parsing events.
SAX stands for "Simple API for XML". Or if you
really want it in full, the Simple Application Programming Interface for
Extensible Markup Language.
As the name implies, SAX is an interface that allows you to
write applications to read the data held in an XML document. It's primarily a
Java interface, and all of our examples will be in Java. (Since we don't have
the space to explain Java in this chapter we will assume knowledge of it for
the purposes of this exposition. See Beginning Java 2, Wrox Press ISBN
1861002238, or the documentation at https://www.java.sun.com for more information.)
The SAX interface is supported by virtually every Java XML
parser, and the level of compatibility is excellent. For a list of some of the
implementations see xmlsoftware.com or David
Megginson's site at https://www.megginson.com/SAX/.
To write a SAX application in Java, you'll need to install
the SAX classes (in addition to the Java JDK, of course). In most cases you'll
find that the XML Parser does this for you automatically (we'll tell you where
you can get parsers shortly). Check to see that classes such as org.xml.sax.Parser are present somewhere on
your classpath. If not, you can install them from https://www.megginson.com/SAX/.
We'll say a few words later on about where SAX came from and
where it's going. But for the moment, we'll just mention a most remarkable
feature: SAX doesn't belong to any standards body or consortium, nor to any
company or individual; it just exists in cyberspace for anyone to implement and
everyone to use. In particular, unlike most of the XML family of standards it
has nothing to do with the W3C.
SAX development is co-ordinated by David Megginson, and its
specification can be found on his site: https://www.megginson.com/SAX/. That specification, with trivial editorial
changes, is reproduced for convenience in Appendix C of this book.
An Event-Based Interface
There are essentially three ways you can read an XML
document from a program.
You can just read it as a file and
sort out the tags for yourself. This is the hacker's approach, and we don't
recommend it. You'll quickly find that dealing with all the special cases
(different character encodings, escape conventions, internal and external entities,
defaulted attributes and so on) is much harder work than you thought; probably
you won't deal with all these special cases correctly and sooner or later
someone will feed you a perfectly good XML document that your program can't
handle. Avoid the temptation: it's not as if XML parsers are expensive (most
are free).
- You can use a parser that analyses
the document and constructs a tree representation of its contents in memory:
the output from the parser passes into the Document Object Model, or DOM. Your
program can then start at the top of the tree and navigate around it, following
references from one element to another to find the information it needs.
- You can use a parser that reads
the document and tells your program about the symbols it finds, as it finds
them. For example it will tell you when it finds a start tag, when it finds
some character data, and when it finds an end tag. This is called an event-based interface because the parser notifies the application of
significant events as they occur. If this is the right kind of interface for
you, use SAX.
Let's look at event-based parsing in a little more detail.
You may have come across the term 'event-based' in user
interface programming, where an application is written to respond to events such
as mouse-clicks as they occur. An event-based parser is similar: in particular,
you have to get used to the idea that your application is not in control. Once
things have been set in motion you don't call the parser, the parser calls you.
That can seem strange at first, but once you get used to it, it's not a
problem. In fact, it's much easier than user-interface programming, because
unlike a user going crazy with a mouse, the XML parsing events occur in a
rather predictable sequence. XML elements have to be properly nested, so you
know that every element that's been opened will sooner or later be closed, and
so on.
Consider a simple XML file such as the following:
<?xml version="1.0"?>
<books>
<book>Professional XML</book>
</books>
As the parser processes this, it will call a sequence of
methods such as the following (we'll describe the actual method names and
parameters later, this is just for illustration):
startDocument()
startElement( "books" )
startElement( "book" )
characters( "Professional XML" )
endElement( "book" )
endElement( "books" )
endDocument()
All your application has to do is to provide methods to be
called when the events such as startElement
and endElement occur.
Why Use an Event-Based Interface?
Given that you have a choice, it's important to understand
when it's best to use an event-based interface like SAX, and when it's better
to use a tree-based interface like the DOM.
Both interfaces are well standardized and widely supported,
so whichever you choose, you have a wide choice of good quality parsers
available, most of which are free. In fact many of the parsers support both
interfaces.
The Benefits of SAX
The following sections outline the most obvious benefits of
the SAX interface.
It Can Parse Files of Any Size
Because there is no need to load the whole file into memory,
memory consumption is typically much less than the DOM, and it doesn't increase
with the size of the file. Of course the actual amount of memory used by the
DOM depends on the parser, but in many cases a 100Kb document will occupy at
least 1Mb of memory.
A word of caution though: if your SAX
application builds its own in-memory representation of the document, it is
likely to take up just as much space as if you allowed the parser to build it.
It Is Useful When You Want to Build Your Own Data Structure
Your application might want to construct a data structure
using high-level objects such as books, authors, and publishers rather than
low-level elements, attributes, and processing instructions. These
"business objects" might only be distantly related to the contents of
the XML file; for example, they may combine data from the XML file and other
sources. If you want to build up an application-oriented data structure in
memory in this way, there is very little advantage in building up a low-level
DOM structure first and then demolishing it. Just process each event as it
occurs, to make the appropriate incremental change to your business object
model.
It Is Useful When You Only Want A Small Subset Of The Information
If you are only interested, say, in counting how many books
have arrived in the library this week, or in determining their average price,
it is very inefficient and quite unnecessary to read all the data that you
don't want into memory along with the small amount that you do want. One of the
beauties of SAX is that it makes it very easy to ignore the data you aren't
interested in.
It Is Simple
As the name suggests, it's really quite simple to use.
It Is Fast
If it's possible to get the information you need from a
single serial pass through the document, SAX will almost certainly be the
fastest way to get it.
The Drawbacks of SAX
Having looked at the benefits it is only fair to address the
potential drawbacks in using SAX.
There's No Random Access to the Document
Because the document is not in memory you have to handle the
data in the order it arrives. SAX can be difficult to use when the document
contains a lot of internal cross-references, for example using D and IDREF
attributes.
Complex Searches Can Be Difficult to Implement
Complex searches can be quite messy to program as the
responsibility is on you to maintain data structures holding any context
information you need to retain, for example the attributes of the ancestors of
the current element.
The DTD Is Not Available
SAX 1.0 doesn't tell you anything about the contents of the
DTD. Actually the DOM doesn't tell you much about it either, though some
vendors have extended the DOM interface to do so. This isn't a problem for most
applications: the DTD is mainly of interest to the parser; and as we'll see
towards the end of the chapter the problem is fixed in SAX 2.0.
Lexical Information Is Not Available
The design principle in SAX is that it doesn't provide you
with lexical information. SAX tries to tell you what the writer of the
document wanted to say, and avoids troubling you with details of the way they
chose to say it. For example:
- You can't find out whether the
original document contained "
" or " " or
whether it contained a real newline character: all three are reported to the
application in the same way.
- You don't get told about comments in
the document: SAX assumes that comments are there for the author's benefit, not
for the reader's.
- You don't get told about the order in
which attributes were written: it isn't supposed to matter.
These restrictions are only a problem if you want to
reproduce the way the document was written, perhaps for the benefit of future
editing. For example, if you are writing an application designed to leave the
existing content of the document intact, but to add some extra information from
another source, the document author might get upset if you change the order of
the attributes arbitrarily, or lose all the comments. In fact, most of the
restrictions apply just as much to the DOM, although it does give you a little
more information in some areas: for example, it retains comments. Again, many
of the restrictions are fixed in SAX 2.0; though not all, for example the order
of attributes is still a closely guarded secret, as is the choice of delimiter
(single or double quotes).
SAX Is Read-Only
The DOM allows you to create or modify a document in memory,
as well as reading a document from an XML source file. SAX, by contrast, is
designed for reading XML documents, not for writing them.
Actually it turns out that the SAX interface is quite handy
for writing XML documents as well as reading them. As we'll see later, the same
stream of events that the parser sends to the application when reading an XML
document can equally be sent from the application to an XML generator when
writing one.
SAX Is Not Supported In Current Browsers
Although there are many XML parsers that support the SAX
interface, At the time of writing there isn't a parser built into a mainstream
web browser that supports it. You can incorporate a SAX-compliant parser within
a Java applet, of course, but the overhead of downloading it from the server
may strain the patience of a user with a slow Internet connection. In practice,
your choice of interfaces for client-side XML programming is rather limited.
Next Page...
©1999 Wrox Press Limited, US and UK.
|