|
The Developer's Resource & Community Site
|
SAX 1.0: The Simple API for XML
(Reproduced with kind permision of Wrox Press: https://www.wrox.com)
Page 3 (Page 2):
Some SAX Design Patterns
Our example SAX applications have only been interested in
processing one or two different element types, and the processing has been very
simple. In real applications where there is a need to process many different
element types, this style of program can quickly become very unstructured. This
happens for two reasons: firstly, the interactions of different events
processing the same global context data can become difficult to disentangle,
and secondly, each of the event-handling methods is doing a number of quite
unrelated tasks.
So there is a need to think carefully about the design of a
SAX application to prevent this happening. This section presents some of the
possibilities. We'll look at two commonly used patterns: the filter pattern and
the rule-based pattern.
The Filter Design Pattern
In the filter design pattern, which is also sometimes called
the pipeline pattern, each stage of processing can be represented as a section
of a pipeline: the data flows through the pipe, and each section of the pipe
filters the data as it passes through. This is illustrated in the diagram
below:
There are many different things a filter can do, for
example:
- Remove elements of the source document that are not wanted
- Modify tags or attribute names
- Perform validation
- Normalize data values such as dates
The important characteristic of this design is that each
filter has an input and an output, both of which conform to the same interface.
The filter implements the interface at one end, and is a client of the same
interface at the other end. So if we consider any adjacent pair of filters, the
left-hand one acts as the Parser, the right-hand one as the DocumentHandler. And indeed, the filters in this
structure will generally implement both the SAX Parser
and DocumentHandler interfaces.
("Parser," of course, is a misnomer here. The characteristic of a SAX
Parser is not that it understands the lexical and syntactic rules of XML, but
that it notifies events to a DocumentHandler. Any program that performs such
notification can implement the SAX Parser interface, even though it doesn't do
any actual parsing).
It is also possible for a filter to have more than one
output, notifying the events to more than one recipient, or less commonly, for
a filter to have more than one input, merging events from several sources.
The power of the filter design pattern is that the filters
are highly reusable, because just like real plumbing, the same standard filters
can be plugged together in many different ways.
The ParserFilter class
There are a number of tools around for constructing a
pipeline of this form. The simplest is John Cowan's ParserFilter
class, available from https://www.ccil.org/~cowan/XML/.
This is an abstract class: it does the things that every filter needs to do,
and leaves you to define a subclass for each specific filter needed in your own
pipeline.
As you might expect, ParserFilter
implements both the SAX Parser and DocumentHandler interfaces; in fact, for
good measure, it implements the other SAX event-handling interfaces as well (DTDHandler, ErrorHandler,
and EntityResolver). All that the
event-handling methods in this class do is to pass the event on to the next
filter in the pipeline: it's up to your subclass to override any methods that
need to do useful work.
The ParserFilter class
has a constructor that takes a Parser as
its parameter: the effect is to create a piece of the pipeline and connect it
to another piece on its left. To construct our three-stage pipeline in the
diagram above, we could write:
ParserFilter pipeline = new Filter3(
new Filter2 (
new Filter1 (
new com.jclark.xml.sax.Driver())));
pipeline.setDocumentHandler(outputHandler);
The initial input to the pipeline is of course a SAX Parser and the final output is a SAX DocumentHandler.
An Example ParserFilter: an Indenter
Here is a complete working example of a ParserFilter called Indenter.
This filter takes a stream of SAX events, and massages the data by adding
whitespace before start and end tags to make the nested structure of the
document visible on display. It then passes the massaged data to the next DocumentHandler (which might, of course, be
another filter).
The code should be self-explanatory. Note how it relies on
the methods in the superclass to actually send the events to the DocumentHandler:
import java.util.*;
import org.xml.sax.*;
import org.ccil.cowan.sax.ParserFilter;
/**
* Indenter: This ParserFilter indents elements, by adding whitespace where appropriate.
* The string used for indentation is fixed at four spaces.
*/
public class Indenter extends ParserFilter {
private final static String indentChars = " "; //indent by four spaces
private int level = 0; // current indentation level
private boolean sameline = false; // true if no newlines in
//element
private StringBuffer buffer = new StringBuffer();// buffer to hold character
//data
/**
* Constructor: supply the underlying parser used to feed input to this filter
*/
public Indenter(Parser p) {
super(p);
}
/**
* Output an element start tag.
*/
public void startElement(String tag, AttributeList atts) throws SAXException
{
flush(); // clear out pending character data
indent(); // output whitespace to achieve indentation
super.startElement(tag, atts); // output the start tag and attributes
level++; // we're now one level deeper
sameline = true; // assume a single line of content
}
/**
* Output element end tag
*/
public void endElement(String tag) throws SAXException
{
flush(); // clear out pending character data
level--; // we've come out by one level
if (!sameline) indent(); // output indentation if a new line was found
super.endElement(tag); // output the end tag
sameline = false; // next tag will be on a new line
}
/**
* Output a processing instruction
*/
public void processingInstruction(String target, String data) throws
SAXException
{
flush(); // clear out pending character data
indent(); // output whitespace for indentation
super.processingInstruction( // output the processing instruction
target, data);
}
/**
* Output character data
*/
public void characters(char[] chars, int start, int len) throws SAXException
{
buffer.append(chars, // add the character data to a buffer for now
start, len);
}
/**
* Output ignorable white space
*/
public void ignorableWhitespace(char[] ch, int start, int len) throws
SAXException
{
// ignore it
}
/**
* Output white space to reflect the current indentation level
*/
private void indent() throws SAXException
{
// construct an array holding a newline
//character
// and the correct number of spaces
int len = indentChars.length();
char[] array = new char[level*len + 1];
array[0] = '\n';
for (int i=0; i<level; i++)
{
indentChars.getChars(0, len, array, len*i + 1);
}
// output this array as character data
super.characters(array, 0, level*len+1);
}
/**
* Flush the buffer containing accumulated character data.
* White space adjacent to markup is trimmed.
*/
public void flush() throws SAXException
{
// copy the buffer into a character array
int end = buffer.length();
if (end==0) return;
char[] array = new char[end];
buffer.getChars(0, end, array, 0);
// trim whitespace from the start and end
int start=0;
while (start<end && Character.isWhitespace(array[start])) start++;
while (start<end && Character.isWhitespace(array[end-1])) end--;
// test to see if there is a newline in the buffer
for (int i=start; i<end; i++)
{
if (array[i]=='\n') {
sameline = false;
break;
}
}
// output the remaining character data
super.characters(array, start, end-start);
// clear the contents of the buffer
buffer.setLength(0);
}
}
To actually run this example, we will need a DocumentHandler
that outputs the XML, let's suppose this exists and is called XMLOutputter (we'll show how XMLOutputter is written in the next
section). We can then write a main program as follows:
public static void main(String[] args) throws Exception
{
Indenter app = new Indenter(ParserManager.makeParser());
app.setDocumentHandler(new XMLOutputter());
app.parse(args[0]);
}
And you will also have to add an import statement for the ParserManager class at the top of the file:
import java.util.*;
import org.xml.sax.*;
import com.icl.saxon.ParserManager;
import org.ccil.cowan.sax.ParserFilter;
We've made the program a bit more realistic by making the
input file an argument that you can specify on the command line (retrieved from
args[0]), and by creating the underlying SAX
Parser using the ParserManager
class that we introduced earlier. It's still not a production-quality program,
for example it falls over if called without an input argument, but it's getting
closer. Once you have set up the classpath (remember that to use ParserManager, the file ParserManager.properties must also be on the
classpath), you can run this program from the command line, for example:
java Indenter file:///c:/data/books.xml
The output appears nicely intended. Because the argument is
a URL, you can format any XML file on the web.
The End of the Pipeline: Generating XML
Very often, as in the previous example, the final output of
the pipeline will be a new XML document. So you will often need a DocumentHandler that uses the events coming
out of the pipeline to generate an XML document: a sort of parser in reverse.
Surprisingly we couldn't find a DocumentHandler
on the web that does this, so we've written one and included it here.
Here is the class. It's reasonably straightforward, except
for the code that generates entity and character references for special
characters, which uses some of Java's less intuitive methods for manipulating
Strings and arrays.
import org.xml.sax.*;
import java.io.*;
/**
* XMLOutputter is a DocumentHandler that uses the notified events to
* reconstruct the XML document on the standard output
*/
public class XMLOutputter implements DocumentHandler
{
private Writer writer = null;
/**
* Set Document Locator. Provided merely to satisfy the interface.
*/
public void setDocumentLocator(Locator locator) {}
/**
* Start of the document. Make the writer and write the XML declaration.
*/
public void startDocument () throws SAXException
{
try
{
writer = new BufferedWriter(new PrintWriter(System.out));
writer.write("<?xml version='1.0' ?>\n");
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* End of the document. Close the output stream.
*/
public void endDocument () throws SAXException
{
try
{
writer.close();
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Start of an element. Output the start tag, escaping special characters.
*/
public void startElement (String name, AttributeList attributes)
throws SAXException
{
try
{
writer.write("<");
writer.write(name);
// output the attributes
for (int i=0; i<attributes.getLength(); i++)
{
writer.write(" ");
writeAttribute(attributes.getName(i), attributes.getValue(i));
}
writer.write(">");
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Write attribute name=value pair
*/
protected void writeAttribute(String attname, String value) throws
SAXException
{
try
{
writer.write(attname);
writer.write("='");
char[] attval = value.toCharArray();
char[] attesc = new char[value.length()*8]; // worst case scenario
int newlen = escape(attval, 0, value.length(), attesc);
writer.write(attesc, 0, newlen);
writer.write("'");
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* End of an element. Output the end tag.
*/
public void endElement (String name) throws SAXException
{
try
{
writer.write("</" + name + ">");
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Character data.
*/
public void characters (char[] ch, int start, int length) throws SAXException
{
try
{
char[] dest = new char[length*8];
int newlen = escape(ch, start, length, dest);
writer.write(dest, 0, newlen);
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Ignorable whitespace: treat it as characters
*/
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException
{
characters(ch, start, length);
}
/**
* Handle a processing instruction.
*/
public void processingInstruction (String target, String data)
throws SAXException
{
try
{
writer.write("<?" + target + ' ' + data + "?>");
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Escape special characters for display.
* @param ch The character array containing the string
* @param start The start position of the input string within the character
* array
* @param length The length of the input string within the character array
* @param out Character array to receive the output. In the worst case,
* this should be
* 8 times the length of the input array.
* @return The number of characters used in the output array
*/
private int escape(char ch[], int start, int length, char[] out)
{
int o = 0;
for (int i = start; i < start+length; i++)
{
if (ch[i]=='<')
{
("<").getChars(0, 4, out, o); o+=4;
}
else if (ch[i]=='>')
{
(">").getChars(0, 4, out, o); o+=4;
}
else if (ch[i]=='&')
{
("&").getChars(0, 5, out, o); o+=5;
}
else if (ch[i]=='\"')
{
(""").getChars(0, 5, out, o); o+=5;
}
else if (ch[i]=='\'')
{
("'").getChars(0, 5, out, o); o+=5;
}
else if (ch[i]<127)
{
out[o++]=ch[i];
}
else
{
// output character reference
out[o++]='&';
out[o++]='#';
String code = Integer.toString(ch[i]);
int len = code.length();
code.getChars(0, len, out, o); o+=len;
out[o++]=';';
}
}
return o;
}
}
Now you can see how SAX can be used to write XML documents
as well as reading them. In fact, you can run SAX back-to-front: instead of the
Parser being standard software that someone else writes, and the
DocumentHandler being your specific application code, you can write an
implementation of org.xml.sax.Parser
that contains your application logic for generating XML, and couple it to this
off-the-shelf DocumentHandler for writing XML
output!
Other Useful ParserFilters
This ParserFilter
implements the XML Namespaces recommendation, described in Chapter 7. It is
available from JohnCowan's web site at https://www.ccil.org/~cowan/XML/.
SAX was defined before the XML Namespaces recommendation was
published, and takes no account of it. If an element name is written in the
source document as <html:table>,
then the element name passed to the startDocument()
method will be "html:table". There is no
simple way for the application to determine which namespace "html" is referring to.
The NamespaceFilter
solves this problem. It keeps track of all the namespace declarations in the
document (that is, the "xmlns:xxx"
attributes), and when a prefixed element or attribute name is reported by the
SAX parser, it substitutes the full namespace URI for the prefix before passing
it on down the pipeline. For example, if the element start tag is <html:table xmlns:html="https://www.w3.org/TR/REC-html40">
then the element name passed on to the next DocumentHandler will be "https://www.w3.org/TR/REC-html40^table".
The circumflex character was chosen to separate the namespace URI from the
local part of the element name because it's a character that can't appear in
URIs or in XML names.
Sometimes applications want to know the prefix as well as
the namespace URI (for example, for use in error messages). NamespaceFilter doesn't provide this
information, but it could easily be extended to do so.
This is also available from John Cowan's web site at https://www.ccil.org/~cowan/XML/.
Many XML document designs use the concept of an inheritable attribute.
The idea is that if a particular attribute is not present on an element, the
value is taken from the same attribute on a containing element. The XML
standard itself uses this idea for the special attributes xml:lang and xml:space,
and it is extensively used in some other standards such as the XSL Formatting
Objects proposal.
InheritanceFilter
is a ParserFilter that
extends the attribute list passed to the startElement()
method to include attributes that were not actually present on that element,
but were inherited from parent elements. The InheritanceFilter needs to be primed with a
list of attribute names that are to be treated as inherited attributes.
XLinkFilter
This ParserFilter provides support for the draft XLink
specification for creating hyperlinks between XML documents. It is published by
Simon St. Laurent on https://www.simonstl.com/projects/xlinkfilter/
Unlike most ParserFilters, an XLinkFilter
passes all the events through unchanged. While doing so, however, it constructs
a data structure reflecting the XLink attributes encountered in the document.
This data structure can then be interrogated by subsequent stages in the
pipeline.
One kind of link defined in the XLink specification is a
so-called "inclusion" link where the linked text is designed to
appear inline within the main document – rather like a preprocessor #include directive in C. The XLink syntax
for this is show="parsed". This is very
similar to an external entity reference, except that the application has some
control over the decision whether and when to include the text: for example,
the user might have a choice to display the long or short forms of a document.
It would be quite possible, of course, to implement a filter that expanded such
links directly, presenting an included document to subsequent pipeline stages
as if it were physically embedded in the original document.
Pipelines with Shared Context
One potential difficulty with a pipeline is that each filter
in the pipeline has to work out for itself things that other filters already
know; a common example is knowing the parent of the current element. If one
filter is already maintaining a stack of elements so that it can determine
this, it is wasteful for another filter to do the same thing.
You can get round this by allowing one filter to access data
structures set up by a previous filter, either directly or via public methods.
However, this requires that the filters in the pipeline know rather more about
each other than the pure pipeline model suggests, which reduces your ability to
plug filters together in any order. Arguably, when processing reaches this
level of complexity, it might be better to forget event-based processing
entirely and use the DOM (with a navigational design pattern) instead.
Previous Page Next Page...
©1999 Wrox Press Limited, US and UK.
|