IDevResource.com - XML Channel - Professional XML by Wrox Press

The Developer's Resource & Community Site

COM	XML	ASP	Java & Misc.	NEW: VS.NET
International	This Week	Forums	Author Central	Find a Job

SAX 1.0: The Simple API for XML

(Reproduced with kind permision of Wrox Press: https://www.wrox.com)

Page 3 (Page 2):

Some SAX Design Patterns

Our example SAX applications have only been interested in processing one or two different element types, and the processing has been very simple. In real applications where there is a need to process many different element types, this style of program can quickly become very unstructured. This happens for two reasons: firstly, the interactions of different events processing the same global context data can become difficult to disentangle, and secondly, each of the event-handling methods is doing a number of quite unrelated tasks.

So there is a need to think carefully about the design of a SAX application to prevent this happening. This section presents some of the possibilities. We'll look at two commonly used patterns: the filter pattern and the rule-based pattern.

The Filter Design Pattern

In the filter design pattern, which is also sometimes called the pipeline pattern, each stage of processing can be represented as a section of a pipeline: the data flows through the pipe, and each section of the pipe filters the data as it passes through. This is illustrated in the diagram below:

There are many different things a filter can do, for example:

Remove elements of the source document that are not wanted
Modify tags or attribute names
Perform validation
Normalize data values such as dates

The important characteristic of this design is that each filter has an input and an output, both of which conform to the same interface. The filter implements the interface at one end, and is a client of the same interface at the other end. So if we consider any adjacent pair of filters, the left-hand one acts as the Parser, the right-hand one as the DocumentHandler. And indeed, the filters in this structure will generally implement both the SAX Parser and DocumentHandler interfaces. ("Parser," of course, is a misnomer here. The characteristic of a SAX Parser is not that it understands the lexical and syntactic rules of XML, but that it notifies events to a DocumentHandler. Any program that performs such notification can implement the SAX Parser interface, even though it doesn't do any actual parsing).

It is also possible for a filter to have more than one output, notifying the events to more than one recipient, or less commonly, for a filter to have more than one input, merging events from several sources.

The power of the filter design pattern is that the filters are highly reusable, because just like real plumbing, the same standard filters can be plugged together in many different ways.

The ParserFilter class

There are a number of tools around for constructing a pipeline of this form. The simplest is John Cowan's ParserFilter class, available from https://www.ccil.org/~cowan/XML/. This is an abstract class: it does the things that every filter needs to do, and leaves you to define a subclass for each specific filter needed in your own pipeline.

As you might expect, ParserFilter implements both the SAX Parser and DocumentHandler interfaces; in fact, for good measure, it implements the other SAX event-handling interfaces as well (DTDHandler, ErrorHandler, and EntityResolver). All that the event-handling methods in this class do is to pass the event on to the next filter in the pipeline: it's up to your subclass to override any methods that need to do useful work.

The ParserFilter class has a constructor that takes a Parser as its parameter: the effect is to create a piece of the pipeline and connect it to another piece on its left. To construct our three-stage pipeline in the diagram above, we could write:

ParserFilter pipeline = new Filter3(
                  new Filter2 (
                     new Filter1 (
                        new com.jclark.xml.sax.Driver())));
pipeline.setDocumentHandler(outputHandler);

The initial input to the pipeline is of course a SAX Parser and the final output is a SAX DocumentHandler.

An Example ParserFilter: an Indenter

Here is a complete working example of a ParserFilter called Indenter. This filter takes a stream of SAX events, and massages the data by adding whitespace before start and end tags to make the nested structure of the document visible on display. It then passes the massaged data to the next DocumentHandler (which might, of course, be another filter).

The code should be self-explanatory. Note how it relies on the methods in the superclass to actually send the events to the DocumentHandler:

import java.util.*;
import org.xml.sax.*;
import org.ccil.cowan.sax.ParserFilter;

/**
* Indenter: This ParserFilter indents elements, by adding whitespace where appropriate.
* The string used for indentation is fixed at four spaces.
*/


public class Indenter extends ParserFilter {
    
    private final static String indentChars = "    ";   //indent by four spaces
    private int level = 0;                         // current indentation level
    private boolean sameline = false;              // true if no newlines in 
                                                   //element
    private StringBuffer buffer = new StringBuffer();// buffer to hold character 
                                                     //data 

    /**
    * Constructor: supply the underlying parser used to feed input to this filter
    */

    public Indenter(Parser p) {
        super(p);
    }

    /**
    * Output an element start tag.
    */

    public void startElement(String tag, AttributeList atts) throws SAXException
    {
        flush();                     // clear out pending character data
        indent();                    // output whitespace to achieve indentation
        super.startElement(tag, atts);  // output the start tag and attributes
        level++;                      // we're now one level deeper
        sameline = true;              // assume a single line of content
    }

    /**
    * Output element end tag
    */
    
    public void endElement(String tag) throws SAXException 
    {
        flush();                        // clear out pending character data
        level--;                        // we've come out by one level
        if (!sameline) indent();  // output indentation if a new line was found
        super.endElement(tag);          // output the end tag
        sameline = false;               // next tag will be on a new line
    }

    /**
    * Output a processing instruction
    */

    public void processingInstruction(String target, String data) throws 
                                                            SAXException 
    {
        flush();                        // clear out pending character data
        indent();                       // output whitespace for indentation
        super.processingInstruction(    // output the processing instruction
                           target, data);
    }

    /**
    * Output character data
    */

    public void characters(char[] chars, int start, int len) throws SAXException 
    {
        buffer.append(chars,       // add the character data to a buffer for now
            start, len);
    }

    /**
    * Output ignorable white space
    */

    public void ignorableWhitespace(char[] ch, int start, int len) throws 
                                                             SAXException 
    {
      // ignore it
    }

    /**
    * Output white space to reflect the current indentation level
    */

    private void indent() throws SAXException 
    {
                                // construct an array holding a newline 
                                //character 
                                // and the correct number of spaces
        int len = indentChars.length();
        char[] array = new char[level*len + 1];
        array[0] = '\n';
        for (int i=0; i<level; i++) 
        {
            indentChars.getChars(0, len, array, len*i + 1); 
        }
                                // output this array as character data
        super.characters(array, 0, level*len+1);
    }

    /**
    * Flush the buffer containing accumulated character data.
    * White space adjacent to markup is trimmed.
    */

    public void flush() throws SAXException 
    {
                                // copy the buffer into a character array
        int end = buffer.length();
        if (end==0) return;
        char[] array = new char[end];
        buffer.getChars(0, end, array, 0);
                                // trim whitespace from the start and end
        int start=0;
        while (start<end && Character.isWhitespace(array[start])) start++;
        while (start<end && Character.isWhitespace(array[end-1])) end--;
                                // test to see if there is a newline in the buffer
        for (int i=start; i<end; i++) 
        {
            if (array[i]=='\n') {
                sameline = false;
                break;
            }
        }
                                // output the remaining character data
        super.characters(array, start, end-start);
                                // clear the contents of the buffer
        buffer.setLength(0);
    }

}

To actually run this example, we will need a DocumentHandler that outputs the XML, let's suppose this exists and is called XMLOutputter (we'll show how XMLOutputter is written in the next section). We can then write a main program as follows:

public static void main(String[] args) throws Exception
{
    Indenter app = new Indenter(ParserManager.makeParser());
    app.setDocumentHandler(new XMLOutputter());
    app.parse(args[0]);
}

And you will also have to add an import statement for the ParserManager class at the top of the file:

import java.util.*;
import org.xml.sax.*; 
import com.icl.saxon.ParserManager;
import org.ccil.cowan.sax.ParserFilter;

We've made the program a bit more realistic by making the input file an argument that you can specify on the command line (retrieved from args[0]), and by creating the underlying SAX Parser using the ParserManager class that we introduced earlier. It's still not a production-quality program, for example it falls over if called without an input argument, but it's getting closer. Once you have set up the classpath (remember that to use ParserManager, the file ParserManager.properties must also be on the classpath), you can run this program from the command line, for example:

java Indenter file:///c:/data/books.xml

The output appears nicely intended. Because the argument is a URL, you can format any XML file on the web.

The End of the Pipeline: Generating XML

Very often, as in the previous example, the final output of the pipeline will be a new XML document. So you will often need a DocumentHandler that uses the events coming out of the pipeline to generate an XML document: a sort of parser in reverse.

Surprisingly we couldn't find a DocumentHandler on the web that does this, so we've written one and included it here.

Here is the class. It's reasonably straightforward, except for the code that generates entity and character references for special characters, which uses some of Java's less intuitive methods for manipulating Strings and arrays.

import org.xml.sax.*;
import java.io.*;

/**
  * XMLOutputter is a DocumentHandler that uses the notified events to
  * reconstruct the XML document on the standard output
  */
  
public class XMLOutputter implements DocumentHandler 
{

    private Writer writer = null;

    /**
    * Set Document Locator. Provided merely to satisfy the interface.
    */

    public void setDocumentLocator(Locator locator) {}
    

    /**
    * Start of the document. Make the writer and write the XML declaration.
    */
    
    public void startDocument () throws SAXException 
    {
        try 
        {
            writer = new BufferedWriter(new PrintWriter(System.out));
            writer.write("<?xml version='1.0' ?>\n");
        } 
        catch (java.io.IOException err) 
        {
            throw new SAXException(err);
        }
    }

    /**
    * End of the document. Close the output stream.
    */
    
    public void endDocument () throws SAXException 
    {
        try 
        {
            writer.close();
        } 
        catch (java.io.IOException err) 
        {
            throw new SAXException(err);
        }            
    }

    /**
    * Start of an element. Output the start tag, escaping special characters.
    */
    
    public void startElement (String name, AttributeList attributes)
                                                 throws SAXException 
    {
        try 
        {
            writer.write("<");
            writer.write(name);

            // output the attributes

            for (int i=0; i<attributes.getLength(); i++) 
            {
                writer.write(" ");
                writeAttribute(attributes.getName(i), attributes.getValue(i));
            }
            writer.write(">");
        } 
        catch (java.io.IOException err) 
        {
            throw new SAXException(err);
        }            
    }

    /**
    * Write attribute name=value pair
    */

    protected void writeAttribute(String attname, String value) throws 
                                                          SAXException 
    {
        try 
        {
            writer.write(attname);
            writer.write("='");
            char[] attval = value.toCharArray();
            char[] attesc = new char[value.length()*8];  // worst case scenario
            int newlen = escape(attval, 0, value.length(), attesc);
            writer.write(attesc, 0, newlen);
            writer.write("'");        
        } 
        catch (java.io.IOException err) 
        {
            throw new SAXException(err);
        }
    }

    /**
    * End of an element. Output the end tag.
    */

    public void endElement (String name) throws SAXException 
    {
        try 
        {
            writer.write("</" + name + ">");
        } 
        catch (java.io.IOException err) 
        {
            throw new SAXException(err);
        }        
    }

    /**
    * Character data.
    */
    
    public void characters (char[] ch, int start, int length) throws SAXException
    {
        try 
        {
            char[] dest = new char[length*8];
            int newlen = escape(ch, start, length, dest);
            writer.write(dest, 0, newlen);
        } 
        catch (java.io.IOException err) 
        {
            throw new SAXException(err);
        }
    }

    /**
    * Ignorable whitespace: treat it as characters
    */

    public void ignorableWhitespace(char[] ch, int start, int length)
    throws SAXException 
    {
        characters(ch, start, length);
    }

    /**
    * Handle a processing instruction.
    */
    
    public void processingInstruction (String target, String data)
                                               throws SAXException 
    {
        try 
        {
            writer.write("<?" + target + ' ' + data + "?>");
        } 
        catch (java.io.IOException err) 
        {
            throw new SAXException(err);
        }
    }

    /**
    * Escape special characters for display.
    * @param ch The character array containing the string
    * @param start The start position of the input string within the character 
    *              array
    * @param length The length of the input string within the character array
    * @param out Character array to receive the output. In the worst case, 
    * this should be
    * 8 times the length of the input array.
    * @return The number of characters used in the output array
    */
    
    private int escape(char ch[], int start, int length, char[] out) 
    {        
        int o = 0;
        for (int i = start; i < start+length; i++) 
        {
            if (ch[i]=='<') 
            {
                ("<").getChars(0, 4, out, o); o+=4;
            } 
            else if (ch[i]=='>') 
            {
                (">").getChars(0, 4, out, o); o+=4;
            } 
            else if (ch[i]=='&') 
            {
                ("&").getChars(0, 5, out, o); o+=5;
            } 
            else if (ch[i]=='\"') 
            {
                (""").getChars(0, 5, out, o); o+=5;
            } 
            else if (ch[i]=='\'') 
            {
                ("'").getChars(0, 5, out, o); o+=5;
            } 
            else if (ch[i]<127)  
            {
                out[o++]=ch[i];
            } 
            else 
            {
                // output character reference
                out[o++]='&';
                out[o++]='#';
                String code = Integer.toString(ch[i]);
                int len = code.length();
                code.getChars(0, len, out, o); o+=len;
                out[o++]=';';
            }
        }

        return o;
    }

}

Now you can see how SAX can be used to write XML documents as well as reading them. In fact, you can run SAX back-to-front: instead of the Parser being standard software that someone else writes, and the DocumentHandler being your specific application code, you can write an implementation of org.xml.sax.Parser that contains your application logic for generating XML, and couple it to this off-the-shelf DocumentHandler for writing XML output!

Other Useful ParserFilters

This ParserFilter implements the XML Namespaces recommendation, described in Chapter 7. It is available from JohnCowan's web site at https://www.ccil.org/~cowan/XML/.

SAX was defined before the XML Namespaces recommendation was published, and takes no account of it. If an element name is written in the source document as <html:table>, then the element name passed to the startDocument() method will be "html:table". There is no simple way for the application to determine which namespace "html" is referring to.

The NamespaceFilter solves this problem. It keeps track of all the namespace declarations in the document (that is, the "xmlns:xxx" attributes), and when a prefixed element or attribute name is reported by the SAX parser, it substitutes the full namespace URI for the prefix before passing it on down the pipeline. For example, if the element start tag is <html:table xmlns:html="https://www.w3.org/TR/REC-html40"> then the element name passed on to the next DocumentHandler will be "https://www.w3.org/TR/REC-html40^table". The circumflex character was chosen to separate the namespace URI from the local part of the element name because it's a character that can't appear in URIs or in XML names.

Sometimes applications want to know the prefix as well as the namespace URI (for example, for use in error messages). NamespaceFilter doesn't provide this information, but it could easily be extended to do so.

This is also available from John Cowan's web site at https://www.ccil.org/~cowan/XML/.

Many XML document designs use the concept of an inheritable attribute. The idea is that if a particular attribute is not present on an element, the value is taken from the same attribute on a containing element. The XML standard itself uses this idea for the special attributes xml:lang and xml:space, and it is extensively used in some other standards such as the XSL Formatting Objects proposal.

InheritanceFilter is a ParserFilter that extends the attribute list passed to the startElement() method to include attributes that were not actually present on that element, but were inherited from parent elements. The InheritanceFilter needs to be primed with a list of attribute names that are to be treated as inherited attributes.

XLinkFilter

This ParserFilter provides support for the draft XLink specification for creating hyperlinks between XML documents. It is published by Simon St. Laurent on https://www.simonstl.com/projects/xlinkfilter/

Unlike most ParserFilters, an XLinkFilter passes all the events through unchanged. While doing so, however, it constructs a data structure reflecting the XLink attributes encountered in the document. This data structure can then be interrogated by subsequent stages in the pipeline.

One kind of link defined in the XLink specification is a so-called "inclusion" link where the linked text is designed to appear inline within the main document – rather like a preprocessor #include directive in C. The XLink syntax for this is show="parsed". This is very similar to an external entity reference, except that the application has some control over the decision whether and when to include the text: for example, the user might have a choice to display the long or short forms of a document. It would be quite possible, of course, to implement a filter that expanded such links directly, presenting an included document to subsequent pipeline stages as if it were physically embedded in the original document.

Pipelines with Shared Context

One potential difficulty with a pipeline is that each filter in the pipeline has to work out for itself things that other filters already know; a common example is knowing the parent of the current element. If one filter is already maintaining a stack of elements so that it can determine this, it is wasteful for another filter to do the same thing.

You can get round this by allowing one filter to access data structures set up by a previous filter, either directly or via public methods. However, this requires that the filters in the pipeline know rather more about each other than the pure pipeline model suggests, which reduces your ability to plug filters together in any order. Arguably, when processing reaches this level of complexity, it might be better to forget event-based processing entirely and use the DOM (with a navigational design pattern) instead.

Contribute to IDR:

To contribute an article to IDR, a click here.

To contact us at IDevResource.com, use our feedback form, or email us.

To comment on the site contact our webmaster.