JAXP: Java API for XML Processing

Abstract

An XML parser is an essential package for any serious programming effort involving XML document. The XML parser is the link between the XML document representing data and the application code.  There are many ways to parse XML documents with Java such as DOM and SAX which are the standard parsing techniques.

The Java API for XML Processing (JAXP) enables applications to parse and transform XML documents independent of a particular XML processing (parser) implementation.  JAXP is a Java interface that provides a standard approach to parsing XML documents. 

JAXP provides parsers for DOM and SAX approaches to process XML documents. It also provides the Transformations API for XML (TrAX) API that can be used to transform XML documents into other XML documents.

JAXP adds a few factory classes to fill some holes in these APIs and enable Java programmers to write completely parser-independent code. The factory class you use determines the approach you use. A factory class is a standard design pattern that gives you the ability to manufacture classes as needed.

A basic JAXP session will look like this:

With JAXP, you can use either the DocumentBuilderFactory to create DocumentBuilder classes or the SAXParserFactory to create SAXParser classes. The difference is that DOM parsers read the entire document into memory and allow you to traverse the document in a random access way, while SAX parsers call handlers to interpret XML data as it's encountered in the document.

The one major new invention in JAXP that was not based on previous standards was a series of abstract factory classes in the javax.xml.parsers package. These allow a Java program to obtain a DOM parser, a SAX1 Parser, or a DOMImplementation in a parser-independent fashion. The SAX1 factories are now obsolete, but the DOM factories are still quite useful.

Simple API for XML (SAX)

The Simple API for XML (SAX) is available with the JAXP; SAX is one of two common ways to write software that accesses XML data.  SAX is an event-driven methodology for XML processing and consists of many callbacks. Using SAX with JAXP allows developers to traverse through XML data sequentially, one element at a time, using a delegation even model. Each time elements of the XML structure are encountered, an event is triggered. Developers write event handlers to define customer processing for events they deem important. Each element is parsed down to its leaf node before moving on to the next sibling of that element in the XML document, therefore at no point is there any clear relation of what level of the tree we are at.

SAX

SAX is very useful for processing very large XML documents or streams, because all the XML data processed does not need to be kept in runtime memory. SAX is also very useful  for retrieving a specific value in a XML document and creating a subset of a XML document. It lacks randomly access or modifying the XML data capability, in such case, the Document Object Model (DOM) should be used.

The SAX API has provided the following handler interfaces:

The SAX programmer implements one of the SAX interfaces that define event processing callbacks. SAX also provides a class called DefaultHandler (in the org.xml.sax.helpers package) that implements all of these callbacks and provides default, empty implementations of all the callback methods. The SAX developer needs only extend this class, then implement methods that require insertion of specific logic. So the key in SAX is to provide code for these various callbacks, then let a parser trigger each of them when appropriate. Here's the typical SAX routine:

  1. Create a SAXParser instance using a specific vendor's parser implementation.
  2. Create an event handler object and register the event handler object to the parser (the event handler implementation extends DefaultHandler, for example).
  3. Start parsing and sending each event to the handler.

JAXP's SAX component provides a simple means for doing all of this. Without JAXP, a SAX parser instance either must be instantiated directly from a vendor class (such as org.apache.xerces.parsers.SAXParser), or it must use a SAX helper class called XMLReaderFactory (also in the org.xml.sax.helpers package). The problem with the first methodology is obvious: It isn't vendor neutral. The problem with the second is that the factory requires, as an argument, the String name of the parser class to use (that Apache class, org.apache.xerces.parsers.SAXParser, again). You can change the parser by passing in a different parser class as a String. With this approach, if you change the parser name, you won't need to change any import statements, but you will still need to recompile the class. This is obviously not a best-case solution. It would be much easier to be able to change parsers without recompiling the class.

JAXP offers that better alternative: It lets you provide a parser as a Java system property. Of course, when you download a distribution from Sun, you get a JAXP implementation that uses Sun's version of Xerces.  The developer can move from one parser implementation to another through a system property rather than having to refer to it in the actual code. This means that the code does not need to be recompiled each time the parser implementation is changed.

An Abstract Factory of SAXParserFactory

The JAXP SAXParserFactory class is an obsolete class for building and configuring SAXParser objects in an implementation independent fashion. The concrete subclass to load is read from the javax.xml.parsers.SAXParserFactory java system property. This class has been replaced by the org,xml.sax.helpers.XMLReaderFactory class in SAX2. To obtain a SAXParser, the you must create a new instance of SAXParserFactory through its static SAXParserFactory.newInstance() method first and then the SAXParser itself by the factory newSAXParser() method.

Every factory's newInstance() method uses a specific algorithm for finding the JAXP implementation. Since JAXP 1.1.3 (also part of JDK 1.4), the factory find algorithm is the following:

  • Searches for a system property named after the appropriate factory

  • javax.xml.parsers.DocumentBuilderFactory
  • javax.xml.parsers.SAXParsersFactory
  • javax.xml.transform.TransformerFactory

  • Use the properties file "lib/jaxp.properties" in the JRE directory containing the fully qualified name for the implementation class with the key being the system property from above.
  • If this file is not found or it does not contain a property for the search factory, then the actual search logic follows (used also in the J2EE Engine):


  • Get a class loader by invoking Thread.getCurrentThread().getContextClassLoader()


  • Use this classloader to load a resource using Classloader.getResourceAsStream(factoryResource), where factoryResource is a file named after the properties above and located in META-INF/services/, for example META-INF/services/javax.xml.parsers.DocumentBuilderFactory


  • In case this resource is found, it is loaded and the contents of this file is a string specifying the class for the JAXP factory implementation


  • Then this factory is loaded using the same class loader

  • In case no such file is found, a fallback value is to be loaded. This fallback value is specific for the JDK or the JAXP interfaces provider (Use the platform default instance, such as SAXParserFactory).

In addition to the basic job of creating instances of SAX parsers, the factory lets you set configuration options. These options affect all parser instances obtained through the factory. The two most commonly used options available in JAXP 1.3 are to set namespace awareness with setNamespaceAware(boolean awareness), and to turn on DTD validation with setValidating(boolean validating). Remember that once these options are set, they affect all instances obtained from the factory after the method invocation.

Once you have set up the factory, invoking newSAXParser() returns a ready-to-use instance of the JAXP SAXParser class. This class wraps an underlying SAX parser (an instance of the SAX class org.xml.sax.XMLReader). It also protects you from using any vendor-specific additions to the parser class. This class allows actual parsing behavior to be kicked off.

The example shows how you can create, configure, and use a SAX factory:

import java.io.OutputStreamWriter;
import java.io.Writer;

// JAXP
import javax.xml.parsers.FactoryConfigurationError;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;

// SAX
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class TestSAXParsing {
    public static void main(String[] args) {
        try {
            if (args.length != 1) {
                System.err.println ("Usage: java TestSAXParsing [filename]");
                System.exit (1);
            }

            SAXParserFactory factory = SAXParserFactory.newInstance();
            // Turn on validation, and turn off namespaces
            factory.setValidating(true);
            factory.setNamespaceAware(false);
            SAXParser parser = factory.newSAXParser();
            parser.parse(new File(args[0]), new MyHandler());
        } catch (ParserConfigurationException e) {
            System.out.println("The underlying parser does not support " +
                               " the requested features.");
        } catch (FactoryConfigurationError e) {
            System.out.println("Error occurred obtaining SAX Parser Factory.");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

class MyHandler extends DefaultHandler {
    // SAX callback implementations from ContentHandler, ErrorHandler, etc.
}

In the example, you can see that two JAXP-specific problems can occur in using the factory: the inability to obtain or configure a SAX factory, and the inability to configure a SAX parser. The first of these problems, represented by a FactoryConfigurationError, usually occurs when the parser specified in a JAXP implementation or system property cannot be obtained. The second problem, represented by a ParserConfigurationException, occurs when a requested feature is not available in the parser being used. Both are easy to deal with and shouldn't pose any difficulty when using JAXP. In fact, you might want to write code that attempts to set several features and gracefully handles situations where a certain feature isn't available.

A SAXParser instance is obtained once you get the factory, turn off namespace support, and turn on validation; then parsing begins. The SAX parser's parse() method takes an instance of the SAX HandlerBase helper class that I mentioned earlier, which your custom handler class extends. See the code distribution to view the implementation of this class with the complete Java listing (see Download). You also pass in the File to parse. However, the SAXParser class contains much more than this single method.

SAXParser Class

Once you have an instance of the SAXParser class, you can do a lot more than just pass it a File to parse. Because of the way components in large applications communicate, it's not always safe to assume that the creator of an object instance is its user. One component might create the SAXParser instance, while another component (perhaps coded by another developer) might need to use that same instance. For this reason, JAXP provides methods to determine the parser's settings. For example, you can use isValidating() to determine if the parser will -- or will not -- perform validation, and isNamespaceAware() to see if the parser can process namespaces in an XML document. These methods can give you information about what the parser can do, but users with just a SAXParser instance -- and not the SAXParserFactory itself -- do not have the means to change these features. You must do this at the parser factory level.

You also have a variety of ways to request parsing of a document. Instead of just accepting a File and a SAX DefaultHandler instance, the SAXParser's parse() method can also accept a SAX InputSource, a Java InputStream, or a URL in String form, all with a DefaultHandler instance. So you can still parse documents wrapped in various forms.

Finally, you can obtain the underlying SAX parser (an instance of org.xml.sax.XMLReader) and use it directly through the SAXParser's getXMLReader() method. Once you get this underlying instance, the usual SAX methods are available. Listing 2 shows examples of the various uses of the SAXParser class, the core class in JAXP for SAX parsing:

// Get a SAX Parser instance
SAXParser saxParser = saxFactory.newSAXParser();
// Find out if validation is supported
boolean isValidating = saxParser.isValidating();
// Find out if namespaces are supported
boolean isNamespaceAware = saxParser.isNamespaceAware();
// Parse, in a variety of ways
// Use a file and a SAX DefaultHandler instance
saxParser.parse(new File(args[0]), myDefaultHandlerInstance);
// Use a SAX InputSource and a SAX DefaultHandler instance
saxParser.parse(mySaxInputSource, myDefaultHandlerInstance);
// Use an InputStream and a SAX DefaultHandler instance
saxParser.parse(myInputStream, myDefaultHandlerInstance);
// Use a URI and a SAX DefaultHandler instance
saxParser.parse("http://www.newInstance.com/xml/doc.xml",
                myDefaultHandlerInstance);
// Get the underlying (wrapped) SAX parser
org.xml.sax.XMLReader parser = saxParser.getXMLReader();
// Use the underlying parser
parser.setContentHandler(myContentHandlerInstance);
parser.setErrorHandler(myErrorHandlerInstance);
parser.parse(new org.xml.sax.InputSource(args[0]));

JAXP's added functionality is fairly minor, especially where SAX is involved. This minimal functionality makes your code more portable and lets other developers use it, either freely or commercially, with any SAX-compliant XML parser. That's it. There's nothing more to using SAX with JAXP. If you already know SAX, you're about 98 percent of the way there. You just need to learn two new classes and a couple of Java exceptions, and you're ready to roll. If you've never used SAX, it's easy enough to start now.

SAX Plugability

The SAX Plugability classes allow an application programmer to provide an implementation of the org.xml.sax.DefaultHandler API to a SAXParser implementation and parse XML documents. As the parser processes the XML document, it will call methods on the provided DefaultHandler.

After having obtained parser which is an XMLReader, you plug in the event handlers you need:

The class org.xml.sax.helpers.DefaultHandler is a convenience class that implements the org.xml.sax.ContentHandler interface (plus the org.xml.sax.DTDHandler, org.xml.sax.ErrorHandler, and org.xml.sax.EntityResolver interfaces) with empty methods.

Feactures

(See the SAX-standardized features and properties list at http://www.saxproject.org/?selected=get-set)

Once you have an instance of your parser, you need to configure it. Note that this isn't the same as setting up the parser to deal with errors, content, or structures in XML; instead, configuration is the process of actually telling the parser how to behave. You may turn on validation, turn off namespace checking, and expand entities. These behaviors are totally independent of a specific XML document, and therefore involve interaction with your new parser instance.

Note: For those of you who are overly anxious (I know you're out there), I will indeed be dealing with content, error handling, and the like. However, those subjects will be addressed in future tips, so you'll have to check back. For now, just concentrate on configuration, features, and properties.

You can configure parsers in two ways: features and properties. Features involve turning on or off a specific piece of functionality, like validation. Properties involve setting the value of a specific item that the parser uses, like the location of a schema to validate all documents against. I'll deal with features first, and then look at properties in the next section.

Features are set, not surprisingly, through a method on your parser called setFeature(). The syntax looks like that in Listing 2.

// Obtain an instance of an XMLReader implementation from a system property
XMLReader parser = org.xml.sax.helpers.XMLReaderFactory.createXMLReader();

String featureName = "some feature URI";
boolean featureOn = true;

try {
  parser.setFeature(featureName, featureOn);
} catch (SAXNotRecognizedException e) {
  System.err.println("Unknown feature specified: " + e.getMessage());
} catch (SAXNotSupportedException e) {
  System.err.println("Unsupported feature specified: " + e.getMessage());
} catch (SAXException e) {
  System.err.println("Error in setting feature: " + e.getMessage());
}

This is pretty self-explanatory; the key is knowing the common features available to SAX parsers. Each feature is identified by a specific URI. A complete list of these URIs is available online at the SAX Web site (see Resources). Some of the most common features are validation and namespace processing. Listing 3 shows an example of setting both of these properties.

// Obtain an instance of an XMLReader implementation from a system property
XMLReader parser = org.xml.sax.helpers.XMLReaderFactory.createXMLReader();

try {
  // Turn on validation
  parser.setFeature("http://xml.org/sax/features/validation", true);
  // Ensure namespace processing is on (the default)
  parser.setFeature("http://xml.org/sax/features/namespaces", true);
} catch (SAXNotRecognizedException e) {
  System.err.println("Unknown feature specified: " + e.getMessage());
} catch (SAXNotSupportedException e) {
  System.err.println("Unsupported feature specified: " + e.getMessage());
} catch (SAXException e) {
  System.err.println("Error in setting feature: " + e.getMessage());
}

Note that while parsers have several standard SAX features, they are free to add their own vendor-specific features. For example, Apache Xerces-J adds features that allow for dynamic validation and the continuance of processing after encountering a fatal error. Consult your parser vendor's documentation for the relevant feature URIs.

Properties

Once you understand features, making sense of properties is easy. They behave in exactly the same manner, except that properties take an object as an argument where features take in a boolean value. You use the setProperty() method for this purpose, as shown in Listing 4.

// Obtain an instance of an XMLReader implementation from a system property
XMLReader parser = org.xml.sax.helpers.XMLReaderFactory.createXMLReader();

String propertyName = "some property URI";

try {
  parser.setProperty(propertyName, obj-arg);
} catch (SAXNotRecognizedException e) {
  System.err.println("Unknown property specified: " + e.getMessage());
} catch (SAXNotSupportedException e) {
  System.err.println("Unsupported property specified: " + e.getMessage());
} catch (SAXException e) {
  System.err.println("Error in setting property: " + e.getMessage());
}

The same error-handling framework is in play here, so you can easily duplicate code between the two types of configuration options. As with features, SAX provides a standard set of properties, and vendors can add their own extensions. Common SAX-standard properties allow for setting a Lexical Handler and a Declaration Handler (two handlers I'll discuss in later tips). Parsers like Apache Xerces extend these with, for example, the ability to set the input buffer size and the location of an external schema to use in validation. Listing 5 shows a few properties in action.

// Obtain an instance of an XMLReader implementation from a system property
XMLReader parser = org.xml.sax.helpers.XMLReaderFactory.createXMLReader();

try {
  // Set the chunk to read in by SAX
  parser.setProperty("http://apache.org/xml/properties/input-buffer-size",
      new Integer(2048));
  // Set a LexicalHandler
  parser.setProperty("http://xml.org/sax/properties/lexical-handler",
      new MyLexicalHandler());
} catch (SAXNotRecognizedException e) {
  System.err.println("Unknown feature specified: " + e.getMessage());
} catch (SAXNotSupportedException e) {
  System.err.println("Unsupported feature specified: " + e.getMessage());
} catch (SAXException e) {
  System.err.println("Error in setting feature: " + e.getMessage());
}

With an understanding of features and properties, you can make your parser do almost anything. Once you understand setting up your parser in this fashion, you're ready for my next tip, which will discuss building a basic content handler.


XSL Transformations

XSL Transformations (XSLT) is an API that can be used to transform XML documents into other XML documents or other formats such as HTML. A stylesheet written in the XML Stylesheet Language (XSL) is usually needed to perform the transformation. The stylesheet contains formatting instructions which specify how the document is to be displayed.

XSLT

The Extensible Stylesheet Language Transformations (XSLT) W3C recommendation describes a transformation vocabulary used to specify how to create new structured information from existing XML documents. 

The JAXP Transformation APIs include:

To transform an input document into an output document follow these steps:

  1. Load the TransformerFactory factory with the static TransformerFactory.newInstance() factory method.
  2. Form a Source object from the XSLT stylesheet.
  3. Pass this Source object to the factory’s newTransformer() factory method to build a Transformer object.
  4. Build a Source object from the input XML document you wish to transform. The Source object may be one of  DOMSource, SAXSource, and  StreamSource object.
  5. Build a Result object for the target of the transformation. The Result object may be one of  DOMResult, SAXResult, and StreamResult object.
  6. Pass both the source and the result to the Transformer object’s transform() method.

Steps four through six can be repeated for as many different input documents as you want. You can reuse the same Transformer object repeatedly in series, though you can’t use it in multiple threads in parallel because neither TransformerFactory nor Transformer is guaranteed to be thread-safe. The following code is a sample code :

package com.xyzws.jaxp.xslt.sample;

import javax.xml.transform.Result;
import javax.xml.transform.Source;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

public class XSLTSample1 {

    public static void main(String[] args) {
        try {
              TransformerFactory factory = TransformerFactory.newInstance();
              Source xsl = new StreamSource("Sample1.xsl");
              Transformer transformer = factory.newTransformer(xsl);

              Source request  = new StreamSource("input.xml");
              Result response = new StreamResult("output.xml");
              transformer.transform(request, response);
 
            }
            catch (TransformerException e) {
              System.err.println(e);
            }
    }
}

Neither TransformerFactory nor Transformer is guaranteed to be thread-safe. If your program is multi-threaded, the simplest solution is just to give each separate thread its own TransformerFactory and Transformer objects. However, this can be expensive, especially if you frequently reuse the same large stylesheet, since it will need to be read from disk or the network and parsed every time you create a new Transformer object. There is also likely to be some overhead in building the processor’s internal representation of an XSLT stylesheet from the parsed XML tree.

An alternative is to ask the TransformerFactory to build a Templates object instead. The Templates class represents the parsed stylesheet. You can then ask the Templates class to give you as many separate Transformer objects as you need, each of which can be created very quickly by copying the processor’s in-memory data structures rather than by reparsing the entire stylesheet from disk or the network. The Templates class itself can be safely used across multiple threads. To transform an input document into an output document follow these steps:

  1. Establish  a TransformerFactory factory with the static TransformerFactory.newInstance() factory method. The factory allows  you to create different transformers for different style sheet templates.
  2. Form a Source object from the XSLT stylesheet.
  3. Pass this Source object to the TransformerFactory factory's newTemplates() factory method to create a Templates object.
  4. Use this Templates object's  newTransformer()  method to generate a Transformer object.
  5. Build a Source object from the input XML document you wish to transform.
  6. Build a Result object for the target of the transformation.
  7. Pass both the source and the result to the Transformer object’s transform() method.

Steps four through seven can be repeated for as many different input documents as you want. You create a separate thread-unsafe Transformer object in the step four for each thread so the Transformer object does not shared by multiple threads. All the time-consuming work is done when the Templates object is created. Calling templates.newTransformer() is very quick by comparison.  Here is a sample code using Templates object:

package com.xyzws.jaxp.xslt.sample;

import javax.xml.transform.Result;
import javax.xml.transform.Source;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

public class XSLTSample1 {


    public static void main(String[] args) {
        try {
              TransformerFactory factory = TransformerFactory.newInstance();
              Source xsl = new StreamSource("Sample1.xsl");
              Templates template = factory.newTemplates(xsl);
    
              Transformer transformer = template.newTransformer();
              Source request  = new StreamSource("input.xml");
              Result response = new StreamResult("output.xml");
              transformer.transform(request, response);
              
            }
            catch (TransformerException e) {
              System.err.println(e);
            }
    }
}

Typically, the XSL source is a StreamSource object. You can easily convert an XSL document to a StreamSource through a File, Reader, InputStream, or URI (represented as a string) reference to the document. Similar to the XSL source, the XML source is typically a StreamSource constructed from a File, Reader, InputStream, or URI. The transformed StreamResult can be a File, Writer, OutputStream or URI.

Here is an example that takes a DOM object and transforms it into an XML document:

//create xsl
File xsltFile = new File(xsltFileName);
StreamSource xslSource = new StreamSource(xsltFile); 
 
//create a new DOMSource using the root node of an existing DOM tree
DOMSource source = new DOMSource(thePhonebook);
StreamResult result = new StreamResult(System.out);

TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer(xslSource);
transformer.transform(source, result);

We first instantiate a new TransformerFactory object by calling newInstance, which goes through the ordered lookup procedure (see "Simple API for XML (SAX) ") to determine the Transformer implementation to use. As with SAX and DOM factories, there are several settings that can be set which determine the way Transformer objects are created. After a new transformer is created using newTransformer, the transform method is then called, which takes a Source object (implemented by DOMSource, SAXSource, and  StreamSource) and transforms it into the format of the Result object (implemented by DOMResult, SAXResult, and StreamResult).



Summary

So as you can see, by utilizing the JAXP API, code is written to interact directly with the abstraction layer. This guarantees vendor independence and the ability to swap out the backend implementation quickly and easily. In parsing an XML document, the Java developer has two options depending on their specific needs. SAX is an event-based parsing model that utilizes callback procedures, while DOM is a tree-walking model that parses the XML data into a tree before manipulating it. All in all, what we have in JAXP is a powerful, flexible, and easy to use set of tools that will meet the XML processing needs of most Java developers.