Chapter 7. Java and XML · to know more about using Java with XML, try Brett McLaughlin's Java and...

Chapter 7. Java and XML

Extensible Markup Language (XML ) has become an essential technology for enterprise applications. The XML specification[*] allows users to define unique and structured document formats, allowing for easy and flexible data exchange between applications. Since the syntax of an XML document is bound by a public specification, XML documents can be read and manipulated by a wide variety of tools. Because XML documents are text-based, they can be easily transmitted between different systems using a number of transportation mechanisms, from JMS to HTTP.

[*] The complete specification, developed by the World Wide Web Consortium , is available from the W3C web site at http://www.w3.org/XML. The site also includes a variety of related specifications and other useful resources.

XML documents can be freely structured, although they must abide by a basic set of XML rules that define a well-formed document. More commonly, however, the document structure is further defined by a Document Type Definition (DTD). With a standardized DTD, enterprise applications can exchange data without knowledge of each other's native formats. Industry working groups have defined DTDs for everything from bank transactions to medical records to electronic books. DTDs are very common, although they have been partially supplanted by XML Schemas, a more sophisticated way of describing the structure of an XML document that, among other things, supports defining data types.

The advantages of combining Java and XML are obviousa cross-platform language and a cross-platform data specification. We don't have space here to discuss XML itself in depthfor more information, try Learning XML by Erik T. Ray (O'Reilly). This book covers the XML specification itself, including topics such as XML namespaces, DTDs, the XLink and XPointer specifications for rich links, and the XSLT transformation specification (which we will touch on later in this chapter). We have tried to include enough information to give newcomers a taste of what can be done.

In this chapter, we're going to take a quick look at Sun's Java API for XML Processing (JAXP ) Version 1.2, which provides a standardized approach to processing XML files in Java. JAXP is included in Version 1.4 of the J2EE specification. While originally an enterprise API, JAXP joined the standard J2SE distribution with J2SE 1.4. JAXP includes three other specifications by reference: the Simple API for XML parsing (SAX), Version 2; the W3C's Document Object Model (DOM ), Level 2; and the XSLT (XML Stylesheet Transformation) specification. We also discuss using JAXP to access DOM and SAX parsers and XSLT processors and offer a quick introduction to using DOM and SAX. If you want to know more about using Java with XML, try Brett McLaughlin's Java and XML (O'Reilly).

7.1. Using XML Documents

XML allows developers to create tag-based markup structures that are bound by a set of rules defined in a public specification. The actual content of any particular XML file is left undefined by the specification. Here's an example, orders.xml, that represents two orders made to a fictional online shopping site. Each order includes identifying information (an order number and a customer number), a shipping address, and one or more items. The shipping address encapsulates both the shipping method and the shipping destination, and each item includes an identifying number and the quantity ordered, as

mywbut.com

1

well as an optional handling instruction.

Most elements include an opening and closing tag, with the element attributes set in the opening tag. Some elements are "empty," such as the first example of the <item> tag. Empty elements terminate with /> instead of simply >, and don't need a separate closing tag. The significance of an empty tag is in either its attributes or its mere presence. The data is simple enough that the structure should be clear to the reader. Here's the actual XML:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE orders SYSTEM "orders.dtd"> <orders> <order idnumber="3123" custno="121312"> <shippingaddr method="Camel"><![CDATA[One Main St Boston, MA 02112]]></shippingaddr> <item idnumber="7231" quantity="13"/> <item idnumber="1296" quantity="2"> <handling>Please embroider in a tasteful manner!</handling> </item> </order> <order idnumber="3124" custno="12"> <shippingaddr method="FedEx"><![CDATA[285 York St. New Haven, CT 06510]]></shippingaddr> <item idnumber="12" quantity="8"/> </order> </orders>

At the simplest level, an XML document must merely be well-formed, meaning that the document adheres to all of the syntax rules defined by the XML specification. These rules define the XML declaration on the first line and specify how tags may be formed and nested.[*] The requirements for a well-formed document don't include any particular XML tags except for the XML version and specify structure in only the broadest possible terms. So for most applications, simply knowing that a document is well-formed is not particularly helpful. Of course, one can specify that only files of a particular format should be used as input for a particular program, but without a way to define what that format should be and whether documents conform to it, an extensible markup language doesn't make a great deal of sense.

[*] And quite a bit more, including entity escape sequences, namespaces, valid character sets, and so on. But we don't have room here for a full discussion, so again, we recommend O'Reilly's Learning XML.

However, there is a solution. The second line of orders.xml specifies that the file should conform to a DTD. The DTD goes a step beyond the well-formed requirement and specifies the allowable XML tags, their formats, and the allowable structure. The DTD for the orders.xml file requires that all <order> tags be nested within an <orders> tag, all orders have at least one item and a shipping address, all items include identifier and quantity attributes, and so forth. Here's the DTD orders.dtd:

<?xml version="1.0" encoding="UTF-8"?> <!ELEMENT orders (order+)> <!ELEMENT order (shippingaddr, item+)> <!ELEMENT shippingaddr (#PCDATA)> <!ELEMENT item (handling?)> <!ELEMENT handling (#PCDATA)> <!ATTLIST order idnumber CDATA #REQUIRED

mywbut.com

2

custno CDATA #REQUIRED > <!ATTLIST shippingaddr method (FedEx | UPS | USPS | Camel) #REQUIRED > <!ATTLIST item idnumber CDATA #REQUIRED quantity CDATA #REQUIRED >

The XML layer of an application generally consists of one or more DTDs and a set of documents. The DTDs are written ahead of time, by an individual developer, a working group, an application server vendor, or other provider. Some documents, particularly those related to configuration and profiling tasks (such as the J2EE deployment descriptors), are edited by hand and read by software.

The previous example is more transaction-oriented. Documents like orders.xml would likely be generated by a purchasing frontend (such as a web site) and transmitted over the network to a fulfillment system (such as a corporate order tracking database) via HTTP, JMS, or some other transport layer. The receiving software reads the document and processes it, often without any human intervention at all. Standardized DTDs mean that the two sides of the exchange can easily be provided by different vendors.

7.1.1. XML Schema

You may have noticed in orders.dtd that the DTD standard is missing a few things. For one, the specification of data types for elements and their attributes is extremely limited. Your application may require that the contents of the idnumber attribute be a valid integer, for example. The closest you can get to this with a DTD is to declare the attribute as an enumerated type. But this is useful only if you can enumerate all of the allowed integer values for idnumber, which isn't possible in most cases. So typically the only option you'd have is to declare the attribute as character data, as we've done in the example, and then validate the attribute values in your application code. Things are even more limited for data within elements, where DTDs allow only character data, no data (empty), or completely unspecified content. Again, in most cases, you would need to do additional data validation within your application.

DTDs are also limiting in terms of entity namespaces and granularity. You can have only a single DTD associated with a given XML document, and that DTD applies to the entire document.

The XML Schema standard provides a much richer mechanism for describing the structure and content of an XML file. A schema defines a set of data types that is then used to define element and attribute types. One or more schemas can then be referenced in a given XML document and applied to specific elements in the document. This provides a much more flexible model than DTDs. The data type descriptions are more powerful, the namespace facility allows you to integrate multiple schemas into a single document, and schema types can be applied at the individual element or attribute level in your XML documents.

In addition, XML schema files are regular XML files, so they can be processed using standard XML-handling tools. This makes it much easier to write software that can respond to incoming XML dynamically.

An XML schema describing the orders.xml file might look something like this:

mywbut.com

3

<?xml version="1.0" encoding="UTF-8" standalone="no"?> <xs:schema xmlns="jent:xml-orders" xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="jent:xml-orders">  <xs:complexType name="itemType"> <xs:sequence> <xs:element name="handling" type="xs:string" minOccurs="0"/> </xs:sequence> <xs:attribute name="idnumber" type="xs:positiveInteger" use="required"/> <xs:attribute name="quantity" type="xs:integer" use="required"/> </xs:complexType>  <xs:complexType name="shippingaddrType" mixed="true"> <xs:attribute name="method" use="required"> <xs:simpleType> <xs:restriction base="xs:NMTOKEN"> <xs:enumeration value="USPS"/> <xs:enumeration value="Camel"/> <xs:enumeration value="FedEx"/> <xs:enumeration value="UPS"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType>  <xs:complexType name="orderType"> <xs:sequence> <xs:element name="shippingaddr" type="shippingaddrType"/> <xs:element name="item" type="itemType" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="idnumber" type="xs:positiveInteger" use="required"/> <xs:attribute name="custno" type="xs:positiveInteger" use="required"/> </xs:complexType>  <xs:complexType name="ordersType"> <xs:sequence> <xs:element name="order" type="orderType" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>  <xs:element name="orders" type="ordersType"/> </xs:schema>

The schema representation is longer than the DTD representation, but it's also easier to read. The most important pieces are the <xs:element> and <xs:complexType> elements. Complex types in XML

mywbut.com

4

Schema are collections of element and attribute descriptions. These types are bound to XML elements using the type attribute on element declarations. In our example, the root element, orders is defined at the end of the schema, and it uses the type ordersType, which is defined by the complexType defined above it.[*] The type attribute on the element specifies that the data included in the element or attribute value should conform to a particular data type.

[*] This is a two-step process because we might want to use the same complex type definition for two or more different elements without having to copy and paste.

The complexType structure is the top of the food chain in terms of the data-typing facilities in XML Schema. In orders.xsd, we also constrain various elements and attributes using the <xs:string>, <xs:integer>, and <xs:positiveInteger> data types, standard data types specified in the XML Schema specification. XML Schema also allows you to define new data types based on existing data types, using the <xs:simpleType> and <xs:restriction> tags. These are used in the shippingaddrType to define the type for the name attribute, specified as an enumerated type using the built-in NMTOKEN type. Other restrictions are available for other data types. The xs:string data type, in particular, supports restrictions for minimum and maximum length and for regular expression matching, allowing for effectively unlimited control of data formatting.

As mentioned earlier, XML schemas are referenced within XML documents as a namespace that defines entities (elements and attributes) to be used in the document. Typically, this is done by specifying the namespace (and its corresponding schema) for the root element in your XML document. Adjusting our orders.xml document to refer to our XML schema would look like this:

<?xml version="1.0" encoding="UTF-8"?> <orders xmlns="jent:xml-orders" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="jent:xml-orders http://mycompany.com/orders.xsd"> <order idnumber="3123" custno="121312"> <shippingaddr method="Camel"><![CDATA[One Main St Boston, MA 02112]]></shippingaddr> <item idnumber="7231" quantity="13"/> <item idnumber="1296" quantity="2"> <handling>Please embroider in a tasteful manner!</handling> </item> </order> <order idnumber="3124" custno="12"> <shippingaddr method="FedEx"><![CDATA[285 York St. New Haven, CT 06510]]></shippingaddr> <item idnumber="12" quantity="8"/> </order> </orders>

Note the new attributes on the <orders> element. xmlns is a standard attribute supported by all XML elements and is used to specify a namespace to be used by the element and its children. If you want or need to use a prefix with the elements and attributes referenced from the schema, you can specify the prefix by appending it (with a colon separator) to the xmlns attribute entry. We've done this with our orders.xml example when we import the XML Schema "instance" schema, a standard schema that defines some attributes that are useful when dealing with schemas in XML documents. We've imported the schema as a namespace using the xsi prefix. When we want to reference elements or attributes from this namespace, we use this prefix on their names. The schemaLocation attribute shows this in

mywbut.com

5

actionthis attribute comes from the "instance" schema, so we refer to it as xsi:schemaLocation.

The schemaLocation attribute we use on the orders element also demonstrates one way to specify the location of schema definition files. In the xmlns attribute, we specified our orders schema using a URN, jent:xml-orders. The schemaLocation attribute tells the XML parser how to resolve this URN to a physical schema file. In our case, we're telling the parser that the schema file is located at the URL http://mycompany.com/orders.xsd.

7.2. Java API for XML Processing

The JAXP API is bundled into the JDK as of 1.4 and is an optional package for earlier versions. It is also a standard component of the J2EE 1.3 and 1.4 platforms. XML Schema support is available only in JAXP 1.2, which is part of J2EE 1.4 and JDK 1.4. J2EE 1.3 includes the 1.1 release of JAXP, which is otherwise functionally identical. If you're working in a Java 5.0 environment, you're using JAXP 1.3, which adds XPath and XInclude support to the API, among other things.The full specification and a reference implementation are available from http://java.sun.com/xml.

The SAX and DOM APIs that are actually used for processing XML files don't include a standard method for creating a parser object; this is one of the voids JAXP fills. The API provides a set of Factory objects that will create parsers or XSLT processors. Additionally, JAXP defines a programmatic interface to XSLT processors.

The actual parser and processor implementations used by JAXP are pluggable. You can use the Crimson parser , the Apache Xerces parser (available from http://xml.apache.org), or any other JAXP-compatible parser. Version 1.1 of the reference implementation shipped with Sun's Crimson XML parser and the Xalan XSL engine from the Apache XML project (again, see http://xml.apache.org). In JAXP 1.2, the Crimson parser was replaced with the Xerces parser. There are still variations in support for different levels of functionality across parser implementations. The examples in this chapter have been tested with the Xerces parser that shipped with JAXP 1.2.

7.2.1. Getting a Parser or Processor

To retrieve a parser or processor from inside a Java program, call the newInstance( ) method of the appropriate factory class, either SAXParserFactory, DocumentBuilderFactory, or transformerFactory. The actual factory implementation is provided by the parser vendor. For example, to retrieve the platform default SAX parser:

SAXParserFactory spf = SAXParserFactory.newInstance( ); spf.setValidating(true); //request a validating parser try { SAXParser saxParser = spf.newSAXParser( );

mywbut.com

6

// Processs XML here } catch (SAXException e) { e.printStackTrace( ); } catch (ParserConfigurationException pce) { pce.printStackTrace( ); } catch (IOException ioe) { ioe.printStackTrace( ); }

The next three sections will deal with what you can do once you've actually retrieved a parser. For the time being, let's treat it as an end in itself.

SAXParserFactory includes a static method called newInstance( ). When this method is called, the JAXP implementation searches for an implementation of javax.xml.parsers.SAXParserFactory, instantiates it, and returns it. The implementation of SAXParserFactory is provided by the parser vendor; it's org.apache.xerces.jaxp.SAXParserFactoryImpl for the Xerces parser.

The system looks for the name of the class to instantiate in the following four locations, in order:

1. In one of these system properties:

javax.xml.parsers.SAXParserFactory

javax.xml.parsers.DocumentBuilderFactory

javax.xml.parsers.TransformerFactory

2. In the lib/jaxp.properties file in the JRE directory. The configuration file is in key=value format, and the key is the name of the corresponding system property. Therefore, to set Crimson as the default parser, jaxp.properties would contain the following line:

javax.xml.parsers.SAXParserFactory=org.apache.crimson.jaxp. SAXParserFactoryImpl

3. In the application jar file, via the Services API . The API looks for the classname in a file called

META-INF/services/parserproperty in which the filename (parserproperty) is the property name corresponding to the desired factory. The runtime environment checks every available jar file, so if you have multiple parsers available to your application, specify the desired factory using one of the previous methods to prevent nondeterministic behavior.

4. In a platform default factory instance.

Once you have a factory, various parser options can be set using the factory- specific set methods. SAXParserFactory and DocumentBuilderFactory, for instance, include setNamespaceAware( ) and setValidating( ) methods, which tell the factory whether to produce a parser that is aware of XML namespaces (and will fail if the document being parsed doesn't properly conform to the namespace specification) and whether to validate against any DTD and/or schema specified by the XML document itself. To enable schema validation in parsers, you'll need to set the following attribute on the DocumentBuilderFactory, or set the equivalent property on the SAXParser using setProperty():

mywbut.com

7

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); dbf.setAttribute( "http://java.sun.com/xml/jaxp/properties/schemaLanguage", "http://www.w3.org/2001/XMLSchema");

This attribute is a standard property defined in the JAXP specification and it tells the parser which schema language to use for validation. In this example, we're using the 2001 version of the XML Schema language.

Factories are threadsafe, so a single instance can be shared by multiple threads. This allows parser factories to be instantiated in a Java Servlet init( ) method or other centralized location. Parsers and processors, however, aren't guaranteed to be threadsafe.

7.3. SAX

The SAX API provides a procedural approach to parsing an XML file. As a SAX parser iterates through an XML file, it performs callbacks to a user-specified object. These calls indicate the start or end of an element, the presence of character data, and other significant events during the life of the parser.

SAX doesn't provide random access to the structure of the XML file; each tag must be handled as it is encountered by the browser. This means that SAX provides a relatively fast and efficient method of parsing. Because the SAX parser deals with only one element at a time, implementations can be extremely memory-efficient, making it often the only reasonable choice for dealing with particularly large files.

7.3.1. SAX Handlers

The SAX API allows you to create objects that handle XML parsing events, by implementing the org.xml.sax.ContentHandler, org.xml.sax.ErrorHandler, and org.xml.sax.DTDHandler interfaces.[*] Processing a document with SAX involves passing a handler implementation to the parser and calling the parse( ) method of SAXParser. The parser will read the contents of the XML file, calling the appropriate method on the handler when significant events (such as the start of a tag) occur. All handler methods may throw a SAXException in the event of an error.

[*] We're not covering DTDHandler here, because it's rarely used. It is useful only if you need to know about unparsed entities and notations in the DTD associated with an XML document.

We'll take a look at the ContentHandler, the ErrorHandler, and the generic but useful DefaultHandler interfaces next.

7.3.1.1. ContentHandler

Most, if not all, SAX applications implement the ContentHandler interface. The SAX parser will call methods on a ContentHandler when it encounters basic XML elements: chiefly, the start or end of a

mywbut.com

8

document, the start or end of an element, and character data within an element.

The startDocument( ) and endDocument( ) methods are called at the beginning[*] and end of the parsing process and take no parameters. Most applications use startDocument( ) to create any necessary internal data stores and use endDocument( ) to dispose of them (for example, by writing to the database).

[*] The first method called by the parser is actually setDocumentLocator( ), which provides the handler with an implementation of org.xml.sax.Locator. This object can report the current position of the parser within the XML file via its getColumnNumber( ), getLineNumber( ), getPublicId( ), and getSystemId( ) methods. However, while parser implementations are strongly encouraged to implement this method, they aren't required to.

When the parser encounters a new element, it calls the startElement( ) method of the ContentHandler, passing a namespace URI, the local name of the element, the fully qualified name of the element (the namespace and the local name), and an org.xml.sax.Attributes object containing the element attributes.

The Attributes interface allows the parser to inform the ContentHandler of attributes attached to an XML tag. For instance, the <order> tag in our earlier example contained two attributes, idnumber and custno, specified like this:

<order idlabel="321" custno="98173">

To retrieve attributes when processing an element, call the getValue( ) method of attributes:

public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { if(localName.equals("order") System.out.println("New Order Number " + atts.getValue("idnumber") + " for Customer Number " + atts.getValue("custno")); }

Note that before we can safely run this line, we need to make sure that we are processing an <order> tag; otherwise, there is no guarantee that the particular attributes we are querying will be available.[*]

[*] The parser returns all the attributes specified in the XML document, either explicitly or through a default value specified in the DTD. Attributes without defaults that aren't explicitly specified in the XML document itself aren't included.

When the parser encounters the closing tag of an element (</order>, in this case), the parser calls the endElement( ) method, passing the same namespace URI, local name, and qualified name that were passed to startElement( ). Every startElement( ) call will have a corresponding endElement( ) call, even when the element is empty.

These four methods all deal with handling information about XML tags but not with the data within a

mywbut.com

9

tag (unless that data is another tag). Much XML content consists of textual data outside the confines of tags and attributes. For example, here's the handling instruction from orders.xml:

<handling>Please embroider in a tasteful manner!</handling>

When the SAX parser encounters the text between the tags, it calls the characters( ) method, passing a character array, a starting index within that array, and the length of the relevant character sequence within the array. This simple implementation of characters( ) prints the output to the screen:

public void characters(char[] ch, int start, int length) throws SAXException { System.out.print(new String(ch, start, length)); }

Note that there is no guarantee that all of the characters you want will be delivered in the same call. Also, since the characters( ) method doesn't include any references to the parent element, to perform more complicated tasks (such as treating the characters differently depending on the element that contains them), you will need to store the name of the current element within the handler class itself. Example 7-1, later in this chapter, shows how to do this via the startElement( ) method.

The characters( ) method might also be called when the parser encounters ignorable whitespace, such as a carriage return separating nested elements that don't otherwise have nested character data. If the parser is validating the document against a DTD, it must instead call the ignoreableWhitespace( ) method to report these characters.

7.3.1.2. ErrorHandler

Since SAX is a language-independent specification, it doesn't handle parsing errors by throwing exceptions. Instead, a SAX parser reports errors by calling methods on a user-supplied object that implements the ErrorHandler interface. The ErrorHandler interface includes three methods: error( ), fatalError( ), and warning( ). Each method takes an org.xml.sax.SAXParseException parameter. The programmer is free to handle the errors in whatever manner she deems appropriate; however, the specification doesn't require parsing to continue after a call to fatalError( ).

7.3.1.3. DefaultHandler

The API also provides the org.xml.sax.helpers.DefaultHandler class that implements all three handler interfaces. Since most handlers don't need to override every handler method, or even most, the easiest way to write a custom handler is to extend this object and override methods as necessary.

7.3.2. Using a SAX Parser

Once you have a handler or set of handlers, you need a parser. JAXP generates SAX parsers via a SAXParserFactory, as we saw earlier. The SAXParserFactory has three methods for further specifying parser behavior: setValidating( ) (which instructs the parser to validate the incoming XML file against its DTD or schema), setNamespaceAware( ) (which requests support for XML namespaces), and setFeature( ) (which allows configuration of implementation-specific attributes for parsers from particular vendors).

mywbut.com

10

It is possible to parse a document directly from a SAXParser object by passing an object that implements the ContentHandler interface to the parse( ) method, along with a path, URI, or InputStream containing the XML to be parsed. For more control, call the getXMLReader( ) method of SAXParser, which returns an org.xml.sax.XMLReader object. This is the underlying parser that actually processes the input XML and calls the three handler objects. Accessing the XMLReader directly allows programs to set specific ErrorHandler and DTDHandler objects, rather than being able to set a ContentHandler only.

All events in the SAX parsing cycle are synchronous. The parse( ) method will not return until the entire document has been parsed, and the parser will wait for each handler method to return before calling the next one.

7.3.2.1. A SAX example: Processing orders

Example 7-1 uses a SAX DefaultHandler to process an XML document containing a set of incoming orders for a small business. It uses the startElement( ) method of ContentHandler to process each element, displaying relevant information. Element attributes are processed via the Attributes object passed to the startElement( ) method. When the parser encounters text within a tag, it calls the characters( ) method of ContentHandler.

You can also call the set Property() method on the SAXParser to control its behavior. The standard JAXP property we saw in the previous section can be set using this method, for example.

Example 7-1. Parsing XML with SAX

import javax.xml.parsers.*; import org.xml.sax.*; public class OrderHandler extends org.xml.sax.helpers.DefaultHandler { public static void main(String[] args) { SAXParserFactory spf = SAXParserFactory.newInstance(); spf.setValidating(true); //request a validating parser XMLReader xmlReader = null; try { SAXParser saxParser = spf.newSAXParser(); /* We need an XMLReader to use an ErrorHandler We could just pass the DataHandler to the parser if we wanted to use the default error handler. */ xmlReader = saxParser.getXMLReader(); xmlReader.setContentHandler(new OrderHandler()); xmlReader.setErrorHandler(new OrderErrorHandler()); xmlReader.parse("orders.xml"); } catch (Exception e) { e.printStackTrace(); } } // The startDocument() method is called at the beginning of parsing public void startDocument() throws SAXException { System.out.println("Incoming Orders:"); }

mywbut.com

11

// The startElement() method is called at the start of each element public void startElement(String namespaceURI, String localName, String rawName, Attributes atts) throws SAXException { if(localName.equals("order")) { System.out.print("\nNew Order Number " + atts.getValue("idnumber") + " for Customer Number " + atts.getValue("custno")); } else if (localName.equals("item")) { System.out.print("\nLine Item: " + atts.getValue("idnumber") + " (Qty " + atts.getValue("quantity") + ")"); } else if (localName.equals("shippingaddr")) { System.out.println("\nShip by " + atts.getValue("method") + " to:"); } else if (localName.equals("handling")) { System.out.print("\n\tHandling Instructions: "); } } // Print Characters within a tag // This will print the contents of the <shippingaddr> and <handling> tags // There is no guarantee that all characters will be delivered in a // single call public void characters(char[] ch, int start, int length) throws SAXException { System.out.print(new String(ch, start, length)); } /* A custom error handling class, although DefaultHandler implements both interfaces. Here we just throw the exception back to the user.*/ private static class OrderErrorHandler implements ErrorHandler { public void error(SAXParseException spe) throws SAXException { throw new SAXException(spe); } public void warning(SAXParseException spe) throws SAXException { System.out.println("\nParse Warning: " + spe.getMessage()); } public void fatalError(SAXParseException spe) throws SAXException { throw new SAXException(spe); } } }

In a real application, we would want to treat error handling in a more robust fashion, probably by reporting parse errors to a logging utility or EJB. An actual order management utility would populate a database table or an Enterprise JavaBean's object.

7.4. DOM

mywbut.com

12

The DOM API, unlike the SAX API, allows programmers to construct an object model representing a document and then traverse and modify that representation. The DOM API is not Java-specific; it was developed by the W3C XML working group as a cross-platform API for manipulating XML files (see http://www.w3c.org/XML). As a result, it sometimes doesn't take the most direct Java-based path to a particular result. The JAXP 1.1 API incorporated DOM Level 2. In JAXP 1.3, this was updated to support DOM Level 3.

DOM is useful when programs need random access to a complex XML document or to a document whose format is not known ahead of time. This flexibility does come at a cost, however, as the parser must build a complete in-memory object representation of the document. For larger documents, the resource requirements mount quickly. Consequently, many applications use a combination of SAX and DOM, using SAX to parse longer documents (such as importing large amounts of transactional data from an enterprise reporting system) and using DOM to deal with smaller, more complex documents that may require alteration (such as processing configuration files or transforming existing XML documents).

7.4.1. Getting a DOM Parser

The DOM equivalent of a SAXParser is the org.w3c.dom.DocumentBuilder. Many DocumentBuilderimplementations actually use SAX to parse the underlying document, so the DocumentBuilder implementation itself can be thought of as a layer that sits on top of SAX to provide a different view of the structure of an XML document. We use the JAXP API to get a DocumentBuilder interface in the first place, via the DocumentBuilderFactory class:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance( ); // Validation dbf.setValidating(false); // Ignore text elements that are completely empty dbf.setIgnoringElementContentWhitespace(false); // Expand XML entities according to the DTD dbf.setExpandEntityReferences(true); // Treat CDATA sections the same as text dbf.setCoalescing(true); DocumentBuilder db = null; try { db = dbf.newDocumentBuilder( ); } catch (ParserConfigurationException pce) { pce.printStackTrace( ); }

The set( ) methods, as with the SAXParserFactory, provide a simple method for configuring parser options:

setCoalescing( )

Joins XML CDATA nodes with adjoining text nodes. The default is false.

setExpandEntityReferences( )

mywbut.com

13

Expands XML entity reference nodes. The default is true.

setIgnoringComments( )

Ignores XML comments. The default is false.

setIgnorningElementContentWhitespace( )

Ignores whitespace in areas defined as element-only by the DTD. The default is false.

setNamespaceAware( )

Requests a namespace-aware parser. The default is false.

setValidating( )

Requests a validating parser. The default is false.

setAttribute()

Sets various standard and implementation-specific parser attributes.

Once the DocumentBuilder is instantiated, call the parse(String URI) method to return an org.w3c.dom.Document object.

7.4.2. Navigating the DOM Tree

The Document object provides the starting point for working with a DOM tree. Once the parser has produced a Document, your program can traverse the document structure and make changes. In addition, Document implements the Node interface, which is the core of DOM's tree structure, and provides methods for traversing the tree and retrieving information about the current node.

Each element, attribute, entity, and text string (indeed, every distinct component within an XML document) is represented in DOM as a node. To determine what kind of node you are working with, you can call the getNodeType( ) method. This returns one of the constants specified by the Node interface. All node objects have methods for dealing with child elements, although not all nodes may have children. The DOM API also provides a set of interfaces that map to each node type.

The most important DOM node types and their corresponding interfaces are listed in Table 7-1. If you attempt to add child elements to a node that doesn't support children, a DOMException is thrown.

Table 7-1. Important DOM node types

Interface Name property contains

Value Children Node constant

mywbut.com

14

For most applications, element nodes (identified by Node.ELEMENT_NODE) and text nodes (identified by Node.TEXT_NODE) are the most important. An element node is created when the parser encounters an XML markup tag. A text node is created when the parser encounters text that is not included within a tag. For example, if the input XML (we're using XHTML in this example) looks like this:

<p> Here is some <b>boldface</b> text. </p>

The parser creates a top-level node that is an element node with a local name of p. The top-level node contains three child nodes: a text node containing "Here is some," an element node named b, and another text node containing "text." The b element node contains a single child text node containing the word "boldface."

The getNodeValue( ) method returns the contents of a text node or the value of other node types. It returns null for element nodes.

To iterate through a node's children, use the getFirstChild( ) method, which will return a Node reference. To retrieve subsequent child nodes, call the getNextSibling( ) method of the node that was returned by getFirstChild( ). To print the names of all the children of a particular Node (assume that the node variable is a valid Node):

for (c = node.getFirstChild( ); c != null; c = c.getNextSibling( )) { System.out.println(c.getLocalName( )); }

Note that there is no getNextChild( ) method, and you can't iterate through child nodes except via the

Attr Name of attribute Yes No ATTRIBUTE_NODE

CDATASection #cdata-section Yes No CDATA_SECTION_NODE

Comment #comment Yes No COMMENT_NODE

Document #document No Yes DOCUMENT_NODE

DocumentFragment #document-fragment

No Yes DOCUMENT_FRAGMENT_NODE

DocumentType Document type name No No DOCUMENT_TYPE_NODE

Element Tag name No Yes ELEMENT_NODE

Entity Entity name No No ENTITY_NODE

EntityReferenced Name of referenced entity

No No ENTITY_REFERENCE_NODE

ProcessingInstruction PI target Yes No PROCESSING_INSTRUCTION_NODE

Text #text No No TEXT_NODE

mywbut.com

15

getNextSibling( ) method. As a result, if you use the removeChild( ) method to remove one of a node's children, calls to the child node's getNextSibling( ) method immediately return null.

7.4.2.1. Element attributes

Element attributes are accessed via the getAttributes( ) method, which returns a NamedNodeMap object. The NamedNodeMap contains a set of Node objects of type ATTRIBUTE_NODE. The getNodeValue( ) method can read the value of a particular attribute.

NamedNodeMap atts = elementNode.getAttributes( ); if(atts != null) { Node sizeNode = atts.getNamedItem("size"); String size = sizeNode.getValue( ); }

Alternately, you can cast a node to its true type (in this case, an org.w3c.dom.Element) and retrieve attributes or other data more directly:

if(myNode.getNodeType( ) == Node.ELEMENT_NODE) { Element myElement = (org.w3c.dom.Element)myNode; String attributeValue = myElement.getAttribute("attr"); // attributeValue will be an empty string if "attr" does not exist }

This is often easier than retrieving attribute nodes from a NamedNodeMap.

7.4.3. Manipulating DOM Trees

DOM is particularly useful when you need to manipulate the structure of an XML file. Example 7-2 is an HTML document "condenser." It loads an HTML file (which must be well-formed XML, although not necessarily XHTML), iterates through the tree, and preserves only the important content. In this case, it's text within <em>, <th>, <title>, <li>, and <h1> tHRough <h6>. All text nodes that aren't contained within one of these tags are removed. A more sophisticated algorithm is no doubt possible, but this one is good enough to demonstrate the DOM principle.

Example 7-2. DocumentCondenser

import javax.xml.parsers.*; import javax.xml.transform.*; import javax.xml.transform.dom.*; import javax.xml.transform.stream.*; import org.w3c.dom.*; public class DocumentCondenser { public static void main(String[] args) throws Exception { DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance( ); // For HTML, we don't want to validate without a DTD dbf.setValidating(false); // Ignore text elements that are completely empty: dbf.setIgnoringElementContentWhitespace(false);

mywbut.com

16

dbf.setExpandEntityReferences(true); dbf.setCoalescing(true); // Ensure that getLocalName() returns the HTML element name dbf.setNamespaceAware(true); DocumentBuilder db = null; try { db = dbf.newDocumentBuilder( ); } catch (ParserConfigurationException pce) { pce.printStackTrace(); return; } Document html = null; try { html = db.parse("enterprisexml.html"); process(html); // Use the XSLT Transformer to see the output TransformerFactory tf = TransformerFactory.newInstance(); Transformer output = tf.newTransformer(); output.transform(new DOMSource(html), new StreamResult(System.out)); } catch (Exception ex) { ex.printStackTrace(); return; } } /* We want to keep text if the parent is <em>, <title>, <b>, <li>, <th> or <h1>..<h6>. We also want to keep text if it is in a <font> tag with a size attribute set to a larger than normal size */ private static boolean keepText(Node parentNode) { if (parentNode == null) return true; // top level String parentName = parentNode.getLocalName(); if ((parentName.equalsIgnoreCase("em")) || (parentName.equalsIgnoreCase("title")) || (parentName.equalsIgnoreCase("b")) || (parentName.equalsIgnoreCase("li")) || (parentName.equalsIgnoreCase("th")) || ((parentName.toLowerCase().startsWith("h")) && (parentName.length() == 2))) { return true; } if ((parentNode.getNodeType() == Node.ELEMENT_NODE) && (parentName.equalsIgnoreCase("font"))) { NamedNodeMap atts = parentNode.getAttributes(); if (atts != null) { Node sizeNode = atts.getNamedItem("size"); //get an attribue Node if (sizeNode != null) { if (sizeNode.getNodeValue().startsWith("+")) { return true; } } }

mywbut.com

17

} return false; } private static void process(Node node) { Node c = null; Node delNode = null; for (c = node.getFirstChild(); c != null; c = c.getNextSibling()) { if (delNode != null) { delNode.getParentNode().removeChild(delNode); } delNode = null; if ((c.getNodeType() == Node.TEXT_NODE) && (!keepText(c.getParentNode()))) { delNode = c; } else if (c.getNodeType() != Node.TEXT_NODE) { process(c); } } // End For if (delNode != null) // Delete, if the last child was text delNode.getParentNode().removeChild(delNode); } }

After the DOM tree has been processed, use the JAXP XSLT API to output new HTML. We will discuss how to use XSL with JAXP in the next section.

If you want to replace the text with a condensed version, call the setNodeValue( ) method of Node when processing a text node.

7.4.4. Extending DOM Trees

Manipulating DOM trees falls, broadly, into three categories. We can add, remove, and modify nodes on existing trees; we can create new trees; and finally, we can merge trees together.

Back in the last example, we saw how to delete nodes from a DOM tree with the removeChild( ) method. If we want to add new nodes, we have two options. While there is no direct way to instantiate a new Node object, we can copy an existing Node using its cloneNode( ) method. The cloneNode( ) method takes a single Boolean parameter, which specifies whether the node's children will be cloned as well:

Node newNodeWithChildren = oldElementNode.cloneNode(true); Node childlessNode = oldElementNode.cloneNode(false);

Regardless of whether children are cloned, clones of an element node will include all of the attributes of the parent node. The DOM specification leaves certain cloning behaviors, specifically Document, DocumentType, Entity and Notation nodes, up to the implementation.

New nodes can also be created via the createXXX( ) methods of the Document object. The

mywbut.com

18

createElement( ) method accepts a String containing the new element name and returns a new element Node. The createElementNS( ) method does the same thing, but accepts two parameters, a namespace and an element name. The createAttribute( ) method also has a version that is namespace-aware, createAttributeNS( ).

Once a new Node is created, it can be inserted or appended into the tree using the appendChild( ), insertBefore( ), and replaceChild( ) methods. Attribute nodes can be inserted into the NamedNodeMap returned by the getAttributes( ) method of Node. You can also add attributes to an element by casting the Node to an Element and calling setAttribute( ).

Creating new trees involves creating a new Document object. The easiest way to do this is via the DOMImplementation interface. An implementation of DOMImplementation can be retrieved from a DocumentBuilder object via the getdOMImplementation( ) method. Example 7-3 builds a version of the XML from a blank slate.

Example 7-3. TreeBuilder

import javax.xml.parsers.*; import javax.xml.transform.*; import javax.xml.transform.dom.*; import javax.xml.transform.stream.*; import org.w3c.dom.*; public class TreeBuilder { public static void main(String[] args) { DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); dbf.setValidating(false); DocumentBuilder db = null; try { db = dbf.newDocumentBuilder(); } catch (ParserConfigurationException pce) { pce.printStackTrace(); return; } Document doc = db.getDOMImplementation().createDocument(null, "orders", null); // create the initial document element Element orderNode = doc.createElement("order"); orderNode.setAttribute("orderno", "123433"); Node item = doc.createElement("item"); Node subitem = doc.createElement("number"); subitem.appendChild(doc.createTextNode("3AGM-5")); item.appendChild(subitem); subitem = doc.createElement("handling"); subitem.appendChild(doc.createTextNode("With Care")); item.appendChild(subitem); orderNode.appendChild(item); doc.getDocumentElement().appendChild(orderNode); // View the output

mywbut.com

19

try { TransformerFactory tf = TransformerFactory.newInstance(); Transformer output = tf.newTransformer(); output.transform(new DOMSource(doc), new StreamResult(System.out)); } catch (TransformerException e) { e.printStackTrace(); } } }

The second parameter to createDocument( ) specifies the name of the base document elementin this case, the <orders> tag. Subsequent tags can be appended to the base tag. If we were to look at the results of this program as regular XML, it would look like this (we've added some whitespace formatting to make it more readable):

<?xml version="1.0" encoding="UTF-8"?> <orders> <order orderno="123433"> <item> <number>3AGM-5</number> <handling>With Care</handling> </item> </order> </orders>

You've probably noticed that each Node implementation we've created has been based on a particular instance of the Document object. Since each node is related to its parent document, we can't go around inserting one document's nodes into another document without triggering an exception. The solution is to use the importNode( ) method of Document, which creates a copy of a node from another document. The original node from the source document is left untouched. Here's an example that takes the <orders> tag from the first document and puts it into a new document under an <ordersummary> tag:

Document doc2 = db.getDOMImplementation( ).createDocument( null, "ordersummary", null); DocumentFragment df = doc.createDocumentFragment( ); df.appendChild(doc.getDocumentElement( ).cloneNode(true)); doc2.getDocumentElement( ).appendChild(doc2.importNode(df, true));

We use a DocumentFragment object to hold the data we're moving. Document fragment nodes provide a lightweight structure for dealing with subsets of a document. Fragments must be valid XML, but don't need to be DTD-conformant and can have multiple top-level children. When appending a document fragment to a document tree, the DocumentFragment node itself is ignored, and its children are appended directly to the parent node. In this example, we cloned the source element when creating the document fragment, since assigning a node to a fragment releases the node's relationship with its previous parent. The XML in the second document object looks like this:

<?xml version="1.0" encoding="UTF-8"?> <ordersummary> <orders>

mywbut.com

20

<order orderno="123433"> <item><number>3AGM-5</number> <handling>With Care</handling></item> </order> </orders> </ordersummary>

7.5. XSLT

Another specification incorporated into the JAXP API is the XML Stylesheet Transformation (XSLT) system. An XSLT transformation takes an input XML document and transforms it into an output format (not necessarily XML) according to a set of rules specified in an XSL stylesheet. One common application of XSL stylesheets is transforming XML into HTML for presentation in a browser; this is often done in the browser directly or on the web server via a content management system such as the Apache project's Cocoon (http://cocoon.apache.org).

The following XSL document converts the orders.xml file from the SAX example into HTML:

<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/ Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format"> <xsl:template match="orders"> <html> <head><title>Order Summary</title></head> <body> <xsl:apply-templates/> </body> </html> </xsl:template> <xsl:template match="order"> <h1>Order Number <xsl:value-of select="@idnumber"/></h1> Ship To: <pre> <xsl:value-of select="shippingaddr"/> </pre> <ul> <xsl:apply-templates select="item"/> </ul> </xsl:template> <xsl:template match="item"> <li><xsl:value-of select="@quantity"/> of item <xsl:value-of select="@idnumber"/><xsl:apply-templates/></li> </xsl:template> <xsl:template match="handling"> <br/>Special Instructions: <xsl:value-of select="."/> </xsl:template> </xsl:stylesheet>

The XSL file consists of a series of templates. The XSL processor matches each template to a particular XML tag and replaces the tag with the template contents. For example, when the processor encounters

mywbut.com

21

an <orders> and an </orders> tag, it replaces them with:

<html> <head><title>Order Summary</title></head> <body> <xsl:apply-templates/> </body> </html>

The <xsl:apply-templates/> command tells the XSLT processor to recursively apply all of the available templates to the XML content contained within the <orders> tag pair. Other XSL commands allow the processor to display element attributes, limit recursion to specific templates, and so forth. For more examples, please check one of the books recommended at the beginning of this chapter. After running the orders.xml file through the stylesheet, we get this HTML:

<html xmlns:fo="http://www.w3.org/1999/XSL/Format"> <head> <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Order Summary</title> </head> <body> <h1>Order Number 3123</h1> Ship To: <pre>One Main St Boston, MA 02112</pre> <ul> <li>13 of item 7231</li> <li>2 of item 1296 <br>Special Instructions: Please embroider in a tasteful manner! </li> </ul>  </body> </html>

The classes in the JAXP javax.xml.transform package allow programs to use XSLT Transformer objects, such as Apache Xalan (which ships with the JAXP reference implementation) to transform XML documents.

At this point, we should point out that JAXP doesn't require XSLT at all: instead, it is a generic transformations API that theoretically can be used with a variety of transformation systems. However, XSLT is by far the most popular and widely implemented, so the remainder of this section (and most books on the subject) will assume that XSLT is used as the transformation system.

7.5.1. JAXP Data Sources

The DOM example earlier in this chapter briefly introduced the transformerFactory class. transformerFactory works just like the SAX and DOM parser factories, but returns a transformer object instead of a parser. The default behavior of a transformer is to pass the input XML through without modification, which is what will occur if no transformation stylesheet is given.

mywbut.com

22

Input and output from a transformer are handled via the javax.xml.transform.Source and javax.xml.transform.Result interfaces. Each has three implementing classes, one each for DOM, SAX, and streams (DOMSource, SAXSource, StreamSource, DOMResult, SAXResult, and StreamResult). Note that the JAXP processors will not necessarily support all six. We handled the output in the DOM example by creating a DOMSource and outputting to a StreamResult that targeted System.out (in this example, the document variable contains a DOM Document object):

TransformerFactory tf = TransformerFactory.newInstance( ); Transformer output = tf.newTransformer( ); output.transform(new DOMSource(document), new StreamResult(System.out));

The Source objects can also specify the source of the stylesheet used for the conversion. If we want to use a stylesheet located in /home/will/orderdisplay.xsl:

TransformerFactory tf = TransformerFactory.newInstance( ); Transformer output = tf.newTransformer( new StreamSource("file://home/will/orderdisplay.xsl")); output.transform(new DOMSource(document), new StreamResult(System.out));

The different source and result types can streamline processing. Rather than load an XML file, transform it, write it to disk, reload it, and parse it with a SAX ContentHandler, the transformation process can feed its results directly into a ContentHandler that can deal with the transformation results.

Example 7-4 shows transformation of a document and its immediate processing by a SAX ContentHandler. We'll use the OrderHandler program from earlier in the chapter as the content handler, and we'll use the same orders.xml file. To keep things simple, we'll do a one-to-one transformation, rather than use an XSLT file to actually alter the structure of the orders.xml file.

Example 7-4. Transforming a document into a SAXResult

import javax.xml.transform.*; import javax.xml.transform.sax.*; import javax.xml.transform.stream.*; import org.xml.sax.*; import org.xml.sax.helpers.*; public class SAXTransformTarget { public static void main(String[] args) { try { StreamSource ss = new StreamSource("orders.xml"); SAXResult sr = new SAXResult(new OrderHandler( )); TransformerFactory tf = TransformerFactory.newInstance( ); Transformer t = tf.newTransformer( ); t.transform(ss, sr); } catch (TransformerConfigurationException e) { e.printStackTrace( ); } catch (TransformerException e) { e.printStackTrace( ); } }

mywbut.com

23

The output from this program should be identical to the output from Example 7-1; we've simply replaced the XMLReader with a transformation stream.

7.5.1.1. Determining data source support

The transformerFactory class includes a method named getFeature( ), which takes a String and returns TRue if the feature identified by the String is supported by the processor. Each of the Source and Result implementations includes a String constant named FEATURE. So to determine whether the XSL processor supports DOM source:

TransformerFactory tf = TransformerFactory.newInstance( ); boolean supportsDOMSource = tf.getFeature(DOMSource.FEATURE);

7.5.1.2. Custom URI resolution

When processing XSL files, it is sometimes necessary to resolve relative URIs. Ordinarily, the parser will do the best possible job, but sometimes it is necessary to override this behavior (for instance, in a web content management system in which the document tree apparent to the content creator and client might not match the system structure or when XML output from a servlet is being transformed within the servlet). In these cases, the setURIResolver(URIResolver) method of transformer allows you to specify resolution behavior by implementing the URIResolver interface. The resolve( ) method of URIResolver must return a Source or null:

class XSLResolver implements URIResolver { public Source resolve(String href, String base) throws TransformerException { // Check for a null Base URI, and provider it if so if((base == null) || (base.equals("/servlet/"))) base = "http://www.oreilly.com/catalog/jentnut3/"; if(href == null) return null; return new StreamSource(base + href); } }

Finally, JAXP supports XSLT transformations that can convert between DOM, SAX, and streams; transform an XML document based on an XSL stylesheet; or both.

mywbut.com

24

Chapter 7. Java and XML · to know more about using Java with XML, try Brett McLaughlin's Java and...

Documents

Transcript of Chapter 7. Java and XML · to know more about using Java with XML, try Brett McLaughlin's Java and...