Post on 16-Jan-2016
description
SDPL 2011 3.4 Streaming API for XML 1
3.4 Streaming API for XML (StAX)3.4 Streaming API for XML (StAX)
Could we process XML documents Could we process XML documents more conveniently than with SAX, and more conveniently than with SAX, and yet more efficiently?yet more efficiently?
A: Yes, with A: Yes, with Streaming API for XML (StAX)Streaming API for XML (StAX)– general introductiongeneral introduction– an examplean example– comparison with SAXcomparison with SAX
SDPL 2011 3.4 Streaming API for XML 2
StAX: GeneralStAX: General
Latest of standard Java XML parser interfaces Latest of standard Java XML parser interfaces – Origin: the XMLPull API (A. Slominski, ~ 2000)Origin: the XMLPull API (A. Slominski, ~ 2000)– developed developed as a Java Community Process lead by as a Java Community Process lead by
BEA Systems (2003)BEA Systems (2003)– included in JAXP 1.4, in Java WSDP 1.6, included in JAXP 1.4, in Java WSDP 1.6,
and in Java SE 6 (JDK 1.6)and in Java SE 6 (JDK 1.6)
An event-driven streaming API, like SAXAn event-driven streaming API, like SAX– does not build in-memory representationdoes not build in-memory representation
A A "pull API""pull API"– lets the application to ask for individual eventslets the application to ask for individual events– unlike a "push API" like SAXunlike a "push API" like SAX
Advantages of Pull ParsingAdvantages of Pull Parsing
A A pull APIpull API provides events, provides events, on demandon demand, , from the chosen streamfrom the chosen stream– can cancel parsing, say, after processing the can cancel parsing, say, after processing the
header of a long messageheader of a long message– can read multiple documents simultaneouslycan read multiple documents simultaneously– application-controlled access (~ application-controlled access (~ iterator iterator
design patterndesign pattern) usually simpler than SAX-) usually simpler than SAX-style call-backs (~ style call-backs (~ observer design patternobserver design pattern) )
SDPL 2011 3.4 Streaming API for XML 3
Cursor and Iterator APIsCursor and Iterator APIs
StAX consists of two sets of APIsStAX consists of two sets of APIs– (1)(1) cursor cursor APIs, APIs, and and (2) (2) iteratoriterator APIs APIs– differ by representation of parse eventsdiffer by representation of parse events
(1) (1) cursor API cursor API XMLStreamReaderXMLStreamReader– lower-levellower-level– methodsmethods hasNext() hasNext() andand next() next() to scan to scan events, events,
represented by as represented by as intint constants constants START_DOCUMENTSTART_DOCUMENT, , START_ELEMENTSTART_ELEMENT, ..., ...
– access methods, depending on current event type:access methods, depending on current event type:
– getName()getName(), , getAttributeValue(getAttributeValue(....)), , getText()getText(), ..., ...
SDPL 2011 3.4 Streaming API for XML 4
(2) (2) XMLEventReaderXMLEventReader Iterator APIIterator API
XMLEventReaderXMLEventReader provides contents of an XML document provides contents of an XML document to the application using an to the application using an event objectevent object iterator iterator
Parse events represented as immutable Parse events represented as immutable XMLEvent XMLEvent objects objects – received using methods received using methods hasNext()hasNext()and and nextEvent()nextEvent() – event properties accessed through their methods event properties accessed through their methods – can be stored (if needed)can be stored (if needed)– require more resources than the cursor API (See require more resources than the cursor API (See
later)later) Event lookahead, without advancing in the stream, with Event lookahead, without advancing in the stream, with
XMLEventReader.peek() XMLEventReader.peek() and and XMLStreamReader.getEventType() XMLStreamReader.getEventType()
SDPL 2011 3.4 Streaming API for XML 5
Writing APIsWriting APIs
StAX is a StAX is a bidirectional bidirectional APIAPI allows also to allows also to writewrite XML data XML data through an through an XMLStreamWriterXMLStreamWriter or an or an XMLEventWriterXMLEventWriter
Useful for "marshaling" data structures into XMLUseful for "marshaling" data structures into XML WritersWriters are not required to force well-are not required to force well-
formedness (not to mention validity)formedness (not to mention validity) provide some support: escaping of reserved chars provide some support: escaping of reserved chars
like & and <, and adding unclosed end-tagslike & and <, and adding unclosed end-tags
SDPL 2011 3.4 Streaming API for XML 6
SDPL 2011 3.4 Streaming API for XML 7
Example of Using StAX (1/6)Example of Using StAX (1/6)
Use Use StAX iterator StAX iterator interfaces tointerfaces to– fold element tagnames to uppercase, and tofold element tagnames to uppercase, and to– strip commentsstrip comments
Outline:Outline:– Initialize Initialize
» an an XMLEventReaderXMLEventReader for the input document for the input document» an an XMLEventWriterXMLEventWriter (for (for System.outSystem.out ) )» an an XMLEventFactoryXMLEventFactory for creating modified for creating modified StartElementStartElement and and EndElementEndElement events events
– Use them to read all input events, and to write some Use them to read all input events, and to write some of them, possibly modifiedof them, possibly modified
SDPL 2011 3.4 Streaming API for XML 8
StAX example (2/6)StAX example (2/6)
First import relevant interfaces & classes:First import relevant interfaces & classes:importimport java.io.*;java.io.*;importimport javax.xml.stream.*; javax.xml.stream.*;importimport javax.xml.stream.events.*; javax.xml.stream.events.*;importimport javax.xml.namespace.QName; javax.xml.namespace.QName;
public class capitalizeTags { public class capitalizeTags {
public static void main(String[] args) public static void main(String[] args) throws throws FactoryConfigurationErrorFactoryConfigurationError,, XMLStreamException XMLStreamException,, IOException {IOException {
if (args.length != 1) System.exit(1);if (args.length != 1) System.exit(1); InputStream input = InputStream input =
new FileInputStream(args[0]);new FileInputStream(args[0]);
SDPL 2011 3.4 Streaming API for XML 9
StAX example (3/6)StAX example (3/6)
Initialize Initialize XMLEventReaderXMLEventReader//WriterWriter//FactoryFactory:: XMLInputFactoryXMLInputFactory xif = xif =
XMLInputFactory.newInstance()XMLInputFactory.newInstance(); ; xif.setProperty(xif.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, XMLInputFactory.IS_NAMESPACE_AWARE,
true)true);; XMLEventReaderXMLEventReader xer = xer =
xif.createXMLEventReader(input)xif.createXMLEventReader(input);;
XMLOutputFactoryXMLOutputFactory xof = xof = XMLOutputFactory.newInstance()XMLOutputFactory.newInstance();;
XMLEventWriterXMLEventWriter xew = xew = xof.createXMLEventWriter(System.out);xof.createXMLEventWriter(System.out);
XMLEventFactoryXMLEventFactory xef = xef = XMLEventFactory.newInstance()XMLEventFactory.newInstance();;
SDPL 2011 3.4 Streaming API for XML 10
StAX example (4/6)StAX example (4/6)
Iterate over events of the InputStream: Iterate over events of the InputStream: while (while (xer.hasNext()xer.hasNext() ) { ) { XMLEvent inEvent = xer.nextEvent()XMLEvent inEvent = xer.nextEvent();; if (if (inEvent.isStartElement()inEvent.isStartElement()) {) { StartElement StartElement sese == (StartElement) inEvent (StartElement) inEvent;; QName inQName = QName inQName = se.getName()se.getName();; String localName = inQName.getLocalPart();String localName = inQName.getLocalPart(); xew.add( xef.createStartElement(xew.add( xef.createStartElement( inQName.getPrefix(),inQName.getPrefix(), inQName.getNamespaceURI(),inQName.getNamespaceURI(), localName.toUpperCase(),localName.toUpperCase(), se.getAttributes(),se.getAttributes(), se.getNamespaces() ) )se.getNamespaces() ) );;
SDPL 2011 3.4 Streaming API for XML 11
StAX example (5/6)StAX example (5/6)
Event iteration continues, to capitalize end tags: Event iteration continues, to capitalize end tags:
} else if (} else if (inEvent.isEndElement()inEvent.isEndElement()) {) { EndElement ee = (EndElement) inEventEndElement ee = (EndElement) inEvent; ;
QName inQName = ee. QName inQName = ee.getName()getName();; String localName = inQName.getLocalPart();String localName = inQName.getLocalPart(); xew.add( xef.createEndElement(xew.add( xef.createEndElement( inQName.getPrefix(),inQName.getPrefix(), inQName.getNamespaceURI(),inQName.getNamespaceURI(), localName.toUpperCase(),localName.toUpperCase(), ee.getNamespaces() ) )ee.getNamespaces() ) );;
SDPL 2011 3.4 Streaming API for XML 12
StAX example (6/6)StAX example (6/6)
Output other events, except for comments; Output other events, except for comments; Finish when input ends:Finish when input ends:
} else if (} else if (inEvent.getEventType()inEvent.getEventType() != != XMLStreamConstants.COMMENTXMLStreamConstants.COMMENT) {) {
xew.add(inEvent)xew.add(inEvent); ; } }
} // while (xer.hasNext())} // while (xer.hasNext()) xer.close()xer.close(); input.close();; input.close(); xew.flush()xew.flush(); ; xew.close()xew.close();;} // main()} // main()} // class capitalizeTags} // class capitalizeTags
Efficiency of Streaming Efficiency of Streaming APIs?APIs?
An experiment of An experiment of SAXSAX vs vs StAXStAX for for scanning documentsscanning documents
Task: Count and report the number of elements, Task: Count and report the number of elements, attributes, character fragments, and total char lengthattributes, character fragments, and total char length
Inputs: Similar prose-oriented documents, Inputs: Similar prose-oriented documents, of different sizeof different size– repeated fragments of W3C XML Schema Rec (Part 1)repeated fragments of W3C XML Schema Rec (Part 1)
Tested on OpenJDK 1.6.0 (different updates), withTested on OpenJDK 1.6.0 (different updates), with– Red Hat Linux 6.0.52, 3 GHz Pentium ,1 GB RAM (”OLD”)Red Hat Linux 6.0.52, 3 GHz Pentium ,1 GB RAM (”OLD”)– 64 b Centos Linux 5, 2.93 GHz Intel Core 2 Duo, 4GB RAM64 b Centos Linux 5, 2.93 GHz Intel Core 2 Duo, 4GB RAM
(”NEW”)(”NEW”)
SDPL 2011 3.4 Streaming API for XML 13
Essentials of the Essentials of the SAXSAX Solution Solution
Obtain and use a JAXP Obtain and use a JAXP SAXSAX parser: parser:
String docFile; // initialized from cmd line String docFile; // initialized from cmd line
SAXParserFactory spf = SAXParserFactory spf = SAXParserFactory.newInstance();SAXParserFactory.newInstance();
spf.setValidating(validate); //from cmd option spf.setValidating(validate); //from cmd option
spf.setNamespaceAware(true);spf.setNamespaceAware(true);
SAXParser sp = spf.newSAXParser();SAXParser sp = spf.newSAXParser();
CountHandler ch = new CountHandler();CountHandler ch = new CountHandler();
sp.parse( new File(docFile), ch );sp.parse( new File(docFile), ch );
ch.printResult(); // print the statisticsch.printResult(); // print the statistics
SDPL 2011 3.4 Streaming API for XML 14
SAX Solution: SAX Solution: CountHandlerCountHandler
public static class CountHandler public static class CountHandler extends extends DefaultHandlerDefaultHandler {{
// Instance vars for statistics:// Instance vars for statistics:
int elemCount = 0, charFragCount = 0,int elemCount = 0, charFragCount = 0,
totalCharLen = 0, attrCount = 0;totalCharLen = 0, attrCount = 0; public void startElement(String nsURI, public void startElement(String nsURI,
String locName, String qName, String locName, String qName, Attributes atts) Attributes atts) { elemCount++; { elemCount++;
attrCount += attrCount += atts.getLength()atts.getLength(); }; }
public voidpublic void characters(char[] buf, int start, characters(char[] buf, int start,int length)int length) { charFragCount++; { charFragCount++;
totalCharLen += totalCharLen += lengthlength; } ; }
SDPL 2011 3.4 Streaming API for XML 15
Essentials of the Essentials of the StAXStAX Solution Solution
First, initializeFirst, initialize:: XMLInputFactory xif = XMLInputFactory xif =
XMLInputFactory.newInstance();XMLInputFactory.newInstance();
xif.setProperty(xif.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, true);XMLInputFactory.IS_NAMESPACE_AWARE, true);
InputStream input = InputStream input = new FileInputStream( docFile );new FileInputStream( docFile );
int elemCount = 0, charFragCount = 0,int elemCount = 0, charFragCount = 0,
totalCharLen = 0, attrCount = 0;totalCharLen = 0, attrCount = 0;
Then parse the Then parse the InputStream,InputStream, using using (a) the cursor API, or (b) the event iterator API (a) the cursor API, or (b) the event iterator API
SDPL 2011 3.4 Streaming API for XML 16
(a) StAX (a) StAX CursorCursor API Solution API Solution (1)(1)
XMLStreamReader xsr = XMLStreamReader xsr = xif.createXMLStreamReader(input);xif.createXMLStreamReader(input);
while(while(xsr.hasNext()xsr.hasNext() ) { ) {
int eventType = xsr.next();int eventType = xsr.next();
switch (eventType) {switch (eventType) {
case case XMLEvent.START_ELEMENTXMLEvent.START_ELEMENT::
elemCount++;elemCount++;
attrCount += attrCount += xsr.getAttributeCount()xsr.getAttributeCount();;
break;break;
SDPL 2011 3.4 Streaming API for XML 17
(a) StAX Cursor API Solution (a) StAX Cursor API Solution (2)(2)
case case XMLEvent.CHARACTERSXMLEvent.CHARACTERS::
charFragCount++;charFragCount++;
totalCharLen += totalCharLen += xsr.getTextLength()xsr.getTextLength();;
break;break;
default: break; default: break;
} // switch} // switch
} // while (xsr.hasNext() )} // while (xsr.hasNext() )
xsr.close()xsr.close();;
input.close();input.close();
SDPL 2011 3.4 Streaming API for XML 18
(b) StAX (b) StAX IteratorIterator API Solution API Solution (1)(1)
XMLEventReader xer = XMLEventReader xer = xif.createXMLEventReader ( input )xif.createXMLEventReader ( input );;
while (while (xer.hasNext()xer.hasNext() ) { ) { XMLEvent event = xer.nextEvent()XMLEvent event = xer.nextEvent();;
if (if (event.isStartElement()event.isStartElement()) {) {
elemCount++;elemCount++;
Iterator attrs =Iterator attrs = event.asStartElement().getAttributes() event.asStartElement().getAttributes();;
while (attrs.hasNext()) {while (attrs.hasNext()) {
attrs.next(); attrCount++; }attrs.next(); attrCount++; }
} // if (event.isStartElement()) } // if (event.isStartElement())
SDPL 2011 3.4 Streaming API for XML 19
(b) StAX Iterator API Solution (b) StAX Iterator API Solution (2)(2)
if (if (event.isCharacters()event.isCharacters()) {) {
charFragCount++;charFragCount++;
totalCharLen +=totalCharLen += ((Characters) ((Characters)
event).getData()event).getData().length();.length();
}}
} // while (xer.hasNext() )} // while (xer.hasNext() )
xer.close()xer.close();;
input.close(); input.close();
SDPL 2011 3.4 Streaming API for XML 20
Efficiency of SAX vs StAX Efficiency of SAX vs StAX
100
150
200
250
300
350
400
450
500
550
0 500 1000 1500 2000 2500 3000
tim
e (
ms
)
s ize (KB)
Document scanning times
SAX + v alidateSAX
StAX ev entsStAX cursor
SDPL 2011 3.4 Streaming API for XML 21
Efficiency of SAX vs StAX Efficiency of SAX vs StAX (NEW) (NEW)
SDPL 2011 3.4 Streaming API for XML 22
0
100
200
300
400
500
600
700
800
0 500 1000 1500 2000 2500 3000
Tim
e (
ms
)
Size (KB)
Document scanning times
StAX eventsSAX + validate
SAXStAX cursor
ObservationsObservations
StAX cursor API is the most efficientStAX cursor API is the most efficient Overhead of Overhead of XMLEventXMLEvent objects makes StAX objects makes StAX
iterator some 50 – 80% sloweriterator some 50 – 80% slower SAX is on small documents ~ 40 - 100% slower SAX is on small documents ~ 40 - 100% slower
than the StAX cursor APIthan the StAX cursor API Overhead of DTD validation adds ~5 – 10 % to Overhead of DTD validation adds ~5 – 10 % to
SAX parsing timeSAX parsing time
StAX loses its advantage with bigger documents: StAX loses its advantage with bigger documents:
SDPL 2011 3.4 Streaming API for XML 23
Times on Larger DocumentsTimes on Larger Documents
0
500
1000
1500
2000
2500
3000
3500
4000
5 10 15 20 25 30 35 40 45 50
tim
e (
ms
)
s ize (M B)
Document scanning times
StAX ev entsStAX cursor
SAX
SDPL 2011 3.4 Streaming API for XML 24
Why? Let's take a look at memory usage Why? Let's take a look at memory usage
Memory Usage of SAX vs Memory Usage of SAX vs StAXStAX
SDPL 2011 3.4 Streaming API for XML 25
StAX implementation has a memory leak!StAX implementation has a memory leak!(Should get fixed in future releases) (Should get fixed in future releases)
0
50
100
150
200
250
5 10 15 20 25 30 35 40 45 50
me
m (
MB
)
document size (M B)
Used main memory
StAX ev entsStAX cursor
SAX
< 6 MB< 6 MB
Memory Usage of SAX vs StAX Memory Usage of SAX vs StAX (NEW)(NEW)
SDPL 2011 3.4 Streaming API for XML 26
Memory-leak also in the SAX implementation!Memory-leak also in the SAX implementation!
0
50
100
150
200
250
300
350
400
450
500
5 10 15 20 25 30 35 40 45 50
me
m (
MB
)
document size (MB)
Used main memory
StAX eventsSAX
StAX cursor
Circumventing the Memory Circumventing the Memory LeakLeak
The bug appears to be related to a The bug appears to be related to a DOCTYPE declaration with an external DTDDOCTYPE declaration with an external DTD
Without a DOCTYPE declarationWithout a DOCTYPE declaration– In first experiment, each API uses less than 6 In first experiment, each API uses less than 6
MBMB– In second experiment, the In second experiment, the StAX Event StAX Event objects objects
still require increasing amounts of memory; still require increasing amounts of memory; See nextSee next
SDPL 2011 3.4 Streaming API for XML 27
SAX vs StAX memory need (w.o. SAX vs StAX memory need (w.o. DTD)DTD)
SDPL 2011 3.4 Streaming API for XML 28
0
20
40
60
80
100
120
140
160
180
5 10 15 20 25 30 35 40 45 50
me
m (
MB
)
document size (MB)
Used main memory (without DTD)
StAX eventsSAX DTD
StAX cursor
Speed on documents without Speed on documents without DTDDTD
SDPL 2011 3.4 Streaming API for XML 29
0
500
1000
1500
2000
2500
3000
5 10 15 20 25 30 35 40 45 50
tim
e (
ms
)
s ize (M B)
Scan times for documents w .o. DTD
StAX ev entsStAX cursor
SAX
Speed on documents without DTD Speed on documents without DTD (NEW)(NEW)
SDPL 2011 3.4 Streaming API for XML 30
200
400
600
800
1000
1200
1400
1600
1800
5 10 15 20 25 30 35 40 45 50
tim
e (
ms
)
size (MB)
Scan times for documents w.o. DTD
StAX eventsSAX
StAX cursor
SDPL 2011 3.4 Streaming API for XML 31
StAX: SummaryStAX: Summary
Event-based streaming pull-API for XML Event-based streaming pull-API for XML documentsdocuments
More convenient than SAXMore convenient than SAX– and often more efficient, esp. the cursor API with small and often more efficient, esp. the cursor API with small
docsdocs
Supports also writing of XML dataSupports also writing of XML data A potential substitute for SAXA potential substitute for SAX
– NB: Sun Java Streaming XML Parser (in JDK 1.6) is NB: Sun Java Streaming XML Parser (in JDK 1.6) is non-non-validatingvalidating (but the API allows validation, too) (but the API allows validation, too)
– once some implementation bugs (in JDK 1.6) get once some implementation bugs (in JDK 1.6) get eliminatedeliminated