Effective XML Elliotte Rusty Harold [email protected]
-
Upload
june-walsh -
Category
Documents
-
view
217 -
download
0
Transcript of Effective XML Elliotte Rusty Harold [email protected]
![Page 2: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/2.jpg)
Part I: Syntax
![Page 3: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/3.jpg)
Stay with XML 1.0
• XML 1.1:• New name characters• C0 control characters• C1 control characters • NEL• Undeclare namespace prefixes
• Incompatible with• Most XML parsers• W3C and RELAX NG schema languages• XOM, JDOM
![Page 4: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/4.jpg)
Part II: Structure
![Page 5: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/5.jpg)
The XML Stack
![Page 6: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/6.jpg)
Allow All XML syntax
• CDATA sections• Entity references• Processing instructions• Comments• Numeric character references• Document type declarations• Different ways of representing the
same core content; not different information
![Page 7: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/7.jpg)
Distinguish text from markup
• A DocBook element<programlisting><![CDATA[<value> <double>28657</double></value>]]></programlisting>
• The content is:<value> <double>28657</double></value>
• This is the same:<programlisting><value> <double>28657</double> </value></programlisting>
![Page 8: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/8.jpg)
The reverse problem
•Tools that create XML from strings:• Tree-based editors like <Oxygen/> or XML Spy•WYSIWYG applications like OpenOffice Writer• Programming APIs such as DOM, JDOM, and XOM
•The tool automatically escapes reserved characters like <, >, or &. •Just because something looks like an XML tag does not mean it is an XML tag.
![Page 9: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/9.jpg)
White space matters
• Parsers report all white space in element content, including boundary white space
• An xml:space attribute is for the client application only, not the parser
• White space in attribute values is normalized
• Parsers do not report white space in the prolog, epilog, the document type declaration, and tags.
![Page 10: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/10.jpg)
Make structure explicit through markup• Bad
<Transaction>Withdrawal 2003 12 15 200.00</Transaction>
• Better<Transaction type="withdrawal"> <Date>2003-12-15</Date> <Amount>200.00</Amount></Transaction>
![Page 11: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/11.jpg)
Store metadata in attributes
• Material the reader doesn’t want to see• URLs• IDs• Styles• Revision dates• Authors name
• No substructure• Revision tracking• Citations
• No multiple elements
![Page 12: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/12.jpg)
Remember mixed content
• Narrative documents• Record-like documents• The RSS problem<item> <title>Xerlin 1.3 released</title> <description> Xerlin 1.3, an open source XML Editor written in Java, has been released. Users can extend the application via custom editor interfaces for specific DTDs. New features in version 1.3 include XML Schema support, WebDAV capabilities, and various user interface enhancements. Java 1.2 or later is required. </description><link>http://www.cafeconleche.org/#news2003April7</link></item>
![Page 13: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/13.jpg)
What you really want is this:<description> <p><a href="http://www.xerlin.org"><strong>Xerlin 1.3</strong></a>,an open source XML Editor written in Java, has been released. Users can extend the application via custom editor interfaces for specific DTDs. New features in version 1.3 include:</p> <ul> <li>XML Schema support</li> <li>WebDAV capabilities</li> <li>Various user interface enhancements</li> </ul> <p>Java 1.2 or later is required.</p> </description>
![Page 14: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/14.jpg)
What people do is this:<description><p><a href="http://www.xerlin.org"><strong>Xerlin 1.3</strong></a>, an open source XML Editor written in Java, has been released. Users can extend the application via custom editor interfaces for specific DTDs. New features in version 1.3 include:</p> <ul> <li>XML Schema support</li> <li>WebDAV capabilities</li> <li>Various user interface enhancements</li> </ul> <p>Java 1.2 or later is required.</p> </description>
![Page 15: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/15.jpg)
Prefer URLs to unparsed entities and notations• URLs are simple and well
understood• Notations and unparsed entities
are confusing and little used• URLs don’t require the DTD to be
read• Many APIs don’t even support
notations and unparsed entities
![Page 16: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/16.jpg)
Part III: Semantics
![Page 17: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/17.jpg)
Use processing instructions for process-specific content
• For a very particular, even local, process
• Describes how a particular process acts on the data in the document
• Does not describe or add to the content itself
• A unit that can be treated in isolation
• Content is not XML-like.• Applies to the entire document
![Page 18: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/18.jpg)
Processing instructions are not appropriate when:• Content is closely related to the
content of the document itself• Structure extends beyond a single
processing instruction• Needs to be validated
![Page 19: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/19.jpg)
Include all information in instance documents• Not all parsers read the DTD• Especially browsers• Beware• Default attribute values• Parsed entity references• XInclude• ID type dependence (XPath, DOM,
etc.)
![Page 20: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/20.jpg)
Encode binary data using quoted printable and/or Base64
• Quoted printable works well for mostly text
• Base-64 for non-text data• Can you link to the data with a URL
instead?
![Page 21: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/21.jpg)
Use namespaces for modularity and extensibility
• Not hard; simple cases can use one default namespace
• http URIs are normally preferred• DTD validation is tricky• Code to namespace URIs, not
prefixes• Avoid namespace prefixes in
element content and attribute values
![Page 22: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/22.jpg)
Reuse XHTML for generic narrative content
![Page 23: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/23.jpg)
Choose the right schema language for the job• DTDs• The W3C XML Schema Language• RELAX NG• Schematron
![Page 24: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/24.jpg)
Use only what you need
• You need• Well-formed XML 1.0• A parser
• You probably need:• Namespaces
• You may not need:• DTDs• Schemas• XInclude• SOAP• WS-Kitchen-Sink• etc.
![Page 25: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/25.jpg)
Always use a parser• Can’t use regular expressions:• Detecting encoding• Comments and processing instructions that
contain tags• CDATA sections• Unexpected placement of spaces and line
breaks within tags• Default attribute values• Character and entity references• Malformed documents• Internal DTD Subset
• Why not?• Unfamiliarity with parsers• Too slow
![Page 26: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/26.jpg)
Layer Functionalitybook.xml
XInclude
XSLT Transform to
XHTML
finished_book.xml
preface.xml
xmlsyntax.xml
XSLT Transform to
HTML
XSLT Transform to
XSL-FO
XSLT Transform to
Extract
SAX Program that extracts
examples
16 more chapters...
finished_book.xml
Valid?
book.xhtml book.html book.fo chapter1.xmlchapter1.xmlchapter2.xml
fop
book.pdf
chapters 1 to 17.xml
Example Source Code
Files
XSLT Transform to
XSL-FO
chapter1.xmlchapter2.xmlchapters 1 to 17.fo
xmlprotocols.xml
Yes
Print Error MessageNo
fop
chapter1.xmlchapter2.xmlchapters 1 to 17.pdf
![Page 27: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/27.jpg)
Program to standard APIs
• Easier to deploy in Java 1.4/1.5• Different implementations have
different performance characteristics
• SAX is fast• DOM interoperates• Semi-standard:• JDOM• XOM
• Bleeding edge• StAX• JAXB
![Page 28: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/28.jpg)
Read the complete DTD
• Be conservative in what you generate; liberal in what you accept
• Important content from DTD:• Default attribute values• Namespace declarations• Entity references• ID types
![Page 29: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/29.jpg)
Navigate with XPath
• More robust against unexpected structure
• Allow optimization by engine• Easier to code; enhanced
programmer productivity
![Page 30: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/30.jpg)
Validate inside your program with schemas
![Page 31: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/31.jpg)
Part IV: Implementation
![Page 32: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/32.jpg)
Write documents in Unicode
• Prefer UTF-8• Smaller in English• ASCII compatible
• Normalization• É, ü, ì and so forth• NFC• ICU
![Page 33: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/33.jpg)
Avoid Vendor Lockin; Beware
• Opaque, binary data used in place of marked up text.
• Over-abbreviated, inobvious names like F17354 and grgyt
• APIs that hide the XML• Products that focus on the
"Infoset”• Alternate serializations of XML• Patented formats
![Page 34: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/34.jpg)
Hang on to your relational database
![Page 35: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/35.jpg)
Document Namespaces with RDDL
<!DOCTYPE html PUBLIC "-//XML-DEV//DTD XHTML RDDL 1.0//EN" "http://www.rddl.org/rddl-xhtml.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:rddl="http://www.rddl.org/"><head> <title>MegaBank Statement Markup Language (MBSML)</title></head><p>This is the XML namespace for the <ahref="http://developer.megabank.com/xml/">MegaBank Statement Markup Language</a>.</p><rddl:resource xlink:type="simple" xlink:href="http://developer.megabank.com/xml/spec.html" xlink:role="http://www.w3.org/TR/html4/" xlink:arcrole ="http://www.rddl.org/purposes#normative-reference"> <p> The <a href="http://developer.megabank.com/xml/spec.html">MegaBank Statement Markup Language Specification 1.0</a> </p></rddl:resource></body></html>
![Page 36: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/36.jpg)
Pick the correct MIME type
• application/xml• Not text/xml!• Don't use charset• application/mathml+xml• image/svg+xml• application/xslt+xml
![Page 37: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/37.jpg)
TagSoup Your HTML
![Page 38: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/38.jpg)
Catalog common resources
<?xml version="1.0"?><catalog xmlns= "urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId= "-//OASIS//DTD DocBook XML V4.2//EN" uri= "file:///opt/xml/docbook/docbookx.dtd"/>
</catalog>
![Page 39: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/39.jpg)
Compress if space is a problem
//output OutputStream fout = new FileOutputStream("data.xml.gz"); OutputStream out = new GZipOutputStream(fout); OutputFormat format = new OutputFormat(document); XMLSerializer output = new XMLSerializer(out, format); output.serialize(doc); // input InputStream fin = new FileInputStream("data.xml.gz"); InputStream in = new GZipInputStream(fin); DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder parser = factory.newDocumentBuilder(); Document doc = parser.parse(in); S // work with the document...
![Page 40: Effective XML Elliotte Rusty Harold elharo@metalab.unc.edu](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649dba5503460f94aab763/html5/thumbnails/40.jpg)
To Learn More
• This Presentation: http://cafeconleche.org/slides/lxny/effectivexml
• Effective XML: 50 Specific Ways to Improve Your XML Documents• Elliotte Rusty Harold• Addison-Wesley, 2003• ISBN 0-321-15040-6• $44.99• http://cafeconleche.org/books/
effectivexml