Python and XML

About me

• Paul Prescod, ([email protected])

• ActiveState Senior Developer

• Co-Author, XML Handbook

• About Python

• Python SAX/DOM

• PyXML Package

• Python XSLT/XPath


• XML and Zope

What is Python?

• Python is an easy to learn, powerful programming language.

– Efficient high-level data structures

– Simple approach to object-oriented programming.

– Elegant syntax and dynamic typing

Brief History of Python

• CWI, early 90s.• Dynamic Object Oriented High Level

Language.• More than a text processing language.• More than a scripting language.• Scalable and object oriented from the

beginning.• Dynamically type checked.

Python's business case

• Python can displace many other languages in the organization.

• The Python interpreter is free.• Python is legally unencumbered.• Professional programmers find Python

more flexible than most languages.• Amateur programmers are (often) more

comfortable than with Perl or Java.

Usability features

• Exceptionally clear syntax.

• Provides an obvious way to do most things.

• Small set of features combine in powerful ways.

• Only innovative where innovation is really necessary.

More Usability features

• Huge amount of free code and libraries• Interactive.• Designed to talk to the world.• Runs with Unix, Mac and Windows.• Integrates with JVM (Jython) and .NET

Framework (Python.NET)• Talks MS COM, XPCOM,


Scalability features

• Simple but powerful module system.

• Simple but powerful class system.

• Structured, standardized exceptions.

• Unix (almost all)

• Windows (3.1, 95, NT, CE)

• Mac


• Various legacy systems...

• New data types -- in Python or C

• Modules -- in Python or C

• Functions -- in Python or C

Python isn't picky!




• You can write code that is portable or platform-specific.

Compared to Perl

• Simpler syntactically.

• More object oriented.

• Easier to extend.

• But slower regular expressions...

Compared to Java

• Java is more difficult for amateur programmers.

• Static type checking can be inconvenient in text processing.

• Puritanical OO can be inconvenient.

• Bottom line: Java can make simple projects harder.

Why not Java: political

• "100% pure Java" gets in the way.

• The Java environment punishes interoperability. (e.g. getenv is deprecated)

• Java is designed to have interoperability limitations.

• Embedding Java is relatively painful.

Jython (nee JPython)

• Compiles Python classes to Java classes

• Embedded interpreter allows interactive coding.

• Access to all Java classes.

• For better or worse: maintains Java's security/platform-independence bubble.

Jython can use Java tools


• XPointer

• Various parsers

• Swing GUI

• Unicode

Python Limitations

• “Ordinary Python" has 8-bit and Unicode string types.– Handling explicit conversions can be annoying.

• Not as fast as C++.• Raw text searching is not as fast as Perl.• Dynamic type checking requires more care in


Python “Hello world"

print "Hello, World“

Python interpreter

• Just type:C:\> pythonPython 1.5.2 (#0, Apr 13 1999, 10:51:12) [MSC 32 bit (Intel)] on win32

Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam

>>> print "Hello, World"Hello, Python>>> print "Goodbye, World "Goodbye, Python>>> ^Z


• Python automatically bytecompiles modules.

• Next execution does not require compilation.

• .py files get a .pyc in the same directory

• When the .py is updated, the .pyc is updated

• DOS/Win32 (last slide)

• Unix (use ^D to exit)

• Graphical: “IDLE”, “PythonWin”

Python variables

• Any Python variable can hold any value.>>> width = 20>>> height = 5 * 9>>> width * height900>>> width = "really wide“>>> width'really wide'

Numeric types

• int: 32 bit, e.g. "x=5"

• long: arbitrary sized, e.g. "x=2L**128"

• float: accuracy depends on platform, e.g. "x=3.14"

• complex: real+imag., "x=5.3+3.2j"

Sequence types:

• Strings: "abcd"

• Tuples: (1,2,"b")

• Lists: [1,"a",3]

Sequence operations

• Iteration:for i in myList:print i

• Numeric indexing:k = myList[3]

• Slicing:k = mylist[2:5]

Sequence types: string

myStr = "abc" # assignment

myStr = myStr + "def" # = "abcdef"

for char in myStr: print char # iterateotherstr = myStr[1:4] # = "bcd"

Sequence types: lists

myList = ["a",5,3.25,2L,4+3j] anotherList = ["a",myList, ["3","2"]]anotherList2 = myList + myList # = ["a",5,...,"a",5,...]yetAnotherList = myList[1:3] # = [5,3.25]

Iterating over sequences

strlist = ["abc", "def", "ghi"]for item in strlist: for char in item: print char

Sequence Concatenation

>>> word = 'Help' + 'A'>>> word'HelpA'>>> list = ["Hello"] + ["World"]>>> print list['Hello', 'World']

Sequence Indexing

>>> str="abc">>> str[0]'a'>>> str[1]'b'

Negative indexes

>>> word[-1] # The last character'A'>>> word[-2] # The last-but-one

character'p'>>> word[-2:] # The last two characters'pA'>>> word[:-2] # All but the last two


Getting the length

• The len() function gets a sequence's length

>>> len( "abc" )3>>> len( ["abc","def"] )2

• Immutable list-like objects are called "tuples“

>>> a=(1,2)>>> a[0]=3Traceback (innermost last): File "<stdin>", line 1, in ?TypeError: object doesn't support item assignment

• Serve as a lookup table

• Maps "keys" to "values".

• Keys can be of any immutable type

• Assignment adds or changes members

• keys() method returns keys

>>> dict={"a":"alpha", "b":"bravo","c":"charlie"}

>>> dict["abc"]=10>>> dict[5]="def">>> dict[2.52]=6.71>>> print dict{2.52: 6.71, 5: 'def', 'abc': 10, 'b': 'bravo', 'c': 'charlie', 'a': 'alpha'}

Dictionary Methods

>>> dict.keys()[2.52, 5, 'abc', 'b', 'c', 'a']>>> dict.values()[6.71, 'def', 10, 'bravo', 'charlie', 'alpha']

>>> dict.items()[(2.52, 6.71), (5, 'def'), ('abc', 10), ('b', 'bravo'), …]

>>> dict.clear()>>> print dict{}

File Objects

• Represent opened files:myFile = open( "catalog.txt", "r" )data = = open( "catalog2.txt", "w" )data = data+ "more data"myFile.write( data )

Function definitions

• Encapsulate bits of code.

• Can take a fixed or variable number of arguments.

• Arguments can have default values.

Functions are objects

>>> def myClickFunction():... print "I was clicked"...>>> # assume button is a GUI button>>> button.OnClick = myClickFunction>>> print button.OnClick.__name__myClickFunction>>>

Flow Control Statements

• if/then/else

• while

• for

• try

Exception handling

• Python exception handling like Java/C++.

• Errors are reported in tracebacks.

• Exceptions propagate up.

Exception traceback

Traceback (innermost last): File "", line 10, in ? a() File "", line 2, in a b( ) File "", line 5, in b c( ) File "", line 8, in c 1/0ZeroDivisionError: integer division or modulo

• Classes combine code and data.• They represent real world objects.• We create "instance objects" from classes.• Closest languages in terms of object model

are SmallTalk or Ruby.• Much more flexible than Java or C++• More central to the language than


• Classes can specify a base class.• The new class "inherits" methods and data.• The new class can

– "override" methods.– add data and methods.

• Multiple Inheritance is okay• All methods are virtual.

Modules and Packages

• A module is a set of code in a single file.

• A package is a collection of related modules.

XML and Python

• Accessing XML with Python

• Parsing XML with Python

– Non validating Parsers

– Validating Parsers

Reading XML

• XML as a character data stream

– the RE module

• XML as a tree structure

– lists of node objects

• XML as an event source

– event dispatching to methods

Parsers in Python

• C extension modules

– PyExpat

– sgmlop

• Written in Python code:

– xmllib

– xmlproc

Parsers for Jython

• Apache

• Sun XML

• XP

• Oracle

• ...

Manipulating XML

• Flat file processing with RE's (briefly!)

• PySAX - Simple API for XML

• PyDOM - W3C Document Object Model

• …

Flat File Processing

• XML documents are text.

• Ordinary textual tools continue to work.

• E.G. Search for emph elements:import re

for i in r"<emph>(.*)</emph>", input ): print i

Copyright 2001, ActiveState

Flat File Recipe

• Unless your needs are very simple, let me help you!

• I’ve already converted the ultimate XML parsing regular expression to Python:

• Think of an XML document as a series of events

• "Start tag", "End tag", “Characters", etc.

• We can handle hierarchy by tracking start/end tags.

• We can deal with the document a little at a time.

• "Simple API for XML"

• Common API for parsers.

• Based on Java API.

• Parser implements certain interfaces.

• Application implements callback interfaces.

SAX Model

• The application hands the parser an event handler object.

• The parser sends events to the handler.• The handler can

– store them somehow,– build something,– re-route them to other parts of the


Application side

• Applications must provide:– ContentHandler– ErrorHandler– DTDHandler– EntityResolver

• Parser developer implements:– XMLReader– A few more (out of scope)

• Captures document instance events.

• App can:

– Build app. objects.

– Output something.

– Build a GUI

– ...

ContentHandler callbacks

• Main ones:

startElement(name, attrs)



ignorableWhitespace(ch, start, length)

processingInstruction(target, data)


ContentHandler egfrom xml.sax.handler import \ ContentHandler

class countHandler(ContentHandler): def __init__(self): self.tags={}

def startElement(self, name, attr): if not self.tags.has_key(name): self.tags[name] = 0 self.tags[name] += 1

ContentHandler eg

import xml.sax

parser = xml.sax.make_parser()

handler = countHandler()



print handler.tags

PySax Distribution

• Default content handler implementation is provided.

• Subclass can override only what it needs.

• Function to get parser is also provided.

• In addition to content handler,• we should assign an error handler.

class MyErrorHandler: def warning(self, exception):

print "Whoa, nelly!" print exception

def error(self, exception): print "Whoa, nelly!" raise exception

def fatalError(self, exception): print "Whoa, nelly!" raise exception

ErrorHandling (cont'd)

...errHandler = MyErrorHandler() parser.setErrorHandler( errHandler )parser.parse("\\temp\\test.xml")

Character handling

# print out characters in documentfrom xml.sax.handler import ContentHandler import xml.sax, sys class textHandler(ContentHandler): def characters(self, ch): sys.stdout.write(ch.encode("Latin-1"))

parser = xml.sax.make_parser() parser.setContentHandler(textHandler()) parser.parse("test.xml")

Document Object Model

• Document Object Model

• The DOM is a W3C standard.

• Extended version of "Dynamic HTML"

• Defined in CORBA IDL.

• Implemented in various languages.

• Implemented in IE5.0 and eventually Netscape

• The DOM is a tree-based API.

• This implies a certain amount of overhead.

• But also a lot of convenience and flexibility.

• XPath implementation essentially requires tree-based APIs.

DOM Nodes

• Elements, attributes, comments, etc. called "nodes".

• Classes represent node types.

• All node types subclass the "node" base class.

Node Objects

• Example methods include:

– getNodeType

– getParentNode

– getChildNodes

– getAttributes

– insertBefore

– cloneNode

Element Objects

• Elements are a representative subclass:

• getTagName

• getAttribute

• setAttribute

• getElementsByTagName

DOM node types


More DOM node types


Navigation properties

• parentNode - Parent of this node• firstChild - First child of this node• lastChild - Last child of this node• previousSibling - Node immediately preceding

this node• nextSibling - Node immediately following this

node• childNodes - List containing all the children of

this node

<folder> <title>XML bookmarks</title> <bookmark href="" >

<title>SIG for XML Processing in Python</title>


title bookmark


“XML Book….” “Sig for …”


First "title" node


• parentNode: folder element• firstChild: Text node 'XML bookmarks'• lastChild: Text node 'XML bookmarks'• previousSibling: codeNone• nextSibling: bookmark element• childNodes: A 1-element list: [ Text node

'XML bookmarks' ]

• The DOM API is very large and beyond the scope of the tutorial.

• A few short examples will illustrate the basic model.

Building a DOM

from xml.dom import minidom

dom = minidom.parse("test.xml")rootel = dom.documentElementprint rootel.nodeNametopnodes = rootel.childNodes

for toplevel in topnodes : print toplevel.nodeName

Searching a DOM

# print the last point element # in the treeprint h.document.documentElement.\ getElementsByTagName('point')[-1]

Modifying a DOM


insertBefore(newChild, refChild)

replaceChild(newChild, oldChild)


The Document Node

• One Document node per document.

• The base of the entire tree

• documentElement attribute contains a single Element node

• childNodes may have additional children, such as ProcessingInstruction nodes.

PyXML Package


• Collection of lots of useful Python XML stuff.

• Collectively maintained.

• A richer, more robust DOM than minidom.

• More classes, support for DOM 2+

• Integration with XPath and XSLT

PyXML Marshalling

• Convert Python types into XML

• xml.marshal.generic – generic base class

• xml.marshal.wddx – marshal Python types as WDDX

• xml.marshal.xmlrpc – marshal Python types as XML-RPC elements

PyXML Parsers

• Xml.parsers.xmlproc• Qp_xml• Xml.sax.drivers

• PyTrex is a schema processor for the TREX schema language



• PythonWare distributes the XML-RPC client:

• There are various SOAP implementations:– : – :– 4Suite:– …

Python SOAP Example


import SOAP

server = SOAP.SOAPProxy( "http://localhost:8000/")

print server.echo("Hello world")

XML and Zope

• Zope is an Open Source application server that publishes objects on the Internet.

• ParsedXML: Breaks up an XML document into bits.

• XML-RPC: You can plumb the depths of Zope with XML-RPC.

• Zcatalog: Index based on element-type names, attribute names, etc.

• A free Zope “product” (extension)

• Every element is a first-class Zope object.

• You can add “behavior” to XML documents

• RSS Channel Product

content=content.replace( 'test', 'CHANGED')


• Redfoot is a framework for distributed RDF-based applications, written in Python.– an RDF database – a query API for RDF– an RDF parser and serializer – a simple HTTP server providing a web interface

for viewing and editing RDF – a fully customizable UI – the beginnings of a peer-to-peer architecture for

communication between different RDF databases

More Information

• XML Topic Guide–

• SIG – http:///

• ActiveState Programmers Network–

• XML-DEV: subscribe at:– [email protected]

General XML

• Definitive Spec.–

• Annotated Spec.–

• FAQ : –

• Definitive Refererence to all things XML–