CSCI/CMPE 4341 Topic: Programming in Python Chapter 9: Python XML Processing Xiang Lian The...

22
CSCI/CMPE 4341 Topic: CSCI/CMPE 4341 Topic: Programming in Python Programming in Python Chapter 9: Python XML Chapter 9: Python XML Processing Processing Xiang Lian The University of Texas – Pan American Edinburg, TX 78539 [email protected] 1

Transcript of CSCI/CMPE 4341 Topic: Programming in Python Chapter 9: Python XML Processing Xiang Lian The...

CSCI/CMPE 4341 Topic: CSCI/CMPE 4341 Topic: Programming in PythonProgramming in Python

Chapter 9: Python XML ProcessingChapter 9: Python XML Processing

Xiang Lian

The University of Texas – Pan American

Edinburg, TX 78539

[email protected]

1

Objectives

• In this chapter, you will:– Understand XML

– Become familiar with the types of markup languages created with XML

– Learn to create XML markup programmatically

– Use the Document Object Model (DOM) to manipulate XML documents

– Explore ElementTree package to retrieve data from XML documents

2

Introduction

• XML developed by World Wide Consortium’s (W3C’s) XML Working Group (1996)

• XML portable, widely supported, open technology for describing data

• XML quickly becoming standard for data exchange between applications

3

XML Documents

• XML documents end with .xml extension

• XML marks up data using tags, which are names enclosed in angle brackets – <tag> elements </tag>

– Elements: individual units of markup (i.e., everything included between a start tag and its corresponding end tag)

– Nested elements form hierarchies

– Root element contains all other document elements

4

2002 Prentice Hall.All rights reserved.

Outline5

article.xml

<?xml version = "1.0"?> <!-- Fig. 15.1: article.xml --><!-- Article structured with XML. --> <article>

<title>Simple XML</title>

<date>December 21, 2001</date>

<author><firstName>John</firstName><lastName>Doe</lastName>

</author>

<summary>XML is pretty easy.</summary>

<content>In this chapter, we present a wide variety of examplesthat use XML.

</content> </article>

Optional XML declaration includes version information parameter

XML comments delimited by <!– and -->

Root element contains all other document elements

End tag has format </start tag name>

XML Document

• View XML documents– Any text editor• Internet Explorer, Notepad, Visual Studio, etc.

6

Minus sign

2002 Prentice Hall.All rights reserved.

Outline7

letter.xml

1 <?xml version = "1.0"?>2 3 <!-- Fig. 15.3: letter.xml -->4 <!-- Business letter formatted with XML. -->5 6 <letter>7 <contact type = "from">8 <name>Jane Doe</name>9 <address1>Box 12345</address1>10 <address2>15 Any Ave.</address2>11 <city>Othertown</city>12 <state>Otherstate</state>13 <zip>67890</zip>14 <phone>555-4321</phone>15 <flag gender = "F" />16 </contact>17 18 <contact type = "to">19 <name>John Doe</name>20 <address1>123 Main St.</address1>21 <address2></address2>22 <city>Anytown</city>23 <state>Anystate</state>24 <zip>12345</zip>25 <phone>555-1234</phone>26 <flag gender = "M" />27 </contact>28 29 <salutation>Dear Sir:</salutation>30

Root element letterChild element contactAttribute (name-value pair)

Empty elements do not contain character data

2002 Prentice Hall.All rights reserved.

Outline8

letter.xml

31 <paragraph>It is our privilege to inform you about our new32 database managed with <technology>XML</technology>. This33 new system allows you to reduce the load on34 your inventory list server by having the client machine35 perform the work of sorting and filtering the data.36 </paragraph>37 38 <paragraph>Please visit our Web site for availability39 and pricing.40 </paragraph>41 42 <closing>Sincerely</closing>43 44 <signature>Ms. Doe</signature>45 </letter>

9

XML Namespaces

• Provided for unique identification of XML elements

• Namespace prefixes identify namespace to which an element belongs<Xiang:CSCI/CMPE4341>

Topic: Programming in Python

</Xiang:CSCI/CMPE4341>

2002 Prentice Hall.All rights reserved.

Outline10

namespace.xml

1 <?xml version = "1.0"?>2 3 <!-- Fig. 15.4: namespace.xml -->4 <!-- Demonstrating namespaces. -->5 6 <text:directory xmlns:text = "http://www.deitel.com/ns/python1e"7 xmlns:image = "http://www.deitel.com/images/ns/120101">8 9 <text:file filename = "book.xml">10 <text:description>A book list</text:description>11 </text:file>12 13 <image:file filename = "funny.jpg">14 <image:description>A funny picture</image:description>15 <image:size width = "200" height = "100" />16 </image:file>17 18 </text:directory>

Attribute xmlns creates namespace prefixNamespace prefix bound to an URI

Uses prefix text to describe element file

2002 Prentice Hall.All rights reserved.

Outline11

defaultnamespace.xml

1 <?xml version = "1.0"?>2 3 <!-- Fig. 15.5: defaultnamespace.xml -->4 <!-- Using default namespaces. -->5 6 <directory xmlns = "http://www.deitel.com/ns/python1e"7 xmlns:image = "http://www.deitel.com/images/ns/120101">8 9 <file filename = "book.xml">10 <description>A book list</description>11 </file>12 13 <image:file filename = "funny.jpg">14 <image:description>A funny picture</image:description>15 <image:size width = "200" height = "100" />16 </image:file>17 18 </directory>

Creates default namespace by binding URI to attribute xmlns without prefix

Element without prefix defined in default namespace

12

Document Object Model (DOM)

• DOM parser retrieves data from XML document

• Hierarchical tree structure called a DOM tree– Each component of an XML document represented

as a tree node– Parent nodes contain child nodes– Sibling nodes have same parent– Single root (or document) node contains all other

document nodes

13

Example of Document Object Model (DOM)

article

title

author

summary

contents

lastName

firstName

date

Processing XML in Python

• Python packages for XML support– 4DOM and xml.sax

– Generating XML dynamically similar to generating HTML

– Python scripts can use print statements or XSLT to output XML

14

2002 Prentice Hall.All rights reserved.

Outline15

names.txt

O'Black, JohnGreen, SueRed, BobBlue, MaryWhite, MikeBrown, JaneGray, Bill

Fig. 16.1 Text file names.txt used in Fig. 16.2.

2002 Prentice Hall.All rights reserved.

Outline16

fig16_02.py

#!c:\Python\python.exe# Fig. 16.2: fig16_02.py# Marking up a text file's data as XML. import sys

# write XML declaration and processing instructionprint ("""<?xml version = "1.0"?> """)

# open data filetry:

file = open( "names.txt", "r" )except IOError:

sys.exit( "Error opening file" ) print ("<contacts>") # write root element # list of tuples: ( special character, entity reference )replaceList = [ ( "&", "&amp;" ),

( "<", "&lt;" ),( ">", "&gt;" ),( '"', "&quot;" ),( "'", "&apos;" ) ]

# replace special characters with entity referencesfor currentLine in file.readlines():

for oldValue, newValue in replaceList:currentLine = currentLine.replace( oldValue, newValue )

Print XML declaration

Open text file if it exists

Print root element

List of special characters and their entity references

Replace special characters with entity references

2002 Prentice Hall.All rights reserved.

Outline17

fig16_02.py

# extract lastname and firstnamelast, first = currentLine.split( ", " )first = first.strip() # remove carriage return

# write contact elementprint (""" <contact>

<LastName>%s</LastName><FirstName>%s</FirstName>

</contact>""" % ( last, first )) file.close() print ("</contacts>")

Extract first and last nameRemove carriage return

Print contact element

Print root’s closing tag

18

XML Processing Packages

• Third-party package 4DOM, included with package PyXML, complies with W3C’s DOM Recommendation

• xml.sax, included with Python, contains classes and functions for SAX-based parsing

• 4XSLT, located in package 4Suite, contains an XSLT processor for transforming XML documents into other text-based formats

• import xml.etree.ElementTree

2002 Prentice Hall.All rights reserved.

Outline19

article2.xml

<?xml version = "1.0"?> <!-- Fig. 16.5: article2.xml --><!-- Article formatted with XML --> <article>

<title>Simple XML</title>

<date>December 19, 2001</date>

<author><firstName>Jane</firstName><lastName>Doe</lastName>

</author>

<summary>XML is easy.</summary>

<content>Once you have mastered XHTML, XML is learnedeasily. Remember that XML is not for displayinginformation but for managing information.</content>

</article>

XML document used by fig16_04.py

2002 Prentice Hall.All rights reserved.

Outline20

fig16_04.py

# Fig. 16.4: fig16_04.py# Using 4DOM to traverse an XML Document. import sysimport xml.etree.ElementTree as etree # open XML filetry:

tree = etree.parse("article2.xml")except IOError:

sys.exit( "Error opening file" ) # get root elementrootElement = tree.getroot()print ("Here is the root element of the document: %s" % \

rootElement.tag)

# traverse all child nodes of root elementprint ("The following are its child elements:" )

for node in rootElement:print (node)

2002 Prentice Hall.All rights reserved.

Outline

fig16_04.py

# get first child node of root elementchild = rootElement[0]print ("\nThe first child of root element is:", child.tag)print ("whose next sibling is:" )

# get next sibling of first childsibling = rootElement[1] print (sibling.tag)

print ('Value of "%s" is:' % sibling.tag, end="")print (sibling.text)

22