XML SNU OOPSLA Lab. October 2005 2 Contents Semistructured Data Introduction History XML...

26
XML SNU OOPSLA Lab. October 2005

Transcript of XML SNU OOPSLA Lab. October 2005 2 Contents Semistructured Data Introduction History XML...

Page 1: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

XML

SNU OOPSLA Lab.October 2005

Page 2: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

2

Contents

Semistructured Data Introduction History XML Application DTD & XML Schema DOM & SAX Summary Online Resources

Page 3: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

3

Semistructured Data(1/3)

Semistructured Data and XML Integration of heterogeneous sources Data sources with non-rigid structure

Biological data Web data

Characteristics of Semistructured Data Missing or additional attributes Multiple attributes Different types in different objects

self-describing, irregular data, no a priori structure

Page 4: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

4

Semistructured Data(2/3)

&o1

&o12 &o24 &o29

&o43&96

&243 &206

&25

“Serge”“Abiteboul”

1997

“Victor”“Vianu”

122 133

paperbook

paper

references

referencesreferences

authortitle

yearhttp

author

authorauthor

titlepublisherauthor

authortitle

page

firstnamelastname

firstname lastname firstlast

Bib

Object Exchange Model (OEM)

complex object

atomic object

Data Model

Page 5: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

5

Semistructured Data(3/3)

Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206

“Vianu”}, title: &o93 “Regular path queries with

constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92

133} } }

Syntax for Semistructured Data

Page 6: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

6

Introduction(1/4)

XML An acronym for ‘eXtensible Markup Language’ A meta-language that describes other

languages A data format for storing structured and semi-

structured text for dissemination and ultimate publication, perhaps on a variety of media

Page 7: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

7

Introduction(2/4)

Properties Tags enclose identifiable parts of the

document Self-describing Physical/logical structure

Physical structure : allows components of the document, called entities

Logical structure : allows a document to be divided into named units and sub-units, called elements

Page 8: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

8

Introduction(3/4)

Sub-unit

Unit

Document

elements

Logical Structure

entities(internal)(separate)

Physical Structure

Page 9: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

9

Introduction(4/4)

<warning><para> This substance if hazardous to health </para><para> See procedure 12A. 7 for information on protective clothing required. </para><logo …/></warning>

XML markup

<transaction><time date=“19980509”/><amount>123</amount><currency type=“pounds”/><from id=“x98765”> J. Smith</from><to id=“x56565>M. Jones</to></transaction>

XML document

Page 10: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

10

History(1/2)

GM Internet

WWW

SGML

HTML

XML

1960

1986

1992

1997

GM = Generalized Markup

Page 11: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

11

History(2/2)

1960’s, IBM GML(Generalized Markup Language)

1980’s, ISO 8879, SGML(Standard Generalized Markup

Language) Early 1990’s, HTML(HyperText Markup

Language) 1996, W3C’s XML 1998, XML 1.0 1999, RDF(Resource Description Framework)

Page 12: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

12

Application

XML

DTD

DBMSParse

r

SA

XEven

ts

XS

L

Pro

cessor

ASP, Ja

va,

VB

HTMLBrowser

DOM(Document Object Model)SAX(Simple APIs for XML)XSL(eXtensible Stylesheet Language)ASP(Active Server Page)

Tre

eD

OM

DO

M A

PI

Data exchange applications

Page 13: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

13

An XML Document

<?xml version=“1.0”?><!DOCTYPE sigmodRecord SYSTEM sigmodRecord.dtd”><sigmodRecord><issue> <volume>1</volume> <number>1</number> <articles><articles> <title> XML Research Issues</title> <initPage> 1 </initPage> <endPage> 5 </endPage> <authors> <author AuthorPosition=“00”> Tom Hanks </author> … </authors></article></articles></issue></sigmodRecord>

Page 14: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

14

DTD(1/2)

DTD(Document Type Definition) An optional but powerful feature of XML Comprises a set of declarations that define a

document structure tree Some XML processors read the DTD and use it

to build the document model in memory A parser uses it to check the validity of

documents

Page 15: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

15

DTD(2/2)

DTD define Element type + Attribute + Entities

Valid Vs. Invalid Valid conforms to DTD Invalid fail to conform to DTDWell formed

XML Document

Valid XML Document

Page 16: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

16

XML Schema

Schema W3C standard : specifies structure of XML

documents Data types for elements/attributes

String, int, float Unordered set is also allowed Derivation of types are allowed

Replaces DTDs Removes syntactic distinctions between DTD

and XML Richer types compared to DTD

Page 17: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

17

XML Schema Example

<xsd:element name=“article” minOccurs=“0” maxOccurs=“unbounded”>

<xsd:complexType><xsd:sequence> <xsd:element name=“title” type=“xsd:string”/> <xsd:element name=“initPage” type=“xsd:string”/> <xsd:element name=“endPage” type=“xsd:string”/> <xsd:element name=“author” type=“xsd:string”/> </xsd:sequence></xsd:complexType><xsd:element>

DTD<!ELEMENT article (title,initPage,endPage,author)><!ELEMENT title (#PCDATA)><!ELEMENT initPage (#PCDATA)><!ELEMENT endPage (#PCDATA)><!ELEMENT author (#PCDATA)>

Page 18: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

18

DOM(1/2)

Characteristics Hierarchical (tree) object model for XML

documents Associate list of children with every node Preserves the sequence of the elements in the

XML documents sigmodRecord

issue

volume number articles

title initPage endPageXML document

Page 19: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

19

DOM(2/2)

DOM interfaces Node : The base data type of the DOM. Element : The vast majority of the objects

you’ll deal with are Elements. Attr : Represents an attribute of an element. Text : The actual content of an Element or

Attr. Document : Represents the entire XML

document

Page 20: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

20

SAX(1/2)

DOM : expensive to materialize for a large XML collection

Characteristics Event-driven : fire an event for every open

tag/end tag Does not require full parsing Enables custom object model building

Application

Parser

Document Handler

create

give

startDocument()

startElement()

characters()

endElement()

endDocument()

<!……………>

<->………….</->

parsing

FeedbackWhen event driven

Event driven

Page 21: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

21

SAX(2/2)

The SAX API actually defines four interfaces for handling events EntityHandler TDHandler DocumentHandler ErrorHandler

All of these interfaces are implemented by HandlerBase.

Page 22: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

22

DOM vs SAX(1/3)

Why use DOM? Need to know a lot about the

structure of a document Need to move parts of the

document around Need to use the information

in the document more than once

Why use SAX? Only need to extract a few

elements from an XML document

Page 23: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

23

DOM vs SAX(2/3)

<book id="1"><verse> Sing, O goddess, the anger of Achilles son of Peleus, that brought countless ills upon the Achaeans. Many a brave soul did it send hurrying down to Hades, and many a hero did it yield a prey to dogs and vultures, for so were the counsels of Jove fulfilled from the day on which the son of Atreus, king of men, and great Achilles, first fell out with one another.</verse><verse> And which of the gods was it that set them on to quarrel? It was the son of Jove and Leto; for he was angry with the king and sent a pestilence upon ...

SAX API would be much more efficientDoing this with the DOM would take a lot of memory

Page 24: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

24

DOM vs SAX(3/3)

...<address>

<name> <title>Mrs.</title> <first-name>Mary</first-name> <last-name>McGoon</last-name></name><street>1401 Main Street</street><city>Anytown</city><state>NC</state><zip>34829</zip>

</address>

<address>

<name>

...

If we were parsing an XML document containing 10,000 address, and we wanted to sort them by last name??DOM would automatically store all of the dataWe could use DOM functions to move the nodes n the DOM tree

Page 25: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

25

Summary

XML eXtensible Markup Language A data format for storing structured and semi-

structured text physical/logical structure

DTD& XML Schema Establishes formal document structure rules

DOM & SAX API Need to know a lot about the structure of a document need to extract a few elements from an XML document

Page 26: XML SNU OOPSLA Lab. October 2005 2 Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

26

Online Resources

XML tutorial http://www.xml.com http://www.w3c.org http://www.w3schools.com/ http://www.xmltraining.com/course-search-xml

+online+tutorials http://xmlfiles.com/