1 XML and QUERY Shilpi Ahuja CSE 591 - Data Mining 4 th April 2002.

35
1 XML and QUERY Shilpi Ahuja CSE 591 - Data Mining 4 th April 2002
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of 1 XML and QUERY Shilpi Ahuja CSE 591 - Data Mining 4 th April 2002.

1

XML and QUERY

Shilpi Ahuja

CSE 591 - Data Mining

4th April 2002

2

What is XML? (Extensible Markup Language )

A Markup language for structured documentation.

A Structural and Semantic language, not a formatting language

Not just for Web pages

3

HTML vs. XML

External Presentation

Xaver Roe

Wikingerrufer 7

10555 Berlin

XML

<Address><name> Xaver Roe </name><street> Wikingerufer 7 </street><Town> Berlin </Town></Address>

HTML <em>Xaver Roe</em> <br> Wikingerufer 7 <br> <strong> Berlin </strong>

4

Why Extensible Markup Language )

Language It has a grammarIt has a vocabulary (sort of)It can be parsed by machines

Markup Language A mechanism to identify structures in a document. It says what things are; not what they do

It is not a programming languageIt is not compiled

Extensible You can add words to the language

5

XML describes structure and semantics, not formatting

XML documents form a tree Element and attribute names reflect the

kind of the element Formatting can be added with a style

sheet

6

So Is XML Just Like HTML?

Discussion Question ?

7

Answer : No In HTML, both the tag semantics and the tag

set are fixed. XML specifies neither semantics nor a tag set XML lets you define your own tags HTML describes lay-out XML describes the structure of a document XML separates content from presentation

8

So IS XML Just Like SGML?

No. Well, yes, sort of ! XML is a much-restricted form of SGML It is defined as an application profile of

SGML. SGML is not well suited to serving

documents over the web

9

So Why XML ?

XML was created so that richly structured documents could be used over the web

HTML -- Bound with a set of semantics , no arbitrary structure

SGML provides arbitrary structure, but is too difficult to implement just for a web browser

10

What is the advantage of using XML ?

Discussion Question ?

11

A Simple XML Document

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE NEWSPAPER SYSTEM "newspaper.dtd"><NEWSPAPER>

<ARTICLE EDITOR="Ernie Pyle" DATE="11/15/98" EDITION="Evening" AUTHOR="Jane Doe"><HEADLINE>Extensible Markup Language Proposed</HEADLINE><BYLINE>Jane Doe, Staff Writer</BYLINE><LEAD>The newly proposed XML Specification has been making a splash in the

community.</LEAD><BODY>The newly proposed XML draft stands to revolutionize the exchange of

easily.</BODY><NOTES>No Notes</NOTES>

</ARTICLE><ARTICLE AUTHOR="John Doe" EDITOR="Ernie Pyle" DATE="02/15/98" EDITION="Morning">

<HEADLINE>XML 1.0 Recommendation Released</HEADLINE><BYLINE>John E. Doe, Reporter</BYLINE><LEAD>The W3C today released the final recommendation for XML</LEAD><BODY>XML Developers, are already using the released recommendation </BODY><NOTES>See www.w3c.org for more information</NOTES>

</ARTICLE></NEWSPAPER>

12

Characteristics The document begins with a processing

instruction: <?XML ...?>. Open and close all tags Empty tags end with /> There is a unique root element Elements may not overlap Attribute values are quoted < and & are only used to start tags and entities

13

Elements Most common form of markup. Example: Article, Headline, Byline are all

elements Delimited by angle brackets, most elements

identify the nature of the content they surround. Some elements may be empty i.e they’ve no

content. A non-empty element always begins with a

start-tag, <element>, and ends with an end-tag, </element>.

14

Attributes

Attributes are name-value pairs that occur inside tags after the element name.

For example, < ARTICLE EDITOR="Ernie Pyle "> is the

Article element with the attribute Editor having the value Ernie Pyle.

In XML, all attribute values must be quoted.

15

Entity References

Entities are used to represent special characters like left angle bracket, “<”

They’re also used to refer to often repeated or varying text and to include the content of external files.

Every entity must have a unique name Entity references begin with the ampersand

and end with a semicolon.

16

Declaring & Referencing Entities <!ENTITY NEWSPAPER "Vervet Logic

Times"> Using &NEWSPAPER anywhere in the

document inserts “Vervet Logic Times” at that location.

Internal entities allows you to define shortcuts for frequently typed text or text that is expected to change, such as the revision status of a document.

17

Comments

Comments begin with “<!--” and end with “-->”.

Comments can contain any data except the literal string “--”.

Comments are not part of the textual content of an XML document. An XML processor is not required to pass them along to an application

18

DTD ( Document Type Definition )

Formally identifies the relationships between the various elements that form the document.

Can express constraints on the sequence and nesting of tags.

Can express constraints on attribute values and their types and defaults

The names of external files that may be referenced , the formats of some external (non-XML) data that may be included, and entities that may be encountered.

19

<!-- A Sample Newspaper Article DTD -->

<!ENTITY NEWSPAPER "Vervet Logic Times">

<!ENTITY PUBLISHER "Vervet Logic Press">

<!ENTITY COPYRIGHT "Copyright 1998 Vervet Logic Press">

<!ELEMENT NEWSPAPER (ARTICLE+)>

<!ELEMENT ARTICLE (HEADLINE, BYLINE+, LEAD, BODY, NOTES?)>

<!ATTLIST ARTICLE AUTHOR CDATA #REQUIRED

EDITOR CDATA #IMPLIED

DATE CDATA #IMPLIED

EDITION CDATA #IMPLIED>

<!ELEMENT HEADLINE (#PCDATA)>

<!ELEMENT BYLINE (#PCDATA)>

<!ELEMENT LEAD (#PCDATA)>

<!ELEMENT BODY (#PCDATA)>

<!ELEMENT NOTES (#PCDATA)>

ELEMENT symbols

* as many times as need

+ at least once? once or not at all, must be in listed order| either one or other, any order

ATRIBUTE option

# REQUIRED – must be

# IMPLIED – can be

Attribute Data Type

CDATA – character dataENUMARATED – list of valuesID – Unique IDIDREF, IDREFS – referred valueENTITY, ENTITIES – binary dataNMTOKEN, NMTOKENS, NOTATION

ELEMENT Data Type

# PCDATA – any characters

The Newspaper DTD

20

Types of declarations in XML

Element declarations Attribute list declarations Entity declarations Notation declarations.

21

Element Declarations

Identifies the names of elements and the nature of their content

Example

<!ELEMENT ARTICLE (HEADLINE, BYLINE+, LEAD, BODY, NOTES?)>

An Article must contain Headline,Byline,Lead, Body and may contain Notes

22

ELEMENT DATA TYPE ( PCDATA ) Parseable character data Example : <!ELEMENT Byline (#PCDATA | quote)*>

<!ELEMENT Body (#PCDATA)*> The vertical bar indicates an “or” relationship The asterisk indicates that the content is optional

(may occur zero or more times) Byline may contain zero or more characters and

quote tags.

23

Attribute Declarations

Identify which elements may have attributes

What attributes they may have What values the attributes may hold What default value each attribute has.

24

Attributes : Example <!ATTLIST ARTICLE

AUTHOR ID #REQUIRED

EDITOR CDATA #IMPLIED

STATUS ( funny | notfunny ) 'funny'>

• Author, which is an ID and is required;

• Editor, which is a string is not required

• Status, which must be either funny or notfunny and defaults to funny if not specified.

25

Types of Attributes

CDATA ID IDREF or IDREFS ENTITY or ENTITIES NMTOKEN or NMTOKENS A list of names

26

Types of Default Values

#REQUIRED #IMPLIED "value" #FIXED "value"

27

Notation Declarations

Identify specific types of external binary data.

This information is passed to the processing application, which may make whatever use of it .

A typical notation declaration is:

<!NOTATION GIF87A SYSTEM "GIF">

28

XML-QL: A Query Language for XML

Designed in the AT&T Labs XML-QL has SELECT-WHERE construct, like

SQL It borrows features of query languages recently

developed by the database research community for semi-structured data.

XML-QL can express queries, which extract pieces of data from XML documents

29

Features of XML-QL

Declarative : like SQL. Relational complete : It can express joins. Easy implementation Data Extraction: XML-QL can extract data from

existing XML documents and construct new XML documents.

Views: Supports both ordered and unordered views on an XML document.

Availability : XML-QL is implemented as a prototype and is freely available in a Java version.

30

Features of XML-QL Path Expressions: Supports partially specified path

expressions . Building new Elements: Supports creation of new

elements Combining Data Sources: Supports querying several

data sources at the same time Negation: XML-QL doesn’t support negation Aggregation: Doesn’t support aggregate functions like

min, max, sum, count and avg . Update Language: XML-QL doesn’t provide any support

for insert, delete and update of elements

31

Queries in XML-QL

Query 1: Produce all editors of the articles where author is John Doe

Feature Exploited: Selection, Projection and Data Extraction on element values

32

Query Function query() {CONSTRUCT<result> { WHERE <NEWSPAPER.ARTICLE> <AUTHOR><NAME>"John Doe"</></> <EDITOR>$b</> </> IN "newspaper.xml" CONSTRUCT $b } </result>}

33

Query Output

 

OUTPUT: <?xml version="1.0" encoding="UTF-8"?><result> <NAME>Ernie Pyle</NAME></result> 

34

Explanation

This query matches every <ARTICLE> element in the XML document newspaper.xml that has atleast one <author> element and a <editor> element and author name is “John Doe”. For each such match, it binds the variable b to the editor. The result is the list of editors bound to b.

35

Discussion Question ?

Can XML be used for things besides the Internet?