1 XML and QUERY Shilpi Ahuja CSE 591 - Data Mining 4 th April 2002.
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of 1 XML and QUERY Shilpi Ahuja CSE 591 - Data Mining 4 th April 2002.
2
What is XML? (Extensible Markup Language )
A Markup language for structured documentation.
A Structural and Semantic language, not a formatting language
Not just for Web pages
3
HTML vs. XML
External Presentation
Xaver Roe
Wikingerrufer 7
10555 Berlin
XML
<Address><name> Xaver Roe </name><street> Wikingerufer 7 </street><Town> Berlin </Town></Address>
HTML <em>Xaver Roe</em> <br> Wikingerufer 7 <br> <strong> Berlin </strong>
4
Why Extensible Markup Language )
Language It has a grammarIt has a vocabulary (sort of)It can be parsed by machines
Markup Language A mechanism to identify structures in a document. It says what things are; not what they do
It is not a programming languageIt is not compiled
Extensible You can add words to the language
5
XML describes structure and semantics, not formatting
XML documents form a tree Element and attribute names reflect the
kind of the element Formatting can be added with a style
sheet
7
Answer : No In HTML, both the tag semantics and the tag
set are fixed. XML specifies neither semantics nor a tag set XML lets you define your own tags HTML describes lay-out XML describes the structure of a document XML separates content from presentation
8
So IS XML Just Like SGML?
No. Well, yes, sort of ! XML is a much-restricted form of SGML It is defined as an application profile of
SGML. SGML is not well suited to serving
documents over the web
9
So Why XML ?
XML was created so that richly structured documents could be used over the web
HTML -- Bound with a set of semantics , no arbitrary structure
SGML provides arbitrary structure, but is too difficult to implement just for a web browser
11
A Simple XML Document
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE NEWSPAPER SYSTEM "newspaper.dtd"><NEWSPAPER>
<ARTICLE EDITOR="Ernie Pyle" DATE="11/15/98" EDITION="Evening" AUTHOR="Jane Doe"><HEADLINE>Extensible Markup Language Proposed</HEADLINE><BYLINE>Jane Doe, Staff Writer</BYLINE><LEAD>The newly proposed XML Specification has been making a splash in the
community.</LEAD><BODY>The newly proposed XML draft stands to revolutionize the exchange of
easily.</BODY><NOTES>No Notes</NOTES>
</ARTICLE><ARTICLE AUTHOR="John Doe" EDITOR="Ernie Pyle" DATE="02/15/98" EDITION="Morning">
<HEADLINE>XML 1.0 Recommendation Released</HEADLINE><BYLINE>John E. Doe, Reporter</BYLINE><LEAD>The W3C today released the final recommendation for XML</LEAD><BODY>XML Developers, are already using the released recommendation </BODY><NOTES>See www.w3c.org for more information</NOTES>
</ARTICLE></NEWSPAPER>
12
Characteristics The document begins with a processing
instruction: <?XML ...?>. Open and close all tags Empty tags end with /> There is a unique root element Elements may not overlap Attribute values are quoted < and & are only used to start tags and entities
13
Elements Most common form of markup. Example: Article, Headline, Byline are all
elements Delimited by angle brackets, most elements
identify the nature of the content they surround. Some elements may be empty i.e they’ve no
content. A non-empty element always begins with a
start-tag, <element>, and ends with an end-tag, </element>.
14
Attributes
Attributes are name-value pairs that occur inside tags after the element name.
For example, < ARTICLE EDITOR="Ernie Pyle "> is the
Article element with the attribute Editor having the value Ernie Pyle.
In XML, all attribute values must be quoted.
15
Entity References
Entities are used to represent special characters like left angle bracket, “<”
They’re also used to refer to often repeated or varying text and to include the content of external files.
Every entity must have a unique name Entity references begin with the ampersand
and end with a semicolon.
16
Declaring & Referencing Entities <!ENTITY NEWSPAPER "Vervet Logic
Times"> Using &NEWSPAPER anywhere in the
document inserts “Vervet Logic Times” at that location.
Internal entities allows you to define shortcuts for frequently typed text or text that is expected to change, such as the revision status of a document.
17
Comments
Comments begin with “<!--” and end with “-->”.
Comments can contain any data except the literal string “--”.
Comments are not part of the textual content of an XML document. An XML processor is not required to pass them along to an application
18
DTD ( Document Type Definition )
Formally identifies the relationships between the various elements that form the document.
Can express constraints on the sequence and nesting of tags.
Can express constraints on attribute values and their types and defaults
The names of external files that may be referenced , the formats of some external (non-XML) data that may be included, and entities that may be encountered.
19
<!-- A Sample Newspaper Article DTD -->
<!ENTITY NEWSPAPER "Vervet Logic Times">
<!ENTITY PUBLISHER "Vervet Logic Press">
<!ENTITY COPYRIGHT "Copyright 1998 Vervet Logic Press">
<!ELEMENT NEWSPAPER (ARTICLE+)>
<!ELEMENT ARTICLE (HEADLINE, BYLINE+, LEAD, BODY, NOTES?)>
<!ATTLIST ARTICLE AUTHOR CDATA #REQUIRED
EDITOR CDATA #IMPLIED
DATE CDATA #IMPLIED
EDITION CDATA #IMPLIED>
<!ELEMENT HEADLINE (#PCDATA)>
<!ELEMENT BYLINE (#PCDATA)>
<!ELEMENT LEAD (#PCDATA)>
<!ELEMENT BODY (#PCDATA)>
<!ELEMENT NOTES (#PCDATA)>
ELEMENT symbols
* as many times as need
+ at least once? once or not at all, must be in listed order| either one or other, any order
ATRIBUTE option
# REQUIRED – must be
# IMPLIED – can be
Attribute Data Type
CDATA – character dataENUMARATED – list of valuesID – Unique IDIDREF, IDREFS – referred valueENTITY, ENTITIES – binary dataNMTOKEN, NMTOKENS, NOTATION
ELEMENT Data Type
# PCDATA – any characters
The Newspaper DTD
20
Types of declarations in XML
Element declarations Attribute list declarations Entity declarations Notation declarations.
21
Element Declarations
Identifies the names of elements and the nature of their content
Example
<!ELEMENT ARTICLE (HEADLINE, BYLINE+, LEAD, BODY, NOTES?)>
An Article must contain Headline,Byline,Lead, Body and may contain Notes
22
ELEMENT DATA TYPE ( PCDATA ) Parseable character data Example : <!ELEMENT Byline (#PCDATA | quote)*>
<!ELEMENT Body (#PCDATA)*> The vertical bar indicates an “or” relationship The asterisk indicates that the content is optional
(may occur zero or more times) Byline may contain zero or more characters and
quote tags.
23
Attribute Declarations
Identify which elements may have attributes
What attributes they may have What values the attributes may hold What default value each attribute has.
24
Attributes : Example <!ATTLIST ARTICLE
AUTHOR ID #REQUIRED
EDITOR CDATA #IMPLIED
STATUS ( funny | notfunny ) 'funny'>
• Author, which is an ID and is required;
• Editor, which is a string is not required
• Status, which must be either funny or notfunny and defaults to funny if not specified.
25
Types of Attributes
CDATA ID IDREF or IDREFS ENTITY or ENTITIES NMTOKEN or NMTOKENS A list of names
27
Notation Declarations
Identify specific types of external binary data.
This information is passed to the processing application, which may make whatever use of it .
A typical notation declaration is:
<!NOTATION GIF87A SYSTEM "GIF">
28
XML-QL: A Query Language for XML
Designed in the AT&T Labs XML-QL has SELECT-WHERE construct, like
SQL It borrows features of query languages recently
developed by the database research community for semi-structured data.
XML-QL can express queries, which extract pieces of data from XML documents
29
Features of XML-QL
Declarative : like SQL. Relational complete : It can express joins. Easy implementation Data Extraction: XML-QL can extract data from
existing XML documents and construct new XML documents.
Views: Supports both ordered and unordered views on an XML document.
Availability : XML-QL is implemented as a prototype and is freely available in a Java version.
30
Features of XML-QL Path Expressions: Supports partially specified path
expressions . Building new Elements: Supports creation of new
elements Combining Data Sources: Supports querying several
data sources at the same time Negation: XML-QL doesn’t support negation Aggregation: Doesn’t support aggregate functions like
min, max, sum, count and avg . Update Language: XML-QL doesn’t provide any support
for insert, delete and update of elements
31
Queries in XML-QL
Query 1: Produce all editors of the articles where author is John Doe
Feature Exploited: Selection, Projection and Data Extraction on element values
32
Query Function query() {CONSTRUCT<result> { WHERE <NEWSPAPER.ARTICLE> <AUTHOR><NAME>"John Doe"</></> <EDITOR>$b</> </> IN "newspaper.xml" CONSTRUCT $b } </result>}
33
Query Output
OUTPUT: <?xml version="1.0" encoding="UTF-8"?><result> <NAME>Ernie Pyle</NAME></result>
34
Explanation
This query matches every <ARTICLE> element in the XML document newspaper.xml that has atleast one <author> element and a <editor> element and author name is “John Doe”. For each such match, it binds the variable b to the editor. The result is the list of editors bound to b.