XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science.

37
XML for Scientific Applications

Transcript of XML for Scientific Applications. Outline SQ1 returned History Manipulating XML XML for Science.

Page 1: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XML for Scientific Applications

Page 2: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Outline

SQ1 returned

History Manipulating XML XML for Science

Page 3: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XML History

late 1960s “Generic Coding” for elect. manuscripts

'heading' rather than 'format-17' 1969

Goldfarb, Mosher, Lorie @ IBM Generalized Markup Language

1980 SGML; Standardized by ANSI IRS, DoD were early adopters

Page 4: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

SGML

Representation of documents Device-Independent, Software-

Independent Content

Tagged Text Structure

Document Type Definition (DTD) Style

Usually proprietary

Page 5: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

SGML: Content

Tagged Text start tags and end tags indentation not important

<section> <heading>SGML Example</heading> <paragraph> <line>first line of SGML</line> <line>second line of SGML</line> </paragraph></section>

Page 6: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

SGML: Structure

A section is an optional heading, followed by one or more paragraphs.

A paragraph is one or more lines or a quote.

A heading is just text.

A line is just text.

Dcoument Type Definition (DTD)

<!ELEMENT section - - (heading?, paragraph+)><!ELEMENT paragraph - O (line+ | quote)><!ELEMENT heading - O (#PCDATA) ><!ELEMENT line - O (#PCDATA) ><!ELEMENT quote - O (#PCDATA) >

Page 7: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

SGML: Benefits and Weaknesses

Benefits Human-readable Portable Structure-oriented

Weaknesses DTD difficult to define up-front Difficult to implement

syntax abbreviations/rules inference

Page 8: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

AN SGML Application: HTML

1990 Tim Berners-Lee @ CERN

a Scientific Data Management problem… writes a DTD for “Hypertext Markup Language”

1994-1996 Web takes off Page designers need guarantees about how

their page is going to look <font> tags, etc. introduced; purists despair MS and Netscape fight

Page 9: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Data on the Web

<title>,<h1>,<font> are for humans

Back to SGML for data, but Stricter syntax (must use end tags)

good for implementors looser semantics (DTD not required)

good for users

Non-standard markup is better than no markup at all

Page 10: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Growth

Sun introduces Java New hope for interoperability

Microsoft pushes XML because…it’s not Java

Individuals can publish data unilaterally Contrast: How do you publish relations? Contrast: Documents? (MS Word? as pdf?) A non-partisan, unimpeachable format

Page 11: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XML File Sample

<?xml version="1.0"?> <dining-room>

<manufacturer>The Wood Shop</manufacturer>

<table type="round" wood="maple">

<price>$199.99</price> </table> <chair wood="maple">

<quantity>6</quantity> <price>$39.99</price>

</chair> </dining-room>

Page 12: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Abstract View of XML

dining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” “$199.99” “$39.99” 6“maple”

price

Page 13: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Manipulating XML

Page 14: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XPath

dining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

1) //price

2) /*/*/wood

3) //wood/../

4) /dining-room/*[price>100]/type

5) //wood[1]/text()

Page 15: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XPath

XML [Node] language is not “closed” cannot compose xpath expressions

Answers are references to nodes works for local memory…

Page 16: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XSLTdining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

<xsl:stylesheet version="1.0" xmlns:xsl=“…"> <xsl:template match=“//wood">

<xsl:if test=“/price > 100”><material>

<xsl:value-of select=“.”/> </material>

</xsl:if></xsl:template> </xsl:stylesheet>

Page 17: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XSLT

logically, a set of rules pattern1 -> result1 pattern2 -> result2 …

physically, an XML document why is this a good idea?

XML XML, via XML

Page 18: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XQuerydining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

for $v in doc(“example.xml”)//woodwhere $v/../price > 100return <material>

$v </material>

Page 19: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XQuerydining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

for $m in doc(“example.xml”)//table[wood=“maple”]for $o in doc(“example.xml”)//table[wood=“oak”]where $m/price > $o/pricereturn <pair> <maple>$m/type </maple>

<oak>$m/type </oak> </pair>

Page 20: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XQuery

Feels more like a Query language Joins are natural to express

XML XML, via special syntax

Page 21: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XSLT and XQuery

Today’s author sums up: XSLT is for transformation

document-centric view of XML XQuery is for query

data-centric view of world

Page 22: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XML for Science

Exercise: Sensor Data again

Page 23: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Sensor Data

Would Homework 1 be any easier with XML?

Page 24: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Benefits for Science

“better, more precise searching of full-content;

automatic extraction of object metadata; better support for compound document

formats and dynamic formatting generally; and,

improved processes for scholarly dissemination tasks such as peer review, versioning, and archiving.”

-- from NSF / NSDL Workshop on Scientific Markup Languages

Page 25: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XML for Science

Recall features of Science Data: Read-oriented access Provenance

who, what, when, where, why Interesting Data Types

timeseries spatial arrays images

Scale

Page 26: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XML for Science

Read-oriented access? perfect!

Provenance requires some flexibility; no problem

Interesting Data Types …and special file formats

Scale could get ugly

Page 27: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Interesting Data Types

Data locked in binary file formats Binary Format Description Language

[Myers, Chappell 2000] Data Format Description Language

[OpenGrid Project] Retrofitting Data Models

[Howe, Maier SSDBM 2005] PADX

[Fernandez et al, PLANX 2006] XDTM

[Foster, Voeckler et al. Global Grid Forum 2005]

Page 28: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Binary Format Description Language

Page 29: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

PADX

Page 30: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

1. Precord Pstruct summary_header_t {2. "0|";3. Punixtime tstamp;4. };5. Pstruct no_ramp_t {6. "no_ii";7. Puint64 id;8. };9. Punion dib_ramp_t {10. Pint64 ramp;11. no_ramp_t genRamp;12. };13. Pstruct order_header_t {14. Puint32 order_num;15. ’|’; Puint32 att_order_num;16. ’|’; Puint32 ord_version;17. ’|’; Popt pn_t service_tn;18. ’|’; Popt pn_t billing_tn;19. ’|’; Popt pn_t nlp_service_tn;20. ’|’; Popt pn_t nlp_billing_tn;21. ’|’; Popt Pzip zip_code;22. ’|’; dib_ramp_t ramp;23. ’|’; Pstring(:’|’:) order_type;24. ’|’; Puint32 order_details;25. ’|’; Pstring(:’|’:) unused;26. ’|’; Pstring(:’|’:) stream;27. };28. Pstruct event_t {29. Pstring(:’|’:) state;30. ’|’; Punixtime tstamp;31. };32. Parray event_seq_t {33. event_t[] : Psep(’|’) && Pterm(Peor);34. };35. Precord Pstruct order_t {36. order_header_t order_header;37. ’|’; event_seq_t events;38. };39. Parray orders_t {40. order_t[];41. };42. Psource Pstruct summary_t{43. summary_header_t summary_header;44. orders_t orders;45. };

Page 31: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XML as Data

Maier’s Maxim:

Every popular data exchange format eventually becomes a data storage format.

Page 32: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XML Storage: Shredding

Use RDBMS as your storage engine Two approaches:

Schema-aware Schema-oblivious

dining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

Page 33: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XML Storage: Schema-aware

Table(SKU, Wood, Type, Price)Chair(SKU, Wood, Price)

DiningRoom(Manufacturer, Chairs, Quantity, Table)

Page 34: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

XML Storage: Schema-oblivious

Remember fancy node-labeling schemes…

Edge(NodeId, Tag, Value, ParentNodeId)

Page 35: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Left/Right Labeling

dining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

0

1

2 3

4 5

7

6

8

9

34

10 …

Which queries are easy and fast?

What did we say the problems were?

Page 36: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Path Labeling

dining-room

table chairmanufacturer

type wood wood“The Wood Shop” price quantity

“round” “maple” 199.99 39.99 6“maple”

price

0

0.0

0.0.0 0.1.2

0.1

0.1.0.0

0.1.0

0.1.1.0

0.1.1

What queries are fast and/or easy?

What did we say the problems were?

0.1.2.0

Page 37: XML for Scientific Applications. Outline  SQ1 returned  History  Manipulating XML  XML for Science.

Next time

Geographic Information Systems Spatial Data

Study Questions 2 (I’ll post them tonight)