Describing XML Wrappers for Information Integration Research project XRAKE November 15, 2001.

36
Describing XML Wrappers for Information Integration Research project XRAKE November 15, 2001

Transcript of Describing XML Wrappers for Information Integration Research project XRAKE November 15, 2001.

Describing XML Wrappers for Information Integration

Research project XRAKE

November 15, 2001

November 15, 2001 Research project XRAKE 2

Research project XRAKE

Merja Ek, Heli Hakkarainen, Pekka Kilpeläinen, Eila Kuikka, and Tommi Penttinen

University of KuopioDepartment of Computer Science

and Applied Mathematics

November 15, 2001 Research project XRAKE 3

Content

• Research project XRAKE

• Introduction

• General ideas of XW

• Examples

• Implementation & Future

November 15, 2001 Research project XRAKE 4

November 15, 2001 Research project XRAKE 5

Content

• Research project XRAKE

• Introduction

• General ideas of XW

• Examples

• Implementation & Future

6

Introduction

• programming error-prone, tedious• XW

– declarative– serialised data– influenced by XML Schema, XSLT

• XW wrapper– well-formed XML– highly readable

November 15, 2001 Research project XRAKE 7

Content

• Research project XRAKE

• Introduction

• General ideas of XW

• Examples

• Implementation & Future

November 15, 2001 Research project XRAKE 8

General ideas of XW

• XW can– remove data items

– add structure

– remove structure

• XW cannot– change order of data

• crude transformation– followed by e.g. XSLT

AA x1x2

BBy1 y2

z1 z2

<part-a> <e1>x1</e1> <e2>x2</e2></part-a><part-b> <line-1> <d1>y1</d1> <d2>y2</d2> </line-1> <d3>z2</d3></part-b>

XWengine

XW wrapperspecification

XSLT

November 15, 2001 Research project XRAKE 9

General ideas of XW (cont'd)

• wrapper is a template for output– element names– structure input structure

• input hierarchically divided into parts

• part ~ element• part + subparts ~ element

+ child elements

<the-whole …><part-X …>

<part-Y …>

<subpart-1 …><subpart-2 …><subpart-3 …>

<subpart-1 …><subpart-2 …>

November 15, 2001 Research project XRAKE 10

Content

• Research project XRAKE

• Introduction

• General ideas of XW

• Examples

• Implementation & Future

November 15, 2001 Research project XRAKE 11

Examples

• positional text data– phone invoices

• separator-delimited text

• binary data

INVOICE INVOICE NUMBER: 44196 CUSTOMER NUMBER: 25272 PERSONAL REFERENCE: WORK

John SmithGarden Avenue 4043234 Bigtown

PHONE SPECIFICATION

DATE UNITS DURATION NUMBER PRICE11.1.1992 5 307 min 37126 50.0023.6.1995 10 193 min 53829 122.00----------------------------------------------------------------John SmithGarden Avenue 4043234 Bigtown

595324 17.8.1996 907.00

XW Wrapper Specification

<xw:wrapper xw:name=”phone-invoice” xw:sourcetype=”text” xmlns:xw=”http://www.cs.uku.fi/XW/2001”> <invoice xw:starter=”\^INVOICE” xw:occurs=”unbounded”> … </invoice></xw:wrapper>

INVOICE INVOICE NUMBER: 44196 CUSTOMER NUMBER: 25272 PERSONAL REFERENCE: WORK

John SmithGarden Avenue 4043234 Bigtown

PHONE SPECIFICATION

DATE UNITS DURATION NUMBER PRICE11.1.1992 5 307 min 37126 50.0023.6.1995 10 193 min 53829 122.00----------------------------------------------------------------John SmithGarden Avenue 4043234 Bigtown

595324 17.8.1996 907.00

<xw:wrapper xw:name="phone-invoice" xw:sourcetype="text" xmlns:xw="http://www.cs.uku.fi/XW/2001" > <invoice xw:starter="\^INVOICE" xw:occurs="unbounded"> <identifierdata ...> ... </identifierdata> <specification xw:starter="\^PHONE SPECIFICATION" ...> ... </specification> <invoicedata xw:starter="\^----------" ...> ... </invoicedata> </invoice></xw:wrapper>

INVOICE INVOICE NUMBER: 44196 CUSTOMER NUMBER: 25272 PERSONAL REFERENCE: WORK

John SmithGarden Avenue 4043234 Bigtown

PHONE SPECIFICATION

DATE UNITS DURATION NUMBER PRICE11.1.1992 5 307 min 37126 50.0023.6.1995 10 193 min 53829 122.00----------------------------------------------------------------John SmithGarden Avenue 4043234 Bigtown

595324 17.8.1996 907.00

<xw:wrapper xw:name="phone-invoice" xw:sourcetype="text" xmlns:xw="http://www.cs.uku.fi/XW/2001" > <invoice xw:starter="\^INVOICE" xw:occurs="unbounded"> <identifierdata xw:childterminator="\n" xw:ignoreemptysubpart="true"> <invoicenumber xw:position="53 64"/> <customernumber xw:position="60 71"/> <personalreference xw:position="60 71"/> <name xw:position="1 22"/> <streetaddress xw:position="1 22"/> <postoffice xw:position="1 22"/> </identifierdata> <specification xw:starter="\^PHONE SPECIFICATION" ...> ... </specification> <invoicedata xw:starter="\^----------" ...> ... </invoicedata> </invoice></xw:wrapper>

INVOICE INVOICE NUMBER: 44196 CUSTOMER NUMBER: 25272 PERSONAL REFERENCE: WORK

John SmithGarden Avenue 4043234 Bigtown

PHONE SPECIFICATION

DATE UNITS DURATION NUMBER PRICE11.1.1992 5 307 min 37126 50.0023.6.1995 10 193 min 53829 122.00----------------------------------------------------------------John SmithGarden Avenue 4043234 Bigtown

595324 17.8.1996 907.00

<xw:wrapper ...> <invoice xw:starter="\^INVOICE" xw:occurs="unbounded"> <identifierdata xw:childterminator="\n" ...> </identifierdata> <specification xw:starter="\^PHONE SPECIFICATION" xw:childterminator="\n" xw:ignoreemptysubpart="true"> <xw:ignore/> <specificationrow xw:occurs="unbounded"> <date xw:position="1 12"/> <units xw:position="14 22"/> <duration xw:position="24 33"/> <number xw:position="35 43"/> <price xw:position="45 52"/> </specificationrow> </specification> <invoicedata xw:starter="\^----------" ... </invoicedata> </invoice></xw:wrapper>

INVOICE INVOICE NUMBER: 44196 CUSTOMER NUMBER: 25272 PERSONAL REFERENCE: WORK

John SmithGarden Avenue 4043234 Bigtown

PHONE SPECIFICATION

DATE UNITS DURATION NUMBER PRICE11.1.1992 5 307 min 37126 50.0023.6.1995 10 193 min 53829 122.00----------------------------------------------------------------John SmithGarden Avenue 4043234 Bigtown

595324 17.8.1996 907.00

<xw:wrapper xw:name="phone-invoice" xw:sourcetype="text" xmlns:xw="http://www.cs.uku.fi/XW/2001" > <invoice xw:starter="\^INVOICE" xw:occurs="unbounded"> <identifierdata xw:childterminator="\n" xw:ignoreemptysubpart="true"> </identifierdata> <specification xw:starter="\^PHONE SPECIFICATION" xw:childterminator="\n" xw:ignoreemptysubpart="true"> </specification> <invoicedata xw:starter="\^----------" xw:childterminator="\n" xw:ignoreemptysubpart="true"> <xw:ignore xw:occurs="4"/> <reference xw:position="30 48"/> <xw:collapse> <duedate xw:position="30 39"/> <totalsum xw:position="42 50"/> </xw:collapse> </invoicedata> </invoice></xw:wrapper>

<invoice> <identifierdata> <invoicenumber>44196</invoicenumber> <customernumber>25272</customernumber> <personalreference>WORK</personalreference> <name>John Smith</name> <streetaddress>Garden Avenue 40</> <postoffice>43234 Bigtown</> </identifierdata> <specification> <specificationrow> <date>11.1.1992</date> <units>5</units> <duration>307 min</duration> <number>37126</number> <price>50.00</price> </specificationrow>

Resulting XML 1/2

<specificationrow> <date>23.6.1995</date> <units>10</units> <duration>193 min</duration> <number>53829</number> <price>122.00</price> </specificationrow> </specification> <invoicedata> <reference>595324</reference> <duedate>17.8.1996</duedate> <totalsum>907.00</totalsum> </invoicedata></invoice>

Resulting XML 2/2

November 15, 2001 Research project XRAKE 24

Examples

• positional text data

• separator-delimited text– HL7 version 2.3 messages

• binary data

MSH|^~\&|KL-Lab||CCIMS|RDNT01|200001071300||ORU^R01...PID|||311244A0112|ExamMod1|Smith^John||19441231|M...OBR||76551|Res_01||||20000107060000|||||||||||||||||CH|COBX||NM|1535^aB-pO2^||11||||||FNTE|||This is a comment for aB-pO2.NTE|||Another comment for aB-pO2.OBX||NM|1026^S -ALAT^||61|||*|||F

Research project XRAKE 26

<!-- MSH, PID and OBR lines processed above --> <xw:CHOICE xw:occurs='unbounded'> <xw:collapse xw:starter='\^OBX' xw:childseparator='|'> <xw:ignore xw:occurs='3'/> <observation/> <xw:ignore/> <result/> <xw:ignore xw:occurs='2'/> <flag/> <xw:ignore xw:occurs='2'/> <responsetype/> </xw:collapse> <xw:collapse xw:starter='\^NTE' xw:childseparator='|' xw:occurs='unbounded'> <xw:ignore xw:occurs='3'> <xw:collapse/> </xw:collapse>

</xw:CHOICE>

<!-- MSH, PID and OBR lines processed above --> <xw:CHOICE xw:occurs='unbounded'> <xw:collapse xw:starter='\^OBX' xw:childseparator='|'> <xw:ignore xw:occurs='3'/> <observation/> <xw:ignore/> <result/> <xw:ignore xw:occurs='2'/> <flag/> <xw:ignore xw:occurs='2'/> <responsetype/> </xw:collapse> <xw:ELEMENT xw:name='comment'> <xw:collapse xw:starter='\^NTE' xw:childseparator='|' xw:occurs='unbounded'> <xw:ignore xw:occurs='3'> <xw:collapse/> </xw:collapse> </xw:ELEMENT> </xw:CHOICE>

Resulting XML<response> ... <observation>1535^aB-pO2^</observation> <result>11</result> <responsetype>F</responsetype> <comment>This is a comment for aB-pO2.Another comment for aB-pO2.</comment> <observation>1026^S -ALAT^</observation> <result>61</result> <flag>*</flag> ...</response>

November 15, 2001 Research project XRAKE 28

Examples

• positional text data

• separator-delimited text

• binary data– packet of IP-based communications protocol

Binary data

length 16b 16b 16b 16b 4*8b 4*8b variesname len chk id off src dst paytype short short short short 4*byte 4*byte array of bytes

<xw:wrapper xw:name="IP-like-protocol" xw:sourcetype="binary" xmlns:xw="http://www.cs.uku.fi/XW/2001"> <datagram> <xw:ignore xw:name="total-length" xw:type="short"/> <checksum xw:type="short"/> <id xw:type="short"/> <segment-offset xw:type="short"/> ... </datagram></xw:wrapper>

Binary data

length 16b 16b 16b 16b 4*8b 4*8b variesname len chk id off src dst paytype short short short short 4*byte 4*byte array of bytes

<xw:wrapper xw:name="IP-like-protocol" xw:sourcetype="binary" xmlns:xw="http://www.cs.uku.fi/XW/2001"> <datagram> <xw:ignore xw:name="total-length" xw:type="short"/> <checksum xw:type="short"/> <id xw:type="short"/> <segment-offset xw:type="short"/> <xw:ELEMENT xw:name="source-address"> <a xw:type="byte"/> <b xw:type="byte"/> <c xw:type="byte"/> <d xw:type="byte"/> </xw:ELEMENT> <xw:ELEMENT xw:name="destination-address"> <a xw:type="byte"/> <b xw:type="byte"/> <c xw:type="byte"/> <d xw:type="byte"/> </xw:ELEMENT> <xw:ELEMENT name="payload"> <xw:collapse xw:occurs="total-length - 16" xw:type="byte" xw:numeric-output-format="hexadecimal"/> </xw:ELEMENT> </datagram></xw:wrapper>

Resulting XML

<datagram> <checksum>397485</checksum> <id>37</id> <segment-offset>0</segment-offset> <source-address> <a>193</a><b>167</b><c>232</c><d>253</d> </source-address> <destination-address> <a>193</a><b>167</b><c>224</c><d>8</d> </destination-address> <payload>e6a9ff120a</payload></datagram>

November 15, 2001 Research project XRAKE 34

Content

• Research project XRAKE

• Introduction

• General ideas of XW

• Examples

• Implementation & Future

November 15, 2001 Research project XRAKE 35

Implementation & Future

• Java program– reads wrapper specification into DOM tree– produces output as SAX events:

characters, startElement, endElement

• further development of XW– attribute generation– content generation from input– enhancements to alternative/optional parts

November 15, 2001 Research project XRAKE 36

The end of the presentation

• Questions?