1 SCHEMALESS APPROACH OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE Ibrahim Dweib, Ayman Awadi,...
-
date post
20-Dec-2015 -
Category
Documents
-
view
220 -
download
3
Transcript of 1 SCHEMALESS APPROACH OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE Ibrahim Dweib, Ayman Awadi,...
11
SCHEMALESS APPROACH SCHEMALESS APPROACH OF MAPPING XML DOCUMENTOF MAPPING XML DOCUMENTSS
INTO RELATIONAL DATABASEINTO RELATIONAL DATABASE
Ibrahim Dweib, Ayman Awadi, Ibrahim Dweib, Ayman Awadi,
Seif Elduola Fath Elrhman, Joan LuSeif Elduola Fath Elrhman, Joan Lu
CIT 2008CIT 2008Sydney, Australia 8-11 July 2008Sydney, Australia 8-11 July 2008
22
Why schema-lessWhy schema-less
Many applications deal with highly flexible XML documents from different sources, which make it difficult to define their structure by a fixed schema or a DTD. Therefore, it is necessary for schema-lessschema-less approaches to deal with such XML documents.
33
The method aims to overcome the The method aims to overcome the challenges faced due to fixed shreddingchallenges faced due to fixed shredding
No loss of information while shredding.No loss of information while shredding.
Reconstruction of original XML documents Reconstruction of original XML documents is easier and much faster.is easier and much faster.
Maintaining XML document structure. Maintaining XML document structure.
Preserve the ordering nature of XML data.Preserve the ordering nature of XML data.
44
Theory guidanceTheory guidance
The main mathematical concepts that are used in this The main mathematical concepts that are used in this method are:method are: Definition 1:Definition 1: XML tree is composed of many sub-trees of different levels; it can be XML tree is composed of many sub-trees of different levels; it can be define as the following:define as the following: ii=1, 2 … n, represent the levels of XML tree, 0 represents the root=1, 2 … n, represent the levels of XML tree, 0 represents the root
Where, Where, EEii is a finite set of elements in the level is a finite set of elements in the level ii..
AAii is a finite set of attributes in the level is a finite set of attributes in the level ii..
XXii is a finite set of texts in the level is a finite set of texts in the level ii..
rri-1i-1 is the root of the sub-tree of level is the root of the sub-tree of level ii..
55
Theory guidance (Con’t)Theory guidance (Con’t)
Definition 2:Definition 2: A dynamic fragment (shred) A dynamic fragment (shred) df(idf(i) is defined to be the attributes and ) is defined to be the attributes and texts (leaf children) of the sub-tree texts (leaf children) of the sub-tree ii of the XML tree plus its root of the XML tree plus its root rri-1i-1, as follows:, as follows:
dfdf(i) = (A(i) = (Aii, X, Xii, r, ri-1i-1),),
Where: Where: AAii is a finite set of attributes in the level is a finite set of attributes in the level ii
XXi i is a finite set of texts in the level is a finite set of texts in the level ii..
rri-1i-1 is the root of the sub-tree of level is the root of the sub-tree of level ii..
66
Design frameworkDesign framework
A master table for documents. Called "documents“ table, to keep A master table for documents. Called "documents“ table, to keep information about documents themselves,information about documents themselves,
documents(documents(doc_iddoc_id, doc_structure, ….. ),, doc_structure, ….. ),
Additional fields may be added to keep all information about the Additional fields may be added to keep all information about the document itself such as dates, statistics, types… etc.document itself such as dates, statistics, types… etc.
The doc_id is a unique id generated per document to identify documents.The doc_id is a unique id generated per document to identify documents. The doc_structure is a big text field containing a coded string describing The doc_structure is a big text field containing a coded string describing
each document structure, any changes on the document structure should be each document structure, any changes on the document structure should be reflected in this field, such as adding a new tag or property, deleting an reflected in this field, such as adding a new tag or property, deleting an existing tag or property, or relocating a given tag or property to a different existing tag or property, or relocating a given tag or property to a different location in the same documentlocation in the same document
77
Design frameworkDesign framework (Con’t) (Con’t)
A second table to store the actual contents for all A second table to store the actual contents for all documents. Documents will be shredded into pieces of documents. Documents will be shredded into pieces of data that will be called tokens, each document element, data that will be called tokens, each document element, tag, or property will be considered a token, the tokens tag, or property will be considered a token, the tokens table will have at the minimum this structure,table will have at the minimum this structure,
tokens(tokens(doc_id, token_iddoc_id, token_id, token_name, token_value)., token_name, token_value). The The token_idtoken_id is the primary generated id for each token. is the primary generated id for each token. The The doc_id isdoc_id is the foreign key linking the tokens table to the documents table. the foreign key linking the tokens table to the documents table. token_nametoken_name is the tag name or the property name as found in the original XML is the tag name or the property name as found in the original XML
document.document. token_valuetoken_value is the text value of the XML tag property. is the text value of the XML tag property.
88
Design framework,Design framework, (Con’t) (Con’t) “doc_structure” field“doc_structure” field construction rules: construction rules:
The The doc_structuredoc_structure field is where the document structure maintained. field is where the document structure maintained. It consists of long series of related keys.It consists of long series of related keys. Each key should start with a given alphabet character, Each key should start with a given alphabet character, The letter 'T' for element (child), and the letter 'A' for attribute, The letter 'T' for element (child), and the letter 'A' for attribute, These letters are necessary to delimit keys in the sequence. These letters are necessary to delimit keys in the sequence. Then the letter is followed by a numeric number representing the Then the letter is followed by a numeric number representing the token_idtoken_id that this key is referring to, that this key is referring to,
Example: T120 is a key referring to a token in the tokens table whose Example: T120 is a key referring to a token in the tokens table whose token_idtoken_id = 120. = 120.
99
Design framework,Design framework, “doc_structure” field“doc_structure” field construction rules: (Con’t) construction rules: (Con’t)
If the token has properties then If the token has properties then
the key representing this token in the the key representing this token in the doc_structuredoc_structure will be followed with a set of will be followed with a set of keys keys defining these properties.defining these properties. Example: T120A12A17A2 is a valid key string for Example: T120A12A17A2 is a valid key string for token number 120 which has three properties defined by token number 120 which has three properties defined by tokens number 12, 17, and 2. tokens number 12, 17, and 2. These properties appear in the original document in this These properties appear in the original document in this order.order.
1010
Design framework,Design framework, “doc_structure” field“doc_structure” field construction rules: (Con’t) construction rules: (Con’t)
If the token has some children tags thenIf the token has some children tags then
these children will be represented as a key-string these children will be represented as a key-string surrounded by angle brackets.surrounded by angle brackets.
Example: T120<T12T7<T2T1>T77> is a valid string that Example: T120<T12T7<T2T1>T77> is a valid string that can be read, token 120 has three sub tags in this order: token can be read, token 120 has three sub tags in this order: token 12, followed by token 7, then token 77, and token 7 itself has 12, followed by token 7, then token 77, and token 7 itself has also two sub tags 2, and 1 in the given order.also two sub tags 2, and 1 in the given order.
1111
Theory implementation on simple case Theory implementation on simple case studystudy
<books><books> <book id="11210" category="fiction"><book id="11210" category="fiction">
<author id="a1" sex="m">M. John</author><author id="a1" sex="m">M. John</author> <name>Computer Science 101</name><name>Computer Science 101</name>
</book></book> <book id="11211"><book id="11211">
<author>A. Mark</author><author>A. Mark</author> <name>Applied Math 101</name><name>Applied Math 101</name>
<subject>Math</subject ><subject>Math</subject > </book></book>
</books></books>
Figure 1: XML documentFigure 1: XML document
1212
Theory implementation on simple case Theory implementation on simple case studystudy
Figure 2: A tree representation for XML document in figure 1
Books99
104105
100 Book
author name
M. John CS 101
Id" 11210"
Category"fiction"
Id" a1"
Sex"m"
101 102103 106
Book
author subjectname
A. Mark Math Applied Math 101
Id" 11211"
107
108 110109 111
Books99
104105
100 Book
author name
M. John CS 101
Id" 11210"
Category"fiction"
Id" a1"
Sex"m"
101 102103 106
Book
author subjectname
A. Mark Math Applied Math 101
Id" 11211"
107
108 110109 111
Books99
104105
100 Book
author name
M. John CS 101
Id" 11210"
Category"fiction"
Id" a1"
Sex"m"
101 102103 106
Book
author subjectname
A. Mark Math Applied Math 101
Id" 11211"
107
108 110109 111
1313
Theory implementation on simple case Theory implementation on simple case studystudy
Doc_idDoc_strcuture
10T99<T100A101A102<T103A104A105T106>T107A108<T109T110T111>>
Figure 5: Documents table
1414
Theory implementation on simple case Theory implementation on simple case studystudy
doc_idtoken_idtoken_nametoken_value1099booksNull
10100bookNull
10101id11210
10102categoryfiction
10103authorM. John
10104ida1
10105sexm
10106nameComputer Science 101
10107bookNull
10108id11211
10109authorA. Mark
10110nameApplied Math 101
10111subjectMath
Figure 6: Tokens table
1515
EXPERIMENTAL EnvironmentEXPERIMENTAL Environment
An Intel Core 2 Duo computer with 2 GHz CPU, 1 GB RAM, 256 MB shared An Intel Core 2 Duo computer with 2 GHz CPU, 1 GB RAM, 256 MB shared CacheCache
OS: Windows Vista home edition. OS: Windows Vista home edition. Visual Basic 6 is used as software development kit with Microsoft Access Visual Basic 6 is used as software development kit with Microsoft Access
2003 as relational database target.2003 as relational database target.Five XML documents with different sizes are used in the experiment.Five XML documents with different sizes are used in the experiment.
The data is taken from the XML data repository that is available at the web site The data is taken from the XML data repository that is available at the web site of the School of Computer Science and Engineering, University of Washington.of the School of Computer Science and Engineering, University of Washington.
The performance metric is the time spent for mapping XML documents to The performance metric is the time spent for mapping XML documents to relational database and the time spent for reconstructing these documents from relational database and the time spent for reconstructing these documents from
relational database. relational database. The experiment is repeated five times and the mean value of those times is The experiment is repeated five times and the mean value of those times is
reported to obtain a realistic and accurate results. reported to obtain a realistic and accurate results.
1616
EXPERIMENTAL RESULTSEXPERIMENTAL RESULTS
Document sizeDocument size4 KB4 KB28 KB28 KB64 KB64 KB602KB602KB1MB1MB
Mapping time Mapping time (secs)(secs)
0.019882380.019882380.149777360.14977736.3551445.35514453.5743353.5743355.852781365.85278136
Reconstructing Reconstructing time (secs)time (secs)
0.0189902340.0189902340.449809580.449809581.9268361.92683618.30554418.30554432.0625510432.06255104
Table 1: The time spent for mapping XML documents to RDBMS, and the time for reconstructing them
1717
EXPERIMENTAL RESULTSEXPERIMENTAL RESULTS
The time spent for mapping XML documents to RDBMS and the time spent for reconstructing them
0
510
1520
2530
35
4 KB 28 KB 64 KB 602KB 1MB
Document size
Tim
e sp
end Mapping time (secs)
Reconstructing time(secs)
1818
Conclusion (1)Conclusion (1)
By using this method:By using this method:
Maintaining document structure at a low cost price and Maintaining document structure at a low cost price and easily,easily,
Building the original document is straight forward, Building the original document is straight forward,
Performing first level semantic search is also Performing first level semantic search is also achievable either on a single document or on all achievable either on a single document or on all documents. documents.
1919
Conclusion (2)Conclusion (2)
Method Limitation:Method Limitation:
Complex semantic search is not achievable easily in Complex semantic search is not achievable easily in this structure.this structure.
Document size is limited to memory size since we use Document size is limited to memory size since we use DOM based parsingDOM based parsing
2020
Future WorksFuture Works
Improving this method to achieve complex semantic Improving this method to achieve complex semantic search, differentiate between XML data type (i.e., strings, search, differentiate between XML data type (i.e., strings, dates, integers), in order to apply less than or greater than dates, integers), in order to apply less than or greater than queries. queries.
Making an intensive testing and compare our method Making an intensive testing and compare our method with other methods in the literature to see its performance.with other methods in the literature to see its performance.
Using SAX parsing for XML document to solve Using SAX parsing for XML document to solve document size limitation.document size limitation.
2121
Thank You for Your TimeThank You for Your Time