Some of my XML/Internet Research Projects
description
Transcript of Some of my XML/Internet Research Projects
![Page 1: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/1.jpg)
1
Some of my XML/Internet
Research Projects
CSCI 6530October 5, 2005
Kwok-Bun YueUniversity of Houston-Clear Lake
![Page 2: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/2.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 2
Content
• Areas of My Research Interest• Some Current Projects• Storage of XML in Relational Database• Example Internet Computing Projects• Conclusions
![Page 3: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/3.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 3
Areas of My Research Interest
• Internet Computing• XML • Databases• Concurrent Programming
![Page 4: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/4.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 4
Content
• Areas of My Research Interest• Some Current Projects• Storage of XML in Relational Database• Example Internet Computing Projects• Conclusions
![Page 5: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/5.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 5
Some Current Projects
• Storage of XML in relational database• Measuring Web bias using authorities and
hubs• Measuring information quality of Web pages• Distributed computer security laboratory• Collaborative Open Community for
developing educational resources• Generalized exchanges within organizations
![Page 6: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/6.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 6
Some Recent Student Work
• McDowell, A., Schmidt, C. & Yue, K., Analysis and Metrics of XML Schema, Proceedings of the 2004 International Conference on Software Engineering Research and Practice, pp 538-544, Las Vegas, June 2004.
• Yang A., Yue K., Liaw K., Collins G., Venkatraman J., Achar S., Sadasivam K., and Chen P., Distributed Computer Security Lab and Projects, Journal of Computing Sciences in Colleges. Volume 20, Issue 1. October 2004.
• Yue, K., Alakappan, S. and Cheung, W., A Framework of Inlining Algorithms for Mapping DTDs to Relational Schemas, Technical Report COMP-05-005, Computer Science Department, the Hong Kong Baptist University, 2005, http://www.comp.hkbu.edu.hk/en/research/?content=tech-reports.
![Page 7: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/7.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 7
Content
• Areas of My Research Interest• Some Current Projects• Storage of XML in Relational Database• Example Internet Computing Projects• Conclusions
![Page 8: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/8.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 8
Storing XML in RDB
• Advantages:– Mature database technologies.– May be queried by
• XML technology: e.g. XPath, XQuery.• RDB technology: e.g. SQL.
• Disadvantages: – impedance mismatch: XML and relations
are different data models.
![Page 9: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/9.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 9
Related Issues
• Effective mapping XML DTDs (~ ordered tree model) to relational schemas.
• Mapping of XML queries (e.g. XQuery) to RDB queries (e.g. SQL).
• Mapping of RDB query results back to XML format.
![Page 10: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/10.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 10
Related Work and Context
• Mapping – With or without schemas for XML.– With or without user input.
• Schemas for XML:– Document Type Definition (DTD)– XML Schema
• We consider mapping with DTD and without user input.
![Page 11: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/11.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 11
Naïve Mapping
• An XML element is mapped to a relation.
Example 1a:XML:
<a><b><c><d>hello</d></c></b></a>-> Relations: a, b, c and d.
![Page 12: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/12.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 12
Problems of Naïve Mapping
• Many relations.• Ineffective queries: multiple query joins.Example 1b:XPath Query: //aSQL Query: need to join the relations a, b,
c and d.
![Page 13: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/13.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 13
Inlining Algorithms
• First proposed by Shanmugasundaram, et. al.
• Expanded by Lu, Lee, Chu and others.• Extended in various directions by various
researchers, e.g.,– Preserving XML element orders.– Preserving XML constraints.
• Do not consider extensions here.
![Page 14: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/14.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 14
Basic Idea of Inlining Algorithms
• Inline child element into the relation for the parent element when appropriate.
• Different inlining algorithms differ in inlining criteria.
Example 1c: XML: <a><b><c><d>hello</d></c></b></a>
Inlined Relation: a.
![Page 15: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/15.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 15
Inlining Algorithms
• Child elements & attributes may be inlined.
• Child elements may not have their own relations.
• Results in less number of relations.• In general, more inlining -> less joins.
![Page 16: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/16.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 16
Inlining Algorithm Structure
1. Simplification of DTD.2. Generation of DTD graphs3. Generation of Relational Schemas
![Page 17: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/17.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 17
Our Preliminary Results
1. A more complete and optimal DTD Simplification Algorithm
2. A generic DTD Graph that can be used by inlining algorithms.
3. Inlining Considerations: framework for analyzing inlining algorithm
4. A new and aggressive inlining algorithm
![Page 18: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/18.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 18
Examples of Our Work
• Use DTD Simplification as an example of the flavor of our work.
• Show the new Inlining Algorithm.
![Page 19: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/19.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 19
Brief Introduction to DTD
• DTD: a simple language to describe XML vocabulary:– Element declarations: contents of
elements.– Attribute declarations: types and properties
of attributes.• DTD is still very popular.
![Page 20: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/20.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 20
DTD Element Declarations
• Define element contents:– #PCDATA: string– ANY: anything go– EMPTY: no content (attributes only)– Content models: child elements.– Mixed contents: child elements and strings.
![Page 21: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/21.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 21
DTD Example
Example 2: A complete DTD<!ELEMENT addressBook (person+)><!ELEMENT person (name,email*)><!ELEMENT name (last,first)><!ELEMENT first (#PCDATA)> <!ELEMENT last (#PCDATA)><!ELEMENT email (#PCDATA)><!ATTLIST person id ID #REQUIRED>
![Page 22: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/22.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 22
Operators for Element Declaration
• ,: sequence• +: 1 or more• *: 0 or more• ?: optional; 0 or 1• |: choice• (): parenthesis
![Page 23: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/23.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 23
Simplification of DTD
• Mapping of DTD to Relational Schemas:– Input: DTDs– Output: Relational Schemas
• DTD can be complicated => simplification.
Example 3:<!ELEMENT a (b,((b+,c)|(d,b*,c?)),
(e*,f)?)>
![Page 24: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/24.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 24
Simplification Principles
• The relational schema needs to store all possible scenarios.
• Some relations/columns may not be populated in some instances.
Example 3:<!ELEMENT a (b|c)> and<!ELEMENT a (b,c)>:May be the same from the RDB’s point of
view.
![Page 25: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/25.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 25
Simplification Details
• Comma-separated clauses: only operators remain: (), , and *.– + -> *, e.g. a+ -> a*.– Removal of | and ?, e.g. (a|b?) -> (a,b) – Removal of (), e.g. (a, (b)) -> (a,b)– Removal of repetition, e.g. (a, b, a) -> (a*, b)
• Note that element orders are not preserved.
![Page 26: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/26.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 26
Previous Simplification Results
• Not complete: e.g.– Shanmugasundaram: not specify how to
handle |.– Lu: not specify how to remove ().
• Not optimal (may generate * when it is not needed).
Example 4a: For Lu and Lee, 2 steps:(b|(b,c)) -> (b,b,c) -> (b*,c)
![Page 27: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/27.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 27
Our Simplification Algorithm
• A set of definitions.• A set of 7 simplification rules.• An algorithm on how and when to use
them.
Example 4b: For us, 1 step:(b|(b,c)) -> (b,c)
![Page 30: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/30.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 30
Complexity
Time complexity = O(Nop)
Where Nop is the total number of operators (including parentheses) in the element declarations of the DTD.
![Page 31: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/31.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 31
Advantages
• Complete: handle all DTDs.• Optimal: in the sense that * will not be
generated if not needed.
Example 5:<!ELEMENT a (b,((b+,c)|(d,b*,c?)),
(e*,f)?)> => <!ELEMENT a (b*,c,d,e*,f)>
![Page 32: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/32.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 32
A New Inlining Algorithm (1)
• Aggressive in inlining.• More complete. • Elaborated algorithms.• Handle more details: e.g. element types
of ANY, EMPTY and mixed contents.
![Page 36: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/36.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 36
Main Results
• Yue, K., Alakappan, S. and Cheung, W., A Framework of Inlining Algorithms for Mapping DTDs to Relational Schemas, Technical Report COMP-05-005, Computer Science Department, the Hong Kong Baptist University, 2005, http://www.comp.hkbu.edu.hk/en/research/?content=tech-reports.
![Page 37: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/37.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 37
Future Works
• Implemented the algorithms and tested with many DTDs.
• Need to implement the XQuery/SQL bridge for performance study.
![Page 38: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/38.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 38
Content
• Areas of My Research Interest• Some Current Projects• Storage of XML in Relational Database• Example Internet Computing Projects• Conclusions
![Page 39: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/39.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 39
Measuring Web Bias
• Search engines dominate how information are accessed.
• Search results have major social, political and commercial consequences.
• Are search engines bias?• How bias are them?
![Page 40: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/40.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 40
Previous Works
• To measure bias, results should be compared to a norm.
• The norm may be from human experts.• Mowshowitz and Kawaguchi: the
average search result of a collection of popular search engines as the norm.
![Page 41: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/41.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 41
Mowshowitz and Kawaguchi
SE1
SEn
URLS1
URLSn
NORMURLS
URLVector1
URLVectorn
union NORMURL
Vector
Bias1
Biasn
![Page 42: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/42.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 42
Limitations
• Based on URL Vector -> cannot measure bias quality.
![Page 43: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/43.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 43
Our Approach
• Use Kleinberg’s HITS algorithm to create clusters, authorities and hubs of the result norm URLs.
• Use them as norm clusters, authorities and hubs.
• Measure distances between norms and individual results as bias.
![Page 44: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/44.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 44
HITS
• Obtain a directed graph G where– Node: page– Edge: URL link from between pages.
• Two indices: xp,i (authority) & yp,i (hub)• Iterate until steady state:
– xp,i+1 <- ∑ q,q->p yq,i
– yp,i+1 <- ∑ q,p->q xq,i
![Page 45: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/45.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 45
Our Approach
SE1
SEn
URLS1
URLSn
NORMURLS
URLVector1
URLVectorn
union NORMClusterVector
Bias1
Biasn
NORMCluster
ClusterVector1
ClusterVectorn
![Page 46: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/46.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 46
Current Progress
• Implemented previous results.• Implemented vector analysis• Implemented HITS algorithm, but it is not
accurate enough:– ‘Conglomerate’ effect.
![Page 47: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/47.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 47
Measuring Page’s Information Quality
• People find information from Web pages.• How good is the content of a given page?
![Page 48: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/48.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 48
Previous Works
• Measuring different kinds of quality:– Web site design quality– Navigational quality
• Many framework on how to measure information quality:– Most results in surveys so users can rank
informational quality.– Very few automated or semi-automated tool.
![Page 49: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/49.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 49
Our Objectives
• Build automated and/or semi-automated tool to measure and/or assist user to measure information quality of a Web page.
![Page 50: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/50.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 50
Approach
• Hypothesis, measure, usage guidelines.• Example:
– Hypothesis: a Web page with many spelling mistakes is likely to have low information quality.
– Measures: • Show frequencies of word occurrences.• Show percentage of spelling ‘mistakes’.
– Usage guideline:• Spelling ‘mistakes’ may not be actual mistakes (e.g.
UHCL).
![Page 51: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/51.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 51
Metrics
• Many potential metrics. Some examples:– Broken links– HTML Quality– Domain names– Page ranking and popularity– Appearance in directory structure– History (e.g. Way back machine)– Currency (e.g. last modified)– Author (e.g. Meta tag)
![Page 52: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/52.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 52
Current Progress
• ‘Pre-alpha’ prototype: http://dcm.cl.uh.edu/yue/util/pageInfo.pl
• A capstone project
![Page 53: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/53.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 53
Content
• Areas of My Research Interest.• Some Current Projects• Storage of XML in Relational Database• Example Internet Computing Projects• Conclusions
![Page 54: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/54.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 54
Conclusions
• Good time to do applied computing research in the Web and XML areas.
• Style: hands-on supervision + publications.
• Don't forget to donate a scholarship to the School if your future research leads to a windfall.
![Page 55: Some of my XML/Internet Research Projects](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681559a550346895dc379be/html5/thumbnails/55.jpg)
10/5/2005 Bun Yue: [email protected], http://dcm.uhcl.edu/yue slide 55
Questions?
• Any Questions?• Thanks!