An Adaptive Path Index for XML Data using the Query...

An Adaptive Path Index for XML Data using

the Query Workload

Jun-Ki Min a,1 Chin-Wan Chung a,2 Kyuseok Shim b,3

a Division of Computer ScienceDepartment of Electrical Engineering & Computer Science

Korea Advanced Institute of Science and Technology373-1,Kusong-dong, Yusong-gu, Taejon, 305-701, Koreab School of Electrical Engineering and Computer Science

Seoul National UniversitySan 56-1, Shilim-dong, Kwanak-gu, Seoul, 151-742, Korea

Abstract

Due to its flexibility, XML is becoming the de facto standard for exchanging andquerying documents over the Web. Many XML query languages such as XQuery andXPath use label paths to traverse the irregularly structured XML data. Without astructural summary and efficient indexes, query processing can be quite inefficientdue to an exhaustive traversal on XML data. To overcome the inefficiency, severalpath indexes have been proposed in the research community. Traditional indexesgenerally record all label paths from the root element in XML data and are con-structed with the use of data only. Such path indexes may result in performancedegradation due to large sizes and exhaustive navigations for partial matching pathqueries which start with the self-or-descendent axis(“//”). To improve the queryperformance, we propose an adaptive path index for XML data (termed APEX).APEX does not keep all paths starting from the root and utilizes frequently usedpaths on query workloads. APEX also has a nice property that it can be updated in-crementally according to the changes of query workloads. Experimental results withsynthetic and real-life data sets clearly confirm that APEX improves the query pro-cessing cost typically 2 to 69 times compared with the traditional indexes, with theperformance gap increasing with the irregularity of XML data.

Key words: XML, Semistructured Data, Path Index, Query Processing

1 Email: [email protected] Corresponding author. Email: [email protected] Email: [email protected]

Preprint submitted to Information Systems 18 May 2004

1 Introduction

The emergence of the World Wide Web has dramatically increased the amountof data of all kinds available electronically. And, there has been an increas-ing interest in XML (eXtensible Markup Language) [5]. XML can describea wide range of data, from regular to irregular, from flat to deeply nested,and from tree shaped to graph shaped. Due to its flexibility, XML is rapidlyemerging as the de facto standard for exchanging and querying documentson the Web required for the next generation Web applications including elec-tronic commerce and intelligent Web searching. XML data is an instance ofsemistructured data [1]. XML documents comprise hierarchically nested collec-tions of elements, where each element can be either atomic (i.e., raw characterdata) or composite (i.e., a sequence of nested subelements). Tags stored withelements in an XML document describe the semantics of the data. Thus, XMLdata, like semistructured data, is hierarchically structured and self-describing.

Several XML query languages [2,4,7,8,11] have been also proposed recently.XML Query languages such as XPath [8] and XQuery [4] use path expressionsto traverse irregularly structured XML data. Thus, the navigation of irreg-ularly structured graph is one of essential components for processing XMLqueries. Since the objects may be scattered at different locations in the disk,processing XML queries may result in significant performance degradation.Furthermore, query processing with a label path for partial matching is veryinefficient due to the navigation of an entire XML data graph. However, struc-tural summaries or path indexes can speed up query evaluation on XML databy restricting the search to only relevant portions of the XML data. Thus, theextraction of the structural summary and index structures for the semistruc-tured data in order to improve the performance of query processing havereceived a lot of attention recently. Examples of such index structures includeDataGuides [14], (1/2/T)-indexes [21], the Index Fabric [9], A(k)-index [16],and extensions of inverted indexes [18,25]. The details on these index struc-tures are described in Section 2.

Our Contributions.DataGuides and 1-indexes are in the category of gen-eralized path indexes that represent all paths starting from the root in XMLdata. They are generally useful for processing queries with path expressionsstarting from the root. The A(k)-index is a generalized index of the 1-index.This is useful for path expressions whose lengths are less than or equal tok. However, these indexes are very inefficient for processing queries with par-tial matching due to the exhaustive navigation of the indexes. Furthermore,these path indexes are constructed with the use of data only. Therefore, theydo not take advantages of query workloads to process frequently used pathexpressions effectively.

2

In this paper, we propose APEX which is an Adaptive Path indEx for XMLdata. APEX does not keep all paths starting from the root and utilizes fre-quently used paths to improve the query performance. In contrast to the tra-ditional indexes such as DataGuides, 1-indexes and the Index Fabric, APEXis constructed by utilizing a data mining technique to summarize paths thatappear frequently in query workloads. APEX also guarantees to maintain allpaths of length two so that any label path query can be evaluated by joinsof extents in APEX without scanning original data. APEX has the followingnovel combination of characteristics to improve the performance of processingqueries.

• Efficient Processing of Partial Matching Queries: Since traditionalpath indexes keep all label paths from the root element, they are efficient tohandle queries with a simple path expression which is a sequence of labelsstarting from the root of the XML data. However, partial matching querieswith the self-or-descendent axis(“//”) should be rewritten to queries withsimple path expressions. The problem here is that, the exhaustive navigationfor the query rewriting degrades the performance. In contrast to traditionalpath indexes, APEX is designed to support these path expressions efficiently.APEX consists of two structures: a hash tree and a graph structure. Sincethe hash tree maps incoming label paths to nodes of graph structure, thenodes needed to compute the results of partial matching path queries areobtained easily.

• Workload-Aware Path Indexes: Traditional path indexes for semistruc-tured data are constructed with the use of data only. Therefore, it is verydifficult to tune the indexes toward efficient processing of frequently usedqueries. In APEX, frequently used path expressions in query workloads aretaken into account using the sequential pattern mining technique [3,13] sothat the cost of query processing can be improved significantly. Althoughany sequential pattern mining algorithm with the anti-monotonicity prop-erty can be used to extract frequently used paths, the problem space of thesequential pattern mining is subtly different from ours. Thus, we devise asimple algorithm that counts all subpaths in query workloads by one scan.

• Incremental Update: When we decide to rebuild APEX due to the changeof query workloads, we do not build APEX from the scratch. Instead, APEXis incrementally updated in order to reduce the overhead of construction.

We implemented our APEX and conducted an extensive experimental studywith both real-life and synthetic data sets. Experimental results show thatAPEX improves query processing cost typically 2 to 69 times better thanexisting indexes, with the performance gap increasing with the irregularity ofXML data.

The remainder of the paper is organized as follows. In Section 2, we discussrelated work. In Section 3, we present the data model and basic notations

3

for APEX. We present an overview of APEX in Section 4 and describe theconstruction algorithms for APEX in Section 5. Section 6 contains the resultsof our experiments, showing the effectiveness and comparing the performanceof APEX to existing path indexes. Finally, Section 7 summarizes our work.

2 Related Work

Many database researchers developed various path indexes to support labelpath expressions. Goldman and Widom [14] provided a path index, called thestrong DataGuide. The strong DataGuide is restricted to a simple label pathand is not useful in complex path queries with several regular expressions [21].The building algorithm of the strong DataGuide emulates the conversion algo-rithm from the non-deterministic finite automaton (NFA) to the deterministicfinite automaton (DFA) [15]. This conversion takes linear time for tree struc-tured data and exponential time in the worst case for graph structured data.Furthermore, on very irregularly strucutured data, the strong DataGuide maybe much larger than the original data.

Milo and Suciu [21] provided another family of indexes (1/2/T-index). Theirapproach is based on the backward simulation and the backward bisimulationwhich are originated from the graph verification area. The 1-Index coincideswith the strong DataGuide on tree structured data. The 1-Index can be con-sidered as a non-deterministic version of the strong DataGuide.

In object-oriented databases, access support relations [17] are used to supportfrequently used reference chains between two object instances. Therefore, itmaterializes access paths of arbitrary lengths and thus it can be used forindexing XML documents. Note that access support relations and the T-indexsupport only predefined subsets of paths.

The A(k)-index [16], a generalized index of the 1-index, keeps all the labelpaths whose lengths are smaller than or equal to k and approximates labelpaths longer than k. The performance of the A(k)-index largely depends onthe parameter k. In spite of its importance, the A(k)-index does not takeadvantage of the query workload to choose the parameter k, but assigns anarbitrary value to k. In addition, since the A(k)-index controls the shape ofthe index using the parameter k, the A(k)-index does not effectively reflectthe diverse usage patterns of path expressions.

Cooper et al. [9] presented the Index Fabric which is conceptually similar tothe strong DataGuide in that it keeps all label paths starting from the rootelement. The Index Fabric encodes each label path to each XML elementhaving a data value as a string and inserts the composition of the encoded

4

label path and the data value into an efficient index such as the Patricia trie.The index block and XML data are both stored in relational database systems.The evaluation of queries encodes the desired path traversal as a search keystring and performs a lookup. The Index Fabric does not keep all parent-child relationships of elements since it does not keep the information of XMLelements which do not have data values. Thus, the Index Fabric is not effectiveto process partial matching queries.

Many queries to XML data has the partial matching path expression becauseusers of XML data may not be concerned with the structure of data and inten-tionally make the partial matching path expression to get the intended result.Since the strong DataGuide, the 1-Index, and the Index Fabric record onlypaths starting from the root in the data graph, the query processor rewritespartial matching path queries into the queries with simple path expressionsby the exhaustive navigation of index structures [12,19]. This results in per-formance degradation. In contrast, APEX is constructed from label pathswhich are frequently used in query workloads. Thus, APEX is very effectivefor queries with partial matching expressions.

3 Preliminary

In this section, we describe our representation of XML data and define somebasic notations to explain our proposed index.

As shown in Figure 1, we represent the structure of XML data as the labeleddirected graph which is similar to the OEM model [24]. Especially, two par-ticular attributes, ID and IDREF, allow us to represent the structure of XMLdata as a graph.

Definition 1 The structure of XML data is represented by the directed la-beled edge graph GXML = (V, E, A, λ, root). V = Vc ∪ Va is the set of nodeswhere Vc is the universe of non-leaf nodes and Va is the universe of leaf nodes,E ⊆ Vc×V is the set of edges, A is the universe of labels, λ is a label functionmapping E to A where the edges pointing to a node have same labels, androot ∈ V is the root of GXML. Each node in GXML has a unique node identifier(nid).

Since XML elements are ordered and the results of XML queries must be indocument order, each node keeps the document-order information. And nodesreturned by the index are sorted using this information as a post-processingstep. As shown in Figure 1, the reference relationship (i.e., ID-IDREF) isrepresented as an edge from a node for an IDREF typed attribute to a nodefor an element which has the corresponding ID typed attribute. In addition,

5

<MovieDB><actor id = “a1”>

<name> actor1 </name></actor><actor id = “a2” ,movie = “m1”>

<name> actor2 </name></actor><director id = “d1”>

<name> director1</name><movie id = “m1” director = “d1”>

<title> movie2 </title></movie></director><director id = “d2”>

<name> director2</name></director><movie id = “m2” actor = “a1” , director = “d2”>

<title> movie1</title></movie></MovieDB>

(a) XML data

1

MovieDB

2 4 1214

actoractor director

movie

3

name

13

name

5

name

17

title

11

7

director

name

director

8

movie@movie

@actor

10

title

0

6 movie

15

@director9

@director

director

16actor

(b) Representation of XML data structure

Fig. 1. A sample XML data

the label of the edge from an element to IDREF typed child node starts with’@’ and the edge for the reference relationship has the normal tag, which isthe tag for the target node, as its label.

Definition 2 A label path of a node o in GXML, is a sequence of one or moredot-separated labels l1.l2 . . . ln, such that we can traverse a path of n edges(e1 . . . en) from o, where the edge ei has the label li.

In Figure 1, movie.title and name are both valid label paths of node 7. InXML data, queries are based on label paths such as //movie/title.

Definition 3 A data path of a node o in GXML is a dot-separated alternatingsequence of labels and nids of the form l1.o1.l2.o2 . . . ln.on, such that we cantraverse from o a path of n edges (e1 . . . en) through n nodes (x1 . . . xn), wherethe edge ei has the label li and the node xi has the nid oi.

6

In Figure 1, movie.8.title.10 and name.11 are data paths of node 7.

Definition 4 A data path d is an instance of a label path p if the sequencelabels made from d by eliminating nids is equal to p.

Again in Figure 1, movie.8.title.10 is an instance of movie.title and name.11is an instance of name.

Definition 5 A label path A = a1.a2 . . . an is contained in another label pathB = b1.b2 . . . bm if we have a1 = bi, a2 = bi+1, . . . , an = bi+n−1 where 1 ≤ i andi+n− 1 ≤ m. When A is contained in B, we also call that B contains A or Ais a subpath of B. Furthermore, when A is a subpath of B and m = i + n− 1,we call that A is a suffix of B.

For example, the label path movie is a subpath of movie.title. And, the labelpath title is a suffix of movie.title.

4 Overview of APEX

In this section, we propose an example and the formal definition of APEX.

��

� � � � � ��

�� !

"$#�% &('�)+* ,+-/.10

2�3+4�5

6$798+: ;�<>=@?

A�B�CED F

G�H I JLK�M N+OQP�R

S�T�U V�W X�Y

Z\[9]�^+_ `�a

b�c�d�e�f g h�i9j

k�l mon�p�q r+s t@uvxw�y1z�{ |\}E~9��

�x��1�� E�9��9�

��

�� ¢¡

£9¤�¥ ¦�§ ¨@©�ªE« ¬

�®+¯@°

± ² ³ ´ µ

¶�· ¸o¹�º�» ¼+½

¾�¿+À@Á

ÂÄÃ�Å Æ(Ç�ÈLÉ Ê+ËÌ�Í@Î�Ï�Ð Ñ

Ò�Ó�Ô Õ�Ö ×ÄØ9Ù+Ú Û�Ü

Ý�Þ�ßEà á

â ã ä å æç è é ê ë

ìîí

ï/ð

ñóò

ôöõ

÷îøúù

ûöüý/þ

ÿ��

��

� �

��

��

��

�� ! #" $&%

' (�) *,+!-&. /&0

132�4 5,6#7#8 9&:

;�<>=,?�@BADCE�FHGJILKNM&O�P Q R�S&TVUW�X�YJZL[B\�] ^`_Vab�cHdJeLfhg!i j�kml nho!p q�rNs tvuLw�x y&z|{}�~H�J�L�h�!� ��m� �h�!� �L��N� �B�J� ��m� ��L�� L��V��N J¡L¢m£�¤ ¥L¦�§N¨�©vª¬«> ®¬¯�°V±²�³H´JµL¶>·>¸ ¹�ºV»¼�½H¾J¿LÀmÁ�Â Ã�ÄmÅ�ÆhÇ¬È>É ÊLË&ÌVÍÎÐÏBÑJÒLÓhÔ¬Õ>Ö ×LØ�Ù�ÚÛÝÜJÞJßLàhá!â ã¬ä�å�æçÐèBéJêLëHìJí î#ïVðñ�òLóHô õ`öB÷�ø ù&úVûü�ý�þ�ÿ �� !"�#%$ &('�)* +�, -/.

Fig. 2. An example of APEX

An example of APEX for Figure 1 is shown in Figure 2 when the requiredpaths = A ∪ {director.movie, @movie.movie, actor.name} (see Definition 6).It is not necessary to know how to construct APEX at this point. The purposeof this example is to help understanding the definitions in this section.

Intuitively, APEX keeps the required paths which are composed of the labels ofGXML and frequently used paths extracted from query workloads. Accordingto the required paths, APEX groups the edges in GXML as extents and denotes

7

an extent as a node of APEX’s graph structure. The required paths are keptin a hash tree of APEX. There is the one to one correspondence between anode of the graph structure and an entry of the hash tree’s node.

As shown in Figure 2, APEX consists of two structures: a graph structure(GAPEX) and a hash tree (HAPEX). GAPEX represents the structural summaryof XML data. It is useful for query pruning and rewriting. HAPEX representsincoming label paths to nodes of GAPEX . HAPEX consists of nodes, called thehnode and each hnode contains a hash table. In an hnode, each entry of thehash table points to another hnode or a node of GAPEX but not both. Eachnode of GAPEX maps to an entry of an hnode of HAPEX . HAPEX is a usefulstructure to find a node of GAPEX for given label path. Furthermore, HAPEX

is useful in the incremental update phase (see details in Section 5.2). Also,each node of GAPEX corresponds to an extent. The extent is similar to thematerialized view in the sense that it keeps the set of edges whose endingnodes are the result of a label path expression of a query.

name

1

MovieDB

16

2,4

15

14actor

movie

2

actor

12

director3,5

17

title

11,13

7,12

director

name @director

6

movie

@director

@movie@actor

10

title

movie

9

8

name

0

movie

7director

11

name

13

name

3

(a) Strong DataGuide

1

MovieDB

2 4 1214

actor actor director

movie

3 13

name

5

name

17

title

11

7

director

namedirector

8

movie@movie

@actor

10

title

0

6 movie 15

@director9

@director

director

16actor

name

(b) 1-Index

Fig. 3. Path indexes for the sample XML data

8

The strong DataGuide and the 1-Index of Figure 1 are shown in Figure 3. Notethat, the strong DataGuide is larger than the original data and the 1-Index isequal to the structure graph of XML data. The following XPath query q1 isan example query that retrieves all actors’ names.

q1: //actor/name

To compute q1 on the strong DataGuide in Figure 3(a), the edge lookupoccurs 14 times on the index structure to prune and rewrite q1 at compile-time [19]. The query processor obtains the extent for MovieDB.actor.name.The behavior of query processor on the 1-Index is similar to that of the strongDataGuide.

However, APEX in Figure 2 is very efficient to compute q1 since the queryprocessor just looks up the hash tree with actor.name in the reverse order.That is, the hash tree of APEX enables efficient finding of the nodes of GAPEX

for partial matching path queries.

Since making an effective index structure for all the queries is very hard, APEXchanges its structure according to the frequently used paths. To extract fre-quently used paths, we assume that a database system keeps the bag (= work-load) of queries (=label paths). Furthermore, we adopt the support concept ofthe sequential pattern mining to identify frequently used paths [3,13].

Let the support of a label path p = li . . . lj, denoted by sup(p), be the ratioof the number of queries having p as a subpath to the total number queries.Also, let minSup denote the user-specified minimum support.

Definition 6 A label path p = li . . . lj in GXML is a frequently used path ifsup(p) ≥ minSup. Let p be a required path if it is either a frequently used pathor the length of p is one.

Definition 7 For a label path p of the form li.li+1 . . . lj in GXML, an edgeset, T (p) = { < oj−1, oj > | li.oi . . . lj−1.oj−1. lj.oj is an instance of the labelpath p }. That is, an edge set T (p) is a set of pairs of nids for the incomingedges to the last nodes that are reachable by traversing a given label path p.In addition, we call p an incoming label path of oj

For example, the edge set T (title) in Figure 1 is {<8,10>, <14,17>}. Labelpaths from root to 10 and 17 include MovieDB.movie.title, MovieDB.director.movie.title, [email protected], [email protected]. And, the common suffix of these label paths is “title”.

Definition 8 Given a required path set R, a label path p ∈ R is a maximalsuffix in R if there is no path q ∈ R of which p is a suffix except p.

9

For example, in Figure 2, we assume that the required paths = A ∪ {director.movie, @movie.movie, actor.name}. In this case, actor.name is a maximalsuffix in the required paths since a required path whose suffix is actor.namedoes not exist except actor.name.

Suppose there are two required paths q = li . . . lj and p = lk . . . lj, wherep 6= q and p is a suffix of q. That means, p is not a maximal suffix in therequired paths. In this case, making edge set T (q) and T (p) increases thestorage overhead since T (q) ⊆ T (p). Thus, we introduce the target edge setTR(p) instead of the edge set T (p).

Definition 9 Let QXML be a set of label paths of the root node in GXML.For each label path p in the required path set R, let QG(p) = { s | s ∈ QXML

and p is a suffix of s}, let QA(p) = { s | s ∈ QXML and some path q ( 6= p )∈ R having p as a suffix is a suffix of s }, and let Q(p) = QG(p)−QA(p).Finally, a target edge set, TR(p) = ∪r∈Q(p)T (r).

Note that, by Definition 9, if p is a maximal suffix in the required paths, T (p)= TR(p) since QA(p) = {}. If p is not a maximal suffix in the required paths,TR(p) is pointed to by a remainder entry of HAPEX . Again in Figure 2,we assume that name and actor.name are required paths. Thus, as shownin Figure 1, T (actor.name) = {<2,3>, <4,5>}, and T (name) = {<2,3>,<4,5>, <7,11>, <12,13>}. In this case, TR(actor.name) = T (actor.name)since QA(actor.name) = {}.

However, T (name) 6= TR(name). By definition 9, QG(name) = { MovieDB.director.name, MovieDB.actor.name, [email protected], . . .}, and QA(name) = { MovieDB.actor.name, [email protected] } since actor.name is a required path and has name as a suffix. ThenQ(name) = {MovieDB.director.name, MovieDB.director.movie.@director. di-rector.name, . . . }. Thus, TR(name) = {<7,11>, <12,13>}.

Now, we will define APEX.

Definition 10 Given a GXML and a required path set R, APEX can be de-fined as follows. We introduce the root node of GAPEX , xroot in APEX whichcorresponds to the root node of GXML. By considering every required path p∈ R, we introduce a node of GAPEX , with an incoming label path p, that keepsTR(p) as an extent only if TR(p) is not empty. For each edge v.l.v′ in GXML,there is an edge x.l.x′ in APEX where the extent of x′ contains < v, v′ > andthe extent of x contains < u, v > where u is a parent node of v.

The following theorem proves that APEX is sufficient for the path index. Bythe definition of the simulation [6], if there is a simulation from GXML toGAPEX , all the label paths in GXML exist in GAPEX . Thus, all queries basedon label paths can be evaluated on APEX.

10

Theorem 1 There is a simulation from GXML to GAPEX .

Proof: Given GAPEX = (Vx, Ex, xroot, A) and GXML = (V,E, r, A), there isa simulation from r to xroot. Suppose, there is a simulation from v ∈ V to x∈ Vx, a full label path q to v is l1 . . . lm, and ∃ v.lm+1.v

′ ∈ E. By Definition 10,there is a node x′ for TR(p′ = li . . . lm+1 where 1 ≤ i ≤ m) whose incominglabel path is p′, and ∃ x.lm+1.x

′ ∈ Ex. Therefore, there is a simulation fromGXML to GAPEX .

Furthermore, GAPEX satisfies the following theorem.

Theorem 2 All the label paths whose lengths are 2 in GAPEX are on GXML.

Proof: Recall that APEX groups the edges with respect to the required paths.Thus, the labels of all edges in the extent of x ∈ Vx are the same and equalto the labels of x’s incoming edges. By Definition 10, ∀ x.lj.x

′ ∈ Ex, ∃ v.lj.v′

∈ E. Therefore, by letting the labels of incoming edges of x be li, the labelpath li.lj exists in GXML.

APEX is a general path index since (1) all label paths in GXML are in GAPEX

and (2) some label paths in GAPEX are in GXML. To be more precise on (2),(a) all label paths whose length are at most 2 in GAPEX are in GXML, (b) withrespect to frequently used paths, label paths in GAPEX may or may not bein GXML. Thus, any label path query can be evaluated by look-up of HAPEX

and/or joins of extents.

5 Construction and Management of APEX

The architecture of the APEX management tool is illustrated in Figure 4.As shown in the figure, the system consists of three main components: theinitialization module, the frequently used path extraction module and theupdate module.

FrequentlyUsed Path Extraction Moduleworkload

Update Module

APEX

Initialization Module

APEX0

Fig. 4. Architecture of APEX Management tool

11

The initialization module is invoked without query workloads only when APEXis built first. This module generates APEX0 that is the simplest form of APEXand is used as a seed to build a more sophisticated APEX. As the query work-load is collected with the use of the current APEX, the frequently used pathsare computed and used to update the current APEX dynamically into a moredetailed version of APEX. The last two steps are repeated whenever queryworkloads change. Although the decision on how often the last two steps in-voke is important, it is not the focus of our paper and we omit any furtherdiscussion on this issue since it depends on the preference of end user (e.g.,by request or periodical).

5.1 APEX0: Initial index structure

APEX0 is the initial structure to build APEX. This step is executed only onceat the beginning. Since there is no workload at the beginning, the requiredpath set has paths of length one that is equivalent to the set of all labels inthe XML data.

MovieDB

actormovie

name

director

name

director

@movie

actor

movie

@actortitle

@director

&1

&2 &3

&4

&5

&7

&6

&8

&9

&0

&7@actor

&6@director

&5@movie

&4title

&9name

&8movie

&3director

&2actor

&1MovieDB

&0xroot

xnodelabel

movie

Fig. 5. An example of APEX0

An example of the APEX0 for the XML data in Figure 1 is presented in Figure5. The structure of APEX0 is similar to that of the 1-Representative Object(1-RO) proposed as a structural summary in [22]. As 1-RO contains all pathsof length two in the XML data, APEX0 includes every label path of lengthtwo. However, in APEX, we have not only the structural summary in GAPEX

but also the extents in the nodes of GAPEX . Since APEX0 groups the edgesaccording to labels, it is similar to the Bindex in [20] and their performanceshould be very similar. However, in contrast to APEX0, the Bindex does nothave the structural summary. Thus, the Bindex is not used for query pruningand rewriting.

The algorithm of building APEX0 is shown in Figure 6. Each node in APEX0

represents a set of edges that have the same incoming label. Basically, wetraverse every node in XML data (GXML) in the depth first fashion. We first

12

visit the root node of XML data (GXML) and generate the root node forGAPEX first. We add an edge <NULL,root > to the extent of the root node inGAPEX . Since each node in GAPEX represents a unique label and there is noincoming label for the root node of APEX, we represent the root node witha special incoming label ’xroot’ for convenience. We then call the functionexploreAPEX0 with the root node in GAPEX and the extent of the root nodein GAPEX .

Procedure buildAPEX0(root)begin1. xnode := hash(‘xroot’)2. xnode.extent := {< NULL, root >} //root is the root node of GXML

3. exploreAPEX0(xnode, xnode.extent)endProcedure exploreAPEX0(x, ∆ESet)begin1. ESet := ∅2. for each < u, v >∈ ∆ESet do3. ESet := ESet ∪ {o | o is an outgoing edge from v}4. for each unique label l in ESet do {5. y := hash(l)6. if (y = NULL) {7. y := newXNode()8. insert y into hash table9. }10. make edge(x, y, l)11. ∆newESet := a set of edges having l in ESet - y.extent12. y.extent := y.extent ∪ ∆newESet13. exploreAPEX0(y, ∆newESet)14. }end

Fig. 6. An algorithm to build APEX0

In each invocation of exploreAPEX0, we have two input arguments: the newlyvisited node x in GAPEX and new edges just added to the extent of x in theprevious step. We traverse all outgoing edges from the end point of the edgesin ∆ESet and group them by labels. Then we process edges in each grouphaving the same label l one by one.

Let us assume that the node representing the label l in GAPEX is y. Intuitively,we need to put the edges having the label l to the extent of y and connectx and y in GAPEX . Then, we call exploreAPEX0 recursively with y and thenewly added edges to the extent of y. We give only newly added edges at thisstep as ∆ESet for recursive invocation of exploreAPEX0 since the outgoingedges from the edges included previously to the extent have been all traversedalready.

13

When we consider the edges for each group with a distinct label, we have tocheck whether the node y exists already. To find this, we can maintain a hashtable and use it. In case, the node y for the label l does not exist, we generatea new node and put it to GAPEX . We also make sure that the new node canbe located quickly by inserting the node into HAPEX . The hash at Line (5)in Figure 6 is the hash function which returns a node for a given label. Theprocedure make edge makes an edge from x to y with label l if there is nota edge. For preventing the infinite traversal of cyclic data, we do not considerthe edges which are already in the extent of the node y.

5.2 Frequently Used Path Extraction

Any sequential pattern mining algorithms such as the one in [3,13] may beused to extract frequently used paths from the path expressions appearing inquery workloads. While we need to use the traditional algorithms with theanti-monotonicity property [23] for pruning, we have to modify them.

Consider a mail order company. Assume that many customers buy A first,then B and finally C. In the traditional sequential pattern problem, when asequence of (A,B,C) is frequent, all subsequences of length two (i.e.,(A,B),(A,C) (B, C)) including (A,C) are frequent. However, for the problem offinding frequently used path expressions, it is not valid any more. In otherwords, even though the path expression of A.B.C is frequently used, the pathexpression of A.C may not be frequent. Therefore, in case we want to usetraditional data mining algorithms which use the anti-monotonicity pruningtechnique, we need a minor modification to handle the difference.

Actually, we also found that the size of the query workload is not so large asthat of data for sequential pattern mining applications. Thus, we used a naivealgorithm in our implementation that simply counts all subpaths that appearin the query workload by one scan.

The basic behavior of the frequently used path extraction module is describedin Figure 7. Each entry of hash table in a node of HAPEX consists of five fields:label, count, new, xnode, and next. The label field keeps the key value for theentry. The count field keeps the frequency of label path which is representedby the entry. The new field is used to check a newly created entry in a nodeof HAPEX . The xnode field points to a node in GAPEX whose incoming labelpath is represented by the entry. Finally, The next field points to another nodein HAPEX . For simplicity, we omit new fields in Figure 7.

Suppose that the required path set was {A, B, C, D, B.D}. Then, the currentstate of HAPEX is represented as Figure 7-(a). A label path B.D is representedas an entry in the subnode of D entry in the root node(HashHead) of HAPEX .

14

NULLDNULL&3CNULL&2BNULL&1ANULL&0xrootnextxnodecountlabel

NULL&5remainder

NULL&4Bnextxnodecountlabel

(a) Current state

Qworkload = {A.D, C, A.D}

NULL2DNULL&31CNULL&20BNULL&12ANULL&00xrootnextxnodecountlabel

NULLNULL2A

NULL&5remainder

NULL&40B

nextxnodecountlabel

(b) After frequency count

NULL2DNULL&31CNULL&20BNULL&12ANULL&00xrootnextxnodecountlabel

NULLNULL2A

NULLNULLremainder

nextxnodecountlabel

(c) After pruning

Fig. 7. The behavior of frequently used path extraction

Let the workload Qworkload become {A.D, C, A.D}. We first count the fre-quency of each label path which appeared in Qworkload and store the countsin HAPEX . When we count, we do not use the remainder entry for counting.Figure 7-(b) shows the status of HAPEX after the frequency count.

Finally, we prune out the label paths whose frequency is less than minSup.The status of HAPEX after pruning is illustrated in Figure 7-(c). Assumingthat minSup is 0.6, the label path whose frequency is less than 2 is removed.Thus, a label path B.D is pruned. However, label paths B and C still remainsince a label path of length one is always in the required path set. Also, in thepruning step, the xnode fields, which are not valid any more by the changeof frequently used paths, are set to NULL. The content for TR(D) in Figure7-(a) representing remainder.D in HAPEX is the set of edges whose end nodesare reachable by traversing a label path D but not B.D. However, the contentfor TR(D) in Figure 7-(c) should be changed to the set of edges whose endnodes are reachable by traversing a label path D but not A.D. Thus, we setthe xnode field of remainder entry to NULL to update it later.

15

Procedure frequentlyUsedPathExtraction()begin1. reset all count fields to 0 and new fields to FALSE2. frequencyCount()3. pruningHAPEX(HashHead)endFunction pruningHAPEX(hnode)begin1. is empty := FALSE2. if hnode = NULL return is empty3. for each entry t ∈ hnode do {4. if (t.count < minsup) {5. t.next := NULL6. if (t 6∈ HashHead) {7. is empty := hnode.delete(t)8. }9. } else {10. if (pruningHAPEX(t.next) = TRUE)11. t.next := NULL12. if (t.next 6= NULL) and (t.xnode 6= NULL)13. t.xnode := NULL14. if (t.new = TRUE) and (hnode.remainder.xnode 6= NULL)15. hnode.remainder = NULL16. }17. }18. return is emptyend

Fig. 8. Frequently Used Path Extraction Algorithm

The algorithm of frequently used path extraction is presented in Figure 8.To extract frequently used paths from the given workload (Qworkload), thealgorithm first sets count fields to 0 and new fields to FALSE of all entries inHAPEX .

The algorithm consists of two parts; the first part counts frequencies and thesecond part is the pruning phase. HAPEX is used to keep the change of theworkload. The algorithm invokes the procedure frequencyCount to count thefrequency of each label path and its subpaths in Qworkload. In this procedure,the new field of a newly created entry in a node in HAPEX is set to TRUE toidentify a newly created entry in a node in HAPEX during pruning phase.

The function pruningHAPEX removes the hash entry whose frequency is lessthan the given threshold minSup(Line (4)-(5)). Even though the frequencyof an entry in the root node (HashHead) of HAPEX is less than minSup, itshould not be removed since a label path of length one should always be inthe required path set by Definition 6. If the frequency of an entry of the nodein HAPEX is less than minSup and the entry is not in the root node, the entry

16

is removed from the hash node by the function hnode.delete (Line (6)-(7)). Ifall entries in a node of HAPEX except the remainder entry are removed by thefunction hnode.delete, hnode.delete returns TRUE. And then we remove thisnode of HAPEX (Line (10)-(11)).

Finally, it sets the xnode field to NULL because it points to the wrong nodein GAPEX . As mentioned earlier, contents of some nodes of GAPEX may beaffected by the change of frequently used paths. There are two cases. First, alabel path q was a maximal suffix previously but it is not anymore. This iscaptured by the fact that an entry has not NULL value in both xnode andnext fields (Line (12)-(13)). In this case, the algorithm sets the xnode field toNULL to update the xnode field appropriately later. Second, a new frequentlyused path influences the contents of the node of GAPEX for the remainderentry in the same node of HAPEX since the contents of remainder is affectedby the change of frequently used paths. This is represented by the fact thata new entry appears in the node of HAPEX and the remainder entry in thisnode points to a node in GAPEX using the xnode field (Line (14)-(15)). In thiscase, the algorithm sets the content of the xnode field in the remainder entryto NULL to update it later.

5.3 The Update with Frequently Used Paths

After the entries in HAPEX was updated with frequently used paths computedfrom the changes of query workloads, we have to update the graph GAPEX andxnode fields of entries in the nodes of HAPEX that locates the correspondingnode in GAPEX .

After extraction of frequently used paths, each entry in a node of HAPEX mayhave a pointer to another node of HAPEX in the next field or a pointer tothe node of GAPEX in the xnode field, but the entry cannot have non-NULLvalue for both next and xnode fields. If the entry of a node n in HAPEX hasa pointer to another node m of HAPEX , there exists a longer frequently usedpath represented by the entry in m whose suffix is represented by the entryin the node n. For example, consider the example of HAPEX in Figure 7-(c).The next field of the entry with the label D in HAPEX points to a node thathas two entries; one is for the path A.D and the other is for the rest of pathsending with D except A.D. Recall that the paths are represented in HAPEX inthe reverse order. If the xnode field in the entry of a node n in HAPEX pointsto a node g in GAPEX , the extent of g has edges with an incoming label pathrepresented by the entry in n.

The basic idea of update is to traverse the nodes in GAPEX , check the validityof the nodes in GAPEX with respect to HAPEX , and update not only the

17

structure of GAPEX with frequently used paths but also the xnode field ofentries in HAPEX .

Function lookup(path = l1 . . . ln)begin1. hnode := HashHead // the root of HAPEX

2. for i = n to 1 do {3. t := hnode.hash(li)4. if (t = NULL) {5. // HAPEX keeps the la.li+1 . . . ln where la 6= li.6. //Thus, l1 . . . ln is mapped to remaider.li+1 . . . ln7. return hnode.remainder.xnode8. } else { // HAPEX keeps the path for li . . . ln9. if (t.next = NULL)10. return t.xnode // l1 . . . ln is mapped to li . . . ln11. else12. hnode := t.next13. }14. }end

Fig. 9. Lookup Function of APEX

First of all, we present the hash tree look-up function in Figure 9 which is a corefunction of the update module of APEX. To update GAPEX according to thechange of frequently used paths, we identify the nodes in GAPEX with respectto HAPEX . Thus, before visiting a node in GAPEX , we look for HAPEX usingthe lookup function. The input of the lookup function is the label path from theroot to a node in GAPEX and the output is the xnode field of the entry whichrepresents the longest suffix of the input, where the longest suffix is amongrequired paths. Note that an entry for the longest suffix path always exists inHAPEX since, by the definition of the required path (See Definition 6), the lastlabel of the path to look for in HAPEX always exists. Thus, while traversingthe nodes in GAPEX , we can check the validity of the nodes in GAPEX withrespect to the change of frequently used paths.

Consider that the input of the lookup function is l1 . . . ln. The lookup functionlooks for the nodes in HAPEX iteratively in the reverse order of l1 . . . ln (Line(2)). After accessing a node in HAPEX by li+1 . . . ln, the lookup function looksfor the entry for li . . . ln using the hash function (Line (3)).

If there is no entry for li . . . ln, the lookup function returns the xnode field ofthe remainder entry of the node since HAPEX keeps the entry for la.li+1 . . . ln(where la 6= li) and the longest suffix of l1 . . . ln among required paths isli+1 . . . ln (Line (4)-(7)).

When the entry for li . . . ln exists (Line (8)-(13)), the lookup function checks

18

the next field whether the entry keeps the subnode in HAPEX or not (Line(9)). If the next field is NULL, the label path li . . . ln is a maximal suffix ofrequired paths. Thus, lookup returns xnode of the entry since li . . . ln is thelongest suffix of l1 . . . ln among required paths (Line (11)). Otherwise (i.e.,next is not NULL), the node pointed to by next is retrieved since li . . . ln maynot be the longest suffix of l1 . . . ln among required paths (Line (12)).

Consider that the update module checks the validity of a node’s subnode inGAPEX using lookup. If the xnode field of the entry returned by lookup actuallypoints to the subnode, we do not need to do anything on HAPEX , since HAPEX

has the right information. However, if both next and xnode fields of the entryare NULLs, there are two cases. One is when the path represented by theentry in HAPEX is a newly generated frequently used path due to the changeof the query workload. The other is when the entry in the node n pointed toa node m in HAPEX previously, but the node m was deleted because the labelpaths represented by entries in the node m are not frequent any more. In thiscase, both next and xnode fields of the entry in the node n were set to NULL.The xnode field of the entry returned by lookup needs to be replaced to pointto the corresponding node in GAPEX after the modification of GAPEX . Wegenerate a new node in GAPEX that is reachable with the same label of theedge from the parent node of the subnode and set the entry’s xnode field topoint the node generated. Actually, the extent for the newly generated nodein GAPEX can be obtained from old GAPEX . However, for easy understanding,we describe the APEX update algorithm such that the extent for the newlygenerated node in GAPEX is computed by examining GXML and is attachedto the new node.

We recursively visit the newly generated node with the extent. We still keepthe subnode because it may be visited again through a different path laterduring the update phase.

For understanding, we explain the basic behavior of the dynamic update ofAPEX. Suppose we visit a node x in GAPEX with an label path path1 and anedge set ∆ESet as illustrated in Figure 10. ∆ESet is a newly added edges tothe extent of x just before visiting x. ∆ESet is needed to update the subgraphof x.

�ESet, path1

x

…l1 l2

ln

y

Fig. 10. Visiting a node by UpdateAPEX

If x was previously visited and ∆ESet is empty, then we do nothing since alledges and their subgraphs of x were traversed before.

19

If x is newly visited and ∆ESet is empty, we should traverse all outgoing edgesof x in GAPEX to verify all subnodes of x using HAPEX . For example, while xhas an outgoing edge to y with label l1, we verify the node y using the lookupfunction with the label path path1.l1

Whether x is previously visited or not, if there is a change of the extent ofit, we should update the subgraph rooted at x according to the change of theextent (i.e., ∆Eset).

We present the updateAPEX in Figure 11 that does the modification of APEXwith frequently used paths stored in HAPEX . Before calling the updateAPEX,we first initialize visited flags of all nodes in GAPEX to FALSE. The up-dateAPEX is executed with the root node in GAPEX by calling updateAPEX(xroot,∅, NULL) where xroot is the root node of GAPEX .

If xnode was previously visited and ∆ESet is empty, then we do nothing(Line (1)). If xnode is newly visited and ∆ESet is empty, we should traverseall outgoing edges of xnode in GAPEX to verify all subnodes of xnode usingHAPEX . For each ending vertex (i.e. e.end) of xnode in GAPEX , we get thepointer xchild by calling the lookup function with the input the label pathfrom the root to e.end (Line (6)-(7)). If the value of xchild is NULL, it meansthat the valid node in GAPEX does not exist. Thus, we allocate a new nodeof GAPEX and have xchild point to the new node (Line (8)). If xchild is notNULL, the xnode field returned by lookup points to a node in GAPEX for thegiven label path.

If xchild and e.end are different, we compute the edge set which should beadded to the extent of xchild (Line (10)-(14)). We select the edges, fromcomputed edges, which are not already in the extent of xchild, and assignthem to ∆EdgeSet. We then update the extent of xchild by adding edges in∆EdgeSet (Line (15)-(16)). We next make an edge by invoking make edge fromxnode to xchild with label e.label, if the edge does not exist (Line (17)-(18)).If xnode has an outgoing edge to a node in GAPEX which is different fromxchild with label e.label, make edge removes this edge. If xchild and e.end areequal, there is no change of the extent of the node xchild (Line (20)). Thus,∆ESet is set to empty. Now, we call updateAPEX recursively for xchild whichis a child node of xnode.

Whether xnode is previously visited or not, if there is a change of the extentof it, we should update the subgraph rooted at xnode (Line (23)-(37)). In thiscase, we obtain outgoing edges from the end points of edges in ∆ESet (Line(24)-(25)). In order to update the subgraph rooted at xnode, we partition theedges based on their labels and update HAPEX and GAPEX similarly as weprocessed for the case when ∆ESet was empty. (Line (26)-(36)).

The overhead of the update module is mainly affected by the number of the

20

Procedure updateAPEX(xnode, ∆ESet, path)begin1. if (xnode.visited = TRUE) and (∆ESet = ∅) return2. xnode.visited := TRUE3. EdgeSet := ∅4. if ∆ESet = ∅ {5. for each e that is an outgoing edge of xnode do {6. newpath := concatenate(path, e.label)7. xchild := lookup(newpath)8. if (xchild = NULL) xchild := newXNode()9. if (xchild != e.end) {10. if (EdgeSet = ∅) {11. for each < u, v > ∈ xnode.extent do12. EdgeSet := EdgeSet ∪ {o | o is an outgoing edge from v}13. }14. subEdgeSet := a set of edges with the label e.label in EdgeSet15. ∆EdgeSet := subEdgeset - xchild.extent16. xchild.extent := xchild.extent ∪ ∆EdgeSet17. make edge(xnode, xchild, e.label);18. hash.append(newpath, xchild);19. }20. else ∆EdgeSet := ∅21. updateAPEX(xchild, ∆EdgeSet, newpath);22. }23. } else {24. for each < u, v > ∈ ∆ESet do25. EdgeSet := EdgeSet ∪ {o | o is an outgoing edge from v}26. for each unique label l in EdgeSet do {27. newpath := concatenate(path, e.label)28. xchild := lookup(newpath)29. if (xchild = NULL) xchild := newXNode()30. subEdgeSet := set of edges labeled l in EdgeSet31. ∆EdgeSet := subEdgeset - xchild.extent32. xchild.extent := xchild.extent ∪ ∆EdgeSet33. make edge(xnode, xchild, l)34. hash.append(newpath, xchild)35. updateAPEX(xchild, ∆EdgeSet, newpath);36. }37. }end

Fig. 11. An algorithm to update APEX

frequently used paths newly extracted from the workload. As mentioned ear-lier, updateAPEX visits the node in GAPEX with the label path path= l1. . . . .lnand ∆ESet. Suppose path′=li. . . . .ln.ln+1 is a new frequently used path. Then,as shown in Line (14) and (30) of Figure 11, edges whose labels are ln+1 areextracted from an edge set. In the case of Line (14), the edges, whose la-

21

bels are ln+1 and start nodes are the end nodes of edges in the extent forthe label path path, are extracted. In the case of Line (30), the edges whoselabels are ln+1 and start nodes are the end nodes of edges in ∆ESet are ob-tained. If all edges are clustered with respect to the labels and the sort-mergejoin is applied, the overhead to obtain the edge set (i.e., extent) for path′ isO(Eln + Eln+1log(Eln + Eln+1)), where Eln is the number of edges whose labelare ln. And then, to compute the change of the extent (i.e., ∆ESet), O(Eln+1)is required. Therefore, if the number of new frequently used paths are k, thetotal update complexity is O(k · (|E| + |E|)log(|E| + |E|)), where |E| is thetotal number of edges in GXML.

Here, we explain the behavior of the algorithm updateAPEX with an example.Let us consider the HAPEX , GAPEX and GXML in Figure 12-(a) and Figure 12-(b). Figure 12 is a continuation of the example depicted in Figure 7. It showsthe change of GAPEX in Figure 12-(b) due to the changed state of HAPEX asin Figure 12-(b), which is same as HAPEX in Figure 7-(c).

We invoke updateAPEX (xroot, ∅, NULL). Since the ∆ESet is empty, thecode in Line (4)-(23) in Figure 11 will be executed. Since there is only oneoutgoing edge with the label A and the ending node &1, we check the entryin HAPEX for the path of A. The xnode field of the entry returned by lookuppoints to the node &1. Thus, we do nothing and call updateAPEX(&1, ∅,A) recursively. This recursive call visits the node &1 with label path A. Wehave three outgoing edges from the node &1. Suppose we consider the node&2 first in the for-loop in line (5). We check the entry in HAPEX with thepath of A.B and find that the xnode field of the entry points to the node &2.Thus, we do nothing again and invoke updateAPEX(&2, ∅, A.B) recursively.Inside this call, we check outgoing edges of &2. As illustrated in Figure 12-(b), the end node of an outgoing edge of &2 with label D is &4. However,for the input A.B.D, lookup returns NULL that is the xnode field of the entryfor remainder.D, which should point to the node in GAPEX representing alllabel paths ending with D except A.D. Thus, we make a new node &6 forremainder.D, compute extent of &6 and change the outgoing edge of &2 withlabel D to point to &6. Since there is no outgoing edge, we return back to &1from the recursive call. We next consider the outgoing edge with the end node&5 with the label path A.D from &1. For the input of A.D, lookup returnsNULL that is the xnode field of the entry for A.D. Then we make a node &7for A.D and make an extent (Figure 12-(c)).

After backtracking to &1, we consider the outgoing edge with the end node&3 associated with label path A.C. Since lookup returns &3 and ∆ESet isempty, we do nothing and call updateAPEX(&3,∅,A.C). In this call, the endnode of an outgoing edge of &3 with label D is &5. But, for the input A.C.D,lookup returns &6 which is the xnode field of the entry for remainder.D. Thus,Line (9)-(19) in Figure 11 is executed. The HAPEX and GAPEX including the

22

A

DD D

CB1

2 5

3 4 6

0

(a) GXML

A

DD

D

CB&1

&2 &3

&4 &5

extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&4: {<2,3>}&5: {<1,4>, <5,6>}

&0


NULLNULLA

NULLNULLremainder

nextxnodecountlabel

(b) before Update

extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&5: {<1,4>, <5,6>}&6: {<2,3>}&7: {<1,4>}

A

D D

DCB&1

&2 &3

&6

&7

&5

&0


NULL&7A

NULL&6remainder

nextxnodecountlabel

(c) snapshot of updating

A

D D

DCB&1

&2 &3

&6

&7

extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&6: {<2,3>,<5,6>}&7: {<1,4>}

&0


NULL&7A

NULL&6remainder

nextxnodecountlabel

(d) after Update

Fig. 12. An example for updateAPEX

extents after traversing every node in GAPEX with updateAPEX are shown inFigure 12-(d).

23

5.4 Query Processing with APEX

In this section, we present the query processing mechanism using APEX.Conceptually, the query processing for label path p is equal to obtain edge setT(p) (defined in Definition 7).

Suppose that a label path pi,n = li. . . . .ln whose XML query form is //li/ . . . /lnis given as a query. Then, the query processor looks up HAPEX using pi,n. If pi,n

is a required path, then, HAPEX has the label paths q whose suffixes are pi,n.For each node’s extent TR(q) which is pointed by the entry for q in HAPEX ,T (pi,n) =

⋃∀q TR(q). Thus, the query processor gathers the edges from the

extents of nodes in GAPEX which can be located by HAPEX with pi,n. Theseending nodes of the edges are the result of the given query.

If pi,n is not a required path, pk,n = lk. . . . .ln (i < k ≤ n) is kept in HAPEX .Thus, using pk,n, the query processor obtains T (pk,n). Then, the query pro-cessor repeatedly looks up HAPEX using the sub path pi,j = li.li+1. . . . .lj ofpi,n by decreasing the label subscript j from n to i until the longest suffixfound from HAPEX is equal to the label path li.li+1. . . . .lj. For each look-up ofHAPEX (for each j), the query processor keeps the union of extents of nodesin GAPEX which can be located using HAPEX . Finally, to obtain the queryresult, the query processor performs a multi-way join operation on the edgesets T (pi,j), . . . , T (pk,n) acquired by look-ups.

To evaluate a complicated XML query such as //li//lj , the query pruning andrewriting technique [12] is applied. This partial matching query is rewrittenas the set of label paths with the format of li.li+1. . . . .lj in APEX. And then,the query result is obtained by union of the result of each label path.

The query processing for the path-value mixed queries such that “select nodeswhose value satisfies a certain condition among nodes which satisfies a la-bel path expression” consists two steps. In the first step, the nodes whichsatisfy the label path expression in the query are acquired using the index.The behavior of the query processor in this step is the same as that for alabel path presented above. In the second step, the query processor tests thenodes by looking up the data table which keeps all node identifiers (nid) andcorresponding data values.

6 Experiment Results

We empirically compared the performance of APEX with the strong DataGuideand the Index Fabric on real-life and synthetic data sets. In our experiments,

24

we found that APEX shows significantly better performance. In addition, in anumber of cases, it is more than an order of magnitude faster than the otherindexes.

6.1 Experimental Environment

The experiments were performed on Pentium III-866MHz platform with MS-Windows 2000 and 512 MBytes of main memory. The XML4J parser 4 andthe XML Generator 5 from IBM were used to parse and to generate XMLdata. We implemented the strong DataGuide, the Index Fabric, and APEX inJava. The data sets were stored on a local disk. We begin by describing theXML data sets and query workloads used in the experiment.

Data Sets. To evaluate APEX over various structured XML data, we usedreal-life and synthetic data sets.

Real-life data sets come from the collection of the plays of Shakespeare [10].To test the efficiency of APEX with regard to various data sizes, we combinedsome plays of Shakespeare into bigger ones: four tragedy.xml consists of fourtragedies (Hamlet, Macbeth, Othello, and Lear), shakes 11.xml consists ofeleven plays, and shakes all.xml is an XML file containing all Shakespears’splays. Because the Shakespeare’s play data sets do not have ID and IDREFtyped attributes, these are tree structured XML data.

Synthetic data sets are generated using the XML Generator from IBM withreal-life DTDs : FlixML and GedML. The Flix Markup Language (FlixML)is a markup language for categorizing B-movie reviews for the XML-based B-movie guides 6 . The GedML is a markup language for the genealogical XMLdata [10]. These two kinds of synthetic data sets are graph structured. TheShakespeare’s play data sets show minor irregularities in the structure. TheFlixML data sets (Flix01.xml, Flix02.xml, and Flix03.xml) have moderateirregularities and the structures of GedML data sets (Ged01.xml, Ged02.xml,and Ged03.xml) are highly irregular.

The other characteristics of the three kinds of data sets used in the experimentare summarized in Table 1. The two numbers in the last column of the tablerepresent the number of distinct labels and IDREF typed labels (inside ofparentheses), respectively.

4 available at http://www.alpahworks.ibm/tech/xml4j5 available at http://www.alpahworks.ibm/tech/xmlgenerator6 available at http://www.xml.com

25

Table 1XML Data Set

Data Set nodes edges lables

four tragedy.xml 22791 22790 17(0)

shakes 11.xml 48818 48817 21(0)

shakes all.xml 179691 179690 22(0)

Flix01.xml 14734 14763 62(3)

Flix02.xml 41691 41723 64(3)

Flix03.xml 335401 335432 70(3)

Ged01.xml 8259 9699 65(14)

Ged02.xml 30875 36228 77(14)

Ged03.xml 381046 447524 84(14)

Query Workloads. To estimate the efficiency of APEX for query processing,we generated 5000 XML queries randomly. A simple path expression is asequence of labels starting from the root of the XML data. There can be adereference operator (=>) with an attribute in the simple path expression dueto the IDREF type attribute in XML data. In order to generate XML queries,we stored all possible simple path expressions in XML data. To generate aquery, we randomly selected a simple path expression, selected a subsequenceof the simple expression randomly, and then put the self-or-descendent axisin front of the subsequence if the subsequence does not start from the rootof the XML data. We repeated this process until 5000 queries are generated.The queries generated can be thought of as XQuery queries [4] having formatsof either //li/li+1/ . . . /ln or //li/ . . . /lj => lj+1/ . . . /ln where lj is a tag oran attribute (with the prefix ’@’). We will represent the type of these queriesas QTYPE1. We randomly selected 20% of the 5000 queries as the queryworkload. We found that the percentage of simple path expressions in thequery workload generated by the above method was about 25%. Since thestrong DataGuide keeps the all label path staring from the root of the XMLdata, the strong DataGuide is useful for processing the simple path expression.However, as mentioned earlier, some users may not be concerned with thestructure of the XML data and intentionally make the partial matching pathexpression. Thus, the queries in the query workload reflect the diverse users’interests.

To evaluate more complicated partial matching queries, we also generated500 XML queries having formats of //li//lj on each data set. To make thiskind of queries, we randomly selected a simple path expression and choosetwo distinct labels from the simple path. We will represent the type of thesequeries as QTYPE2. On the evaluation of the QTYPE2 queries, the query

26

processor does not use the reference relationship (i.e., lk => lk+1). However,there may be only one label path, which contains the reference relationship,from li to lj. In this case, the result of such a query becomes empty. Thus, wedid not make sure that the results of the queries are not empty.

As mentioned earlier, the Index Fabric does not keep the information of XMLelements which do not have data values. Thus, the Index Fabric is not ef-fective to process queries of QTYPE1 and QTYPE2. To compare the queryperformance of APEX with that of the Index Fabric, we generated 1000 XMLqueries represented as QTYPE3. This typed queries retrieve the XML elementswhich have the data value. The XQuery format for the QTYPE3 queries is//li/li+1/ . . . /ln[text() = value]. Since, as shown in [9], the Index Fabric doesnot keep the dereference relationship information, the QTYPE3 queries donot contain the dereference operator. We also made sure that the results ofthe queries are not empty.

Query Processor Implementation. The query processing mechanism usingAPEX is presented in Section 5.4.

As mentioned in Section 5.4, a complicated query //li//lj of QTYPE2 istransformed into the set of label paths with the form as li.li+1. . . . .lj in APEX.However, in the strong DataGuide, these partial matching queries should berewritten as a set of simple path expressions in the form of l1. . . . .li.li+1. . . . .lj.

Like query processing in APEX, query processing of the QTYPE3 queries inthe strong DataGuide consists of two steps: one for the label path, the other forthe value condition. However, in the Index Fabric, a query of QTYPE3 is en-coded as the search key string and the query processor performs the key search.Since the Index Fabric keeps all simple paths like the strong DataGuide, thewhole index structure should be traversed to perform the partial matchingpath expression. The performance of the Index Fabric is affected by the size ofthe index block. In our preliminary test, the Index Fabric whose index blocksize is 8Kbytes shows the best performance among the other Index Fabricswith various sized index blocks. Thus, in our experiment, we compared theperformance of APEX with the Index Fabric whose index block size is 8Kbytes.

6.2 Performance Result

In order to get the feeling about the structures generated by the strongDataGuide and APEX, we present the statistics regarding indexes in Table 2.

For APEX, we varied minSup between 0.002 and 0.05. Typically, the strongDataGuide produces more complex structures than APEX variants. It is notsurprising since the strong DataGuide keeps all the possible paths from the

27

Table 2Statictics of Index Structures.

minSup SDG APEX0 0.002 0.005 0.01 0.03 0.05

Nodes 24 18 24 24 24 22 19 four_tradegy.xml

Edges 23 23 23 23 23 23 23

Nodes 43 22 43 43 43 25 23 shakes_11.xml

Edges 42 36 42 42 42 42 38

Nodes 59 23 59 59 49 30 27 shakes_all.xml

Edges 58 40 58 58 58 56 51

Nodes 163 63 159 126 79 66 65 Flix01.xml

Edges 162 127 162 161 153 153 142

Nodes 139 65 137 119 84 67 66 Flix02.xml

Edges 138 117 138 137 132 131 121

Nodes 176 71 172 123 83 74 73 Flix03.xml

Edges 175 142 175 170 167 167 156

Nodes 3976 66 880 428 243 121 97 Ged01.xml

Edges 4587 178 2956 1692 1032 518 382

Nodes 13992 78 1500 652 363 182 118 Ged02.xml

Edges 16105 221 6193 3307 1648 793 468

Nodes 114679 85 2106 762 441 216 147 Ged03.xml

Edges 133071 255 10395 3961 2250 996 685

root of XML data. Particularly, the size of the strong DataGuide for the highlyirregularly structured data (GedML) becomes very large. As expected from thedefinition of APEX0, it has the most compact structure. When we increase thevalue of minSup, the number of frequently used paths decreases. In the queryworkload we generated, when the value of minSup is at least 0.05, the lengthof almost every required path becomes one. Thus, the structure of APEX inthis case becomes very close to that of the APEX0.

We also plot the total query processing costs of the QTY PE1 queries for thethree kinds of data sets over the strong DataGuide and APEX as minSup

varied from 0.002 to 0.05 in Figure 13. We also show the cost of APEX0

because it represents the upper bound of the cost as we increase minSup. Notethat the required path set becomes a set of paths with the length of onewhen we increase minSup to high values closed to 0.05. Obviously, the queryprocessing cost of APEX0 should be the most slow among APEX variantssince we need to perform joins of extents for the path expression of the querywith the length of at least two. Therefore, the query processing pays severeperformance penalty and it was illustrated by the graphs for all experimentalresults.

Note that the query processing of the strong DataGuide is more inefficientthan that of APEX0 for the moderately and highly irregularly structured data(FlixML and GedML). The query processing times of APEX0 for Ged02.xml

28

Shakespeare's Play Data Sets

0200400600800

100012001400160018002000

four_tragedy.xml shakes_11.xml shakes_all.xml

time(

seco

nds)

SDG

APEX0

0.002

0.005

0.01

0.03

0.05

(a) Play

FlixML Data Sets

0

500

1000

1500

2000

2500

Flix01.xml Flix02.xml Flix03.xml

time(

seco

nds)

SDG

APEX0

0.002

0.005

0.01

0.03

0.05

(b) FlixML

GedML Data Sets

0

5000

10000

15000

20000

25000

30000

Ged01.xml Ged02.xml Ged03.xml

time(

seco

nds)

SDG

APEX0

0.002

0.005

0.01

0.03

0.05

(c) GedML

Fig. 13. Total execution times of QTYPE1

29

and Ged03.xml are about 1500 seconds and 25000 seconds, respectively. Whereas,those of the strong DataGuide are about 27000 seconds and 125000 seconds,respectively. The inefficiency of the strong DataGuide comes from the factthat its size generated for the XML document with a complex structure be-comes very large. As Table 2 illustrates, the number of nodes and the numberof edges of the strong DataGuide generated for Ged02.xml data are 13392and 16105 respectively, while these of APEX0 are 78 and 221. Furthermore,to process a partial matching path query, the query processor should traversea lot of nodes in the strong DataGuide. This result confirms that the strongDataGuide is generally inefficient for complex XML data. In contrast to thestrong DataGuide, in APEX0, the results of some queries whose lengths areone can be directly obtained by HAPEX .

The performance of APEX is affected by the value of minSup. As minSup de-creases, the number of frequently used paths increases. Thus, the results ofmore queries can be directly obtained by HAPEX . However, as minSup de-creases, the query processing time does not always decrease. As shown in Fig-ure 13-(b) and (c), when minSup is extremely small (i.e., 0.002), APEX doesnot show the best performance of APEX variants since APEX with smallminSup keeps too many frequently used paths and there are overheads for theextent union operations to compute the query result.

As shown in Table 2, the sizes of the graph structures of the strong DataGuideand APEX with minSup of 0.002 and 0.005 for the moderately and less irreg-ularly structured data (Play and FlixML) are similar. However, the queryprocessing cost of APEX is cheaper than that of the strong DataGuide. Thisis because the query result can be directly obtained by HAPEX without travers-ing GAPEX of APEX. In addition, the query performance of APEX is signif-icantly better for the highly irregularly structured XML data (GedML) thanthat of the strong DataGuide. From the result, we can conclude that APEXshows much better performance as the structure of the XML data gets morecomplex.

Figure 14 describes the total query execution time of the QTYPE2 queriesover the strong DataGuide, APEX0, and APEX with minSup of 0.005. Notethat the Y-axis is in log-scale. We have chosen the value of 0.005 because itis one of values of minSup that generates an efficient APEX. We omit theevaluation results for the other data sets (e.g., four tragedy.xml, Flix01.xml)since they show similar results.

To evaluate a query of QTYPE2, the query pruning and rewriting shouldbe applied. To perform the query pruning and rewriting with the strongDataGuide, the query processor generally traverses the whole index structurefrom the root several times. However, the query processor for APEX traversesthe index structure of GAPEX starting from the nodes whose incoming edges’

30

QTYPE2

1

10

100

1000

10000

shakes_11.xml Flix02.xml Ged02.xml

time(

seco

nds)

SDG

APEX0

APEX(0.005)

Fig. 14. Total evaluation times of QTYPE2 [log scale]

labels are li. Therefore, the query pruning and rewriting overhead of APEXis less than that of the strong DataGuide. Even though there is the join over-head to obtain the query result, APEX shows the best performance over thevarious data sets. As shown in Table 2, the graph structure of APEX0 is themost compact. Thus, the query pruning and rewriting on the APEX0 takes lesstime than that of the others. However, the query processing cost of APEX0 ishigh since there are massive join operations to obtain the query result. TheQTYPE2 processing time of APEX0 for Ged02.xml is up to 210000 seconds.

QTYPE3

1

10

100

1000

10000

shakes_11.xml Flix02.xml Ged02.xml

time(

seco

nds)

Index Fabric

SDG

APEX(0.005)

Fig. 15. Total evaluation times of QTYPE3 [log scale]

Figure 15 describes the total execution time of the QTYPE3 queries over theIndex Fabric, the strong DataGuide, and APEX with minSup of 0.005. We alsoomit the results of the other data sets. As mentioned in Section 2, the IndexFabric records each element which has a data value using the compositionsof the encoded label path and the data value of this element. Thus, in theIndex Fabric, the result of a query of QTYPE3 is obtained using only theindex structure. On the contrary, in the strong DataGuide and APEX, thedata nodes obtained by the index are tested whether the nodes have the givendata value by looking up the data table. Thus, the Index Fabric shows the

31

best performance for less irregularly structured data.

In contrast to the result for the minor irregular structure, the Index Fab-ric shows the worst performance for the moderately and highly irregularlystructured data (FlixML and GedML). This result comes from the lossy com-pression inherent in the Patricia Trie. The Index Fabric keeps a compositionof the encoded label path and the data value of an element as a string usingthe Patricia Trie. Thus, to evaluate a QTYPE3 query using the Index Fabric,the traversal of the whole index structure and the validation of each nodewith regard to the given label path expression are needed. This overhead de-grades the performance of the Index Fabric. Whereas, APEX shows the bestperformance for complex XML data.

The effectiveness of APEX is affected by minSup. In our experiment, APEXis efficient with the value of minSup ranging between 0.002 and 0.01. WhenminSup is 0.005, APEX shows the best performance with the query workloadwe generated over various XML data.

7 Conclusion

In this paper, we propose APEX which is an Adaptive Path indEx for XMLdata. APEX does not keep all paths starting from the root and utilizes fre-quently used paths to improve the query performance. In contrast to the tra-ditional indexes such as DataGuides, 1-indexes and the Index Fabric, it isconstructed by utilizing a data mining technique to summarize paths that ap-pear frequently in query workloads. APEX can be incrementally updated inorder to minimize the overhead of construction whenever the query workloadchanges. APEX also guarantees all paths of length two so that any label pathexpression can be evaluated by joins of extents in APEX without scanningoriginal data.

To support efficient query processing, APEX consists of two structures: thegraph structure GAPEX and the hash tree HAPEX . GAPEX represents the struc-tural summary of XML data with extents. HAPEX keeps the information forfrequently used paths and points to the nodes in GAPEX which correspond tothe ending nodes of the frequently used paths. Given a query, we use HAPEX

to locate the nodes of GAPEX that have extents required to evaluate a query.

We implemented APEX and conducted an extensive experimental study withboth real-life and synthetic data sets. Experimental results show that APEXimproves query processing cost typically 2 to 69 times compared with thetraditional indexes, with the performance gap increasing with the irregularityof XML data.

32

References

[1] S. Abiteboul. Querying semi-structured data. In Proceedings of the 6thInternational Conference on Database Theory, pages 1–18, January 1997.

[2] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. L. Wiener. The lorelquery languages for semistructured data. International Journal on DigitalLibraries, 1(1):68–88, 1997.

[3] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the11th International Conference on Data Engineering, pages 3–14, March 1995.

[4] S. Boag, D. Chamberlin, M. Fernandez, D. Florescu, J. Robie, J. Simeon,and M. Stefanescu. XQuery 1.0: An XML query language. Working Draft,http://www.w3.org/TR/2001/WD-xquery-20011220, 20 December 2001.

[5] T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler. Extensiblemarkup language (xml) 1.0 (second edition). W3C Recommendation,http://www.w3.org/TR/REC-xml, October 2000.

[6] P. Buneman, S. B. Davidson, M. F. Fernandez, and D. Suciu. Adding structureto unstructed data. In Proceedings of the 6th International Conference onDatabase Theory, pages 336–350, January 1997.

[7] P. Buneman, S. B. Davidson, G. G. Hillebrand, and D. Suciu. A query languageand optimization techniques for unstructured data. In Proceedings of the 1996ACM SIGMOD International Conference on Management of Data, pages 505–516, June 1996.

[8] J. Clark and S. DeRose. XML path language(XPath) version 1.0. W3CRecommendation, http://www.w3.org/TR/xpath, November 1999.

[9] B. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon. A fastindex for semistructured data. In Proceedings of 27th International Conferenceon Very Large Data Bases, January 2001.

[10] R. Cover. The XML cover pages. http://www.oasis-open.org/cover/xml.html,2001.

[11] M. F. Fernandez, D. Florescu, A. Y. Levy, and D. Suciu. A query lanaguagefor a web-site management system. ACM SIGMOD Record, 26(3):4–11, 1997.

[12] M. F. Fernandez and D. Suciu. Optimizing regular path expressions usinggraph schemas. In Proceedings of the 14th International Conference on DataEngineering, pages 14–23, February 1998.

[13] M. N. Garofalakis, R. Rastogi, and K. Shim. SPIRIT: Sequential patternmining with regular expression constraints. In Proceedings of 25th InternationalConference on Very Large Data Bases, pages 223–234, September 1999.

[14] R. Goldman and J. Widom. DataGuides: Enable query formulation andoptimization in semistructured databases. In Proceedings of 23rd InternationalConference on Very Large Data Bases, pages 436–445, August 1997.

33

[15] J. E. Hopcraft and J. D. Ullman. Introduction to Automata Theory,Language, and Computation. Addison-Wesley Publishing Company, Reading,Massachusetts, 1979.

[16] R. Kaushik, P. Shhenoy, P. Bohannon, and E. Gudes. Exploting local similarityfor indexing paths in graph-structured data. In Proceedings of the 18thInternational Conference on Data Engineering, pages 129–140, Feburary 2002.

[17] A. Kemper and G. Moerkotte. Access support relations: An indexing methodfor object bases. Information Systems, 17(2):117–145, 1992.

[18] Q. Li and B. Moon. Indexing and querying XML data for regular pathexpressions. In Proceedings of 27th International Conference on Very LargeData Bases, pages 361–370, September 2001.

[19] J. McHugh and J. Widom. Compile-time path expansion in lore. In Proceedingsof the Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats, 1999.

[20] J. McHugh and J. Widom. Query optimization for xml. In Proceedings of 25thInternational Conference on Very Large Data Bases, pages 315–326, September1999.

[21] T. Milo and D. Suciu. Index structures for path expression. In Proceedingsof 7th International Conference on Database Theory, pages 277–295, January1999.

[22] S. Nestorov, J. D. Ullman, J. L. Wiener, and S. S. Chawathe. RepresentativeObjects: Concise representations of semistructured, hierarchial data. InProceedings of the 13th International Conference on Data Engineering, pages79–90, April 1997.

[23] R. T. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining andpruning optimizations of constrained association rules. In Proceedings of the1998 ACM SIGMOD International Conference on Management of Data, pages13–24, June 1998.

[24] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange acrossheterogeneous information source. In Proceedings of the 11th InternationalConference on Data Engineering, pages 251–260, March 1995.

[25] C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. M. Lohman. Onsupporting containment queries in relational database management systems.In Proceedings of the 2001 ACM SIGMOD International Conference onManagement of Data, May 2001.

34

An Adaptive Path Index for XML Data using the Query...

Documents

Transcript of An Adaptive Path Index for XML Data using the Query...