Web Data Management

1

Web Data Management

XPath

2

In this lecture

• Review of the XPath specification– data model

– examples

– syntax

Resources:A formal semantics of patterns in XSLT by Phil Wadler.

XML Path Language (XPath) www.w3.org/TR/xpath

http://cm.bell-labs.com/cm/cs/who/wadler/papers/xsl-semantics/xsl-semantics.pdf

3

XPath• http://www.w3.org/TR/xpath (11/99)

• Building block for other W3C standards:– XSL Transformations (XSLT) – XML Link (XLink)– XML Pointer (XPointer)– XML Query

• Was originally part of XSL

XPath

• An expression language to be used in another host language (e.g., XSLT, XQuery).

• Allows the description of paths in an XML tree, and the retrieval of nodes that match these paths.

• Can also be used for performing some (limited) operations on XML data.

4

5

Example for XPath Queries<bib>

<book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

Data Model for XPath

• XPath expressions operate over XML trees, which consist of the following node types:– Document: the root node of the XML

document;– Element: element nodes;– Attribute: attribute nodes, represented as

children of an Element node;– Text: text nodes, i.e., leaves of the XML tree.

6

7

Data Model for XPath

bib

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul

The root

The root element

Much like the Xquery data model

Processing instruction

Comment

Attr= “1”element

attribute

text

Data Model for XPath• The root node of an XML tree is the (unique)

Document node;• The root element is the (unique) Element child of

the root node;• A node has a name, or a value, or both

– an Element node has a name, but no value;

– a Text node has a value (a character string), but no name;

– an Attribute node has both a name and a value.

• Attributes are special! Attributes are not considered as first-class nodes in an XML tree. They must be addressed specifically, when needed.

8

9

XPath: Simple Expressions

/bib/book/year

Result: <year> 1995 </year>

<year> 1998 </year>

/bib/paper/year

Result: empty (there were no papers)

10

XPath Tree Nodes

• Seven nodes types:– root, element, attribute, text, comment, processing

instruction and namespace

• Namespace and attribute nodes have parent nodes, but are not children of those parent nodes.

• The relationship between a parent node and a child node is containment

• Attribute nodes and namespace nodes describe their parent nodes

11

Xpath Tree Nodes<?xml version = "1.0"?>



<html xmlns = "http://www.w3.org/TR/REC-html40"> <head>

<title>Processing Instruction and Namespace Nodes

</title> </head>

<?deitelprocessor example = "fig11_03.xml"?>

<body>

<deitel:book deitel:edition = "1" xmlns:deitel = "http://www.deitel.com/xmlhtp1"> <deitel:title>XML How to Program</deitel:title> </deitel:book>

</body>

</html>

12

XPath Tree Nodes• String-value: Each XPath tree node has a string representation that XPath

uses to compare nodes.• The string-value of a text node consists of the character data contained in the

node.• Document order: Nodes in an XPath tree have an ordering that is determined

by the order in which the nodes appear in the original XML document.• The reverse document order is the reverse ordering of the nodes in a

document.• The string-value for the html element node is determined by concatenating

the string-values for all of its descendant text nodes in document order.• The string-value for element node html is

Processing Instruction and Namespace NodesXML How to Program

• Because all whitespace is removed when the text nodes are normalized, there is no space in the concatenation.

13

XPath Tree Nodes

• For processing instructions, the string-value consists of the remainder of the processing instruction after the target, including whitespace, but excluding the ending ?>

• The string-value for the processing instruction is

• example = "fig11_03.xml"

• Namespace-node string-values consist of the URI for the namespace.

• The string-value for the namespace declaration is

http://www.deitel.com/xmlhtpl

http://www.deitel.com/xmlhtpl

14

XPath Tree Nodes

• For the root node of the document, the string-value is also determined by concatenating the string-values of its text-node descendents in document order.

• The string-value of the root node is therefore identical to the string-value calculated for the html element node

• The string-value for the edition attribute node consists of its value, which is 3.

• The string-value for a comment node consists only of the comment's text, excluding .

• The string-value for the second comment node is therefore: Processing instructions and namespacess.

15

XPath Tree Nodes• Expanded-name: Certain nodes (i.e., element, attribute,

processing instruction and namespace) also have an expanded-name that can be used to locate specific nodes in the XPath tree.

• Expanded-names consist of both a local part and a namespace URI.

• The local part for the element node html is therefore html.

• If there is a prefix for the element node, the namespace URI of the expanded-name is the URI to which the prefix is bound.

• If there is no prefix for the element node, the namespace URI of the expanded name is the URI for the default namespace.

16

XPath Tree Nodes

• The local part of the expanded name for a processing instruction node corresponds to the target of the processing instruction in the XML document.

• For processing instructions, the namespace URI of the expanded-name is null

• The local part of the expanded-name for a namespace node corresponds to the prefix for the namespace, if one exists; or, if it is a default namespace, the local part is empty (i.e., the empty string).

• The namespace URI of the expanded-name for a namespace node is always null.

17

XPath Tree NodesNode Type string-value expanded-name Description

root Determined by concatenating the string-values of all text-node descendents in document order.

None Represents the root of an XML document. This node exists only at the top of the tree and may contain element, comment or processor-instruction children.

element Determined by concatenating the string-values of all text-node descendents in document order.

The element tag, including the namespace prefix (if applicable).

Represents an XML element and may contain element, text, comment or processor-instruction children.

attribute The normalized value of the attribute.

The name of the attribute, including the namespace prefix (if applicable).

Represents an attribute of an element.

text The character data contained in the text node.

None. Represents the character data content of an element

comment The content of the comment (not including ).

None. Represents an XML comment

processing instruction

The part of the processing instruction that follows the target and any whitespace

The target of the processing instruction.

Represents an XML processing instruction

namespace The URI of the namespace The namespace prefix. Represents an XML namespace

18

XPath: Axes• A location path is an expression that specifies how to navigate an

XPath tree from one node to another.

• A location path is composed of location steps, each of which is composed of an "axis," a "node test" and an optional "predicate."

• Searching through an XML document begins at a context node in the XPath tree.

• Searches through the XPath tree are made relative to this context node.

• An axis indicates which nodes, relative to the context node, should be included in the search.

• The axis also dictates the ordering of the nodes in the set.

• Axes that select nodes that follow the context node in document order are called forward axes.

• Axes that select nodes that precede the context node in document order are called reverse axes.

XPath Context• A step is evaluated in a specific context [< N1,N2, · · · ,Nn

>, Nc] which consists of:

– a context list < N1,N2, · · · ,Nn > of nodes from the XML tree;

– a context node Nc belonging to the context list.

• The context length n is a positive integer indicating the size of a contextual list of nodes; it can be known by using the function last();

• The context node position c [1,n] is a positive integer indicating the position of the context node in the context list of nodes; it can be known by using the function position().

19

XPath Steps• The basic component of XPath expression are steps, of

the form: axis::node-test[P1][P2]. . . [Pn]

– axis is an axis name indicating what the direction of the step in the XML tree is (child is the default).

– node-test is a node test, indicating the kind of nodes to select.

– Pi is a predicate, that is, any XPath expression, evaluated as a boolean, indicating an additional condition. There may be no predicates at all.

• A step is evaluated with respect to a context, and returns a node list.

20

Path Expressions• A path expression is of the form: [/]step1/step2/. . .

/stepn– A path that begins with / is an absolute path expression;

– A path that does not begin with / is a relative path expression.

• Examples– /A/B is an absolute path expression denoting the Element

nodes with name B, children of the root named A;

– ./B/descendant::text() is a relative path expression which denotes all the Text nodes descendant of an Element B, itself child of the context node;

– /A/B/@att1[.> 2] denotes all the Attribute nodes @att1 whose value is greater than 2. 21

Evaluation of Path Expressions• Each stepi is interpreted with respect to a context; its

result is a node list.

• A step stepi is evaluated with respect to the context of stepi−1. More precisely:– For i = 1 (first step) if the path is absolute, the context is a

singleton, the root of the XML tree; else (relative paths) the context is defined by the environment;

– For i > 1 if N = < N1,N2, · · · ,Nn > is the result of step stepi−1, stepi is successively evaluated with respect to the context [N,Nj ], for each j [1,n].

• The result of the path expression is the node set obtained after evaluating the last step.

22

Evaluation of Path Expressions

• Evaluation of /A/B/@att1– The path expression is absolute: the context

consists of the root node of the tree.

• The first step, A, is evaluated with respect to this context.

23

Evaluation of /A/B/@att1• The result is A, the root element.

• A is the context for the evaluation of the second step, B.

24

Evaluation of /A/B/@att1• The result is a node list with two nodes

B[1], B[2].

• @att1 is first evaluated with the context node B[1].

25

Evaluation of /A/B/@att1• The result is the attribute node of B[1].

26

Evaluation of /A/B/@att1• @att1 is also evaluated with the context

node B[2].

27

Evaluation of /A/B/@att1• The result is the attribute node of B[2].

28

Evaluation of /A/B/@att1• Final result: the node set union of all the

results of the last step, @att1.

29

30

XPath: AxesAxes Ordering Description

Self None The context node itself.

Parent Reverse The context node's parent, if one exists.

Child Forward The context node's children, if they exist.

Ancestor Reverse The context node's ancestors, if they exist.

ancestor-or-self Reverse The context node's ancestors and also itself.

Descendant Forward The context node's descendants.

descendant-or-self Forward The context node's descendants and also itself.

Following Forward The nodes in the XML document following the context node, not including descendants.

following-sibling Forward The sibling nodes following the context node.

Preceding Reverse The nodes in the XML document preceding the context node, not including ancestors.

preceding-sibling Reverse The sibling nodes preceding the context node.

Attribute Forward The attribute nodes of the context node.

Namespace Forward The namespace nodes of the context node.

31

XPath: Axes

• An axis has a principal node type that corresponds to the type of node the axis may select.

• For attribute axes, the principal node type is attribute.

• For namespace axes, the principal node type is namespace.

• All other axes have a element principal node type.

XPath: Axes• Child axis: denotes the Element or Text children

of the context node.• Important: An Attribute node has a parent (the

element on which it is located), but an attribute node is not one of the children of its parent.

• Example: child::D

32

XPath: Axes• Parent axis: denotes the parent of the context node.

– The node test is either an element name, or * which matches all names, node() which matches all node types.

– Always a Element or Document node, or an empty node-set (if the parent does not match the node test or does not satisfy a predicate).

– .. is an abbreviation for parent::node(): the parent of the context

• Example: parent::node()

33

XPath: Axes• Attribute axis: denotes the attributes of the

context node.– The node test is either the attribute name, or * which

matches all the names.

• Example: attribute::*

34

XPath: Axes• Descendant axis: all the descendant nodes, except the

Attribute nodes.– The node test is either the node name (for Element nodes), or * (any

Element node) or text() (any Text node) or node() (all nodes).

– The context node does not belong to the result: use descendant-or-self instead.

• Example: descendant::node()

35

XPath: Axes• Example: descendant::*

36

XPath: Axes• Ancestor axis: all the ancestor nodes.

– The node test is either the node name (for Element nodes), or node() (any Element node, and the Document root node).

– The context node does not belong to the result: use ancestor-or-self instead.

• Example: ancestor::node()

37

XPath: Axes• Following axis: all the nodes that follows the

context node in the document order.– Attribute nodes are not selected.

– The node test is either the node name, * text() or node().

– The axis preceding denotes all the nodes that precede the context node.

• Example: following::node()

38

XPath: Axes• Following sibling axis: all the nodes that follows

the context node, and share the same parent node.– Same node tests as descendant or following.

– The axis preceding-sibling denotes all the nodes the precede the context node.

• Example: following-sibling::node()

39

40

Location Path Abbreviations

Location path Abbreviation

child:: This location path is used by default if no axis is supplied and may therefore be omitted

attribute:: @

/descendant-or-self::node()/ //

self::node() (.)

parent::node() (..)

41

XPath: Node Tests

• The set of selected nodes is refined with node tests.

• node tests rely upon the principal node type of an axis for selecting nodes in a location path

Node Test Description

* Selects all nodes of the same principal node type.

node() Selects all nodes, regardless of their type.

text() Selects all text nodes.

comment() Selects all comment nodes.

processing-instruction() Selects all processing-instruction nodes.

node name Selects all nodes with the specified node name.

42

XPath: Axes

• Location Paths Using Axes and Node Tests– Location paths are composed of sequences of location steps.

– A location step contains an axis and a node test separated by a double-colon (::) and, optionally, a "predicate" enclosed in square brackets ([ ]).

child::* – The above location path selects all element-node children of the

context node, because the principal node type for the child axis is element.

43

XPath: Wildcard

//author/child::* or //author/*

Result: <first-name> Rick </first-name>

<last-name> Hull </last-name>

* Matches any element

44

XPath: Axes and Node Tests

• child::text() – selects all text-node children of the context

node

• Combining two location steps to form the location path– child::*/child::text() – selects all text-node grandchildren of the

context node

45

XPath: Node Tests/bib/book/author/text()

Result: Serge Abiteboul Victor Vianu Jeffrey D. Ullman

Rick Hull doesn’t appear because he has firstname, lastname

/bib/book/author/*/text()

Result: Rick

Hull

46

XPath: Restricted Kleene Closure• select all author element nodes in an entire document /descendent-or-self::node()/child::author

• Instead use the abbreviation: //authorResult:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author>

/bib//first-nameResult: <first-name> Rick </first-name>

47

XPath: Attribute Nodes

/bib/book/@price

Result: “55”

@price means that price has to be an attribute

XPath: Predicates• Boolean expression, built with tests and the

Boolean connectors and/or (negation is expressed with the not() function);

• a test is– either an XPath expression, whose result is

converted to a Boolean;– a comparison or a call to a Boolean function.

• Important: predicate evaluation requires several rules for converting nodes and node sets to the appropriate type.

48

Predicate Evaluation• A step is of the form axis::node-test[P]• First axis::node-test is evaluated: one obtains an

intermediate result I• Second, for each node in I, P is evaluated: the step

result consists of those nodes in I for which P is true.

/A/B/descendant::text()[1]

49

Predicate Evaluation• Beware: an XPath step is always evaluated with

respect to the context of the previous step.– Here the result consists of those Text nodes, first

descendant (in the document order) of a node B.

• /A/B//text()[1]

50

51

XPath: Predicates

<?xml version = "1.0"?><books> <book> <title>Java How to Program</title> <translation edition="1">Spanish</translation>

<translation edition="1">Chinese</translation> <translation edition="1">Japanese</translation> <translation edition="2">French</translation> <translation edition="2">Japanese</translation>

</book> <book>

<title>C++ How to Program</title> <translation edition="1">Korean</translation> <translation edition="2">French</translation> <translation edition="2">Spanish</translation> <translation edition="3">Italian</translation> <translation edition="3">Japanese</translation>

</book></books>

52

XPath: Predicates• Select the title element node for each book that has a

Japanese translation/books/book/

translation[. = 'Japanese']/../title

• A predicate is a Boolean expression used as part of a location path to filter nodes from the search.

• Select the edition attribute node for books with Japanese translations

/books/book/translation[. = 'Japanese']/@edition

XPath 1.0 Type System• Four primitive types:

• The boolean(), number(), string() functions convert types into each other (no conversion to nodesets is defined), but this conversion is done in an implicit way most of the time.

• Rules for converting to a Boolean:– A number is true if it is neither 0 nor NaN.– A string is true if its length is not 0.– A nodeset is true if it is not empty.

53

Type Description Literals Examples

Boolean Boolean values None true(), not($a=3)

Number Floating-point 12, 12.5 1 div 33

String Ch. Strings "to", ’ti’ concat(’Hello’,’!’)

Nodeset Node set None /a/b[c=1 or @e]/d

XPath 1.0 Type System• Rules for converting a nodeset to a string:

– The string value of a nodeset is the string value of its first item in document order.

– The string value of an element or document node is the concatenation of the character data in all text nodes below.

– The string value of a text node is its character data.

– The string value of an attribute node is the attribute value.

• Examples (Whitespace-only text nodes removed)

54

<a toto="3"> <b titi=’tutu’><c /></b> <d>tata</d></a>

string(/) "tata"string(/a/@toto) "3"boolean(/a/b) true()boolean(/a/e) false()

55

Operators• Node-set operators allow to manipulate the node

sets to form other node sets.

Node-set Operators Description

pipe (|) union of node-sets (Example: node()|@*)

slash (/) Separates location steps

double-slash (//) Abbreviation for the location path /descendant-or-self::node()/

+, -, *, div, mod standard arithmetic operators

or, and Boolean operators (Example: @a and c=3)

<, <=, >=, > relational operators (Example: ($a<2) and ($a>0))

56

Node-set Functions• node-set functions perform an action on a node-set

returned by a location path

Node-set Functions Description

last() returns a number equal to the context size from the expression evaluation context

position() Returns the position number of the current node in the node-set being tested.

count( node-set ) Returns the number of nodes in node-set.

id( string ) Returns the element node whose ID attribute matches the value specified by argument string.

local-name( node-set ) Returns the local part of the expanded-name for the first node in node-set.

namespace-uri( node-set ) Returns the namespace URI of the expanded-name for the first node in node-set.

name( node-set ) Returns the qualified name for the first node in node-set.

57

Node-set Functions//book/author[last()]• Returns the last author child of book node - Jeffrey D. Ullman

//book/author[position() = 3] or //book/author[3]• Selects the third author element of the book node

/book[count(*)]• return the total number of element-node children of the

book node

//book• selects all book element nodes in the document

String FunctionsString Function Description

concat($s1,...,$sn) concatenates the strings $s1, . . . , $sn

starts-with($a,$b) returns true() if the string $a starts with $b

contains($a,$b) returns true() if the string $a contains $b

substring-before($a,$b) returns the substring of $a before the first occurrence of $b

substring-after($a,$b) returns the substring of $a after the first occurrence of $b

substring($a,$n,$l) returns the substring of $a of length $l starting at index $n (indexes start from 1). $l may be omitted

string-length($a) returns the length of the string $a

normalize-space($a) removes all leading and trailing whitespace from $a, and collapse all whitespace to a single character

translate($a,$b,$c) returns the string $a, where all occurrences of a character from $b has been replaced by the character at the same place in $c

58

Boolean and Number Functions

Functions Decsription

not($b) returns the logical negation of the boolean $b

sum($s) returns the sum of the values of the nodes in the nodeset $s

floor($n) rounds the number $n to the next lowest integer

ceiling($n) rounds the number $n to the next greatest integer

round($n) rounds the number $n to the closest integer

59

count(//*) returns the number of elements in the documentnormalize-space(’ titi toto ’) returns the string “titi toto”translate(’baba,’abcdef’,’ABCDEF’) returns the string “BABA”round(3.457) returns the number 3

60

XPath String functions<?xml version = "1.0"?>   <stocks> <stock symbol = "INTC">

<name>Intel Corporation</name> </stock> <stock symbol = "CSCO"> <name>Cisco Systems, Inc.</name> </stock> <stock symbol = "DELL"> <name>Dell Computer Corporation</name> </stock> <stock symbol = "MSFT"> <name>Microsoft Corporation</name> </stock> <stock symbol = "SUNW"> <name>Sun Microsystems, Inc.</name> </stock> <stock symbol = "CMGI"> <name>CMGI, Inc.</name> </stock>

</stocks>

<?xml version = "1.0"?>  <xsl:stylesheet version = "1.0“ xmlns:xsl = "http://www.w3.org/1999/XSL/Transform“ <xsl:template match = "/stocks">

<html> <body> <ul> <xsl:for-each select = "stock"> <xsl:if test = "starts-with(@symbol, 'C')"> <li> <xsl:value-of select = "concat(@symbol,' - ',name)"/> </li> </xsl:if> </xsl:for-each> </ul> </body> </html> </xsl:template>

</xsl:stylesheet>

61

XPath: Qualifiers/bib/book/author[first-name]

Result: <first-name> Rick </first-name>

/bib/book[@price < “60”]

/bib/book/author[@age < “25”]

/bib/book/author[text()]

XPath Examples• child::A/descendant::B : B elements, descendant

of an A element, itself child of the context node; Can be abbreviated to A//B.

• child::*/child::B : all the B grand-children of the context node

• descendant-or-self::B : elements B descendants of the context node, plus the context node itself if its name is B.

• child::B[position()=last()] : the last child named B of the context node. Abbreviated to B[last()].

• following-sibling::B[1] : the first sibling of type B (in the document order) of the context node 62

XPath Examples• /descendant::B[10] the tenth element of type B in

the document.– Not: the tenth element of the document, if its type is B!

• child::B[child::C] : child elements B that have a child element C. Abbreviated to B[C].

• /descendant::B[@att1 or @att2] : elements B that have an attribute att1 or an attribute att2; Abbreviated to //B[@att1 or @att2]

• *[self::B or self::C] : children elements named B or C

63

64

XPath: Summarybib matches a bib element

* matches any element

/ matches the root element

/bib matches a bib element under root

bib/paper matches a paper in bib

bib//paper matches a paper in bib, at any depth

//paper matches a paper at any depth

paper|book matches a paper or a book

@price matches a price attribute

bib/book/@price matches price attribute in book, in bib

bib/book[@price<“55”]/author/lastname matches…

65

The Root and the Root

• <bib> <paper> 1 </paper> <paper> 2 </paper> </bib>

• bib is the “document element”

• The “root” is above bib

• /bib = returns the document element

• / = returns the root

• Why ? Because we may have comments before and after <bib>; they become siblings of <bib>

66

XPath: More Details

• Examples:– child::author/child:lastname = author/lastname

– child::author/descendant::zip = author//zip

– child::author/parent::* = author/..

– child::author/attribute::age = author/@age

• What does this mean ?– paper/publisher/parent::*/author

– /bib//address[ancestor::book]

– /bib//author/ancestor::*//zip

67

XPath: Even More Details

• name() = the name of the current node– /bib//*[name()=book] same as /bib//book

• What does this mean ? /bib//*[ancestor::*[name()!=book]]

– In a different notation bib.[^book]*._

• Navigation axis gives us strictly more power !

XPath 2.0• An extension of XPath 1.0, backward compatible with

XPath 1.0. Main differences:• Improved data model tightly associated with XML

Schema. a new sequence type, representing ordered set of nodes

and/or values, with duplicates allowed. XSD types can be used for node tests.

• More powerful new operators (loops) and better control of the output (limited tree restructuring capabilities)

• Extensible Many new built-in functions; possibility to add user-defined functions.

• XPath 2.0 is also a subset of XQuery 1.0. 68

Path expressions in XPath 2.0• New node tests in XPath 2.0:

• Nested paths expressions:• Any expression that returns a sequence of nodes

can be used as a step

69/book/(author | editor)/name

Node tests Description

item() any node or atomic value

element() any element (eq. to child::* in XPath 1.0)

element(author) any element named author

element(*, xs:person) any element of type xs:person

attribute() any attribute

XPath 1.0 Implementations• libxml2 Free C library for parsing XML

documents, supporting XPath.• java.xml.xpath Java package, included with JDK

versions starting from 1.5.• System.Xml.XPath .NET classes for XPath.• XML::XPath Free Perl module, includes a

command-line tool.• DOMXPath PHP class for XPath, included in

PHP5.• PyXML Free Python library for parsing XML

documents, supporting XPath.70

Web Data Management

Documents

Transcript of Web Data Management