Post on 18-Dec-2015
Reasoning and Identifying Relevant Reasoning and Identifying Relevant Matches for XML Keyword SearchMatches for XML Keyword Search
Ziyang Liu, Yi ChenYi ChenArizona State University
VLDB 2008, Auckland, New Zealand
team
name
Grizzlies
players
player
name position
Spain forward
player
nationality position
Miller USA guard
name
player
nationality position
Brown USA
name
forward
league
name
NBA
founded
1946
division
southwest
arena
FedExForum
founded
1995
Gasol
team team… …
…
nationality
MotivationMotivation
Identifying relevant matches is a critical step of processing XML search.
Query: “Gasol, position”
relevant matches
irrelevant matches
VLDB 2008, Auckland, New Zealand
How to Evaluate Various How to Evaluate Various Strategies?Strategies?
Existing approaches for identifying relevant matches:XKSearch (SLCA) [Xu and Papakonstantinou 2005]
XRank [Guo et al. 2003]
XSEarch [Cohen et al. 2003] Star-semantics All-semantics
Schema-free XQuery (MLCA) [Li et al. 2004]
CVLCA [Li et al. 2007]
VLDB 2008, Auckland, New Zealand
How to Evaluate Various How to Evaluate Various Strategies?Strategies?
The traditional approach Obtain ground truth of query results by user studies on a large number of
documents and queries. Measure the precision and recall of a strategy wrt ground truth Costly
An axiomatic approach Formalize broad intuitions as a collection of simple axioms and evaluate
strategies based on the axioms. It has been successful in many areas, e.g. mathematical economics,
clustering, location theory, collaborative filtering, etc Cost-effective
Problem: Is it possible to evaluate and reason about XML keyword search strategies in a formal axiomatic framework?
VLDB 2008, Auckland, New Zealand
RoadmapRoadmap
Motivation and Problem Definition
Challenges and Contributions
Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency
MaxMatch: the first system that satisfies all four properties
Experimental Evaluation
Conclusions
VLDB 2008, Auckland, New Zealand
team
name
Grizzlies
players
player
name position
Spain forward
player
nationality position
Miller USA guard
name
player
nationality position
Brown USA
name
forward
league
name
NBA
founded
1946
division
southwest
arena
FedExForum
founded
1995
Gasol
team team… …
…
nationality
ChallengeChallengeIt is easy for an individual to assess the relevance of matches
But it is extremely difficult to formalize the relevance assessment, independently of any query, data, algorithm, and user
Query: “Gasol, position”
relevant matches
irrelevant matches
VLDB 2008, Auckland, New Zealand
Example: Similar QueriesExample: Similar QueriesInterestingly, we discovered that some abnormal behaviors can be clearly observed when examining results of two similar queries or one query on two similar documents produced by the same search engine.
team
name
Grizzlies
players
player
name position
Spain forward
player
nationality position
Miller USA guard
name
player
nationality position
Brown USA
name
forward
league
name
NBA
founded
1946
division
southwest
arena
FedExForum
founded
1995
Gasol
team team… …
…
nationality
Q1: “Gasol, position”Q2: “Grizzlies, Gasol, position”
These two “position” nodes should still be irrelevant.
VLDB 2008, Auckland, New Zealand
Example: Similar DataExample: Similar Data
team
name
Grizzlies
players
player
name position
Spain forward
player
nationality position
Miller USA guard
name
player
nationality
Brown USA
name
league
name
NBA
founded
1946
division
southwest
arena
FedExForum
founded
1995
Gasol
team team… …
…
nationality
Q: “Grizzlies, Gasol, Brown, position”
position
forward
An empty result after data insertion is abnormal.
How to capture the logical connection between query results?
VLDB 2008, Auckland, New Zealand
Contributions of This WorkContributions of This WorkThe first work that formally reasoned about keyword search in an axiomatic framework
We identified four desirable properties that an XML search engine should satisfy.Data/Query Monotonicity capture the desirable changes to
the number of query resultsData/Query Consistency capture the desirable changes to the
content of a query result
We reasoned about existing XML keyword search strategies.
We proposed MaxMatch - the only XML keyword search strategy that possess all properties.
Experiments verified our intuition and demonstrated the effectiveness and efficiency of MaxMatch.
VLDB 2008, Auckland, New Zealand
RoadmapRoadmap
Motivation and Problem Definition
Challenges and Contributions
Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency
MaxMatch: the first system that satisfies all four properties
Experimental Evaluation
Conclusions
VLDB 2008, Auckland, New Zealand
Properties wrt Similar Properties wrt Similar QueriesQueries
Query Monotonicity When we add a keyword to the query, the query becomes more
restrictive, therefore the number of query results should not increase.
Query Consistency When we add a new keyword to the query, each delta subtree
that newly becomes (part of) a query result should contain the new keyword.
VLDB 2008, Auckland, New Zealand
team
name
Grizzlies
players
player
name position
Spain forward
player
nationality position
Miller USA guard
name
player
nationality position
Brown USA
name
forward
league
name
NBA
founded
1946
division
southwest
arena
FedExForum
founded
1995
Gasol
team team… …
…
nationality
Example: Query Monotonicity/ConsistencyExample: Query Monotonicity/Consistency
Q1: “forward, name”Q2: “forward, USA, name”
New Keyword
Monotonicity: the number of query results reduces from 2 to 1.
Consistency: in each result, the delta sub-tree (if exists) contains “USA”.
VLDB 2008, Auckland, New Zealand
Example Revisited: Violation of Query Example Revisited: Violation of Query ConsistencyConsistency
team
name
Grizzlies
players
player
name position
Spain forward
player
nationality position
Miller USA guard
name
player
nationality position
Brown USA
name
forward
league
name
NBA
founded
1946
division
southwest
arena
FedExForum
founded
1995
Gasol
team team… …
…
nationality
Q1: “Gasol, position”
An XML keyword search engine that considers these nodes as relevant for the new query violates query consistency .
Q2: “Grizzlies, Gasol, position”
VLDB 2008, Auckland, New Zealand
Properties wrt Similar DataProperties wrt Similar Data
Data Monotonicity When we add a node to the data, the data content becomes
richer, and the number of query results should not decrease.
Data Consistency After we add a node to the data, each delta subtree that
becomes (part of) a query result should contain the newly inserted node.
VLDB 2008, Auckland, New Zealand
Example: Data Example: Data Monotonicity/ConsistencyMonotonicity/Consistency
team
name
Grizzlies
players
player
name position
Spain forward
player
nationality position
Miller USA guard
name
player
nationality
Brown USA
name
league
name
NBA
founded
1946
division
southwest
arena
FedExForum
founded
1995
Gasol
team team… …
…
nationality
Q: “forward, name”
position
forward
New Match
Monotonicity: the number of query results increases from 1 to 2.
Consistency: in each result, the delta sub-tree (if exists) contains the new data node.
VLDB 2008, Auckland, New Zealand
Example Revisited: Violation of Data Example Revisited: Violation of Data MonotonicityMonotonicity
team
name
Grizzlies
players
player
name position
Spain forward
player
nationality position
Miller USA guard
name
player
nationality
Brown USA
name
league
name
NBA
founded
1946
division
southwest
arena
FedExForum
founded
1995
Gasol
team team… …
…
nationality
Q: “Grizzlies, Gasol, Brown, position”
position
forward
An XML keyword search engine that outputs an empty result on the updated data violates data monotonicity.
VLDB 2008, Auckland, New Zealand
The Proposed Axiomatic The Proposed Axiomatic FrameworkFramework
Four desirable properties Query Monotonicity Query Consistency Data Monotonicity Data Consistency
These properties are: Non-trivial
No prior XML keyword system satisfies all of them.
Non-redundant An algorithm may violate any one of them while satisfying others.
Satisfiable We propose a novel technique – MaxMatch - that satisfies all four
properties.
VLDB 2008, Auckland, New Zealand
RoadmapRoadmap
Motivation and Problem Definition
Challenges and Contributions
Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency
MaxMatch: the first system that satisfies all four properties
Experimental Evaluation
Conclusions
VLDB 2008, Auckland, New Zealand
MaxMatchMaxMatch
MaxMatch’s name comes from “Maximal Match”
MaxMatch preserves each subtree whose set of descendant keyword matches is “Maximal” among its siblings. Intuitively, the subtrees that are removed are strictly less
relevant to the query since fewer keywords are contained.
VLDB 2008, Auckland, New Zealand
MaxMatchMaxMatch
team
name
Grizzlies
players
player
name position
Spain forward
player
nationality position
Miller USA guard
name
player
nationality position
Brown USA
name
forward
league
name
NBA
founded
1946
division
southwest
arena
FedExForum
founded
1995
Gasol
team team… …
…
nationality
Q: Grizzlies, Gasol, Brown, position
Not as informative as its siblings: discarded
MaxMatch satisfies all four properties.
Proof details and algorithms can be found in the paper.
VLDB 2008, Auckland, New Zealand
RoadmapRoadmap
Motivation and Problem Definition
Challenges and Contributions
Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency
MaxMatch: the first system that satisfies all four properties
Experimental Evaluation
Conclusions
VLDB 2008, Auckland, New Zealand
Search QualitySearch QualityData set: Baseball, Mondial
Query set: 36 queries in total
Ground truth: obtained by user study.
User perception of search results on query pairs and document pairs confirms our intuition of the proposed properties
F-measure of MaxMatch vs. Existing Approaches
VLDB 2008, Auckland, New Zealand
Processing TimeProcessing Time
Mondial Data (515KB) Baseball Data (1014KB)
VLDB 2008, Auckland, New Zealand
ConclusionsConclusions
This is the first work on reasoning about and evaluating XML keyword search strategies using a formal axiomatic framework.
Four intuitive and elegant properties are proposed: query monotonicity/consistency, data monotonicity/consistency.
We designed and developed MaxMatch - the only XML keyword search strategy that satisfies all properties.
Experiments verified the intuition of the properties and the effectiveness and efficiency of MaxMatch.
MaxMatch is incorporated as part of XSeek [Liu & Chen Sigmod 07]