Indexing Dataspaces Xin Dong, University of Washington Alon Havely,Google Inc.
Intensional Associations in Dataspaces
description
Transcript of Intensional Associations in Dataspaces
Intensional Associations in Dataspaces
Marcos Vaz Salles
Cornell University
Jens Dittrich
Saarland University
Lukas Blunschi
ETH Zurich
ICDE 2010
Irrelevant results thatsound like
Kevin Spacey
Potentiallyrelevant results
What is missing?
Potentiallyrelevant results
ColleaguesWho acted
together withKevin Spacey
Other Members of the Spacey
Family in the Trade
Other Folks from NJ in the Trade
Items connectedto Kevin Spaceyby relationships
Potentiallyrelevant results
Movies in Common Same Last Name
Same Place of Birth
Items connectedto Kevin Spaceyby relationships
Samuel L. Jackson (37)Tom Hanks (34)
Robin Williams (34)Dustin Hoffman (34)
Morgan Freeman (32)
John Graham Spacey(great-great-uncle!)
Zach Braff, Adam Horovitz, Andrew Shue Joel Silver, Craig Kingsbury, Joseph Kraft Drew Rosenhaus, Lauryn Hill, Stacey Kent
The Problem
• Keywords are not enough– If item is not tagged, it is not returned– No meaningful definition of relatedness
• Relationships essential, but hard to get right– Searches do not include related items– Adding relationships to search queries hurts response
time– The more flexible the definition of relatedness, the
higher the cost
Our Solution
• Keywords are not enough– Declarative mini-language to define intensional
associations
• Relationships essential, but hard to get right– Special class of neighborhood-enriched search
queries over virtual associations– New index structure for neighborhood searches
to process these queries efficiently
Association Trails
A: QL QR
• Example: Actors in the same movies
moviesInCommon: //person[type=“actor”]
//person[type=“actor”],
θ1 = (ml L/movies: ml R/movies)
Meaning: Every element from query on the left has an intensional edge to θ-matching elements from query on the right
θ(L, R)Join Predicate that relates elements from the left with elements from the right
Search queries that select elements in the dataspace
θ1
Neighborhood Search Queries
• Combine search with pre-defined joins in association trails to get related items
• Examples:
– Search for “kevin spacey” also returns colleagues who acted together, other family members, etc
– Search for “actors who won the Oscar” also returns other actors strongly related to this set by virtual associations
Search Results
Related Items
Query Processing over Association Trails
• Intuition: Index at association trail definition time to avoid costly joins at runtime
• Naive Approach– Materialize all association trails into join index– Probe join index to get related items
Naive Approach:Given m association trails and n items,index size is worst-case O(mn2)
Grouping-Compressed Index (GCI)
• Still materializes join, but in compressed form
• Takes advantage of redundancy in join output– O(mn) worst-case on equi-joins
• Intuition:
samePlaceOfBirthθ(L,R)=(L.placeOfBirth = R.placeOfBirth)
NJ NJ
CA
CA
CA
NJ NJ
For each clique,only represent pivot, edges from pivot,
and elements in clique
Grouping-Compressed Index (GCI)
• Technical challenge is to answer neighborhood queries without decompressing
• Intuition:
• Details on full version of the paper
NJ NJ
CA
CA
CA
NJ NJ
Search ResultsProbe pivotonly once
Search: actors whowon the Oscar
samePlaceOfBirthθ(L,R)=(L.placeOfBirth = R.placeOfBirth)
Experiments with IMDb Dataset
• Dataset: – IMDb biographies and filmographies– ~2M people, ~1.5M movies
• Queries: – Original search returns a subset of people– Neighborhood processing includes all people related to original set through association trails
• Association trails: moviesInCommon, samePlaceOfBirth, sameHeight, sameLastName, sameBirthdate
Experiments with IMDb Dataset
• Indexing: over order-of-magnitude gains
• Querying:– Naive method very
sensitive to selectivity– Querying compressed
index comparable to uncompressed one with high selectivity
Related Work
• Neighborhood queries in dataspaces / IR: Dong & Halevy [SIGMOD 2007], Carmel et al. [SIGIR 2003]
• Intensional Associations: Srivastava & Velegrakis [SIGMOD 2007]
• Graph Indexing: Trissl and Leser [SIGMOD 2007], Neumann & Weikum [VLDB 2008], Weiss et al. [VLDB 2008], XML
• Recursive Queries: Declarative Networking & Datalog [SIGMOD 2006]
Conclusion
• Association Trails– Declarative mini-language to specify intensional
associations in dataspaces
• Neighborhood Search Queries– Return associated items along with search results– Search combined with joins
• Grouping-Compressed Index (GCI)– Efficient scheme to index intensional associations and
process neighborhood search queries
Thank you!
Backup Slides
Association Trail Examples
• Actors in the same movies
moviesInCommon: //person[type=“actor”] //person[type=“actor”],
θ1 = (ml L/movies: ml R/movies)
• Actors born in same place
samePOB: //person[type=“actor”] //person[type=“actor”],
θ2 = (L.placeOfBirth = R.placeOfBirth)
θ1
θ2