Predicate-based Indexing of Annotated Data
-
Upload
byron-stewart -
Category
Documents
-
view
34 -
download
0
description
Transcript of Predicate-based Indexing of Annotated Data
Predicate-based Indexing of Annotated Data
Donald KossmannETH Zurich
http://www.dbis.ethz.ch
Observations• Data is annotated by apps and humans
– Word: versions, comments, references, layout, ...– humans: tags on del.icio.us
• Applications provide views on the data– Word: Version 1.3, text without comments, …– del.icio.us: Provides search on tags + data
• Search Engines see and index the raw data– E.g., treat all versions as one document
• Search Engine’s view != User’s view– Search Engine returns the wrong results– or hard-code the right search logic (del.icio.us)
Application(e.g., Word, Wiki, Outlook, …)
File System(e.g., x.doc, y.xls, …)
Views
Desktop Search Engine(e.g., Spotlight, Google Desktop, …)
User
crawl & index
read & update
query
Desktop Search
Example 1: Bulk Letters
<address/>
Dear <recipient/>,The meeting is at 12.CU, Donald
Peter …
Paul …
Mary …
Raw datax.doc, y.xls
…Dear Peter,The meeting is at
12.CU, Donald
…Dear Paul,The meeting is at
12.CU, Donald
… View
Example1: Traditional Search Engines
DocId Keyword …
x.doc Dear …
x.doc Meeting …
y.xls Peter …
y.xls Paul …
y.xls Mary …
… … …
Inverted File
Query: Paul, meetingAnswer: -Correct Answer: x.doc
Query: Paul, MaryAnswer: y.xlsCorrect Answer: x.doc
Example 2: Versioning (OpenOffice)
<deleted id=„1“><info date =„5/15/2006“></deleted>
<inserted id=„2“><info date =„5/15/2006“></inserted>
<delete id=„1“>Mickey likes Minnie</delete>
<insert id=„2“>Donald likes Daisy</insert>
Mickey likes Minnie Donald likes Daisy
Raw Data
Instance 1 (Version 1)
Instance 2 (Version 2)
Example 2: Versioning (OpenOffice)
DocId Keyword …
z.swx Mickey …
z.swx likes …
z.swx Minnie …
z.swx Donald …
z.swx Daisy …
Inverted FileQuery: Mickey likes DaisyAnswer: z.swxCorrect Answer: -
Query: Mickey likes MinnieAnswer: z.swxCorrect Answer: z.swx (V1)
Example 3: Personalization, Localization, Authorization
<header> <data row="duck" id=“man">Donald</data> <data row="duck" id=“woman">Daisy</data> <data row="mouse" id=“man">Mickey</data> <data row="mouse" id=“woman">Minnie</data></header><body> <field id=“man"/> likes <field id=“woman"/>.</body>
Donald likes Daisy.
Mickey likes Minnie.
Donald Daisy Mickey Minnie likes
.
Example 4: del.icio.us
• Query: „Joe, software, Yahoo“– both A and B are relevant, but in different worlds– if context info available, choice is possible
user tag URL
Joe business A
Mary software B
Tag Table
Yahoo builds software.
Joe is a programmerat Yahoo.
http://A.com
http://B.com
Example 5: Enterprise Search• Web Applications
– Application defined using „templates“ (e.g., JSP)– Data both in JSP pages and database– Content = JSP + Java + Database– Content depends on Context (roles, workflow)– Links = URL + function + parameters + context
• Enterprise Search – Search: Map Content to Link – Enterprise Search: Content and Link are complex
• Example: Search Engine for J2EE PetStore– (see demo at CIDR 2007)
Possible Solutions• Extend Applications with Search Capabilities
– Re-invents the wheel for each application– Not worth the effort for small apps– No support for cross-app search
• Extend Search Engines– Application-specific rules for „encoded“ data– „Possible Worlds“ Semantics of Data– Materialize view, normalize view– Index normalized view– Extended query processing– Challenge: Views become huge!
Application(e.g., Word, Wiki, Outlook, …)
File System(e.g., x.doc, y.xls, …)Views
Desktop Search Engine(e.g., Spotlight, Google Desktop, …)
User
crawl & index
read & update
query
Views
rules
Size of Views• One rule: size of view grows linearly with
size of document– E.g., for each version, one instance in view– Constant can be high! (e.g., many versions)
• Several rules: size of view grows exponentially with number of rules– E.g, #versions x #alternatives
• Prototype, experiments: Wiki, Office, E-Mail…– About 30 rules; 5-6 applicable per document– View ~ 1000 Raw data
Solution Architecture
Rules and Patterns
• Analogy: Operators of relational algebra• Patterns sufficient for Latex, MS Office,
OpenOffice, TWiki, E-Mail (Outlook)
Normalized View
<field match=„//field“ ref=„//data[@id=$m/@id]/text()“ key=„$r/../@row“ />
<body> <select pred=“R1=duck">Donald</select> <select pred=“R1=mouse">Mickey</select> likes <select pred=“R1=duck">Daisy</select> <select pred=“R1=mouse">Minnie</select>.</body>
<header> <data row="duck" id=“man">Donald</data> <data row="duck" id=“woman">Daisy</data> <data row="mouse" id=“man">Mickey</data> <data row="mouse" id=“woman">Minnie</data></header><body> <field id=“man"/> likes <field id=“woman"/>.</body>
Raw Data:
Rule:
NormalizedView:
Normalized View
<version match=„//insert“ key=„//inserted[@id eq $m/@id]/info/@date />
Mikey <select pred=“R2>=5/1/2006">Mouse</select> likesMinnie <select pred=“R2>=5/16/2006">Mouse</select>.
<inserted id=1><info date=„5/1/2006“/></inserted><inserted id=2><info date=„5/16/2006“/></inserted>Mikey <insert id=1>Mouse</insert> likes Minnie <insert id=2>Mouse</insert>.
Raw Data:
Rule:
NormalizedView:
General Idea: • Factor out common parts: „Mickey likes Minnie.“• Markup variable parts: <select …/>, <select …/>
Normalization Algorithm• Step 1: Construct Tagging Table
– Evaluate „match“ expression– Evaluate „key“ expression– Compute Operator from Pattern (e.g., > for version)
• Step 2: Tagging Table -> Normalized View– Embrace each match with <select> tags
Rule NodeId Key Value Op
R1 19 duck =
R1 19 mouse =
R1 22 duck =
R1 22 mouse =
Predicate-based Indexing
DocId Keyword Condition
z.swx Donald R1=duck
z.swx Mickey R1=mouse
z.swx likes true
z.swx Daisy R1=duck
z.swx Minnie R1=mouse
Normalized View:
InvertedFile:
<body> <select pred=“R1=duck">Donald</select> <select pred=“R1=mouse">Mickey</select> likes <select pred=“R1=duck">Daisy</select> <select pred=“R1=mouse">Minnie</select>.</body>
Query Processing
Donald likes
MinnieR1=duck ^ true ^ R1=mouse
Donald likes Daisy
R1=duck^true^R1=duck R1=duck
false
DocId Keyword Condition
z.swx Donald R1=duck
z.swx Mickey R1=mouse
z.swx likes true
z.swx Daisy R1=duck
z.swx Minnie R1=mouse
Qualitative Assessment• Expressiveness of rules / patterns
– Good enough for „desktop data“– Extensible for other data– Unclear how good for general applications (e.g., SAP)
• Normalized View – Size: O(n); with n size of raw data– Generation Time: depends on complexity of XQuery
expressions in rules; typically O(n) • Predicate-based Inverted File
– Size: O(n) - same as traditional inverted files– Generation Time: O(n) – But, constants can be large
• Query Processing– Polynomial in #keywords in query (~ traditional)– High constants!
Experiments• Data sets from my personal desktop
– E-Mail, TWiki, Latex, OpenOffice, MS Office, …
• Data-set dependent rules– E-Mail: different rule sets (here conversations)– Latex: include, footnote, exclude, …– TWiki: versioning, exclude, …
• Hand-cooked queries– Vary selectivity, degree that involves instances
• Measure size of data sets, indexes, precision & recall, query running times
Data Size (Twiki)
Traditional Enhanced
Raw Data (MB) 4.77 4.77
Normalized View (MB)
- 4.53
Index (MB) 0.56 1.07
Creation Time (secs)
9.00 9.62
Data Size (E-Mail)
Traditional Enhanced
Raw Data (MB) 51.46 51.46
Normalized View (MB)
- 51.77
Index (MB) 2.86 12.61
Creation Time (secs)
106.61 132.62
Precision (Twiki)
Traditional Enhanced
Query 1 0.985 1
Query 2 0.071 1
Query 3 0.339 1
Query 4 0.875 1
Recall is 1 in all cases. Twiki: example for „false positives“.
Recall (E-Mail)
Traditional Enhanced
Query 1 0.322 1
Query 2 0.821 1
Query 3 0.499 1
Query 4 0.5 1
Precision is 1 in all cases. E-Mail: example for „false negatives“.
Response Time in ms (Twiki)
Traditional Enhanced
Query 1 0.201 0.907
Query 2 0.218 1.224
Query 3 0.033 0.122
Query 4 0.054 0.212
Enhanced one order of magnitude slower, but still within milliseconds.
Response Time in ms (E-Mail)
Traditional Enhanced
Query 1 0.003 0.864
Query 2 0.004 6.091
Query 3 0.020 1.845
Query 4 0.027 0.055
Enhanced orders of magnitude slower, but still within milliseconds.
Conclusion & Future Work
• See data with the eyes of users!– Give search engines the right glasses– Flexibility in search: reveal hidden data– Compressed indexes using predicates
• Future Work– Other apps: e.g., JSP, Tagging, Semantic Web– Consolidate different view definitions (security)– Search on streaming data