© 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)
-
Upload
percival-booker -
Category
Documents
-
view
214 -
download
0
Transcript of © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)
© 2010 IBM Corporation
WIKIANALYTICS
Andrey Balmin (IBM Almaden)Emiran Curtmola (UC San Diego)
© 2010 IBM Corporation
Data: English Wikipedia Infoboxes{{Infobox officeholder|name = Arnold Schwarzenegger|nick = Governator|order = [[List of Governors of California|38th]]|office = Governor of California|term_start = November 17, 2003|birth_place = [[Thal, Austria]]|religion = Roman Catholic|net worth = $100–$200 [[million]] USD}}
{{Infobox Governor|name=Edmund Gerald Brown, Jr.|office=California Attorney General|order3=[[List of Governors of California|34th]]|office3=Governor of California|term_start3=January 6, 1975|religion=[[Catholic Church|Roman Catholic]]}}
{{Infobox Governor|order=16th Governor of California}}
Main cluster
Structural outliers
© 2010 IBM Corporation3
Wikipedia is Sparse
WikiAnalytics
WikiInfoboxes
WikiInfoboxes
~50k distinct fields
~1.7M
infoboxes
Documents
Fields
Universal Table logical abstraction
Field (distinct type, attribute) Occurrences
1
5122561286432
168
4
2
0
10000
20000
30000
40000
50000
60000
1 10 100 1000...that occur at least X times in Wikipedia
Nu
mb
er
of F
ield
s Almost 20,000 fields occur only once
Only 300 fields occur over 4,000 times in Wikipedia
© 2010 IBM Corporation4
Sparse Data Data is produced by humans and for humans
–No pressing need for schema consistency
–Domain examples: •Healthcare patent records •Electronic forms •product catalogs
How does one query such data ?
WikiAnalytics
© 2010 IBM Corporation5
WikiAnalytics Approach Start with simple keyword search interface
–E.g. “California Governor Religion!”
–Returns a superset of the result (123 infoboxes)
Cluster the results based on where the keywords are–“California” in office field vs. “California” in birth_place
Present the user with hierarchies of clusters–Let them accept/reject clusters and features
–Not unlike faceted search, but facets are not a property of the data – they also depend on the query
WikiAnalytics
© 2010 IBM Corporation6
Clustering Features A feature corresponds to
– The mapping of keyword k on field f in the corpus– It defines a dynamic dimension on the corpus
3 kinds of features– type: the keyword occurs inside documents of type “type”
• E.g., F1: type = Governor– field-keyword: the keyword occurs in a field
• E.g., F2: “California” office vs. “California” birth_place– field value: the field has a particular value that contains the keyword
• E.g., F3: office = “Governor of California” vs. office = “Governor of Baja California”
WikiAnalytics
DocumentsFields
© 2010 IBM Corporation
For a typical result, there are more features than rows
– No way to tell which ones are relevant
– Many overlapping hierarchies
Our solution: produce all possible clusterings
– Pack all hierarchies into a lattice structure
– Heuristically filter clusters to display
• Don’t show clusters that are smaller than a user-controlled “minimum support” parameter
Clustering on Hundreds of Features
© 2010 IBM Corporation
Universal Navigational Lattice (UNL)
Wiki Infoboxes
F1: type=governor
type=president
F1, F2: California officeF1, F3: governor office
F1, F2, F3
type=judge
© 2010 IBM Corporation
User Interface
© 2010 IBM Corporation10
System Architecture
WikiAnalytics
Wiki
Infoboxes
Wiki
Infoboxes Universal Table (logically)
+DB2 Text Search®
Input: keyword search query
e.g., California governor religion!
Compute navigational dimensions on the fly
-- map query keywords to fields
Fields
Documents
Universal Navigational Lattice (UNL) in-memory computation
Query Processing
FrontEnd (Flex)
BackEnd (Java)
Enable faceted-like search UI to explore
over the lattice
Interactive
selection
of final answers
Generate and publish
the output feed
The final feed is useful for
• further processing in
community-based mashups
• data cube analytics
• many-eyes analytics and
visualization
• help formulating a structural
query
Community support by
leveraging previously
cleaned feedsWiki InfoboxesWiki Infoboxes
type=governortype=governortype=presidenttype=president
California officeCalifornia officegovernor officegovernor office
California, governor officeCalifornia, governor office
type=judgetype=judge
e.g., governor of California is
also a former US president
(see Ronald Reagan )
e.g., governor of California is
also a former US president
(see Ronald Reagan )
Storage & Indexing
FieldsDocu-
ments
Heterogeneous,
sparse data
DB2 pureXML®
© 2010 IBM Corporation11
Summary: WikiAnalytics
Future work: – More heuristics for pruning the lattice
– Collaborative features
WikiAnalytics
Heterogeneous,
sparse data
Structured data feeds
Dynamic interface that lets users
navigate, identify and extract
all the documents of interest
Mashups
(e.g., Mashup Hub)
Data analytics
Formulate a structural query over the heterogeneous data
© 2010 IBM Corporation
Thank you!
Questions ???
© 2010 IBM Corporation13
Wikipedia: The Human Factor Same schema used differently
– E.g., Populate “order” attribute vs. “office” attribute
order=“16th Governor of California” for Washington Bartlett
vs. order =“38th” office=“Governor of California” for Arnold Schwarz.
Same data in different schemas– E.g., The governor’s data in the input of the president’s schema
order2=“33rd Governor of California” for Ronald Reagan
Schema conflicts• E.g., {{Infobox Officeholder/Personal data ...• | birth_date = ...• | date of birth = ...
No universal format for representing field values Etc.
WikiAnalytics
How to efficiently query such data and derive a complete set of answers?How to efficiently query such data and derive a complete set of answers?
© 2010 IBM Corporation14
Existing approaches Use keyword ranked search + heuristics
– Queries are underspecified: hard to capture user’s intentions
Use query languages (SQL, XQuery/XPath) after data integration– Strict, complex, expensive and hard to express
Faceted search – Static dimensions, too restrictive
Data summarization with SEDA [CIDR’09]– Nice system but too generic
WikiAnalytics
Ha
rd t
o id
en
tify
the
co
mp
lete
se
t o
f a
nsw
ers
Ha
rd t
o id
en
tify
the
co
mp
lete
se
t o
f a
nsw
ers
© 2010 IBM Corporation15
Universal Framework (search space) Key idea
–Cluster documents D based on the features F corresponding to query Q=(k1,…,kn)
Universal Navigational Lattice (UNL)–Given D, Q and F, produce all possible groupings
of documents by all sets of features–Connect the groups by the subset relationship of
documents
Universal computation framework to express–Traditional faceted search, OLAP navigation
WikiAnalytics
© 2010 IBM Corporation16
Computing Features F Universal Table + Keywords F = {F1, F2, …, Fn}
F = { type = Governor, “California” office, office = “Governor of California”, “California” order, “religion” religion, “California” born }
WikiAnalytics
Universal Table (logically)
Fields
Documents Input: keyword search query
e.g., California governor religion!
DocumentsFields
Focus on the set of documents containing all keywords
© 2010 IBM Corporation17
UNL Lattice Construction Bottom-up construction
–Start with groups of documents that match single features Fi in F
–Consider groups of documents by all pairs of features
–Etc.
WikiAnalytics
F1
1,2,3
F2
1,2,10
F3
1,2,20
1 2 3 10 15 20 3040Documents stored in DB2
F1, F2
1,2
F1, F3
1,2
F2, F3
1,2
F1, F2, F3
1,2 Redundant to construct different navigational nodes (with different set of features) for the same groups documents!
Solution: consolidate all the buckets with same set of documents
Redundant to construct different navigational nodes (with different set of features) for the same groups documents!
Solution: consolidate all the buckets with same set of documents
© 2010 IBM Corporation18
UNL Lattice Construction Bottom-up construction
–Start with groups of documents that match single features Fi in F
–Consider groups of documents by all pairs of features
–Etc.
WikiAnalytics
F1
1,2,3
F2
1,2,10
F3
1,2,20
1 2 3 10 15 20 3040Documents stored in DB2
F1, F2, F3
1,2
Redundant to construct different navigational nodes (different set of features) for the same groups documents!
Solution: consolidate all the buckets with same set of documents• add edges• merge features
Redundant to construct different navigational nodes (different set of features) for the same groups documents!
Solution: consolidate all the buckets with same set of documents• add edges• merge features
© 2010 IBM Corporation19
Construction Example
WikiAnalytics
F1
1,2,3
F2
1,2,10
F3
40,15,10
F4
20,2,3
F5
30,20,3
1 2 3 10 15 20 3040Documents stored in DB2
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
F1,F2
1,2
F1,F4
2,3
F1,F5
3
© 2010 IBM Corporation20
Construction Example
WikiAnalytics
F1
1,2,3
F2
1,2,10
F3
40,15,10
F4
20,2,3
F5
30,20,3
1 2 3 10 15 20 3040Documents stored in DB2
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
F1,F2
1,2
F2,F3
10
F1,F4
2,3
F2,F4
2
F1,F5
3
© 2010 IBM Corporation21
Construction Example
WikiAnalytics
F1
1,2,3
F2
1,2,10
F3
40,15,10
F4
20,2,3
F5
30,20,3
1 2 3 10 15 20 3040Documents stored in DB2
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
F1,F2
1,2
F2,F3
10
F1,F4
2,3
F2,F4
2
F1,F5
3
F4,F5
3,20
© 2010 IBM Corporation22
Construction Example
WikiAnalytics
F1
1,2,3
F2
1,2,10
F3
40,15,10
F4
20,2,3
F5
30,20,3
1 2 3 10 15 20 3040Documents stored in DB2
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
F1,F2
1,2
F2,F3
10
F1,F4
2,3
F1,F5
3
F4,F5
3,20
F1,F2,F4
2
© 2010 IBM Corporation23
Construction Example
WikiAnalytics
F1
1,2,3
F2
1,2,10
F3
40,15,10
F4
20,2,3
F5
30,20,3
1 2 3 10 15 20 3040Documents stored in DB2
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
F1,F2
1,2
F2,F3
10
F1,F4
2,3
F1,F2,F4
2
F1,F4,F5
3
F4,F5
3,20
© 2010 IBM Corporation24
Construction Example
WikiAnalytics
F1
1,2,3
F2
1,2,10
F3
40,15,10
F4
20,2,3
F5
30,20,3
1 2 3 10 15 20 3040Documents stored in DB2
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
• triangle rule
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
• triangle rule
F1,F2
1,2
F2,F3
10
F1,F4
2,3
F1,F2,F4
2
F1,F4,F5
3
F4,F5
3,20
© 2010 IBM Corporation25
Construction Example
WikiAnalytics
F1
1,2,3
F2
1,2,10
F3
40,15,10
F4
20,2,3
F5
30,20,3
1 2 3 10 15 20 3040Documents stored in DB2
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
• triangle rule
Invariant: n1 n2
n1.D n2.D and n1.F n2.F
Lattice construction rules for a new node n
• n.D = do not add n
• n.D already exists consolidate with existing node
• otherwise, add n and its edges
• triangle rule
Invariant: n1 n2
n1.D n2.D and n1.F n2.F
F1,F2
1,2
F2,F3
10
F1,F4
2,3
F1,F2,F4
2
F1,F4,F5
3
F4,F5
3,20
Conceptually, UNL captures all possible ways to group documents based on where
the query keywords hit in the corpus
© 2010 IBM Corporation26
Big Question: GUI How to display the UNL to facilitate discovery of
complete answers?
WikiAnalytics
F1
1,2,3
F2
1,2,10
F3
40,15,10
F4
20,2,3
F5
30,20,3
F1,F2
1,2
F2,F3
10
F1,F4
2,3
F1,F2,F4
2
F1,F4,F5
3
F4,F5
3,20
UNL lends to a bottom-up visual representation:• start from all root nodes traversing towards leaves
© 2010 IBM Corporation27
Big Question: GUI How to display the UNL to facilitate discovery of
complete answers?
WikiAnalytics
F1
1,2,3
F2
1,2,10
F3
40,15,10
F4
20,2,3
F5
30,20,3
F1,F2
1,2
F2,F3
10
F1,F4
2,3
F1,F2,F4
2
F1,F4,F5
3
F4,F5
3,20
© 2010 IBM Corporation28
Effective Visualization Problem
– There are still lots of nodes to explore
Challenge– Need to find both large cluster(s) of documents and the
outliers to find the complete set of answers
Solution: Y-feature filtering– Impose threshold Y on the features entering the lattice
computation• Features representing more the Y documents
– To find the big chunks: choose Y bigger• Filter out the less representative features
– To find the smaller chunks: choose Y smallerWikiAnalytics
© 2010 IBM Corporation29
Effective Visualization (2) Complementary solution: Partition based
– Introduce some order among the computed groups by prioritizing certain set of features to appear at the top levels of the lattice
WikiAnalytics
Block1 Features
e.g., type = Governor
Block2 Features
e.g., type = President
Block3 Features
e.g., type = Judge
UNL1UNL1 UNL2UNL2 UNL3UNL3
Documents stored in DB2
© 2010 IBM Corporation30
Example 1:
Feed0 = California governor religion!
Example 2: Find the number of released jazz albums in the world per country
Generate the following feeds with WikiAnalytics
–Feed1 = jazz album artist! released!
–Feed2 = jazz artist origin!
Feed them to Mashup Hub
Demo Scenario
WikiAnalytics
album artist releasedate
artist place of origin
© 2010 IBM Corporation
User Interface