A Platform for Personal Information Management and Integration - Xin Dong and Alon Halevy
Visualization of Heterogeneous Data Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin...
-
Upload
trevor-hawkins -
Category
Documents
-
view
215 -
download
2
Transcript of Visualization of Heterogeneous Data Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin...
Visualization of Heterogeneous Data
Visualization of Heterogeneous Data
Mike CammaranoXin (Luna) Dong
Bryan ChanJeff KlingnerJustin TalbotAlon Halevy
Pat Hanrahan
Homogeneous data is easy.Homogeneous data is easy.
Company Founded Headquarters Logo
Microsoft 1975 47.6 N, 122.1 W
Enron 1985 29.7 N, 95.3 W
Google 1998 37.4 N, 122.0 W
Homogeneous data is easy.Homogeneous data is easy.
Company Founded Headquarters Logo
Microsoft 1975 47.6 N, 122.1 W
Enron 1985 29.7 N, 95.3 W
Google 1998 37.4 N, 122.0 W
1970 1980 1990 2000
1975
1985
1998
Homogeneous data is easy.Homogeneous data is easy.
Company Founded Headquarters Logo
Microsoft 1975 47.6 N, 122.1 W
Enron 1985 29.7 N, 95.3 W
Google 1998 37.4 N, 122.0 W
1970 1980 1990 2000
Multiple sources?Multiple sources?
• Collaborative content
• Semi-structured data
{{Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2.jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 ...| birth_date = {{birth date|1809|1|19|mf=y}}| birth_place = [[Boston, Massachusetts]] [[United States|U.S.]]| death_date = {{death date and age|1849|10|07|1809|01|19}}| death_place = [[Baltimore, Maryland]] [[United States|U.S.]]| occupation = Poet, short story writer, editor, literary critic| movement = [[Romanticism]], [[Dark romanticism]]| genre = [[Horror fiction]], [[Crime fiction]], [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]...
DBpedia.orgDBpedia.org
• DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web.
• The DBpedia dataset currently provides information about more than 1.95 million “things”, including at least:• 80,000 persons• 70,000 places• 35,000 music albums• 12,000 films
According to DBpedia.org:
Database sizeDatabase size
We use a subset of DBpedia, mostly infoboxes and geonames.
• 30 M triples• 2.5 GB
We currently use an in-memory database.
Hardware is dual processor, dual core AMD opteron 280’s w/ 8GB RAM.
A glimpse inside DBpediaA glimpse inside DBpedia
Kerry:Poe:
dbp: PLACE_OF_BIRTH dbp: latitude 39° 41´ 45˝ N
dbp: birth_place w3c: owl#sameAs geonames: latitude 42.358403
HeterogeneityHeterogeneity
• Types• Decimal vs. sexagesimal coordinates
• Names• PLACE_OF_BIRTH vs. birth_place
• Pathsdbp: PLACE_OF_BIRTH dbp: latitude
vs.dbp: birth_place w3c: owl#sameAs geonames: latitude
39° 41´ 45˝ N 39.70
ContributionsContributions
• Visualize heterogeneous data represented as a graph of relationships between objects
• Describe inputs to a visualization:• Visualization template• Set of keywords per attribute
• Find attributes needed for a visualization by searching paths• Within an iterative process of search, visualization, and refinement
• Present algorithm for finding and ranking paths based on keywords• Efficiently enumerate paths
• A*• Random sampling
• Rank according to:• Keywords• Heuristics about graph structure
Integrate searching and visualizationIntegrate searching and visualization
Search for potentially
desirable paths
Refine path Visualize results
selections in context
Matching problemMatching problem
• Find the best path to a number for “state latitude”
stat
e
capitallatitude
DianneFeinstein
42.4
pop
6349000
birthplacespouse latitude
39.0
party
house
leadername
color blue
HarryReid
governor4
children
state.capital.latitude
state.pop
spouse.birth_place.latitude
state.governor.children
state.capital.latitude
state.pop
spouse.birth_place.latitude
state.governor.children
Basic algorithmBasic algorithm
1. Explore graph
2. Find paths ending
in a number
3. Score andrank paths
using TF/IDF
• Find the best path to a number for “state latitude”
stat
e
capitallatitude
DianneFeinstein
42.4
pop
6349000
birthplacespouse latitude
39.0
party
house
leadername
color blue
HarryReid
0.8
0.5
0.6
0.5
governor4
children
Improving execution timeImproving execution time
• New pruning techniques since the paper submission• A*• Bidirectional search on terms• Random sampling
Pruning techniquesPruning techniques
stat
e
capitallatitude
DianneFeinstein
42.4
pop
6349000
birthplacespouse latitude
39.0
party
house
leadername
color blue
HarryReid
governor4
children
• Most paths do not correspond to a “state latitude”• How can we avoid such bad paths?
No mention of latitude
Many unrelated terms
No potential paths
stat
e
capitallatitude
DianneFeinstein
42.4
pop
6349000
birthplacespouse latitude
39.0
party
house
leadername
color blue
HarryReid
governor4
children
Pruning techniques / A* SearchPruning techniques / A* Search
• Use a scoring function that penalizes unrelated terms• Then an A* search ignores paths with many such terms
Many unrelated terms
A* pruning resultsA* pruning results
Senators on map
Average # of edges examined at each depth, full enumeration:
Average # of edges examined at each depth, using A*:
1 2 3 4
Image 66 2049 1615 198
Name 66 9 5092 228
latitude 66 598 2272 2148
1 2 3 4
Image 66 5409 134226 1393766
Name 66 5446 168673 5245035
latitude 66 5408 145549 1009247
stat
e
capitallatitude
42.4
pop
6349000
birthplacespouse latitude
39.0
party
house
leadername
color blue
HarryReid
governor4
DianneFeinstein
children
Pruning techniques / Random SamplingPruning techniques / Random Sampling
• Do normal A* search for n randomly chosen nodes
No potential paths
A hit!
stat
e
capitallatitude
42.4
pop
6349000
birthplacespouse latitude
39.0
party
house
leadername
color blue
HarryReid
governor4
children
Pruning techniques / Random SamplingPruning techniques / Random Sampling
• Do normal A* search for n randomly chosen nodes
No potential paths
JohnKerry
• Only search known hits for the remaining nodes• Prevents repeatedly checking where there are likely no paths
A hit!
Sampling resultsSampling results
Average # edges examined at all depths:
Total edges examined:
without sampling 7360×99 = 728640
with sampling 7360×10 + 580×89 = 125220
Seed nodes (10) Others (89)
Image 920 82
Name 40 35
State 200 175
Latitude 3100 144
Longitude 3100 144
TOTAL 7360 580
PerformancePerformance
Runtime for senators’ example:
Runtime for astronauts’ example:
Runtime for each field in countries’ example:
• Performance now interactive• With new pruning techniques, ~100x faster than reported in paper.
State latitude State longitude Image Name Instances total
0.911 0.854 0.542 0.513 0.187 3.007 sec
Mission launch Mission insignia Name Instances total
1.109 1.151 0.743 1.102 4.105 sec
GDP per capita Inflation Flag Name Instances total
1.142 2.228 0.867 1.108 1.136 6.481 sec
Precision / RecallPrecision / Recall
Correct Incorrect
64 34 Accepted
1 0 Rejected
Senators – state latitude:
Correct Incorrect
206 58 Accepted
9 0 Rejected
Countries – gdp per capita:
Correct Incorrect
86 6 Accepted
0 6 Rejected
Senators – image:
SummarySummary
• Visualize heterogeneous data represented as a graph of relationships between objects
• Produce visualizations conforming to templates by searching for needed attributes
• Present algorithm for finding and ranking paths based on keywords• Efficiently enumerate paths• Rank
• Now fast enough for interactive use• High precision and recall
Future workFuture work
• Improvements• UI support for initial discovery and query refinement• Robustness of terms / Improved ranking• Automatic selection of visualization• Visualizing missing data• Visualizations that reflect result relevance (selective emphasis)
• Deploy on the web• Wikipedia• The whole web
AcknowledgementsAcknowledgements
Funding sources:• Boeing• RVAC• CALO
Tools and data:• DBpedia• MIT SIMILE project timeline• Tom Patterson’s map artwork
stat
e
capitallatitude
DianneFeinstein
42.4
pop
6349000
birthplacespouse latitude
39.0
party
house
leadername
color blue
HarryReid
governor4
children
Pruning techniquesBidirectional SearchPruning techniques
Bidirectional Search
• Before A*, search one step back from each literal,following only edges that match keywords
No mention of latitude
• This saves one step during forward A* search