Semantic and Distributed Entity Search in the Web of Data€¦ · Entity Search and the Web of Data...
Transcript of Semantic and Distributed Entity Search in the Web of Data€¦ · Entity Search and the Web of Data...
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Semantic and Distributed Entity Search in
the Web of Data
Robert [email protected]
Norwegian University of Science and TechnologyTrondheim, Norway
March 6, 2013
1/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Outline1. Entity Search and the Web of Data
The Web of DataWhat are Entities?
2. Centralised Entity SearchEntity ModellingExperiments
3. Federated Entity SearchIntroductionExperimental Results
4. P2P Entity SearchIntroduction and ApproachExperiments
5. Conclusions and Future WorkFuture Work
2/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Overview
• Describe the main components of the last four years ofmy research
• Try to give a good motivation and show the “wholepicture”
• Show real-world examples
• Pointers on future work
• Do it in an accessible way
3/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
What?
• Semantic and Distributed Entity Searchin the Web of Data
• Definitions (in reverse order)• Web of Data• Entities• Entity Search• Centralised or distributed?
4/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
The Web of Data• Blog post by Tim Heath
• . . . slight disagreement
• Terms:• Linked Data• Web of Data
5/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
The Web of Data• Blog post by Tim Heath
• . . . slight disagreement
• Terms:• Linked Data• Web of Data
“. . . Linked Data is just anattempt to rebrand the SemanticWeb . . . ”
5/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
The Web of Data• Blog post by Tim Heath
• . . . slight disagreement
• Terms:• Linked Data• Web of Data
“. . . Personally I use the termWeb of data largelyinterchangeably with the termSemantic Web . . . ”
5/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
The Web of Data• Blog post by Tim Heath
• . . . slight disagreement
• Terms:• Linked Data• Web of Data
“. . . The precise term I usedepends on the audience. WithSemantic Web geeks I saySemantic Web, with others I tendto say Web of data . . . ”
5/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
. . . How We Use the Terms
• Linked Data• Technical foundation• “means of publishing/exchanging interconnected data”
• Web of Data / Semantic Web• Largely interchangeable• “an interconnected Web of Data available for search and
research”• example wikipedia connecting to other resources
6/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Linked Open Data 2007
SWConference
Corpus
DBpedia
RDF Book Mashup
DBLPBerlin
Revyu
Project Guten-berg
FOAF
Geo-names
Music-brainz
Magna-tune
Jamendo
World Fact-book
DBLPHannover
SIOC
Sem-Web-
Central
Euro-stat
ECS South-ampton
BBCLater +TOTP
Fresh-meat
Open-Guides
Gov-Track
US Census Data
W3CWordNet
flickrwrappr
Wiki-company
OpenCyc
NEW! lingvoj
Onto-world
NEW!
NEW!NEW!
7/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
8/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Entities 1/4• Knowledge bases are growing, so what?• “Something’s interesting when Google do it”
• Google Knowledge graph (2012)
9/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Entities 1/4• Knowledge bases are growing, so what?• “Something’s interesting when Google do it”
• Google Knowledge graph (2012)
9/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Entities 1/4• Knowledge bases are growing, so what?• “Something’s interesting when Google do it”
• Google Knowledge graph (2012)
9/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Entities 2/4• What is an entity?
• (Typed) object
10/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Entities 3/4
• Once identified, the entity has• Attributes and relations
11/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Entities 4/4
• Free text
• Date
• Director
• Relations (Links)• Outgoing• Ingoing
12/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
The Entity Search Task
ad-hoc entity retrieval1:
answering arbitrary information needs related toparticular aspects of objects [entities], expressed inunconstrained natural language and resolved using acollection of structured data
• Our main focus
• Realistic and frequent type of search
1J. Pound, P. Mika, and H. Zaragoza. “Ad-hoc object retrieval in the web of data”. In: Proc. of the 19th
Int. Conference on World Wide Web (WWW’10). 2010.
13/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Top Google Searches 20122
• People do searchfor entities
• Persons• Products• Events
• BBB12 is bigbrother Brazil . . .
2http://www.google.com/zeitgeist/2012
14/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Overview
• (Centralised) entity search
• Federated entity search
• Peer-to-peer (P2P) networks
15/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Overview
• (Centralised) entity search
• Federated entity search
• Peer-to-peer (P2P) networks
15/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Overview
• (Centralised) entity search
• Federated entity search
• Peer-to-peer (P2P) networks
15/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Overview of Publications 1/3
• (Centralised) entity search
• Semantic Search Challenge3
• Hierarchical Entity Model4
• Strong Baselines5
3K. Balog, M. Ciglan, R. Neumayer, W. Wei, and K. Nørv̊ag. “NTNU at SemSearch 2011”. In: Proc. of the
4th Int. Semantic Search Workshop of the 20th Int. World Wide Web Conference WWW2011). 2011.4
R. Neumayer, K. Balog, and K. Nørv̊ag. “On the Modeling of Entities for Ad-hoc Entity Search in the Webof Data”. In: Proc. of the 34rd European Conference on Information Retrieval (ECIR’12). 2012.
5R. Neumayer, K. Balog, and K. Nørv̊ag. “When Simple is (more than) Good Enough: Effective Semantic
Search with (almost) no Semantics”. In: Proc. of the 34rd European Conference on Information Retrieval(ECIR’12). 2012.
16/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Overview of Publications 2/3
• Federated entity search
• Collection ranking and selection6
• Ranking Distributed Knowledge Repositories7
6K. Balog, R. Neumayer, and K. Nørv̊ag. “Collection Ranking and Selection for Federated Entity Search”.
In: Proc. of 18th Int. Symposium of String Processing and Information Retrieval (SPIRE’12). Lecture Notes inComputer Science. 2012.
7R. Neumayer, K. Balog, and K. Nørv̊ag. “Ranking Distributed Knowledge Repositories”. In: Proc. of the
Int. Conference on Theory and Practice of Digital Libraries Research and Advanced Technology for Digital Libraries(TPDL’12). Lecture Notes in Computer Science. 2012.
17/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Overview of Publications 3/3
• Peer-to-peer (P2P) networks
• Aggregation of Document Frequencies8
• Hybrid Aggregation in P2P Networks9
8R. Neumayer, C. Doulkeridis, and K. Nørv̊ag. “Aggregation of Document Frequencies in Unstructured P2P
Networks”. In: Proc. of 10th Int. Conference on Web Information Systems Engineering (WISE’09). LectureNotes in Computer Science. 2009.
9R. Neumayer, C. Doulkeridis, and K. Nørv̊ag. “A Hybrid Approach for Estimating Document Frequencies in
Unstructured P2P Networks”. In: Information Systems 36.3 (2011).
18/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
1. Entity Search and the Web of DataThe Web of DataWhat are Entities?
2. Centralised Entity SearchEntity ModellingExperiments
3. Federated Entity SearchIntroductionExperimental Results
4. P2P Entity SearchIntroduction and ApproachExperiments
5. Conclusions and Future WorkFuture Work
19/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Centralised Entity Search
• Research questions• How can traditional ad-hoc document retrieval
techniques be applied in the context of the Web of Data?• How can the structure of entities be exploited for the
purpose of ad-hoc retrieval?• How does field weighting affect search quality?
20/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
From Predicates to Fields:
Structured Retrieval• How to represent entity data in terms of structured fields?
Text
Serenity2005119Serenity is a . . .Joss WhedonUnited StatesFilms based on tv seriesSpace WesternsFilmAdam BaldwinSummer GlauJewel Staite
(a) Unstructured . . .
Pred. type Value
Name SerenityAttributes 2005 119
Serenity is a . . .OutRelations Joss Whedon
United StatesFilms based on tv seriesSpace WesternsFilmAdam BaldwinSummer Glau
InRelations Best 2005 sci-fi filmFavourite film
(b) and Structured Entity Model
21/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Entity Modelling Approaches
• Fields and predicates
• Somewhere in between one field and one field perpredicate
• We consider:• Unstructured entity model
• Collapse all predicates
• Structured entity model with predicate folding• Collapse within predicate types
• Hierarchical entity model• Use individual fields• Predicate type weighting
22/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Structured Entity Model
• Collapsing all fieldsper type
• Name, Attribute,InRelation,OutRelation
• Smoothing on typelevel
• Linear mixture oftypes (mixture ofLMs)
e
tpt
...tpt
...
23/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Hierarchical Entity Model• Type folding viable
alternative
• Preserve info aboutindividual predicates
• Use individual fields
• Three modelcomponents
• Term generation• Predicate
generation• Predicate type
generation
e
ppt
...
t
ppt t... ...
24/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
2010/2011 Semantic Search
Challenge
• Given a keyword query, targeting a particular entity,provide a ranked list of relevant entities (i.e., URIs)
• Queries• Sampled from web search engine logs (142 in total)
• Data collection• Billion Triple Challenge 2009 (BTC) dataset• About 70 million entities• From sources like dbpedia.org or livejournal.com
• Relevance judgments• On a 3-point scale, collected using crowdsourcing
25/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Experimental Results
• Ingoing relations have a marginal effect only
• Structured entity model improves compared tounstructured model
• Hierarchical model improves, but only for individualpredicate types
• Overall our results are competitive with the ones achievedat evaluation initiatives
• Preprocessing, preprocessing
• Collection quality
26/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
When Simple is Good Enough
• Rather straigth forward approach
• Three components• Extended preprocessing
• Process entity names
• Fielded representation• Title and content fields
• Domain boosting• Boost DBpedia
• Compare state-of-the-art fielded retrieval models
27/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Results
• Outperform all results from Semantic Search Challenge
• Still not outperformed by others
• Entity titles answer entity queries very well
• Extent of improvements surprising
28/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
1. Entity Search and the Web of DataThe Web of DataWhat are Entities?
2. Centralised Entity SearchEntity ModellingExperiments
3. Federated Entity SearchIntroductionExperimental Results
4. P2P Entity SearchIntroduction and ApproachExperiments
5. Conclusions and Future WorkFuture Work
29/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Federated Search 1/2
• Moving from centralised retrieval to a distributed setting• Starting from a “broker,” query is “routed” to the right
collection• Main research question:
• Can federated entity search benefit from entitymodelling?
30/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Federated Search
1 Collectionrepresentation
2 Collectionselection
3 Result merging
Collection A
Collection B
Collection C
Summary A
Summary B
Summary C
Central broker
A
C
Q
B2
3
Q 1
Q
31/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Collection Representation 1/2
• Collection-centric model
• Treat each collection as one large document
• Low cost, less accurate results expected
32/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Collection Representation 2/2
• Entity-centric model
• Consider each collection in terms of its entities
• High cost, more accurate results expected
33/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Collection Selection
• Predefined threshold• Top-k collection selection• Typically 5-20
• AENN: “All an Entity Needs is a Name”• Central repository of entity names
• AENN collection selection
• Trade-off between EC and CC approaches• Precision-oriented• Recall-oriented• Balanced
34/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Result merging
• Once we havemultiplecollectionsselected
• These collectionsrank theirrespective entities
• . . . and theresultant rankingshave to be mergedinto one final list
Collection A
Collection B
Collection C
Summary A
Summary B
Summary C
Central broker
A
C
Q
B2
3
Q 1
Q
35/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Experimental Setup
• Distributed environment
• Top 100 largest second-level domains from BTC• Three sets with different handling of DBpedia
• Relevance• Considered the #relevant entities from each collection
• Metrics• Collection ranking and result merging: Standard IR
metrics (MAP, MRR, nDCG)• Collection selection: Analogues of precision and recall,
plus the avg. #coll. selected
36/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Experimental Results
• CC and EC methods are competitive
• Content-based methods stronger• Small difference for the DBpedia-only collection
• AENN outperforms other “title-only” methods
• AENN has positive effects on collection selection
37/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
1. Entity Search and the Web of DataThe Web of DataWhat are Entities?
2. Centralised Entity SearchEntity ModellingExperiments
3. Federated Entity SearchIntroductionExperimental Results
4. P2P Entity SearchIntroduction and ApproachExperiments
5. Conclusions and Future WorkFuture Work
38/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
P2P Search
• A query can originate from every peer and has to be“routed” via possibly many others
• Research questions:• Is P2P search a viable alternative to broker-based (i.e.,
federated search) architectures for entity retrieval?• How can the proposed frequency estimation technique
be further improved?39/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Text documents, terms, and
distribution
• Many problems are caused by distributed collections
• What is distributed and how? random is easy
• Local / global document frequencies
• Different numbers of documents per node• Local importance and influence of collections
• Global information improves search results• How frequent is a term on the global level?
40/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
DESENT
• We employ DESENT for P2Pnetwork creation
• Completely distributed anddecentralised
• Hierarchical overlaygeneration
• Individual peers• Zones formed by
neighbouring peers• Super zones based
previous level
41/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Local Term Selection Process
• Based on local peer’s knowledge only
• Considers local terms and their frequencies
• Problems• Number of documents per peer• Document frequencies are unstable• Local / global importance issues
42/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
• Compare to central case• Full info
• Central case without term info• Lucene scoring
• Aggregated values score in between
• Portable to LM and entity use-case
43/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
1. Entity Search and the Web of DataThe Web of DataWhat are Entities?
2. Centralised Entity SearchEntity ModellingExperiments
3. Federated Entity SearchIntroductionExperimental Results
4. P2P Entity SearchIntroduction and ApproachExperiments
5. Conclusions and Future WorkFuture Work
44/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Summary of Contributions
• Analysis of retrieval models wrt. their applicability toentity search
• Hierarchical models
• Structured retrieval for entity search
• Formalisation of federated search task in a languagemodel framework
• AENN method
• Benchmark data sets for federated entity search
• Entity search in P2P contexts
45/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Query Target Type Identification
• Queries often target specifictypes (e.g. cars, actors, . . . )
• Sub problem: DBPediaontology target typeidentification
• What is a query’s “type”?• Ontology linking• How to exploit this info?
• See CIKM’12 poster, partlyINEX’12 submission
46/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Query to Field/Predicate Mapping
• Which field/predicate best answers a query?
• Simple example: IMDB• field:actor• field:director• field:trivia
• Example query: “Clint Eastwood”
• What is the best field to answer the query?
• What is the best field to answer the individual queryterms?
• What results are we looking for (actor/director)?
47/48
Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions and Future Work
Last Slide
• Three basic purposes of oral presentations (in the spirit oftrusting Wikipedia10)
• Inform• Persuade• Good will
• I tried to do all of these things!
• Thanks for help and support
10http://en.wikipedia.org/wiki/Presentation
48/48