Linked data: P redicting missing properties

36
Linked data: Predicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik, primoz.skraba}@ijs.si

description

Klemen Simonic , Jan Rupnik , Primoz Skraba { klemen.simonic , jan.rupnik , primoz.skraba }@ijs.si. Linked data: P redicting missing properties. Overview. Linked Data (Motivation for the work) Problem Definition Approaches Results. An example. Linked Data. - PowerPoint PPT Presentation

Transcript of Linked data: P redicting missing properties

Page 1: Linked data: P redicting missing properties

Linked data:Predicting missing properties

Klemen Simonic, Jan Rupnik, Primoz Skraba{klemen.simonic, jan.rupnik, primoz.skraba}@ijs.si

Page 2: Linked data: P redicting missing properties

Overview

1. Linked Data (Motivation for the work)

2. Problem Definition

3. Approaches

4. Results

Page 3: Linked data: P redicting missing properties

An example

Page 4: Linked data: P redicting missing properties

Linked Data

- connect related data that was not previously linked- practice for exposing, sharing, and connecting pieces of data

and information

How:- URI (Uniform Resource Identifier)- RDF (Resource Description Framework)

(description of how to model/present the data)

Page 5: Linked data: P redicting missing properties

Linked Data, tiny example

Page 6: Linked data: P redicting missing properties

Linked Data, tiny example

Resource Predicate / Property Resource / Literalhttp://www.w3.org/res/Audi http://www.w3.org/rel/

manufacturerhttp://www.w3.org/Audi_A6

http://www.w3.org/res/Audi http://www.w3.org/rel/name “Audi”

http://www.w3.org/res/Audi http://www.w3.org/rel/industry http://www.w3.org/res/Automotive_industry

http://www.w3.org/res/Claus_Luthe http://www.w3.org/rel/employer http://www.w3.org/res/Audi

http://www.w3.org/res/Audi http://www.w3.org/rel/sameAs http://en.wikipedia.org/wiki/Audi

Page 7: Linked data: P redicting missing properties

Linked Data, one dataset

- Nodes are resources- Edges are relations- Edge Labels are properties

Page 8: Linked data: P redicting missing properties

Linked Data cloud diagram

Page 9: Linked data: P redicting missing properties

DBpedia

DBpedia extracted the information from the infoboxes from the Wikipedia websites Resource

Resource

Properties

Literal

en.wikipedia.org/wiki/University_of_Ljubljana Location http://en.wikipedia.org/wiki/Ljubljana

en.wikipedia.org/wiki/University_of_Ljubljana Established “1919”

Page 10: Linked data: P redicting missing properties

DBpedia

DBrawcontains all the properties from all the infoboxes within the English Wikipedia articles

DBmappedthe properties are unified (mapped onto a DBpedia ontology).

Semantic of properties: PlaceOfBirth = BirthPlace

The data is much cleaner and is better structured than the raw properties dataset.

Page 11: Linked data: P redicting missing properties

Freebase

An entity graph of people, places and things, built by people.

- Colloborative knowledge base

- Property schemas

- Google Knowledge graph

Page 12: Linked data: P redicting missing properties

Scale of Datasets

#nodes #edges #objects #properties avgDeg

DBmapped 5M 17M 2M 1296 5.92

DBraw 11M 47M 3M 44463 8.45

Freebase 141M 607M 23M 19700 8.58

DBpedia 3.7 version (additional properties and resources may be added in the meanwhile)

Largest and most structured dataset(Large number of edges and objects, and relatively small number of properties)

Mesy and noisy dataset(Large number of different properties because they are not unified )

Page 13: Linked data: P redicting missing properties

Missing properties

Problem:What are the missing properties for Fiat?

For a given resource, we want a rank of missing properties by likelihood.

Page 14: Linked data: P redicting missing properties

Approach

- Similar objects

- Measure of similarity

- Neighborhood

- Ranking function

Page 15: Linked data: P redicting missing properties

Approach

Ranking = weighted average of the k nearest-neighbor objects’ property frequency vectors.

General framework (Kernel smoother):

We can replace d with normalized kernel function.(More math on this topic is in the paper.)

The function g(o) depends on the choice of measure of closeness d(o,oi).

Page 16: Linked data: P redicting missing properties

Evaluation protocol

The evaluation procedure:1. For a given object, we delete one or more of its

properties, denoting (o, {p1, …, pk} )

2. Run the recommendation algorithm for the object

3. Compute several evaluation metrics

Page 17: Linked data: P redicting missing properties

Evaluation metrics

- Inverse rank (IRank) =

- Top 5 =

- Top 10 =

Page 18: Linked data: P redicting missing properties

Measure of Closeness

- Local Measures: local graph properties

- Baselines:- Random Objects- Objects with Common Properties- Property Co-occurrence

- Global Measures: global graph properties

- Exogenous Measures: external information (text)

Page 19: Linked data: P redicting missing properties

Local Graph Measures

We focus on a local description, based on the property distributions:- PropertyCount

- DirPropertyCount

- NeighbDirProperyCount

Page 20: Linked data: P redicting missing properties

Random objects

Choose uniformly at random some number of objects in the network

Page 21: Linked data: P redicting missing properties

Objects with common properties

Take the objects which share a minimum number of properties with the query object

The number of shared properties is taken as the weight for the object

Page 22: Linked data: P redicting missing properties

Property Co-occurence

Approximate resource similarities through property co-occurrence patterns

Only pairwise co-occurrences are considered for the purposes of scalability and feasibility of estimation

Page 23: Linked data: P redicting missing properties

Our method

Each object is described by DirPropertyCount vector

The similarity is determined by the computing the dot product between DirPropertyCount vectors

Page 24: Linked data: P redicting missing properties

Comparison

Page 25: Linked data: P redicting missing properties

Other Measure of Closeness

- Local Measures: local graph properties

- Baselines:- Random Objects- Objects with Common Properties- Property Co-occurrence

- Global Measures: global graph properties

- Exogenous Measures: external (no graph) information

Page 26: Linked data: P redicting missing properties

Global Graph Measures

We use two global measures of closeness based on graph geodesics and graph diffusion:(We treat the graph as a simple undirected graph. We also remove all the literals and constants from the set of nodes to remove unintuitive paths.)- Shortest path length

- The length of a shortest path between two objects- We calculate the distances corresponding to the k nearest objects

- Exponential diffusion kernel- Based on computing the matrix exponential of the graph adjacency matrix A

- Parameter α controls how local/global the similarities are- Takes into account both the total number of paths between nodes as well as their

respective lengths- Robust measure

Page 27: Linked data: P redicting missing properties

Exogenous Measures

- Independent of the graph structure- Rely on additional external information about the objects- Helpful for nodes with little connections in the graph

Textual information:- For some of the objects, we have extended abstracts describing

the objects- TF-IDF weighting + cosine similarity

Page 28: Linked data: P redicting missing properties

Results - IRank

Page 29: Linked data: P redicting missing properties

Results - Top10

Page 30: Linked data: P redicting missing properties

In vs. Out properties

Page 31: Linked data: P redicting missing properties

Deleting several properties

Method: DirPropertyCount vectorDataset: DBrawWe remove a fixed fraction of in and out properties

Page 32: Linked data: P redicting missing properties

Degradation – nodes / edges

The negative effect of deleting a fraction of edges or nodes from the network

Page 33: Linked data: P redicting missing properties

Degradation – properties

The effect of deleting K most frequent properties from the network

Page 34: Linked data: P redicting missing properties

Conclusion

- Method for predicting missing properties

- Use kernel smoother

- Measure similarity in a number of different ways:- Local properties- Global graph structure- External data (text)

- Extensive experimentation - Investigate more on combining measures

- More details about the research is in the paper:- Linked data: Predicting missing properties [machine learning]- Predicting Instance Properties in Linked Data [semantics of data]

Page 35: Linked data: P redicting missing properties

Take home message

- Big redundancy / regularity in the data

- Local measures perform well

- Scale changes the structure -> we need different method

Page 36: Linked data: P redicting missing properties

What’s Your Message?Questions ?