1 ESTEEM: Trust-aware P2P data integration Carola Aiello,Tiziana Catarci, Diego Milano, Monica...

Post on 27-Mar-2015

214 views 1 download

Tags:

Transcript of 1 ESTEEM: Trust-aware P2P data integration Carola Aiello,Tiziana Catarci, Diego Milano, Monica...

1

ESTEEM: Trust-aware P2P data integration

Carola Aiello,Tiziana Catarci, Diego Milano, Monica Scannapieco

Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza”

2

Outline

Progetti precedenti Obiettivi ESTEEM Problematiche e direzioni di ricerca

dell’unità Data quality: Quality-aware query

processing Privacy: Privacy-aware record matching Trust: Modello di trust per le sorgenti

3

DaQuinCIS project (2003)

MIUR – COFIN/PRIN Main focus: data quality in

cooperative information systems (CISs)

Data Quality Problems: Record Matching Quality-driven query processing

4

Motivations

A real example: e-Goverment project to integrate data about Italian companies

DATA INTEGRATION LAYER

Query Company XYZ ?

Chambers of Commerce Social Insurance Agency Accident Insurance Agency

5

Chambers of Commerce Social Insurance Agency Accident Insurance Agency

Id Name Type of activity Address City

6

The Three Real RecordsID Type of

ActivityCity Name Address

CNCBTB765SDV Retail of bovine and ovine meats

Novi Ligure Meat production of Bartoletti Benito

National Street dei Giovi

0111232223 Grocer’s shop, beverages

Pizzolo Formigaro

Bartoletti Benito

Meat production

9, Rome Street

CNCBTR765LDV Butcher Ovada Meat production in Piemonte of Bartoletti Benito

4, Mazzini Square

Which is the actual company XYZ to be returned to the client ?

• One of 3 ? Which ?• A “merge” of the 3 ?

Which is the actual company XYZ to be returned to the client ?

• One of 3 ? Which ?• A “merge” of the 3 ?

7

Objectives of the Research

Given a set of distributed and heterogeneous data sources that are affected by data quality problems

1. Improving the quality of each data source

Record matching across sources

2. Provide a unified and trasparent access to data sources

Data Integration & Quality-driven query processing

8

Improving quality of addresses in Italian PA (2004)

Accordo di collaborazione AIPA (ora CNIPA) e ISTAT Aprile 2002-Luglio 2004

Proposta di formati standard per l’acquisizione e l’interscambio degli indirizzi

Proposta di ridisegno dei flussi per l’aggiornamento degli indirizzi

Metodologia per la misurazione della qualità degli indirizzi

Misurazione sperimentale della qualità degli indirizzi in tre archivi nazionali: Agenzia delle Entrate Camere di Commercio INPS

9

Data Quality and Data Privacy (Current)

Joint Activity with University of Purdue, Indiana USA

Publishing elementary data may violate privacy requirements, even when data are anonymized anonymization removes principal identifiers

like SSN, Name+Surname+DOB, etc. Record matching privacy aware

only the result of the intersection (AB) across data sets are shared and nothing else (not A-AB and not B-AB)

10

Obiettivi ESTEEM

Studio di problematiche di trust e qualità dei dati in sistemi P2P

Specifica di sistemi di integrazione dati P2P con requisiti di trust

Definizione di algoritmi di query processing quality- and trust-aware

11

P2P Systems

P2P systems loosely coupled, dynamic, open

Data sharing in such systems no centralized global schema peers mapping dynamically build new peers can make available new

data schema

12

Data Quality

EmployeeID Name Surname Salary Email

arpa78 John Smith 2600 smith@abc.it

eugi98 Edward Monroe 1500 monroe@abc.it

ghjk09 Anthony White 1250 white@abc.it

dref43 Marianne Collins 1150 collins@abc.it

Attributeconflict

EmployeeID Name Surname Salary Email

arpa78 John Smith 2000 smith@abc.it

eugi98 Edward Monroe 1500 monroe@abc.it

ghjk09 Anthony Wite 1250 white@abc.it

treg23 Marianne Collins 1150 collins@abc.it

Keyconflict

EmployeeS1

EmployeeS2

13

Quality-aware query processing - 1

Key conflicts require the application of Record Matching techniques

Attribute conflicts are solved by query time Conflict Resolution Techniques

The resolution of such conflicts in P2P systems is an open issue: Definition of a quality-aware semantics for query

answering in P2P systems Need to develop techniques for solving such

conflicts according to the defined semantics

14

Quality-aware query processing - 2

Query language supporting the specification of conflict resolution strategies

Important in P2P systems: research space pruning on the basis of quality characterization of sources

15

Privacy

How to protect privacy when sharing data? With the source S1 and S2 issuing the Queries Q1 and Q2 respectively, at the end of the

interaction S1 must learn result Q1 and nothing else S2 must learn result Q2 and nothing else

S1 S2

Query Q1

Result Q2

Query Q2Result Q1

16

Privacy-aware Record Matching - 1

A B

AB Secure set intersection: (i) matching

esatto; (ii) non di record; (iii) costosi Private data sharing: (i) matching

esatto; (ii) schema un-aware

17

Privacy-aware Query Processing - 2

Algoritmi che consentano di fare privacy aware record matching in contesti P2P Problema della third party Prime proposte ElAbbadi ICDE 2006 ma

matching esatto

18

Trust

Trust typically associated to a source as a whole

Need for finer level characterization Eg: Ministero delle Finanze

affidabile rispetto ai Codici Fiscali

19

Modello di Trust per le sorgenti dati -1

Previous proposals: the whole organization (peer)

Our proposal: <Organization, Data Type>

OOrgn

CDOrgR i

iDki

iDki

k

)(,,

,,,

# of D-exchanges of

Orgk

# of <D, Orgk> complaints sent

by Orgi

20

Modello di Trust per le sorgenti dati - 2

Drawback: Centralized Need for:

Decentralized More flexible model (e.g. trust associated to views)

21

Modello di Trust per le sorgenti dati - 3

More general trust characterization based on the evaluation of a peer’s assertion on some metadata:

Data quality-aware: trust computed on the basis of the declared quality of provided data

Privacy-aware: trust computed on the basis of the declared privacy leveldifferent roles for providers and consumers: e.g. a provider can decide not to

release data if a requester is not privacy - trusted (or to adopt specific technique)