Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne...

18
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org

Transcript of Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne...

Page 1: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Quality views: capturing and exploiting the user perspective on data quality

Paolo Missier, Suzanne Embury, Mark GreenwoodSchool of Computer ScienceUniversity of Manchester, UK

Alun Preece, Binling JinDepartment of Computing Science

University of Aberdeen, UK

http://www.qurator.org

Page 2: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Integration of public data (in biology)

GenBankUniProt

EnsEMBL

Entrez

dbSNP

• Large volumes of data in many public repositories• Increasingly creative uses for this data• Their quality is largely unknown

Page 3: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality of e-science data

Defining quality can be challenging:

• In-silico experiments express cutting-edge research

– Experimental data liable to change rapidly

– Definitions of quality are themselves experimental

• Scientists’ quality requirements often just a hunch

– Quality tests missing or based on experimental heuristics

– Often implicit and embedded in the experiment not reusable

Criteria for data acceptability within a specific data processing context

Criteria for data acceptability within a specific data processing context

A data consumer’s view on quality:

Page 4: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Example: protein identification

Data output

Protein identification algorithm

“Wet lab” experiment

Referencedatabases

Protein Hitlist

Protein function prediction

Remove likely false positives Improve prediction accuracyQuality filtering

Goal:to explicitly define and automatically add the additional filtering step in a principled way

Goal:to explicitly define and automatically add the additional filtering step in a principled way

Support evidence:provenance metadata

Page 5: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Our goals

Offer e-scientists a principled way to:

• Discover quality definitions for specific data domains

• Make them explicit using a formal model

• Implement them in their data processing environment

• Test them on their data

… in an incremental refinement cycle

Benefits:

• Automated processing

• Reusability

• “plug-in” quality components

Page 6: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

ApproachResearch hypothesis:

adding quality to data can be made cost-effective– By separating out generic quality processing from domain-

specific definitions

Defineabstract quality views

on the data

Map quality view to an

executable process

Execute quality views

- runtime environment- data-specific quality services

Quratorarchitectural framework:

Page 7: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Abstract quality view model

Data

Assertions

Classspace 1C11 C12 …

C21 C22… Class

space 2

Classification1

Classification2

Actions on regions

Conditions:regions specification

Quality Metadata

Evidence

e1

e2

e3

Data annotation

Coverage

PeptidesCount

Page 8: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Semantic model for quality concepts

Quality “upper ontology”(OWL)

Quality “upper ontology”(OWL)

Evidence annotations are class instances

Evidence annotations are class instances

Quality evidence typesQuality evidence types

EvidenceMeta-data model

(RDF)

EvidenceMeta-data model

(RDF)

Page 9: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality hypotheses discovery and testing

Performance assessment

Executionon test data

abstractquality view

CompilationCompilationTargeted

Compilation

Quality-enhancedUser environmentQuality-enhanced

User environmentQuality-enhancedUser environment

Target-specificQuality componentTarget-specific

Quality componentTarget-specificQuality component

DeploymentDeployment

Deployment

Multiple target environments:• Workflow• query processor

Page 10: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Generic quality process pattern

Collect evidence - Fetch persistent annotations- Compute on-the-fly annotations

<variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> </variables>

Evaluate conditionsExecute actions

<action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12</condition> </filter> </action>

Compute assertions

ClassifierClassifier

Classifier

<QualityAssertion

serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass"

Persistentevidence

Page 11: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Bindings: assertion service

service class Web service endpoint

PIScoreClassifier http://localhost/axis/services/PIScoreClassifierSvc

All services implement the same WSDL interface

• Makes concrete assertion functions homogeneous

• Facilitates compilation

• Uniform input / output messages

PIScoreClassifierSvc

Common WSDLinterface

PI_Top_k_svc

D = {(di, evidence(di))}

{class(di)}{score(di)}

(service registry)

Page 12: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Execution model for Quality views

Binding compilation executable component

– Sub-flow of an existing workflow

– Query processing interceptor

Host workflow

AbstractQuality view

Embeddedquality

workflow

QV compiler

D

D’ Quality view on D’

Qurator quality frameworkServices registry

Servicesimplementation

Host workflow: D D’

Page 13: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Example: original proteomics workflow

Taverna (*): workflow language and enactment engine for e-science applications

(*) part of the myGrid project, University of Manchester - taverna.sourceforge.net

Quality flow embedding point

Page 14: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Example: embedded quality workflow

Page 15: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Interactive conditions / actions

Page 16: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality views for queries

Data

Queryprocessor

Q

Rannotate

R1

Queryclient

QualityView

manager

R

assert

act

evidence

dump

dump

Actions: filtering, dump to DB / file

Page 17: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Qurator architecture

Page 18: Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Combining the strengths of UMIST andThe Victoria University of Manchester

Summary

For complex data types, often no single “correct” and agreed-upon definition of quality of data

• Qurator provides an environment for fast prototyping of quality hypotheses

– Based on the notion of “evidence” supporting a quality hypothesis

– With support for an incremental learning cycle

• Quality views offer an abstract model for making data processing environments quality-aware

– To be compiled into executable components and embedded

– Qurator provides an invocation framework for Quality Views

More info and papers: http://www.qurator.orgLive demos (informal) available