Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark...

31
Modelling and computing the quality of information in e-science Paolo Missier , Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org Roma, 3/4/07

Transcript of Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark...

Page 1: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Modelling and computingthe quality of information in e-science

Paolo Missier, Suzanne Embury, Mark GreenwoodSchool of Computer ScienceUniversity of Manchester, UK

Alun Preece, Binling JinDepartment of Computing Science

University of Aberdeen, UK

http://www.qurator.org

Roma, 3/4/07

Page 2: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality of data

Main driver, historically: data cleaning for

• Integration: use of same IDs across data sources

• Warehousing, analytics:

– restore completeness,

– reconcile referential constraints

– cross-validation of numeric data by aggregation

Focus:

• Record de-duplication, reconciliation, “linkage”

– Ample literature – see eg Nov 2006 issue of IEEE TKDE

• Consistency of data across sources

• Managing uncertainty in databases (Trio - Stanford)

Data quality control in the data management practice

Page 3: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Common quality issues

• Completeness: not missing any of the results

• Correctness: each data should reflect the actual real-world entity that it is intended to model

– The actual address where you live, the correct balance in your bank account…

• Timeliness: delivered in time for use by a consumer process

– Eg stock information

• …

Page 4: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Taxonomy for data quality dimensions

Page 5: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Our motivation: quality in public e-science data

GenBankUniProt

EnsEMBL

Entrez

dbSNP

• Large volumes of data in many public repositories• Increasingly creative uses for this data

Problem: using third party data of unknown quality may result in misleading scientific conclusions

Problem: using third party data of unknown quality may result in misleading scientific conclusions

Page 6: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Some quality issues in biology

“Quality” covers a broader spectrum of issues than traditional DQ

• “X% of database A may be wrong (unreliable) – but I have no easy way to test that”

• “This microarray data looks ok but is testing the wrong hypothesis”

• The output from this sequence matching algorithm produces false positives

• …

Each of these issues calls for a separate testing procedureDifficult to generalize

Each of these issues calls for a separate testing procedureDifficult to generalize

Page 7: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Correctness in biology - examples

Data type Creation process Correctness

Uniprot protein annotation

Manual curation Functional annotation f for p correct if function f can reliably be attributed to p

Qualitative proteomics:

Protein identification

Generate peptides peak lists, match peak lists (eg Imprint)

No false positives:

Every protein in the output is actually present in the cell sample

Transcriptomics:

Gene expression report (up/down-regulation)

Microarray data analysis

No false positives, no false negatives

Page 8: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Defining quality in e-science is challenging

• In-silico experiments express cutting-edge research

– Experimental data liable to change rapidly

– Definitions of quality are themselves experimental

• Scientists’ quality requirements often just a hunch

– Quality tests missing or based on experimental heuristics

– Definitions of quality criteria are personal and subjective

• Quality controls tightly coupled to data processing

– Often implicit and embedded in the experiment

– Not reusable

“Quality” personal criteria for data acceptability

Page 9: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Research goals

1. Make personal definitions of quality explicit and formal

– Identify a common denominator for quality concepts

– Expressed as a conceptual model for Information Quality

Elicit “nuggets” of latent quality knowledgefrom the experts

Elicit “nuggets” of latent quality knowledgefrom the experts

2. Make existing data processing quality-aware

– Define an architectural framework that accommodates personal definitions of quality

– Compute quality levels and expose them to the user

Page 10: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Example: protein identification

Data output

Protein identification algorithm

“Wet lab” experiment

Protein Hitlist

Protein function prediction

Correct entry true positive

Evidence:

mass coverage (MC) measures the amount of protein sequence matched

Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum

ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting

This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain

This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain

Page 11: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Correctness of protein identification

Estimator function: (computes a score rather than a probability)

PMF score = (HR x 100) + MC + (ELDP x 10)

Prediction performance – comparing 3 models:

ROC curve:True positives vs false positives

Page 12: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality process components

Data output

Protein identification algorithm

“Wet lab” experiment

Protein Hitlist

Protein function prediction

Goal:to automatically add the additional filtering step in a principled way

Goal:to automatically add the additional filtering step in a principled way

PMF score = (HR x 100) + MC + (ELDP x 10)

Quality filtering

Quality assertion:

Evidence:•mass coverage (MC)•Hit ratio (HR)•ELDP

Page 13: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality Assertions

QA(D): any function of evidence (metadata for D) that computes equivalence classes on D

1. Score model (total or partial order)

2. Classification model:

D

B

A

C

Actions associated to regions:Eg accept/reject but possibly more

Quality-equivalent regions

Page 14: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Layered definition of Quality

DB

DBData sources

custom qualityknowledge

Quality Assertionsfunctions

QA QA QA

Quality Views:definition of acceptability regions QVQVQV QV

quality evidence annotations

EnvEnv

Annotationfunctions

Long-livedreusable

CommoditiesExpert-defined

DynamicUser

controlled

Page 15: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Abstract Quality ViewsAn operational definition for personal quality:

1. Formulate a quality assertion on the dataset:– i.e. a ranking of proteins by PMF score

– “quality knolwedge, possibly subjective”

2. Identify underlying evidence necessary to compute the assertion– the variables used to compute the score (HR, MC, ELDP)

– Objective, inexpensive

3. Define annotation functions that compute evidence values• Functions that compute HR, MC, ELDP

4. Define quality regions on the ranked dataset• In this case, intervals of acceptability

5. Associate actions to each region

Page 16: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Computable quality views as commodities

Cost-effective quality-awareness for data processing:

• Reuse of high-level definitions of quality views

• Compilation of abstract quality views into quality components

Abstract quality views

binding andcompilation

Executable Quality process

- runtime environment- data-specific quality services

Quratorarchitectural framework:

Page 17: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality hypotheses discovery and testing

Quality modelPerformance assessment

Executionon test data

abstractquality view

CompilationCompilationTargeted

Compilation

Quality-enhancedUser environmentQuality-enhanced

User environmentQuality-enhancedUser environment

Target-specificQuality componentTarget-specific

Quality componentTarget-specificQuality component

DeploymentDeployment

Deployment

Multiple target environments:• Workflow• query processor

Quality modeldefinition

Page 18: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Experimental quality

Making data processing quality-aware using Quality Views

– Query, browsing, retrieval, data-intensive workflows

Discovery and validation: “nuggets of quality

knowldege”

QualityView

Modeltesting

Testdatasets

Embedding quality views and flow-through

testing

+

Page 19: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Execution model for Quality views

Binding compilation executable component

– Sub-flow of an existing workflow

– Query processing interceptor

Host workflow

AbstractQuality view

Embeddedquality

workflow

QV compiler

D

D’ Quality view on D’

Qurator quality frameworkServices registry

Servicesimplementation

Host workflow: D D’

Page 20: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Example: original proteomics workflow

Taverna workflow

Quality flow embedding point

Page 21: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Example: embedded quality workflow

Page 22: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Interactive conditions / actions

Page 23: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Generic quality process pattern

Collect evidence - Fetch persistent annotations- Compute on-the-fly annotations

<variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> </variables>

Evaluate conditionsExecute actions

<action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12</condition> </filter> </action>

Compute assertions

ClassifierClassifier

Classifier

<QualityAssertion

serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass"

Persistentevidence

Page 24: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Reference (semantic) model

quality evidence annotations

custom qualityknowledge

DB

DBEnvEnv

Data sources

Annotationfunctions

Quality Assertionsfunctions

QA QA QA

Quality Viewsdefinition of acceptability regions QVQVQV QV

Common Semantic

Model(IQ Ontology)

Page 25: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

A semantic model for quality concepts

Quality “upper ontology”(OWL)

Quality “upper ontology”(OWL)

Evidence annotations are class instances

Evidence annotations are class instances

Quality evidence typesQuality evidence types

EvidenceMeta-data model

(RDF)

EvidenceMeta-data model

(RDF)

Page 26: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Main taxonomies and properties

Class restriction:MassCoverage is-evidence-for . ImprintHitEntry

Class restriction:PIScoreClassifier assertion-based-on-evidence . HitScorePIScoreClassifier assertion-based-on-evidence . Mass Coverage

assertion-based-on-evidence: QualityAssertion QualityEvidence

is-evidence-for: QualityEvidence DataEntity

Page 27: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

The ontology-driven user interface

Detecting inconsistencies: no annotators for this Evidence type

Detecting inconsistencies: no annotators for this Evidence type

Detecting inconsistencies: Unsatisfied input requirements

for Quality Assertion

Detecting inconsistencies: Unsatisfied input requirements

for Quality Assertion

Page 28: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Qurator architecture

Page 29: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality-aware query processing

Data

Queryprocessor

SQL, XQUERY

annotate

R’

Queryclient

QualityView

component

R

assert

act

evidence

dump

dumpR’

Quality-aware

query

Page 30: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Research issuesQuality modelling:

• Provenance as evidence

– Can data/process provenance be turned into evidence?

• Experimental elicitation of new Quality Assertions

– Seeking new collaborations with biologists!

• Classification with uncertainty

– Data elements belong to a quality class with some probability

• Computing Quality Assertions with limited evidence

– Evidence may be expensive and sometimes unavailable

– Robust classification / score models

Architecture:

• Metadata management model

– Quality Evidence is a type of metadata with known features…

Page 31: Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Combining the strengths of UMIST andThe Victoria University of Manchester

Summary

For complex data types, often no single “correct” and agreed-upon definition of quality of data

• Qurator provides an environment for fast prototyping of quality hypotheses

– Based on the notion of “evidence” supporting a quality hypothesis

– With support for an incremental learning cycle

• Quality views offer an abstract model for making data processing environments quality-aware

– To be compiled into executable components and embedded

– Qurator provides an invocation framework for Quality Views

Publications: http://www.qurator.orgQurator is registered with OMII-UK