Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science

Modelling and computingthe quality of information in e-science

Paolo Missier, Suzanne Embury, Mark GreenwoodSchool of Computer ScienceUniversity of Manchester, UK

Alun Preece, Binling JinDepartment of Computing Science

University of Aberdeen, UK

http://www.qurator.org

Aberdeen, 24/1/07

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality of data

Main driver, historically: data cleaning for

• Integration: use of same IDs across data sources

• Warehousing, analytics:

– restore completeness,

– reconcile referential constraints

– cross-validation of numeric data by aggregation

Focus:

• Record de-duplication, reconciliation, “linkage”

– Ample literature – see eg Nov 2006 issue of IEEE TKDE

• Consistency of data across sources

• Managing uncertainty in databases (Trio - Stanford)

The need for data quality control is rooted in the data management practice


Common quality issues

• Completeness: not missing any of the results

• Correctness: each data should reflect the actual real-world entity that it is intended to model

– The actual address where you live, the correct balance in your bank account…

• Timeliness: delivered in time for use by a consumer process

– Eg stock information

• …


Taxonomy for data quality dimensions


Our motivation: quality in public e-science data

GenBankUniProt

EnsEMBL

Entrez

dbSNP

• Large volumes of data in many public repositories• Increasingly creative uses for this data

Problem: using third party data of unknown quality may result in misleading scientific conclusions

Problem: using third party data of unknown quality may result in misleading scientific conclusions


Some quality issues in biology

“Quality” covers a broader spectrum of issues than traditional DQ

• “X% of database A may be wrong (unreliable) – but I have no easy way to test that”

• “This microarray data looks ok but is testing the wrong hypothesis”

• The output from this sequence matching algorithm produces false positives

• …

Each of these issues calls for a separate testing procedureDifficult to generalize

Each of these issues calls for a separate testing procedureDifficult to generalize


Correctness in biology - examples

Data type Creation process Correctness

Uniprot protein annotation

Manual curation Functional annotation f for p correct if function f can reliably be attributed to p

Qualitative proteomics:

Protein identification

Generate peptides peak lists, match peak lists (eg Imprint)

No false positives:

Every protein in the output is actually present in the cell sample

Transcriptomics:

Gene expression report (up/down-regulation)

Microarray data analysis

No false positives, no false negatives


Defining quality in e-science is challenging

• In-silico experiments express cutting-edge research

– Experimental data liable to change rapidly

– Definitions of quality are themselves experimental

• Scientists’ quality requirements often just a hunch

– Quality tests missing or based on experimental heuristics

– Definitions of quality criteria are personal and subjective

• Quality controls tightly coupled to data processing

– Often implicit and embedded in the experiment

– Not reusable


Research goals

1. Make personal definitions of quality explicit and formal

– Identify a common denominator for quality concepts

– Expressed as a conceptual model for Information Quality

2. Make existing data processing quality-aware

– Define an architectural framework that accommodates personal definitions of quality

– Compute quality levels and expose them to the user

Elicit “nuggets” of latent quality knowledgefrom the experts

Elicit “nuggets” of latent quality knowledgefrom the experts


Example: protein identification

Data output

Protein identification algorithm

“Wet lab” experiment

Protein Hitlist

Protein function prediction

Correct entry true positive

Evidence:

mass coverage (MC) measures the amount of protein sequence matched

Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum

ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting

This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain

This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain


Correctness of protein identification

Estimator function: (computes a score rather than a probability)

PMF score = (HR x 100) + MC + (ELDP x 10)

Prediction performance – comparing 3 models:

ROC curve:True positives vs false positives


Quality process components

Data output

Protein identification algorithm

“Wet lab” experiment

Protein Hitlist

Protein function prediction

Goal:to automatically add the additional filtering step in a principled way

Goal:to automatically add the additional filtering step in a principled way

PMF score = (HR x 100) + MC + (ELDP x 10)

Quality filtering

Quality assertion:

Evidence:•mass coverage (MC)•Hit ratio (HR)•ELDP


Quality Assertions

QA(D): any function of evidence (metadata for D) that computes a partial order on D

1. Score model (total or partial order)

2. Classification model with class ordering:

D

reject

accept

analyze

Reject < analyze < acceptActions associated to regions


Abstract quality views

An operational definition for personal quality:

1. Formulate a quality assertion on the dataset:

– i.e. a ranking of proteins by PMF score

2. Identify underlying evidence necessary to compute the assertion

– the variables used to compute the score (HR, MC, ELDP)

3. Define annotation functions that compute evidence values

• Functions that compute HR, MC, ELDP

4. Define quality regions on the ranked dataset

• In this case, intervals of acceptability

5. Associate actions to each region


Computable quality views as commodities

Cost-effective quality-awareness for data processing:

• Reuse of high-level definitions of quality views

• Compilation of abstract quality views into quality components

Abstract quality views

binding andcompilation

Executable Quality process

- runtime environment- data-specific quality services

Quratorarchitectural framework:


Quality hypotheses discovery and testing

Quality modelPerformance assessment

Executionon test data

abstractquality view

CompilationCompilationTargeted

Compilation

Quality-enhancedUser environmentQuality-enhanced

User environmentQuality-enhancedUser environment

Target-specificQuality componentTarget-specific

Quality componentTarget-specificQuality component

DeploymentDeployment

Deployment

Multiple target environments:• Workflow• query processor

Quality modeldefinition


Experimental quality

Making data processing quality-aware using Quality Views

– Query, browsing, retrieval, data-intensive workflows

Discovery and validation of “Quality nuggets”

QualityView

Modeltesting

Testdatasets

Embedding quality views and flow-through

testing

+


Execution model for Quality views

Binding compilation executable component

– Sub-flow of an existing workflow

– Query processing interceptor

Host workflow

AbstractQuality view

Embeddedquality

workflow

QV compiler

D

D’ Quality view on D’

Qurator quality frameworkServices registry

Servicesimplementation

Host workflow: D D’


Example: original proteomics workflow

Taverna workflow

Quality flow embedding point


Example: embedded quality workflow


Interactive conditions / actions


Generic quality process pattern

Collect evidence - Fetch persistent annotations- Compute on-the-fly annotations

<variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> </variables>

Evaluate conditionsExecute actions

<action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12</condition> </filter> </action>

Compute assertions

ClassifierClassifier

Classifier

<QualityAssertion

serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass"

Persistentevidence


A semantic model for quality concepts

Quality “upper ontology”(OWL)

Quality “upper ontology”(OWL)

Evidence annotations are class instances

Evidence annotations are class instances

Quality evidence typesQuality evidence types

EvidenceMeta-data model

(RDF)

EvidenceMeta-data model

(RDF)


Main taxonomies and properties

Class restriction:MassCoverage is-evidence-for . ImprintHitEntry

Class restriction:PIScoreClassifier assertion-based-on-evidence . HitScorePIScoreClassifier assertion-based-on-evidence . Mass Coverage

assertion-based-on-evidence: QualityAssertion QualityEvidence

is-evidence-for: QualityEvidence DataEntity


The ontology-driven user interface

Detecting inconsistencies: no annotators for this Evidence type

Detecting inconsistencies: no annotators for this Evidence type

Detecting inconsistencies: Unsatisfied input requirements

for Quality Assertion

Detecting inconsistencies: Unsatisfied input requirements

for Quality Assertion


Qurator architecture


Quality-aware query processing

Data

Queryprocessor

SQL, XQUERY

annotate

R’

Queryclient

QualityView

component

R

assert

act

evidence

dump

dumpR’

Quality-aware

query


Research issuesQuality modelling:

• Provenance as evidence

– Can data/process provenance be turned into evidence?

• Experimental elicitation of new Quality Assertions

– Seeking new collaborations with biologists!

• Classification with uncertainty

– Data elements belong to a quality class with some probability

• Computing Quality Assertions with limited evidence

– Evidence may be expensive and sometimes unavailable

– Robust classification / score models

Architecture:

• Metadata management model

– Quality Evidence is a type of metadata with known features…


Summary

For complex data types, often no single “correct” and agreed-upon definition of quality of data

• Qurator provides an environment for fast prototyping of quality hypotheses

– Based on the notion of “evidence” supporting a quality hypothesis

– With support for an incremental learning cycle

• Quality views offer an abstract model for making data processing environments quality-aware

– To be compiled into executable components and embedded

– Qurator provides an invocation framework for Quality Views

Publications: http://www.qurator.orgQurator is registered with OMII-UK

Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science

Technology

Transcript of Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science