Data Integration vs Transparency: Tackling the tension

Post on 07-Aug-2015

87 views 0 download

Tags:

Transcript of Data Integration vs Transparency: Tackling the tension

Paul Groth Elsevier Labs@pgroth | pgroth.com

Data Integration & TransparencyTackling the tension

University of Fribourg Informatics Colloquium June 15, 2015

Outline

• Data integration for analysis– i.e. remixing data

• The need for transparency• Provenance as a solution• The downloads folder problem

60 % of time is spent on data preparation

http://openphacts.orgpmu@openphacts.org

@Open_PHACTS

Why?

Public Domain Drug Discovery Data:Pharma are accessing, processing, storing & re-processing

LiteraturePubChem

GenbankPatents

DatabasesDownloads

Data Integration Data AnalysisFirewalled Databases

Repeat @ each

companyx

Prioritised Research QuestionsNumber sum Nr of 1 Question

15 12 9 All oxido,reductase inhibitors active <100nM in both human and mouse

18 14 8Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound?

24 13 8 Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives.

32 13 8 For a given interaction profile, give me compounds similar to it.

37 13 8 The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X.

38 13 8 Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not).

41 13 8

A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that may modulate the target directly? i.e. return all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both from structured assay databases and the literature.

44 13 8 Give me all active compounds on a given target with the relevant assay data46 13 8 Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease)59 14 8 Identify all known protein-protein interaction inhibitors

www.openphacts.org

Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse

From Mabel Loza - USC team

From Mabel Loza - USC team

From Mabel Loza - USC team

From Mabel Loza - USC team

Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse

ChEMBL:

Search target Oxidoreductase: 481 targets from different species

Selection of all the oxidoreductases and filtering bioactivities with the criteria IC50 < 100 (no units could be selected): 11497 data obtained

Table exported to a excel spreadsheet and manually filtered

From Mabel Loza - USC team

5 people

Working 6 hours

Problem: Data Integration

DataSource

DataSource

Data Warehouse

Queries

ExtractTransformLoad

DataSource

DataSource

Mediator

Queries

QueryReformulation

Using the Power of Open PHACTS, London, 22-23 April 2013

RDFNanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices

Identity Resolution

Service

Chemistry RegistrationNormalisation & Q/C

IdentifierManagement

Service

index

Co

re P

latf

orm

P12374EC2.43.4

CS4532

“Adenosine receptor 2a”

RDF

VoID

Db

RDFNanopub

Db

VoID

RDF

Db

VoID

RDFNanopub

VoID

Public Content Commercial

Public Ontologies

User Annotations

Applications

16

Open PHACTS Explorer

17

Open PHACTS Explorer

?

Credits: Curt Tilmes, Peter Fox

Tilmes, C.; Fox, P.; Ma, X.; McGuinness, D.L.; Privette, A.P.; Smith, A.; Waple, A.; Zednik, S.; Zheng, J.G., "Provenance Representation for the National Climate Assessment in the Global Change Information System," Geoscience and Remote Sensing, IEEE Transactions on , vol.51, no.11, pp.5160,5168, Nov. 2013

Problem: I don’t trust your assessment what is it based on?

Tension:

Integrated & SummarizedData

Transparency& Trust

Solution

Integrating and exposing provenance provided by multiple sources

provbook.org

National Climate Change Assessment Provenance

Tooling

http://asdf.readthedocs.org/en/latest/provenance.html

http://www.slideshare.net/soilandreyes/20130529-taverna-provenance

Towards Workflow Ecosystems Through Semantic and Standard Representations Garijo, D.; Gil, Y.; and Corcho, O. In Proceedings of the Ninth Workshop on Workflows in Support of Large-Scale Science (WORKS), held in conjunction with the IEEE ACM International Conference on High-Performance Computing (SC), New Orleans, LA, 2014.

https://github.com/pgroth/PROVTutorial

Great…..

but

Data integration is manual

Look to OS techniques

1. Taint Tracking2. Record and Replay

Work with Manolis Stamatogiannakis & Herbert Bos VU University Amsterdam – Security Group

https://www.youtube.com/watch?v=BD0h6M5mVoo

http://www.androidreran.com

Use R&R for provenance

1. Execution Capture2. Application of instrumentation3. Provenance analysis4. Selection and iteration

Implemented using plugins for Platform for Architecture-Neutral Dynamic Analysis (PANDA)

An Example (1)

<exe://pam-foreground-~3451> prov:endedAtTime 199090196 .<exe://getent~3451> a prov:Activity . <exe://getent~3451> rdf:type dt:getent .<exe://cut~3452> a prov:Activity . <exe://cut~3452> rdf:type dt:cut .<file:/etc/nsswitch.conf> a prov:Entity .<file:/etc/nsswitch.conf> rdfs:label "/etc/nsswitch.conf" .<file:/etc/nsswitch.conf> rdf:type dt:Unknown .<exe://getent~3451> prov:used <file:/etc/nsswitch.conf> .# unused file:3477815296:getent~3451:/etc/passwd:r0:w0:f524288<exe://getent~3451> prov:startedAtTime 199090196 .<exe://getent~3451> prov:endedAtTime 200392668 .<file:FD0_3452> a prov:Entity .<file:FD0_3452> rdfs:label "FD0_3452"

An Example (2

Example (3)

Example (4)

Conclusion

• Tension between: – putting stuff together; and– documenting what’s been done

• Provenance helps• Issues in collection • Standards + Stealth1

1 hat tip Carole Goble

Questions?

• More info:– openphacts.org– data2semantics.org– provbook.org– https://github.com/m000/dtracker– Manolis Stamatogiannakis, Paul Groth and Herbert Bos. Decoupling

Provenance Capture and Analysis from Execution. Theory and Practice of Provenance 2015

– Luc Moreau, Paul Groth, James Cheney, Timothy Lebo, Simon Miles, The rationale of PROV, Web Semantics: Science, Services and Agents on the World Wide Web, Available online 20 April 2015 http://dx.doi.org/10.1016/j.websem.2015.04.001.

– Paul Groth, "Transparency and Reliability in the Data Supply Chain," IEEE Internet Computing, vol. 17, no. 2, pp. 69-71, March-April, 2013

– Paul Groth, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013