An Identity Crisis in the Life Sciences
description
Transcript of An Identity Crisis in the Life Sciences
![Page 1: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/1.jpg)
An Identity Crisis in the Life Sciences
Jun Zhao, Carole Goble and Robert StevensThe University of Manchester, UK
Thanks to: Tom Oinn, Matthew Pocock, Daniele Turi
And our usersAnd the EPSRC
![Page 2: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/2.jpg)
UK e-Science project
Middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications.
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt
![Page 3: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/3.jpg)
![Page 4: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/4.jpg)
![Page 5: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/5.jpg)
Bioinformatics workflows
Taverna workflow workbench
collected metabolic pathway
computed BLAST report
computed BLAST report
• Data pipelines• Collect data• Compute data• Frequently
updated public resources
• Open world• Get the same data
product in different experiment context
Bioinformatician users
![Page 6: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/6.jpg)
urn:data:f2urn:data:f2
urn:data1urn:data1
urn:data2urn:data2
urn:compareinvocation3urn:compareinvocation3
urn:data12urn:data12
Blast_report
[input]
[output]
[input]
[distantlyDerivedFrom]
SwissProt_seq[instanceOf]
Sequence_hit
[hasHits]
urn:hit2….urn:hit2….
urn:hit1…urn:hit1…
urn:hit50…..
urn:hit50…..
[instanceOf]
[similar_sequence_to]
Data generated by services/workflows
Concepts
[ ]
[performsTask]
Find similar sequence[contains]
Services
urn:data:3urn:data:3
urn:hit8….urn:hit8….
urn:hit5…urn:hit5…
urn:hit10…..
urn:hit10…..
[contains]
[instanceOf]
urn:BlastNInvocation3urn:BlastNInvocation3
urn:invocation5urn:invocation5urn:data:f1urn:data:f1[output]
New sequenceMissed sequence
[hasName][hasName]
literalsDatumCollection
[type]
LSDatum
[type]Properties
[instanceOf]
[output]
[output]
[directlyDerivedFrom]
Concept
Data
![Page 7: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/7.jpg)
Fusion between different data models using
shared concepts and shared data
outputOf
createdFromcontains_similiar_seq_to
urn:genbank2…
urn:genbank2…
urn:genbank1…
urn:genbank1…
urn:genbank50…
urn:genbank50…
Blast_reportDNA_sequence
DNA_sequence
urn:BlastNInvocation3urn:BlastNInvocation3
urn:data:3urn:data:3urn:data2urn:data2
inputOf
Blast_service
instanceOf
instanceOf
instanceOf
instanceOf
urn:williamsA
urn:williamsA
urn:run5urn:run5
urn:data2urn:data2
urn:run7urn:run7
urn:williamsBurn:williamsB
GenBank UniProt
runOfinputOf
inputOf
runOf
createdBy
LSID
createdBy
urn:data:f2
urn:data:f2
urn:data1urn:data1
urn:data2urn:data2
urn:compareinvocation3urn:compareinvocation3
urn:data12
urn:data12
Blast_report
[input]
[output]
[input]
[distantlyDerivedFrom]
SwissProt_seq[instanceOf]
Sequence_hit
[hasHits]
urn:hit2….
urn:hit2….
urn:hit1…urn:hit1…
urn:hit50…..
urn:hit50…..
[instanceOf]
[similar_sequence_to]
Data generated by services/workflowsConcepts
[ ]
[performsTask]
Find similar sequence
[contains]
Services
urn:data:3urn:data:3
urn:hit8….
urn:hit8….
urn:hit5…urn:hit5…
urn:hit10…..
urn:hit10…..
[contains]
[instanceOf]
urn:BlastNInvocation3urn:BlastNInvocation3
urn:invocation5urn:invocation5
urn:data:f1
urn:data:f1
[output]
New sequence
Missed sequence
[hasName] [hasName
]literalsDatumCollection
[type]
LSDatum
[type]Properties
[instanceOf]
[output]
[output]
[directlyDerivedFrom]
Add assertions, Add rules
Reason over assertions
![Page 8: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/8.jpg)
Putting Provenance to Use
• Single workflow– audit trail– recipe
• Multiple workflow runs (versions)– Aggregation - gathering– Integration - merging– Comparison - differencing
![Page 9: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/9.jpg)
Any idea?
• 30350027• 30350027
• gi:30350027 Life Science IdentifierA ruddy great lump of RDF
![Page 10: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/10.jpg)
URIs for Dataurn:lsid:mygrid.ac.uk:data:49841:1
• Life Science Identifier• Protocol for allocation and
resolution• Adopted by a range of data
providers• LSIDs in the data providers
databases we collect during workflow execution
• LSIDs for the data products we computed during the workflow execution
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt
http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02
![Page 11: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/11.jpg)
Having a BLAST in every workflow!Seq
GenBankReport
databasescore
BLAST
BLAST_simplifer
GenBank_retrieve
BlastReport
A list of Sequences
![Page 12: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/12.jpg)
Alignment of sequence AC005089
![Page 13: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/13.jpg)
Computed Collections and Collected data items
BLAST
ReportSequence1
Sequence2
Sequence3
BLAST
ReportSequence1
Sequence2
Sequence3
BLAST
ReportSequence1
Sequence2
Sequence4
SEQ
listOf
BLASTsimplifer
SEQ
listOf
BLASTsimplifer
SEQ
listOf
BLASTsimplifer
![Page 14: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/14.jpg)
BLAST
ReportSequence1
Sequence2
Sequence3
BLAST
ReportSequence1
Sequence2
Sequence4
SEQ
listOf
BLASTsimplifer
SEQ
listOf
BLASTsimplifer
Equivalent data
Corresponding data
Data Co-references
Context of the
workflow
![Page 15: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/15.jpg)
Run2Run1
Aggregation of repeated run
AC005089
BLASTReport
urn:lsid:tav:ic531
urn:lsid:tav:ic537
urn:lsid:tav:ic538
urn:lsid:tav:57b6
urn:lsid:tav:57b13
urn:lsid:tav:57b14
refersTo
derivedFrom
derivedFrom
derivedFrom
DNASeq
DNASeq
derivedFrom
refersTo
refersTo
rdf:type
rdf:type
rdf:type rdf:type
rdf:type
![Page 16: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/16.jpg)
![Page 17: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/17.jpg)
External Duplicates
gi:15145617
ac073846
urn:lsid:myg:ac073846
mmu:11423
Different providers
A replica
Different tool providers
Sequence
![Page 18: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/18.jpg)
LSID Assignment Process
Workflow enactorProvenance
service
Data service
External domainservice
Data storage group
wfEvents
Taverna LSID Authority
MySQL relational stores
KAVE
BAKLAVA
CustomizedDB
CustomizedDB
Jena/Sesame RDF store
Equivalent data in repeated runsDuplicate ids for these data
![Page 19: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/19.jpg)
Provenance from two repeated runs
my:derivedFrom
my:hasElement
my:derivedFrom
my:derivedFrom
my:hasElement
Run1
Run2
No convergence
urn:lsid:tav:brpt1
urn:lsid:tav:brpt2
urn:lsid:tav:seqcollection1
urn:lsid:tav:seqcollection2
urn:lsid:tav:seq1
urn:lsid:tav:seq2
my:derivedFrom
![Page 20: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/20.jpg)
urn:lsid:tav:brpt1 urn:lsid:tav:brpt2
urn:gb:seq1Sequence1 Sequence1
Execution duplicates
BLAST BLAST_simpliferBlastReport A list of Seq
GenBank_retrieve
But hidden!!
urn:gb:seq1
BLAST report BLAST report
![Page 21: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/21.jpg)
BLAST BLAST_simpliferBlastReport A list of Seq
GenBank_retrieve
SEQ1 Sequence1
Sequence2
Sequence3
listOfurn:tav:seqc1 urn:tav:seq1
urn:gb:seq1
SEQ1 listOfurn:tav:seqc2 urn:tav:seq2Sequence1
Sequence2
Sequence3
Execution duplicates
![Page 22: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/22.jpg)
Managing identity co-reference
• Identity co-reference:– Identifying duplicate identities that refer to the
same object but kept context
• An approach:– An IDSet entity
• Identity equivalence for collected data• Identity correspondence for computed data
– An identity service– Identity normalisation and cleansing activity
![Page 23: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/23.jpg)
IDSet entity
• IDSet(BLASTrpt) = {{urn:tav:brpt1}, {urn:tav:brpt3}}
urn:gb:seq1Sequence
Query by its identity
Query by
its content
IDSet1
merge
IDSet created by another organization
IDSet3
urn:lsid:tav:brpt1
BLASTreport
![Page 24: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/24.jpg)
Extended Architecture
Workflow enactor Provenance service
Data service
External domainservice
Data storage group
wfEvents
Taverna LSID Authority
MySQL relational stores
BAKLAVA
CustomizedDB
CustomizedDB
Identity service
KAVE+
Jena/Sesame RDF store MySQL
relational store
Identitystore
KAVE
![Page 25: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/25.jpg)
Identifying collected product
Identity service
urn:gb:seq1
Identitystore
Receivean identity
Look for or create
Its IDSet
KAVE+
1
2 3
3
urn:gb:seq1
Store the id and the
IDSet
IDSet
1urn:gb:seq1
![Page 26: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/26.jpg)
Identifying a collection product
Identity service
Identitystore
Receivean identity
Look for or create
Its IDSet
KAVE+
1
2 3
3
Store the id and the
IDSet
IDSeturn:lsid:seqc1
Seq1
Seq2
Seq3
SEQ2listOf
unr:lsid:seqc2
Look for equivalent
collection
unr:lsid:seqc1
unr:lsid:seqc2
![Page 27: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/27.jpg)
Putting the Identity Service to Use
Provenance Integration
Provenance Aggregation
Identity Management
Provenance Normalization
Run2
Run1b1
c1s1
b2
c2s2
![Page 28: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/28.jpg)
Discussion
• Scalability issues:– Normalizing provenance graphs– Building IDSet for collections with multiple hierarchies
• Open world data type-free context• Use experimental context more effectively –
workflows are not independently executed.• Granularity of identity• Identity aware operations in workflow• Multiple naming schemes• Migration duplicates• Compacting data results
![Page 29: An Identity Crisis in the Life Sciences](https://reader036.fdocuments.in/reader036/viewer/2022081512/56814583550346895db261cd/html5/thumbnails/29.jpg)
Conclusion
• Combining provenance kind of depends on finding points of commonality. Like data identity.
• Duplicate identities will occur in an open world• Hard to achieve uniqueness without community
commitment• Different types of equivalent objects• How much can be avoided? • And how much has to be repaired?