Spreading Semantics Over Biology

49
Spreading Semantics Over Biology Phillip Lord Newcastle University

description

Spreading Semantics Over Biology. Phillip Lord Newcastle University. Overview. Conclusions Data Integration in ComparaGRID Annotation in CARMEN and CISBAN Computing with Semantics The future. Conclusions. Thin Semantics is Good More Semantics is Better Shared Semantics is Wonderful. - PowerPoint PPT Presentation

Transcript of Spreading Semantics Over Biology

Page 1: Spreading Semantics Over Biology

Spreading Semantics Over Biology

Phillip Lord

Newcastle University

Page 2: Spreading Semantics Over Biology

Overview

• Conclusions

• Data Integration in ComparaGRID

• Annotation in CARMEN and CISBAN

• Computing with Semantics

• The future

Page 3: Spreading Semantics Over Biology

Conclusions

• Thin Semantics is Good

• More Semantics is Better

• Shared Semantics is Wonderful

Page 4: Spreading Semantics Over Biology

Key Problems

• Scalability– Both in technology and processes

• Usability

• Autonomy

Page 5: Spreading Semantics Over Biology

Methods for Data Integration

– Combining data from multiple, autonomous data sources.

• TAMBIS– ontology driven mediation of querying

• EcoCyc– ontology driven schema for warehousing

• BioPAX– ontology defined interchange format.

– More recently, ComparaGRID

Page 6: Spreading Semantics Over Biology

ComparaGRID

Roslin

Newcastle

Cambridge

John InnesNCYC

Manchester

• 6 Investigators

• 5 Researchers

• Commenced: 2003

Page 7: Spreading Semantics Over Biology

ComparGRID’s Problem Domain

Page 8: Spreading Semantics Over Biology

Many Model Organism Databases

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Page 9: Spreading Semantics Over Biology

Data Models, Model Data

Page 10: Spreading Semantics Over Biology

domain ontologydatabase

Databases and Knowledge

SequenceRecord

Sequence

S_hasID S_hasSeqStrS_hasLength

Molecule

DNA SequenceRepresentation

Representation

seqStringlengthid

Page 11: Spreading Semantics Over Biology

The Fluxion Stack

Raw data

Rawdata

Pubservice

Transservice

integrator

query

data

AggregationSemanticsSyntax

JDBC OWL OWL

Page 12: Spreading Semantics Over Biology

The difficulties

• The Cost of Integration– building ontologies is often hard

• The Cost of Managing Change– biological knowledge tends to undergo a lot of

flux

• The Scalabilty of Expressive Ontologies.

Page 13: Spreading Semantics Over Biology

Getting the Semantics Upfront

• Instead of annotating heterogenous data sources after the event, why not do so upfront?

• Originators of the data are likely to understand it best.

• Spreads the cost among those contributing.

Page 14: Spreading Semantics Over Biology

CARMENCode, Analysis, Repository and Modelling for e-Neuroscience

www.carmen.org.uk

Engineering and Physical Sciences Research Council

Page 15: Spreading Semantics Over Biology

Consortium & Profile

Stirling

St. Andrews

Newcastle

York

Sheffield

Cambridge

ImperialPlymouth

Warwick

Leicester

Manchester

• £4M over 4 years

• 20 Investigators

• Commenced 1st October 2006

Page 16: Spreading Semantics Over Biology

Virtual Laboratory for Neurophysiology

• Enabling sharing and collaborative exploitation of data, analysis code and expertise that are not physically collocated

Page 17: Spreading Semantics Over Biology

The need for clear metadata

• Most neurosciences data is relative simple in structure

• But often contextually complex

• Sometimes associated with behavioural features

Page 18: Spreading Semantics Over Biology

How do we represent…

LaboratoryExperiments

In silico Analysis

Derived data

Page 19: Spreading Semantics Over Biology

Functional Genomics Experiment (FuGE)

• Model of common components in science investigations, such as materials, data, protocols, equipment and software.

• Provides a framework for capturing complete laboratory workflows, enabling the integration of pre-existing data formats.

Page 20: Spreading Semantics Over Biology

Re-use

CARMEN

Brain anatomy

BIRNLex, FMA

Taxonomy

NCBI Taxonomy

Sample preparation

sepCV

Page 21: Spreading Semantics Over Biology

What we need – lab based

CARMEN

Age/stage development Subject preparation

Subject training

Subject task

Experiment process

Equipment Subject stimulus

Page 22: Spreading Semantics Over Biology

What we need – In silico

CARMEN

File formats Data structures

Statistics Algorithms

Software

Page 23: Spreading Semantics Over Biology

Align with OBI

• Aims to provide an ontology for the life sciences• Consortium to 15 communities from crop

science to neuroscience• CARMEN will align and contribute to OBI

Ontology for

Biomedical Investigations

Page 24: Spreading Semantics Over Biology

The Difficulties

• Even with a lot of pre-existing work there is a lot to describe

• OBI has 15 communities involved in it

Bio-ImagingJeff GretheBiomedical Informatics Research Network (BIRN) Coordinating CenterUniversity of California, San DiegoWinter 2007 Daniel RubinRadiological Society of North America (RSNA)National Center for Biomedical Ontology at Stanford Medical Informatics and the Department of Radiology, Stanford UniversityWinter 2007 Bill BugBiomedical Informatics Research Network (BIRN)Laboratory of Bioimaging and Anatomical Informatics, in the Department of Neurobiology and Anatomy, Drexel University College of MedicineSpring 2006 Cellular AssaysStefan Wiemann DKFZ   Clinical InvestigationsJennifer FostelClinical Trial OntologyNIEHS, National Institute for Environmental Health SciencesSpring 2004 Tina Hernandez-Boussard  Department of Genetics, Stanford Medical SchoolFall 2007 Crop SciencesRichard BruskiewichGeneration Challenge ProgrammeIRRI   ElectrophysiologyFrank GibsonCARMENSchool of Computing Science, Newcastle UniversitySpring 2007 Environmental OmicsNorman Morrison NERC Environmental Bioinformatic Centre and School of Computer Science, The University of ManchesterSpring 2004 Flow CytometryRyan BrinkmanISAC and FICCSBritish Columbia Cancer Research Center and University of British Columbia in the Department of Medical Genetics , Vancouver, BC, CanadaSpring 2004 Genomics/MetagenomicsDawn FieldGenome CatalogueNERC Centre for Ecology and HydrologyWinter 2005 Tanya GrayWinter 2005 ImmunologyRichard ScheuermannImmPort, FICCS, BioHealthBaseUniversity of Texas Southwestern Medical Center, in in Department of Pathology and Division of Biomedical InformaticsSpring 2006 Bjoern PetersImmune Epitope Database and Analysis ResourceLa Jolla Institute for Allergy and ImmunologySpring 2006 In Situ Hybridization and ImmunohistochemistryEric DeutschMISFISHIE    MetabolomicsSusanna SansoneMSI, The European Bioinformatics Institute EBI-EMBL, NET ProjectSpring 2004 Daniel SchoberSpring 2006 NeuroinformaticsBill BugBiomedical Informatics Research Network (BIRN)Laboratory of Bioimaging and Anatomical Informatics, in the Department of Neurobiology and Anatomy, Drexel University College of MedicineSpring 2006 Frank GibsonCARMENSchool of Computing Science, Newcastle UniversitySpring 2007 NutrigenomicsPhilippe Rocca-SerraRSBIThe European Bioinformatics Institute EBI-EMBL, NET ProjectSpring 2004 PolymorphismTina Hernandez-BoussardPharmGKBDepartment of Genetics, Stanford Medical SchoolWinter 2006Fall 2007ProteomicsSusanna SansonePSIThe European Bioinformatics Institute EBI-EMBL, NET ProjectSpring 2004 Daniel SchoberSpring 2006 Luisa MontecchiThe European Bioinformatics Institute EBI-EMBLSpring 2006 Chris Taylor   Trish Whetzel  Spring 2004 Frank GibsonSchool of Computing Science, Newcastle UniversitySpring 2007 ToxicogenomicsJennifer FostelToxicogenomicsNIEHS, National Institute for Environmental Health SciencesSpring 2004 Susanna SansoneRSBI The European Bioinformatics Institute EBI-EMBL, NET ProjectSpring 2004 TranscriptomicsSusanna SansoneMGED The European Bioinformatics Institute EBI-EMBL, NET ProjectSpring 2004 Philippe Rocca-SerraSpring 2004 Trish Whetzel  Spring 2004 Chris StoeckertDepartment of Genetics and Center for Bioinformatics, University of PennsylvaniaSpring 2004 Gilberto FragosoNCI Center for BioinformaticsSpring 2004 Joe White     Helen ParkinsonThe European Bioinformatics Institute EBI-EMBLSpring 2004 Mervi Heiskanen     Liju FanOntology Workshop, LLC, Columbia, MD, USASpring 2004 Helen CaustonImperial CollegeSpring 2004

Page 25: Spreading Semantics Over Biology

Information Extraction

• More semantics is better?

• How do we get extract the information?

http://en.wikipedia.org/wiki/Image:Brain_090407.jpg

Page 26: Spreading Semantics Over Biology

Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN)

Page 27: Spreading Semantics Over Biology

Robot

Reference set of 5,000 mutant strains

‘Folate’ + - + -‘MMS’ - - + +

• Data curation.• Functional analysis.• Interactions with in silico programme.

***

Robot

Screen mutants for sensitivity to damage/nutrition

Identification of novel interactions between nutrition and damage using automated yeast screening and analysis

Page 28: Spreading Semantics Over Biology

CISBAN dataflow

Page 29: Spreading Semantics Over Biology

Data Entry with SYMBA

http://symba.sourceforge.net/

Page 30: Spreading Semantics Over Biology

Data Entry with SYMBA

Page 31: Spreading Semantics Over Biology

CARMEN and CISBAN

• We can provide more semantics upfront• This should make data more explicit• If we still need to integrate it should be easier.

• Like much of biology, these projects are largely using structural simple, non-SW based technologies.

• This is a lot of effort to go to; what do we hope to gain?

Page 32: Spreading Semantics Over Biology

Yeast Hub

YeastHub: a semantic web use case for integrating data in the life sciences domain

Kei-Hoi Cheung, Kevin Y. Yip, Andrew Smith, Remko deKnikker, Andy Masiar and Mark Gerstein

doi:10.1093/bioinformatics/bti1026

Page 33: Spreading Semantics Over Biology

A rapturous reception

• So the general idea is take a bunch of data, convert it to RDF, dump it into a RDF triple store […] to discover interesting things ?

– http://www.nodalpoint.org/user/greg

• Putting a lot of RDF in a bucket isn’t integration. Not unless the RDF is the same schema and using the same concepts– Carole Goble, University of Manchester

Page 34: Spreading Semantics Over Biology

A thin layer of semantics. • Inverse Document Frequency is a method for classifying

documents; rare words carry more information than common ones.

• In this case, YeastHub has a common semantics describing the type of document.

• “protein” or “sequence” occurs a lot in Uniprot, but less in the bulk corpus

• Rather than treating all documents equally, they use IDF twice.

• Leveraging Biological Identifier Relationships and Related Documents to Enhance Information Retrieval for Proteomics -- Smith et al., 10.1093/bioinformatics/btm452 – Bioinformatics

Page 35: Spreading Semantics Over Biology

Thin Semantics

• The semantics of YeastHub is not deep.

• But even a thin layer of semantics is useful.

• If we modify our technologies to use it.

• A large part of library sciences has been encoded in 15 tags – Dublin Core

Page 36: Spreading Semantics Over Biology

Using Ontology to Classify Members of a Protein Family

• Katy Wolstencroft (Bioinformatics)• Daniele Turi (Instance Store)• Phil Lord (myGrid)• Lydia Tabernero (Protein Scientist)• Matt Horridge, Nick Drummond et al (Protégé OWL)• Andy Brass and Robert Stevens (Bioinformatics)

Page 37: Spreading Semantics Over Biology

The Protein Phosphatases

• A large superfamily of proteins• Motifs determine a protein’s place within

the family• Recognising that motifs imply class

membership is normally manual• Can these be captured in an ontology?

Page 38: Spreading Semantics Over Biology

Phosphatase Functional Domains

Andersen et al (2001) Mol. Cell. Biol. 21 7117-36

Page 39: Spreading Semantics Over Biology

Definition of Tyrosine Phosphatase

Class TyrosineRreceptorProteinPhosphataseEquivalentTo: Protein That- (contains atLeast-1 ProteinTyrosinePhosphataseDomain and

- contains 1 TransmembraneDomain

Page 40: Spreading Semantics Over Biology

Classifying Proteins

>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).

MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV………..

InterPro

Instance Store

Reasoner

Translate

Codify

Page 41: Spreading Semantics Over Biology

Results• Human phosphatases have been classified using the system• The ontology system refined classification

- DUSC contains zinc finger domain characterised and conserved – but not in classification- DUSA contains a disintegrin domain previously uncharacterised – evolutionarily conserved

• We have automated a part of the scientific process– We have defined our domain model in a computational form– We have collected some data– We have let the reasoner test whether the model fits the data

• The semantics here are deeper with YeastHub, which allow us to reason

Page 42: Spreading Semantics Over Biology

Summary

• Ontologies have been used in life sciences for data integration

• Increasingly, are being used to describe the data early in the scientific process

• Even thin semantics can be exploited for information retrieval

• Richer semantics allows more use of computational inference

Page 43: Spreading Semantics Over Biology

Richer Expressivity

• There are applications of more expressive semantics

• Can we move to from specific software, to generic software with specific knowledge models

• But, scalability and usability remain the bottleneck

Page 44: Spreading Semantics Over Biology

Industrialisation

• Semantics in the life sciences is moving from small to large scale– building ontologies has now become very

committee driven– we don’t understand ontology engineering as

we do software engineering– Encapsulation, modularisation, continuous

integration.

Page 45: Spreading Semantics Over Biology

Future

• ComparaGRID has semantics describing schema which means data integration can happen on-the-fly.

• Death to data warehouses!• CARMEN and CISBAN are gathering

semantically enriched data in the first place. An End to Integration!

• Semantics during dissemination• Knowledge for All.

Page 46: Spreading Semantics Over Biology

AcknowledgementsThe ComparaGRID consortium is Madhuchhanda Bhattacharjee, Richard Boys,

Tony Burdett, Rob Davey, Jo Dicks, David Marshall, Andy Law, Phillip Lord, Trevor Paterson, Matthew Pocock, Peter Rice, Ian Roberts, Robert Steven, Paul Watson, Darren Wilkinson and Neil Wipat, Andy Gibson

CISBAN is Tom Kirkwood (PI), Thomas von Zglinicki (PI), David Lydall (PI), Anil Wipat (PI), Stephen Addinall (Research Associate), Suzanne Advani (Technician), Kim Clugston (Research Associate), Sharon Denley (PA to Professor Tom Kirkwood), Amanda Greenall (Research Associate), Jennifer Hallinan (Research Associate), Dominic Kurian (Research Associate), Conor Lawless (Research Associate), Guiyuan Lei (Research Associate), Allyson Lister (Research Associate), Mandy Maddick (Research Associate), Satomi Miwa (Research Associate), Glyn Nelson (Research Associate), Bob Nicholson (Superintendent), Sharon Oljslagers (Technician), Joao Passos (Research Associate), Carole Proctor (Research Associate), Daryl Shanley (Research Associate), Oliver Shaw (Research Associate), Donna Stark (Research Secretary), Laura Steedman (Technician), Joyce Wang (Technician), Darren Wilkinson (Professor of Stochastic Modelling)

Page 47: Spreading Semantics Over Biology

CARMEN AcknowledgementsProfessor Colin Ingram, Professor Jim Austin, Professor Leslie Smith, Professor Paul Watson Dr. Stuart Baker,Professor Roman Borisyuk, Dr. Stephen Eglen, Professor Jianfeng Feng, Dr. Kevin Gurney, Dr. Tom Jackson Dr. Marcus Kaiser, Dr. Phillip Lord, Dr. Paul Overton, Dr. Stefano Panzeri, Dr. Rodrigio Quian Quiroga, Dr. Simon Schultz, Dr. Evelyne Sernagor, Dr. V. Anne Smith, Dr. Tom Smulders Professor Miles Whittington, Christoph Echtermeyer, Martyn Fletcher, Frank Gibson, Mark Jessop Dr. Bojian Liang, Juan Martinez-Gomez, Dr. Chris Mountford, Agah Ogungboye, Georgios Pitsilis, Dr. Daniel Swan

University ofSt Andrews

TheUniversity OfSheffield

Page 48: Spreading Semantics Over Biology

Holiday Pictures

Page 49: Spreading Semantics Over Biology

Questions?