Implementing Precision Approaches Supported by Satellite ...
Challenges in developing and implementing standards-based approaches in bioinformatics
-
Upload
ashton-conway -
Category
Documents
-
view
22 -
download
0
description
Transcript of Challenges in developing and implementing standards-based approaches in bioinformatics
Challenges in developing and implementing standards-based approaches in bioinformatics
South African National Bioinformatics Institute
Electric Genetics
University of the Western Cape
Impact of Open Standard is a lot like Open Source Software
Free software
Open source software
Myriad of licenses
Low or no cost access
Tools
• Existing and growing numbers of initiatives• Applications: EMBOSS, BLAST• Environments, vocabularies, databases
ACCESS and support of OS tools
• Understanding– How do I install and use this
system/application?– What if I have never used a non-windows
environment?– Who else is using this that I can share my
questions with?– Is anyone out there going to help me?– Is there a credible user base?
Impact
• Commercial– Are there any legal/regulations hurdles to
employing open source tools?– Is it a time sink?– Any impact on early adopters within the
company?– How is this supported in terms of impact on
the enterprise?
Impact
• Academic– Funding threat
• Calls for public funds to be spent upon development that should be available back to the public
– requirement to distribute freely– If a developer wishes to do OS projects,
sometimes a requirement to commercilise as part of funding
Population Genetics of Open Source
– Longevity of use as a function of penetrance
– most new mutations, even if they are not selected against, never succeed in entering the population.
– Where N is the total finite population size– 1/2N is the probability that the mutation will
become fixed.
Software zygosity
• Two possible forms of software or data– Open– Controlled access
• Heterozygous – Both are used
• Homozygous– Only one is used
Software selection - packaging
• Opportunity to use• Support and documentation• Distribution and marketing• Training• User base• Knowledge of users• Repeat uses = impact• Funded/stable development• Commercial or open source support or both
Effects of software selection
• Same selection can have different outcomes– Roberston and Reeve
• Change in wing size in drosophila• Number of cells• Size of cells
• Selection for a web browser– Mosaic*– Netscape $ > *– Mozilla*– Internet explorer *– Opera $ > *
Manifesto for bioinformatics
• open source
• open standards
• open annotation
• open data
• open health care
North – South Divide• Generation of genome data has been performed mostly in the
developed west
• Major laboratories and researchers are not in developing countries
• Researchers at ‘site of infection’ have to compete with developed country researchers for access to genome projects
• Developing countries lack resources for large scale projects
• Developing countries provide the genetic material
Lessons so far
• Sharing knowledge is key to developing knowledge• Sharing is difficult if there is an impediment to access• Open philosophies provide access to those with
limited resources but a need for knowledge• Standards improve sharing• Those who would benefit from access to knowledge
should contribute to standards and sharing
Why did SANBI get involved in controlled vocabularies?
Legacy expertisein gene expression
data
RP1expression product is unique to retina
ESTs have nasty annotations
Genome imminent
Leverage
Looking at ESTs across libraries
• Library descriptions are diverse and in many cases non-informative
• NCI_CGAP_Lip2• UT0117 (75% of all EST libraries)
• Soares foetal %^&*()• What were the actual expression states that
these libraries captured?
eVOC: Controlled Vocabulary for Unifying Gene Expression Data
• Consistent description of different libraries• Mapped orthogonal vocabularies• Anatomy, Cell type, Pathology, Development• 7016 EST libraries classified + 104 SAGE• 700 controlled terms• Applying terms of SAGE and EST allows
cross comparisons for the first time, Microarray to follow…
Uses of eVOC
• Provide as an integrated public resource which allows:– Linking libraries, transcripts and genes with
expression terms– Analysis of expression level and tissue expression
profiles– Comparison of expression between species– Linkage of genome sequence with expression
phenotype information
Data Structure
• 4 orthogonal mutually exclusive knowledge domains• independent pure hierarchies
– One parent but multiple children• Advantages of pure hierarchies over more complex data structures
– Easily maintained– Easily expanded– Easily visualised– Human and computer readable– Powerful simple querying
• Each node has specific concept– One or more synonymous terms
• Nasal• Nose
No More Tangles?
• Where multiple parents/relationship types exist and could be represented in a DAG, these can often be “untangled” into more than one hierarchy
• Untangling a tangled ontology. A complex mixed ontology can be simplified by creating simpler ontologies representing distinct domains.
Entities Roles Value Types
Person Body Substance
Steroid Organic Ion
Testosterone Glutamate
Clinical Role Physiological Role
PatientDoctor Neurotransmitter
Hormone
Sex Age
Male Female Adult Child
Untangled Ontology
Man
Body SubstancePerson
System
Woman Doctor Patient Steroid Hormone Neurotransmitter
Female doctor
Male doctor
Organic Ion
Testosterone Glutamate
Tangled Ontology
Relationships
• Single type of relationship between nodes• Anatomical System
– part-of• Cell Type + Pathology
– subclass• Developmental Stage
– is-a
Anatomical System Ontology
• Untangling of Computational Biology and Informatics Laboratory’s (CBIL) terms (ICDM9)
• removal of all references to tissue type, cell type or developmental stage
• digestive system > pancreatic islets – Anatomical Site (spatial position)
• 372 terms
Cell Type ontology
• fine-grained description of where a gene is expressed.
• listing of human cell types extracted from Gray’s Anatomy (Gray, H. L., Bannister, L. H, Williams, P. L, Collins, P., and Berry, M. M 1995).
• 154 different cell types.
Developmental Stage ontology
• Ordered timeline of human development for the description of gene expression in temporal space
• Examples “embryo” and “adult”. • Embryogenesis is further divided into the
standard Carnegie stages (www.ana.ed.ac.uk/anatomy/database/humat/) – first two months of human development.
• further divided into weekly and yearly categories
• 133 terms
Pathology Ontology
• WHO ICD-9-CM basis• classification of morbidity and mortality
information – Stats and indexing of hospital records by
disease and surgery performed• first two levels • sample description • 141 terms
liver
neoplasia
Anatomical System
Pathology
Query “liver AND neoplasia”
Result: Intersection of libraries mapped to liver and to neoplasia
Total cDNA library collection
Total cDNA Libraries
Annotated Libraries
Not Annotated
Anatomical System 7016 6752 5.2%
Cell type 7016 410 94.2%
Developmental Stage 7016 5891 17.3%
Pathology 7016 6401 10.1%
Most libraries can be annotated with Anatomical System terms as these are generally present in the library record. Less information is available for Cell Type and Developmental Stages as these are not consistently captured during the capture of library information.
Ontologies Clone Libraries ESTs
U30152
U30154
U30159
U30162
U30163
U30164
U58979
Human TNF-treated BG9 fibroblasts (ID:1260)
Homo sapiens foreskin fibroblast (ID:1620)
Anatomical System
foreskin
Pathology
Not classified
Developmental Stage
Not classified
Cell Type
fibroblast
The four expression ontologies are used to annotate cDNA clone libraries. ESTs can be transitively associated with ontology terms via their association with a unique clone library.
Browsing, Querying and CurationAn interface for browsing, curating and querying the ontologies is under development by Electric Genetics (see poster by Visagie et al. this meeting).
Curation
• Central, versioned database of the eVOC ontologies
• Curators who are domain experts add and delete terms or synonyms and make changes to the hierarchies on an ongoing basis
• Groups that modify the ontologies are encouraged to contribute these modifications back to eVOC
Applications What happens when you link libraries (cDNA/SAGE) or microarray probes to terms in each
ontology?
– Expression profile selection of libraries – Terms > Libraries > Transcripts > Genes– Genes > Terms– Breadth of expression
• Assess differential expression levels (SAGE)• Assess differential tissue expression (cDNA & SAGE)
– Physical distribution of expression across the genome • Expression profile prioritisation of disease candidates• Link genome to standardised controlled terms
– Assess expression clustering– Cross species expression comparison– Comparison of local data with whole picture– Choice of libraries by, for instance, molecular pathology :
Neoplasia– Transitive Integration with GO
IntegrationCurrentICL Candidate Gene ProfilerA disease gene candidate identification system which integrates genomic data with the
GO and eVOC ontologies to identify and rank genes which are candidates for known diseases.
Swiss Institute of Bioinformatics Transcriptome Database
FutureEnsembl Datamart: select expression profile in a defined genome region
GOBO apply for incorporation
MGED apply for inclusion as an MGED-approved expression ontology
Human Transcriptome Database
• H-Invitational Odaiba, Japan• Human FLcDNA annotation jamboree• Non-redundant set of mapped, manually
curated, expression profiled, classified cDNAs• eVOC terms used to describe mRNA
expression
High resolution of eVOC
• genome-wide detection of alternatively spliced transcripts and identified those which show tissue-specificity (Xu, Q., Modrek, B.,
and Lee, C. 2002) • flat list of 46 human tissue classes • isoform-specific EST lists provided for a
subset of the genes
Gene Name
Isoform 1 Isoform 2
Xu et al. eVOC Xu et al. eVOC
IRP3 Brain-specific
5 nervous >brain1 respiratory >lung
No specificity
2 urogenital >genital >female >uterus1 urogenital >genital >female >placenta1 haematological >blood
4 infant 3 adult
WNK1 Kidney-specific
7 urinary >kidney No specificity
2 urogenital >genital male >penis1 alimentary >pancreas
eVOC extends the expression information that can be obtained from other sources. IRP3, described by Xu et al. as having a brain-specific isoform, was shown to be infant brain specific by combining information gathered from the eVOC ontologies. The ESTs for each isoform were submitted to eVOC and the associated terms in each of the four ontologies were examined to identify expression state specificity.
GANESH deepAnnotation engine
ENSEMBL annotation engine
Controlled
Expression Vocabulary
Candidate Gene Profiler
Candidate geneEnrichment
Annotation using sequencesGenerated in the lab, and usingLocal domain expertise
Annotation servedusing DAS
Exon Skipping in Cancer:
• Determine chromosomal location of 1011 gene set on human genome sequence
• Assess the frequency and tissue distribution of exon skipping
• Determine functional significance of exon skipping
• Can the presence of transcripts demonstrating exon skipping be used as diagnostic/prognostic markers?
• Can the biological effect of the skip on the resulting protein be explained?
Genes with exon skipped transcripts found uniquely in cancer tissues
GENE NORMAL FUNCTION EFFECT OF SKIP ON PROTEIN
CD53 antigen Panleukocyte marker; may function in the transduction of CD2-generated signals in T cells and natural killer (NK) cells
Reading frame remains intactPrenyl group removed – effect unknown
Human trans-Golgi p230 (GOLG4A)
Peripheral membrane protein Reading frame remains intactNo known functional motif affected
PTPN13 (protein tyrosine phosphatase)
Signaling molecule that regulates a variety of cellular processes including cell growth, differentiation, mitotic cycle, and oncogenic transformation.
Reading frame remains intactPDZ domain removed – possible effect on intracellular signalling cascade
Case Study: TRANS-GOLGI P230
• Trans-Golgi p230 gene on chr3• Membrane protein implicated in vesicular
transport from cytoplasmic face of the golgi• 17 ESTs confirm skip of exon 2• Other distinct exon skipping events previously
described• What were the expression terms associated
with the exon 2 skip?
Tissue Expression Profile
Constitutive Exon Skipped ExonNormalNeoplasticAlimentaryDermalReproductiveLymphoreticularMultisystemNervousAdultFetusGlioblastEpithelialMelanocyte
Pathological
Anatomical
Developmental
Cell Type
Encouraging distribution
• Broad acceptance• Supported offering
– Open Source to maximise• Community involvement• Acceptance• Speed of development and resources• Quality of science
– Commericial model to provide• Support• Documentation• Customisation• Distribution• Accountability and interface to other commercial entities
Availability
– Can be used and modified without restriction under BSD – style license
– Mailing list for comments, questions and suggestions: [email protected]
– Commercial support with Electric Genetics commercial grade software, customisation and enhanced mappings to commercial clones, libraries, proprietary data
Anatomy and Gene Expression Workshop
• Resource repository site @ NCI and mirrors• Resource name• Access type (FTP)• Language/access software• Purpose• Domain• Level of commitment• Corresponding author (and when will respond by?)• Status (dev/production)• Used by? - applications
SANBIJanet Kelso Alan Christoffels Soraya Bardien
Electric GeneticsJohann VisageDarren OtgaarGary Greyling Tania Hide
Imperial College LondonDamian Smedley Mark McCarthy
Swiss Institute BioinformaticsGregory Theiler Victor JongeneelENSEMBLArek Icrapzych and Datamart
team
SANBI
www.sanbi.ac.za
electric genetics