Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments...

56
Bioinformatics Seminar Bioinformatics Seminar Department of Computer Science, Department of Computer Science, UIUC UIUC February 25, 2005 February 25, 2005 Analysis Environments Analysis Environments For Functional Genomics For Functional Genomics Bruce R. Schatz CANIS Laboratory Institute for Genomic Biology University of Illinois at Urbana- Champaign [email protected] , www.canis.uiuc.edu

Transcript of Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments...

Page 1: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Bioinformatics SeminarBioinformatics SeminarDepartment of Computer Science, UIUCDepartment of Computer Science, UIUC

February 25, 2005February 25, 2005

Analysis EnvironmentsAnalysis Environments For Functional GenomicsFor Functional Genomics

Bruce R. SchatzCANIS Laboratory

Institute for Genomic BiologyUniversity of Illinois at Urbana-Champaign

[email protected] , www.canis.uiuc.edu

Page 2: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

What are Analysis EnvironmentsWhat are Analysis Environments

Functional Analysis Find the underlying Mechanisms Of Genes, Behaviors, Diseases

Comparative Analysis Top-down data mining (vs Bottom-up) Multiple Sources especially literature

Page 3: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Building Analysis EnvironmentsBuilding Analysis Environments

Manual by Humans Interaction user navigation Classification collection indexing

Automatic by Computers Federation search bridges Integration results links

Page 4: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Trends in Analysis EnvironmentsTrends in Analysis Environments

Central versus Distributed Viewpoints

The 90s Pre-Genome Entrez (NIH NCBI) versus WCS (NSF Arizona)

The 00s Post-Genome GO (NIH curators) versus BeeSpace (NSF Illinois)

Page 5: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Pre-Genome EnvironmentsPre-Genome Environments

Focused on Syntax pre-Web

WCS (Worm Community System) Search words across sources Follow links across sources Words automatic, Links manual

Towards Uniform Searching

Page 6: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Post-Genome EnvironmentsPost-Genome Environments

Focused on Semantics post-Web

BeeSpace (Honey Bee Inter Space) Navigate concepts across sources Integrate data across sources Concepts automatic, Links automatic

Towards Question Answering

Page 7: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Paradigm ShiftParadigm ShiftTowards Dry-Lab Biology, Walter Gilbert (Jan 1991)

“The new paradigm, now emerging, is that all the 'genes' will be known (in the sense of being resident in databases available electronically), and that the starting point of a biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture, only then turning to experiment to follow or test that hypothesis. ...

To use this flood of knowledge [the total sequence of the human and model organisms], which will pour across the computer networks of the world, biologists not only must become computer-literate, but also change their approach to the problem of understanding life. ...

The Coming of Informational ScienceCorrelation of Information across Sources

Page 8: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

NCBI EntrezNCBI Entrez

Page 9: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Community SystemsCommunity Systems

browse and share all the knowledge of a community

data results(database management) (electronic mail)

literature news(information retrieval) (bulletin

boards)

knowledge(hypertext annotations)

Formal Informal

Page 10: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Worm Community SystemWorm Community System WCS Information:Literature BIOSIS, MEDLINE, newsletters,

meetings

Data Genes, Maps, Sequences, strains, cells

WCS FunctionalityBrowsing search, navigationFiltering selection, analysisSharing linking, publishing

WCS: 250 users at 50 labs across Internet (1991)

Page 11: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

WCSMolecular

Page 12: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

WCS Cellular

Page 13: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

WCS Publishing

Page 14: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

WCS Linking

Page 15: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

WCS invokes

gm

Page 16: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

WCS vis-à-vis

acedb

Page 17: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

from Objects to Concepts

from Syntax to Semantics

Infrastructure is Interaction with Abstraction

Internet is packet transmission across computers

Interspace is concept navigation across repositories

Towards the InterspaceTowards the Interspace

Page 18: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

THE THIRD WAVE OF NET EVOLUTIONTHE THIRD WAVE OF NET EVOLUTION

PACKETS

OBJECTS

CONCEPTS

Page 19: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Technology

Engineering

Electrical

FORMAL

INFORMAL

(manual)

(automatic)

IEEE

communities

groups

individuals

LEVELS OF INDEXESLEVELS OF INDEXES

Page 20: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

1992 1993 1995 1996 1998

COMPUTING CONCEPTSCOMPUTING CONCEPTS

‘92: 4,000 (molecular biology)

‘93: 40,000 (molecular biology)

‘95: 400,000 (electrical engineering)

‘96: 4,000,000 (engineering)

‘98: 40,000,000 (medicine)

Page 21: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Simulating a New WorldSimulating a New World Obtain discipline-scale collection

MEDLINE from NLM, 10M bibliographic abstracts human classification: Medical Subject Headings

Partition discipline into Community Repositories 4 core terms per abstract for MeSH classification 32K nodes with core terms (classification tree)

Community is all abstracts classified by core term 40M abstracts containing 280M concepts concept spaces took 2 days on NCSA Origin 2000

Simulating World of Medical Communities 10K repositories with > 1K abstracts (1K w/ > 10K)

Page 22: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Interspace Remote Access ClientInterspace Remote Access Client

Page 23: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Navigation in MEDSPACENavigation in MEDSPACE

For a patient with Rheumatoid Arthritis Find a drug that reduces the pain (analgesic) but does not cause stomach (gastrointestinal) bleeding

Choose DomainChoose Domain

Page 24: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Concept SearchConcept Search

Page 25: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Concept NavigationConcept Navigation

Page 26: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Retrieve DocumentRetrieve Document

Page 27: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Navigate DocumentNavigate Document

Page 28: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Retrieve DocumentRetrieve Document

Page 29: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Informational ScienceInformational ScienceComputational Science is widely accepted as The Third Branch of Science (beyond Experimental and Theoretical)

Genes are Computed, Proteins are Computed, Sequence “equivalences” are Computed.

Informational Science is coming to be accepted asThe Fourth Branch of Science

Based on Information Science technologies forFunctional Analysis across Information Sources

Page 30: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Post-Genome Informatics IPost-Genome Informatics I

Comparative Analysis within theDry Lab of Biological Knowledge

Classical Organisms have Genetic Descriptions.There will be NO more classical organisms beyondMice and Men, Worms and Flies, Yeasts and Weeds.

Must use comparative genomics on classical organismsVia sequence homologies and literature analysis.

Page 31: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Post-Genome Informatics IIPost-Genome Informatics II

Functional Analysis within theDry Lab of Biological Knowledge

Automatic annotation of genes to standard classifications, e.g. Gene Ontology via homology on computed protein sequences.

Automatic analysis of functions to scientific literature, e.g. concept spaces via text extractions. Thus must use functions in literature descriptions.

Page 32: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Informatics: From Bases to SpacesInformatics: From Bases to Spaces

data Bases support genome datae.g. FlyBase has sequences and mapsGenes annotated by GO and linked to literaturee.g. BeeBase has computed annotationsProtein homologies for similar Genes via GO

information Spaces support biomedical literaturee.g. BeeSpace uses automatically generated conceptual relationships to navigate functions

Page 33: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Gene OntologyGene Ontology

Page 34: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Gene OntologyGene OntologyGene Symbol Data Source Full Name…Calca MGI calcitonin-related polypeptideCat-1 Wormbase NoneCat-2 Wormbase NoneCCKR-Human UniProt Cholecystokinin receptorCRF2-Rat UniProt Corticotropin releasing factorCrhr2 RGD corticotrophin relse hormoneEgl-10 Wormbase NoneEgl-30 Wormbase NoneFeh-1 Wormbase NoneFor FlyBase None

Page 35: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Conceptual Navigation in BeeSpaceConceptual Navigation in BeeSpace

NeuroscienceLiterature

MolecularBiology

Literature

BeeLiterature

Flybase,WormBase

BeeGenome

Brain RegionLocalization

Brain GeneExpression

Profiles

BehavioralBiologist

MolecularBiologist

Neuro-scientist

Page 36: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

BeeSpace Analysis EnvironmentBeeSpace Analysis Environment Build Concept Space of Biomedical Literature

for Functional Analysis of Bee Genes

-Partition Literature into Community Collections-Extract and Index Concepts within Collections-Navigate Concepts within Documents-Follow Links from Documents into Databases

Locate Candidate Genes in Related Literatures then follow links into Genome Databases

Page 37: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Question AnsweringQuestion AnsweringBehaviour Organism Gene

Molecular Function

Reference

Foraging

Rover vs sitter phenotype Drosophila melanogaster for Protein kinase G 8

Roamer vs dweller phenotype C. elegans egl-4 Protein kinase G 16

Division of labour: age at onset of foraging

Apis mellifera for Protein kinase G 9

Division of labour: age at onset of foraging

Apis mellifera mlv Mn transporter 19

Division of labour: foraging-related? Apis mellifera per Transcription cofactor 68

Division of labour: foraging-related? Apis mellifera ache Acetylcholine esterase 69

Division of labour: foraging-related? Apis mellifera IP(3)K Inositol signaling 70

Foraging specialization: nectar vs. pollen

Apis mellifera pkc Protein kinase C 71

Social feeding Drosophila melanogaster dpnfNeuropeptide Y

(NPY) homolog21

Social feeding (aggregation) C. elegans npr-1 Receptor for NPY 22, 23

Page 38: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Functional PhrasesFunctional Phrases<gene> encodes <chemical> Sokolowski and colleagues demonstrated in Drosophila melanogaster that the foraging gene (for) encodes a cGMP dependent protein kinase (PKG). The dg2 gene encodes a cyclic guanosine monophosphate (cGMP)- dependent protein kinase (PKG). <chemical> affects/causes <behavior> Thus, PKG levels affected food-search behavior. cGMP treatment elevated PKG activity and caused foraging behavior. <gene> regulates <behavior> Amfor, an ortholog of the Drosophila for gene, is involved in the regulation of age at onset of foraging in honey bees. This idea is supported by results for malvolio (mvl), which encodes a manganese transporter and is involved in regulating Drosophila feeding and age at onset of foraging in honey bees.

Page 39: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

BeeSpace Software ImplementationBeeSpace Software Implementation

Natural Language Processing Identify noun and verb phrasesRecognize biological entitiesCompute biological relations

Statistical Information Retrieval Compute statistical contextsSupport conceptual navigation

Page 40: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Data Integration (FlyBase Gene)Data Integration (FlyBase Gene)D. melanogaster gene foraging , abbreviated as for , is reported here . It has also been known in FlyBase as BcDNA:GM08338, CG10033 and l(2)06860. It encodes a product with cGMP-dependent protein kinase activity (EC:2.7.1.-) involved in protein amino acid phosphorylation which is a component of the cellular_component unknown . It has been sequenced and its amino acid sequence contains an eukaryotic protein kinase , a protein kinase C-terminal domain , a tyrosine kinase catalytic domain , a serine/Threonine protein kinase family active site , a cAMP-dependent protein kinase and a cGMP-dependent protein kinase . It has been mapped by recombination to 2-10 and cytologically to 24A2--4 . It interacts genetically with Csr . There are 27 recorded alleles : 1 in vitro construct (not available from the public stock centers), 25 classical mutants ( 3 available from the public stock centers) and 1 wild-type. Mutations have been isolated which affect the larval nerve terminal and are behavioral, pupal recessive lethal, hyperactive, larval neurophysiology defective and larval neuroanatomy defective. for is discussed in 80 references (excluding sequence accessions), dated between 1988 and 2003. These include at least 6 studies of mutant phenotypes , 2 studies of wild-type function , 3 studies of natural polymorphisms and 7 molecular studies . Among findings on for function, for activity levels influence adult olfactory trap response to a food medium attractant. Among findings on for polymorphisms, the frequency of for R and for s strains in three natural populations are studied to determine the contribution of the local parasitoid community to the differences in for R and for s frequencies.

Page 41: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

BeeSpace Information SourcesBeeSpace Information Sources Biomedical Literature- Medline (medicine)- Biosis (biology)- Agricola, CAB Abstracts, Agris (agriculture)

Model Organisms (heredity)-Gene Descriptions (FlyBase, WormBase) Natural Histories (environment)-BeeKeeping Books (Cornell Library, Harvard

Press)

Page 42: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Medical Concept Spaces (1998)Medical Concept Spaces (1998)

Medical Literature (Medline, 10M abstracts) Partition with Medical Subject Headings (MeSH)

Community is all abstracts classified by core term 40M abstracts containing 280M concepts computation is 2 days on NCSA Origin 2000

Simulating World of Medical Communities 10K repositories with > 1K abstracts (1K with > 10K)

Page 43: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Biological Concept Spaces (2005)Biological Concept Spaces (2005)

Compute concept spaces for All of BiologyBioSpace across entire biomedical literature

50M abstracts across 50K repositories

Use Gene Ontology to partition literature into biological communities for functional analysis

GO same scale as MeSH but adequate coverage?GO light on social behavior (biological process)

Page 44: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Paradigm ShiftParadigm ShiftDissecting Human Disease, Victor McKusick (Feb 2001)

Structural genomics Functional genomics Genomics Proteomics Map-based gene discovery Sequence-based gene discovery Monogenic disorders Multifactorial disorders Specific DNA diagnosis Monitoring susceptibility Analysis of one gene Analysis of multi-gene

pathways Gene action Gene regulation Etiology (mutation) Pathogenesis (mechanism) One species Several species

Page 45: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Needles and HaystacksNeedles and Haystacks

Genes Honey Bees have 13K genes Perhaps 100 have known functions

Paths Perhaps 30K protein families exist KEGG has 200 known pathways

Statistical Clustering for Interactive DiscoveryAcross Two Orders of Magnitude!

Page 46: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Concept SwitchingConcept Switching

In the Interspace…

each Community maintains its own repository

Switching is navigating Across repositories

use your specialty vocabulary to search another specialty

Page 47: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

CONCEPT SWITCHINGCONCEPT SWITCHING

“Concept” versus “Term” set of “semantically” equivalent terms

Concept switching region to region (set to set) match

term

Semantic region

Concept SpaceConcept Space

Page 48: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Biomedical SessionBiomedical Session

Page 49: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Categories and ConceptsCategories and Concepts

Page 50: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Concept SwitchingConcept Switching

Page 51: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Document RetrievalDocument Retrieval

Page 52: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Future TechnologiesFuture Technologies Concept Switching

Spreading activation, type tagging

Dynamic Indexing On-the-fly collections, during session

Path Matching Aggregating indexes, many repositories

Page 53: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

THE NET OF THE 21st CENTURYTHE NET OF THE 21st CENTURY

Beyond Objects to Concepts Beyond Search to Analysis Problem Solving via Cross-Correlating

Multimedia Information across the Net

Every community has its own special library Every community does semantic indexing

The Interspace approximates Cyberspace

Page 54: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Interactive Functional AnalysisInteractive Functional AnalysisBeeSpace will enable users to navigate a uniform space of

diverse databases and literature sources for hypothesis development and testing, with a software system beyond a searchable database, using literature analyses to discover functional relationships between genes and behavior.

Genes to BehaviorsBehaviors to GenesConcepts to ConceptsClusters to ClustersNavigation across Sources

Page 55: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

XSpace Information SourcesXSpace Information SourcesOrganize Genome Databases (XBase)Compute Gene Descriptions from Model OrganismsPartition Scientific Literature for Organism XCompute XSpace using Semantic Indexing

Boost the Functional Analysis from Special SourcesCollecting Useful Data about Natural Historiese.g. CowSpace Leverage in AIPL Databases

Page 56: Bioinformatics Seminar Department of Computer Science, UIUC February 25, 2005 Analysis Environments For Functional Genomics Bruce R. Schatz CANIS Laboratory.

Towards the InterspaceTowards the Interspace

The Analysis Environment technology is GENERAL!

BirdSpace? BeeSpace?PigSpace? CowSpace? BehaviorSpace? BrainSpace?

BioSpace… Interspace