Genomics and Bioinformatics - Computational Services and
Transcript of Genomics and Bioinformatics - Computational Services and
Genomics and Bioinformatics
Doug BrutlagProfessor Emeritus
Biochemistry & Medicine (by courtesy)
Computational Molecular BiologyBiochem 218 – BioMedical Informatics 231
http://biochem218.stanford.edu/
Faculty, TAs and Staff
Doug Brutlag Lee Kozar
Dan DavisonMaeve O’Huallachain
• Alway M114 – Tuesdays & Thursdays 2:15-3:30 PM
• Course Web Site– http://biochem218.stanford.edu/
• Stanford Center for Professional Development– http://scpd.stanford.edu/
• Videos available 24 hours/day, 7 days/week• Course offered Autumn, Winter and Spring
quarters
Course and Video Availability
Course Requirements• Lectures
– Theoretical background of current methods– Strengths and weaknesses of current approaches– Future directions for improvements
• Demonstrations– Applications (Mac, PC, Unix, Web)– Web applications– Illustrate homework
• All homework and questions must be submitted by email to [email protected]
• Several homework assignments (35%)– Due one week after assigned
• Final project (Due March 12th)– A critical or comparative review of computational approaches to
any problem in computational molecular biology– Propose new approach– Implement a new approach– Examples of previous projects for the class can be found at
http://biochem218.stanford.edu/Projects.html
David MountBioinformatics: Sequence and Genome Analysis 2nd Edition
Jin XiongEssential Bioinformatics
Richard Durbin et al.Biological Sequence Analysis
Jones & PevznerBioinformatics Algorithms
Dan GusfieldAlgorithms on Strings, Trees & Sequences
Baldi & BrunakBioinformatics: The Machine Learning Approach
Higgins & TaylorBioinformatics: Sequence, Structure & Databanks
NCBI Handbookhttp://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook
NCBI Handbookhttp://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook
EMBL-EBI Home Pagehttp://www.ebi.ac.uk/
Berg, Tymoczko & StryerBiochemistry, Fifth Edition
Benjamin LewinGenes IX
Genomics, Bioinformatics &Computational Biology
Computational Biology
Computational Molecular Biology
BioinformaticsGenomics
ProteomicsStructural Genomics
Genomics, Bioinformatics &Computational Biology
Computational Biology
Computational Molecular Biology
BioinformaticsGenomics
ProteomicsStructural GenomicsSystems Biology
DatabasesMachine Learning Robotics
Statistics & ProbabilityArtificial Intelligence
Graph TheoryInformation Theory
Algorithms
Genomics, Bioinformatics &Computational Biology
Computational Biology
Computational Molecular Biology
BioinformaticsGenomics
ProteomicsStructural Genomics
What is Bioinformatics?
RNA Protein
DNA Phenotype
SelectionEvolution
Individuals
Populations
Biological Information
Computational Goals of Bioinformatics• Learn & Generalize: Discover conserved patterns (models) of
sequences, structures, interactions, metabolism & chemistries from well-studied examples.
• Prediction: Infer function or structure of newly sequenced genes, genomes, proteins or proteomes from these generalizations.
• Organize & Integrate: Develop a systematic and genomic approach to molecular interactions, metabolism, cell signaling, gene expression…
• Simulate: Model gene expression, gene regulation, protein folding, protein-protein interaction, protein-ligand binding, catalytic function, metabolism…
• Engineer: Construct novel organisms or novel functions or novel regulation of genes and proteins.
• Gene Therapy: Target specific genes, or mutations, RNAi to change a disease phenotype.
Central Paradigm of Molecular Biology
DNA RNA Protein Phenotype(Symptoms)
Molecular Biology of the Gene 1965
Central Paradigm of Bioinformatics
MolecularStructure
Phenotype(Symptoms)
BiochemicalFunction
GeneticInformation
MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH
Central Paradigm of Bioinformatics
MolecularStructure
Phenotype(Symptoms)
BiochemicalFunction
GeneticInformation
MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH
Challenges Understanding Genetic Information
GeneticInformation
MolecularStructure
BiochemicalFunction Phenotype
• Genetic information is redundant• Structural information is redundant• Genes and proteins are meta-stable• Single genes have multiple functions• Genes are one dimensional but function depends
on three-dimensional structure
Redundancy in Genomic& Protein Sequences
• DNA is double-stranded• Genetic code• Acceptable amino-acid
replacements• Intron-exon variation• Alternative splicing• Strain variations (SNPs)• Sequencing errors
Using A Controlled Vocabulary for Literature Searchhttp://www.ncbi.nlm.nih.gov/sites/entrez?db=mesh
Gene Ontology Databasehttp://www.geneontology.org/
UCSC Genome Browserhttp://genome.ucsc.edu/
ExPASy Proteomics Serverhttp://www.expasy.ch/doc.html
Inferring Biological Function fromProtein SequenceConsensus Sequences
or Sequence MotifsZinc Finger (C2H2 type)
C x {2,4} C x {12} H x {3,5} H
Sequence Similarity 10 20 30 40 50Query VLSPADKTNVKAAWGKVGAHAGEVGAEALERMFLSFPTTKTYFPHF------DLSHGS |:| :|: | |:|||| | |:||| |: : :|:| :| | |: |Match HLTPEEKSAVTALWGKV--NVDEYGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN
10 20 30 40 50
Sequences of CommonStructure or Function
A Typical Motif:Zinc Finger DNA Binding Motif
C..C............H....H
Profiles, PSI-BLASTHidden Markov Models
AA1 AA2 AA3 AA4 AA5 AA6
I 1 I 2 I 3 I 4 I 5
D 2 D 3 D 4 D 5
Inferring Biological Function fromProtein SequenceConsensus Sequences
or Sequence MotifsZinc Finger (C2H2 type)
C x {2,4} C x {12} H x {3,5} H
Sequence Similarity 10 20 30 40 50Query VLSPADKTNVKAAWGKVGAHAGEVGAEALERMFLSFPTTKTYFPHF------DLSHGS |:| :|: | |:|||| | |:||| |: : :|:| :| | |: |Match HLTPEEKSAVTALWGKV--NVDEYGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN
10 20 30 40 50
Sequences of CommonStructure or Function
1 2 3 4 5 6 7 8 9 10 11 121 2 3 4 5 6 7 8 9 10 11 12
AA 2 1 3 13 10 12 67 4 13 9 1 22 1 3 13 10 12 67 4 13 9 1 2RR 7 5 8 9 4 0 1 16 7 0 1 07 5 8 9 4 0 1 16 7 0 1 0NN 0 8 0 1 0 0 0 2 1 1 10 00 8 0 1 0 0 0 2 1 1 10 0DD 0 1 0 1 13 0 0 12 1 0 4 00 1 0 1 13 0 0 12 1 0 4 0CC 0 0 1 0 0 0 0 0 0 2 2 10 0 1 0 0 0 0 0 0 2 2 1QQ 1 1 21 8 10 0 0 7 6 0 0 21 1 21 8 10 0 0 7 6 0 0 2EE 2 0 0 9 21 0 0 15 7 3 3 02 0 0 9 21 0 0 15 7 3 3 0GG 9 7 1 4 0 0 8 0 0 0 46 09 7 1 4 0 0 8 0 0 0 46 0HH 4 3 1 1 2 0 0 2 2 0 5 04 3 1 1 2 0 0 2 2 0 5 0II 10 0 11 1 2 10 0 4 9 3 0 1610 0 11 1 2 10 0 4 9 3 0 16LL 16 1 17 0 1 31 0 3 11 24 0 1416 1 17 0 1 31 0 3 11 24 0 14KK 3 4 5 10 11 1 1 13 10 0 5 23 4 5 10 11 1 1 13 10 0 5 2MM 7 1 1 0 0 0 0 0 5 7 1 87 1 1 0 0 0 0 0 5 7 1 8FF 4 0 3 0 0 4 0 0 0 10 0 04 0 3 0 0 4 0 0 0 10 0 0PP 0 6 0 1 0 0 0 0 0 0 0 00 6 0 1 0 0 0 0 0 0 0 0SS 1 17 0 8 3 1 3 0 2 2 2 01 17 0 8 3 1 3 0 2 2 2 0TT 5 22 3 11 1 5 0 2 2 2 0 55 22 3 11 1 5 0 2 2 2 0 5WW 2 0 0 0 0 0 0 0 0 1 0 12 0 0 0 0 0 0 0 0 1 0 1YY 1 0 4 2 0 1 0 0 2 4 0 11 0 4 2 0 1 0 0 2 4 0 1VV 6 3 1 1 2 15 0 0 2 12 0 286 3 1 1 2 15 0 0 2 12 0 28
Weight Matrices orPosition-Specific Scoring Matrices
Buried Treasure
Buried Treasure
Buried Treasure
Clustal Globin Alignment
Consensus Sequence From aMultiple Sequence Alignment
ClustalW Insulin Alignments
IPGPIPDKIPDGIPCHIPCAIPBOIPAF
10 20 30F V S R HA A N Q H
M A L W M R L L P L L A L L A L W A P A P T R A F V N Q HM A L W I R S L P L L A L L V F S G P G - T S Y A A N Q HM A V W I Q A G A L L F L L A V S S V N A N A G A P - Q H
F V N Q HM A A L W L Q S F S L L V L L V V S W P G S Q A V A P A Q H
A . W . . L L L L A N Q H
IPGPIPDKIPDGIPCHIPCAIPBOIPAF
40 50 60L C G S N L V E T L Y S V C Q D D G F F Y I P K D X X E L EL C G S H L V E A L Y L V C G E R G F F Y S P K T X X D V EL C G S H L V E A L Y L V C G E R G F F Y T P K A R R E V EL C G S H L V E A L Y L V C G E R G F F Y S P K A R R D V EL C G S H L V D A L Y L V C G P T G F F Y N P K R D V D P PL C G S H L V E A L Y L V C G E R G F F Y T P K A R R E V EL C G S H L V D A L Y L V C G D R G F F Y N P K R D V D Q LL C G S H L V E A L Y L V C G E R G F F Y . P K . D V E
IPGPIPDKIPDGIPCHIPCAIPBOIPAF
70 80 90D P Q V E Q T E L G M G - - - - - L G A G G L Q P - - L Q GQ P - L V N G P L H G E - - - - - V G E L P F Q - - - - H ED L Q V R D V E L A G A - - - - - P G E G G L Q P L A L E GQ P - L V S S P L R G E - - - - - A G V L P F Q - - - - Q EL G F L P P K S - - - - - - A Q E T E V A D F A F K D H A EG P Q V G A L E L A G G - - - - - P G A G G L E - - - - - GL G F L P P K S G G A A A A G A D N E V A E F A F K D Q M E
P L L G G F Q E
IPGPIPDKIPDGIPCHIPCAIPBOIPAF
100 110 120A L Q X X - - G I V D Q C C T G T C T R H Q L Q S Y C NE Y Q X X - - G I V E Q C C E N P C S L Y Q L E N Y C NA L Q K R - - G I V E Q C C T S I C S L Y Q L E N Y C NE Y E K V K R G I V E Q C C H N T C S L Y Q L E N Y C NV I R K R - - G I V E Q C C H K P C S I F E L Q N Y C NP P Q K R - - G I V E Q C C A S V C S L Y Q L E N Y C NM M V K R - - G I V E Q C C H R P C N I F D L Q N Y C N
. Q K R G I V E Q C C C S L Y Q L E N Y C N
HMM Model of Hemoglobinshttp://decypher.stanford.edu/
GrowTree VegF Neighbor Joining Tree
T Cells Signaling
DNA Damage
Fibroblast Stimulation
B Cells Signaling
CMV Infection
Anoxia
Polio InfectionMonocytes Signaling IL4Hormone
Human Gene Expression Signatures
Clustering Gene Expression Profiles: Comparison of Methods
D'haeseleer P (2005). Nat Biotechnol. 23,1499-501.
TAMO:Tools for the Analysis of Motifs
Finding Transcription Factor Binding Sites
Upstream Regions Co-expressed
Genes
GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTGGATCTTGA...AGAATGACTGGC
Pho 5
Pho 8
Pho 81
Pho 84
Pho …
Transcription Start
Upstream Regions Co-expressedGenes
GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTGGATCTTGT...AGAATGGCCTAT
Finding Transcription Factor Binding Sites
Upstream Regions Co-expressedGenes
ATGGCTGCACCACGTTTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
TTAGGACCATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTTGATCTTGT...AGAATGGCCTAT
Pho4 binding
Finding Transcription Factor Binding Sites
Metabolic Networks: BioCychttp://biocyc.org/
C. crescentus Cell Cycle Gene Expression
Genome Wide Associations in Rheumatoid Arthritis
Pearson, T. A. et al. JAMA 2008;299:1335-1344
Leveraging Genomic Information in Medicine
Novel DiagnosticsMicrochips & Microarrays - DNAGene Expression - RNAProteomics - Protein
Understanding MetabolismUnderstanding Disease
Inherited Diseases - OMIMInfectious Diseases
Pathogenic BacteriaViruses
Novel Therapeutics Drug Target DiscoveryRational Drug DesignMolecular DockingGene TherapyStem Cell Therapy