1 Computational biology, bioinformatics, and high performance computing Craig A. Stewart...
-
date post
15-Jan-2016 -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Computational biology, bioinformatics, and high performance computing Craig A. Stewart...
1
Computational biology, bioinformatics, and high performance computing
Craig A. Stewart
Indiana University
SC2003 Tutorial 16 November 2003
S14
License terms• Please cite as: Stewart, C.A. 2003. Computational Biology. Tutorial presented
at SC2003, 15-21 Nov, Phoenix, AZ. http://hdl.handle.net/2022/14000• Some figures are shown here taken from web, under an interpretation of fair
use that seemed reasonable at the time and within reasonable readings of copyright interpretations. Such diagrams are indicated here with a source url. In several cases these web sites are no longer available, so the diagrams are included here for historical value. Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.
2
3
Table of Contents• Class Plan and Objectives 3• A rapid introduction to key elements of biology 11• Bioinformatics data sources 32• Similarity matching 48• Phylogenetics 95• RNA and Protein Structure 108• Systems Biology 126• Grand challenge problems 140• Acknowledgements & references 163
Note: Slides with the Indiana University wordmark in the bottom left corner were generated at Indiana University, with images sometimes from other sources. In such cases the url for the source of the image is indicated on the slide. Slides with a plain white background have been graciously provided by someone outside IU, and sources are attributed on such slides.
4
Class Plan & Objectives
• Class Plan & Strategy– Materials focus on open source software (generally not the
presenters own work)– One critical application will be covered in great depth, and
several others will be reviewed• Objectives. At the end of the class, participants should:
– understand enough biology to understand key computational biology problems
– be conversant with current key applications, and current problems facing bioinformatics and computational biology
– Be familiar with some strategies for collaborating with biologists and biomedical scientists
5
Motivation
• The “-omics” trend• Finding press pieces about huge computing problems is easy• How many bio codes really scale to hundreds of processors?• What are the coming high performance needs of biologists?• Importance of computational biology and bioinformatics to the
HPC community• The challenges and promise are real• Successes and failures so far
– Successes: Protein structure, Genome assembly, Surgical assistance, Phylogenetics
– Mismatched priorities: Ab initio protein folding– Not yet successful: Drug discovery
6
What has changed recently?
• Bioinformatics not new– Protein structure– Phylogenetics
• What is new is high-throughput sequencing:– Lots more data– The possibility of going
from a knowledge of the DNA sequence to an understanding of diseases and health
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
7
Genome Projects Timeline
• 1978 First virus (SV40) sequenced (5224 base pairs)• 1986 DOE announces Human Genome Initiative • 1994 First complete map of all human chromosomes • 1995 First living organism sequenced (H. influenzae) 2 Mb• 1996 Yeast (S. cerevisiae) - 12 Mb• 1997 Intestinal bacterium (E. coli) - 5 Mb• 1998 Nematode worm (C. elegans) - 100 Mb• 1998 Celera announcement; Public effort regroups• 1999 Human Chromosome 22 – 34 Mb• 2000 Joint announcement by NHGRI – Celera• 2003 “As good as it gets” human genome
This slide based on slide by Manfred D. Zorn
8
Definitions
• Computational Biology: any use of advanced information technology in the study of biological problems.
• “Bioinformatics applies the principles of information sciences and techologies to make the vast, diverse and complex life sciences data mnore understandable and useful” (NIH BISTIC Committee grants1.nih.gov/grants/bistic/CompuBioDef.pdf)
• Genomics – study of genomes and gene function• Proteomics – study of proteins and protein function• ___omics –
9
Challenges
• Different types of biological data at different scales• Data of varying quality• Much of the underlying biology is not well understood• Prior to the availability of high-throughput sequencing,
scientists could only study small pieces of the genetic information of any organism.
• Now the entire genome of several organisms has been completed, but knowing the genome is different than knowing how it works!
10
Comparison of Complexity• Physics & Chemistry
– 2 elementary particles– 4 forces– 112 elements– When random events occur
it is often possible to study average behavior
– Typically ahistoric (astrophysics an exception)
• Biology– 3B base pairs in humans– Min. 30,000 genes in
humans– ~1.5M species– Individual random events
important; no law of large numbers
– Intensely historic, heavily contingent
11
Complexity, Con't• Chip design
– All components known– Device physics for
individual components known
– Itanium has 3 x 10^8 connections and 2 x 10^8 devices
– Unified basic currency (electrons)
– Computer program required to understand
• Cells– Components not known– Function of individual
components not known– # components ~10^13– No unified basic currency– Ecell, Karyote, etc.
attempting to model cells
12
A rapid introduction to key elements of biology
13Why is it important to know some biology?
• Would you study numerical methods without knowing some mathematics?
• Much current biological knowledge is very specific to particular organisms, genes, or diseases
• If you just wade into the available data online you can do some very silly things.
Anopheles gambiae
From www.sciencemag.org/feature/data/ mosquito/mtm/index.htmlSource Library:Centers for Disease Control Photo Credit:Jim Gathany
14
Central dogma of biology
• The central dogma of biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population (first put forward by Crick in 1958)
http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html
15
Cell Structure
Eukaryotes• Chromosomes linear• Introns, exons,
postprocessing• Nucleus & nuclear wall• Mictochondria and (in
plants) Chloroplasts
http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html
Prokaryotes• Chromosome circular• Location is everything• No nucleus• No plastids
16
Four (or Five) Bases
• DNA consists of four nucleotides: Cytosine, Thymine, Adenine, and Guanine.
• In the double helix, A&T are always bound, and C&G are always bound to each other
• RNA consists of four nucleotides as well: Cytosine, Uracil, Adenine, and Guanine
• RNA may loop back on itself but it does not form a double helix
http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/structur.gif
17
http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/98-647.jpg
18
Genetic CodeAla AlanineArg ArginineAsn AsparagineAsp Aspartic acidCys CysteineGlu Glutamic acidGln GlutamineGly GlycineHis HistidineIle Isoleucine
Leu LeucineLys LysineMet MethioninePhe PhenylalaninePro ProlineSer SerineThr ThreonineTrp TryptophanTyr TyrosineVal Valine
http://www.ncbi.nlm.nih.gov/Class/MLACourse/Original8Hour/Genetics/geneticcode.html
19Translating DNA to RNA and Transcribing RNA to Proteins
DNA AAAAAGGAGCAAATT
RNA UUUUUCCUCGUUUAA
One possible amino acid string Phe Asn Asp Ala
45
6
12
3
20
Human Chromosomes
http://www.ncbi.nlm.nih.gov/Class/MLACourse/Original8Hour/Genetics/cytogenetic.html
http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/elsikaryotype.html
21
Sickle CellNormal RBC• GAG codes for Glutamine• disc-Shaped, soft• easily flow through small
blood vessels• lives for 120 daysSickle RBC• GTG codes for Valine• sickle-Shaped, hard• often get stuck in small
blood vessels• lives for 20 days or lessMalaria vs. Anemia!
http://www.nlm.nih.gov/medlineplus/ency/imagepages/1223.htm
22
What is a Gene?
• An inheritable trait associated with a region of DNA that codes for a polypeptide chain or specifies an RNA molecule which in turn has an influence on some characteristic phenotype of the organism.– Early views: genes lined up on the chromosome like beads
on a string; one gene => one protein– Examples of genes: color blindness, sickle-cell anemia– Mendelian genes, Sex-linked genes, Quantitative traits
• Annotation: Extraction, definition, and interpretation of features on the genome sequence
• Annotations vs. genes: – Many annotations describe features that constitute a gene.– Others may not always directly correspond in this way– An annotation is what we think… nature may disagree!
• Inheritance problem with annotations
23
Gene Components• Procaryotes
– Location is everything– Essentially all of the DNA is transcribed (few mitochondrial diseases)
• Eucaryotes– Non-contiguous pieces of DNA may comprise one gene– Start sequence (complicated and long) – Stop Codons – end transcription– Exons – portions of sequence that are transcribed and used– Introns – portions of sequence that are not used
• Genes and Chromosomes– In eukaryotes, an organism has two of each chromosome (in pairs).– Among sexually reproducing organisms, one chromosome comes from
each parent– In “simple Mendelian genes” there are two alleles for each gene – one
on each chromosome (e.g. wrinkly)
24
Alternate splicing
http://www.blc.arizona.edu/marty/411/Modules/altsplice.html
25A (very) little about evolutionary genetics
Ww
WwParents
Offspring
Ww
WwWW ww
Based on this, can you explain why the gene for Sickle Cell Anemiapersists in populations of people in Africa?
Hardy-Weinberg Law
26
Population genetics & evolution
• Mutations create the raw material for evolution
• Natural selection and chance affect the frequency with which particular genes or DNA sequences are present in populations
• Given enough time and enough change, evolution, speciation, and so forth happen
• Genes can be ‘fixed’ or ‘maintained in an equilibrium’ in a population by chance or through natural selection
http://faculty.wm.edu/bsgran/
27
How do sequences differ?
• Differences in individual bases
• Bases may be added to a sequence
• Bases may be deleted from a sequence
CGTACCGTTAATATCGTACCGATAATAT
CGTACCCCGTAATATCGTACC . .GTAATAT
CGTACCGTTAATATCGTACCG . . .ATAT
28
Random genetic change
• “things happen”• Molecular clock
– theory – ~ 2% change per million years (2 x 10-9
substitutions per base location per year)– Practice – a rule of thumb is different than something like
Newton’s 2nd law of motion• Random change may often be responsible for speciation – e.g.
two populations of birds, separated by a geographic barrier, may at random eventually develop into two different species
29
Key points (so far)
• Biological processes are complicated; the historicity and complexity of biological processes and our lack of understanding of many matters makes biologty an interesting topic!
• The fundamental dogma of molecular biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population.
• DNA consists of four base pairs (ATCG). A is always paired with T; C always paired with G.
• DNA is translated into RNA. RNA consists of four base pairs as well (AUCG).
• The linear structure of DNA is transcribed into RNA and then into proteins. Proteins have their 3D configuration as the basis for their structure.
30
DNA sequencing
Send in the clones!• DNA chopped into
blocks• Blocks inserted into
bacterial cells using viruses
• The bacterial clones make lots of copies of DNA so that you have something to work with
• The sequence of each chunk of genetic material is determined using gel electrophoresis
31
Dye-terminator Sequencing
www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/standardRGB200.jpg
• Cut DNA at various places (at T, G, C, A)
• Add a radioactive molecule at the end of the DNA chain
• Find out how long the chain is by gel electrophoresis
• Read off the sequence
Sanger
www.ornl.gov/TechResources/Human_Genome/publicat/primer/
32
Sequence assembly
• Phred – base calling• Phrap – shotgun sequence assembly• Consed – finishing• http://www.phrap.org/• High quality software
33
Bioinformatics data sources
34
Bioinformatics Data Sources
• There are many• Characteristics vary• There are many ways to organize view of the biological data• A pragmatic approach:
– Biomedical literature sources– Structured vocabularies– DNA, RNA, Protein etc. data sources
35
Biomedical literature• Abstracts of biomedical lit.
largely available online• Text processing itself is an
interesting problem• U.S. National Library of
Medicine – NLM Medline http://www.nlm.nih.gov/
• ~12 million references on life sciences/biomedicine.
• Covers 1966 to present.• Citations from over 4,600
journals; most published in English
36
PubMed
• Standard search tool for Medline
• http://www.ncbi.nlm.nih.gov/entrez/
• Useful limit terms:– Gender– Age Groups– Human or Animal– Publication Date
• You can save queries
37
Structured Languages
• NLP or write with agreed-upon terms?• Three important structured languages:
– MeSH– GO (Gene Ontology)– LOINC
38
MeSH
• Medical Subject Heading• http://www.nlm.nih.gov/
mesh/MBrowser.html• ~17,000 Thesaurus Terms• Typically 10-15 used per
article in MedLine; 3-4 as major points (indicated with * in PubMed)
• When done right…. the terms used are the most specific possible
• There are both advantages and disadvantages!
39
GO (Gene Ontology)
• http://www.geneontology.org/• “The goal of the Gene OntologyTM Consortium is to produce
a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.”
• Based on xml file format• Several browsers (AmiGO, QuickGO, MGO)• Directed Acyclic Graph (child may have multiple parents)
– ISA (is a) %– Part of <
• Three ontologies– Molecular function– Biological processes– Cellular components
40
Genomic, Proteomic, etc. data sources
• A tremendous amount of data is available through public data sources via the Web, ftp, or by other means.
• To analyze biological data, we first have to get it…. • Several ways to organize presentation of material – by site, by
type, etc. We will organize by data type.• Types of Databases:
– Chromosomal (http://www.ncbi.nlm.nih.gov/mapview)– DNA/Genes– Protein– Biochemistry and metabolic pathways– Microarray– Web collections
41
Types of genomic data
• Genomic DNA: DNA sequences, typically complete with coding and noncoding sequences
• GSS: Genome survey sequence. Single pass sequence read directly from robot.
• mRNA: an RNA sequence from an mRNA molecule. May or may not cover all of a particular gene
• cDNA: complement DNA – a DNA sequence generated by conversion of an mRNA sequence
• EST: Expressed Sequence Tag – short cDNA sequences from studies of cells under particular circumstances. Typically incomplete.
• SNP – Single Nucleotide Polymorphism
42
DNA databases
• GenBank. Operated by NCBI (National Center for Biotechnology Information). http://www.ncbi.nlm.nih.gov
• European Molecular Biology Laboratory – Nucleotide Sequence Database. http://www.ebi.ac.uk/genomes/
• DNA Database of Japan (DDBJ). http://www.ddbj.nig.ac.jp• All share data daily. Update conflicts avoided by policy. • Differences are in “value added” and interfaces
43http://www.ncbi.nlm.nih.gov
44
Data Structures
• Current– Primary DNA repository data based on ASN.1. Makes
possible linkages among many types of biomedical info.– The software libraries now often handle XML as well– Software toolkits and docs available at
http://www.ncbi.nlm.nih.gov/IEB/• Genbank Flat File format
– http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html• FASTA
>gi|532319|pir|TVFV2E|TVFV2E envelope proteinELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
45
Primary vs. Secondary Data sources
• Primary data sources:– Genetic sequences in NCBI, EMBL, DDJP– Protein sequences in PDB
• Secondary data sources:– Inferred protein sequences (what do we know already about
issues here?)– Curated data sources
46
Protein Structure• NCBI (of course…)• Swiss-Prot/TrEMBL at http://www.expasy.org/
– Note: 125,744 chemically determined vs 861,482 inferred from automated translation of DNA sequences!!!!!
• Protein Data Base – PDB http://www.rcsb.org/pdb/ - one of the first online bioinformatics databases!!!
47
Biochemistry and pathways
• Biochemistry– ENZYME (part of the ExPASy system)– BIND (part of the NCBI system)
• Pathways– PathDB http://www.ncgr.org/software/version_2_0.html– Kegg WIT http://wit.mcs.anl.gov/WIT2/
48
Web Resources - General• NCBI
http://www.ncbi.nlm.nih.gov/• EBI Biocatalog
http://www.ebi.ac.uk/biocat/• IUBio Archive
http://iubio.bio.indiana.edu
http://www.ncbi.nlm.nih.gov/
49
Similarity matching
50Why pattern matching (and what are the problems)
Bonobohttp://www.sandiegozoo.org/special/zoo-featured/pygmy_chimps.html
and… US!
51
Problems!
• For proteins, 95% similarity is ~ identical, 80% similarity is a lot. Even less similarity than that needed for DNA
• Database techniques inadequate – they are too precise!• Datasets very large to search• Homology
• Common ancestry • Sequence (and usually structure) conservation • Homology is inferred rather than measured
• Identity• Objective and well defined • Can be quantified easily, but not very useful!
• Similarity• Most common method used, but not as easily defined
52
Alignment
• An alignment is an arrangement of two sequences opposite one another
• It shows where they are different and where they are similar • We want to find the optimal alignment - the most similarity
and the least differences• Alignments have two aspects:
– Quantity: To what degree are the sequences similar (percentage, other scoring method)
– Quality: Regions of similarity in a given sequence
53
Alignment
• Methods:– dynamic programming – Hidden Markov Models– Pattern matching
• Key problem: keeping the calculation time manageable• Some alignment packages:
– BLAST (http://www.ncbi.nlm.nih.gov/BLAST/)– FASTA (http://gcg.nhri.org.tw/fasta.html)
54
Scoring AlignmentsGCTAAATTC ++ x x GC AAGTT
• Matches are good: they get a positive value• Mismatches are bad: they get a negative value• Gaps are bad: they get a negative value
– Gap opening penalty– Gap extension penalty– Score = Matches –Mismatches
-∑{gap opening penalty +(length)*gap length penalty}
CGTACCGTTAATATCGTACCG . . .ATAT
CGTACCGTTAATATCGT. C . GTT .ATAT
55
Now what?
• Taking a sequence and simply comparing it against all existing sequences in a database in all possible ways approaches O(N!) if you do it badly enough. Plus it would be silly.
• So: many algorithms possible• Algorithms are in some ways the same, and in some ways
different, between DNA and proteins.• We’ll start with DNA, and not do things in historical order
56
Dotter• Simple way to get a feel for how
sequences compare to each other.• Used both with DNA and Protein
sequences• http://www.cgr.ki.se/cgr/groups/
sonnhammer/Dotter.html/• "A dot-matrix program with
dynamic threshold control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167(2):GC1-10 (1995)
• And now (hopefully) a live demo• Modular nature of proteins
57
Local Alignments with BLAST
• Basic Linear Alignment Search Tool• We’ll spend a LOT of time with BLAST• First a quick demo (hopefully)• http://www.ncbi.nlm.nih.gov/BLAST• So, what did we do?
– BLAST – Basic Linear Alignment Search Tool– In particular, BLASTn (for nucleotides)– Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman,
D.J. 1990. Basic Local alignment search tool. Journal of Molecular Biology 215:403-410
58
(Original) BLAST Algorithm
• Original algorithm does not permit gaps• The original BLAST algorithm is a local (heuristic) alignment
tool• Given a search sequence, e.g. ACGTAGGCATGAA• BLAST first makes a list of all “words” of a given length that
would possibly have a score of at least T against the search string.
• In the case of this example there would be (at least) the following:– ACGTAGGCATG– CGTAGGCATGA– GTAGGCATGAA
59
(Original) BLAST Algorithm, 2
• BLAST takes the list of all words with a score of at least T against the string one is trying to match…. and then searches a database for any matches to these words. So if one were using the example and the NR database, BLAST would search NR for all occurrences of the words:– ACGTAGGCATG– CGTAGGCATGA– GTAGGCATGAA
• Suppose BLAST finds in the NR database an exact match to – ACGTAGGCATG
• BLAST then attempts to extend the match in both directions– ACGTAGGCATGA– ACGTAGGCATGA
• So now we have an exact match of 12 letters
60
(Original) BLAST algorithm,3
• So BLAST keeps going, and in this case would stop at an exact match of 13 letters (if one existed), since 13 letters was the entire initial search string:– ACGTAGGCATGAA– ACGTAGGCATGAA
• BLAST has a stopping algorithm for dropping particular search directions, or stopping altogether
61
Scoring of DNA
A C G T R Y M W S K D H V B N A 4 C -3 4 G -3 -3 4 T -3 -3 -3 4 R 1 -1 1 -1 1 Y -1 1 -1 1 -3 1 M 1 1 -2 -2 0 0 1 W 1 -2 -2 1 0 0 0 1 S -2 1 1 -2 0 0 0 0 1 K -2 -2 1 1 0 0 0 0 0 1 D 1 -2 1 1 1 0 0 1 0 1 1 H 1 1 -2 1 0 1 1 1 0 0 0 1 V 1 1 1 -2 1 0 1 0 1 0 0 0 1 B -2 1 1 1 0 1 0 0 1 1 0 0 0 1 N 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1
62
BLAST algorithm in more detail• The BLAST algorithm searches for MSPs – Maximal Scoring Pairs – such that the score of
sequences cannot be improved either by lengthening it or shortening it. “Pairs” here refers to a string – or a substring – of the initial string used as the search string – and one or more strings or substrings found in a database.
• The search starts with the creation of all possible subwords of a given length (default typically 11 for DNA sequences, 3 amino acids for protein sequences) that would score at least T when matched against the original search string. (T is short for Threshold)
• BLAST then goes through the database being searched against looking for any occurrence of each of these words that have a score of at least T. This is a “hit” – or a “High Scoring Pair (HSP)”
• The search then continues by trying to extend these HSPs. • Suppose “S” is the best score found for a word of length k. BLAST stops trying to extend
words when the score drops a certain amount below the best value S in the previous round.• BLAST continues on and on until it is no longer possible to improve the score of HSPs by
making them longer.• Then it generates a list of the best HSPs. Default is a cutoff E-value of 10• BLAST (original) has an infinite gap penalty
63
BLAST Statistics
• BLAST reports E values rather than P values, but it turns out that when E < 0.01, E~P
• What do we do about the fact that we have done many tests?• If the sequence is length n, and the total length of the database being
searched is N, then a reasonable approach is to multiply E by N/n• Edge effects – statistics tend to be conservative for short sequences• Problems:
– Highly repetitive segments– Low complexity regions– Bias in composition
• Solution: low complexity regions can be excluded
64
BLAST Options
• Set subsequence (of the submitted sequence)• Choose Database (NB: nr ≠ non redundant!)• Limit by entrez query or select an organism• Choose Filter• Expect Value• Word size (default = 11 for nucleotides)
65
Protein Sequence Alignment
• What most people do most of the time• DNA sequences are useful for relationships that are close, but
DNA sequences are not nearly as well conserved as Amino Acid sequences
• Now we need to talk about the characteristics of Amino Acids and ways to compare what is similar and what is not!
• Amino acids can have similar chemical properties, and similar functions as part of a protein, without being identical!
66
Point Accepted Mutations (PAM)• For scoring amino acid sequence
alignments• Dayhoff, M.O., Schwartz, R.M., Orcutt,
B.C. 1978. "A model of evolutionary change in proteins." In Atlas of Protein Sequence and Structure 5(3) M.O. Dayhoff (ed.), 345 - 352, National Biomedical Research Foundation, Washington.
• PAM N corresponds to N mutations in DNA sequence per 100 amino acids. N can be greater than 100.
• PAM 250 is most commonly used; PAM 100 is also used. PAM 250 => chains with ~20% identity
• PAM matrix calculator at www.cmbi.kun.nl/bioinf/tools/pam.shtml
http://www.psc.edu/biomed/training/tutorials/sequence/db/index.html
67
BLOSUM Matrices
• Henikoff and Henikoff (1992) Proc Natl Acad Sci 89(22):10915-9
• Based on analysis of the BLOCKS database (http://www.blocks.fhcrc.org/)
• BLOSUM = BLOcks SUM database• Based on analysis of conserved and variable regions of
proteins Naming convention is different than for PAM matrices.
• BLOSUMxy is based on likelihood ratios for two chains of amino acids that are xy% identical
• BLOSUM62 is the ‘typical default’• PAM250 is roughly equivalent to BLOSUM45
68
PSI BLAST
• Position Specific Iterative BLAST• http://nar.oupjournals.org/cgi/content/full/25/17/3389• Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z,
Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997 Sep 1;25(17):3389-402
• Required two non-overlapping similarities with search term to occur within a certain distance (A) on the genome
• Permits gaps in the alignments• Can be iterated to allow for user-specified scoring matrices By
default, uses the BLOSUM-62 Matrix
69
PSI BLAST• In the original
BLAST, the step of extending the length of the ‘hits’ took ~90% of execution time.
• The initial threshold value T must be lower than with the original BLAST, but far fewer hits are pursued, meaning that the extension time is lower http://nar.oupjournals.org/content/
vol25/issue17/images/gka56202.gif
Two hits, T=11 A=40 vs One hit, T=13
70
http://nar.oupjournals.org/content/vol25/issue17/images/gka56201.gif
71
Gaps in PSI-Blast
• PSI BLAST seeks alignments with single gaps• Gaps are sought only when a two-hit score exceeds the value
Sg• Gaps: handled by using a different gap cost function:
-(a+bk+cj)
a is the cost for opening a gapb is the per unit cost for the length of the gapk is the length of the gapc is the cost per of unaligned sequences in the gapj is the number of sequences left unaligned
72
Discontinuous MEGA Blast
• Useful especially for identifying diverged DNA sequences• Uses templates; within the template only those items with “1”s
are compared.• E.g. 1101101101101101
How many BLASTs?
http://www.ncbi.nlm.nih.gov/BLAST/producttable.html
73
mpiBLAST http://mpiblast.lanl.gov/
74
mpiBLAST Algorithm
• Darling, A.E., L. Carey, W.-C. Feng. 2003. The design, implementation, and evaluation of mpiBLAST. Presented at ClusterWorld2003. http://www.cs.wisc.edu/%7Edarling/mpiblast-cwce2003.pdf
• Algorithm– Database is segmented. Portions of database are placed on data
storage devices on multiple nodes in a HPC system. mpiformatdb is a wrapper for the BLAST formatdb program. Number of subdivisions specified by user
– Foreman/worker algorithm. Portions of the database are assigned to workers, using a greedy algorithm
75
mpiBLAST performance
• Scaling can be superlinear when pieces are small enough that they fit into memory
• Scalability limitations due to communication, implicit barrier before assembly of results
• If pieces of data distributed out to workers are larger than available RAM, then scaling is still good but not superlinear
• Blast is the most heavily used bioinformatics tool in existence. Parallelization of BLAST has huge payoff for practicing biologists
76
Motivation: BLAST with Low Memory
• Standard BLAST running on a system with 128 MB of memory.
Slide courtesy of Wu-chun [email protected] Los Alamos National Laboratory
77
mpiBLAST: Low-Memory Performance
• Environment– 1, 2, or 4 nodes.– Each node w/ dual
550-MHz CPUs and 128-MB memory.
– Same query and database used.
• Conclusions– blastn is I/O bound.
Superlinear speed-up possible.
– tblastx is CPU bound.
Slide courtesy of Wu-chun [email protected] Los Alamos National Laboratory
78
mpiBLAST on Green Destiny
BLAST Run Time for 300-kB Query against nt
Nodes Runtime (s) Speedup over 1 node
1 80774.93 1.00
4 8751.97 9.23
8 4547.83 17.76
16 2436.60 33.15
32 1349.92 59.84
64 850.75 94.95
128 473.79 170.49
The Bottom Line: mpiBLAST reduces search time from 1346 minutes (or 22.4 hours) to under 8 minutes!
Slide courtesy of Wu-chun [email protected] Los Alamos National Laboratory
79
Global Alignments: Needleman-Wunsch Algorithm
• Start at the beginning, end t the end• Needleman, S.B., and C.D. Wunsch. 1970. A general method
applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Bio. 48: 443-453.
• “The amino acid sequences of a number of proteins have been compared to determine whether the relationships existing between them could have occurred by chance. Generally, these sequences are from proteins having closely related functions and are so similar that simple visual comparisons can reveal sequence coincidence….”
80
Needleman-Wunsch
• Amino acid sequences are lined up as column and row headers for a matrix• Ai is the ith amino acid in protein A• Bj is the jth amino acid in protein B• Start with a matrix where the matches between the Ai s and Bj s are 1 of
there is a match, 0 otherwise• The optimal alignment can be represented as a path through the matrix• If MATmn is part of a pathway including MATij, the only permissible
relationships are m> i and n>j, or m<I and n<j• The optimal pathway is found by filling out the matrix from the bottom
right corner towards the upper left, where in each cell you insert the maximum score arising from an alignment that includes this cell in the matrix
81
Needleman-Wunsch and Smith-Watermann
• Shortcomings of Needleman-Wunsch?• Can you think of biological situations in which you might
want to use Needleman-Wunsch?• Smith-Waterman: similar to Needleman-Wunsch, except
– Requires a penalty for gaps– Will do partial alignments (e.g. has stopping point)
• Computational requirements– Original Needleman-Wunsch and Smith Waterman both require
O(N*M) time and O(N*M) memory– There are improvements of Smith-Waterman that require
O(N*M) time and O(N) space
82
ALIGN
• Simple protein alignment tool• Included in FASTA distributions 2.x, but not 3.x• Still, it’s a nice learning tool• Can be downloaded for Linux or for Windows• Can also be run from web at
http://fasta.bioch.virginia.edu/fasta/align.htm• Can also be run from web at http://us.expasy.org/tools
83
Protein Alignment with the FASTA family
• FASTA is one of the earliest protein alignment tools, and still actively maintained
• Pronounced FAST and then a long A• A local alignment, heuristic tool• Can be downloaded from
http://www.people.virginia.edu/~wrp/pearson.html• FASTA family maintained by Prof. William R. Pearson• Can also be run from Web
84
FASTA Algorithm• Ktup = word length (2 default; 1 sometimes used)• FASTA searches for words of length ktup matching between
sequences • FASTA searches for ungapped regions of a particular length
that have the highest number of identical ktups• FASTA scores the 10 ungapped alignments that have the
highest number of identical ktups, scoring with a scoring matrix (default is BLOSUM50)
• FASTA then tests for the ability to merge the ungapped alignments into a single alignment without dropping the overall score too much
• FASTA uses the Smith-Waterman algorithm within the local alignment regions!
85
Multiple Alignment - Clustal-W• Why do we need to align many different sequences at once?
– Look for highly conserved regions– Gene searching (of mice and men)
• http://www.ebi.ac.uk/clustalw/ • Thompson et al. 1994. Nucleic Acids Res. 22: 4673-4680• Heuristic & Progressive
– Begin with 2 sequences– Add others one-by-one
• Uses profile alignment– Align sequence with group of aligned sequences– Align groups of aligned sequences– Misalignments in conserved regions penalized heavily
86
Example output
FOS_RAT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTFOS_MOUSE MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT FOS_CHICK MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNSFOSB_MOUSE -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASFOSB_HUMAN -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS
*:..* .:*:: .***** **:.:* * *..***.* :.. :*:
FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPFOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVP FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPFOSB_HUMAN VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMP :******:** **********:**:* **... ::. .**.:* :
87
Clustal-W Algorithm• Construct matrix of distances
– Alignment scores from all pairwise combinations– Alignments by dynamic programming method– Alignment scores transformed to evolutionary distances– Cluster distances into hierarchical tree (neighbor joining)
• Progressively align sequences using tree as a guide– Begin with closest pair– Work up tree in order of decreasing similarity– Use pairwise alignment for pairs– Use sequence-profile alignment to add sequences to
clusters– Use profile-profile alignment to join clusters
88
CLUSTAL-W key features
• Sequences weighted to reduce representation bias associated with large subfamilies (usual sum-of-pairs score problem)
• Substitution matrix used for scoring depends on distance between sequences.– BLOSUM80 for near sequences– BLOSUM50 for distant sequences
• Gap penalties at hydrophobic residues heavier than those at hydrophilic residues
• Gap penalties also contingent upon exact residue identity at gap site• Gaps corralled by increasing penalties at sites where gaps are rare when
gaps are common nearby• When building alignment, low-scoring additions rescheduled to be added
later
89
ClustalW-MPI
• Li, K.-B.2003. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19: 1585-1586
• Initial pairwise alignment process is parallelized and scales very well
• Multiple alignment process is parallelized and scales modestly• Scaling tests published thus far up to 16 processors, reduces
time from hours to minutes
90
HMMR
• http://hmmer.wustl.edu/• Profile HMMs for protein sequence analysis• A profile is a statistical model of patterns that are likely for
multiple alignments, including variability at various positions and probabilities of various residues
• Useful when similarities are too faint to be picked up by BLAST
• Several profiles based on existing alignments exist• Available as a parallel code using PVM• Scales reasonably well as regards number of processors. Does
not scale as well as regards size of the biological problem
91
GeneIndex
• Location of initiators, promoters, etc. a key question in genomics
• First step in this is creating a dictionary of words of various lengths (many possible next steps)
• To be useful, analysis must be performed on entire genomes at once
• GeneIndex finds frequencies and positions of all words of a given length in a DNA sequence. Visualization with Tcl/Tk.
92
GeneIndex Parallelization
• Genome is broken up into n sections, where n = number of processors
• After each segment is analyzed, linked lists are joined
93
94
GeneIndex Scalability: Processing TimeDrosophila
0
500
1000
1500
2000
2500
3000
0 20 40 60
Number of CPU
Tim
e (s
econ
ds)
95
GeneIndex Scalability: SpeedupDrosophila
0
10
20
30
40
50
60
70
0 20 40 60 80
Number of CPU
Sp
eed
up
96
Phylogenetics
97
Building Phylogenetic Trees
• Goal: an objective means by which phylogenetic trees can be estimated in tolerable amounts of wall-clock time, producing phylogenetic trees with measures of their uncertainty
• All evolutionary changes are described as bifurcating trees-genes or gene products -organisms
98
Phylogenetic trees from DNA sequences
• Changes DNA modeled as Markov processes• Sequences available:• DNA (sequences are series of the base molecules; aligned
sequences will also contain +s for gaps)• Amino acid sequences (series of letters indicating the 20
amino acids). Computational challenges more severe than with DNA sequences.
• RNA • The availability of data at present exceeds the ability of
researchers to analyze it!
99
Why is tree-building a HPC problem?
• The number of bifurcating unrooted trees for n taxa is(2n-5)!/ (n-3)! 2n-3
• for 50 taxa the number of possible trees is ~1074; most scientists are interested in much larger problems
• NP-hard problem• The number of rooted trees
is (2n-5)!
100
Phylogenetic software
• Phylip. (J. Felsenstein). Collection of software packages that cover most types of analysis. One of the most popular software collections. Free.
• PAUP. (D. Swofford). Parsimony, distance, and ML methods. Also one of the most popular software collections. Not free, but not expensive.
• fastDNAml. (G. Olsen). Maximum likelihood method for DNA; becoming one of the more popular ML packages. MPI version available soon; well suited to tree searching in large data sets. Free.
• GRAPPA (Bader et al.): Breakpoint analysis program - scales well
101
Stochastic change of DNA
• Markov process, independent for each site: 4 x 4 matrix for DNA, 20 x 20 for amino acids
• A C G T• A p(A->A) p(A->C) p(A->G) …• C p(C->A) p(C->C) p(C->G) …• G .• T .• Transitions more probable than transversions.• Must account for heterogeneity in substitution rates among
sites (DNArates – Olsen)
102
fastDNAml
• Developed by Gary Olsen• Derived from Felsensteins’s PHYLIP programs• One of the more commonly used ML methods• The first phylogenetic software implemented in a parallel
program (at Argonne National Laboratory, using P4 libraries)• Olsen, G.J.,et al.1994. fastDNAml: a tool for construction of
phylogenetic trees of DNA sequences using maximum likelihood. Computer Applications in Biosciences 10: 41-48
• MPI version produced by Indiana University in collaboration with Gary Olsen available from http://www.indiana.edu/~rac/hpc/fastDNAml/
103
fastDNAml algorithm – adding taxa
• Optimize tree for 3 (randomly chosen) taxa - only one topology possible
• Randomly pick another taxon –
(2i-5) trees possible • Keep the best
(maximum likelihood tree)
104Basic fastDNAml algorithm - Branch rearrangement
• Move any subtree crossing n vertices (if n=1 there are 2i-6 possibilities)
• Keep best resulting tree• Repeat this step until local
swapping no longer improves likelihood value
105
fastDNAml algorithm con’t: Iterate
• Get sequence data for next taxon• Add new taxa (2i-5)• Keep best• Local rearrangements (2i-6)• Keep best• Keep going….• When all taxa have been added, perform a full tree check
106
Overview of parallel program flow
• Program modules– Master (generates trees,
receives back from Foreman best tree at each step)
– Foreman (dispatches trees to workers, determines best tree, tracks activity of workers)
– Worker– Monitor (instrumentation)– Parallel versions include fault
tolerance features (useful in large clusters and grid computing)
107
Performance of fastDNAml
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70
Number of Processors
Spee
dUp
Perfect Scaling 50 Taxa 101 Taxa 150 Taxa
108
Why bother with parallel code?
• Why not just achieve speedup of n on n processors by running n independent jobs?
• Practical benefits of seeing results quickly
• Parallel program permits assault on much more complicated problems (e.g. protein sequences)
109
RNA & Protein Structure
110
RNA Structure – Vienna RNA
• http://www.tbi.univie.ac.at/~ivo/RNA/• Package consists of several parts (from the web site):
– RNAfold - predict minimum energy secondary structures and pair probabilities
– RNAeval - evaluate energy of RNA secondary structures – RNAheat - calculate the specific heat (melting curve) of an RNA
sequence – RNAinverse - inverse fold (design) sequences with predefined
structure – RNAdistance - compare secondary structures – RNApdist - compare base pair probabilities – RNAsubopt - complete suboptimal folding
http://www.tbi.univie.ac.at/~ivo/RNA/
111
Types of Proteins• Enzymes- biological catalysts Most of the chemical reactions
which occur in biological systems are catalyzed by enzymes.• Storage. Various ions, small molecules and other metabolites
are stored by complexing with proteins; for example haemoglobin carries oxygen.
• Transport. Proteins are involved in the transportation of particles ranging from electrons to macromolecules.
• Messengers. Proteins are involved in the transmission of nervous impulses. Hormones play a coordinating role.
• Antibodies. Proteins which bind to specific foreign particles such as bacteria and viruses.
• Regulation. Enzymes synthesize proteins by translating sequences of DNA.
• Structural proteins. Mechanical proteins (e.g. collagen)
112
Proteins – a sparse vocabulary build up from amino acids
• Average time to fold based on random motion• Actual folding – small fractions of a second• Only a small subset of possible amino acid sequences actually
code for a real protein• Minimization of free energy – the key in real life and in
analysis!
113
http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html
114
http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html
115
Molecular viewing software options
• VRML – Cosmo Player http://www.karmanaut.com/cosmo/player/
• RASMOL - http://www.openrasmol.org/• CHIME - http://www.mdl.com/chime/index.html• Swiss Pdb Viewer - http://www.expasy.ch/spdbv/• MICE - http://mice.sdsc.edu/• Many tend to be touchy about browsers and plugins
116
Different ways to view molecules
• Wireframe• Stick• Ball and stick• Space filled (Van der Waals radii)• Some examples:
– http://class.fst.ohio-state.edu/FST822/Images/helix.pdb– http://www.rcsb.org/pdb/– http://www.rcsb.org/pdb/cgi/explore.cgi?
job=graphics;pdbId=1GFL;page=;pid=264201048789105&opt=vrml_default
117
Protein structure determination
• Xray crystallography• X-ray reflections form a
pattern• Model the known sequence of
atoms fitting into a 3D structure so that the reflection pattern matches the observed pattern
• Spectroscopic analysis of molecule structure precise but still slow!
• ~127,863 entries in SwissProt• ~857,950 entries in TrEMBL http://crystal.uah.edu/~carter/protein/xray.htm
118
Protein structure prediction methods
• Knowledge-based methods– Based on information extracted from existing structures to
estimate structure• Physico-chemical methods
– “Ab initio” protein structure prediction• Feature detection methods:
– Look for post-translational modification signals• Cleavage sites• Glycosylation sites• Phosphorylation
• Site for prediction server: http://www.cbs.dtu.dk/services/
119
Protein Structure Prediction
• Key requirement: prediction of molecule position within 1 angstrom
• Measuring quality of fit– Root mean square of atom distances
RMSD = √ (∑di2)/N
– Q3 = (true positives + true negatives)/total residues• Better than 70% right is really good!
120
Secondary Structure Prediction• Secondary – or local –
structure prediction is the first step in classifying amino acid sequences– Alpha helix– Beta sheet– coil
http://www.cryst.bbk.ac.uk/PPS95/course/3_geometry/rama.html
http://www.cryst.bbk.ac.uk/PPS95/course/3_geometry/helix1.html
121
Different approaches to tertiary structure prediction
• Do a sequence alignment to find a protein that is like the unknown sequence in whole or in part
• Threading– Thread a molecule on to a guide– Add sidechains– Optimize sidechains
• Piecewise reconstrcution– Estimate the structure of smaller pieces– Then estimate how they fit together
122
SDSC Biology Workbench
• Probably one of the best overall sites in the US
• http://workbench.sdsc.edu• Requires registration but
this is relatively painless• You do need to read the
instructions first…
123
Ab initio methods - Amber
• http://amber.scripps.edu/#ff• sander: Simulated annealing with NMR-derived energy restraints. • gibbs: Free energy perturbation (FEP) and thermodynamic integration
(TI) , and also allows potential of mean force (PMF) calculations. • roar: Allows mixed quantum-mechanical/molecular-mechanical (QM/MM)
calculations, "true" Ewald simulations, and alternate molecular dynamics integrators.
• nmode: Normal mode analysis program using first and second derivative information, used to find search for local minima, perform vibrational analysis, and search for transition states.
• (from http://amber.scripps.edu/#code)
124
Ab initio methods - GAMESS
• M.W.Schmidt, M.W., K.K.Baldridge, J.A.Boatz, S.T.Elbert, M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga, K.A.Nguyen, S.Su, T.L.Windus, M.Dupuis, J.A.Montgomery. 1993. General Atomic and Molecular Electronic Structure System J. Comput. Chem.14: 1347-63.
• NPACI/SDSC Web portal for GAMESS: https://gridport.npaci.edu/gamess/
125
Hybrid approaches: Rosetta
• Library of identification of short sequence motifs that correlate strongly with protein local structural properties.
• Basic idea:– sequence-dependent local interactions bias segments of the chain – nonlocal interactions select the lowest free-energy tertiary structures
from the many conformations compatible – Use protein database and take the distribution of local structures
adopted by short sequence segments (fewer than 10 residues in length) in known three-dimensional structures
– Put these structures together using non-local interactions• hydrophobic burial, electrostatics, main-chain hydrogen bonding
and excluded volume. • Free energy is then minimized to create candidate structures
126
Molecular Docking
• Key in drug searching• Autodock is a commonly used package• http://www.scripps.edu/pub/olson-web/doc/autodock/• “AutoDock is a suite of automated docking tools. It is
designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure.” (from the web site)
• Nice visualization of an AutoDock docking simulation: http://wwwcmc.pharm.uu.nl/moret/dockings/home.html
127
Systems Biology
128
Systems Biology
• Special issue of Science: 295, Mar. 2002
• Special issue of Nature: 420, Nov. 2002
• Nobody’s quite sure what it is, but it sure is hot!
http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/01-0052_web.gif
129
Historical approach to biological experiments
• From Lazebnik, Y. 2002. Cancer cell 2:179• Traditional biological experimentation much like the process
of trying to fix a broken radio• (Or, for those of us who have experienced either being or
living with a 12-year old boy, the process of breaking a functioning radio)
• Some typical steps:– Cataloguing components and their attributes– Perturbing the system– Knock-out experiments– Drawing diagrams
• Eventually may find a component that, when replaced, repairs the radio
130
Issues
• In a very complex system, knowing what all of the parts are, and knowing the function of individual pathways, may still not tell you how the systems work. It may simply be impossible to deduce this from 1-st order interactions
• Interactions, multiple changes– Power supply and other components (well-known PC repair
example!)– Change everything all at once so that we’ll never know what
worked!
131
Systems Biology
• Systems biology emphasizes close integration of experiment, theory and computational modeling
• Goal: understanding the structure and dynamics of biological systems, placing the parts in the context of the dynamic whole– Studies the complex interactions of many levels of biological
information– Quantitative, predictive models are central– Computational modeling in particular is a key tool
• Why model– You are forced to really state what you are hypothesizing– Allows you to understand an *approximation* of reality in great detail
• Computational Cell Biology. 2002. Springer Verlag (Fall et al, eds).• Foundations of systems biology. MIT Press, 2001. Kitano (ed)
132
Example - MCell• MCell is: A General Monte Carlo Simulator of Cellular
Microphysiology. http://www.mcell.cnl.salk.edu/• MCell focuses on simulations using a Brownian dynamics random
walk algorithm. • MCell's use to date has been focused on the microphysiology of
synaptic transmission.• Images and MCell-related material courtesy of Joel R. Stiles,
Pittsburgh SupercomputingCenter and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute.http://www.mcell.cnl.salk.edu/
133
MCell Scalability
Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh Supercomputing Center and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute. http://www.mcell.cnl.salk.edu/
134
M-Cell
• Uses MDL (Model Description Language (MDL), designed with biologically-oriented users in mind.
• Embarrassingly parallel Monte Carlo application
• Supports checkpointing!
Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh Supercomputing Center and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute. http://www.mcell.cnl.salk.edu/
135
CompuCell
• CompuCell currently uses a combination of "extended Potts model" for cell sorting and clustering, and "Schnakenberg Reaction Diffusion" equations to establish the underlying chemical field to which cells respond and form typical patterns found in such biological systems as a growing chicken limb.
• http://www.nd.edu/~icsb/
Image courtesy of James Glazierhttp://www.biocomplexity.indiana.edu/software.php
136
Karyote
• Information theory approach - construction of probability for parameters so that uncertainty in their estimation is assessed.
• The incompleteness of model is addressed via a probability functional approach for computing the time-dependence of the concentration of key enzymes
• Small features such as ribosomes or viruses behave in ways that rely on their atomic scale structure but which take part in the overall (macroscopic) balance of metabolic reaction and transport. “Zones” may be treated in more detail via the solution of mesoscopic models using finite element methods.
• Can be run over web at http://biodynamics.indiana.edu/overview/
137
Issue: Getting Tools to Interoperate
• There is currently a proliferation of software, but no single package answers all needs
• No single tool is likely to do so in the near future• But: problems with using multiple packages• One effort to address this problem:
– Systems Biology Workbench Project• Purpose: develop software and standards to
– Enable sharing of simulation & analysis software– Enable sharing of models
• Goal: make it easier to share than to reimplement
138
The Systems Biology Workbench Project
• http://www.sbw-sbml.org/• Simple framework for
application interaction. • Cross-platform compatible &
language-neutral
• Modules are separately compiled executables. A module defines services which have methods
• SBW native-language libraries provide APIs.
• SBW Broker acts as coordinator
SBW
VisualEditor
StochasticSimulator
ODE-basedSimulator
ScriptInterpreter
DatabaseInterface
139
CellML
• http://www.cellml.org/public/about/what_is_cellml.html• XML-based specification of interchange of cell model
information• Includes: • Information about model structure • Math, based on MathML• Metadata about the model• Project of Bioengineering Institute of University of Auckland
with support from Physiome Sciences Inc.
140
Systems biology URLs
• SBW & SBML www.sbw-sbml.org• NetBuilder strc.herts.ac.uk/bio/Maria/NetBuilder• CellML www.cellml.org• Jarnac + JDesigner www.cds.caltech.edu/~hsauro• Gepasi www.gepasi.org• Virtual Cell www.nrcam.uchc.edu/ • E-CELL www.e-cell.org• JigCell gnida.cs.vt.edu/~cellcyclepse/• DARPA BioSPICE www.biospice.org• Karyote http://biodynamics.indiana.edu/
overview/
141
Grand challenge problems and some thoughts about the future
142
Modeling Heart Function
• Based on Noble, D. 2002. Modeling the heart – from genes to cells to the whole organ. Science 295: 1678-1682
• Two mutations known for sodium channels– DeltaKPQ – deletion of 3 amino acids (lysine-proline-
glutamine) – causes persistent sodium flow through cell wall
– Missense mutations in sodium channels which cause ventricular fibrulations that can be fatal
• Models of heart function can produce counterintuitive predictions
• Grand challenge problem: the full scale reconstruction of a heart attack
143
3.0T MRI Scanner SGI Onyx
Real-time fMRI
In 1996, this required a supercomputerToday, it’s routine
CRAY T3E
Slide courtesy of Ralph Roskies, Pittsburgh Supercomputing Center, [email protected]
144
Gamma Knife
• Used to treat inoperable tumors
• Treatment methods currently use a standardized head model
• UITS is working with IU School of Medicine to adapt Penelope code to work with detailed model of an individual patient’s head
145
PENELOPE Basics
• “PENELOPE performs Monte Carlo simulation of coupled electron-photon transport in arbitrary materials and complex quadric geometries”(http://www.nea.fr/abs/html/nea-1525.html)
• Improvement of targeting based on CT scans of patient’s head – 200 512 x 512 voxel slices
• Simulation takes ~7 hours using a serial version of PENELOPE running on a 1 GHz PIII Windows system
• Goal: 5 minutes to one hour
146Parallelization of PENELOPE
• Each processor:– Views entire target– Generates its own random
numbers– Generates a set number of
independent trajectories– Accumulates data
• Process 0: – Collects the raw data– Computes desired results
• Uses F90 for parallel random number generator from MILC consortium
• Uses MPI elsewhere
147
PENELOPE Scalability: processing time
1
10
100
1000
10000
100000
0 50 100 150 200 250 300
Number of Processors
Tota
l W
allclo
ck T
ime (
sec.)
On IBM SP/Power3
148
PENELOPE Scalability: Speedup
0
50
100
150
200
250
300
0 50 100 150 200 250 300
# of Processors
Sp
eed
up
149
Some very boring Vampir traces of PENELOPE
150
“Simulation-only” studies
• Aquaporins -proteins which conduct large volumes of water through cell walls while filtering out charged particles like hydrogen ions.
• Massive simulation (35,000 hours TCS) showed that water moves through aquaporin channels in single file. Oxygen leads the way in. Half way through, the water molecule flips over.
• That breaks the ‘proton wire’• Work done at Pittsburgh Supercomputing Center• Klaus Schulten et al, U. of Illinois, SCIENCE (April 19, 2002)
151Other example large-scale computational biology grid projects
• Department of Energy “Genomes to Life” http://doegenomestolife.org/
• Encyclopedia of Life (http://eol.sdsc.edu/)• Biomedical Informatics Research Network (BIRN)
http://birn.ncrr.nih.gov/birn/• Asia Pacific BioGrid (http://www.apbionet.org/)• eDiamond – breast cancer/mammography grid
(http://www.mirada-solutions.com/PH1.asp?PAGE_ID=739)
152
Visualization: OpenDX
• http://www.opendx.org/• OpenDX is the open source
software version of IBM's Visualization Data Explorer Product
• Good sources of information in books, tutorials, etc.
• Interesting example of open source
• Animations as well
http://www.opendx.org/highlights.php
153
Visualization: SciRUN
• Some of the most dramatic biological visualizations ever done• Has been used for surgical support• Scientific Computing and Imaging Institute – Christopher R.
Johnson• http://www.sci.utah.edu/
154
Genomes to Life
• http://www.doegenomestolife.org/• Goals:
– Identify and Characterize the Molecular Machines of Life — the Multiprotein Complexes That Execute Cellular Functions and Govern Cell Form
– Characterize Gene Regulatory Networks– Characterize the Functional Repertoire of Complex Microbial
Communities in Their Natural Environments at the Molecular Level
– Develop the Computational Methods and Capabilities to Advance Understanding of Complex Biological Systems and Predict Their Behavior
– (Goals taken directly from Genomes to Life web site)
155
EOL Basic Topology
Putative Functional and 3D Assignment
Genomic Data
Integration with Other Resources
Public and Private DatabasesTo Serve Thousands Worldwide
http://eol.sdsc.edu/methodology.html
156
Current Genomic PipelineArabidopsis Protein sequences
Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)
Structural assignment of domains by PSI-BLAST on FOLDLIB
Only sequences w/out A-prediction
Only sequences w/out A-prediction
Structural assignment of domains by 123D on FOLDLIB
Create PSI-BLAST profiles for Protein sequences
Store assigned regions in the DB
Functional assignment by PFAM, NR, PSIPred assignments
FOLDLIB
NR, PFAM
Building FOLDLIB:
PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP
90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)
Domain location prediction by sequence
structure infosequence info
SCOP, PDB
http://eol.sdsc.edu/methodology.html
157
Scale of Multi-genome AnalysisGenomes Protein sequences
Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)
Structural assignment of domains by PSI-BLAST on FOLDLIB
Only sequences w/out A-prediction
Only sequences w/out A-prediction
Structural assignment of domains by 123D on FOLDLIB
Create PSI-BLAST profiles for Protein sequences
Store assigned regions in the DB
Functional assignment by PFAM, NR, PSIPred assignments
FOLDLIB
NR, PFAM
Building FOLDLIB:
PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP
90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)
Domain location prediction by sequence
structure infosequence info
SCOP, PDB
~800 genomes @ 10k-20k per =~107 ORF’s
4 CPU years
228 CPU years
3 CPU years
9 CPU years
252 CPU years
3 CPU years
104 entries
http://eol.sdsc.edu/methodology.html
158
BIRN
• Biomedical Informatics Research Network• http://www.nbirn.net/• NIH-sponsored attempt to create health-oriented
cyberinfrastructure• Function BIRN – brain function and disorders, e.g.
schizophrenia• Morphometry BIRN – brain structural disorders, e.g.
Alzheimers• Mouse BIRN – studying mouse brain and mouse models of
human brain disorders• Grid technology, using federated data system approach, based
on Globus, SRB, etc.
159
Drug Design
• Target generation – so what• Target verification – that’s important!• Toxicity prediction – VERY important!!• (Cholesterol example)• Counterintuitive problem: the more personalized a therapy is,
the smaller its target audience!
160What is the killer application in computational biology?
• Systems biology – latest buzzword, but…. • Goal: multiscale modeling from cell chemistry up to multiple
populations• Current software tools still inadequate• Multiscale modeling calls for use of established HPC
techniques – e.g. adaptive mesh refinement, coupled applications
• Current challenge examples: actin fiber creation, heart attack modeling
• Opportunity for predictive biology?
161Computational biology, biomedical research, and HPC
• Two challenges:– Scalability of applications– Wall-clock time sensitivity
• Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done.
• Traditional biomedical researchers must take advantage of new possibilities
• Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers
162
Peta-Scale applications?• Is this what most biologist really need?• Many biologists are unfamiliar with the real possibilities• Useful – even lifesaving – applications may require
straightforward application of well known principles. • The low hanging fruit taste just fine. e.g. “Parallel” Matlab,
GeneIndex, batch scripts (www.indiana.edu/~rac/bioinformatics/iubatchscripts.html)
• Writing a parallel application that can be used to treat people is a very difficult challenge
• Attacks on all fronts simultaneously are needed• Interactive Tera-scale applications might for many biologists be
more valuable right now than Peta-scale applications (even if we had them!)
• All of these open source codes are out there waiting for you to parallelize and/or tune them!
163So how do you find biologists with whom to collaborate?
• Chicken and egg problem?• Or more like fishing?• Or bank robbery?• Willie Sutton, a famous American bank robber, was asked why he
robbed banks, and reportedly said “because that's where the money is.” (This is, sadly, an urban legend: Sutton never said this)
• Cultivating collaborations with biologists in the short run will require:– Active outreach– Different expectations than we might usually have– Patience
• There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships. To do this, we’ll all have to spend a bit of time “going where the biologists are.”
164
Acknowledgments• Some of the research described herein was supported in part by the Indiana Genomics
Initiative. The Indiana Genomics Initiative of Indiana University is supported in part by Lilly Endowment Inc.
• Some of the research described herein was supported in part by Shared University Research grants from IBM, Inc. to Indiana University.
• Some of the material described herein is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).
• Some of the ideas presented here were developed while the senior author was a visiting scientist at Höchstleistungsrechenzentrum Universität Stuttgart. The support and collaboration of HLRS and Michael Resch, Matthias Müller, Peggy Lindner, Matthias Hess, and Rainer Keller are gratefully acknowledged.
• Thanks to UITS Research and Academic Computing Division managers: Mary Papakhian, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar
• Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock
• UITS Senior Management: Associate Vice President and Dean (Retired) Christopher Peebles, RAC(Data) Director Gerry Bernbom, Associate Vice President and Dean Bradley Wheeler
• Assistance with this presentation: John Herrin, Malinda Lingwall, W. Les Teach
165
Some Good Books• Winter, P.C., G.I. Hickey, H.L. Fletcher. 1998. Instant notes in
genetics. Springer-Verlag, NY. ISBM 0-387-91562-1• Durbin, R., S. Eddy, A. Krogh, G. Mitchison. 2000. Biological
sequence analysis. Cambridge University Press.• Gibas, C., and P. Jambeck. 2001. Developing bioinformatics
computer skills. O’Reilly.• Tisdall, J. 2001. Beginning perl for bioinformatics. O’Reilly.• Gusfield, D. 1997. Algorithms on strings, trees, and
sequences. Cambridge University Press.• Berman, F., G.C. Fox, A.J.G. Hey. (eds) 2003. Grid
computing: making the grid infrastructure a reality. Wiley, Sussex