1 Computational biology, bioinformatics, and high performance computing Craig A. Stewart...

Post on 15-Jan-2016

215 views 0 download

Tags:

Transcript of 1 Computational biology, bioinformatics, and high performance computing Craig A. Stewart...

1

Computational biology, bioinformatics, and high performance computing

Craig A. Stewart

stewart@iu.edu

Indiana University

SC2003 Tutorial 16 November 2003

S14

License terms• Please cite as: Stewart, C.A. 2003. Computational Biology. Tutorial presented

at SC2003, 15-21 Nov, Phoenix, AZ. http://hdl.handle.net/2022/14000• Some figures are shown here taken from web, under an interpretation of fair

use that seemed reasonable at the time and within reasonable readings of copyright interpretations. Such diagrams are indicated here with a source url. In several cases these web sites are no longer available, so the diagrams are included here for historical value. Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

2

3

Table of Contents• Class Plan and Objectives 3• A rapid introduction to key elements of biology 11• Bioinformatics data sources 32• Similarity matching 48• Phylogenetics 95• RNA and Protein Structure 108• Systems Biology 126• Grand challenge problems 140• Acknowledgements & references 163

Note: Slides with the Indiana University wordmark in the bottom left corner were generated at Indiana University, with images sometimes from other sources. In such cases the url for the source of the image is indicated on the slide. Slides with a plain white background have been graciously provided by someone outside IU, and sources are attributed on such slides.

4

Class Plan & Objectives

• Class Plan & Strategy– Materials focus on open source software (generally not the

presenters own work)– One critical application will be covered in great depth, and

several others will be reviewed• Objectives. At the end of the class, participants should:

– understand enough biology to understand key computational biology problems

– be conversant with current key applications, and current problems facing bioinformatics and computational biology

– Be familiar with some strategies for collaborating with biologists and biomedical scientists

5

Motivation

• The “-omics” trend• Finding press pieces about huge computing problems is easy• How many bio codes really scale to hundreds of processors?• What are the coming high performance needs of biologists?• Importance of computational biology and bioinformatics to the

HPC community• The challenges and promise are real• Successes and failures so far

– Successes: Protein structure, Genome assembly, Surgical assistance, Phylogenetics

– Mismatched priorities: Ab initio protein folding– Not yet successful: Drug discovery

6

What has changed recently?

• Bioinformatics not new– Protein structure– Phylogenetics

• What is new is high-throughput sequencing:– Lots more data– The possibility of going

from a knowledge of the DNA sequence to an understanding of diseases and health

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

7

Genome Projects Timeline

• 1978 First virus (SV40) sequenced (5224 base pairs)• 1986 DOE announces Human Genome Initiative • 1994 First complete map of all human chromosomes • 1995 First living organism sequenced (H. influenzae) 2 Mb• 1996 Yeast (S. cerevisiae) - 12 Mb• 1997 Intestinal bacterium (E. coli) - 5 Mb• 1998 Nematode worm (C. elegans) - 100 Mb• 1998 Celera announcement; Public effort regroups• 1999 Human Chromosome 22 – 34 Mb• 2000 Joint announcement by NHGRI – Celera• 2003 “As good as it gets” human genome

This slide based on slide by Manfred D. Zorn

8

Definitions

• Computational Biology: any use of advanced information technology in the study of biological problems.

• “Bioinformatics applies the principles of information sciences and techologies to make the vast, diverse and complex life sciences data mnore understandable and useful” (NIH BISTIC Committee grants1.nih.gov/grants/bistic/CompuBioDef.pdf)

• Genomics – study of genomes and gene function• Proteomics – study of proteins and protein function• ___omics –

9

Challenges

• Different types of biological data at different scales• Data of varying quality• Much of the underlying biology is not well understood• Prior to the availability of high-throughput sequencing,

scientists could only study small pieces of the genetic information of any organism.

• Now the entire genome of several organisms has been completed, but knowing the genome is different than knowing how it works!

10

Comparison of Complexity• Physics & Chemistry

– 2 elementary particles– 4 forces– 112 elements– When random events occur

it is often possible to study average behavior

– Typically ahistoric (astrophysics an exception)

• Biology– 3B base pairs in humans– Min. 30,000 genes in

humans– ~1.5M species– Individual random events

important; no law of large numbers

– Intensely historic, heavily contingent

11

Complexity, Con't• Chip design

– All components known– Device physics for

individual components known

– Itanium has 3 x 10^8 connections and 2 x 10^8 devices

– Unified basic currency (electrons)

– Computer program required to understand

• Cells– Components not known– Function of individual

components not known– # components ~10^13– No unified basic currency– Ecell, Karyote, etc.

attempting to model cells

12

A rapid introduction to key elements of biology

13Why is it important to know some biology?

• Would you study numerical methods without knowing some mathematics?

• Much current biological knowledge is very specific to particular organisms, genes, or diseases

• If you just wade into the available data online you can do some very silly things.

Anopheles gambiae

From www.sciencemag.org/feature/data/ mosquito/mtm/index.htmlSource Library:Centers for Disease Control Photo Credit:Jim Gathany

14

Central dogma of biology

• The central dogma of biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population (first put forward by Crick in 1958)

http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html

15

Cell Structure

Eukaryotes• Chromosomes linear• Introns, exons,

postprocessing• Nucleus & nuclear wall• Mictochondria and (in

plants) Chloroplasts

http://www.ncbi.nlm.nih.gov/About/primer/genetics_cell.html

Prokaryotes• Chromosome circular• Location is everything• No nucleus• No plastids

16

Four (or Five) Bases

• DNA consists of four nucleotides: Cytosine, Thymine, Adenine, and Guanine.

• In the double helix, A&T are always bound, and C&G are always bound to each other

• RNA consists of four nucleotides as well: Cytosine, Uracil, Adenine, and Guanine

• RNA may loop back on itself but it does not form a double helix

http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/structur.gif

17

http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/98-647.jpg

18

Genetic CodeAla AlanineArg ArginineAsn AsparagineAsp Aspartic acidCys CysteineGlu Glutamic acidGln GlutamineGly GlycineHis HistidineIle Isoleucine

Leu LeucineLys LysineMet MethioninePhe PhenylalaninePro ProlineSer SerineThr ThreonineTrp TryptophanTyr TyrosineVal Valine

http://www.ncbi.nlm.nih.gov/Class/MLACourse/Original8Hour/Genetics/geneticcode.html

19Translating DNA to RNA and Transcribing RNA to Proteins

DNA AAAAAGGAGCAAATT

RNA UUUUUCCUCGUUUAA

One possible amino acid string Phe Asn Asp Ala

45

6

12

3

20

Human Chromosomes

http://www.ncbi.nlm.nih.gov/Class/MLACourse/Original8Hour/Genetics/cytogenetic.html

http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/elsikaryotype.html

21

Sickle CellNormal RBC• GAG codes for Glutamine• disc-Shaped, soft• easily flow through small

blood vessels• lives for 120 daysSickle RBC• GTG codes for Valine• sickle-Shaped, hard• often get stuck in small

blood vessels• lives for 20 days or lessMalaria vs. Anemia!

http://www.nlm.nih.gov/medlineplus/ency/imagepages/1223.htm

22

What is a Gene?

• An inheritable trait associated with a region of DNA that codes for a polypeptide chain or specifies an RNA molecule which in turn has an influence on some characteristic phenotype of the organism.– Early views: genes lined up on the chromosome like beads

on a string; one gene => one protein– Examples of genes: color blindness, sickle-cell anemia– Mendelian genes, Sex-linked genes, Quantitative traits

• Annotation: Extraction, definition, and interpretation of features on the genome sequence

• Annotations vs. genes: – Many annotations describe features that constitute a gene.– Others may not always directly correspond in this way– An annotation is what we think… nature may disagree!

• Inheritance problem with annotations

23

Gene Components• Procaryotes

– Location is everything– Essentially all of the DNA is transcribed (few mitochondrial diseases)

• Eucaryotes– Non-contiguous pieces of DNA may comprise one gene– Start sequence (complicated and long) – Stop Codons – end transcription– Exons – portions of sequence that are transcribed and used– Introns – portions of sequence that are not used

• Genes and Chromosomes– In eukaryotes, an organism has two of each chromosome (in pairs).– Among sexually reproducing organisms, one chromosome comes from

each parent– In “simple Mendelian genes” there are two alleles for each gene – one

on each chromosome (e.g. wrinkly)

24

Alternate splicing

http://www.blc.arizona.edu/marty/411/Modules/altsplice.html

25A (very) little about evolutionary genetics

Ww

WwParents

Offspring

Ww

WwWW ww

Based on this, can you explain why the gene for Sickle Cell Anemiapersists in populations of people in Africa?

Hardy-Weinberg Law

26

Population genetics & evolution

• Mutations create the raw material for evolution

• Natural selection and chance affect the frequency with which particular genes or DNA sequences are present in populations

• Given enough time and enough change, evolution, speciation, and so forth happen

• Genes can be ‘fixed’ or ‘maintained in an equilibrium’ in a population by chance or through natural selection

http://faculty.wm.edu/bsgran/

27

How do sequences differ?

• Differences in individual bases

• Bases may be added to a sequence

• Bases may be deleted from a sequence

CGTACCGTTAATATCGTACCGATAATAT

CGTACCCCGTAATATCGTACC . .GTAATAT

CGTACCGTTAATATCGTACCG . . .ATAT

28

Random genetic change

• “things happen”• Molecular clock

– theory – ~ 2% change per million years (2 x 10-9

substitutions per base location per year)– Practice – a rule of thumb is different than something like

Newton’s 2nd law of motion• Random change may often be responsible for speciation – e.g.

two populations of birds, separated by a geographic barrier, may at random eventually develop into two different species

29

Key points (so far)

• Biological processes are complicated; the historicity and complexity of biological processes and our lack of understanding of many matters makes biologty an interesting topic!

• The fundamental dogma of molecular biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population.

• DNA consists of four base pairs (ATCG). A is always paired with T; C always paired with G.

• DNA is translated into RNA. RNA consists of four base pairs as well (AUCG).

• The linear structure of DNA is transcribed into RNA and then into proteins. Proteins have their 3D configuration as the basis for their structure.

30

DNA sequencing

Send in the clones!• DNA chopped into

blocks• Blocks inserted into

bacterial cells using viruses

• The bacterial clones make lots of copies of DNA so that you have something to work with

• The sequence of each chunk of genetic material is determined using gel electrophoresis

31

Dye-terminator Sequencing

www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/standardRGB200.jpg

• Cut DNA at various places (at T, G, C, A)

• Add a radioactive molecule at the end of the DNA chain

• Find out how long the chain is by gel electrophoresis

• Read off the sequence

Sanger

www.ornl.gov/TechResources/Human_Genome/publicat/primer/

32

Sequence assembly

• Phred – base calling• Phrap – shotgun sequence assembly• Consed – finishing• http://www.phrap.org/• High quality software

33

Bioinformatics data sources

34

Bioinformatics Data Sources

• There are many• Characteristics vary• There are many ways to organize view of the biological data• A pragmatic approach:

– Biomedical literature sources– Structured vocabularies– DNA, RNA, Protein etc. data sources

35

Biomedical literature• Abstracts of biomedical lit.

largely available online• Text processing itself is an

interesting problem• U.S. National Library of

Medicine – NLM Medline http://www.nlm.nih.gov/

• ~12 million references on life sciences/biomedicine.

• Covers 1966 to present.• Citations from over 4,600

journals; most published in English

36

PubMed

• Standard search tool for Medline

• http://www.ncbi.nlm.nih.gov/entrez/

• Useful limit terms:– Gender– Age Groups– Human or Animal– Publication Date

• You can save queries

37

Structured Languages

• NLP or write with agreed-upon terms?• Three important structured languages:

– MeSH– GO (Gene Ontology)– LOINC

38

MeSH

• Medical Subject Heading• http://www.nlm.nih.gov/

mesh/MBrowser.html• ~17,000 Thesaurus Terms• Typically 10-15 used per

article in MedLine; 3-4 as major points (indicated with * in PubMed)

• When done right…. the terms used are the most specific possible

• There are both advantages and disadvantages!

39

GO (Gene Ontology)

• http://www.geneontology.org/• “The goal of the Gene OntologyTM Consortium is to produce

a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.”

• Based on xml file format• Several browsers (AmiGO, QuickGO, MGO)• Directed Acyclic Graph (child may have multiple parents)

– ISA (is a) %– Part of <

• Three ontologies– Molecular function– Biological processes– Cellular components

40

Genomic, Proteomic, etc. data sources

• A tremendous amount of data is available through public data sources via the Web, ftp, or by other means.

• To analyze biological data, we first have to get it…. • Several ways to organize presentation of material – by site, by

type, etc. We will organize by data type.• Types of Databases:

– Chromosomal (http://www.ncbi.nlm.nih.gov/mapview)– DNA/Genes– Protein– Biochemistry and metabolic pathways– Microarray– Web collections

41

Types of genomic data

• Genomic DNA: DNA sequences, typically complete with coding and noncoding sequences

• GSS: Genome survey sequence. Single pass sequence read directly from robot.

• mRNA: an RNA sequence from an mRNA molecule. May or may not cover all of a particular gene

• cDNA: complement DNA – a DNA sequence generated by conversion of an mRNA sequence

• EST: Expressed Sequence Tag – short cDNA sequences from studies of cells under particular circumstances. Typically incomplete.

• SNP – Single Nucleotide Polymorphism

42

DNA databases

• GenBank. Operated by NCBI (National Center for Biotechnology Information). http://www.ncbi.nlm.nih.gov

• European Molecular Biology Laboratory – Nucleotide Sequence Database. http://www.ebi.ac.uk/genomes/

• DNA Database of Japan (DDBJ). http://www.ddbj.nig.ac.jp• All share data daily. Update conflicts avoided by policy. • Differences are in “value added” and interfaces

43http://www.ncbi.nlm.nih.gov

44

Data Structures

• Current– Primary DNA repository data based on ASN.1. Makes

possible linkages among many types of biomedical info.– The software libraries now often handle XML as well– Software toolkits and docs available at

http://www.ncbi.nlm.nih.gov/IEB/• Genbank Flat File format

– http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html• FASTA

>gi|532319|pir|TVFV2E|TVFV2E envelope proteinELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC

45

Primary vs. Secondary Data sources

• Primary data sources:– Genetic sequences in NCBI, EMBL, DDJP– Protein sequences in PDB

• Secondary data sources:– Inferred protein sequences (what do we know already about

issues here?)– Curated data sources

46

Protein Structure• NCBI (of course…)• Swiss-Prot/TrEMBL at http://www.expasy.org/

– Note: 125,744 chemically determined vs 861,482 inferred from automated translation of DNA sequences!!!!!

• Protein Data Base – PDB http://www.rcsb.org/pdb/ - one of the first online bioinformatics databases!!!

47

Biochemistry and pathways

• Biochemistry– ENZYME (part of the ExPASy system)– BIND (part of the NCBI system)

• Pathways– PathDB http://www.ncgr.org/software/version_2_0.html– Kegg WIT http://wit.mcs.anl.gov/WIT2/

48

Web Resources - General• NCBI

http://www.ncbi.nlm.nih.gov/• EBI Biocatalog

http://www.ebi.ac.uk/biocat/• IUBio Archive

http://iubio.bio.indiana.edu

http://www.ncbi.nlm.nih.gov/

49

Similarity matching

50Why pattern matching (and what are the problems)

Bonobohttp://www.sandiegozoo.org/special/zoo-featured/pygmy_chimps.html

and… US!

51

Problems!

• For proteins, 95% similarity is ~ identical, 80% similarity is a lot. Even less similarity than that needed for DNA

• Database techniques inadequate – they are too precise!• Datasets very large to search• Homology

• Common ancestry • Sequence (and usually structure) conservation • Homology is inferred rather than measured

• Identity• Objective and well defined • Can be quantified easily, but not very useful!

• Similarity• Most common method used, but not as easily defined

52

Alignment

• An alignment is an arrangement of two sequences opposite one another

• It shows where they are different and where they are similar • We want to find the optimal alignment - the most similarity

and the least differences• Alignments have two aspects:

– Quantity: To what degree are the sequences similar (percentage, other scoring method)

– Quality: Regions of similarity in a given sequence

53

Alignment

• Methods:– dynamic programming – Hidden Markov Models– Pattern matching

• Key problem: keeping the calculation time manageable• Some alignment packages:

– BLAST (http://www.ncbi.nlm.nih.gov/BLAST/)– FASTA (http://gcg.nhri.org.tw/fasta.html)

54

Scoring AlignmentsGCTAAATTC ++ x x GC AAGTT

• Matches are good: they get a positive value• Mismatches are bad: they get a negative value• Gaps are bad: they get a negative value

– Gap opening penalty– Gap extension penalty– Score = Matches –Mismatches

-∑{gap opening penalty +(length)*gap length penalty}

CGTACCGTTAATATCGTACCG . . .ATAT

CGTACCGTTAATATCGT. C . GTT .ATAT

55

Now what?

• Taking a sequence and simply comparing it against all existing sequences in a database in all possible ways approaches O(N!) if you do it badly enough. Plus it would be silly.

• So: many algorithms possible• Algorithms are in some ways the same, and in some ways

different, between DNA and proteins.• We’ll start with DNA, and not do things in historical order

56

Dotter• Simple way to get a feel for how

sequences compare to each other.• Used both with DNA and Protein

sequences• http://www.cgr.ki.se/cgr/groups/

sonnhammer/Dotter.html/• "A dot-matrix program with

dynamic threshold control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167(2):GC1-10 (1995)

• And now (hopefully) a live demo• Modular nature of proteins

57

Local Alignments with BLAST

• Basic Linear Alignment Search Tool• We’ll spend a LOT of time with BLAST• First a quick demo (hopefully)• http://www.ncbi.nlm.nih.gov/BLAST• So, what did we do?

– BLAST – Basic Linear Alignment Search Tool– In particular, BLASTn (for nucleotides)– Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman,

D.J. 1990. Basic Local alignment search tool. Journal of Molecular Biology 215:403-410

58

(Original) BLAST Algorithm

• Original algorithm does not permit gaps• The original BLAST algorithm is a local (heuristic) alignment

tool• Given a search sequence, e.g. ACGTAGGCATGAA• BLAST first makes a list of all “words” of a given length that

would possibly have a score of at least T against the search string.

• In the case of this example there would be (at least) the following:– ACGTAGGCATG– CGTAGGCATGA– GTAGGCATGAA

59

(Original) BLAST Algorithm, 2

• BLAST takes the list of all words with a score of at least T against the string one is trying to match…. and then searches a database for any matches to these words. So if one were using the example and the NR database, BLAST would search NR for all occurrences of the words:– ACGTAGGCATG– CGTAGGCATGA– GTAGGCATGAA

• Suppose BLAST finds in the NR database an exact match to – ACGTAGGCATG

• BLAST then attempts to extend the match in both directions– ACGTAGGCATGA– ACGTAGGCATGA

• So now we have an exact match of 12 letters

60

(Original) BLAST algorithm,3

• So BLAST keeps going, and in this case would stop at an exact match of 13 letters (if one existed), since 13 letters was the entire initial search string:– ACGTAGGCATGAA– ACGTAGGCATGAA

• BLAST has a stopping algorithm for dropping particular search directions, or stopping altogether

61

Scoring of DNA

A C G T R Y M W S K D H V B N A 4 C -3 4 G -3 -3 4 T -3 -3 -3 4 R 1 -1 1 -1 1 Y -1 1 -1 1 -3 1 M 1 1 -2 -2 0 0 1 W 1 -2 -2 1 0 0 0 1 S -2 1 1 -2 0 0 0 0 1 K -2 -2 1 1 0 0 0 0 0 1 D 1 -2 1 1 1 0 0 1 0 1 1 H 1 1 -2 1 0 1 1 1 0 0 0 1 V 1 1 1 -2 1 0 1 0 1 0 0 0 1 B -2 1 1 1 0 1 0 0 1 1 0 0 0 1 N 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1

62

BLAST algorithm in more detail• The BLAST algorithm searches for MSPs – Maximal Scoring Pairs – such that the score of

sequences cannot be improved either by lengthening it or shortening it. “Pairs” here refers to a string – or a substring – of the initial string used as the search string – and one or more strings or substrings found in a database.

• The search starts with the creation of all possible subwords of a given length (default typically 11 for DNA sequences, 3 amino acids for protein sequences) that would score at least T when matched against the original search string. (T is short for Threshold)

• BLAST then goes through the database being searched against looking for any occurrence of each of these words that have a score of at least T. This is a “hit” – or a “High Scoring Pair (HSP)”

• The search then continues by trying to extend these HSPs. • Suppose “S” is the best score found for a word of length k. BLAST stops trying to extend

words when the score drops a certain amount below the best value S in the previous round.• BLAST continues on and on until it is no longer possible to improve the score of HSPs by

making them longer.• Then it generates a list of the best HSPs. Default is a cutoff E-value of 10• BLAST (original) has an infinite gap penalty

63

BLAST Statistics

• BLAST reports E values rather than P values, but it turns out that when E < 0.01, E~P

• What do we do about the fact that we have done many tests?• If the sequence is length n, and the total length of the database being

searched is N, then a reasonable approach is to multiply E by N/n• Edge effects – statistics tend to be conservative for short sequences• Problems:

– Highly repetitive segments– Low complexity regions– Bias in composition

• Solution: low complexity regions can be excluded

64

BLAST Options

• Set subsequence (of the submitted sequence)• Choose Database (NB: nr ≠ non redundant!)• Limit by entrez query or select an organism• Choose Filter• Expect Value• Word size (default = 11 for nucleotides)

65

Protein Sequence Alignment

• What most people do most of the time• DNA sequences are useful for relationships that are close, but

DNA sequences are not nearly as well conserved as Amino Acid sequences

• Now we need to talk about the characteristics of Amino Acids and ways to compare what is similar and what is not!

• Amino acids can have similar chemical properties, and similar functions as part of a protein, without being identical!

66

Point Accepted Mutations (PAM)• For scoring amino acid sequence

alignments• Dayhoff, M.O., Schwartz, R.M., Orcutt,

B.C. 1978. "A model of evolutionary change in proteins." In Atlas of Protein Sequence and Structure 5(3) M.O. Dayhoff (ed.), 345 - 352, National Biomedical Research Foundation, Washington.

• PAM N corresponds to N mutations in DNA sequence per 100 amino acids. N can be greater than 100.

• PAM 250 is most commonly used; PAM 100 is also used. PAM 250 => chains with ~20% identity

• PAM matrix calculator at www.cmbi.kun.nl/bioinf/tools/pam.shtml

http://www.psc.edu/biomed/training/tutorials/sequence/db/index.html

67

BLOSUM Matrices

• Henikoff and Henikoff (1992) Proc Natl Acad Sci 89(22):10915-9

• Based on analysis of the BLOCKS database (http://www.blocks.fhcrc.org/)

• BLOSUM = BLOcks SUM database• Based on analysis of conserved and variable regions of

proteins Naming convention is different than for PAM matrices.

• BLOSUMxy is based on likelihood ratios for two chains of amino acids that are xy% identical

• BLOSUM62 is the ‘typical default’• PAM250 is roughly equivalent to BLOSUM45

68

PSI BLAST

• Position Specific Iterative BLAST• http://nar.oupjournals.org/cgi/content/full/25/17/3389• Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z,

Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997 Sep 1;25(17):3389-402

• Required two non-overlapping similarities with search term to occur within a certain distance (A) on the genome

• Permits gaps in the alignments• Can be iterated to allow for user-specified scoring matrices By

default, uses the BLOSUM-62 Matrix

69

PSI BLAST• In the original

BLAST, the step of extending the length of the ‘hits’ took ~90% of execution time.

• The initial threshold value T must be lower than with the original BLAST, but far fewer hits are pursued, meaning that the extension time is lower http://nar.oupjournals.org/content/

vol25/issue17/images/gka56202.gif

Two hits, T=11 A=40 vs One hit, T=13

70

http://nar.oupjournals.org/content/vol25/issue17/images/gka56201.gif

71

Gaps in PSI-Blast

• PSI BLAST seeks alignments with single gaps• Gaps are sought only when a two-hit score exceeds the value

Sg• Gaps: handled by using a different gap cost function:

-(a+bk+cj)

a is the cost for opening a gapb is the per unit cost for the length of the gapk is the length of the gapc is the cost per of unaligned sequences in the gapj is the number of sequences left unaligned

72

Discontinuous MEGA Blast

• Useful especially for identifying diverged DNA sequences• Uses templates; within the template only those items with “1”s

are compared.• E.g. 1101101101101101

How many BLASTs?

http://www.ncbi.nlm.nih.gov/BLAST/producttable.html

73

mpiBLAST http://mpiblast.lanl.gov/

74

mpiBLAST Algorithm

• Darling, A.E., L. Carey, W.-C. Feng. 2003. The design, implementation, and evaluation of mpiBLAST. Presented at ClusterWorld2003. http://www.cs.wisc.edu/%7Edarling/mpiblast-cwce2003.pdf

• Algorithm– Database is segmented. Portions of database are placed on data

storage devices on multiple nodes in a HPC system. mpiformatdb is a wrapper for the BLAST formatdb program. Number of subdivisions specified by user

– Foreman/worker algorithm. Portions of the database are assigned to workers, using a greedy algorithm

75

mpiBLAST performance

• Scaling can be superlinear when pieces are small enough that they fit into memory

• Scalability limitations due to communication, implicit barrier before assembly of results

• If pieces of data distributed out to workers are larger than available RAM, then scaling is still good but not superlinear

• Blast is the most heavily used bioinformatics tool in existence. Parallelization of BLAST has huge payoff for practicing biologists

76

Motivation: BLAST with Low Memory

• Standard BLAST running on a system with 128 MB of memory.

Slide courtesy of Wu-chun Fengfeng@lanl.gov Los Alamos National Laboratory

77

mpiBLAST: Low-Memory Performance

• Environment– 1, 2, or 4 nodes.– Each node w/ dual

550-MHz CPUs and 128-MB memory.

– Same query and database used.

• Conclusions– blastn is I/O bound.

Superlinear speed-up possible.

– tblastx is CPU bound.

Slide courtesy of Wu-chun Fengfeng@lanl.gov Los Alamos National Laboratory

78

mpiBLAST on Green Destiny

BLAST Run Time for 300-kB Query against nt

Nodes Runtime (s) Speedup over 1 node

1 80774.93 1.00

4 8751.97 9.23

8 4547.83 17.76

16 2436.60 33.15

32 1349.92 59.84

64 850.75 94.95

128 473.79 170.49

The Bottom Line: mpiBLAST reduces search time from 1346 minutes (or 22.4 hours) to under 8 minutes!

Slide courtesy of Wu-chun Fengfeng@lanl.gov Los Alamos National Laboratory

79

Global Alignments: Needleman-Wunsch Algorithm

• Start at the beginning, end t the end• Needleman, S.B., and C.D. Wunsch. 1970. A general method

applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Bio. 48: 443-453.

• “The amino acid sequences of a number of proteins have been compared to determine whether the relationships existing between them could have occurred by chance. Generally, these sequences are from proteins having closely related functions and are so similar that simple visual comparisons can reveal sequence coincidence….”

80

Needleman-Wunsch

• Amino acid sequences are lined up as column and row headers for a matrix• Ai is the ith amino acid in protein A• Bj is the jth amino acid in protein B• Start with a matrix where the matches between the Ai s and Bj s are 1 of

there is a match, 0 otherwise• The optimal alignment can be represented as a path through the matrix• If MATmn is part of a pathway including MATij, the only permissible

relationships are m> i and n>j, or m<I and n<j• The optimal pathway is found by filling out the matrix from the bottom

right corner towards the upper left, where in each cell you insert the maximum score arising from an alignment that includes this cell in the matrix

81

Needleman-Wunsch and Smith-Watermann

• Shortcomings of Needleman-Wunsch?• Can you think of biological situations in which you might

want to use Needleman-Wunsch?• Smith-Waterman: similar to Needleman-Wunsch, except

– Requires a penalty for gaps– Will do partial alignments (e.g. has stopping point)

• Computational requirements– Original Needleman-Wunsch and Smith Waterman both require

O(N*M) time and O(N*M) memory– There are improvements of Smith-Waterman that require

O(N*M) time and O(N) space

82

ALIGN

• Simple protein alignment tool• Included in FASTA distributions 2.x, but not 3.x• Still, it’s a nice learning tool• Can be downloaded for Linux or for Windows• Can also be run from web at

http://fasta.bioch.virginia.edu/fasta/align.htm• Can also be run from web at http://us.expasy.org/tools

83

Protein Alignment with the FASTA family

• FASTA is one of the earliest protein alignment tools, and still actively maintained

• Pronounced FAST and then a long A• A local alignment, heuristic tool• Can be downloaded from

http://www.people.virginia.edu/~wrp/pearson.html• FASTA family maintained by Prof. William R. Pearson• Can also be run from Web

84

FASTA Algorithm• Ktup = word length (2 default; 1 sometimes used)• FASTA searches for words of length ktup matching between

sequences • FASTA searches for ungapped regions of a particular length

that have the highest number of identical ktups• FASTA scores the 10 ungapped alignments that have the

highest number of identical ktups, scoring with a scoring matrix (default is BLOSUM50)

• FASTA then tests for the ability to merge the ungapped alignments into a single alignment without dropping the overall score too much

• FASTA uses the Smith-Waterman algorithm within the local alignment regions!

85

Multiple Alignment - Clustal-W• Why do we need to align many different sequences at once?

– Look for highly conserved regions– Gene searching (of mice and men)

• http://www.ebi.ac.uk/clustalw/ • Thompson et al. 1994. Nucleic Acids Res. 22: 4673-4680• Heuristic & Progressive

– Begin with 2 sequences– Add others one-by-one

• Uses profile alignment– Align sequence with group of aligned sequences– Align groups of aligned sequences– Misalignments in conserved regions penalized heavily

86

Example output

FOS_RAT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTFOS_MOUSE MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT FOS_CHICK MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNSFOSB_MOUSE -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASFOSB_HUMAN -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS

*:..* .:*:: .***** **:.:* * *..***.* :.. :*:

FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLP FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPFOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVP FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPFOSB_HUMAN VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMP :******:** **********:**:* **... ::. .**.:* :

87

Clustal-W Algorithm• Construct matrix of distances

– Alignment scores from all pairwise combinations– Alignments by dynamic programming method– Alignment scores transformed to evolutionary distances– Cluster distances into hierarchical tree (neighbor joining)

• Progressively align sequences using tree as a guide– Begin with closest pair– Work up tree in order of decreasing similarity– Use pairwise alignment for pairs– Use sequence-profile alignment to add sequences to

clusters– Use profile-profile alignment to join clusters

88

CLUSTAL-W key features

• Sequences weighted to reduce representation bias associated with large subfamilies (usual sum-of-pairs score problem)

• Substitution matrix used for scoring depends on distance between sequences.– BLOSUM80 for near sequences– BLOSUM50 for distant sequences

• Gap penalties at hydrophobic residues heavier than those at hydrophilic residues

• Gap penalties also contingent upon exact residue identity at gap site• Gaps corralled by increasing penalties at sites where gaps are rare when

gaps are common nearby• When building alignment, low-scoring additions rescheduled to be added

later

89

ClustalW-MPI

• Li, K.-B.2003. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19: 1585-1586

• Initial pairwise alignment process is parallelized and scales very well

• Multiple alignment process is parallelized and scales modestly• Scaling tests published thus far up to 16 processors, reduces

time from hours to minutes

90

HMMR

• http://hmmer.wustl.edu/• Profile HMMs for protein sequence analysis• A profile is a statistical model of patterns that are likely for

multiple alignments, including variability at various positions and probabilities of various residues

• Useful when similarities are too faint to be picked up by BLAST

• Several profiles based on existing alignments exist• Available as a parallel code using PVM• Scales reasonably well as regards number of processors. Does

not scale as well as regards size of the biological problem

91

GeneIndex

• Location of initiators, promoters, etc. a key question in genomics

• First step in this is creating a dictionary of words of various lengths (many possible next steps)

• To be useful, analysis must be performed on entire genomes at once

• GeneIndex finds frequencies and positions of all words of a given length in a DNA sequence. Visualization with Tcl/Tk.

92

GeneIndex Parallelization

• Genome is broken up into n sections, where n = number of processors

• After each segment is analyzed, linked lists are joined

93

94

GeneIndex Scalability: Processing TimeDrosophila

0

500

1000

1500

2000

2500

3000

0 20 40 60

Number of CPU

Tim

e (s

econ

ds)

95

GeneIndex Scalability: SpeedupDrosophila

0

10

20

30

40

50

60

70

0 20 40 60 80

Number of CPU

Sp

eed

up

96

Phylogenetics

97

Building Phylogenetic Trees

• Goal: an objective means by which phylogenetic trees can be estimated in tolerable amounts of wall-clock time, producing phylogenetic trees with measures of their uncertainty

• All evolutionary changes are described as bifurcating trees-genes or gene products -organisms

98

Phylogenetic trees from DNA sequences

• Changes DNA modeled as Markov processes• Sequences available:• DNA (sequences are series of the base molecules; aligned

sequences will also contain +s for gaps)• Amino acid sequences (series of letters indicating the 20

amino acids). Computational challenges more severe than with DNA sequences.

• RNA • The availability of data at present exceeds the ability of

researchers to analyze it!

99

Why is tree-building a HPC problem?

• The number of bifurcating unrooted trees for n taxa is(2n-5)!/ (n-3)! 2n-3

• for 50 taxa the number of possible trees is ~1074; most scientists are interested in much larger problems

• NP-hard problem• The number of rooted trees

is (2n-5)!

100

Phylogenetic software

• Phylip. (J. Felsenstein). Collection of software packages that cover most types of analysis. One of the most popular software collections. Free.

• PAUP. (D. Swofford). Parsimony, distance, and ML methods. Also one of the most popular software collections. Not free, but not expensive.

• fastDNAml. (G. Olsen). Maximum likelihood method for DNA; becoming one of the more popular ML packages. MPI version available soon; well suited to tree searching in large data sets. Free.

• GRAPPA (Bader et al.): Breakpoint analysis program - scales well

101

Stochastic change of DNA

• Markov process, independent for each site: 4 x 4 matrix for DNA, 20 x 20 for amino acids

• A C G T• A p(A->A) p(A->C) p(A->G) …• C p(C->A) p(C->C) p(C->G) …• G .• T .• Transitions more probable than transversions.• Must account for heterogeneity in substitution rates among

sites (DNArates – Olsen)

102

fastDNAml

• Developed by Gary Olsen• Derived from Felsensteins’s PHYLIP programs• One of the more commonly used ML methods• The first phylogenetic software implemented in a parallel

program (at Argonne National Laboratory, using P4 libraries)• Olsen, G.J.,et al.1994. fastDNAml: a tool for construction of

phylogenetic trees of DNA sequences using maximum likelihood. Computer Applications in Biosciences 10: 41-48

• MPI version produced by Indiana University in collaboration with Gary Olsen available from http://www.indiana.edu/~rac/hpc/fastDNAml/

103

fastDNAml algorithm – adding taxa

• Optimize tree for 3 (randomly chosen) taxa - only one topology possible

• Randomly pick another taxon –

(2i-5) trees possible • Keep the best

(maximum likelihood tree)

104Basic fastDNAml algorithm - Branch rearrangement

• Move any subtree crossing n vertices (if n=1 there are 2i-6 possibilities)

• Keep best resulting tree• Repeat this step until local

swapping no longer improves likelihood value

105

fastDNAml algorithm con’t: Iterate

• Get sequence data for next taxon• Add new taxa (2i-5)• Keep best• Local rearrangements (2i-6)• Keep best• Keep going….• When all taxa have been added, perform a full tree check

106

Overview of parallel program flow

• Program modules– Master (generates trees,

receives back from Foreman best tree at each step)

– Foreman (dispatches trees to workers, determines best tree, tracks activity of workers)

– Worker– Monitor (instrumentation)– Parallel versions include fault

tolerance features (useful in large clusters and grid computing)

107

Performance of fastDNAml

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

Number of Processors

Spee

dUp

Perfect Scaling 50 Taxa 101 Taxa 150 Taxa

108

Why bother with parallel code?

• Why not just achieve speedup of n on n processors by running n independent jobs?

• Practical benefits of seeing results quickly

• Parallel program permits assault on much more complicated problems (e.g. protein sequences)

109

RNA & Protein Structure

110

RNA Structure – Vienna RNA

• http://www.tbi.univie.ac.at/~ivo/RNA/• Package consists of several parts (from the web site):

– RNAfold - predict minimum energy secondary structures and pair probabilities

– RNAeval - evaluate energy of RNA secondary structures – RNAheat - calculate the specific heat (melting curve) of an RNA

sequence – RNAinverse - inverse fold (design) sequences with predefined

structure – RNAdistance - compare secondary structures – RNApdist - compare base pair probabilities – RNAsubopt - complete suboptimal folding

http://www.tbi.univie.ac.at/~ivo/RNA/

111

Types of Proteins• Enzymes- biological catalysts Most of the chemical reactions

which occur in biological systems are catalyzed by enzymes.• Storage. Various ions, small molecules and other metabolites

are stored by complexing with proteins; for example haemoglobin carries oxygen.

• Transport. Proteins are involved in the transportation of particles ranging from electrons to macromolecules.

• Messengers. Proteins are involved in the transmission of nervous impulses. Hormones play a coordinating role.

• Antibodies. Proteins which bind to specific foreign particles such as bacteria and viruses.

• Regulation. Enzymes synthesize proteins by translating sequences of DNA.

• Structural proteins. Mechanical proteins (e.g. collagen)

112

Proteins – a sparse vocabulary build up from amino acids

• Average time to fold based on random motion• Actual folding – small fractions of a second• Only a small subset of possible amino acid sequences actually

code for a real protein• Minimization of free energy – the key in real life and in

analysis!

113

http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html

114

http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html

115

Molecular viewing software options

• VRML – Cosmo Player http://www.karmanaut.com/cosmo/player/

• RASMOL - http://www.openrasmol.org/• CHIME - http://www.mdl.com/chime/index.html• Swiss Pdb Viewer - http://www.expasy.ch/spdbv/• MICE - http://mice.sdsc.edu/• Many tend to be touchy about browsers and plugins

116

Different ways to view molecules

• Wireframe• Stick• Ball and stick• Space filled (Van der Waals radii)• Some examples:

– http://class.fst.ohio-state.edu/FST822/Images/helix.pdb– http://www.rcsb.org/pdb/– http://www.rcsb.org/pdb/cgi/explore.cgi?

job=graphics;pdbId=1GFL;page=;pid=264201048789105&opt=vrml_default

117

Protein structure determination

• Xray crystallography• X-ray reflections form a

pattern• Model the known sequence of

atoms fitting into a 3D structure so that the reflection pattern matches the observed pattern

• Spectroscopic analysis of molecule structure precise but still slow!

• ~127,863 entries in SwissProt• ~857,950 entries in TrEMBL http://crystal.uah.edu/~carter/protein/xray.htm

118

Protein structure prediction methods

• Knowledge-based methods– Based on information extracted from existing structures to

estimate structure• Physico-chemical methods

– “Ab initio” protein structure prediction• Feature detection methods:

– Look for post-translational modification signals• Cleavage sites• Glycosylation sites• Phosphorylation

• Site for prediction server: http://www.cbs.dtu.dk/services/

119

Protein Structure Prediction

• Key requirement: prediction of molecule position within 1 angstrom

• Measuring quality of fit– Root mean square of atom distances

RMSD = √ (∑di2)/N

– Q3 = (true positives + true negatives)/total residues• Better than 70% right is really good!

120

Secondary Structure Prediction• Secondary – or local –

structure prediction is the first step in classifying amino acid sequences– Alpha helix– Beta sheet– coil

http://www.cryst.bbk.ac.uk/PPS95/course/3_geometry/rama.html

http://www.cryst.bbk.ac.uk/PPS95/course/3_geometry/helix1.html

121

Different approaches to tertiary structure prediction

• Do a sequence alignment to find a protein that is like the unknown sequence in whole or in part

• Threading– Thread a molecule on to a guide– Add sidechains– Optimize sidechains

• Piecewise reconstrcution– Estimate the structure of smaller pieces– Then estimate how they fit together

122

SDSC Biology Workbench

• Probably one of the best overall sites in the US

• http://workbench.sdsc.edu• Requires registration but

this is relatively painless• You do need to read the

instructions first…

123

Ab initio methods - Amber

• http://amber.scripps.edu/#ff• sander: Simulated annealing with NMR-derived energy restraints. • gibbs: Free energy perturbation (FEP) and thermodynamic integration

(TI) , and also allows potential of mean force (PMF) calculations. • roar: Allows mixed quantum-mechanical/molecular-mechanical (QM/MM)

calculations, "true" Ewald simulations, and alternate molecular dynamics integrators.

• nmode: Normal mode analysis program using first and second derivative information, used to find search for local minima, perform vibrational analysis, and search for transition states.

• (from http://amber.scripps.edu/#code)

124

Ab initio methods - GAMESS

• M.W.Schmidt, M.W., K.K.Baldridge, J.A.Boatz, S.T.Elbert, M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga, K.A.Nguyen, S.Su, T.L.Windus, M.Dupuis, J.A.Montgomery. 1993. General Atomic and Molecular Electronic Structure System J. Comput. Chem.14: 1347-63.

• NPACI/SDSC Web portal for GAMESS: https://gridport.npaci.edu/gamess/

125

Hybrid approaches: Rosetta

• Library of identification of short sequence motifs that correlate strongly with protein local structural properties.

• Basic idea:– sequence-dependent local interactions bias segments of the chain – nonlocal interactions select the lowest free-energy tertiary structures

from the many conformations compatible – Use protein database and take the distribution of local structures

adopted by short sequence segments (fewer than 10 residues in length) in known three-dimensional structures

– Put these structures together using non-local interactions• hydrophobic burial, electrostatics, main-chain hydrogen bonding

and excluded volume. • Free energy is then minimized to create candidate structures

126

Molecular Docking

• Key in drug searching• Autodock is a commonly used package• http://www.scripps.edu/pub/olson-web/doc/autodock/• “AutoDock is a suite of automated docking tools. It is

designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure.” (from the web site)

• Nice visualization of an AutoDock docking simulation: http://wwwcmc.pharm.uu.nl/moret/dockings/home.html

127

Systems Biology

128

Systems Biology

• Special issue of Science: 295, Mar. 2002

• Special issue of Nature: 420, Nov. 2002

• Nobody’s quite sure what it is, but it sure is hot!

http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/images/01-0052_web.gif

129

Historical approach to biological experiments

• From Lazebnik, Y. 2002. Cancer cell 2:179• Traditional biological experimentation much like the process

of trying to fix a broken radio• (Or, for those of us who have experienced either being or

living with a 12-year old boy, the process of breaking a functioning radio)

• Some typical steps:– Cataloguing components and their attributes– Perturbing the system– Knock-out experiments– Drawing diagrams

• Eventually may find a component that, when replaced, repairs the radio

130

Issues

• In a very complex system, knowing what all of the parts are, and knowing the function of individual pathways, may still not tell you how the systems work. It may simply be impossible to deduce this from 1-st order interactions

• Interactions, multiple changes– Power supply and other components (well-known PC repair

example!)– Change everything all at once so that we’ll never know what

worked!

131

Systems Biology

• Systems biology emphasizes close integration of experiment, theory and computational modeling

• Goal: understanding the structure and dynamics of biological systems, placing the parts in the context of the dynamic whole– Studies the complex interactions of many levels of biological

information– Quantitative, predictive models are central– Computational modeling in particular is a key tool

• Why model– You are forced to really state what you are hypothesizing– Allows you to understand an *approximation* of reality in great detail

• Computational Cell Biology. 2002. Springer Verlag (Fall et al, eds).• Foundations of systems biology. MIT Press, 2001. Kitano (ed)

132

Example - MCell• MCell is: A General Monte Carlo Simulator of Cellular

Microphysiology. http://www.mcell.cnl.salk.edu/• MCell focuses on simulations using a Brownian dynamics random

walk algorithm. • MCell's use to date has been focused on the microphysiology of

synaptic transmission.• Images and MCell-related material courtesy of Joel R. Stiles,

Pittsburgh SupercomputingCenter and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute.http://www.mcell.cnl.salk.edu/

133

MCell Scalability

Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh Supercomputing Center and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute. http://www.mcell.cnl.salk.edu/

134

M-Cell

• Uses MDL (Model Description Language (MDL), designed with biologically-oriented users in mind.

• Embarrassingly parallel Monte Carlo application

• Supports checkpointing!

Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh Supercomputing Center and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute. http://www.mcell.cnl.salk.edu/

135

CompuCell

• CompuCell currently uses a combination of "extended Potts model" for cell sorting and clustering, and "Schnakenberg Reaction Diffusion" equations to establish the underlying chemical field to which cells respond and form typical patterns found in such biological systems as a growing chicken limb.

• http://www.nd.edu/~icsb/

Image courtesy of James Glazierhttp://www.biocomplexity.indiana.edu/software.php

136

Karyote

• Information theory approach - construction of probability for parameters so that uncertainty in their estimation is assessed.

• The incompleteness of model is addressed via a probability functional approach for computing the time-dependence of the concentration of key enzymes

• Small features such as ribosomes or viruses behave in ways that rely on their atomic scale structure but which take part in the overall (macroscopic) balance of metabolic reaction and transport. “Zones” may be treated in more detail via the solution of mesoscopic models using finite element methods.

• Can be run over web at http://biodynamics.indiana.edu/overview/

137

Issue: Getting Tools to Interoperate

• There is currently a proliferation of software, but no single package answers all needs

• No single tool is likely to do so in the near future• But: problems with using multiple packages• One effort to address this problem:

– Systems Biology Workbench Project• Purpose: develop software and standards to

– Enable sharing of simulation & analysis software– Enable sharing of models

• Goal: make it easier to share than to reimplement

138

The Systems Biology Workbench Project

• http://www.sbw-sbml.org/• Simple framework for

application interaction. • Cross-platform compatible &

language-neutral

• Modules are separately compiled executables. A module defines services which have methods

• SBW native-language libraries provide APIs.

• SBW Broker acts as coordinator

SBW

VisualEditor

StochasticSimulator

ODE-basedSimulator

ScriptInterpreter

DatabaseInterface

139

CellML

• http://www.cellml.org/public/about/what_is_cellml.html• XML-based specification of interchange of cell model

information• Includes: • Information about model structure • Math, based on MathML• Metadata about the model• Project of Bioengineering Institute of University of Auckland

with support from Physiome Sciences Inc.

140

Systems biology URLs

• SBW & SBML www.sbw-sbml.org• NetBuilder strc.herts.ac.uk/bio/Maria/NetBuilder• CellML www.cellml.org• Jarnac + JDesigner www.cds.caltech.edu/~hsauro• Gepasi www.gepasi.org• Virtual Cell www.nrcam.uchc.edu/ • E-CELL www.e-cell.org• JigCell gnida.cs.vt.edu/~cellcyclepse/• DARPA BioSPICE www.biospice.org• Karyote http://biodynamics.indiana.edu/

overview/

141

Grand challenge problems and some thoughts about the future

142

Modeling Heart Function

• Based on Noble, D. 2002. Modeling the heart – from genes to cells to the whole organ. Science 295: 1678-1682

• Two mutations known for sodium channels– DeltaKPQ – deletion of 3 amino acids (lysine-proline-

glutamine) – causes persistent sodium flow through cell wall

– Missense mutations in sodium channels which cause ventricular fibrulations that can be fatal

• Models of heart function can produce counterintuitive predictions

• Grand challenge problem: the full scale reconstruction of a heart attack

143

3.0T MRI Scanner SGI Onyx

Real-time fMRI

In 1996, this required a supercomputerToday, it’s routine

CRAY T3E

Slide courtesy of Ralph Roskies, Pittsburgh Supercomputing Center, roskies@psc.edu

144

Gamma Knife

• Used to treat inoperable tumors

• Treatment methods currently use a standardized head model

• UITS is working with IU School of Medicine to adapt Penelope code to work with detailed model of an individual patient’s head

145

PENELOPE Basics

• “PENELOPE performs Monte Carlo simulation of coupled electron-photon transport in arbitrary materials and complex quadric geometries”(http://www.nea.fr/abs/html/nea-1525.html)

• Improvement of targeting based on CT scans of patient’s head – 200 512 x 512 voxel slices

• Simulation takes ~7 hours using a serial version of PENELOPE running on a 1 GHz PIII Windows system

• Goal: 5 minutes to one hour

146Parallelization of PENELOPE

• Each processor:– Views entire target– Generates its own random

numbers– Generates a set number of

independent trajectories– Accumulates data

• Process 0: – Collects the raw data– Computes desired results

• Uses F90 for parallel random number generator from MILC consortium

• Uses MPI elsewhere

147

PENELOPE Scalability: processing time

1

10

100

1000

10000

100000

0 50 100 150 200 250 300

Number of Processors

Tota

l W

allclo

ck T

ime (

sec.)

On IBM SP/Power3

148

PENELOPE Scalability: Speedup

0

50

100

150

200

250

300

0 50 100 150 200 250 300

# of Processors

Sp

eed

up

149

Some very boring Vampir traces of PENELOPE

150

“Simulation-only” studies

• Aquaporins -proteins which conduct large volumes of water through cell walls while filtering out charged particles like hydrogen ions.

• Massive simulation (35,000 hours TCS) showed that water moves through aquaporin channels in single file. Oxygen leads the way in. Half way through, the water molecule flips over.

• That breaks the ‘proton wire’• Work done at Pittsburgh Supercomputing Center• Klaus Schulten et al, U. of Illinois, SCIENCE (April 19, 2002)

151Other example large-scale computational biology grid projects

• Department of Energy “Genomes to Life” http://doegenomestolife.org/

• Encyclopedia of Life (http://eol.sdsc.edu/)• Biomedical Informatics Research Network (BIRN)

http://birn.ncrr.nih.gov/birn/• Asia Pacific BioGrid (http://www.apbionet.org/)• eDiamond – breast cancer/mammography grid

(http://www.mirada-solutions.com/PH1.asp?PAGE_ID=739)

152

Visualization: OpenDX

• http://www.opendx.org/• OpenDX is the open source

software version of IBM's Visualization Data Explorer Product

• Good sources of information in books, tutorials, etc.

• Interesting example of open source

• Animations as well

http://www.opendx.org/highlights.php

153

Visualization: SciRUN

• Some of the most dramatic biological visualizations ever done• Has been used for surgical support• Scientific Computing and Imaging Institute – Christopher R.

Johnson• http://www.sci.utah.edu/

154

Genomes to Life

• http://www.doegenomestolife.org/• Goals:

– Identify and Characterize the Molecular Machines of Life — the Multiprotein Complexes That Execute Cellular Functions and Govern Cell Form

– Characterize Gene Regulatory Networks– Characterize the Functional Repertoire of Complex Microbial

Communities in Their Natural Environments at the Molecular Level

– Develop the Computational Methods and Capabilities to Advance Understanding of Complex Biological Systems and Predict Their Behavior

– (Goals taken directly from Genomes to Life web site)

155

EOL Basic Topology

Putative Functional and 3D Assignment

Genomic Data

Integration with Other Resources

Public and Private DatabasesTo Serve Thousands Worldwide

http://eol.sdsc.edu/methodology.html

156

Current Genomic PipelineArabidopsis Protein sequences

Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)

Structural assignment of domains by PSI-BLAST on FOLDLIB

Only sequences w/out A-prediction

Only sequences w/out A-prediction

Structural assignment of domains by 123D on FOLDLIB

Create PSI-BLAST profiles for Protein sequences

Store assigned regions in the DB

Functional assignment by PFAM, NR, PSIPred assignments

FOLDLIB

NR, PFAM

Building FOLDLIB:

PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP

90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)

Domain location prediction by sequence

structure infosequence info

SCOP, PDB

http://eol.sdsc.edu/methodology.html

157

Scale of Multi-genome AnalysisGenomes Protein sequences

Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)

Structural assignment of domains by PSI-BLAST on FOLDLIB

Only sequences w/out A-prediction

Only sequences w/out A-prediction

Structural assignment of domains by 123D on FOLDLIB

Create PSI-BLAST profiles for Protein sequences

Store assigned regions in the DB

Functional assignment by PFAM, NR, PSIPred assignments

FOLDLIB

NR, PFAM

Building FOLDLIB:

PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP

90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)

Domain location prediction by sequence

structure infosequence info

SCOP, PDB

~800 genomes @ 10k-20k per =~107 ORF’s

4 CPU years

228 CPU years

3 CPU years

9 CPU years

252 CPU years

3 CPU years

104 entries

http://eol.sdsc.edu/methodology.html

158

BIRN

• Biomedical Informatics Research Network• http://www.nbirn.net/• NIH-sponsored attempt to create health-oriented

cyberinfrastructure• Function BIRN – brain function and disorders, e.g.

schizophrenia• Morphometry BIRN – brain structural disorders, e.g.

Alzheimers• Mouse BIRN – studying mouse brain and mouse models of

human brain disorders• Grid technology, using federated data system approach, based

on Globus, SRB, etc.

159

Drug Design

• Target generation – so what• Target verification – that’s important!• Toxicity prediction – VERY important!!• (Cholesterol example)• Counterintuitive problem: the more personalized a therapy is,

the smaller its target audience!

160What is the killer application in computational biology?

• Systems biology – latest buzzword, but…. • Goal: multiscale modeling from cell chemistry up to multiple

populations• Current software tools still inadequate• Multiscale modeling calls for use of established HPC

techniques – e.g. adaptive mesh refinement, coupled applications

• Current challenge examples: actin fiber creation, heart attack modeling

• Opportunity for predictive biology?

161Computational biology, biomedical research, and HPC

• Two challenges:– Scalability of applications– Wall-clock time sensitivity

• Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done.

• Traditional biomedical researchers must take advantage of new possibilities

• Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers

162

Peta-Scale applications?• Is this what most biologist really need?• Many biologists are unfamiliar with the real possibilities• Useful – even lifesaving – applications may require

straightforward application of well known principles. • The low hanging fruit taste just fine. e.g. “Parallel” Matlab,

GeneIndex, batch scripts (www.indiana.edu/~rac/bioinformatics/iubatchscripts.html)

• Writing a parallel application that can be used to treat people is a very difficult challenge

• Attacks on all fronts simultaneously are needed• Interactive Tera-scale applications might for many biologists be

more valuable right now than Peta-scale applications (even if we had them!)

• All of these open source codes are out there waiting for you to parallelize and/or tune them!

163So how do you find biologists with whom to collaborate?

• Chicken and egg problem?• Or more like fishing?• Or bank robbery?• Willie Sutton, a famous American bank robber, was asked why he

robbed banks, and reportedly said “because that's where the money is.” (This is, sadly, an urban legend: Sutton never said this)

• Cultivating collaborations with biologists in the short run will require:– Active outreach– Different expectations than we might usually have– Patience

• There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships. To do this, we’ll all have to spend a bit of time “going where the biologists are.”

164

Acknowledgments• Some of the research described herein was supported in part by the Indiana Genomics

Initiative. The Indiana Genomics Initiative of Indiana University is supported in part by Lilly Endowment Inc.

• Some of the research described herein was supported in part by Shared University Research grants from IBM, Inc. to Indiana University.

• Some of the material described herein is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

• Some of the ideas presented here were developed while the senior author was a visiting scientist at Höchstleistungsrechenzentrum Universität Stuttgart. The support and collaboration of HLRS and Michael Resch, Matthias Müller, Peggy Lindner, Matthias Hess, and Rainer Keller are gratefully acknowledged.

• Thanks to UITS Research and Academic Computing Division managers: Mary Papakhian, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar

• Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock

• UITS Senior Management: Associate Vice President and Dean (Retired) Christopher Peebles, RAC(Data) Director Gerry Bernbom, Associate Vice President and Dean Bradley Wheeler

• Assistance with this presentation: John Herrin, Malinda Lingwall, W. Les Teach

165

Some Good Books• Winter, P.C., G.I. Hickey, H.L. Fletcher. 1998. Instant notes in

genetics. Springer-Verlag, NY. ISBM 0-387-91562-1• Durbin, R., S. Eddy, A. Krogh, G. Mitchison. 2000. Biological

sequence analysis. Cambridge University Press.• Gibas, C., and P. Jambeck. 2001. Developing bioinformatics

computer skills. O’Reilly.• Tisdall, J. 2001. Beginning perl for bioinformatics. O’Reilly.• Gusfield, D. 1997. Algorithms on strings, trees, and

sequences. Cambridge University Press.• Berman, F., G.C. Fox, A.J.G. Hey. (eds) 2003. Grid

computing: making the grid infrastructure a reality. Wiley, Sussex