“Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A...

38
“Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1. A computer scientist who does not understand the subject matter and questions arising from Biological research will very likely not be able to make a significant contribution. 2. A Biologist who does not understand algorithmics and the construction of algorithms will be severely handicapped in doing the Biological Research of the future.
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A...

Page 1: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

“Computers are to Biology what Mathematics is to Physics”

- Harold MorowitzCorollaries:

1. A computer scientist who does not understand the subject matter and questions arising from Biological research will very likely not be able to make a significant contribution.

2. A Biologist who does not understand algorithmics and the construction of algorithms will be severely handicapped in doing the Biological Research of the future.

Page 2: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Report of the

National Reseearch Council

National Academy of Sciences

2002

“Biological concepts, models, and theories are becoming more quantitative, and the connections between the life and physical sciences are becoming deeper and stronger”

“Computers now play a central role in the acquisition, storage, analysis, interpretation, and visualization of vast quantities of biological data”

Page 3: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Some Descriptions of Bioinformatics1. (A Computer Scientist, 2005) Bioinformatics is a study of the algorithms and programs

that are used by Molecular Biologists and others in the Biological and Medical Sciences in their quest for understanding protein structure and function in living organisms

2. (Claverie & Notredame, 2003) Bioinformatics is nothing but good, sound, regular biology appropriately dressed so that it can fit into a computer.

3. (Attwood & Parry-Smith, 1999) The current research drive is to be able to understand evolutionary relationships in terms of the expression of protein function. Two computational approaches have been brought to bear on the problem: tackling it from the prespectives of sequence analysis and structure analysis, respectively.

4. (Augen, 2005) Bioinformatics lies at the intersection of Information Science and Molecular Biology. Furthermore, it’s development is highly dependent on simultaneous technical advances in both areas.

5. (Pevsner, 2003) Bioinformatics is designed to help the biologist use computer programs and databases to solve biological problems related to proteins, genes, and genomes with a larger goal of understanding broader issues such as the relationship of structure to function, development, and disease.

6. (Krane & Raymer, 2003) Bioinformatics strives to determine what information is biologically important and to decipher how it is used to precisely control the chemical environment within living organisms.

Page 4: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Computational Biology

Micro Biology & Medical Science

Computational Biology

Computer Science Biology

Bioinformatics

Micro Biology & Medical Science

Computational Biology

(Note the two way arrow)

Page 5: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Early Pre History

Computer Science

Micro Biology

Bioinformatics

Page 6: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Late Pre History

Computer Science

Micro Biology

Bioinformatics

Page 7: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Recent History

Computer Science

Micro Biology

Bioinformatics

Page 8: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

The Net (no pun intended) Result is an Astounding Growth of Biological Information

Source: Scrabanek op. cit.

Page 9: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Let’s look at that word algorithm:

Another term – recipe

• Tells the steps to take to obtain the desired answer

• Logical organization of steps

• It is a prescription for a computer program

• Algorithms are the result of human thought enhanced by human senses and experiences.

• Computer programs are constrained by a restricted language

Consider the following example:

You have compared five sequences to an ancestral sequence A

B has 70% similarities with A C has 65% similarities D has 80%

E has 90% F has 67%

Rank B, C, D, E, and F in order of having the most similarities with A.

You see it right away, but what about a computer?

Page 10: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Computers have neither eyes, nor a BRAIN!!

We must tell them every step of the process. All it knows is that it has 5 numbers to sort.

We use an algorithm that is similar to the way we sort cards in a hand we are dealt.

Initial ordering: B C D E F recall similarities 70, 65, 80, 90, 67

Compare B and C arrange them in order: B C D E F

Compare D to B and C and insert in proper position: D B C E F

Compare E to D, B, and C and insert in proper position E D B C F

Compare F to E, D, B, and C and insert in proper position E D B F C

How efficient is the algorithm. This is important to know in order to determine if there is a faster way to solve the problem.

Judge the efficiency on the number of comparisons.

2 sequences – 1 comparison; 3 sequences – 3; 4 sequences – 6; 5 sequences – 10

6 sequences – 15 7 sequences – 21 What about 8 and 9?

Page 11: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

If we have n sequences, the worst case for number of comparisons needed is: 1 + 2 + . . . . + (n – 1) = n(n-1) ÷ 2.

Note that when n is greater than 3, n(n-1)÷2 is bigger than n and it grows much faster. Its graph is a quadratic. Based on this analysis, we say that the Insertion Sort Algorithm is order n2.

Let’s return to the problem at hand: aligning DNA sequences.

Page 12: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

In Lab 2 we analyzed a single sequence of cDNA. We determined the number of each type of nucleotide making up the sequence and also looked for repeated sequences. This type of analysis is useful for identifying key subsequences, determining the makeup of the sample, and searching for genetically caused diseases. It may also serve to help explain the protein that is manufactured by this sample of DNA,

In Lab 3 we began our consideration of the problem of comparing two DNA samples. This comparison can lead to identification of the query sequence, finding homologous subsequences within the sequence, identifying strongly conserved regions, and establishing an evolutionary relationship between some of the genes contained within the given sample. Unfortunately, the problem is not a simple one. For example, where do we begin the comparison? What about the nucleotides between conserved regions?

Page 13: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

An Analogy

Information is stored on a floppy disc as a sequence of 0’s and 1’s. We may be able to read this code and still not tell what is on the disk.Two discs may contain the same files and programs yet the patterns of 0’s and 1’s may be entirely different.

WHY?Information is not stored as single sequential files. As files become written and erased, new files are placed in ‘holes’ and maybe distributed in several of these available spaces.

Page 14: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Sequence Comparison

A segment of a blast report:

Score: 76

What does this number mean?

Where did it come from?

How good a score is this?

We will back off a little and work up to answering these questions.

Page 15: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

First Attempt at Alignment – Dot Plots

Basic Idea –

1. Form a rectangle.

2. Place the first sequence along the right hand side of the rectangle and the second along the top.

3. Start with the character at the top of the first sequence place a dot in the rectangle for every character in the second sequence that matches that character from the first sequence.

4. Examine the result to determine if there are sequences of consecutive characters that match.

5. If there are such sequences, align them.

6. Use mismatches and/or gap to match as many matching sequences as possible

Page 16: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Well, That’s the Idea, But …..The following is the exact implementation of the previous slide. Recall black pixels mean that there was a match. What have you learned from the plot about the sequence alignment?

Page 17: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

The two sequences that were aligned were:

Escherichia coli O157:H7 mutS

And

Shigella boydii strain 374(S37) phosphoprotein phosphatase B (prpB) gene, partial sequence

The second is a well-conserved homolog of a subsequence of the first. If the Dot Plot method is to be of any value, it should make this fact obvious.

The plotting method is modified by using a “sliding window” of some specified length. If the window is of length n the point is only plotted if the character and its next n – 1 characters match.

What is a good value for n? There is no consensus. Some other methods use as the default value of 11 since (1/4)11 = .000000238… Which would be the probability that the match of 11 characters would occur at random. The Dot Plot Tool that is part of Mol Kit at Colorado State uses n = 9.

Page 18: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Here is the Dot Plot with n = 11:

In this case the relationship is obvious. So the moral of the story is that it is not just the tool, it is how skilled you are at using it.

Page 19: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Let’s do something a little more simple That may also involve some gaps, i.e., indels. How about the sequencesACACTG and ACACTGATCG ? GACGGATTAG and GATCGGAATAG ?

Looks like an alignment of ACACTG- - - - ACACTGATCGIs best. Gaps are used to fill out the end of the sequence.

One possibility GA- CGGATTAG GATCGGAATAGNote without the gap nothing after GA would match up.

Page 20: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

In the last two examples the gaps played roles that are significantly different

In the first: ACACTG- - - - ACACTGATCGthe gaps probably denote that the quite possibly the first sequence was terminated prematurely

In the second: GA- CGGATTAG GATCGGAATAGthe gap denotes either an insertion or a deletion. These are called an “indel” ‘s. Note there is also a possible mutation four characters from the end.

Page 21: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Dealing with Gaps

Consider the two sequences: AATCTATA and AAGATA

These sequences have different lengths – 8 versus 6

If we insist that the sequences do not contain any gaps, i.e. are ungapped, then there are only 3 ways to align them. AATCTATA AATCTATA AATCTATA AAGATA AAGATA AAGATA

However, if we decide to put gaps in the smaller sequence to have the two sequence sizes match up. We now have 8 spaces to place the two gaps (note they may be separated). This means there are 8 choose 2 ( C8,2) or 28 possible placements for the gaps and, thus, 28 different possible alignments. Note: Cn,k=n!/(k!*(n-k)!) where n! = n*(n-1)*(n-2)*….*3*2*1.

If we were to in addition allow 3 gaps in the top sequence, we would now have C11,3*C11,5 = 76,230 candidates for alignments.

The inclusion of gaps makes the problem much larger. But, it is an unavoidable necessity.

Page 22: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Three of the 28 possible alignments with gaps allowed in the shorter sequence. AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- -When scoring these alignments we need to assign a “penalty” for these gaps or ‘indels’.

One possibility is to give a -1 for each gap character in the alignment.

Another is to consider starting a gap to be more serious than allowing a gap to continue, i.e. G- AT -A is considered a more serious difference than G- -ATA

For example, we might assess a penalty of -5 for starting a gap and only -1 for continuing it. In the above example the first instance would have a penalty of -10 and the second, -6.

Page 23: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Overall Scoring of Sequence Alignments

Page 24: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Strategy 1

The Identity Matching Scheme

This is essentially a “no penalty for mismatches” scheme

A T C G

A 1 0 0 0

T 0 1 0 0

C 0 0 1 0

G 0 0 0 1

Penalty for indel = -1

Scores for our previous alignment candidates: AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- -

1+1+0-1+0+0 -1+1 = 1 1+1+0-1-1+1+1+1= 3 1+1+0+0+1+1-1-1= 2

Or, if we do not assign a penalty for ending gaps 1+1+0+0+1+1 = 4

Page 25: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Discussion of Strategy 1

• The second alignment has the highest score

• Problem is: what does this score represent?

• There is no recognition of the quality of the mismatch

• There is no reference to the history of the two sequences taken in to account. We just have assigned numbers

• May not be at all appropriate for a situation where we have a quickly mutating organism, such as a virus. In such an organism substitutions are common – given site may undergo several changes.

Page 26: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Strategy Due to Jukes/CantorMake the assumption that any nucleotide is equally likely to change into any of the other nucleotides.

A T C G

A 5 -4 -4 -4

T -4 5 -4 -4

C -4 -4 5 -4

G -4 -4 -4 5

Jukes and Cantor made no provision for indels. Others have added a gap penalty, but no consensus as to what that should be. We will arbitrarily choose -8.

Scoring the three alignments: AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- - 5+5-4-8-4-4-8+5 = -13 5+5-4-8-8+5+5 = 0 5+5-4-4+5+5-8-8 = -4

OR if we ignore gaps at the end of the sequence: 5+5-4-4+5+5 = 12

Page 27: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Discussion of Jukes/Cantor

• The scoring of the alignments yielded exactly the same ranking as was the case for the identity matrix score

• The “quality” of the scores is different – It is based on observations of frequency data of “known” sequences.

• Assumption that all mismatches are equally likely is not supported by our previous lectures or by examining larger, more recent, databases of known sequences where ancestral data is known.

• No standard way to choose a gap penalty.

Page 28: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

The Kimura Two Parameter Model

Based on assigning different probabilities to different types of matches, i.e., purine – purine, pyrimidine - pyrimidine as opposed to purine – pyrimidine.

A T C G

A 1 -5 -5 -1

T -5 1 -1 -5

C -5 -1 1 -5

G -1 -5 -5 1

Once again, the issue of gaps is decided arbitrarily. We will use -8, as in the previous case

Scoring the three sequences once more: AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- - 1+1-5-8-5-5-8+1 = -28 1+1-5-8-8+1+1+1 = -16 1+1-5-5+1+1-8-8 = -20

OR 1+1-5-5+1+1 = -6

Page 29: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Discussion of the Kimura Matrix

• More realistically set up – uses more recent (1980) database information to calculate frequency probabilities and assign scoring values.

• Seems to have a propensity towards negative scores

• Still no resolution of the “proper” weighting for gaps

• Note the rankings of the three sequences still maintained the same order.

• Again, the issue is the quality of the rankings.

Page 30: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

No matter what scoring scheme is used the basic problem is simply this: Given two strands of DNA, how can we find the best possible alignment for these two sequences?

We need to follow certain rules:

• Must use a “sensible” scoring matrix and gap penalty

• The order of the characters in the sequence can not be changed, but gaps can be inserted between them.

• Gaps appear in either of the sequences.

• The score for the pairing of two characters, one from each sequence or a gap with a character from the other sequence, must be done so that the score to that point is the best possible score.

Page 31: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

How do we implement the last rule?

• In Mathematics this is called “The Principle of Optimality” or POO (The method of problem solving using POO is called Dynamic Programming).

• How can the score for a particular position within the alignment be calculated?

We can pair the two available characters

This adds the score for a match or mismatch to the previous score

We can pair the character in the first string with a gap

We can pair the character in the second string with a gap

Either of these add the gap penalty to the previous score

Scorei = Scorei-1 + s(a,b)

Scorei = Scorei-1 +s(a,-)

Scorei = Scorei-1 + s(-,b)

Page 32: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

The Three Ways To Score A Position

G T A

C * * *

T * *

First make a table with the first sequence along the right and the second along the top. Calculate the three possible values for each cell. Choose the largest of these values.

Gap in second sequence

Gap in first sequence

Pair the two symbols

Page 33: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Let’s try one

A C T C G

0 -1 -2 -3 -4 -5

A -1

C -2

A -3

G -4

T -5

A -6

G -7

Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1

The negative numbers along the top and down the right are the gap penalties for leading gaps in either of the sequences.

Page 34: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Calculating a Cell Value

A C T C G

0 -1 -2 -3 -4 -5

A -1 1

C -2

A -3

G -4

T -5

A -6

G -7

Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1

-2

-21

The negative numbers along the top and down the right are the gap penalties for leading gaps in either of the sequences.

Page 35: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Filling in the row and column

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0

A -3 -1

G -4 -2

T -5 -3

A -6 -4

G -7 -5

Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1

Indicate the arrow that gave the highest score for each cell. Dynamic Programming records the best score and where it came from. This will be important when the table is filled

Page 36: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

The Complete Matrix

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0 -1

A -3 -1 1 2 1 0

G -4 -2 O 1 2 2

T -5 -3 -1 1 1 2

A -6 -4 -2 0 1 1

G -7 -5 -3 -1 0 2

Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1

Page 37: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Dynamic Programming Finds the Best Score and the Corresponding Alignment

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0 -1

A -3 -1 1 2 1 0

G -4 -2 O 1 2 2

T -5 -3 -1 1 1 2

A -6 -4 -2 0 1 1

G -7 -5 -3 -1 0 2

Alignment: Start in lower right corner and work backwards: AC- - TCG

ACAGTAG

Page 38: “Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1.A computer scientist who does not understand the subject matter.

Rules to Discover

The Alignment

1. Start in the lower right box – this box contains the best alignment score for the two sequences relative to this particular scoring scheme. NOTE: This may NOT be the largest value in the table, but it is the best score for completely aligning the two sequences. All other scores in the table are for partial alignments of the sequences.

2. Work backwards following the arrows from the present box in reverse order.

3. Diagonal arrow is a pairing of the characters

4. Vertical arrow represents a gap in the sequence across the top

5. Horizontal arrow represents a gap in the sequence along the side.