Fundamental Concepts of Bioinformatics Miami University, May 2008 Michael L. Raymer Computer...

Fundamental Concepts of Fundamental Concepts of BioinformaticsBioinformatics

Miami University, May 2008Miami University, May 2008

Michael L. RaymerComputer Science, Biomedical Sciences

Wright State University

Bioinformatics Research GroupBioinformatics Research Group

Part I – BackgroundPart I – Background

Some basics of molecular biology, and some of the fundamental

problems facing bioinformaticians

OCCBIO 2006 – Fundamental Bioinformatics 3

The Central Dogma of molecular biologyThe Central Dogma of molecular biology


DNA structure and base pairingDNA structure and base pairing Polymer of:

• Ribose sugar

• Phosphate

• Nitrogenous base

Four bases• A, C, G, T

Base pairing• A—T

• G—C


DNA is an information carrying moleculeDNA is an information carrying molecule Arranged into 23

chromosome pairs in the nucleus of each cell

Genes: coding information• < 5% of all DNA

• Instructions for protein synthesis

• Directions on when and where to synthesize proteins (regulatory regions)


The Genetic CodeThe Genetic Code Redundancy/robustness

• Synonymous codons

• Dual strands

• Diploidy

• Amino acid structure (?)

TranscriptionTranscription

DNAtranscriptiontranscription

RNAtranslationtranslation

Protein


Messenger RNAMessenger RNA Carries

instructions for a protein outside of the nucleus to the ribosome

The ribosome is a protein complex that synthesizes new proteins


Prokaryotic gene structureProkaryotic gene structure

Promoter: RNA polymerase bindingPromoter: RNA polymerase binding

Operator: regulationOperator: regulation

CodingCoding

Stop CodonStop Codon

5' UTR5' UTR5' UTR5' UTR 3' UTR3' UTR

5'5' 3'3'

Yeast RNA Polymerase IIDarst et al. in 1991 (Cell 66, pp 121-128)


Regulation of transcriptionRegulation of transcription Energy budget Cellular differentiation & tissue function

From W. Becker, L. Kleinsmith, and J. Hardin, The World of the Cell, Fourth Edition. Copyright © Addison Wesley Longman, Inc.


Bioinformatics problemsBioinformatics problems Shotgun sequencing Sequence alignment & multiple alignment

• Database searches

Phylogenetic tree induction Protein structure determination, modeling, and

prediction Ligand screening and docking Many, many more


Bioinformatics dataBioinformatics data DNA sequence information

• Genome projects, etc.

mRNA expression information• Microarrays, SAGE

Metabolite concentrations• Mass Spec., NMR Spec., etc.

Protein sequence information Protein structure information

• X-Ray Crystallography

Part II – Obtaining SequencesPart II – Obtaining Sequences

Sanger SequencingPrimer Walking

Shotgun ApproachesFragment Assembly Algorithms


OutlineOutline PCR Sanger Sequencing Primer Walking Shotgun Sequencing

• Models• Algorithms• Analysis


Polymerase chain reaction (PCR)Polymerase chain reaction (PCR)


Gel electrophoresisGel electrophoresis


Sanger sequencingSanger sequencing


Limitations to sequencingLimitations to sequencing You must have a primer of known sequence to

initiate PCR Only about 1000nts can be sequenced in a

single reaction The sequencing process is slow, so it is

beneficial to do as much in parallel as possible• Primer hopping• Shotgun approach


Shotgun SequencingShotgun Sequencing


The Ideal CaseThe Ideal Case Find maximal overlaps between fragments:

ACCGTCGTGCTTACTACCGT

--ACCGT------CGTGCTTAC------TACCGT— TTACCGTGC

Consensus sequence

determined by vote


Quality MetricsQuality Metrics The coverage at position i of the target or

consensus sequence is the number of fragments that overlap that position

Two contigs

No coverage

Target:


Quality MetricsQuality Metrics Linkage – the degree of overlap between

fragments

Target:

Perfect coverage, poor average linkage poor minimum linkage


Real World ComplicationsReal World Complications Base call errors Chimeric fragments, contamination (e.g. from

the vector)

--ACCGT------CGTGCTTAC------TGCCGT— TTACCGTGC

--ACC-GT------CAGTGCTTAC-------TACC-GT— TTACC-GTGC

--ACCGT------CGTGCTTAC------TAC-GT— TTACCGTGC

Base Call Error Deletion ErrorInsertion Error


Unknown OrientationUnknown Orientation

A fragment can come from either strandA fragment can come from either strand

CACGTACGTACTACGGTACTACTGACTGA

CACGT -ACGT --CGTAGT -----AGTAC --------ACTGA ---------CTGA


RepeatsRepeats Direct repeats

A X B X C X D

A X C X B X D


RepeatsRepeats Direct repeats

A X B Y C X D Y E

A X D Y C X B Y E


RepeatsRepeats Inverted repeats

X X

X X


Sequence Alignment ModelsSequence Alignment Models Shortest common superstring

• Input: A collection, F, of strings (fragments)• Output: A shortest possible string S such that for

every f F, S is a superstring of f.

Example:• F = {ACT, CTA, AGT}• S = ACTAGT


Problems with the SCS modelProblems with the SCS model

x x

x x´

Directionality of fragments must be known No consideration of coverage Some simple consideration of linkage No consideration of base call errors


ReconstructionReconstruction Deals with errors and unknown orientation Definitions

• f is an approximate substring of S at error level when ds(f, S) | f |

• ds = substring edit distance:

Reconstruction• Input: A collection, F, of strings, and a tolerance

level, • Output: Shortest possible string, S, such that for

every f F : fSfdSfd ss ,,,min

Match = 0Mismatch = 1

Gap = 1


Reconstruction ExampleReconstruction Example Input: F = {ATCAT, GTCG, CGAG, TACCA}

= 0.25 Output:

ATGAT------CGAC-CGAG----TACCAACGATACGAC

ATCAT

GTCG

ds(CGAG, ACGATACGAC) = 1= 0.25 4

So this output is OK for = 0.25


Gaps in ReconstructionGaps in Reconstruction Reconstruction allows gaps in fragments:

AT-GA-----ATCGATAGAC

ds = 1


Limitations of ReconstructionLimitations of Reconstruction Models errors and unknown orientation Doesn’t handle repeats Doesn’t model coverage Only handles linkage in a very simple way Always produces a single contig


ContigsContigs Sometimes you just can’t put all of the

fragments together into one contiguous sequence:

No way to tell the order of these two contigs.

?No way to tell how much sequence is missing between them.


MulticontigMulticontig Definitions

• A layout, L, is a multiple alignment of the fragments Columns numbered from 1 to |L |

• Endpoints of a fragment: l(f) and r(f)• An overlap is a link is no other fragment completely

covers the overlap

Link Not a link


MulticontigMulticontig More definitions

• The size of a link is the number of overlapping positions

• The weakest link is the smallest link in the layout• A t-contig has a weakest link of size t• A collection, F, admits a t-contig if a t-contig can be

constructed from the fragments in F

ACGTATAGCATGA GTA CATGATCAACGTATAG GATCA

A link of size 5A link of size 5


Perfect MulticontigPerfect Multicontig Input: F, and t Output: a minimum number of collections, Ci,

such that every Ci admits a t-contigLet F = {GTAC, TAATG, TGTAA}

--TAATGTGTAA--

GTAC

t = 3t = 3

TGTAA-------TAATG---------GTAC

t = 1t = 1


Handling errors in MulticontigHandling errors in Multicontig The image of a fragment is the portion of the

consensus sequence, S, corresponding to the fragment in the layout

S is an -consensus for a collection of fragments when the edit distance from each fragment, f, and its image is at most | f |

TATAGCATCAT CGTC CATGATCAACGGATAG GTCCAACGTATAGCATGATCA

An -consensusfor = 0.4


Definition of MulticontigDefinition of Multicontig Input: A collection, F , of strings, an integer t 0, and an error tolerance between 0 and 1

Output: A partition of F into the minimum number of collections Ci such that every Ci admits a t-contig with an -consensus


Example of MulticontigExample of Multicontig Let = 0.4, t = 3

TATAGCATCATACGTC CATGATCAGACGGATAG GTCCAGACGTATAGCATGATCAG


AlgorithmsAlgorithms Most of the algorithms to solve the fragment

assembly problem are based on a graph model A graph, G, is a collection of edges, e, and

vertices, v.• Directed or undirected• Weighted or unweighted

We will discussrepresentations andother issues shortly… A directed,

unweighted graph

A directed, unweighted

graph


The Maximum Overlap GraphThe Maximum Overlap Graph The text calls it an overlap multigraph Each directed edge, (u,v) is weighted with the

length of the maximal overlap between a suffix of u and a prefix of v

a

b

d

c

TACGA

CTAAAGACCC

GACA

1

1

1

2

1 0-weight edges

omitted!

0-weight edges

omitted!


Paths and LayoutsPaths and Layouts The path dbc leads to the alignment:

a

b

d

c

TACGA

CTAAAGACCC

GACA

1

1

1

2

1

GACA-----------ACCC-----------CTAAAG


SuperstringsSuperstrings Every path that covers every node is a

superstring Zero weight edges result in alignments like:

Higher weights produce more overlap, and thus shorter strings

The shortest common superstring is the highest weight path that covers every node

GACA------------GCCC-------------TTAAAG


Graph formulation of SCSGraph formulation of SCS Input: A weighted, directed graph Output: The highest-weight path that touches

every node of the graph

Does this problem sound familiar?Does this problem sound familiar?


The Greedy AlgorithmThe Greedy Algorithm

Algorithm greedy Sort edges in decreasing weight order For each edge in this order If the edge does not form a cycle and the edge does not start or end at the same node as another edge in the set then add the edge to the current set End forEnd Algorithm

Figure 4.16, page 125


Greedy ExampleGreedy Example

7

6

54

3

2

1

2

2


Greedy does not always find the best pathGreedy does not always find the best path

2

3

2ATGC TGCAT

GCC

0


Tools for Shotgun SequencingTools for Shotgun Sequencing


Common DifficultyCommon Difficulty Each of these problems is a method for

modeling fragment assembly Each of these problems is provably

intractable How?


Embedding problemsEmbedding problems Suppose I told you that I had found a clever

way to model the TSP as a shortest common superstring problem

• Paths between cities are represented as fragments• The shortest path is the shortest common

superstring of the fragments

If this is true, then there are only two possibilities:

1. This problem is just as intractable as TSP

2. TSP is actually a tractable problem!


NP-Complete ProblemsNP-Complete Problems There is a collection of problems that computer

scientists believe to be intractable• TSP is one of them

Each of them has been modeled as one or more of the other NP-complete problems

If you solve one, you solve them all A problem, p, is NP-hard if you can model one

of these NP-complete problems as an instance of p


NP-CompletenessNP-Completeness

TSP P

NP

3-SAT

Graph 3-coloring

Vertex cover

Subset sumSet packing

Bin packing


P = NP?P = NP?

NP

3-SAT

Graph 3-coloring

Vertex cover

Subset sumSet packing

Bin packing

P

NP

Part III – Sequence AlignmentsPart III – Sequence Alignments

Needleman-Wunsch

Smith-Waterman

Dynamic Programming


Why align sequences?Why align sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA

• What does it do?What does it do?

One approach: Is there a similar gene in another species?• Align sequences with known genes• Find the gene with the “best” match


Comparing two sequencesComparing two sequences Point mutations, easy:ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT

Indels are difficult, must align sequences:ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT


Scoring a sequence alignmentScoring a sequence alignment Match score: +1 Mismatch score:+0

Gap penalty: –1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1)

Score = +11Score = +11


DNA ReplicationDNA Replication Prior to cell division, all the

genetic instructions must be “copied” so that each new cell will have a complete set

DNA polymerase is the enzyme that copies DNA• Synthesizes in the 5' to 3'

direction


Over time, genes accumulate Over time, genes accumulate mutationsmutations Environmental factors

• Radiation

• Oxidation Mistakes in replication or

repair Deletions, Duplications Insertions Inversions Point mutations


Codon deletion:ACG ATA GCG TAT GTA TAG CCG…• Effect depends on the protein, position, etc.• Almost always deleterious• Sometimes lethal

Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?…• Almost always lethal

DeletionsDeletions


IndelsIndels Comparing two genes it is generally impossible

to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known:

ACGTCTGATACGCCGTATCGTCTATCTACGTCTGAT---CCGTATCGTCTATCT


Origination and length penaltiesOrigination and length penalties We want to find alignments that are

evolutionarily likely. Which of the following alignments seems more

likely to you?ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT

We can achieve this by penalizing more for a new gap, than for extending an existing gap


Scoring a sequence alignment (2)Scoring a sequence alignment (2) Match/mismatch score: +1/+0

Origination/length penalty: –2/–1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

Matches: 18 × (+1) Mismatches: 2 × 0 Origination: 2 × (–2) Length: 7 × (–1)

Score = +7Score = +7


How can we find an optimal alignment?How can we find an optimal alignment? Finding the alignment is computationally hard:ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG—CATCGTC--T-ATCT

C(27,7) gap positions = ~888,000 possibilities It’s possible, as long as we don’t repeat our

work! Dynamic programming: The Needleman &

Wunsch algorithm


What is the optimal alignment?What is the optimal alignment? ACTCGACAGTAG

Match: +1 Mismatch: 0 Gap: –1


Needleman-Wunsch: Step 1Needleman-Wunsch: Step 1 Each sequence along one axis Mismatch penalty multiples in first row/column 0 in [1,1] (or [0,0] for the CS-minded)

A C T C G0 -1 -2 -3 -4 -5

A -1 1C -2A -3G -4T -5A -6G -7


Needleman-Wunsch: Step 2Needleman-Wunsch: Step 2 Vertical/Horiz. move: Score + (simple) gap penalty Diagonal move: Score + match/mismatch score Take the MAX of the three possibilities

A C T C G0 -1 -2 -3 -4 -5

A -1 1C -2A -3G -4T -5A -6G -7


Needleman-Wunsch: Step 2 (cont’d)Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise…

a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2a -3g -4t -5a -6g -7


Needleman-Wunsch: Step 2 (cont’d)Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise…

The optimal alignment score is calculated in the lower-right corner

a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2


a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2

But what But what isis the optimal alignment the optimal alignment To reconstruct the optimal alignment, we must

determine of where the MAX at each step came from…


A path corresponds to an alignmentA path corresponds to an alignment = GAP in top sequence = GAP in left sequence = ALIGN both positions One path from the previous table: Corresponding alignment (start at the end):

AC--TCGACAGTAG

Score = +2


Practice ProblemPractice Problem Find an optimal alignment for these two

sequences:GCGGTTGCGT

Match: +1 Mismatch: 0 Gap: –1

g c g g t t0 -1 -2 -3 -4 -5 -6

g -1c -2g -3t -4


Practice ProblemPractice Problem Find an optimal alignment for these two

sequences:GCGGTTGCGT g c g g t t

0 -1 -2 -3 -4 -5 -6g -1 1 0 -1 -2 -3 -4c -2 0 2 1 0 -1 -2g -3 -1 1 3 2 1 0t -4 -2 0 2 3 3 2

GCGGTTGCG-T-

Score = +2


g c g0 -1 -2 -3

g -1 1 0 -1g -2 0 1 1c -3 -1 1 1g -4 -2 0 2

Semi-global alignmentSemi-global alignment Suppose we are aligning:GCGGGCG

Which do you prefer?G-CG -GCGGGCG GGCG

Semi-global alignment allows gaps at the ends for free.


Semi-global alignmentSemi-global alignment

g c g0 0 0 0

g 0 1 0 1g 0 1 1 1c 0 0 2 1g 0 1 1 3

Semi-global alignment allows gaps at the ends for free.

Initialize first row and column to all 0’s Allow free horizontal/vertical moves in last

row and column


Local alignmentLocal alignment Global alignments – score the entire alignment Semi-global alignments – allow unscored gaps

at the beginning or end of either sequence Local alignment – find the best matching

subsequence CGATGAAATGGA

This is achieved by allowing a 4th alternative at each position in the table: zero.


c g a t g0 -1 -2 -3 -4 -5

a -1 0 0 0 0 0a -2 0 0 1 0 0a -3 0 0 1 0 0t -4 0 0 0 2 1g -5 0 1 0 1 3g -6 0 1 0 0 2a -7 0 0 2 1 1

Local alignmentLocal alignment Mismatch = –1 this time

CGATGAAATGGA


Optimal Substructure in AlignmentsOptimal Substructure in Alignments Consider the alignment:ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

Is it true that the alignment in the boxed region must be optimal?


A Greedy StrategyA Greedy Strategy Consider this pair of sequencesGAGCCAGC

Greedy Approach:G or G or -C - G

Leads toGAGC--- Better: GACG---CAGC CACG

GAP = 1

Match = +1

Mismatch = 2


Breaking apart the problemBreaking apart the problem Suppose we are aligning:ACTCGACAGTAG

First position choices:A +1 CTCGA CAGTAG

A -1 CTCG- ACAGTAG

- -1 ACTCGA CAGTAG


A Recursive Approach to AlignmentA Recursive Approach to Alignment Choose the best alignment based on these three

possibilities:align(seq1, seq2) {

if (both sequences empty) {return 0;}if (one string empty) {

return(gapscore * num chars in nonempty seq);else {

score1 = score(firstchar(seq1),firstchar(seq2)) + align(tail(seq1), tail(seq2));score2 = align(tail(seq1), seq2) + gapscore;score3 = align(seq1, tail(seq2) + gapscore;return(min(score1, score2, score3));

}}

}


Time Complexity of RecurseAlignTime Complexity of RecurseAlign What is the recurrence equation for the time

needed by RecurseAlign?

3)1(3)( nTnT

3

3

3 3

3 3…

n

3

9

27

3n


RecurseAlign repeats its workRecurseAlign repeats its workA C G T A T C G C G T A T A

G

A

T

G

C

T

C

T

C

G


Dynamic ProgrammingDynamic Programming Remember all the subproblem answers along the way:

This is possible for any problem that exhibits optimal substructure

a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2


Saving SpaceSaving Space Note that we can throw away the previous rows

of the table as we fill it in:

a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2

This row is based only on this one


Saving Space (2)Saving Space (2) Each row of the table contains the scores for

aligning a prefix of the left-hand sequence with all prefixes of the top sequence:

a c t c g0 -1 -2 -3 -4 -5

a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2

Scores for aligning aca with

all prefixes of actcg


Divide and ConquerDivide and Conquer By using a recursive approach, we can use only

two rows of the matrix at a time:• Choose the middle character of the top sequence, i• Find out where i aligns to the bottom sequence

Needs two vectors of scores

• Recursively align the sequences before and after the fixed positions

ACGCTATGCTCATAG

CGACGCTCATCG

i


Finding where Finding where ii lines up lines up Find out where i aligns to the bottom sequence

Needs two vectors of scores

Assuming i lines up with a character:alignscore = align(ACGCTAT, prefix(t)) + score(G, char from t)

+ align(CTCATAG, suffix(t)) Which character is best?

• Can quickly find out the score for aligning ACGCTAT with every prefix of t.

s: ACGCTATGCTCATAG

t: CGACGCTCATCG

i


Finding where Finding where ii lines up lines up But, i may also line up with a gap

Assuming i lines up with a gap:

alignscore = align(ACGCTAT, prefix(t)) + gapscore+ align(CTCATAG, suffix(t))

s: ACGCTATGCTCATAG

t: CGACGCTCATCG

i


Recursive CallRecursive Call Fix the best position for I Call align recursively for the prefixes and

suffixes:

s: ACGCTATGCTCATAG

t: CGACGCTCATCG

i


ComplexityComplexity Let len(s) = m and len(t) = n Space: 2m Time:

• Each call to build similarity vector = m´n´

• First call + recursive call:

s: ACGCTATGCTCATAG

t: CGACGCTCATCG

i

j

mn

jnmmjmn

jnm

Tjm

Tmnmn

nmT

2

)(

,2

,222

,


General Gap PenaltiesGeneral Gap Penalties Suppose we are no longer using simple gap

penalties:• Origination = −2• Length = −1

Consider the last position of the alignment for ACGTA with ACG

We can’t determine the score for

unless we know the previous positions!

G-

-G

or


Scoring BlocksScoring Blocks Now we must score a block at a time

A block is a pair of characters, or a maximal group of gaps paired with characters

To score a position, we need to either start a new block or add it to a previous block

A A C --- A TATCCG A C T AC

A C T ACC T ------ C G C --


The AlgorithmThe Algorithm Three tables

• a – scores for alignments ending in char-char blocks• b – scores for alignments ending in gaps in the top

sequence (s)• c – scores for alignments ending in gaps in the left

sequence (t)

Scores no longer depend on only three positions, because we can put any number of gaps into the last block


The RecurrencesThe Recurrences

1,1

1,1

1,1

max,,

jic

jib

jia

jipjia

jkkwkjic

jkkwkjiajib

1for ,,

1for ,,max,

ikkwjkib

ikkwjkiajic

1for ,,

1for ,,max,


The Optimal AlignmentThe Optimal Alignment The optimal alignment is found by looking at

the maximal value in the lower right of all three arrays

The algorithm runs in O(n3) time• Uses O(n2) space

Part IV – Database SearchesPart IV – Database Searches

BLAST

Search statistics


Database SearchingDatabase Searching How can we find a particular short sequence in

a database of sequences (or one HUGE sequence)?

Problem is identical to local sequence alignment, but on a much larger scale.

We must also have some idea of the significance of a database hit.• Databases always return some kind of hit, how

much attention should be paid to the result?


BLASTBLAST BLAST – Basic Local Alignment Search Tool An approximation of the Needleman & Wunsch

algorithm Sacrifices some search sensitivity for speed


Scoring MatricesScoring Matrices DNA

• Identity

• Transition/TransversionA R N D C Q E G H I L K M F P S T W Y V

A 2R -2 6N 0 0 2D 0 -1 2 4C -2 -4 -4 -5 4Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 2 4

Proteins• PAM

• BLOSUM


The BLAST algorithmThe BLAST algorithm Break the search sequence into words

• W = 3 for proteins, W = 12 for DNA

Include in the search all words that score above a certain value (T) for any search word

MCGPFILGTYC

MCG

CGP

MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC

MCG CGPMCT MGP …MCN CTP … …

This list can be computed in linear time

This list can be computed in linear time


The Blast Algorithm (2)The Blast Algorithm (2) Search for the words in the database

• Word locations can be precomputed and indexed• Searching for a short string in a long string

Regular expression matching: FSA

HSP (High Scoring Pair) = A match between a query word and the database

Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A

Extend the hit until the score falls below a threshold value, X


Results from a BLAST searchResults from a BLAST search


Search Significance ScoresSearch Significance Scores A search will always return some hits.

How can we determine how “unusual” a particular alignment score is?• ORF’s

Assumptions


Assessing significance requires a Assessing significance requires a distributiondistribution I have an apple of diameter 5”. Is that unusual?

Diameter (cm)

Fre

quen

cy


Is a match significant?Is a match significant? Match scores for aligning my sequence with

random sequences. Depends on:

• Scoring system• Database• Sequence to search for

Length Composition

How do we determine the random sequences?

Match score

Fre

quen

cy


Generating “random” sequencesGenerating “random” sequences Random uniform model:

P(G) = P(A) = P(C) = P(T) = 0.25P(G) = P(A) = P(C) = P(T) = 0.25• Doesn’t reflect nature

Use sequences from a database• Might have genuine homology

We want unrelated sequences

Random shuffling of sequences• Preserves composition• Removes true homology


What distribution do we expect to see?What distribution do we expect to see? The mean of n random (i.i.d.) events tends

towards a Gaussian distribution.• Example: Throw n dice and compute the mean.• Distribution of means:

n = 2 n = 1000


The extreme value distributionThe extreme value distribution This means that if we get the match scores for

our sequence with n other sequences, the mean would follow a Gaussian distribution.

The maximum of n (i.i.d.) random events tends towards the extreme value distribution as n grows large.


Comparing distributionsComparing distributions

x

ex

eexf1

2

2

2

2

1

x

exf

Extreme Value: Gaussian:


Determining P-valuesDetermining P-values If we can estimate and , then we can

determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database.

For sequence matches, a scoring system and database can be parameterized by two parameters, K and , related to and .• It would be nice if we could compare hit

significance without regard to the database and scoring system used!


Bit ScoresBit Scores The expected number of hits with score S is:

E = Kmn e s

• Where m and n are the sequence lengths

Normalize the raw score using:

Obtains a “bit score” S’, with a standard set of units.

The new E-value is:

2ln

ln KSS

SmnE 2


P values and E valuesP values and E values Blast reports E-values E = 5, E = 10 versus P = 0.993 and P = 0.99995 When E < 0.01 P-values and E-values are

nearly identical


BLAST parametersBLAST parameters Lowering the neighborhood word threshold (T)

allows more distantly related sequences to be found, at the expense of increased noise in the results set.

Raising the segment extension cutoff (X) returns longer extensions for each hit.

Changing the minimum E-value changes the threshold for reporting a hit.

OptionalOptional – Phylogenies – Phylogenies

Preliminaries

Distance-based methods

Parsimony Methods


Phylogenetic TreesPhylogenetic Trees Hypothesis about the relationship between

organisms Can be rooted or unrooted

A B C D E

A B

C

D

E

Tim

e

Root


Tree proliferationTree proliferation

!22

!322

n

nN

nR !32

!523

n

nN

nU

Species Number of Rooted Trees Number of Unrooted Trees

2 1 1

3 3 1

4 15 3

5 105 15

6 34,459,425 2,027,025

7 213,458,046,767,875 7,905,853,580,625

8 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875


Molecular phylogeneticsMolecular phylogenetics Specific genomic

sequence variations (alleles) are much more reliable than phenotypic characteristics

More than one gene should be considered


An ongoing didacticAn ongoing didactic Pheneticists tend to prefer distance based

metrics, as they emphasize relationships among data sets, rather than the paths they have taken to arrive at their current states.

Cladists are generally more interested in evolutionary pathways, and tend to prefer more evolutionarily based approaches such as maximum parsimony.


Distance matrix methodsDistance matrix methods

Species A B C D

B 9 – – –

C 8 11 – –

D 12 15 10 –

E 15 18 13 5


UPGMAUPGMA Similar to average-link clustering Merge the closest two groups

• Replace the distances for the new, merged group with the average of the distance for the previous two groups

Repeat until all species are joined


UPGMA Step 1UPGMA Step 1

Species A B C D

B 9 – – –

C 8 11 – –

D 12 15 10 –

E 15 18 13 5

Merge D & E

D E

Species A B C

B 9 – –

C 8 11 –

DE 13.5 16.5 11.5


UPGMA Step 2UPGMA Step 2

Merge A & C

D E

Species A B C

B 9 – –

C 8 11 –

DE 13.5 16.5 11.5

A C

Species B AC

AC 10 –

DE 16.5 12.5


UPGMA Steps 3 & 4UPGMA Steps 3 & 4

Merge B & AC

D EA C

Species B AC

AC 10 –

DE 16.5 12.5

B

Merge ABC & DE

D EA C B

(((A,C)B)(D,E))


Parsimony approachesParsimony approaches Belong to the broader class of character based

methods of phylogenetics Emphasize simpler, and thus more likely

evolutionary pathways

I: GCGGACGII: GTGGACG

C T

I II

(C or T)

C T

I II

A

(C or T)


Informative and uninformative sitesInformative and uninformative sitesPosition

Seq 1 2 3 4 5 6

1 G G G G G G

2 G G G A G T

3 G G A T A G

4 G A T C A T

For positions 5 & 6, it is possible to select more parsimonious trees – those that invoke less substitutions.

For positions 5 & 6, it is possible to select more parsimonious trees – those that invoke less substitutions.


Parsimony methodsParsimony methods Enumerate all possible trees Note the number of substitutions events

invoked by each possible tree• Can be weighted by transition/transversion

probabilities, etc.

Select the most parsimonious


Branch and Bound methodsBranch and Bound methods Key problem – number of possible trees grows

enormous as the number of species gets large Branch and bound – a technique that allows

large numbers of candidate trees to be rapidly disregarded

Requires a “good guess” at the cost of the best tree


Branch and Bound for TSPBranch and Bound for TSP Find a minimum cost

round-trip path that visits each intermediate city exactly once

NP-complete Greedy approach:

A,G,E,F,B,D,C,A= 251

AC

F

E

D

G

B

93

46

20

35

68

1257 31

15

82

17

8259


Search all possible pathsSearch all possible pathsA

C

F

E

D

G

B

93

46

20

35

68

1257 31

15

82

17

8259

AC

F

E

D

G

B

93

46

20

35

68

1257 31

15

82

17

8259

All paths

AG (20) AB (46) AC (93)

AGF (88) AGE (55)

AGFB AGFE AGFC

ACB (175) ACD ACF

ACBE (257)

Best estimate: 251


Parsimony – Branch and BoundParsimony – Branch and Bound Use the UPGMA tree for an initial best estimate

of the minimum cost (most parsimonious) tree Use branch and bound to explore all feasible

trees Replace the best estimate as better trees are

found Choose the most parsimonious


Parsimony exampleParsimony examplePosition

Seq 1 2 3 4 5 6

1 G G G G G G

2 G G G A G T

3 G G A T A G

4 G A T C A TAll trees

(1,2) [0] (1,3) [1] (1,4) [1]

Position 5:

Etc.

Part V – Protein StructurePart V – Protein Structure

Preliminaries

Lattice Models

Protein Folding Algorithms

Illustrations from: C Branden and J Tooze, Introduction to Protein Structure, 2nd ed. Garland Pub. ISBN 0815302703


The many functions of proteinsThe many functions of proteins Mechanoenzymes: myosin, actin Rhodopsin: allows vision Globins: transport oxygen Antibodies: immune system Enzymes: pepsin, renin, carboxypeptidase A Receptors: transmit messages through

membranes Vitelogenin: molecular velcro

• And hundreds of thousands more…


Proteins are chains of amino acidsProteins are chains of amino acids Polymer – a molecule composed of repeating units


Amino acid compositionAmino acid composition

Basic Amino AcidStructure:• The side chain, R,

varies for each ofthe 20 amino acids

C

RR

C

H

NO

OHH

H

Aminogroup

Carboxylgroup

Side chain


The Peptide BondThe Peptide Bond

Dehydration synthesis Repeating backbone: N–C –C –N–C –C

• Convention – start at amino terminus and proceed to carboxy terminus

O O


Peptidyl polymersPeptidyl polymers A few amino acids in a chain are called a

polypeptide. A protein is usually composed of 50 to 400+ amino acids.

Since part of the amino acid is lost during dehydration synthesis, we call the units of a protein amino acid residues.carbonylcarbonylcarboncarbon

amideamidenitrogennitrogen


Side chain propertiesSide chain properties Recall that the electronegativity of carbon is at

about the middle of the scale for light elements• Carbon does not make hydrogen bonds with water

easily – hydrophobic• O and N are generally more likely than C to h-bond

to water – hydrophilic We group the amino acids into three general

groups:• Hydrophobic• Charged (positive/basic & negative/acidic)• Polar


The Hydrophobic Amino AcidsThe Hydrophobic Amino Acids

Proline severelyProline severelylimits allowablelimits allowableconformations!conformations!


The Charged Amino AcidsThe Charged Amino Acids


The Polar Amino AcidsThe Polar Amino Acids


More Polar Amino AcidsMore Polar Amino Acids

And then there’s…And then there’s…


Planarity of the peptide bondPlanarity of the peptide bond

Phi () – the angle of rotation about the N-C bond.

Psi () – the angle of rotation about the C-C bond.

The planar bond angles and bond lengths are fixed.


Phi and psiPhi and psi

= = 180° is extended conformation

: C to N–H : C=O to C

C

C=O

N–H


The Ramachandran PlotThe Ramachandran Plot

G. N. Ramachandran – first calculations of sterically allowed regions of phi and psi

Note the structural importance of glycine

Observed(non-glycine)

Observed(glycine)Calculated


Primary & Secondary StructurePrimary & Secondary Structure Primary structurePrimary structure = the linear sequence of

amino acids comprising a protein:AGVGTVPMTAYGNDIQYYGQVT…

Secondary structureSecondary structure• Regular patterns of hydrogen bonding in proteins

result in two patterns that emerge in nearly every protein structure known: the -helix and the-sheet

• The location of direction of these periodic, repeating structures is known as the secondary secondary structurestructure of the protein


The alpha helixThe alpha helix 60°


Properties of the alpha helixProperties of the alpha helix

60° Hydrogen bondsHydrogen bonds

between C=O ofresidue n, andNH of residuen+4

3.6 residues/turn 1.5 Å/residue rise 100°/residue turn


Properties of Properties of -helices-helices 4 – 40+ residues in length Often amphipathic or “dual-natured”

• Half hydrophobic and half hydrophilic• Mostly when surface-exposed

If we examine many -helices,we find trends…• Helix formers: Ala, Glu, Leu,

Met• Helix breakers: Pro, Gly, Tyr,

Ser


The beta strand (& sheet)The beta strand (& sheet)

135° +135°


Properties of beta sheetsProperties of beta sheets Formed of stretches of 5-10 residues in

extended conformation Pleated – each C a bit

above or below the previous Parallel/aniparallelParallel/aniparallel,

contiguous/non-contiguous


Parallel and anti-parallel Parallel and anti-parallel -sheets-sheets Anti-parallel is slightly energetically favored

Anti-parallelAnti-parallel ParallelParallel


Turns and LoopsTurns and Loops Secondary structure elements are connected by

regions of turns and loops Turns – short regions

of non-, non-conformation

Loops – larger stretches with no secondary structure. Often disordered.• “Random coil”• Sequences vary much more than secondary

structure regions

Levels of Protein Levels of Protein StructureStructure

Secondary structure elements combine to form tertiary structure

Quaternary structure occurs in multienzyme complexes• Many proteins are

active only as homodimers, homotetramers, etc.


Disulfide BondsDisulfide Bonds Two cyteines in

close proximity will form a covalent bond

Disulfide bond, disulfide bridge, or dicysteine bond.

Significantly stabilizes tertiary structure.


Protein Structure ExamplesProtein Structure Examples


Determining Protein StructureDetermining Protein Structure There are O(100,000) distinct proteins in the

human proteome. 3D structures have been determined for 14,000

proteins, from all organisms• Includes duplicates with different ligands bound,

etc.

Coordinates are determined by X-ray X-ray crystallographycrystallography


X-Ray CrystallographyX-Ray Crystallography

~0.5mm

• The crystal is a mosaic of millions of copies of the protein.

• As much as 70% is solvent (water)!

• May take months (and a “green” thumb) to grow.


X-Ray diffractionX-Ray diffraction

Image is averagedover:• Space (many copies)• Time (of the diffraction

experiment)


Electron Density MapsElectron Density Maps Resolution is

dependent on the quality/regularity of the crystal

R-factor is a measure of “leftover” electron density

Solvent fitting Refinement


The Protein Data BankThe Protein Data Bank

ATOM 1 N ALA E 1 22.382 47.782 112.975 1.00 24.09 3APR 213ATOM 2 CA ALA E 1 22.957 47.648 111.613 1.00 22.40 3APR 214ATOM 3 C ALA E 1 23.572 46.251 111.545 1.00 21.32 3APR 215ATOM 4 O ALA E 1 23.948 45.688 112.603 1.00 21.54 3APR 216ATOM 5 CB ALA E 1 23.932 48.787 111.380 1.00 22.79 3APR 217ATOM 6 N GLY E 2 23.656 45.723 110.336 1.00 19.17 3APR 218ATOM 7 CA GLY E 2 24.216 44.393 110.087 1.00 17.35 3APR 219ATOM 8 C GLY E 2 25.653 44.308 110.579 1.00 16.49 3APR 220ATOM 9 O GLY E 2 26.258 45.296 110.994 1.00 15.35 3APR 221ATOM 10 N VAL E 3 26.213 43.110 110.521 1.00 16.21 3APR 222ATOM 11 CA VAL E 3 27.594 42.879 110.975 1.00 16.02 3APR 223ATOM 12 C VAL E 3 28.569 43.613 110.055 1.00 15.69 3APR 224ATOM 13 O VAL E 3 28.429 43.444 108.822 1.00 16.43 3APR 225ATOM 14 CB VAL E 3 27.834 41.363 110.979 1.00 16.66 3APR 226ATOM 15 CG1 VAL E 3 29.259 41.013 111.404 1.00 17.35 3APR 227ATOM 16 CG2 VAL E 3 26.811 40.649 111.850 1.00 17.03 3APR 228

http://www.rcsb.org/pdb/


Views of a proteinViews of a protein

Wireframe Ball and stick


Views of a proteinViews of a protein

Spacefill Cartoon CPK colors

Carbon = green, black, or grey

Nitrogen = blue

Oxygen = red

Sulfur = yellow

Hydrogen = white


The Protein Folding ProblemThe Protein Folding Problem Central question of molecular biology:

“Given a particular sequence of amino acid Given a particular sequence of amino acid residues (primary structure), what will the residues (primary structure), what will the tertiary/quaternary structure of the resulting tertiary/quaternary structure of the resulting protein be?”protein be?”

Input: AAVIKYGCAL…Output: 11, 22…= backbone conformation:(no side chains yet)


Forces driving protein foldingForces driving protein folding It is believed that hydrophobic collapse is a key

driving force for protein folding• Hydrophobic core• Polar surface interacting with solvent

Minimum volume (no cavities) Disulfide bond formation stabilizes Hydrogen bonds Polar and electrostatic interactions


Folding helpFolding help Proteins are, in fact, only marginally stable

• Native state is typically only 5 to 10 kcal/mole more stable than the unfolded form

Many proteins help in folding• Protein disulfide isomerase – catalyzes shuffling of

disulfide bonds• Chaperones – break up aggregates and (in theory)

unfold misfolded proteins


The Hydrophobic CoreThe Hydrophobic Core Hemoglobin A is the protein in red blood cells

(erythrocytes) responsible for binding oxygen. The mutation E6V in the chain places a

hydrophobic Val on the surface of hemoglobin The resulting “sticky patch” causes hemoglobin

S to agglutinate (stick together) and form fibers which deform the red blood cell and do not carry oxygen efficiently

Sickle cell anemia was the first identified molecular disease


Sickle Cell AnemiaSickle Cell Anemia

Sequestering hydrophobic residues in Sequestering hydrophobic residues in the protein core protects proteins from the protein core protects proteins from hydrophobic agglutination.hydrophobic agglutination.


Computational Problems in Protein FoldingComputational Problems in Protein Folding

Two key questions:• Evaluation – how can we tell a correctly-folded

protein from an incorrectly folded protein? H-bonds, electrostatics, hydrophobic effect, etc. Derive a function, see how well it does on “real” proteins

• Optimization – once we get an evaluation function, can we optimize it? Simulated annealing/monte carlo EC Heuristics We’ll talk more about these methods later…


Fold OptimizationFold Optimization Simple lattice models (HP-

models)• Two types of residues:

hydrophobic and polar• 2-D or 3-D lattice• The only force is hydrophobic

collapse• Score = number of HH

contacts


H/P model scoring: count noncovalent hydrophobic interactions.

Sometimes:• Penalize for buried polar or surface hydrophobic

residues

Scoring Lattice ModelsScoring Lattice Models


What can we do with lattice models?What can we do with lattice models? For smaller polypeptides, exhaustive search can

be used• Looking at the “best” fold, even in such a simple

model, can teach us interesting things about the protein folding process

For larger chains, other optimization and search methods must be used• Greedy, branch and bound• Evolutionary computing, simulated annealing• Graph theoretical methods


The “hydrophobic zipper” effect:

Learning from Lattice ModelsLearning from Lattice Models

Ken Dill ~ 1997


Absolute directions• UURRDLDRRU

Relative directions• LFRFRRLLFFL• Advantage, we can’t have UD or RL in absolute• Only three directions: LRF

What about bumps? LFRRR• Bad score• Use a better representation

Representing a lattice modelRepresenting a lattice model


Preference-order representationPreference-order representation Each position has two “preferences”

• If it can’t have either of the two, it will take the “least favorite” path if possible

Example: {LR},{FL},{RL},{FR},{RL},{RL},{FR},{RF}

Can still cause bumps:{LF},{FR},{RL},{FL},{RL},{FL},{RF},{RL},{FL}


More realistic modelsMore realistic models Higher resolution lattices (45° lattice, etc.) Off-lattice models

• Local moves• Optimization/search methods and /

representations Greedy search Branch and bound EC, Monte Carlo, simulated annealing, etc.


The Other Half of the PictureThe Other Half of the Picture Now that we have a more realistic off-lattice

model, we need a better energy function to evaluate a conformation (fold).

Theoretical force field:G = Gvan der Waals + Gh-bonds + Gsolvent + Gcoulomb

Empirical force fields• Start with a database• Look at neighboring residues – similar to known

protein folds?


Threading: Fold recognitionThreading: Fold recognition Given:

• Sequence: IVACIVSTEYDVMKAAR…

• A database of molecular coordinates

Map the sequence onto each fold

Evaluate• Objective 1: improve

scoring function• Objective 2: folding


Secondary Structure PredictionSecondary Structure Prediction

AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…A-VGIVPM-AYGQDIQY-GQVT…AG-GIIP--AYGNELQ--GQVT…AGVCTVPMTA---ELQYYG--T…

AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…----hhhHHHHHHhhh--eeEE…----hhhHHHHHHhhh--eeEE…


Secondary Structure PredictionSecondary Structure Prediction Easier than folding

• Current algorithms can prediction secondary structure with 70-80% accuracy

Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222.

• Based on frequencies of occurrence of residues in helices and sheets

PhD – Neural network based• Uses a multiple sequence alignment• Rost & Sander, Proteins, 1994 , 19, 55-72


Chou-Fasman ParametersChou-Fasman ParametersName Abbrv P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine A 142 83 66 0.06 0.076 0.035 0.058Arginine R 98 93 95 0.07 0.106 0.099 0.085Aspartic Acid D 101 54 146 0.147 0.11 0.179 0.081Asparagine N 67 89 156 0.161 0.083 0.191 0.091Cysteine C 70 119 119 0.149 0.05 0.117 0.128Glutamic Acid E 151 37 74 0.056 0.06 0.077 0.064Glutamine Q 111 110 98 0.074 0.098 0.037 0.098Glycine G 57 75 156 0.102 0.085 0.19 0.152Histidine H 100 87 95 0.14 0.047 0.093 0.054Isoleucine I 108 160 47 0.043 0.034 0.013 0.056Leucine L 121 130 59 0.061 0.025 0.036 0.07Lysine K 114 74 101 0.055 0.115 0.072 0.095Methionine M 145 105 60 0.068 0.082 0.014 0.055Phenylalanine F 113 138 60 0.059 0.041 0.065 0.065Proline P 57 55 152 0.102 0.301 0.034 0.068Serine S 77 75 143 0.12 0.139 0.125 0.106Threonine T 83 119 96 0.086 0.108 0.065 0.079Tryptophan W 108 137 96 0.077 0.013 0.064 0.167Tyrosine Y 69 147 114 0.082 0.065 0.114 0.125Valine V 106 170 50 0.062 0.048 0.028 0.053


Chou-Fasman AlgorithmChou-Fasman Algorithm Identify -helices

• 4 out of 6 contiguous amino acids that have P(a) > 100

• Extend the region until 4 amino acids with P(a) < 100 found

• Compute P(a) and P(b); If the region is >5 residues and P(a) > P(b) identify as a helix

Repeat for -sheets [use P(b)] If an and a region overlap, the overlapping

region is predicted according to P(a) and P(b)


Chou-Fasman, cont’dChou-Fasman, cont’d Identify hairpin turns:

• P(t) = f(i) of the residue f(i+1) of the next residue f(i+2) of the following residue f(i+3) of the residue at position (i+3)

• Predict a hairpin turn starting at positions where: P(t) > 0.000075 The average P(turn) for the four residues > 100 P(a) < P(turn) > P(b) for the four residues

Accuracy 60-65%


Chou-Fasman ExampleChou-Fasman Example CAENKLDHVRGPTCILFMTWYNDGP CAENKL – Potential helix (!C and !N)

Residues with P(a) < 100: RNCGPSTY

• Extend: When we reach RGPT, we must stop• CAENKLDHV: P(a) = 972, P(b) = 843• Declare alpha helix

Identifying a hairpin turn• VRGP: P(t) = 0.000085• Average P(turn) = 113.25

Avg P(a) = 79.5, Avg P(b) = 98.25

Additional InformationAdditional Information – Aligning – Aligning protein sequencesprotein sequences

PAM matrices

BLOSUM matrices


Sequence Alignments RevisitedSequence Alignments Revisited Scoring nucleotide sequence alignments was

easier• Match score• Possibly different scores for transitions and

transversions For amino acids, there are many more possible

substitutions How do we score which substitutions are highly

penalized and which are moderately penalized?• Physical and chemical characteristics• Empirical methods


Scoring MismatchesScoring Mismatches Physical and chemical characteristics

• V I – Both small, both hydrophobic, conservative substitution, small penalty

• V K – Small large, hydrophobic charged, large penalty

• Requires some expert knowledge and judgement

Empirical methods• How often does the substitution V I occur in

proteins that are known to be related? Scoring matrices: PAM and BLOSUM


PAM matricesPAM matrices PAM = “Point Accepted Mutation” interested

only in mutations that have been “accepted” by natural selection

Starts with a multiple sequence alignment of very similar (>85% identity) proteins. Assumed to be homologous

Compute the relative mutability, mi, of each amino acid• e.g. mA = how many times was alanine substituted

with anything else?


Relative mutabilityRelative mutability ACGCTAFKIGCGCTAFKIACGCTAFKLGCGCTGFKIGCGCTLFKIASGCTAFKLACACTAFKL

Across all pairs of sequences, there are 28A X substitutions

There are 10 ALA residues, so mA = 2.8


Pam Matrices, cont’dPam Matrices, cont’d Construct a phylogenetic tree for the sequences

in the alignment

Calculate substitution frequences FX,X

Substitutions may have occurred either way, so A G also counts as G A.

ACGCTAFKI

GCGCTAFKI ACGCTAFKL

GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

AG IL

AG AL CS GA

FG,A = 3


Mutation ProbabilitiesMutation Probabilities Mi,j represents the probability of J I

substitution.

= 2.025

iij

ijjij F

FmM

4

37.2,

AGM

ACGCTAFKI

GCGCTAFKI ACGCTAFKL

GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

AG IL

AG AL CS GA


The PAM matrixThe PAM matrix The entries, Ri,j are the Mi,j values divided by

the frequency of occurrence, fi, of residue i.

fG = 10 GLY / 63 residues = 0.1587

RG,A = log(2.025/0.1587) = log(12.760) = 1.106

The log is taken so that we can add, rather than multiply entries to get compound probabilities.

Log-odds matrix Diagonal entries are 1– mj


Interpretation of PAM matricesInterpretation of PAM matrices PAM-1 – one substitution per 100 residues (a

PAM unit of time) Multiply them together to get PAM-100, etc. “Suppose I start with a given polypeptide

sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all amino acid residues have undergone substitutions at time t+n. Let the new sequence at time t+n be called M’. What is the probability that a residue of type j in M will be replaced by i in M’?”


PAM matrix considerationsPAM matrix considerations

If Mi,j is very small, we may not have a large enough sample to estimate the real probability. When we multiply the PAM matrices many times, the error is magnified.

PAM-1 – similar sequences, PAM-1000 very dissimilar sequences


BLOSUM matrixBLOSUM matrix Starts by clustering proteins by similarity Avoids problems with small probabilities by

using averages over clusters Numbering works opposite

• BLOSUM-62 is appropriate for sequences of about 62% identity, while BLOSUM-80 is appropriate for more similar sequences.


Other topics?Other topics? Tools and languages Forensic DNA Microarray analysis

Fundamental Concepts of Bioinformatics Miami University, May 2008 Michael L. Raymer Computer...

Documents

Transcript of Fundamental Concepts of Bioinformatics Miami University, May 2008 Michael L. Raymer Computer...