Fundamental Concepts of Bioinformatics Miami University, May 2008 Michael L. Raymer Computer...
-
date post
19-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of Fundamental Concepts of Bioinformatics Miami University, May 2008 Michael L. Raymer Computer...
Fundamental Concepts of Fundamental Concepts of BioinformaticsBioinformatics
Miami University, May 2008Miami University, May 2008
Michael L. RaymerComputer Science, Biomedical Sciences
Wright State University
Bioinformatics Research GroupBioinformatics Research Group
Part I – BackgroundPart I – Background
Some basics of molecular biology, and some of the fundamental
problems facing bioinformaticians
OCCBIO 2006 – Fundamental Bioinformatics 3
The Central Dogma of molecular biologyThe Central Dogma of molecular biology
OCCBIO 2006 – Fundamental Bioinformatics 4
DNA structure and base pairingDNA structure and base pairing Polymer of:
• Ribose sugar
• Phosphate
• Nitrogenous base
Four bases• A, C, G, T
Base pairing• A—T
• G—C
OCCBIO 2006 – Fundamental Bioinformatics 5
DNA is an information carrying moleculeDNA is an information carrying molecule Arranged into 23
chromosome pairs in the nucleus of each cell
Genes: coding information• < 5% of all DNA
• Instructions for protein synthesis
• Directions on when and where to synthesize proteins (regulatory regions)
OCCBIO 2006 – Fundamental Bioinformatics 6
The Genetic CodeThe Genetic Code Redundancy/robustness
• Synonymous codons
• Dual strands
• Diploidy
• Amino acid structure (?)
TranscriptionTranscription
DNAtranscriptiontranscription
RNAtranslationtranslation
Protein
OCCBIO 2006 – Fundamental Bioinformatics 8
Messenger RNAMessenger RNA Carries
instructions for a protein outside of the nucleus to the ribosome
The ribosome is a protein complex that synthesizes new proteins
OCCBIO 2006 – Fundamental Bioinformatics 9
Prokaryotic gene structureProkaryotic gene structure
Promoter: RNA polymerase bindingPromoter: RNA polymerase binding
Operator: regulationOperator: regulation
CodingCoding
Stop CodonStop Codon
5' UTR5' UTR5' UTR5' UTR 3' UTR3' UTR
5'5' 3'3'
Yeast RNA Polymerase IIDarst et al. in 1991 (Cell 66, pp 121-128)
OCCBIO 2006 – Fundamental Bioinformatics 10
Regulation of transcriptionRegulation of transcription Energy budget Cellular differentiation & tissue function
From W. Becker, L. Kleinsmith, and J. Hardin, The World of the Cell, Fourth Edition. Copyright © Addison Wesley Longman, Inc.
OCCBIO 2006 – Fundamental Bioinformatics 11
Bioinformatics problemsBioinformatics problems Shotgun sequencing Sequence alignment & multiple alignment
• Database searches
Phylogenetic tree induction Protein structure determination, modeling, and
prediction Ligand screening and docking Many, many more
OCCBIO 2006 – Fundamental Bioinformatics 12
Bioinformatics dataBioinformatics data DNA sequence information
• Genome projects, etc.
mRNA expression information• Microarrays, SAGE
Metabolite concentrations• Mass Spec., NMR Spec., etc.
Protein sequence information Protein structure information
• X-Ray Crystallography
Part II – Obtaining SequencesPart II – Obtaining Sequences
Sanger SequencingPrimer Walking
Shotgun ApproachesFragment Assembly Algorithms
OCCBIO 2006 – Fundamental Bioinformatics 14
OutlineOutline PCR Sanger Sequencing Primer Walking Shotgun Sequencing
• Models• Algorithms• Analysis
OCCBIO 2006 – Fundamental Bioinformatics 15
Polymerase chain reaction (PCR)Polymerase chain reaction (PCR)
OCCBIO 2006 – Fundamental Bioinformatics 16
Gel electrophoresisGel electrophoresis
OCCBIO 2006 – Fundamental Bioinformatics 17
Sanger sequencingSanger sequencing
OCCBIO 2006 – Fundamental Bioinformatics 18
Limitations to sequencingLimitations to sequencing You must have a primer of known sequence to
initiate PCR Only about 1000nts can be sequenced in a
single reaction The sequencing process is slow, so it is
beneficial to do as much in parallel as possible• Primer hopping• Shotgun approach
OCCBIO 2006 – Fundamental Bioinformatics 19
Shotgun SequencingShotgun Sequencing
OCCBIO 2006 – Fundamental Bioinformatics 20
The Ideal CaseThe Ideal Case Find maximal overlaps between fragments:
ACCGTCGTGCTTACTACCGT
--ACCGT------CGTGCTTAC------TACCGT— TTACCGTGC
Consensus sequence
determined by vote
OCCBIO 2006 – Fundamental Bioinformatics 21
Quality MetricsQuality Metrics The coverage at position i of the target or
consensus sequence is the number of fragments that overlap that position
Two contigs
No coverage
Target:
OCCBIO 2006 – Fundamental Bioinformatics 22
Quality MetricsQuality Metrics Linkage – the degree of overlap between
fragments
Target:
Perfect coverage, poor average linkage poor minimum linkage
OCCBIO 2006 – Fundamental Bioinformatics 23
Real World ComplicationsReal World Complications Base call errors Chimeric fragments, contamination (e.g. from
the vector)
--ACCGT------CGTGCTTAC------TGCCGT— TTACCGTGC
--ACC-GT------CAGTGCTTAC-------TACC-GT— TTACC-GTGC
--ACCGT------CGTGCTTAC------TAC-GT— TTACCGTGC
Base Call Error Deletion ErrorInsertion Error
OCCBIO 2006 – Fundamental Bioinformatics 24
Unknown OrientationUnknown Orientation
A fragment can come from either strandA fragment can come from either strand
CACGTACGTACTACGGTACTACTGACTGA
CACGT -ACGT --CGTAGT -----AGTAC --------ACTGA ---------CTGA
OCCBIO 2006 – Fundamental Bioinformatics 25
RepeatsRepeats Direct repeats
A X B X C X D
A X C X B X D
OCCBIO 2006 – Fundamental Bioinformatics 26
RepeatsRepeats Direct repeats
A X B Y C X D Y E
A X D Y C X B Y E
OCCBIO 2006 – Fundamental Bioinformatics 27
RepeatsRepeats Inverted repeats
X X
X X
OCCBIO 2006 – Fundamental Bioinformatics 28
Sequence Alignment ModelsSequence Alignment Models Shortest common superstring
• Input: A collection, F, of strings (fragments)• Output: A shortest possible string S such that for
every f F, S is a superstring of f.
Example:• F = {ACT, CTA, AGT}• S = ACTAGT
OCCBIO 2006 – Fundamental Bioinformatics 29
Problems with the SCS modelProblems with the SCS model
x x
x x´
Directionality of fragments must be known No consideration of coverage Some simple consideration of linkage No consideration of base call errors
OCCBIO 2006 – Fundamental Bioinformatics 30
ReconstructionReconstruction Deals with errors and unknown orientation Definitions
• f is an approximate substring of S at error level when ds(f, S) | f |
• ds = substring edit distance:
Reconstruction• Input: A collection, F, of strings, and a tolerance
level, • Output: Shortest possible string, S, such that for
every f F : fSfdSfd ss ,,,min
Match = 0Mismatch = 1
Gap = 1
OCCBIO 2006 – Fundamental Bioinformatics 31
Reconstruction ExampleReconstruction Example Input: F = {ATCAT, GTCG, CGAG, TACCA}
= 0.25 Output:
ATGAT------CGAC-CGAG----TACCAACGATACGAC
ATCAT
GTCG
ds(CGAG, ACGATACGAC) = 1= 0.25 4
So this output is OK for = 0.25
OCCBIO 2006 – Fundamental Bioinformatics 32
Gaps in ReconstructionGaps in Reconstruction Reconstruction allows gaps in fragments:
AT-GA-----ATCGATAGAC
ds = 1
OCCBIO 2006 – Fundamental Bioinformatics 33
Limitations of ReconstructionLimitations of Reconstruction Models errors and unknown orientation Doesn’t handle repeats Doesn’t model coverage Only handles linkage in a very simple way Always produces a single contig
OCCBIO 2006 – Fundamental Bioinformatics 34
ContigsContigs Sometimes you just can’t put all of the
fragments together into one contiguous sequence:
No way to tell the order of these two contigs.
?No way to tell how much sequence is missing between them.
OCCBIO 2006 – Fundamental Bioinformatics 35
MulticontigMulticontig Definitions
• A layout, L, is a multiple alignment of the fragments Columns numbered from 1 to |L |
• Endpoints of a fragment: l(f) and r(f)• An overlap is a link is no other fragment completely
covers the overlap
Link Not a link
OCCBIO 2006 – Fundamental Bioinformatics 36
MulticontigMulticontig More definitions
• The size of a link is the number of overlapping positions
• The weakest link is the smallest link in the layout• A t-contig has a weakest link of size t• A collection, F, admits a t-contig if a t-contig can be
constructed from the fragments in F
ACGTATAGCATGA GTA CATGATCAACGTATAG GATCA
A link of size 5A link of size 5
OCCBIO 2006 – Fundamental Bioinformatics 37
Perfect MulticontigPerfect Multicontig Input: F, and t Output: a minimum number of collections, Ci,
such that every Ci admits a t-contigLet F = {GTAC, TAATG, TGTAA}
--TAATGTGTAA--
GTAC
t = 3t = 3
TGTAA-------TAATG---------GTAC
t = 1t = 1
OCCBIO 2006 – Fundamental Bioinformatics 38
Handling errors in MulticontigHandling errors in Multicontig The image of a fragment is the portion of the
consensus sequence, S, corresponding to the fragment in the layout
S is an -consensus for a collection of fragments when the edit distance from each fragment, f, and its image is at most | f |
TATAGCATCAT CGTC CATGATCAACGGATAG GTCCAACGTATAGCATGATCA
An -consensusfor = 0.4
OCCBIO 2006 – Fundamental Bioinformatics 39
Definition of MulticontigDefinition of Multicontig Input: A collection, F , of strings, an integer t 0, and an error tolerance between 0 and 1
Output: A partition of F into the minimum number of collections Ci such that every Ci admits a t-contig with an -consensus
OCCBIO 2006 – Fundamental Bioinformatics 40
Example of MulticontigExample of Multicontig Let = 0.4, t = 3
TATAGCATCATACGTC CATGATCAGACGGATAG GTCCAGACGTATAGCATGATCAG
OCCBIO 2006 – Fundamental Bioinformatics 41
AlgorithmsAlgorithms Most of the algorithms to solve the fragment
assembly problem are based on a graph model A graph, G, is a collection of edges, e, and
vertices, v.• Directed or undirected• Weighted or unweighted
We will discussrepresentations andother issues shortly… A directed,
unweighted graph
A directed, unweighted
graph
OCCBIO 2006 – Fundamental Bioinformatics 42
The Maximum Overlap GraphThe Maximum Overlap Graph The text calls it an overlap multigraph Each directed edge, (u,v) is weighted with the
length of the maximal overlap between a suffix of u and a prefix of v
a
b
d
c
TACGA
CTAAAGACCC
GACA
1
1
1
2
1 0-weight edges
omitted!
0-weight edges
omitted!
OCCBIO 2006 – Fundamental Bioinformatics 43
Paths and LayoutsPaths and Layouts The path dbc leads to the alignment:
a
b
d
c
TACGA
CTAAAGACCC
GACA
1
1
1
2
1
GACA-----------ACCC-----------CTAAAG
OCCBIO 2006 – Fundamental Bioinformatics 44
SuperstringsSuperstrings Every path that covers every node is a
superstring Zero weight edges result in alignments like:
Higher weights produce more overlap, and thus shorter strings
The shortest common superstring is the highest weight path that covers every node
GACA------------GCCC-------------TTAAAG
OCCBIO 2006 – Fundamental Bioinformatics 45
Graph formulation of SCSGraph formulation of SCS Input: A weighted, directed graph Output: The highest-weight path that touches
every node of the graph
Does this problem sound familiar?Does this problem sound familiar?
OCCBIO 2006 – Fundamental Bioinformatics 46
The Greedy AlgorithmThe Greedy Algorithm
Algorithm greedy Sort edges in decreasing weight order For each edge in this order If the edge does not form a cycle and the edge does not start or end at the same node as another edge in the set then add the edge to the current set End forEnd Algorithm
Figure 4.16, page 125
OCCBIO 2006 – Fundamental Bioinformatics 47
Greedy ExampleGreedy Example
7
6
54
3
2
1
2
2
OCCBIO 2006 – Fundamental Bioinformatics 48
Greedy does not always find the best pathGreedy does not always find the best path
2
3
2ATGC TGCAT
GCC
0
OCCBIO 2006 – Fundamental Bioinformatics 49
Tools for Shotgun SequencingTools for Shotgun Sequencing
OCCBIO 2006 – Fundamental Bioinformatics 50
Common DifficultyCommon Difficulty Each of these problems is a method for
modeling fragment assembly Each of these problems is provably
intractable How?
OCCBIO 2006 – Fundamental Bioinformatics 51
Embedding problemsEmbedding problems Suppose I told you that I had found a clever
way to model the TSP as a shortest common superstring problem
• Paths between cities are represented as fragments• The shortest path is the shortest common
superstring of the fragments
If this is true, then there are only two possibilities:
1. This problem is just as intractable as TSP
2. TSP is actually a tractable problem!
OCCBIO 2006 – Fundamental Bioinformatics 52
NP-Complete ProblemsNP-Complete Problems There is a collection of problems that computer
scientists believe to be intractable• TSP is one of them
Each of them has been modeled as one or more of the other NP-complete problems
If you solve one, you solve them all A problem, p, is NP-hard if you can model one
of these NP-complete problems as an instance of p
OCCBIO 2006 – Fundamental Bioinformatics 53
NP-CompletenessNP-Completeness
TSP P
NP
3-SAT
Graph 3-coloring
Vertex cover
Subset sumSet packing
Bin packing
OCCBIO 2006 – Fundamental Bioinformatics 54
P = NP?P = NP?
NP
3-SAT
Graph 3-coloring
Vertex cover
Subset sumSet packing
Bin packing
P
NP
Part III – Sequence AlignmentsPart III – Sequence Alignments
Needleman-Wunsch
Smith-Waterman
Dynamic Programming
OCCBIO 2006 – Fundamental Bioinformatics 56
Why align sequences?Why align sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA
• What does it do?What does it do?
One approach: Is there a similar gene in another species?• Align sequences with known genes• Find the gene with the “best” match
OCCBIO 2006 – Fundamental Bioinformatics 57
Comparing two sequencesComparing two sequences Point mutations, easy:ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT
Indels are difficult, must align sequences:ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT
OCCBIO 2006 – Fundamental Bioinformatics 58
Scoring a sequence alignmentScoring a sequence alignment Match score: +1 Mismatch score:+0
Gap penalty: –1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1)
Score = +11Score = +11
OCCBIO 2006 – Fundamental Bioinformatics 59
DNA ReplicationDNA Replication Prior to cell division, all the
genetic instructions must be “copied” so that each new cell will have a complete set
DNA polymerase is the enzyme that copies DNA• Synthesizes in the 5' to 3'
direction
OCCBIO 2006 – Fundamental Bioinformatics 60
Over time, genes accumulate Over time, genes accumulate mutationsmutations Environmental factors
• Radiation
• Oxidation Mistakes in replication or
repair Deletions, Duplications Insertions Inversions Point mutations
OCCBIO 2006 – Fundamental Bioinformatics 61
Codon deletion:ACG ATA GCG TAT GTA TAG CCG…• Effect depends on the protein, position, etc.• Almost always deleterious• Sometimes lethal
Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?…• Almost always lethal
DeletionsDeletions
OCCBIO 2006 – Fundamental Bioinformatics 62
IndelsIndels Comparing two genes it is generally impossible
to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known:
ACGTCTGATACGCCGTATCGTCTATCTACGTCTGAT---CCGTATCGTCTATCT
OCCBIO 2006 – Fundamental Bioinformatics 63
Origination and length penaltiesOrigination and length penalties We want to find alignments that are
evolutionarily likely. Which of the following alignments seems more
likely to you?ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT
We can achieve this by penalizing more for a new gap, than for extending an existing gap
OCCBIO 2006 – Fundamental Bioinformatics 64
Scoring a sequence alignment (2)Scoring a sequence alignment (2) Match/mismatch score: +1/+0
Origination/length penalty: –2/–1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1) Mismatches: 2 × 0 Origination: 2 × (–2) Length: 7 × (–1)
Score = +7Score = +7
OCCBIO 2006 – Fundamental Bioinformatics 65
How can we find an optimal alignment?How can we find an optimal alignment? Finding the alignment is computationally hard:ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG—CATCGTC--T-ATCT
C(27,7) gap positions = ~888,000 possibilities It’s possible, as long as we don’t repeat our
work! Dynamic programming: The Needleman &
Wunsch algorithm
OCCBIO 2006 – Fundamental Bioinformatics 66
What is the optimal alignment?What is the optimal alignment? ACTCGACAGTAG
Match: +1 Mismatch: 0 Gap: –1
OCCBIO 2006 – Fundamental Bioinformatics 67
Needleman-Wunsch: Step 1Needleman-Wunsch: Step 1 Each sequence along one axis Mismatch penalty multiples in first row/column 0 in [1,1] (or [0,0] for the CS-minded)
A C T C G0 -1 -2 -3 -4 -5
A -1 1C -2A -3G -4T -5A -6G -7
OCCBIO 2006 – Fundamental Bioinformatics 68
Needleman-Wunsch: Step 2Needleman-Wunsch: Step 2 Vertical/Horiz. move: Score + (simple) gap penalty Diagonal move: Score + match/mismatch score Take the MAX of the three possibilities
A C T C G0 -1 -2 -3 -4 -5
A -1 1C -2A -3G -4T -5A -6G -7
OCCBIO 2006 – Fundamental Bioinformatics 69
Needleman-Wunsch: Step 2 (cont’d)Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise…
a c t c g0 -1 -2 -3 -4 -5
a -1 1 0 -1 -2 -3c -2a -3g -4t -5a -6g -7
OCCBIO 2006 – Fundamental Bioinformatics 70
Needleman-Wunsch: Step 2 (cont’d)Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise…
The optimal alignment score is calculated in the lower-right corner
a c t c g0 -1 -2 -3 -4 -5
a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2
OCCBIO 2006 – Fundamental Bioinformatics 71
a c t c g0 -1 -2 -3 -4 -5
a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2
But what But what isis the optimal alignment the optimal alignment To reconstruct the optimal alignment, we must
determine of where the MAX at each step came from…
OCCBIO 2006 – Fundamental Bioinformatics 72
A path corresponds to an alignmentA path corresponds to an alignment = GAP in top sequence = GAP in left sequence = ALIGN both positions One path from the previous table: Corresponding alignment (start at the end):
AC--TCGACAGTAG
Score = +2
OCCBIO 2006 – Fundamental Bioinformatics 73
Practice ProblemPractice Problem Find an optimal alignment for these two
sequences:GCGGTTGCGT
Match: +1 Mismatch: 0 Gap: –1
g c g g t t0 -1 -2 -3 -4 -5 -6
g -1c -2g -3t -4
OCCBIO 2006 – Fundamental Bioinformatics 74
Practice ProblemPractice Problem Find an optimal alignment for these two
sequences:GCGGTTGCGT g c g g t t
0 -1 -2 -3 -4 -5 -6g -1 1 0 -1 -2 -3 -4c -2 0 2 1 0 -1 -2g -3 -1 1 3 2 1 0t -4 -2 0 2 3 3 2
GCGGTTGCG-T-
Score = +2
OCCBIO 2006 – Fundamental Bioinformatics 75
g c g0 -1 -2 -3
g -1 1 0 -1g -2 0 1 1c -3 -1 1 1g -4 -2 0 2
Semi-global alignmentSemi-global alignment Suppose we are aligning:GCGGGCG
Which do you prefer?G-CG -GCGGGCG GGCG
Semi-global alignment allows gaps at the ends for free.
OCCBIO 2006 – Fundamental Bioinformatics 76
Semi-global alignmentSemi-global alignment
g c g0 0 0 0
g 0 1 0 1g 0 1 1 1c 0 0 2 1g 0 1 1 3
Semi-global alignment allows gaps at the ends for free.
Initialize first row and column to all 0’s Allow free horizontal/vertical moves in last
row and column
OCCBIO 2006 – Fundamental Bioinformatics 77
Local alignmentLocal alignment Global alignments – score the entire alignment Semi-global alignments – allow unscored gaps
at the beginning or end of either sequence Local alignment – find the best matching
subsequence CGATGAAATGGA
This is achieved by allowing a 4th alternative at each position in the table: zero.
OCCBIO 2006 – Fundamental Bioinformatics 78
c g a t g0 -1 -2 -3 -4 -5
a -1 0 0 0 0 0a -2 0 0 1 0 0a -3 0 0 1 0 0t -4 0 0 0 2 1g -5 0 1 0 1 3g -6 0 1 0 0 2a -7 0 0 2 1 1
Local alignmentLocal alignment Mismatch = –1 this time
CGATGAAATGGA
OCCBIO 2006 – Fundamental Bioinformatics 79
Optimal Substructure in AlignmentsOptimal Substructure in Alignments Consider the alignment:ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT
Is it true that the alignment in the boxed region must be optimal?
OCCBIO 2006 – Fundamental Bioinformatics 80
A Greedy StrategyA Greedy Strategy Consider this pair of sequencesGAGCCAGC
Greedy Approach:G or G or -C - G
Leads toGAGC--- Better: GACG---CAGC CACG
GAP = 1
Match = +1
Mismatch = 2
OCCBIO 2006 – Fundamental Bioinformatics 81
Breaking apart the problemBreaking apart the problem Suppose we are aligning:ACTCGACAGTAG
First position choices:A +1 CTCGA CAGTAG
A -1 CTCG- ACAGTAG
- -1 ACTCGA CAGTAG
OCCBIO 2006 – Fundamental Bioinformatics 82
A Recursive Approach to AlignmentA Recursive Approach to Alignment Choose the best alignment based on these three
possibilities:align(seq1, seq2) {
if (both sequences empty) {return 0;}if (one string empty) {
return(gapscore * num chars in nonempty seq);else {
score1 = score(firstchar(seq1),firstchar(seq2)) + align(tail(seq1), tail(seq2));score2 = align(tail(seq1), seq2) + gapscore;score3 = align(seq1, tail(seq2) + gapscore;return(min(score1, score2, score3));
}}
}
OCCBIO 2006 – Fundamental Bioinformatics 83
Time Complexity of RecurseAlignTime Complexity of RecurseAlign What is the recurrence equation for the time
needed by RecurseAlign?
3)1(3)( nTnT
3
3
3 3
3 3…
n
3
9
27
3n
OCCBIO 2006 – Fundamental Bioinformatics 84
RecurseAlign repeats its workRecurseAlign repeats its workA C G T A T C G C G T A T A
G
A
T
G
C
T
C
T
C
G
OCCBIO 2006 – Fundamental Bioinformatics 85
Dynamic ProgrammingDynamic Programming Remember all the subproblem answers along the way:
This is possible for any problem that exhibits optimal substructure
a c t c g0 -1 -2 -3 -4 -5
a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2
OCCBIO 2006 – Fundamental Bioinformatics 86
Saving SpaceSaving Space Note that we can throw away the previous rows
of the table as we fill it in:
a c t c g0 -1 -2 -3 -4 -5
a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2
This row is based only on this one
OCCBIO 2006 – Fundamental Bioinformatics 87
Saving Space (2)Saving Space (2) Each row of the table contains the scores for
aligning a prefix of the left-hand sequence with all prefixes of the top sequence:
a c t c g0 -1 -2 -3 -4 -5
a -1 1 0 -1 -2 -3c -2 0 2 1 0 -1a -3 -1 1 2 1 0g -4 -2 0 1 2 2t -5 -3 -1 1 1 2a -6 -4 -2 0 1 1g -7 -5 -3 -1 0 2
Scores for aligning aca with
all prefixes of actcg
OCCBIO 2006 – Fundamental Bioinformatics 88
Divide and ConquerDivide and Conquer By using a recursive approach, we can use only
two rows of the matrix at a time:• Choose the middle character of the top sequence, i• Find out where i aligns to the bottom sequence
Needs two vectors of scores
• Recursively align the sequences before and after the fixed positions
ACGCTATGCTCATAG
CGACGCTCATCG
i
OCCBIO 2006 – Fundamental Bioinformatics 89
Finding where Finding where ii lines up lines up Find out where i aligns to the bottom sequence
Needs two vectors of scores
Assuming i lines up with a character:alignscore = align(ACGCTAT, prefix(t)) + score(G, char from t)
+ align(CTCATAG, suffix(t)) Which character is best?
• Can quickly find out the score for aligning ACGCTAT with every prefix of t.
s: ACGCTATGCTCATAG
t: CGACGCTCATCG
i
OCCBIO 2006 – Fundamental Bioinformatics 90
Finding where Finding where ii lines up lines up But, i may also line up with a gap
Assuming i lines up with a gap:
alignscore = align(ACGCTAT, prefix(t)) + gapscore+ align(CTCATAG, suffix(t))
s: ACGCTATGCTCATAG
t: CGACGCTCATCG
i
OCCBIO 2006 – Fundamental Bioinformatics 91
Recursive CallRecursive Call Fix the best position for I Call align recursively for the prefixes and
suffixes:
s: ACGCTATGCTCATAG
t: CGACGCTCATCG
i
OCCBIO 2006 – Fundamental Bioinformatics 92
ComplexityComplexity Let len(s) = m and len(t) = n Space: 2m Time:
• Each call to build similarity vector = m´n´
• First call + recursive call:
s: ACGCTATGCTCATAG
t: CGACGCTCATCG
i
j
mn
jnmmjmn
jnm
Tjm
Tmnmn
nmT
2
)(
,2
,222
,
OCCBIO 2006 – Fundamental Bioinformatics 93
General Gap PenaltiesGeneral Gap Penalties Suppose we are no longer using simple gap
penalties:• Origination = −2• Length = −1
Consider the last position of the alignment for ACGTA with ACG
We can’t determine the score for
unless we know the previous positions!
G-
-G
or
OCCBIO 2006 – Fundamental Bioinformatics 94
Scoring BlocksScoring Blocks Now we must score a block at a time
A block is a pair of characters, or a maximal group of gaps paired with characters
To score a position, we need to either start a new block or add it to a previous block
A A C --- A TATCCG A C T AC
A C T ACC T ------ C G C --
OCCBIO 2006 – Fundamental Bioinformatics 95
The AlgorithmThe Algorithm Three tables
• a – scores for alignments ending in char-char blocks• b – scores for alignments ending in gaps in the top
sequence (s)• c – scores for alignments ending in gaps in the left
sequence (t)
Scores no longer depend on only three positions, because we can put any number of gaps into the last block
OCCBIO 2006 – Fundamental Bioinformatics 96
The RecurrencesThe Recurrences
1,1
1,1
1,1
max,,
jic
jib
jia
jipjia
jkkwkjic
jkkwkjiajib
1for ,,
1for ,,max,
ikkwjkib
ikkwjkiajic
1for ,,
1for ,,max,
OCCBIO 2006 – Fundamental Bioinformatics 97
The Optimal AlignmentThe Optimal Alignment The optimal alignment is found by looking at
the maximal value in the lower right of all three arrays
The algorithm runs in O(n3) time• Uses O(n2) space
Part IV – Database SearchesPart IV – Database Searches
BLAST
Search statistics
OCCBIO 2006 – Fundamental Bioinformatics 99
Database SearchingDatabase Searching How can we find a particular short sequence in
a database of sequences (or one HUGE sequence)?
Problem is identical to local sequence alignment, but on a much larger scale.
We must also have some idea of the significance of a database hit.• Databases always return some kind of hit, how
much attention should be paid to the result?
OCCBIO 2006 – Fundamental Bioinformatics 100
BLASTBLAST BLAST – Basic Local Alignment Search Tool An approximation of the Needleman & Wunsch
algorithm Sacrifices some search sensitivity for speed
OCCBIO 2006 – Fundamental Bioinformatics 101
Scoring MatricesScoring Matrices DNA
• Identity
• Transition/TransversionA R N D C Q E G H I L K M F P S T W Y V
A 2R -2 6N 0 0 2D 0 -1 2 4C -2 -4 -4 -5 4Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 2 4
Proteins• PAM
• BLOSUM
OCCBIO 2006 – Fundamental Bioinformatics 102
The BLAST algorithmThe BLAST algorithm Break the search sequence into words
• W = 3 for proteins, W = 12 for DNA
Include in the search all words that score above a certain value (T) for any search word
MCGPFILGTYC
MCG
CGP
MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC
MCG CGPMCT MGP …MCN CTP … …
This list can be computed in linear time
This list can be computed in linear time
OCCBIO 2006 – Fundamental Bioinformatics 103
The Blast Algorithm (2)The Blast Algorithm (2) Search for the words in the database
• Word locations can be precomputed and indexed• Searching for a short string in a long string
Regular expression matching: FSA
HSP (High Scoring Pair) = A match between a query word and the database
Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A
Extend the hit until the score falls below a threshold value, X
OCCBIO 2006 – Fundamental Bioinformatics 104
OCCBIO 2006 – Fundamental Bioinformatics 105
Results from a BLAST searchResults from a BLAST search
OCCBIO 2006 – Fundamental Bioinformatics 106
Search Significance ScoresSearch Significance Scores A search will always return some hits.
How can we determine how “unusual” a particular alignment score is?• ORF’s
Assumptions
OCCBIO 2006 – Fundamental Bioinformatics 107
Assessing significance requires a Assessing significance requires a distributiondistribution I have an apple of diameter 5”. Is that unusual?
Diameter (cm)
Fre
quen
cy
OCCBIO 2006 – Fundamental Bioinformatics 108
Is a match significant?Is a match significant? Match scores for aligning my sequence with
random sequences. Depends on:
• Scoring system• Database• Sequence to search for
Length Composition
How do we determine the random sequences?
Match score
Fre
quen
cy
OCCBIO 2006 – Fundamental Bioinformatics 109
Generating “random” sequencesGenerating “random” sequences Random uniform model:
P(G) = P(A) = P(C) = P(T) = 0.25P(G) = P(A) = P(C) = P(T) = 0.25• Doesn’t reflect nature
Use sequences from a database• Might have genuine homology
We want unrelated sequences
Random shuffling of sequences• Preserves composition• Removes true homology
OCCBIO 2006 – Fundamental Bioinformatics 110
What distribution do we expect to see?What distribution do we expect to see? The mean of n random (i.i.d.) events tends
towards a Gaussian distribution.• Example: Throw n dice and compute the mean.• Distribution of means:
n = 2 n = 1000
OCCBIO 2006 – Fundamental Bioinformatics 111
The extreme value distributionThe extreme value distribution This means that if we get the match scores for
our sequence with n other sequences, the mean would follow a Gaussian distribution.
The maximum of n (i.i.d.) random events tends towards the extreme value distribution as n grows large.
OCCBIO 2006 – Fundamental Bioinformatics 112
Comparing distributionsComparing distributions
x
ex
eexf1
2
2
2
2
1
x
exf
Extreme Value: Gaussian:
OCCBIO 2006 – Fundamental Bioinformatics 113
Determining P-valuesDetermining P-values If we can estimate and , then we can
determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database.
For sequence matches, a scoring system and database can be parameterized by two parameters, K and , related to and .• It would be nice if we could compare hit
significance without regard to the database and scoring system used!
OCCBIO 2006 – Fundamental Bioinformatics 114
Bit ScoresBit Scores The expected number of hits with score S is:
E = Kmn e s
• Where m and n are the sequence lengths
Normalize the raw score using:
Obtains a “bit score” S’, with a standard set of units.
The new E-value is:
2ln
ln KSS
SmnE 2
OCCBIO 2006 – Fundamental Bioinformatics 115
P values and E valuesP values and E values Blast reports E-values E = 5, E = 10 versus P = 0.993 and P = 0.99995 When E < 0.01 P-values and E-values are
nearly identical
OCCBIO 2006 – Fundamental Bioinformatics 116
BLAST parametersBLAST parameters Lowering the neighborhood word threshold (T)
allows more distantly related sequences to be found, at the expense of increased noise in the results set.
Raising the segment extension cutoff (X) returns longer extensions for each hit.
Changing the minimum E-value changes the threshold for reporting a hit.
OptionalOptional – Phylogenies – Phylogenies
Preliminaries
Distance-based methods
Parsimony Methods
OCCBIO 2006 – Fundamental Bioinformatics 118
Phylogenetic TreesPhylogenetic Trees Hypothesis about the relationship between
organisms Can be rooted or unrooted
A B C D E
A B
C
D
E
Tim
e
Root
OCCBIO 2006 – Fundamental Bioinformatics 119
Tree proliferationTree proliferation
!22
!322
n
nN
nR !32
!523
n
nN
nU
Species Number of Rooted Trees Number of Unrooted Trees
2 1 1
3 3 1
4 15 3
5 105 15
6 34,459,425 2,027,025
7 213,458,046,767,875 7,905,853,580,625
8 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875
OCCBIO 2006 – Fundamental Bioinformatics 120
Molecular phylogeneticsMolecular phylogenetics Specific genomic
sequence variations (alleles) are much more reliable than phenotypic characteristics
More than one gene should be considered
OCCBIO 2006 – Fundamental Bioinformatics 121
An ongoing didacticAn ongoing didactic Pheneticists tend to prefer distance based
metrics, as they emphasize relationships among data sets, rather than the paths they have taken to arrive at their current states.
Cladists are generally more interested in evolutionary pathways, and tend to prefer more evolutionarily based approaches such as maximum parsimony.
OCCBIO 2006 – Fundamental Bioinformatics 122
Distance matrix methodsDistance matrix methods
Species A B C D
B 9 – – –
C 8 11 – –
D 12 15 10 –
E 15 18 13 5
OCCBIO 2006 – Fundamental Bioinformatics 123
UPGMAUPGMA Similar to average-link clustering Merge the closest two groups
• Replace the distances for the new, merged group with the average of the distance for the previous two groups
Repeat until all species are joined
OCCBIO 2006 – Fundamental Bioinformatics 124
UPGMA Step 1UPGMA Step 1
Species A B C D
B 9 – – –
C 8 11 – –
D 12 15 10 –
E 15 18 13 5
Merge D & E
D E
Species A B C
B 9 – –
C 8 11 –
DE 13.5 16.5 11.5
OCCBIO 2006 – Fundamental Bioinformatics 125
UPGMA Step 2UPGMA Step 2
Merge A & C
D E
Species A B C
B 9 – –
C 8 11 –
DE 13.5 16.5 11.5
A C
Species B AC
AC 10 –
DE 16.5 12.5
OCCBIO 2006 – Fundamental Bioinformatics 126
UPGMA Steps 3 & 4UPGMA Steps 3 & 4
Merge B & AC
D EA C
Species B AC
AC 10 –
DE 16.5 12.5
B
Merge ABC & DE
D EA C B
(((A,C)B)(D,E))
OCCBIO 2006 – Fundamental Bioinformatics 127
Parsimony approachesParsimony approaches Belong to the broader class of character based
methods of phylogenetics Emphasize simpler, and thus more likely
evolutionary pathways
I: GCGGACGII: GTGGACG
C T
I II
(C or T)
C T
I II
A
(C or T)
OCCBIO 2006 – Fundamental Bioinformatics 128
Informative and uninformative sitesInformative and uninformative sitesPosition
Seq 1 2 3 4 5 6
1 G G G G G G
2 G G G A G T
3 G G A T A G
4 G A T C A T
For positions 5 & 6, it is possible to select more parsimonious trees – those that invoke less substitutions.
For positions 5 & 6, it is possible to select more parsimonious trees – those that invoke less substitutions.
OCCBIO 2006 – Fundamental Bioinformatics 129
Parsimony methodsParsimony methods Enumerate all possible trees Note the number of substitutions events
invoked by each possible tree• Can be weighted by transition/transversion
probabilities, etc.
Select the most parsimonious
OCCBIO 2006 – Fundamental Bioinformatics 130
Branch and Bound methodsBranch and Bound methods Key problem – number of possible trees grows
enormous as the number of species gets large Branch and bound – a technique that allows
large numbers of candidate trees to be rapidly disregarded
Requires a “good guess” at the cost of the best tree
OCCBIO 2006 – Fundamental Bioinformatics 131
Branch and Bound for TSPBranch and Bound for TSP Find a minimum cost
round-trip path that visits each intermediate city exactly once
NP-complete Greedy approach:
A,G,E,F,B,D,C,A= 251
AC
F
E
D
G
B
93
46
20
35
68
1257 31
15
82
17
8259
OCCBIO 2006 – Fundamental Bioinformatics 132
Search all possible pathsSearch all possible pathsA
C
F
E
D
G
B
93
46
20
35
68
1257 31
15
82
17
8259
AC
F
E
D
G
B
93
46
20
35
68
1257 31
15
82
17
8259
All paths
AG (20) AB (46) AC (93)
AGF (88) AGE (55)
AGFB AGFE AGFC
ACB (175) ACD ACF
ACBE (257)
Best estimate: 251
OCCBIO 2006 – Fundamental Bioinformatics 133
Parsimony – Branch and BoundParsimony – Branch and Bound Use the UPGMA tree for an initial best estimate
of the minimum cost (most parsimonious) tree Use branch and bound to explore all feasible
trees Replace the best estimate as better trees are
found Choose the most parsimonious
OCCBIO 2006 – Fundamental Bioinformatics 134
Parsimony exampleParsimony examplePosition
Seq 1 2 3 4 5 6
1 G G G G G G
2 G G G A G T
3 G G A T A G
4 G A T C A TAll trees
(1,2) [0] (1,3) [1] (1,4) [1]
Position 5:
Etc.
Part V – Protein StructurePart V – Protein Structure
Preliminaries
Lattice Models
Protein Folding Algorithms
Illustrations from: C Branden and J Tooze, Introduction to Protein Structure, 2nd ed. Garland Pub. ISBN 0815302703
OCCBIO 2006 – Fundamental Bioinformatics 136
The many functions of proteinsThe many functions of proteins Mechanoenzymes: myosin, actin Rhodopsin: allows vision Globins: transport oxygen Antibodies: immune system Enzymes: pepsin, renin, carboxypeptidase A Receptors: transmit messages through
membranes Vitelogenin: molecular velcro
• And hundreds of thousands more…
OCCBIO 2006 – Fundamental Bioinformatics 137
Proteins are chains of amino acidsProteins are chains of amino acids Polymer – a molecule composed of repeating units
OCCBIO 2006 – Fundamental Bioinformatics 138
Amino acid compositionAmino acid composition
Basic Amino AcidStructure:• The side chain, R,
varies for each ofthe 20 amino acids
C
RR
C
H
NO
OHH
H
Aminogroup
Carboxylgroup
Side chain
OCCBIO 2006 – Fundamental Bioinformatics 139
The Peptide BondThe Peptide Bond
Dehydration synthesis Repeating backbone: N–C –C –N–C –C
• Convention – start at amino terminus and proceed to carboxy terminus
O O
OCCBIO 2006 – Fundamental Bioinformatics 140
Peptidyl polymersPeptidyl polymers A few amino acids in a chain are called a
polypeptide. A protein is usually composed of 50 to 400+ amino acids.
Since part of the amino acid is lost during dehydration synthesis, we call the units of a protein amino acid residues.carbonylcarbonylcarboncarbon
amideamidenitrogennitrogen
OCCBIO 2006 – Fundamental Bioinformatics 141
Side chain propertiesSide chain properties Recall that the electronegativity of carbon is at
about the middle of the scale for light elements• Carbon does not make hydrogen bonds with water
easily – hydrophobic• O and N are generally more likely than C to h-bond
to water – hydrophilic We group the amino acids into three general
groups:• Hydrophobic• Charged (positive/basic & negative/acidic)• Polar
OCCBIO 2006 – Fundamental Bioinformatics 142
The Hydrophobic Amino AcidsThe Hydrophobic Amino Acids
Proline severelyProline severelylimits allowablelimits allowableconformations!conformations!
OCCBIO 2006 – Fundamental Bioinformatics 143
The Charged Amino AcidsThe Charged Amino Acids
OCCBIO 2006 – Fundamental Bioinformatics 144
The Polar Amino AcidsThe Polar Amino Acids
OCCBIO 2006 – Fundamental Bioinformatics 145
More Polar Amino AcidsMore Polar Amino Acids
And then there’s…And then there’s…
OCCBIO 2006 – Fundamental Bioinformatics 146
Planarity of the peptide bondPlanarity of the peptide bond
Phi () – the angle of rotation about the N-C bond.
Psi () – the angle of rotation about the C-C bond.
The planar bond angles and bond lengths are fixed.
OCCBIO 2006 – Fundamental Bioinformatics 147
Phi and psiPhi and psi
= = 180° is extended conformation
: C to N–H : C=O to C
C
C=O
N–H
OCCBIO 2006 – Fundamental Bioinformatics 148
The Ramachandran PlotThe Ramachandran Plot
G. N. Ramachandran – first calculations of sterically allowed regions of phi and psi
Note the structural importance of glycine
Observed(non-glycine)
Observed(glycine)Calculated
OCCBIO 2006 – Fundamental Bioinformatics 149
Primary & Secondary StructurePrimary & Secondary Structure Primary structurePrimary structure = the linear sequence of
amino acids comprising a protein:AGVGTVPMTAYGNDIQYYGQVT…
Secondary structureSecondary structure• Regular patterns of hydrogen bonding in proteins
result in two patterns that emerge in nearly every protein structure known: the -helix and the-sheet
• The location of direction of these periodic, repeating structures is known as the secondary secondary structurestructure of the protein
OCCBIO 2006 – Fundamental Bioinformatics 150
The alpha helixThe alpha helix 60°
OCCBIO 2006 – Fundamental Bioinformatics 151
Properties of the alpha helixProperties of the alpha helix
60° Hydrogen bondsHydrogen bonds
between C=O ofresidue n, andNH of residuen+4
3.6 residues/turn 1.5 Å/residue rise 100°/residue turn
OCCBIO 2006 – Fundamental Bioinformatics 152
Properties of Properties of -helices-helices 4 – 40+ residues in length Often amphipathic or “dual-natured”
• Half hydrophobic and half hydrophilic• Mostly when surface-exposed
If we examine many -helices,we find trends…• Helix formers: Ala, Glu, Leu,
Met• Helix breakers: Pro, Gly, Tyr,
Ser
OCCBIO 2006 – Fundamental Bioinformatics 153
The beta strand (& sheet)The beta strand (& sheet)
135° +135°
OCCBIO 2006 – Fundamental Bioinformatics 154
Properties of beta sheetsProperties of beta sheets Formed of stretches of 5-10 residues in
extended conformation Pleated – each C a bit
above or below the previous Parallel/aniparallelParallel/aniparallel,
contiguous/non-contiguous
OCCBIO 2006 – Fundamental Bioinformatics 155
Parallel and anti-parallel Parallel and anti-parallel -sheets-sheets Anti-parallel is slightly energetically favored
Anti-parallelAnti-parallel ParallelParallel
OCCBIO 2006 – Fundamental Bioinformatics 156
Turns and LoopsTurns and Loops Secondary structure elements are connected by
regions of turns and loops Turns – short regions
of non-, non-conformation
Loops – larger stretches with no secondary structure. Often disordered.• “Random coil”• Sequences vary much more than secondary
structure regions
Levels of Protein Levels of Protein StructureStructure
Secondary structure elements combine to form tertiary structure
Quaternary structure occurs in multienzyme complexes• Many proteins are
active only as homodimers, homotetramers, etc.
OCCBIO 2006 – Fundamental Bioinformatics 158
Disulfide BondsDisulfide Bonds Two cyteines in
close proximity will form a covalent bond
Disulfide bond, disulfide bridge, or dicysteine bond.
Significantly stabilizes tertiary structure.
OCCBIO 2006 – Fundamental Bioinformatics 159
Protein Structure ExamplesProtein Structure Examples
OCCBIO 2006 – Fundamental Bioinformatics 160
Determining Protein StructureDetermining Protein Structure There are O(100,000) distinct proteins in the
human proteome. 3D structures have been determined for 14,000
proteins, from all organisms• Includes duplicates with different ligands bound,
etc.
Coordinates are determined by X-ray X-ray crystallographycrystallography
OCCBIO 2006 – Fundamental Bioinformatics 161
X-Ray CrystallographyX-Ray Crystallography
~0.5mm
• The crystal is a mosaic of millions of copies of the protein.
• As much as 70% is solvent (water)!
• May take months (and a “green” thumb) to grow.
OCCBIO 2006 – Fundamental Bioinformatics 162
X-Ray diffractionX-Ray diffraction
Image is averagedover:• Space (many copies)• Time (of the diffraction
experiment)
OCCBIO 2006 – Fundamental Bioinformatics 163
Electron Density MapsElectron Density Maps Resolution is
dependent on the quality/regularity of the crystal
R-factor is a measure of “leftover” electron density
Solvent fitting Refinement
OCCBIO 2006 – Fundamental Bioinformatics 164
The Protein Data BankThe Protein Data Bank
ATOM 1 N ALA E 1 22.382 47.782 112.975 1.00 24.09 3APR 213ATOM 2 CA ALA E 1 22.957 47.648 111.613 1.00 22.40 3APR 214ATOM 3 C ALA E 1 23.572 46.251 111.545 1.00 21.32 3APR 215ATOM 4 O ALA E 1 23.948 45.688 112.603 1.00 21.54 3APR 216ATOM 5 CB ALA E 1 23.932 48.787 111.380 1.00 22.79 3APR 217ATOM 6 N GLY E 2 23.656 45.723 110.336 1.00 19.17 3APR 218ATOM 7 CA GLY E 2 24.216 44.393 110.087 1.00 17.35 3APR 219ATOM 8 C GLY E 2 25.653 44.308 110.579 1.00 16.49 3APR 220ATOM 9 O GLY E 2 26.258 45.296 110.994 1.00 15.35 3APR 221ATOM 10 N VAL E 3 26.213 43.110 110.521 1.00 16.21 3APR 222ATOM 11 CA VAL E 3 27.594 42.879 110.975 1.00 16.02 3APR 223ATOM 12 C VAL E 3 28.569 43.613 110.055 1.00 15.69 3APR 224ATOM 13 O VAL E 3 28.429 43.444 108.822 1.00 16.43 3APR 225ATOM 14 CB VAL E 3 27.834 41.363 110.979 1.00 16.66 3APR 226ATOM 15 CG1 VAL E 3 29.259 41.013 111.404 1.00 17.35 3APR 227ATOM 16 CG2 VAL E 3 26.811 40.649 111.850 1.00 17.03 3APR 228
http://www.rcsb.org/pdb/
OCCBIO 2006 – Fundamental Bioinformatics 165
Views of a proteinViews of a protein
Wireframe Ball and stick
OCCBIO 2006 – Fundamental Bioinformatics 166
Views of a proteinViews of a protein
Spacefill Cartoon CPK colors
Carbon = green, black, or grey
Nitrogen = blue
Oxygen = red
Sulfur = yellow
Hydrogen = white
OCCBIO 2006 – Fundamental Bioinformatics 167
The Protein Folding ProblemThe Protein Folding Problem Central question of molecular biology:
“Given a particular sequence of amino acid Given a particular sequence of amino acid residues (primary structure), what will the residues (primary structure), what will the tertiary/quaternary structure of the resulting tertiary/quaternary structure of the resulting protein be?”protein be?”
Input: AAVIKYGCAL…Output: 11, 22…= backbone conformation:(no side chains yet)
OCCBIO 2006 – Fundamental Bioinformatics 168
Forces driving protein foldingForces driving protein folding It is believed that hydrophobic collapse is a key
driving force for protein folding• Hydrophobic core• Polar surface interacting with solvent
Minimum volume (no cavities) Disulfide bond formation stabilizes Hydrogen bonds Polar and electrostatic interactions
OCCBIO 2006 – Fundamental Bioinformatics 169
Folding helpFolding help Proteins are, in fact, only marginally stable
• Native state is typically only 5 to 10 kcal/mole more stable than the unfolded form
Many proteins help in folding• Protein disulfide isomerase – catalyzes shuffling of
disulfide bonds• Chaperones – break up aggregates and (in theory)
unfold misfolded proteins
OCCBIO 2006 – Fundamental Bioinformatics 170
The Hydrophobic CoreThe Hydrophobic Core Hemoglobin A is the protein in red blood cells
(erythrocytes) responsible for binding oxygen. The mutation E6V in the chain places a
hydrophobic Val on the surface of hemoglobin The resulting “sticky patch” causes hemoglobin
S to agglutinate (stick together) and form fibers which deform the red blood cell and do not carry oxygen efficiently
Sickle cell anemia was the first identified molecular disease
OCCBIO 2006 – Fundamental Bioinformatics 171
Sickle Cell AnemiaSickle Cell Anemia
Sequestering hydrophobic residues in Sequestering hydrophobic residues in the protein core protects proteins from the protein core protects proteins from hydrophobic agglutination.hydrophobic agglutination.
OCCBIO 2006 – Fundamental Bioinformatics 172
Computational Problems in Protein FoldingComputational Problems in Protein Folding
Two key questions:• Evaluation – how can we tell a correctly-folded
protein from an incorrectly folded protein? H-bonds, electrostatics, hydrophobic effect, etc. Derive a function, see how well it does on “real” proteins
• Optimization – once we get an evaluation function, can we optimize it? Simulated annealing/monte carlo EC Heuristics We’ll talk more about these methods later…
OCCBIO 2006 – Fundamental Bioinformatics 173
Fold OptimizationFold Optimization Simple lattice models (HP-
models)• Two types of residues:
hydrophobic and polar• 2-D or 3-D lattice• The only force is hydrophobic
collapse• Score = number of HH
contacts
OCCBIO 2006 – Fundamental Bioinformatics 174
H/P model scoring: count noncovalent hydrophobic interactions.
Sometimes:• Penalize for buried polar or surface hydrophobic
residues
Scoring Lattice ModelsScoring Lattice Models
OCCBIO 2006 – Fundamental Bioinformatics 175
What can we do with lattice models?What can we do with lattice models? For smaller polypeptides, exhaustive search can
be used• Looking at the “best” fold, even in such a simple
model, can teach us interesting things about the protein folding process
For larger chains, other optimization and search methods must be used• Greedy, branch and bound• Evolutionary computing, simulated annealing• Graph theoretical methods
OCCBIO 2006 – Fundamental Bioinformatics 176
The “hydrophobic zipper” effect:
Learning from Lattice ModelsLearning from Lattice Models
Ken Dill ~ 1997
OCCBIO 2006 – Fundamental Bioinformatics 177
Absolute directions• UURRDLDRRU
Relative directions• LFRFRRLLFFL• Advantage, we can’t have UD or RL in absolute• Only three directions: LRF
What about bumps? LFRRR• Bad score• Use a better representation
Representing a lattice modelRepresenting a lattice model
OCCBIO 2006 – Fundamental Bioinformatics 178
Preference-order representationPreference-order representation Each position has two “preferences”
• If it can’t have either of the two, it will take the “least favorite” path if possible
Example: {LR},{FL},{RL},{FR},{RL},{RL},{FR},{RF}
Can still cause bumps:{LF},{FR},{RL},{FL},{RL},{FL},{RF},{RL},{FL}
OCCBIO 2006 – Fundamental Bioinformatics 179
More realistic modelsMore realistic models Higher resolution lattices (45° lattice, etc.) Off-lattice models
• Local moves• Optimization/search methods and /
representations Greedy search Branch and bound EC, Monte Carlo, simulated annealing, etc.
OCCBIO 2006 – Fundamental Bioinformatics 180
The Other Half of the PictureThe Other Half of the Picture Now that we have a more realistic off-lattice
model, we need a better energy function to evaluate a conformation (fold).
Theoretical force field:G = Gvan der Waals + Gh-bonds + Gsolvent + Gcoulomb
Empirical force fields• Start with a database• Look at neighboring residues – similar to known
protein folds?
OCCBIO 2006 – Fundamental Bioinformatics 181
Threading: Fold recognitionThreading: Fold recognition Given:
• Sequence: IVACIVSTEYDVMKAAR…
• A database of molecular coordinates
Map the sequence onto each fold
Evaluate• Objective 1: improve
scoring function• Objective 2: folding
OCCBIO 2006 – Fundamental Bioinformatics 182
Secondary Structure PredictionSecondary Structure Prediction
AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…A-VGIVPM-AYGQDIQY-GQVT…AG-GIIP--AYGNELQ--GQVT…AGVCTVPMTA---ELQYYG--T…
AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…----hhhHHHHHHhhh--eeEE…----hhhHHHHHHhhh--eeEE…
OCCBIO 2006 – Fundamental Bioinformatics 183
Secondary Structure PredictionSecondary Structure Prediction Easier than folding
• Current algorithms can prediction secondary structure with 70-80% accuracy
Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222.
• Based on frequencies of occurrence of residues in helices and sheets
PhD – Neural network based• Uses a multiple sequence alignment• Rost & Sander, Proteins, 1994 , 19, 55-72
OCCBIO 2006 – Fundamental Bioinformatics 184
Chou-Fasman ParametersChou-Fasman ParametersName Abbrv P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine A 142 83 66 0.06 0.076 0.035 0.058Arginine R 98 93 95 0.07 0.106 0.099 0.085Aspartic Acid D 101 54 146 0.147 0.11 0.179 0.081Asparagine N 67 89 156 0.161 0.083 0.191 0.091Cysteine C 70 119 119 0.149 0.05 0.117 0.128Glutamic Acid E 151 37 74 0.056 0.06 0.077 0.064Glutamine Q 111 110 98 0.074 0.098 0.037 0.098Glycine G 57 75 156 0.102 0.085 0.19 0.152Histidine H 100 87 95 0.14 0.047 0.093 0.054Isoleucine I 108 160 47 0.043 0.034 0.013 0.056Leucine L 121 130 59 0.061 0.025 0.036 0.07Lysine K 114 74 101 0.055 0.115 0.072 0.095Methionine M 145 105 60 0.068 0.082 0.014 0.055Phenylalanine F 113 138 60 0.059 0.041 0.065 0.065Proline P 57 55 152 0.102 0.301 0.034 0.068Serine S 77 75 143 0.12 0.139 0.125 0.106Threonine T 83 119 96 0.086 0.108 0.065 0.079Tryptophan W 108 137 96 0.077 0.013 0.064 0.167Tyrosine Y 69 147 114 0.082 0.065 0.114 0.125Valine V 106 170 50 0.062 0.048 0.028 0.053
OCCBIO 2006 – Fundamental Bioinformatics 185
Chou-Fasman AlgorithmChou-Fasman Algorithm Identify -helices
• 4 out of 6 contiguous amino acids that have P(a) > 100
• Extend the region until 4 amino acids with P(a) < 100 found
• Compute P(a) and P(b); If the region is >5 residues and P(a) > P(b) identify as a helix
Repeat for -sheets [use P(b)] If an and a region overlap, the overlapping
region is predicted according to P(a) and P(b)
OCCBIO 2006 – Fundamental Bioinformatics 186
Chou-Fasman, cont’dChou-Fasman, cont’d Identify hairpin turns:
• P(t) = f(i) of the residue f(i+1) of the next residue f(i+2) of the following residue f(i+3) of the residue at position (i+3)
• Predict a hairpin turn starting at positions where: P(t) > 0.000075 The average P(turn) for the four residues > 100 P(a) < P(turn) > P(b) for the four residues
Accuracy 60-65%
OCCBIO 2006 – Fundamental Bioinformatics 187
Chou-Fasman ExampleChou-Fasman Example CAENKLDHVRGPTCILFMTWYNDGP CAENKL – Potential helix (!C and !N)
Residues with P(a) < 100: RNCGPSTY
• Extend: When we reach RGPT, we must stop• CAENKLDHV: P(a) = 972, P(b) = 843• Declare alpha helix
Identifying a hairpin turn• VRGP: P(t) = 0.000085• Average P(turn) = 113.25
Avg P(a) = 79.5, Avg P(b) = 98.25
Additional InformationAdditional Information – Aligning – Aligning protein sequencesprotein sequences
PAM matrices
BLOSUM matrices
OCCBIO 2006 – Fundamental Bioinformatics 189
Sequence Alignments RevisitedSequence Alignments Revisited Scoring nucleotide sequence alignments was
easier• Match score• Possibly different scores for transitions and
transversions For amino acids, there are many more possible
substitutions How do we score which substitutions are highly
penalized and which are moderately penalized?• Physical and chemical characteristics• Empirical methods
OCCBIO 2006 – Fundamental Bioinformatics 190
Scoring MismatchesScoring Mismatches Physical and chemical characteristics
• V I – Both small, both hydrophobic, conservative substitution, small penalty
• V K – Small large, hydrophobic charged, large penalty
• Requires some expert knowledge and judgement
Empirical methods• How often does the substitution V I occur in
proteins that are known to be related? Scoring matrices: PAM and BLOSUM
OCCBIO 2006 – Fundamental Bioinformatics 191
PAM matricesPAM matrices PAM = “Point Accepted Mutation” interested
only in mutations that have been “accepted” by natural selection
Starts with a multiple sequence alignment of very similar (>85% identity) proteins. Assumed to be homologous
Compute the relative mutability, mi, of each amino acid• e.g. mA = how many times was alanine substituted
with anything else?
OCCBIO 2006 – Fundamental Bioinformatics 192
Relative mutabilityRelative mutability ACGCTAFKIGCGCTAFKIACGCTAFKLGCGCTGFKIGCGCTLFKIASGCTAFKLACACTAFKL
Across all pairs of sequences, there are 28A X substitutions
There are 10 ALA residues, so mA = 2.8
OCCBIO 2006 – Fundamental Bioinformatics 193
Pam Matrices, cont’dPam Matrices, cont’d Construct a phylogenetic tree for the sequences
in the alignment
Calculate substitution frequences FX,X
Substitutions may have occurred either way, so A G also counts as G A.
ACGCTAFKI
GCGCTAFKI ACGCTAFKL
GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL
AG IL
AG AL CS GA
FG,A = 3
OCCBIO 2006 – Fundamental Bioinformatics 194
Mutation ProbabilitiesMutation Probabilities Mi,j represents the probability of J I
substitution.
= 2.025
iij
ijjij F
FmM
4
37.2,
AGM
ACGCTAFKI
GCGCTAFKI ACGCTAFKL
GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL
AG IL
AG AL CS GA
OCCBIO 2006 – Fundamental Bioinformatics 195
The PAM matrixThe PAM matrix The entries, Ri,j are the Mi,j values divided by
the frequency of occurrence, fi, of residue i.
fG = 10 GLY / 63 residues = 0.1587
RG,A = log(2.025/0.1587) = log(12.760) = 1.106
The log is taken so that we can add, rather than multiply entries to get compound probabilities.
Log-odds matrix Diagonal entries are 1– mj
OCCBIO 2006 – Fundamental Bioinformatics 196
Interpretation of PAM matricesInterpretation of PAM matrices PAM-1 – one substitution per 100 residues (a
PAM unit of time) Multiply them together to get PAM-100, etc. “Suppose I start with a given polypeptide
sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all amino acid residues have undergone substitutions at time t+n. Let the new sequence at time t+n be called M’. What is the probability that a residue of type j in M will be replaced by i in M’?”
OCCBIO 2006 – Fundamental Bioinformatics 197
PAM matrix considerationsPAM matrix considerations
If Mi,j is very small, we may not have a large enough sample to estimate the real probability. When we multiply the PAM matrices many times, the error is magnified.
PAM-1 – similar sequences, PAM-1000 very dissimilar sequences
OCCBIO 2006 – Fundamental Bioinformatics 198
BLOSUM matrixBLOSUM matrix Starts by clustering proteins by similarity Avoids problems with small probabilities by
using averages over clusters Numbering works opposite
• BLOSUM-62 is appropriate for sequences of about 62% identity, while BLOSUM-80 is appropriate for more similar sequences.
OCCBIO 2006 – Fundamental Bioinformatics 199
Other topics?Other topics? Tools and languages Forensic DNA Microarray analysis