1 Lesson 3 Aligning sequences and searching databases.
-
date post
19-Dec-2015 -
Category
Documents
-
view
235 -
download
1
Transcript of 1 Lesson 3 Aligning sequences and searching databases.
11
Lesson 3Lesson 3
Aligning sequences and Aligning sequences and searching databases searching databases
22
HomologyHomology
Similarity between objects due to a Similarity between objects due to a common ancestrycommon ancestry
33
Sequence homologySequence homology
Similarity between sequences that Similarity between sequences that results from a common ancestorresults from a common ancestor
VLSPAVKWAKVGAHAAGHG
VLSEAVLWAKVEADVAGHGBasic assumptionBasic assumption: :
Sequence homology → Sequence homology → similar structure/functionsimilar structure/function
44
Sequence alignmentSequence alignment
Alignment: Alignment: Comparing two (pairwise) Comparing two (pairwise) or more (multiple) sequences. or more (multiple) sequences. Searching for a series of identical or Searching for a series of identical or similar characters in the sequences.similar characters in the sequences.
55
HomologyHomology
Ortholog – homolog with similar Ortholog – homolog with similar function (via speciation)function (via speciation)
Paralog – homolog which arose from Paralog – homolog which arose from gene duplicationgene duplication
Orthologs – 2 homologs
from different species
Paralogs – 2 homologs
within the same species
G G G
G
G
G G G1,G2
G
G
66
How closeHow close??
Rule of thumb:Rule of thumb: Proteins are homologous if over 25% Proteins are homologous if over 25%
identical (identical (length >100length >100)) DNA sequences are homologous if DNA sequences are homologous if
over 70% identicalover 70% identical
77
Twilight zoneTwilight zone
< 20% identity in proteins – may be < 20% identity in proteins – may be homologous and may not be….homologous and may not be….
(Note that 5% identity will be (Note that 5% identity will be obtained completely by chance!)obtained completely by chance!)
88
Why sequence alignment?Why sequence alignment?
Predict characteristics of a Predict characteristics of a protein – protein –
use the structure/function of known use the structure/function of known proteins for predicting the proteins for predicting the structure/function of an unknown structure/function of an unknown proteinsproteins
99
Sequence modificationsSequence modifications
Sequences change in the course of evolution Sequences change in the course of evolution due to random mutationsdue to random mutations
Three types of mutations:Three types of mutations:1.1. InsertionInsertion - an insertion of a nucleotide or several - an insertion of a nucleotide or several
nucleotides to the sequence. AAGAnucleotides to the sequence. AAGA AAG AAGTTAA2.2. DeletionDeletion – a deletion of a nucleotide (or more) from the – a deletion of a nucleotide (or more) from the
sequence. sequence. AAAAGAGA AGA AGA
3.3. SubstitutionSubstitution – a replacement of a nucleotide by another. – a replacement of a nucleotide by another. AAAAGGAA AA AACCAA
Insertion or Deletion ?Insertion or Deletion ? -> -> Indel Indel
1010
Local vs. GlobalLocal vs. Global
Global alignmentGlobal alignment – finds the best – finds the best alignment across the alignment across the entireentire two two sequences.sequences.
Local alignmentLocal alignment – finds regions of – finds regions of similarity in similarity in partsparts of the sequences. of the sequences.
ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ
ADLG CDRYFQ|||| |||| |ADLG CDRYYQ
Global alignment:
forces alignment in
regions which differ
Local alignment will
return only regions of
good alignment
1111
When global and when localWhen global and when local??
1212
Global alignmentGlobal alignment PTK2 protein tyrosine kinase 2PTK2 protein tyrosine kinase 2 of of
human and rhesus monkeyhuman and rhesus monkey
1313
Protein tyrosine kinase domainProtein tyrosine kinase domain
1414
Protein tyrosine kinase domainProtein tyrosine kinase domain
Human PTK2 and leukocyte tyrosine Human PTK2 and leukocyte tyrosine kinase kinase
Both function as tyrosine kinases, in Both function as tyrosine kinases, in completely different contextscompletely different contexts
Ancient duplicationAncient duplication
1515
Global alignment of PTK and LTKGlobal alignment of PTK and LTK
1616
Local alignment of PTK and LTKLocal alignment of PTK and LTK
1717
Pairwise alignmentPairwise alignment
AAGCTGAATTCGAAAGGCTCATTTCTGA
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
One possible alignment:
1818
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
This alignment includes:
2 mismatches 4 indels (gap)
10 perfect matches
1919
Choosing an alignment: Choosing an alignment:
Many different alignments are possible:Many different alignments are possible:
AAGCTGAATTCGAAAGGCTCATTTCTGA
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Which alignment is better?
2020
Alignment scoring - scoring of Alignment scoring - scoring of sequence similarity: sequence similarity:
Assumes independence between positionsAssumes independence between positions– Each position is considered separatelyEach position is considered separately
Scores each positionScores each position– Positive if identical (match)Positive if identical (match)– Negative if different (mismatch) or gap (indel)Negative if different (mismatch) or gap (indel)
Total score = sum of position scoresTotal score = sum of position scores– Can be positive or negativeCan be positive or negative
2121
Example - naïve scoring Example - naïve scoring system:system:
Perfect match: +1Perfect match: +1 Mismatch: Mismatch: -2-2 Indel (gap): Indel (gap): -1-1
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Higher score Better alignment
2222
Scoring systemScoring system::
The choice of +1,-2, and -1 scores is The choice of +1,-2, and -1 scores is quite arbitraryquite arbitrary
Different scoring systems Different scoring systems different different alignmentsalignments
Scoring systems implicitly represent a Scoring systems implicitly represent a particular theory of evolution particular theory of evolution – Some mismatches are more plausibleSome mismatches are more plausible
Transition vs. Transversion Transition vs. Transversion LysLysArgArg ≠≠ LysLysCysCys
– Gap extension Gap extension ≠≠ Gap opening Gap opening
2323
Scoring matrixScoring matrix Representing the Representing the
scoring system as a scoring system as a table or matrix table or matrix nn n n ((n n is the number of is the number of letters the alphabet letters the alphabet contains. n=4 for contains. n=4 for nucleotides, n=20 for nucleotides, n=20 for amino acids)amino acids)
symmetricsymmetric
AAGGCCTT
AA22
GG--6622
CC--66--6622
TT--66--66--6622
2424
DNA scoring matricesDNA scoring matrices
Uniform substitutions between all nucleotides:Uniform substitutions between all nucleotides:
FromFrom
ToToAAGGCCTT
AA22
GG--6622
CC--66--6622
TT--66--66--6622
MatchMismatch
2525
DNA scoring matricesDNA scoring matrices
Can take into account biological Can take into account biological phenomena such as:phenomena such as:
Transition-transversionTransition-transversion
2626
Amino-acid scoring matricesAmino-acid scoring matrices Take into account physico-chemical propertiesTake into account physico-chemical properties
2727
Amino-acid substitutions matricesAmino-acid substitutions matrices
Actual substitutions:Actual substitutions:– Based on empirical dataBased on empirical data– Commonly used by many bioinformatics Commonly used by many bioinformatics
programsprograms– PAM & BLOSUMPAM & BLOSUM
2828
Protein matrices – actual Protein matrices – actual substitutionssubstitutions
The ideaThe idea: Given an alignment of a large number : Given an alignment of a large number of closely related sequences we can score the of closely related sequences we can score the relation between amino acids based on how relation between amino acids based on how
frequently they substitute each otherfrequently they substitute each other M G Y D EM G Y D EM G Y E EM G Y D EM G Y Q EM G Y D EM G Y E EM G Y E E
In the fourth columnE and D are found in 7 / 8
2929
PAM Matrix - PAM Matrix - PPoint oint AAccepted ccepted MMutationsutations
Based on a database of 1,572 changes in Based on a database of 1,572 changes in 71 groups of closely related proteins (85% 71 groups of closely related proteins (85% identity)identity)– Alignment was easyAlignment was easy
Counted the number of the substitutions Counted the number of the substitutions per amino-acid pair (20 x 20) per amino-acid pair (20 x 20)
Found that common substitutions occurred Found that common substitutions occurred between chemically similar amino acidsbetween chemically similar amino acids
3030
PAM MatricesPAM Matrices Family of matrices PAM 80, PAM 120, Family of matrices PAM 80, PAM 120,
PAM 250PAM 250
The number on the PAM matrix The number on the PAM matrix represents evolutionary distance represents evolutionary distance
Larger numbers are for larger distancesLarger numbers are for larger distances
3131
Example: PAM 250Example: PAM 250
Similar amino acids have greater score
3232
PAM - limitationsPAM - limitations
Based only on a single, and limited Based only on a single, and limited datasetdataset
Examines proteins with few Examines proteins with few differences (85% identity)differences (85% identity)
Based mainly on small globular Based mainly on small globular proteins so the matrix is biased proteins so the matrix is biased
3333
BLOSUMBLOSUM
Henikoff and Henikoff (1992) derived Henikoff and Henikoff (1992) derived a set of matrices based on a much a set of matrices based on a much larger dataset larger dataset
BLOSUM observes significantly more BLOSUM observes significantly more replacements than PAM, even for replacements than PAM, even for infrequent pairsinfrequent pairs
3434
BLOSUM:BLOSUM: BloBlockscks SuSubstitutionbstitution MMatrixatrix
Based on BLOCKS database Based on BLOCKS database – ~2000 blocks from 500 families of ~2000 blocks from 500 families of
related proteinsrelated proteins– Families of proteins with identical Families of proteins with identical
function function Blocks are short Blocks are short
conserved patterns of conserved patterns of 3-60 aa 3-60 aa without gapswithout gaps
AABCDA----BBCDADABCDA----BBCBBBBBCDA-AA-BCCAAAAACDA-A--CBCDBCCBADA---DBBDCCAAACAA----BBCCC
3535
BLOSUMBLOSUM
Each block represents a sequence Each block represents a sequence alignment with different identity alignment with different identity percentagepercentage
For each block the amino-acid For each block the amino-acid substitution rates were calculated to substitution rates were calculated to create the BLOSUM matrixcreate the BLOSUM matrix
3636
BLOSUM MatricesBLOSUM Matrices
BLOSUMBLOSUMnn is based on sequences that is based on sequences that share at least share at least nn percent identity percent identity
BLOSUMBLOSUM6262 represents closer represents closer sequences than BLOSUMsequences than BLOSUM4545
3737
Example : Blosum62Example : Blosum62
derived from block where the sequencesshare at least 62% identity
3838
PAM vs. BLOSUMPAM vs. BLOSUM
PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45
More distant sequences
3939
Scoring system = Scoring system =
substitution matrix + substitution matrix +
gap penaltygap penalty
4040
Gap penaltyGap penalty
We penalize gaps We penalize gaps Scoring for gap opening & gap extension:Scoring for gap opening & gap extension:
– Gap-extension penalty < gap-open penaltyGap-extension penalty < gap-open penalty
4141
Optimal alignment algorithmsOptimal alignment algorithms
Needleman-WunschNeedleman-Wunsch (global) (global) Smith-Waterman Smith-Waterman (local)(local)
4242
Alignment Search SpaceAlignment Search Space The “The “search spacesearch space” (number of possible gapped ” (number of possible gapped
alignments) for optimally aligning two sequences alignments) for optimally aligning two sequences is is exponentialexponential in the length of the sequences in the length of the sequences (n)(n)..
If If nn=100=100, there are , there are 100100100100 = 10 = 10200200 = = 100,000,000,000,000,000,000,000,000,000,000,000,000,000,0100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,00000,000,000,000,000,000,000,000,000,000
different alignments!different alignments! Average protein length is about Average protein length is about nn=250=250!!
4343
Searching databasesSearching databases
4444
Searching a sequence databaseSearching a sequence database
Using a sequence as a query to find Using a sequence as a query to find homologoushomologous sequences in a sequences in a sequence databasesequence database
4545
Query sequence: DNA or proteinQuery sequence: DNA or protein??
For coding sequences, we can use For coding sequences, we can use the DNA sequence or the protein the DNA sequence or the protein sequence to search for similar sequence to search for similar sequences.sequences.
Which is preferable?Which is preferable?
4646
Protein is betterProtein is better!!
Selection (and hence conservation) Selection (and hence conservation) works (mostly) on the protein level:works (mostly) on the protein level:
CCTTTTTTCCAA = = LeuLeu--SerSerTTTTGGAAGGTT == LeuLeu--SerSer
4747
Query typeQuery type
Nucleotides: a four letter alphabetNucleotides: a four letter alphabet Amino acids: a twenty letter alphabet Amino acids: a twenty letter alphabet
• Two random DNA sequences will, on average, have 25% identity
• Two random protein sequences will, on average, have 5% identity
4848
ConclusionsConclusions Using the amino-acid sequence is Using the amino-acid sequence is
preferable for homology searchpreferable for homology search
Why use a nucleotide sequence after all?Why use a nucleotide sequence after all? No ORF found, e.g. newly sequenced No ORF found, e.g. newly sequenced
genomegenome No similar protein sequences were foundNo similar protein sequences were found Specific DNA databases are available Specific DNA databases are available
(EST)(EST)
4949
Some terminologySome terminology
Query sequenceQuery sequence - the sequence with - the sequence with which we are searchingwhich we are searching
HitHit – a sequence found in the – a sequence found in the database, suspected as homologousdatabase, suspected as homologous
5050
How do we search a databaseHow do we search a database??
Assume we perform pairwise Assume we perform pairwise alignment of the query against all alignment of the query against all the sequences in the databasethe sequences in the database
Exact pairwise alignment is O(mn) ≈ Exact pairwise alignment is O(mn) ≈ O(nO(n22)) (m – length of sequence 1, (m – length of sequence 1, n – length of sequence 2)n – length of sequence 2)
5151
How much time will it takeHow much time will it take?? O(nO(n22) computations per search.) computations per search. Assume n=200, so we have Assume n=200, so we have 40,00040,000
computations per searchcomputations per search Size of database - Size of database - ~60 million entries~60 million entries 2.4 x 102.4 x 101212 computations for each sequence computations for each sequence
search we perform!search we perform! Assume each computation takes 10Assume each computation takes 10-6-6
seconds seconds 24,000 seconds ≈ 24,000 seconds ≈ 6.66 hours 6.66 hours for each sequence searchfor each sequence search
150,000150,000 searches (at least!!) are searches (at least!!) are performed per dayperformed per day
5252
ConclusionConclusion
Using the exact comparison pairwise Using the exact comparison pairwise alignment algorithm between query alignment algorithm between query and all DB entries – too slowand all DB entries – too slow
5353
HeuristicHeuristic
Definition:Definition: a heuristic is a design a heuristic is a design to solve a problem that does not to solve a problem that does not provide an exact solution (but is provide an exact solution (but is not too bad) but reduces the not too bad) but reduces the time complexity of the exact time complexity of the exact solutionsolution
5454
BLASTBLAST
BLAST - Basic Local Alignment and BLAST - Basic Local Alignment and Search ToolSearch Tool
A heuristic for searching a database A heuristic for searching a database for similar sequencesfor similar sequences
5555
DNA or ProteinDNA or Protein All types of searches are possibleAll types of searches are possible
Query: DNA Protein
Database: DNA Protein
blastn – nuc vs. nucblastp – prot vs. protblastx – translated query vs. protein databasetblastn – protein vs. translated nuc. DBtblastx – translated query vs. translated database
Translated databases:
trEMBLgenPept
5656
BLAST - underlying hypothesisBLAST - underlying hypothesis
The underlying hypothesisThe underlying hypothesis: when : when two sequences are similar there are two sequences are similar there are short ungapped regions of high short ungapped regions of high similaritysimilarity between them between them
The heuristic:The heuristic:
1.1. Discard irrelevant sequencesDiscard irrelevant sequences
2.2. Perform exact Perform exact locallocal alignment with alignment with remaining sequences remaining sequences
5757
How do we discard irrelevant How do we discard irrelevant sequences quicklysequences quickly??
Divide the Divide the databasedatabase into into wordswords of of length w (default: w = 3 for protein length w (default: w = 3 for protein and w = 7 for DNA)and w = 7 for DNA)
Save the words in a look-up table Save the words in a look-up table that can be searched quicklythat can be searched quickly
WTDFGYPAILKGGTAC
WTDTDFDFGFGYGYP …
5858
BLASTBLAST:: discarding sequences discarding sequences
When the user gives a query When the user gives a query sequence, divide it also into wordssequence, divide it also into words
Search the Search the databasedatabase for consecutive for consecutive neighbor wordsneighbor words
5959
Neighbour wordsNeighbour words
neighbor wordsneighbor words are defined are defined according to a scoring matrix (e.g., according to a scoring matrix (e.g., BLOSUM62 for proteins) with a BLOSUM62 for proteins) with a certain cutoff levelcertain cutoff level
GFB
GFC (20)
GPC (11)WAC (5)
6060
Search for consecutive wordsSearch for consecutive words
Query
Dat
abas
e re
cord
Neighbor word Look for a seed: hits on the same diagonal
which can be connected
At least 2 hits on the same diagonal with distance which is
smaller than a predetermined cutoff
This is the filtering stage – many unrelated hits are filtered, saving lots
of time!
A
6262
Try to extend the alignmentTry to extend the alignment
Stop extending when the score of the Stop extending when the score of the alignment drops alignment drops XX beneath the beneath the maximal score obtained so farmaximal score obtained so far
Discard segments with score < Discard segments with score < SS
ASKIOPLLWLAASFLHNEQAPALSDAN
JWQEOPLWPLAASOIHLFACNSIFYASScore=15 Score=17 Score=14
X=4
6363
The result – local alignmentThe result – local alignment
The result of BLAST will be a series of The result of BLAST will be a series of local alignmentslocal alignments between the query between the query and the different hits foundand the different hits found
6464
E-valueE-value The number of times we will
theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database
Theoretically, we could trust
any result with an
E-value ≤ 1
In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a
significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe
non-homologous).E-values between 10-2 and 1 do not
indicate a good homology
6565
Filtering low complexityFiltering low complexity
Low complexity regionsLow complexity regions : e.g., Proline : e.g., Proline rich areas (in proteins), Alu repeats rich areas (in proteins), Alu repeats (in DNA)(in DNA)
Regions of low complexity generate Regions of low complexity generate high score of alignment, BUT – this high score of alignment, BUT – this does not indicate homologydoes not indicate homology
6666
SolutionSolution
In BLAST there is an option to mask In BLAST there is an option to mask low-complexity regions in the query low-complexity regions in the query sequence (such regions are sequence (such regions are represented as XXXXX in query)represented as XXXXX in query)