bio1 sequences handout - Göteborgs universitetbio.lundberg.gu.se/courses/bio1/seq.pdf · 2011. 8....
Transcript of bio1 sequences handout - Göteborgs universitetbio.lundberg.gu.se/courses/bio1/seq.pdf · 2011. 8....
1
SEQUENCE ANALYSIS
Sequences - Problem of sequence alignment - interaction betweenmolecular biology / computer science / statistics
* What are the biological problems ?* Algorithm (dynamic programming)* ‘Simple’ implementation / source code / compiling* Statistics and probability theory of alignments* Common implementations in molecular biology software packages* Results of biological significance
Sequence analysis - Where and why
Sequencing projects, assembly of sequence dataIdentification of functional elements in sequencesSequence comparisonClassification of proteins Comparative genomicsRNA structure prediction Protein structure prediction Evolutionary history
Overview of methods
1. Comparison
a) IdentityExamples:• finding restriction sites (GAATTC)• pattern matches ((A,G)x4GK[S,T])(SeqWeb package: FindPatterns)
b) Comparing non-identical sequencesAlignmentsPairwise comparison
global alignment (SW: Gap)local alignment (smith-waterman) (SW:Bestfit)
Fasta (SW:Fasta, Tfasta)
Blast (SW: NetBlast)blastn compares a DNA sequence to a DNA databaseblastp compares a protein sequence to a protein database
tblastn compares a protein sequences to all possible translation products of a DNAdatabase
2
Multiple sequence alignmentClustalw (SW: Pileup)
Methods in phylogeny• Distances (SW: Growtree)• Parsimony, tree is preferred that correspond to the smallest number of changes• Maximum likelihood
2. Analyzing for property other than a simple linear sequence of letters .Examples:• statistical composition of residues• profile analysis , Position Specific Iterated BLAST (PSI-BLAST)(www.ncbi.nlm.nih.gov/BLAST)• HMMs• Prediction of higher order structure
1. Protein secondary structure (alpha, beta)2. RNA (folding by base pairing within the molecule)
3. Simple transformation /extraction• Translation, DNA -> protein (SW: Translate, Map)• Reverse translation• Splicing
3
Identity. Pattern matching
Pattern matching is used for finding short sequence patterns in a single sequence, in a group of sequences orin the databases.
Examples of patterns (regular expressions):
GAATTCRecognition site for the restriction enzyme EcoRI
GDSGGP Typical of serine proteases.
[AG]-x(4)-G-K-[ST] motif A of the ATP/GTP-binding site
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H zinc finger proteins
The program Findpatterns uses these types of patterns to search a set of sequences like those of a database.The program Motifs specifically search a protein sequence or set of sequences for the motifs present in thePROSITE database.
Comparing non-identical sequencesProtein sequence comparison - basic concepts
Comparing two nearly identical protein sequences
GWFTREKLREEDHIKKGWFTKEKIREEDHIKK
When two protein sequences are being compared and the similarity is considered statistically significant, it ishighly likely that the two proteins are evolutionary related. There are really only two kinds of biologicalrelationships:
Orthologs Proteins that carry out the same function in different speciesParalogs Proteins that perform different but related functions within one organism
Proteins are homologous if they are related by divergence from a common ancestor.
4
X
X
X1
X
X2
Speciation
What are orthologs?
Ancestral organism
Organism A
Organism A
Organism B
Organism B
Orthologs
X
X
Xa
X
Xb
Gene duplication
What are paralogs?
Paralogs
5
Mouse trypsin -- orthologs -- Human trypsin | | |paralogs paralogs | |Mouse chymotrypsin -- orthologs -- Human chymotrypsin
M A K L Q G A L G K R Y
M *A * *K * *I
Q *G * *A * *L * *A * * K * *R *Y
M A K L Q G A L G K R Y
* * * * * * * * * *M A K I Q G A L A K R Y
Comparing 2 sequences - Dotplot analysis
Sequence alignment
6
M A K L Q L G K R Y
M *A *K * *L * *Q *G *A *L * *G *K * *R *Y *
M A K L Q L G K R Y
* * * * * * * * * *M A K L Q G A L G K R Y
Gap
Sequence alignment
Comparing 2 sequences - Gaps
7
Comparing 2 sequences: What are really gaps?
Gaps are results of mutations (changes in DNA) that occur during evolution
For instance consider this deletion mutation:
AACTTGACGTTGAACTGC
GACTGGGCGTATCTGACCCGCATA
CGGGCACCGGCCCGTGGC
N L T D W A Y R A P
N L T R A P
AACTTGACGTTGAACTGC
CGGGCACCGGCCCGTGGC
DNAprotein
8
Pairwise comparison
In pairwise comparison gaps cannot be inserted in an unrestricted manner. For these reasons a gap penalty isassigned to gaps. Two parameters frequently used in sequence comparison (such as the programs Gap,Bestfit, Fasta).
- Gap creation penalty- Gap extension penalty
There are two parameters because it is more ’difficult’ to create a gap than to extend an existing gap.
Substitution matrices
In the scoring of an alignment we do not only take into account whether amino acids are identical or not.To better evaluate the biological significance of an alignment we make use of the fact that all amino acidsubstitutions do not occur with the same frequency.
Example : SubstitutionAsp (D) -> Glu (E) more likely thanAsp (D) -> Cys (C)
For each pair of amino acids one can estimate a probability for the pair to occur in a correct alignment ofrelated protein sequences. This kind of data is used to produce a substitution matrix. The first substitution
matrix to be used was PAM250. Now the matrix BLOSUM62 is most often used.
9
Neurospora_crassa GSVDGYAYTD ANKQKGITWD ENTLFEYLEN PKKYIPGTKM AFGGLKKDKD
Stellaria_longipes GSVEGFSYTD ANKAKGIEWN KDTLFEYLEN PKKYIPGTKM AFGGLKKDKD
Thermomyces_lanuginosus GSVEGYSYTD ANKQAGITWN EDTLFEYLEN PKKFIPGTKM AFGGLKKNKD
Arabidopsis_thaliana GSVAGYSYTD ANKQKGIEWK DDTLFEYLEN PKKYIPGTKM AFGGLKKPKD
Aspergillus_niger GQSEGYAYTD ANKQAGVTWD ENTLFSYLEN PKKFIPGTKM AFGGLKKGKE
Debaryomyces_occidentalis GQAAGYSYTD ANKKKGVEWT EQTMSDYLEN PKKYIPGTKM AFGGLKKPKD
Schizosaccharomyces_pombe GQAEGFSYTE ANRDKGITWD EETLFAYLEN PKKYIPGTKM AFAGFKKPAD
Fagopyrum_esculentum GTTAGYSYSA ANKNKAVTWG EDTLYEYLLN PKKYIPGTKM VFPGLKKPQE
Sesamum_indicum GTTPGYSYSA ANKNMAVIWG ENTLYDYLLN PKKYIPGTKM VFPGLKKPQE
Haematobia_irritans GQAAGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE
Lucilia_cuprina GQAPGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE
Ceratitis_capitata GQAAGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE
Sarcophaga_peregrina GQAPGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE
Manduca_sexta GQAPGFSYSD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM VFAGLKKANE
Samia_cynthia GQAPGFSYSN ANKAKGITWG DDTLFEYLEN PKKYIPGTKM VFAGLKKANE
Schistocerca_gregaria GQAPGFSYTD ANKSKGITWD ENTLFIYLEN PKKYIPGTKM VFAGLKKPEE
Apis_mellifera GQAPGYSYTD ANKGKGITWN KETLFEYLEN PKKYIPGTKM VFAGLKKPQE
Macaca_mulatta GQAPGYSYTA ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE
Pan_troglodytes GQAPGYSYTA ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE
Anas_platyrhynchos GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE
Aptenodytes_patagonicus GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE
10
Global alignment (Needleman-Wunsch algorithm): Considers similarity across the full extent of thesequences (Gap program in SeqWeb)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | | ||||||| | |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Local alignment (Smith-Waterman algorithm): Considers regions of similarity in parts of the sequencesonly. (Bestfit program in SeqWeb)
xxxxxxx ||||||| xxxxxxx region of similarity
This means for instance that when doing a global alignment of sequences an alignment is produced eventhough there is no significant similarity between the two sequences. How can one in such global comparisonsdecide whether the similarity is significant? In the Gap program there is an option "Generate statistics fromrandomized alignment" When this option is selected the second sequence is repeatedly shuffled, maintainingits length and composition, and then realigned to the first sequence. The average alignment score, plus orminus the standard deviation, of all randomized alignments is reported in the output file. You can comparethis average quality score to the quality score of the actual alignment to help evaluate the significance of thealignment.
Database searches
Searching databases with FASTA / BLAST
Improvement of speed as compared to local alignment algorithm:
Initial search is for short words.Word hits are then extended in either direction.
Fasta and Blast are programs frequently used to search sequence databases for homology to a querysequence. Programs of this kind answers practical questions posed by molecular biologists like : Is mysequence similar to anything in the database? I seem to have identified a new protein, what is the relationshipof this protein to proteins that have been described previously?
Blast and fasta programs are local similarity search methods that concentrate on finding short identicalmatches, which may contribute to a total match
11
FastA uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988)) to searchfor similarities between one sequence (the query) and any group of sequences of the same type (nucleic acidor protein) as the query sequence. In the first step of this search, the comparison can be viewed as a set of dotplots, with the query as the vertical sequence and the group of sequences to which the query is beingcompared as the different horizontal sequences. This first step finds the registers of comparison (diagonals)having the largest number of short perfect matches (words) for each comparison. In the second step, these"best" regions are rescored using a scoring matrix that allows conservative replacements, ambiguitysymbols, and runs of identities shorter than the size of a word. In the third step, the program checks to see ifsome of these initial highest-scoring diagonals can be joined together. Finally, the search set sequences withthe highest scores are aligned to the query sequence for display.
ktup or wordsize. Length of initial peptide match, default is 2, i.e. the program starts identifying a diagonalby extending a dipeptide match. ktup=1 is used for a more sensitive search.
12
Output from Fasta
Fasta searches a protein or DNA sequence data bank version 3.3t04 January 25, 2000Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
../seq/ramp4.seq: 75 aa >ramp4.seq vs /vol1/gcgdata/ncbi_nr/nr.dat librarysearching /vol1/gcgdata/ncbi_nr/nr.dat library
173831120 residues in 553635 sequences statistics extrapolated from 60000 to 552908 sequences Expectation_n fit: rho(ln(x))= 4.8232+/-0.0004; mu= 0.7959+/- 0.022; mean_var=53.2306+/- 9.966, 0's: 686 Z-trim: 26 B-trim: 2227 in 1/63 Kolmogorov-Smirnov statistic: 0.0519 (N=29) at 46
FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 join: 36, opt: 24, gap-pen: -12/ -2, width: 16 Scan time: 102.010The best scores are: opt bits E(552908)gi|4585827|emb|CAB40910.1| (AJ238236) ribosome as ( 75) 483 130 1.9e-30gi|7657552|ref|NP_055260.1| stress-associated end ( 66) 426 116 3.7e-26gi|7504801|pir||T23009 hypothetical protein F59F4 ( 65) 251 71 8.5e-13gi|9802529|gb|AAF99731.1|AC004557_10 (AC004557) F ( 77) 145 45 0.00012gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|2 ( 136) 105 35 0.22gi|1800061|dbj|BAA16538.1| (D90891) similar to [S ( 217) 105 35 0.33gi|6319639|ref|NP_009721.1| involved in the secre ( 65) 92 31 1.2gi|2498674|sp|Q56109|NRDI_SALTY NRDI PROTEIN gi|1 ( 136) 93 32 1.8
>>gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associa (75 aa) initn: 483 init1: 483 opt: 483 Z-score: 682.4 bits: 130.3 E(): 1.9e-30Smith-Waterman score: 483; 100.000% identity in 75 aa overlap (1-75:1-75)
10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::gi|458 MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC 10 20 30 40 50 60
70ramp4. GSAIFQIIQSIRMGM :::::::::::::::gi|458 GSAIFQIIQSIRMGM 70
>>gi|7657552|ref|NP_055260.1| stress-associated endoplas (66 aa) initn: 426 init1: 426 opt: 426 Z-score: 605.1 bits: 115.8 E(): 3.7e-26Smith-Waterman score: 426; 100.000% identity in 66 aa overlap (10-75:1-66)
10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC :::::::::::::::::::::::::::::::::::::::::::::::::::gi|765 MVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC 10 20 30 40 50
70ramp4. GSAIFQIIQSIRMGM
13
:::::::::::::::gi|765 GSAIFQIIQSIRMGM 60
>>gi|7504801|pir||T23009 hypothetical protein F59F4.2 - (65 aa) initn: 227 init1: 143 opt: 251 Z-score: 365.3 bits: 71.4 E(): 8.5e-13Smith-Waterman score: 251; 53.846% identity in 65 aa overlap (10-74:1-64)
10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC :. :::. .::.. :::...::::::. . : :.: ..:::..::.::::gi|750 MAPKQRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVC 10 20 30 40 50
70ramp4. GSAIFQIIQSIRMGM :::.:.::. ..::gi|750 GSAVFEIIRYVKMGW 60
>>gi|9802529|gb|AAF99731.1|AC004557_10 (AC004557) F17L21 (77 aa) initn: 139 init1: 100 opt: 145 Z-score: 218.9 bits: 44.6 E(): 0.00012Smith-Waterman score: 145; 44.262% identity in 61 aa overlap (17-74:15-74)
10 20 30 40 50ramp4. MVGAGGAAKMVAKQRIRMAN---EKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIF :.:. :: .::: .:: : .:. . .. ::: ::..:.:gi|980 MVDLERTITNTTSKRLADRKIEKFDKNILKRGFVPETTTKKGKDYP-VGPILLGFFVF 10 20 30 40 50
60 70ramp4. VVCGSAIFQIIQSIRMGM :: ::..::::.. :gi|980 VVIGSSLFQIIRTATSGGMA 60 70
>>gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|212121 (136 aa) initn: 66 init1: 41 opt: 105 Z-score: 160.3 bits: 34.6 E(): 0.22Smith-Waterman score: 105; 30.488% identity in 82 aa overlap (3-75:50-125)
10 20 30ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGN :.::.: : .: ::. :..:.. . ::gi|249 RLGLPAVRIPLNERERIQVDEPYILIVPSYGGGGTAGAVPRQVIRFLNDEHNRALL-RGV 20 30 40 50 60 70
40 50 60 70ramp4. VAKTSRNAPEEKASVG---------PWLLALFIFVVCGSAIFQIIQSIRMGM .:. .:: : . .: ::: . : . :. . :...: :.gi|249 IASGNRNFGEAYGRAGDVIARKCGVPWL---YRFELMGTQ--SDIENVRKGVTEFWQRQP 80 90 100 110 120 130
gi|249 QNA...75 residues in 1 query sequences173831120 residues in 553635 library sequences Scomplib [version 3.3t04 January 25, 2000] start: Thu Sep 14 12:30:17 2000 done: Thu Sep 14 12:32:19 2000 Scan time: 102.010 Display time: 0.020
14
E-valueFor a given score, the number of hits in a database search that we expect to seeby chance with this score or better. The E-value takes into account the size ofthe database that was searched. The lower the E-value, the more significant
the score is.
P-valueLike an E-value, but a P-value is the probability of a hit occurring by chancewith this score or better, as opposed to the expected number of hits. A P-valuehas a maximum of 1.0, while an E-value has a maximum of the number of sequences inthe database that was searched. For small (significant) P-values, P and E areapproximately equal, so the choice of one or the other in a software package isarbitrary. NCBI BLAST 2.0, FASTA, and HMMER report E values. WU-BLAST 2.0 reportsP-values.
Blast
blastp compares an amino acid query sequence against a protein sequence database
blastn compares a nucleotide query sequence against a nucleotide sequence database
blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database
tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).
tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame transla- tions of a nucleotide sequence database.
Query Database
blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA
Databases used in BLAST at NCBI
Peptide Sequence Databases
nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
swissprot
15
the last major release of the SWISS-PROT protein sequence database (no updates)
pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank
Nucleotide Sequence Databases
nrAll Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0,1 or 2 HTGS sequences)
dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
dbsts Non-redundant Database of GenBank+EMBL+DDBJ STS Divisions
htgshtgs unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTGsequences are in nr)
pdb Sequences derived from the 3-dimensional structure
gssGenome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu
PCR sequences.
16
Output from Blast
BLASTP 2.0.11 [Jan-20-2000]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),"Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq (75 letters)
Database: nr 457,798 sequences; 140,871,481 total letters
Searching..................................................done
Score ESequences producing significant alignments: (bits) Value
gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6
>gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membrane protein RAMP4 [Rattus norvegicus] Length = 75
Score = 126 bits (313), Expect = 2e-29 Identities = 62/62 (100%), Positives = 62/62 (100%)
Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRMSbjct: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73
Query: 74 GM 75 GMSbjct: 74 GM 75
>gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rattus norvegicus] >gi|5326497|dbj|BAA81894.1| (AB018546) similar to rat RAMP4 and yeast YSY6 [Rattus norvegicus] >gi|5326499|dbj|BAA81895.1| (AB022427) SERP1 [Homo sapiens] Length = 66
Score = 126 bits (313), Expect = 2e-29 Identities = 62/62 (100%), Positives = 62/62 (100%)
Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRMSbjct: 5 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 64
Query: 74 GM 75 GM
17
Sbjct: 65 GM 66
>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65
Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)
Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++MSbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63
Query: 74 G 74 GSbjct: 64 G 64
>gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] Length = 68
Score = 46.1 bits (107), Expect = 3e-05 Identities = 25/54 (46%), Positives = 35/54 (64%), Gaps = 1/54 (1%)
Query: 21 EKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRMG 74 EK KNI +RG V +T+ ++ VGP LL F+FVV GS++FQII++ GSbjct: 13 EKFDKNILKRGFVPETTTKKGKDYP-VGPILLGFFVFVVIGSSLFQIIRTATSG 65
>gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] Length = 106
Score = 35.6 bits (80), Expect = 0.048 Identities = 20/42 (47%), Positives = 26/42 (61%), Gaps = 1/42 (2%)
Query: 21 EKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGS 62 EK KNI +RG V +T+ ++ VGP LL F+FVV GSSbjct: 13 EKFDKNILKRGFVPETTTKKGKDYP-VGPILLGFFVFVVIGS 53
Filtering of low complexity regions
1 MSAAPVQDKDTLSNAERAKNVNGLLQVLMDINTLNGGSSDTADKIRIHAKNFEAALFAKS 60
61 SSKKEYMDSMNEKVAVMRNTYNTRKNAVTAAAANNNIKPVEQHHINNLKNSGNSANNMNV 120
121 NMNLNPQMFLNQQAQARQQVAQQLRNQQQQQQQQQQQQRRQLTPQQQQLVNQMKVAPIPK 180
181 QLLQRIPNIPPNINTWQQVTALAQQKLLTPQDMEAAKEVYKIHQQLLFKARLQQQQAQAQ 240
241 AQANNNNNGLPQNGNINNNINIPQQQQMQPPNSSANNNPLQQQSSQNTVPNVLNQINQIF 300
301 SPEEQRSLLQEAIETCKNFEKTQLGSTMTEPVKQSFIRKYINQKALRKIQALRDVKNNNN 360
361 ANNNGSNLQRAQNVPMNIIQQQQQQNTNNNDTIATSATPNAAAFSQQQNASSKLYQ
18
19
Multiple sequence alignment
The homology search programs Fasta and Blast both rely on a basic procedure to compare two sequenceswith each other. Multiple sequence alignment programs, on the other hand, allows you to align and directlycompare more than two related sequences. This procedure is a very useful tool if you want to analyze afamily of proteins and for instance to identify the structural elements that are characteristic of that family.Common programs for multiple sequence analysis are Clustalw and Pileup.
Clustalw and Pileup exploit the fact that similar sequences are likely to be evolutionary related. Thus, theprograms aligns sequences in pairs, following the branching order of a family tree. Similar sequences arealigned first, and more distantly related sequences are added later. Once pairwise alignment scores for eachsequence relative to all others have been calculated , they are used to cluster the sequences into groups,which are then aligned against each other to generate the final multiple alignment.