Sequence information - lecturesbio.lundberg.gu.se/courses/bio2/lecture1.pdf · Sequence information...
Transcript of Sequence information - lecturesbio.lundberg.gu.se/courses/bio2/lecture1.pdf · Sequence information...
Sequence information - lectures
Pairwise alignmentAlignments in database searchesMultiple alignmentsProfiles
Patterns
RNA secondary structure / Transformational grammars
Genome organisation / Gene prediction / HMMs
Phylogeny
Dates in the history of sequence information
Experimental Sequence analysis1953 Amino acid sequence of insulin.1965 Base sequence of a tRNA molecule,75 bases
1970 Global alignment Needleman-Wunsch
1976 Sequence of MS2 RNA, 3569 bases
1970s Birth of cloning1975 PAM matrix (Dayhoff)
1977 DNA sequencing (Sanger,F)1978 Bacteriophage genome fd
(6408 bases)1981 Local alignment
Smith-Waterman
1982 EMBL database1985 FASTA1985 Multiple sequence alignment
1st attempts1988 Neural networks(protein secondary structure)1989 Clustal1989 Profilesearch1991 BLAST1994 Profile HMMs1994 Context-free grammars
(RNA secondary structure)1995 Genomes of H influenzae & M genitalium
(1.8 / 0.6 million bases)
1997 PSI-BLAST2000 >95% of human genome
(3000 million bases)
Sequences - The analysis of raw data fromthe laboratory
Base-calling (sequence from rawdata) :
PhredSequence assembly:
PhrapCAP3 / CAP4
Cleaning up sequences:
Quality filteringVector filtering
Sequence formats / ASCII-Binary formats:
In ASCII text format (= human readable) each character is stored as a byte, for instance the ASCII code of 'A' = 65 as a decimal number = 01000001 as a binary number
However, sequence data is often stored in a binary format:
For instance in a binary system the three bases may be stored as:
A = 00 T = 01 C = 10 G = 11
In this way there is room in one byte (= 8 bits) for 4 bases.
Typically databases are downloaded by the bioinformatician in ASCII format but then reformatted in a binary format for use with different sequence analysis tools. One example is databases for blast searches that may be formatted by the NCBI utility 'formatdb’
How do you know if a file contains ASCII text or has a binary format?
Using cat , more or less on a binary file% cat /usr/local/bin/less produces unreadable material, like:^?ELF^A^B^A^@^@^@^@^@^@^@^@^@^@^B^^@^@^@^A^@@:\260^@^@^@4^@^Az0^@^@^@^D^@4^@ ^@^G^@(^@^Q^@^P^@^@^@^F^@^@^@4^@@^@4^@@^@4^@^@^@\340^@^@^@^@^@^@^@^D^@^@^@^D^@^@^@^C^@^@^A ^@@^A ^@@^A ^@^@^@^S^@^@^@^S^@^@^@^D^@^@^@^Dp^@^@^@^@^@^A@^@@^A@^@@^A@^@^@^@^X^@^@^@^X^@^@^@^D^@^@^@^D^@^@^@^B^@^@^A\200^@@^A\200^@@^A\200^@^@00^@^@00^@^@^@^D^@^@^@^Pp^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^D^@^@^@^A^@^@^@^@^@@^@^@^@@^@^@^@^A@^@^@^A@^@^@^@^@^E^@^A^@^@^@^@^@^A^@^A@^@^P^@^@^@^P^@^@^@^@^@ ^@^@^@;\200^@^@^@^F^@^A^
The unix utility ’file’ will attempt to determine the file type, for instance
% file README.txt ascii text% file /usr/local/bin/less ELF 32-bit MSB dynamic executable MIPS - version 1% file data.Z compressed data
Sequence formats
Examples
1. Embl
ID LISOD standard; DNA; PRO; 756 BP.XXAC X64011; S78972;XXSV X64011.1XXDT 28-APR-1992 (Rel. 31, Created)DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)XXDE L.ivanovii sod gene for superoxide dismutase........SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180........
2. Fasta
>LISOD L.ivanovii sod gene for superoxide dismutasecgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaatgtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaattaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg........
3. GCG
lisod.seq Length: 756 October 27, 2000 13:17 Type: N Check: 5188 ..
1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc
51 gccttacaat gtaatttctt ttcacataaa taataaacaa tccgaggagg
101 aatttttaat gacttacgaa ttaccaaaat taccttatac ttatgatgct
....
....
Multiple sequence format of GCG
1.msf MSF: 44 Type: P October 24, 2002 12:41 Check: 9117 ..
Name: ftsy_bucai Len: 44 Check: 4221 Weight: 1.00 Name: ftsy_ecoli Len: 44 Check: 2326 Weight: 1.00 Name: ftsy_aquae Len: 44 Check: 2339 Weight: 1.00 Name: ftsy_bacsu Len: 44 Check: 6177 Weight: 1.00 Name: sr54_aciam Len: 44 Check: 6296 Weight: 1.00 Name: sr54_aerpe Len: 44 Check: 7291 Weight: 1.00 Name: sr54_arcfu Len: 44 Check: 122 Weight: 1.00 Name: sr54_aquae Len: 44 Check: 345 Weight: 1.00
//
1 44ftsy_bucai KNS.EKLYFL LKRKMFNILK KVEIP...LE ISSHSPFVIL VVGVftsy_ecoli RDA.EALYGL LKEEMGEILA KVDEP...LN VEGKAPFVIL MVGVftsy_aquae KEG.EKIKEL LKKELKELLK NCQ...GELK IPEKVGAVLL FVGVftsy_bacsu QDP.KEVQSV ISEKLVEIYN SGDEQISELN IQDGRLNVIL LVGVsr54_aciam PTYIERREWF IKIVYDELSN LFGGDKEPKV IPDKIPYVIM LVGVsr54_aerpe PPGVTRRDWM IKIVYEELVK LFGGDQEPQV DPPKTPWIVL LVGVsr54_arcfu LPALNAKEQI LKIVYEELLR GVGEGLEIPL KKAK....IM LVGLsr54_aquae PKNLSPAEFV IKTVYEELVD ILGGEK.... .ADLKKGTVL FVGL
FASTA multiple sequence format
>ftsy_bucai, 44 bases, 52FF1180 checksum.KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGV>ftsy_ecoli, 44 bases, B14506B6 checksum.RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGV>ftsy_aquae, 44 bases, 27571BEE checksum.KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGV>ftsy_bacsu, 44 bases, FA023A4F checksum.QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGV>sr54_aciam, 44 bases, 2FC13632 checksum.PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGV>sr54_aerpe, 44 bases, 37AFB895 checksum.PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGV>sr54_arcfu, 44 bases, 8294461 checksum.LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGL>sr54_aquae, 44 bases, 794768A2 checksum.PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL
CLUSTAL multiple sequence format
CLUSTAL W (1.81) multiple sequence alignment
ftsy_bucai KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGVftsy_ecoli RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGVftsy_aquae KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGVftsy_bacsu QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGVsr54_aciam PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGVsr54_aerpe PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGVsr54_arcfu LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGLsr54_aquae PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL .:. :: ::.**:
Phylip multiple sequence format
8 44ftsy_bucai KNS-EKLYFL LKRKMFNILK KVEIP---LE ISSHSPFVIL VVGVftsy_ecoli RDA-EALYGL LKEEMGEILA KVDEP---LN VEGKAPFVIL MVGVftsy_aquae KEG-EKIKEL LKKELKELLK NCQ---GELK IPEKVGAVLL FVGVftsy_bacsu QDP-KEVQSV ISEKLVEIYN SGDEQISELN IQDGRLNVIL LVGVsr54_aciam PTYIERREWF IKIVYDELSN LFGGDKEPKV IPDKIPYVIM LVGVsr54_aerpe PPGVTRRDWM IKIVYEELVK LFGGDQEPQV DPPKTPWIVL LVGVsr54_arcfu LPALNAKEQI LKIVYEELLR GVGEGLEIPL KKAK----IM LVGLsr54_aquae PKNLSPAEFV IKTVYEELVD ILGGEK---- -ADLKKGTVL FVGL
Conversion of sequence formats - Readseq
(one useful version of readseq is part of the SAM package http://www.cse.ucsc.edu/research/compbio/sam2src/)
Readseq can convert between the following formats:
1. IG/Stanford 10. Olsen (in-only) 2. GenBank/GB 11. Phylip3.2 3. NBRF 12. Phylip 4. EMBL 13. Plain/Raw 5. GCG 14. PIR/CODATA 6. DNAStrider 15. MSF 7. Fitch 16. ASN.1 8. Pearson/Fasta 17. PAUP/NEXUS 9. Zuker (in-only) 18. Pretty (out-only)
GCG package utilities:Convert from to
* Reformat text GCG * FromEMBL EMBL GCG * FromGenBank Genbank GCG * FromFasta Fasta GCG * ToFastA GCG Fasta
Clustal:
clustalw my_alignment -convert -output=gcgclustalw my_alignment -convert -output=phylip
Downloading bioinformatics data andsoftware for local useData and software downloaded from the net are typically compressed in oneway or the other. The most common formats are *.Z and *.gz
These are binary files that cannot be read with normal editorsTo uncompress a *.Z file :
% uncompress [file]To uncompress a *.gz file :
% gunzip [file]
A set of files is often downloaded as a set of compressed tar archive. Forinstance if you download a file called dna.tar.gz you can do
% gunzip dna.tar.gz
This will yield a file called dna.tar. Then do
% tar -xvf dna.tar
This will unpack all files contained in dna.tar
Alignments and database searches
Common biological problem: We have a novel protein sequence. What can we inferfrom this sequence about the biological function of theprotein ??
* Pattern search - PROSITE* Profile search - Pfam* Prediction of transmembrane domains ( ~ 25 % of all proteins are membrane bound!) * Sequence homology - BLAST, FASTA, SSEARCH
Simple example: unknown human protein is highly homologous to aprotein with known function from another organism => The human protein has the same function (it’s anortholog or a paralog)
BLAST
1. Heuristic step where short word matches are identified
2. Extending matches using dynamic programming methodas for local pairwise alignments. (Substitution matrix, gap penalties)
-The nature of protein sequence and structure evolution -what is the sensitivity of BLAST searches?
-General principles of database searches
ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGT M A K L E K L N Q A G L M V A G
60%ATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGA M V R L E K I N Q A G L L V A G69%
M V R I Q K I N E K G A L L A G38%
Q V R I Q K I Y E K G A L L A A19% (‘twilight zone’)
Q V R I Q K I Y E K T A L L F A6% (‘midnight zone’)
Evolution of protein genes
Blast report
Sequences producing significant alignments: (bits) Value
pir||F69494 (R)-hydroxyglutaryl-CoA dehydratase activator (hgdC)... 462 e-129gb|AAD31675.1| (AF123384) (R)-2-hydroxyglutaryl-CoA dehydratase ... 233 1e-060sp|P39383|YJIL_ECOLI HYPOTHETICAL 27.4 KD PROTEIN IN IADA-MCRD I... 184 9e-046emb|CAA67409.1| (X98916) orf6 [Methanopyrus kandleri] 170 1e-041gb|AAF13150.1|AF156260_1 (AF156260) unknown [Methanosarcina bark... 143 2e-033pir||A69117 activator of (R)-2-hydroxyglutaryl-CoA - Methanobact... 132 4e-030pir||A72369 (R)-2-hydroxyglutaryl-CoA dehydratase activator-rela... 129 4e-029gb|AAC23928.1| (U75363) benzoyl-CoA reductase subunit [Rhodopseu... 117 1e-025pir||S04476 hypothetical protein (hdgA 5' region) - Acidaminococ... 104 1e-021sp|P27542|DNAK_CHLPN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 42 0.005gb|AAC15473.1| (AF016711) heat shock protein 70 [Burkholderia ps... 39 0.036pir||F75029 o-sialoglycoprotein endopeptidase (gcp) PAB1159 - Py... 38 0.082pir||F72514 probable glucokinase APE2091 - Aeropyrum pernix (str... 37 0.18sp|P42373|DNAK_BURCE DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 37 0.18emb|CAA10035.1| (AJ012470) mitochondrial-type hsp70 [Encephalito... 36 0.31sp|P56836|DNAK_CHLMU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.41gb|AAF39496.1| (AE002336) dnaK protein [Chlamydia muridarum] 36 0.41pir||B70189 rod shape-determining protein (mreB-1) homolog - Lym... 36 0.41sp|O57716|GCP_PYRHO PUTATIVE O-SIALOGLYCOPROTEIN ENDOPEPTIDASE (... 36 0.54sp|O33522|DNAK_ALCEU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.54ref|NP_012874.1| Ykl050cp >gi|549677|sp|P35736|YKF0_YEAST HYPOTH... 36 0.54emb|CAA53420.1| (X75781) D513 [Saccharomyces cerevisiae] >gi|158... 36 0.54sp|P30722|DNAK_PAVLU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) >gi|99... 36 0.54pir||A40158 dnaK-type molecular chaperone - Chlamydia trachomati... 34 1.2gb|AAF07742.1|AE001584_39 (AE001584) hypothetical protein [Borre... 34 1.6gb|AAF07521.1|AE001577_35 (AE001577) hypothetical protein [Borre... 34 1.6gb|AAF38963.1| (AE002276) cell shape-determining protein MreB [C... 34 2.1gb|AAG08147.1|AE004889_10 (AE004889) DnaK protein [Pseudomonas a... 33 2.7dbj|BAB03215.1| (AB017035) dnaK [Bacillus thermoglucosidasius] 33 2.7sp|P43736|DNAK_HAEIN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7sp|P45554|DNAK_STAAU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7sp|Q58303|FLA3_METJA FLAGELLIN B3 PRECURSOR 32 4.7gb|AAG08239.1|AE004898_10 (AE004898) phosphoribosylaminoimidazol... 32 6.1
Conclusions / comments
If two proteins have significant sequencehomology it is highly likely that the two proteinshave the same 3D structure (and same function)
If two proteins have the same 3D structure isdoes not necessarily mean that the sequences arerelated
Very remote evolutionary relationships aredifficult or impossible to detect with normalBLAST / pairwise alignment
Database sequence searches involving proteinsshould be carried out at the protein level and notat the DNA level
More rules of database searches
?Compare sequences as proteins and not as DNA*?Use of smallest possible database (not too small though)?Sequence statistics should be used rather than percent identity/similarity as criterion for homology?Consider different scoring matrices and gap penalties
* 1) DNA sequences encoding the same protein sequence can be verydifferent, due to the degeneracy of the genetic code.
TTTCGATTCTCAACAAGAAGC** * ** ** * *TTCAGGTTTAGCACGCGGTCC F R F S T R S
2) Amino acid substitution matrices may be taken into account.
PAM matrices
Starting point of PAM matrices are closely related orthologs, full sequences are taken into account.In the calculation of the PAM matrix extrapolation is done to more distant relationships
Neurospora_crassa GSVDGYAYTD ANKQKGITWD ENTLFEYLEN PKKYIPGTKM AFGGLKKDKD Stellaria_longipes GSVEGFSYTD ANKAKGIEWN KDTLFEYLEN PKKYIPGTKM AFGGLKKDKD Thermomyces_lanuginosus GSVEGYSYTD ANKQAGITWN EDTLFEYLEN PKKFIPGTKM AFGGLKKNKD Arabidopsis_thaliana GSVAGYSYTD ANKQKGIEWK DDTLFEYLEN PKKYIPGTKM AFGGLKKPKD Aspergillus_niger GQSEGYAYTD ANKQAGVTWD ENTLFSYLEN PKKFIPGTKM AFGGLKKGKE Debaryomyces_occidentalis GQAAGYSYTD ANKKKGVEWT EQTMSDYLEN PKKYIPGTKM AFGGLKKPKD Schizosaccharomyces_pombe GQAEGFSYTE ANRDKGITWD EETLFAYLEN PKKYIPGTKM AFAGFKKPAD Fagopyrum_esculentum GTTAGYSYSA ANKNKAVTWG EDTLYEYLLN PKKYIPGTKM VFPGLKKPQE Sesamum_indicum GTTPGYSYSA ANKNMAVIWG ENTLYDYLLN PKKYIPGTKM VFPGLKKPQE Haematobia_irritans GQAAGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Lucilia_cuprina GQAPGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Ceratitis_capitata GQAAGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Sarcophaga_peregrina GQAPGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Manduca_sexta GQAPGFSYSD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM VFAGLKKANE Samia_cynthia GQAPGFSYSN ANKAKGITWG DDTLFEYLEN PKKYIPGTKM VFAGLKKANE Schistocerca_gregaria GQAPGFSYTD ANKSKGITWD ENTLFIYLEN PKKYIPGTKM VFAGLKKPEE Apis_mellifera GQAPGYSYTD ANKGKGITWN KETLFEYLEN PKKYIPGTKM VFAGLKKPQE Macaca_mulatta GQAPGYSYTA ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE Pan_troglodytes GQAPGYSYTA ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE Anas_platyrhynchos GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE Aptenodytes_patagonicus GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE
BLOSUM matrices. Starting point is conserved elements in protein families.Protein sequences are more distantly related than the sequences used for PAM matrices.Example of entry from BLOCKS DATABASE:
Block BL00134A
ID TRYPSIN_HIS; BLOCKAC BL00134A; distance from previous block=(23,4353)DE Serine proteases, trypsin family, histidine proteins.BL CHC; width=17; seqs=259; 99.5%=867; strength=1439
CTR2_VESCR|P00769 ( 24) FCGGSISKRYVLTAAHC 63CTRL_HALRU|P35003 ( 53) CGCVLYTTSKALTAAHC 80EAST_DROME|P13582 ( 158) CGGSLISTRYVITASHC 18GILX_HELHO|P43685 ( 26) WSGVLLNRDWILTAAHC 45NDL_DROME|P98159 (1170) CGGTIYSDRWIISAAHC 10PCE_TACTR|P21902 ( 157) CGGALVTNRHVITASHC 25TRYP_ASTFL|P00765 ( 30) CGASIYNENYAITAGHC 36TRYU_DROME|P42279 ( 59) CGGCILDAVTIATAAHC 31ACRO_HUMAN|P10323 ( 73) CGGSLLNSRWVLTAAHC 4ACRO_MOUSE|P23578 ( 74) CGGSLLNSHWVLTAAHC 5ACRO_PIG|P08001 ( 71) CGGILLNSHWVLTAAHC 7ACRO_RABIT|P48038 ( 71) CGGVLLNAHWVLTAAHC 6ACRO_RAT|P29293 ( 74) CGGSLLNSHWVLTAAHC 5ANC1_AGKRH|P26324 ( 28) CGGVLIHPEWVITAEHC 25ANC2_AGKRH|P47797 ( 52) CGGVLIHPEWVITAKHC 16ANCR_AGKCO|P09872 ( 25) CGGTLINQEWVLTARHC 25BATX_BOTAT|P04971 ( 50) CGMTLINQEWVLTAAHC 49
Variants of BLAST / FASTA
DNA/DNA P/P DNA/P P/DNA DNA/DNA
blastall -p blastn -p blastp -p blastx -p tblastn -p tblastxfasta fasta fasta fastx,fasty* tfasta tfastx, tfasty
*Compare a DNA sequence to a protein sequence database, by comparing the translated DNA sequence in three frames and allowing gaps and frameshifts. fastx3 uses a simpler, faster algorithm for alignments that allows frameshifts only between codons; fasty3 is slower but produces better alignments with poor quality sequences because frameshifts are allowed within codons.