NCBI Fundamentals of Sequence Analysis Chuong Huynh NIH/NLM/NCBI Sept 30, 2004...
-
Upload
spencer-booker -
Category
Documents
-
view
216 -
download
0
Transcript of NCBI Fundamentals of Sequence Analysis Chuong Huynh NIH/NLM/NCBI Sept 30, 2004...
NC
BI
3000 Myr3000 Myr
1000 Myr1000 Myr
540 Myr540 Myr
Common ancestry allows us to infer similar function
Alzheimer’sDisease
Ataxiatelangiectasia
Colon cancer
Pancreaticcarcinoma
Yeast BacteriaWormFlyHuman
Molecular Evolution
MLH1 MutL
NC
BI
Why do we need similarity searching?
Identification and annotation•Incomplete or no annotations (GenBank)•Incorrectly annotated sequences
Evolutionary relationshipshomologous molecules may
have similar functions
NC
BI
Why Search Databases?
• To find out if a new DNA sequence is already fully or partially present in the databanks.
• To find homologous proteins to a putative coding ORF that might share similar 3D structure.
• to identify homology (“relatedness”) between a query and entries in a database
NC
BI
Some Terminology
NC
BI
Searching Sequence Databases
• Two sequences are homologous when they share a common ancestry. This ancestry is reflected in strong sequence similarity.
• Computationally, threshold limits for sequence similarity can be defined by :– length of the stretch of similar sequence– percentage of identity between the
sequence– statistical measurements, like E-value, P-
value, Bit-score, etc.
NC
BI
Similarity and Homology
• Similarity can be expressed as a percentage. It does not imply any reasons for the observed sameness.
• Homology is an evolutionary term used to describe relationship via descent from a common ancestor.
• Homologous things are often similar, but not always (whale flipper <-> human arm)
• Homology is NEVER expressed as a percentage
NC
BI
Orthologs vs Paralogs
• Homologs can be separated into two classes: orthologs and paralogs.
• Orthologs are homologous genes that perform the same function in different species.
• Paralogs are homologous genes within a species that may perform different functions.
NC
BI
Similarity and Homology• Sequence homology can be reliably inferred
from statistically significant similarity over a majority of the sequence length.
• Non-homology CANNOT be inferred from non-similarity because non-similar things can still share a common ancestor.
• Homologous proteins share common structures, but not necessarily common sequence or function (e.g. FtsZ <-> tubulin)
• Remember: pair of sequences either is or isn't homologous. There is no such thing as “64% homologous"
NC
BI
Searching sequence databases
• When we search a sequence database, we are usually looking for related sequences.
• Unfortunately, the algorithms that we have for searching databases, do not search for homology, they search for similarity.
• When similarity is found, we must determine if this similarity is a result of homology or it if comes from another source.
NC
BI
Pairwise Sequence Alignments
• Purpose:• identification of sequences with significant similarity to
(a) sequence(s) in a sequence-repository• identification of all homologous sequences the repository• identification of domains with sequence similarity
• Terminology • Global alignment• Local alignment
NC
BI
Terminology: Global Alignment
• Finds the optimal alignment over the entire length of the two compared sequences
• Unlikely to detect genes that have evolved by recombination (e.g. domain shuffling) or insertion/deletion of DNA
• Suitable for sequences of homologous molecules
NC
BI
Terminology: Local Alignment
• short regions of similarity between a pair of sequences.
• compared sequences can receive high local similarity scores, without the need to have high levels of similarity over their entire length
• useful when looking for domains within proteins or looking for regions of genomic DNA that contain coding exons
NC
BI
Substitutions, Insertions, Deletions
• Mutation: one of– switch from one nucleotide to another– insertion– deletion
• Substitution: a switch in nucleotides which spreads throughout most of a species.
• Substitutions, insertions and deletions passed along two independent lines of descent cause a divergence of the two sequences from the original (and from each other):
ccctaggtccca
cgggtatccaacggtatgcca
NC
BI
Example
• For the previous example cggtatgcca cgggtatccaa , ccctaggtccca, the two
descendent sequences align as follows
c g g g t a - - t - c c a a c c c - t a g g t c c c - a
• “-” (indel) represents an insertion or deletion.
NC
BI
Algorithms: definition
Webster’s definition: “a procedure for solving a
mathematical problem in a finite number of steps that frequently involves a repetition of an operation; or broadly: a step-by-step procedure for solving a problem or accomplishing some end”
NC
BI
Algorithms
• Needleman-Wunsch– Exhaustive global alignment– most rigorous method when aligning conserved
sequences of similar length (no exon shuffling, insertion/deletion etc)
• Smith-Waterman– Exhaustive local alignment– alignment does not have to extend along the full
length of the sequences– In contrast to N-W alignments initiating at all
possible positions of the sequence-space will be considered
– Can be very slow
NC
BI
Basic Local Alignment Search Tool
http://www.ncbi.nlm.nih.gov/BLAST/
NC
BI
Basic Local Alignment Search Tool
• Widely used similarity search tool• Heuristic approach based on Smith Waterman
algorithm• Finds best local alignments• Provides statistical significance• All combinations (DNA/Protein) query and
database.– DNA vs DNA– DNA translation vs Protein– Protein vs Protein– Protein vs DNA translation– DNA translation vs DNA translation
• www, standalone, and network clients
NC
BI
BLAST Selection Matrix
NC
BI
Choosing The Right BLAST Flavor for Proteins
What you Want to Do? The Right BLAST Flavor
Find out something about the function of the protein
Use blastp to compare your protein with other proteins contained in the databases.
Discover new genes encoding similar proteins
Use tblastn to compare your protein with DNA sequences translated into their 6 possible reading framesClaverie & Notredame 2003
NC
BI
Choosing the Right BLAST
Flavor for DNAQuestions Answer
Am I interested in non coding DNA?
Yes, Use blastn. Rem: blastn is only for closely related DNA sequences (more than 70% identical)
Do I want to discover new proteins?
Yes, Use tblastx
Do I want to discover proteins encoded in my query DNA sequences?
Yes, Use blastx
Am I unsure of the quality of my DNA?
Yes, Use blastx. Especially if you suspsect your DNA sequence codes for a protein, but may contain sequencing errors.
Claverie & Notredame 2003
NC
BI
Choosing The Right BLAST Flavor
for DNA SequencesUsage Query Database Progra
m
Find very similar DNA sequence
DNA DNA blastn
Protein discovery and ESTs
Translated DNA
Translated DNA
tblastx
Analysis of query DNA sequence
Translated DNA
Protein blastx
Claverie & Notredame 2003
NC
BI
WWW BLAST
NC
BI
Web BLAST
NC
BI
BLAST Databases: Non-redundant protein
nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein
• PIR, Swiss-Prot, PRF
– PDB (sequences from structures)
NC
BI
BLAST Databases: Nucleic Acid
• nr (nt)– Traditional GenBank
Divisions– NM_ and XM_ RefSeqs
• dbest – EST Division
• htgs – HTG division
• gss – GSS division
• chromosome – NC_ RefSeqs
NC
BI
Protein BLAST Page
>Mutated in Colon CancerIETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSS
Protein database
NC
BI
BLAST Formatting Page
NC
BI
BLAST Output Overview
• Graphic Display: Shows you where your query is similar to other sequences.
• Hit List: Name of sequences similar to your query ranked by similarity
• Alignments: Every alignment between your query and the reported hits
• Parameters: List of the various parameters used for the search
NC
BI
BLAST Output: Graphic
mouse over,click for active links
Sort by taxonomy
Red bar = most similar sequencePink = almost as similarGreen – even less similarBlue/Black – worse scores
NC
BI
BLAST Output: Descriptions
Bacterial mismatch repair proteins
link to entrez
sorted by e values
4 X 10-56
Default e value cutoff 10
LocusLink
Bit scores < 50 unreliable
NC
BI
A Little on Interpretation
• How similar must sequences be in order to be considered homologous?
• More than 25% of the amino acids present are identical for proteins and more than 70% of the nucleotides present are identical for DNA. Above these limits, you can be sure that two proteins have same structure and same common ancestor.
• Rem: only > 100 aa or nt in length
NC
BI
A Little on Interpretation: E-value
• Determine how much you can trust your conclusion on homology.
• E-value = Expectation Values• Allow for comparing pairwise alignment with
different similarities and different length. Advantage over Percent Identity (not discussed).
• Definition: Number of times your database match may have occurred by chance. Match unlikely to occur by chance is a good match. The loest E-values (as close to 0 as possible) are the best. Thus, most significant, since we know we can trust them enough to infer homology
• If you want to be certain of homology your E-values must be below 10-4 or (0.0001).
NC
BI
TaxBLAST: Taxonomy Reports
NC
BI
BLAST Output: Pairwise Alignments
>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615
Score = 44.3 bits (103), Expect = 5e-05 Identities = 25/59 (42%), Positives = 33/59 (55%), Gaps = 8/59 (13%)
Query: 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILERVQQHIESKL 59 L + P L LEI P VDVNVHP KHEV F +H+ + +L +QQ +E+ LSbjct: 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338
NC
BI
BLAST Output: Alignments
>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756
Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%)
Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDASbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395
Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct: 396 FLQPLSKPLSS 406
low complexity sequence filtered
NC
BI
Results from nr
Sequences producing significant alignments: (bits) Value
gi|604369|gb|AAA85687.1| (U17857) hMLH1 gene product [Homo ... 233 3e-61 gi|4557757|ref|NP_000240.1| (NM_000249) mutL homolog 1; mut... 233 4e-61 gi|466462|gb|AAA17374.1| (U07418) human homolog of E. coli ... 233 4e-61 gi|13878583|sp|Q9JK91|MLH1_MOUSE DNA mismatch repair protei... 214 2e-55 gi|19387852|ref|NP_081086.1| (NM_026810) mutL homolog 1; DN... 213 2e-55 gi|13591989|ref|NP_112315.1| (NM_031053) mismatch repair pr... 212 5e-55 gi|12835158|dbj|BAB23172.1| (AK004105) DNA MISMATCH REPAIR ... 205 6e-53 gi|3192877|gb|AAC19117.1| (AF068257) mutL homolog [Drosophi... 128 1e-29 gi|17136968|ref|NP_477022.1| (NM_057674) Mlh1-P1 [Drosophil... 127 1e-29 gi|17861656|gb|AAL39305.1| (AY069160) GH18717p [Drosophila ... 125 8e-29 gi|20146218|dbj|BAB89000.1| (AP003238) putative MLH1 [Oryza... 87 2e-17 gi|11357265|pir||T51620 DNA mismatch repair protein MLH1 [i... 83 5e-16 gi|18413196|ref|NP_567345.1| (NM_116983) MLH1 protein [Arab... 83 5e-16 gi|6323819|ref|NP_013890.1| (NC_001145) Required for mismat... 72 1e-12 gi|460627|gb|AAA16835.1| (U07187) Mlh1p [Saccharomyces cere... 71 2e-12gi|19112991|ref|NP_596199.1| (NC_003423) putative DNA misma... 70 5e-12 gi|13517948|gb|AAK29067.1|AF346620_1 (AF346620) MLH1 [Trypa... 57 3e-08 gi|16272041|ref|NP_438240.1| (NC_000907) DNA mismatch repai... 54 3e-07 gi|19173567|ref|NP_597370.1| (NC_003232) DNA MISMATCH REPAI... 52 9e-07 gi|13543339|gb|AAH05833.1|AAH05833 (BC005833) Similar to mu... 50 5e-06 gi|15602769|ref|NP_245841.1| (NC_002663) MutL [Pasteurella ... 50 6e-06 gi|15642797|ref|NP_227838.1| (NC_000853) DNA mismatch repai... 48 2e-05
>gi|4557757|ref|NP_000240.1| (NM_000249) mutL homolog 1; mutL (E. coli) homolog 1; coli) homolog 1 (colon cancer, nonpolyposis type 2) [Homo sapiens] gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 (MutL protein homolog 1) gi|631299|pir||S43085 DNA mismatch repair protein MLH1 - human gi|463989|gb|AAC50285.1|(U07343) hMLH1 [Homo sapiens] gi|1079787|gb|AAA82079.1|(U40978) DNA mismatch repair protein homolog [Homo sapiens] gi|13905126|gb|AAH06850.1|AAH06850 (BC006850) mutL (E. coli) homolog 1 type 2) [Homo sapiens] gi|741682|prf||2007430A DNA mismatch repair protein [Homo sapiens] Length = 756
Score = 233 bits (593), Expect = 4e-61 Identities = 117/131 (89%), Positives = 117/131 (89%)
Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
NC
BI
tblastn Results Against ESTs
>gi|12794555|emb|AL531062.1|AL531062 AL531062 LTI_NFL001_NBC4 Homo sapiens cDNA clone CS0DM005YM23 5 prime. Length = 878
Score = 167 bits (422), Expect(3) = 1e-42 Identities = 81/82 (98%), Positives = 81/82 (98%) Frame = +2
Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 512 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 691
Query: 61 GSNSSRMYFTQTLLPGLAGPSG 82 GSNSSRMYFTQTLLPGLAGP GSbjct: 692 GSNSSRMYFTQTLLPGLAGPLG 757
Score = 24.3 bits (51), Expect(3) = 1e-42 Identities = 11/26 (42%), Positives = 11/26 (42%) Frame = +1
Query: 80 PSGEMVKXXXXXXXXXXXXXXDKVYA 105 PSG MVK DKVYASbjct: 748 PSG*MVKSTTSLTSSSTSGSSDKVYA 825
combined expect forhits to multiple frames
NC
BI
BLAST Tips
• It is faster and more accurate to BLAST proteins (blastp) rather than nucleotides.
• If in doubt use blastp.• When possible restrict to the subset of
the database you are interested in.• Look around for the database you
need or create your own custom BLAST database. BUT HOW???
• When is the best time to use the BLAST server?
NC
BI
Asking Biological Problems with BLASTWhat You
Want to DOGeneral (but More Complicated) Computational Method
Using BLAST
Finding genes in a genome
Run gene prediction software or an ORF Finder (for bacteria)
Cut your genome sequence in little (2-5kb) overlapping sequences. Use blastx to BLAST each piece of genome against NR (nonredundant protein db). Works better for sequences with no introns (bacteria).
Predicting protein function
Domain analysis or wet-lab experimentation
Use blastp to BLAST your protein sequence against SWISS-Prot (future = UniProt). If you get a good hit (more than 25% identify) over the complete length of the protein, then your protein has the same function as the SWISS-PROT protein
Predicting protein 3-D structure
Homology modeling, X-ray, NMR analysis of protein of interest
Use blastp to BLAST your protein against PDB (Protein structure DB), if you get hit >25% identity, then your protein and the good hit(s) have a similar 3-D structure
Finding protein family members
Clone new family members using PCR techniques
Use blastp (or better use PSI-BLAST) and run against NR (nonredundant protein family). After you have all members of family, you can make multiple sequence alignment phylogenetic tree
Claverie & Notredame 2003
NC
BI
BLAST and PSI-BLAST Servers on the Internet
Country
Program
URL
USA BLAST/ PSI-BLAST
http://www.ncbi.nlm.nih.gov/BLAST
USA BLAST http://genome.wustl.edu/gsc/BLAST
EUROPE BLAST http://www.ch.embnet.org/software/bBLAST.html
Europe BLAST http://www.ebi.ac.uk/blast2/
Japan BLAST/ PSI-BLAST
http://www.ddbj.nig.ac.jp/E-mail/ homology.html
NC
BI
Alternative Method for
Homology Searches• Smith-Waterman (ssearch): slower but
more accurate• FASTA: slower than BLAST, but more
accurate when making DNA comparison
• BLAT: for locating cDNA in a genome or finding close proteins in a genome
NC
BI
Common Mistake
• Seq1 has domain A & B; Seq2 has domain A and Seq3 has domain B
• Use Seq 1 as query sequence• What happens? E-value of both of these hits may
be very high if domain A and B are long and well conserved.
• Seq1 is homologous to Seq2&3, but remember Seq1 is not homlogous over the entire length to Seq2&3
• Just don’t depend on the E-value• “BLAST hits are not transitive, unless the
alignments are overlapping”• Most proteins have more than one domain, so
becareful when looking a BLAST results, not all reported hits belong to the same big family.
Sequence 1: AAAAAABBBBBBSequence 2: AAAAAASequence 3: BBBBBB
NC
BI
Common Questions
• When I do a blast job using WU-BLAST vs NCBI BLAST with the same query sequence, I get a different result? Both are based on the same algorithm, but a different implementation. So why the difference?
Usually this is due to the slight variation in the database version, but differences in BLAST program version also play a minor role in the difference. Usually the result, do not change in a dramatic manner, but they do change a bit.
NC
BI
Self Guided Exercises - BLAST
• If you need further help on Blast. • First READ then try the problem set.• Blast Course:http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-
1.htmlBlast
• Tutorial:http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/
information3.html • Blast Quick Start (click on P for the
problem set)http://www.ncbi.nlm.nih.gov/Class/minicourses/
blastex2.html