Sequence Database Searching Computing in Molecular Biology Hugues Sicotte National Center for...
-
Upload
miranda-evans -
Category
Documents
-
view
226 -
download
5
Transcript of Sequence Database Searching Computing in Molecular Biology Hugues Sicotte National Center for...
Sequence Database Searching
Computing in Molecular Biology
Hugues Sicotte
National Center for Biotechnology Information
Sequence Database Searching
Alignment methods
Query sequence
Sub
ject
seq
uenc
e
Sequence Alignment representation using a dot plot.
For a query of N letters against a subject sequence of M letters, it requires MxN comparisons.
Sequence Database Searching
H A S H I N G M E T H O D S
Hashing is a common method for accelerating database searches
MLILII
MLIIKRDELVISWASHEREquery sequence
IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE
all overlappingwords of size 3
Compile “dictionary” of words from the query sequence. Put each word in a look-up table that points to the original position in the sequence. Thus given one word, you can know if it is in the query in a single operation.
Sequence Database Searching
Index lookup
Each word is assigned a unique integer.
E.g. for a word of 3 letters made up of an alphabet of 20 letters.
1. Assign a code to each letter Code(l) (0 to 19)
2. For a word of 3 letters L1 L2 L3 the code is
index = Code(L1)*202 + Code(L2)*201 + Code(L3)
3. Have an array with a list of the positions that have that word.
1
0 1 2 3
Position in query sequence of word
Sequence Database Searching
H A S H I N G M E T H O D S
Building the dictionary for the query sequence requires (N-2) operations.
MLILII
MLIIKRDELVISWASHEREquery sequence
IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE
all overlappingwords of size 3
The database contains (M-2) words, and it takes only one operation to see if the word was in the query.
Sequence Database Searching
H A S H I N G M E T H O D S
Query sequence
Sub
ject
seq
uenc
e
Scan the subject, looking up words in the dictionary
Use word hits to determine were to search for alignments
fills the dynamic programming matrix
in (N-2)+(M-2) operations instead
of MxN.
Sequence Database Searching
H A S H I N G M E T H O D S
Query sequence
Sub
ject
seq
uenc
e
Scan the database, looking up words in the dictionary
Use word hits to determine were to search for alignments
FASTA searches in a band
Sequence Database Searching
H A S H I N G M E T H O D S
Query sequence
Dat
abas
e se
quen
ce
Scan the database, looking up words in the dictionary
Use word hits to determine were to search for alignments
BLAST extends from word hits
Sequence Database Searching
Database Search Space
Query sequence
Con
cana
ted
Dat
abas
e se
quen
ce
Simplest Database searching could is a large dynamic programming example.
With all the database sequences concatenated one after another.
Sequence Database Searching
Database Search Space
Query sequence
Con
cana
ted
Dat
abas
e se
quen
ce
Which alignment is more significant?
Sequence Database Searching
Database Search Space
Query sequence
Con
cana
ted
Dat
abas
e se
quen
ce
Score can be used to judge alignments. But a score absolute value is a function of the score parameters.
Match=+1,Mismatch=-1,
Gap_open=5,
gap_extend=1
Yields same alignments as
Match=+10,Mismatch=-10,
Gap_open=50,
gap_extend=10
Scores useful for relative ranking.
Sequence Database Searching
Database Search Space
Query sequence
Con
cana
ted
Dat
abas
e se
quen
ce
To Judge relevancy of an alignment, need to judge if match is significant.
E-value = Expect(S) is a function of the score, database size and composition, and query size.
Number of Aligments with scores >= S expected if the query was a random given the database size and composition.
Expect of 0.0 means a very good match unlikely to be random.
Sequence Database Searching
D A T A B A S E S E A R C H I N G
Compare one query sequence against an entire database
> fasta myquery swissprot -ktup 2
search program
querysequence
sequencedatabase
optionalparameters
A typical search has four basic elements
Sequence Database Searching
D A T A B A S E S E A R C H I N G
With exponential database growth, searches keep taking more time
> fasta myquery swissprot -ktup 2
searching . . . . . .
Sequence Database Searching
E-value
“Hits” can be sorted according to their E-value or their score.
The E-value is better known as the EXPECT value and is a function of score, database size and query sequence length.
E-value: Number of alignments with a score >=S that you expect to find if the database was a collection of random letters.
e.g. For a score of 1, one only requires 1 match, and there should be an enormous amount of alignments. One expects to find less alignments with a score of 5, and so on.. Eventually when the score is big enough, one expects to find an insignificant number of of alignments that could be due to chance.
E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually very good and for proteins, E<1e-2 is usually considered significant. It is still possible for a Hit with E>1 to be biologically meaningful, but more analysis is required to comfirm that.
Even for VERY good hits, it is possible that the hit is due to a biological artifact (sequencing/cloning vector, repeats, low-complexity sequence…)
Sequence Database Searching
D A T A B A S E S E A R C H I N G
The “hit list” gives titles and scores for matched sequences
> fasta myquery swissprot -ktup 2The best scores are: initn init1 opt z-sc E(77110)gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996 1262.1 0gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412 382 395 507.6 1.4e-21gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238 133 316 407.4 5.4e-16gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153 98 190 253.1 2.1e-07gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163 163 184 244.8 6.1e-07gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN 164 164 170 227.2 5.8e-06gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130 91 157 210.3 5.1e-05gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125 125 148 199.7 0.0002gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4 42 42 140 191.3 0.00058gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128 73 139 188.7 0.00082gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT- 76 76 133 181.0 0.0022gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1 27 27 119 165.2 0.017gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO 66 66 118 163.0 0.022gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO 65 65 116 160.5 0.03gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT- 52 52 117 160.3 0.031gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO 66 66 115 159.3 0.035gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO 66 66 112 155.5 0.057gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN 73 73 112 155.4 0.058gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN 76 76 110 153.8 0.072gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP 58 58 104 138.5 0.51gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE 47 47 103 137.8 0.56gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T 63 63 98 131.3 1.3gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA 58 58 99 129.4 1.6gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA 70 48 91 122.9 3.7gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE 75 50 92 121.9 4.3gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU 36 36 85 121.3 4.7gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC 36 36 84 120.0 5.5gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA 45 45 90 118.9 6.3gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA 48 48 92 117.4 7.6gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED 59 59 89 117.0 8gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC 48 48 97 117.0 8gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO 38 38 83 116.8 8.3
Sequence Database Searching
D A T A B A S E S E A R C H I N G
Detailed alignments are shown farther down in the output
> fasta myquery swissprot -ktup 2
>>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa)initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21Smith-Waterman score: 395; 52.3% identity in 109 aa overlap
10 20 30 40 50gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF : X: .:.:: :.:: ::..:::::: : : : :..:: :.:..:::gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF 10 20 30 40 50 60
60 70 80 90 100 110gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK ....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.:gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK 70 80 90 100 110 120
120 130 140gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ ..gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE 130 140 150 160 170 180
>>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa)initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16Smith-Waterman score: 316; 37.4% identity in 131 aa overlap
10 20 30 40gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER :.. :. .v^: :.. ..:::: ::.::::::. ::X :
Sequence Database Searching
Database Search Space
Query sequence
Con
cana
ted
Dat
abas
e se
quen
ce
Some matches are non-meaningful because they occur VERY often in
database.
e.g. nucleotide AAA (from polyA)
Biological repeated elements(retroposons ALU)
Low-complexity repeated patterns.
(CAGCAG, QQQ,KKK,…)
These elements should be
FILTERED or MASKED
to avoid generating false ‘hits’.. It is ‘OK’ to align through them if they are near meaningful diagonal ‘hits’
Sequence Database Searching
Score and Statistics
Some amino acids mutations do not affect structure/function very much. Amino acids with similar physico-chemical and steric properties can often replace each other.
Scoring system that doesn’t penalize very much mutations to similar amino acid.
PAM Matrices: Point Accepted Mutations. Defined in terms of a divergence of 1 percent PAM. For distant sequences use PAM250, while for closer sequences (like DNA) use PAM100. Some sites accumulate mutations some others don’t, thus use of the PAM100 matrice doesn’t mean that the sequences compared were 100% mutated.
BLOSUM: BLOCK substitution matrices. Started with the BLOCKS database of multiple alignment only involving distant sequences. BLOSUM62 means that the proteins compated were never closer than 62% Identity. BLOSUM50 matrices involved alignment of more distant sequences. Recommend use BLOSUM matrices (BLOSUM62) for most protein alignments.
Sequence Database Searching
S C O R I N G S Y S T E M S
BLOSUM62
Figure 7.8
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I L K M F P S T W Y V
Some amino acid substitutions are more common than others
Substitution scores come from an odds ratio based on measured substitution rates
Sequence Database Searching
S C O R I N G S Y S T E M S
BLOSUM62
Figure 7.8
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I L K M F P S T W Y V
Identities get positive scores, but some are better than others
Sequence Database Searching
S C O R I N G S Y S T E M S
BLOSUM62
Figure 7.8
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I L K M F P S T W Y V
Some non-identities have positive scores, but most are negative
Sequence Database Searching
BLAST and BLAST2SEQUENCES
BLAST is a database search engine based on
using hashing to accelerate the search.
blastn (nucleotide query against nucleotide database) blastp (protein query against protein database)blastx (nucleotide query against protein database)
- translates a nucleotide query in all 6 reading frames and compare it to a protein database.
tblastn (protein query against nucleotide database)- compare a protein against a nucleotide database translated in all 6 reading frames.
tblastx (nucleotide query against nucleotide database)- compares a nucleotide sequence against a nucleotide database by translating the query and database in all 6 reading frames. Very slow!
A pairwise alignment implementation of this
program is available at:
http://www.ncbi.nlm.nih.gov/gorf/bl2.html
Sequence Database Searching
Protein BLAST databases
nr All non-redundant GenBank CDS+ translations+PDB+ SwissProt + PIR + PRF
month All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days.
swissprot Last major release of the SWISS-PROT protein sequence database (no updates)
Drosophila Drosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP).
yeast Yeast (Saccharomyces cerevisiae) genomic CDS translations
ecoli Escherichia coli genomic CDS translations
pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank
kabat [kabatpro] Kabat's database of sequences of immunological interest
alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.
Sequence Database Searching
Nucleotide BLAST databases
nr All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant".
month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
Drosophila genome Drosophila genome provided by Celera and Berkeley Drosophila Genome Project (BDGP).
dbest Database of GenBank+EMBL+DDBJ sequences from EST Divisions
dbsts Database of GenBank+EMBL+DDBJ sequences from STS Divisions
htgs Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTG sequences are in nr)
gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.
yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
E. coli Escherichia coli genomic nucleotide sequences
Sequence Database Searching
Nucleotide BLAST databases
pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank
kabat [kabatnuc] Kabat's database of sequences of immunological interest
vector Vector subset of GenBank(R), NCBI, in ftp://ncbi.nlm.nih.gov/blast/db/
mito Database of mitochondrial sequences
alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.
epd Eukaryotic Promotor Database found on the web at http://www.genome.ad.jp/dbget-bin/www_bfind?epd
Sequence Database Searching
BLASTN SEARCH (M29204)
Search Nucleotide sequence M29204 against nr.
http://www.ncbi.nlm.nih.gov/blast/blast.cgi?Jform=1
Sequence Database Searching
BLASTP and filtering.
Search using blastp against nr
With filtering ON (default)\
Then with filtering OFF.
>GCF
MKKRVTNRERHWTHRRRRQRTRKKKKKKKRVLGRRALGPRPWLTGRKGLFGSARLIPATA
Sequence Database Searching
BLASTN vs BLASTX
Search blastn against nr (nucleotide) U15595
Now search using blastx against nr (protein)
Now
Search blastx against ALU
Sequence Database Searching
TBLASTX against dbEST
Search tblastx against dbEST
Picks up homologs based on protein homology of translations.
>OCRL-selected mRNA, partial sequenceTTGAACATCATGAAACATGAGGTTGTCATTTGGTTGGGAGATTTGAATTATAGACTTTGCATGCCTGATGCCAATGAGGTGAAAAGTCTTATTAATAAGAAAGACCTTCAGAGACTCTTGAAATTCGACCAGCTAAATATTCAGCGCACACAGAAAAAAGCTTTTGTTGACTTCAATGAAGGGGAAATCAAGTTCATCCCCACTTATAAGTATGACTCTAA
Sequence Database Searching
Prosite search
Search prosite for
NP_000271 (Pax6a)
http://www.expasy.ch/prosite
Sequence Database Searching
PHI-Blast search
Search Prosite db using the NCBI’s PHI-blast.(Pattern-Hit-Initiated blast) using the pattern for Pax6a.
[LIVMFYG]-[ASLVR]-X(2)-[LIVMSTACN]-X-(4)-[LIV]-[RKNQESTAIY]-[LIVFSTNKH]-W
-e 2e-14
Sequence Database Searching
PSI-Blast search
Search AB026911 using PSI-blast. (at NCBI).
Position-Specific-Iteration.
.. Modifies the scoring matrix as a function of conserved or unconserved residues in alignments.
Sequence Database Searching
ONLINE tutorials
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
Details of Blast methodology.
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
Blast usage and Tutorial
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html
Quick overview of terminology.