Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide...

Access to sequences:GenBank – a place to start and then some more...

Links: embl nucleotide archive http://www.ebi.ac.uk/ena/ DNA data bank of Japan http://www.ddbj.nig.ac.jp/ GenBank http://www.ncbi.nlm.nih.gov/

http://www.ebi.ac.uk/ena/

http://www.ebi.ac.uk/ena/

http://www.ddbj.nig.ac.jp/

http://www.ncbi.nlm.nih.gov/

contains wealth of many types of data

…but the main part represent sequences (DNA, RNA, aa; short fragments, genomes…)

for the explained sample of GenBank sequence recordclick here

there is lots of categories and information, but you can view the sequencealso in much more streamlined form (called FASTA format):

>gi|1293613|gb|U49845.1|SCU49845 Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cdsGATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAGACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAAGTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATATTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGAGCAGTACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCACAACCTCAAAGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAACTTATTTTCTTATTCTTTACTCTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGAAGAACAGAACAATTACTTAATAGAAAAATTATATCTTCCTCGAAACGATTTCCTGCTTCCAACATCTACGTATATCAAGAAGCATTCACTTACCATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGTAGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAGTTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAATACTCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGTCCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAACGGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCGCCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATAAACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTTCTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATCAACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGATCCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTGTCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGTTCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAAGCCAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATTTCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACATCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAAATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTTCATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTAGCTCTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATTAGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAACCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAGAAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGGTCTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTCTGATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGAAGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAACGTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAATCACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTTTTGCTGGGTCCATAGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAAATAAGAGTAATGTCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGATTATACGCAACGATATTTTGCTTAATTTTATTTTCCTGTTTTATTTTTTATTAGTGGTTTACAGATACCCTATATTTTATTTAGTTTTTATACTTAGAGACATTTAATTTTAATTCCATTCTTCAAATTTCATTTTTGCACTTAAAACAAAGATCCAAAAATGCTCTCGCCCTCTTCATATTGAGAATACACTCCATTCAAAATTTTGTCGTCACCGCTGATTAATTTTTCACTAAACTGATGAATAATCAAAGGCCCCACGTCAGAACCGACTAAAGAAGTGAGTTTTATTTTAGGAGGTTGAAAACCATTATTGTCTGGTAAATTTTCATCTTCTTGACATTTAACCCAGTTTGAATCCCTTTCAATTTCTGCTTTTTCCTCCAAACTATCGACCCTCCTGTTTCTGTCCAACTTATGTCCTAGTTCCAATTCGATCGCATTAATAACTGCTTCAAATGTTATTGTGTCATCGTTGACTTTAGGTAATTTCTCCAAATGCATAATCAAACTATTTAAGGAAGATCGGAATTCGTCGAACACTTCAGTTTCCGTAATGATCTGATCGTCTTTATCCACATGTTGTAATTCACTAAAATCTAAAACGTATTTTTCAATGCATAAATCGTTCTTTTTATTAATAATGCAGATGGAAAATCTGTAAACGTGCGTTAATTTAGAAAGAACATCCAGTATAAGTTCTTCTATATAGTCAATTAAAGCAGGATGCCTATTAATGGGAACGAACTGCGGCAAGTTGAATGACTGGTAAGTAGTGTAGTCGAATGACTGAGGTGGGTATACATTTCTATAAAATAAAATCAAATTAATGTAGCATTTTAAGTATACCCTCAGCCACTTCTCTACCCATCTATTCATAAAGCTGACGCAACGATTACTATTTTTTTTTTCTTCTTGGATCTCAGTCGTCGCAAAAACGTATACCTTCTTTTTCCGACCTTTTTTTTAGCTTTCTGGAAAAGTTTATATTAGTTAAACAGGGTCTAGTCTTAGTGTGAAAGCTAGTGGTTTCGATTGACTGATATTAAGAAAGTGGAAATTAAATTAGTAGTGTAGACGTATATGCATATGTATTTCTCGCCTGTTTATGTTTCTACGTACTTTTGATTTATAGCAAGGGGAAAAGAAATACATACTATTTTTTGGTAAAGGTGAAAGCATAATGTAAAAGCTAGAATAAAATGGACGAAATAAAGAGAGGCTTAGTTCATCTTTTTTCCAAAAAGCACCCAATGATAATAACTAAAATGAAAAGGATTTGCCATCTGTCAGCAACATCAGTTGTGTGAGCAATAATAAAATCATCACCTCCGTTGCCTTTAGCGCGTTTGTCGTTTGTATCTTCCGTAATTTTAGTCTTATCAATGGGAATCATAAATTTTCCAATGAATTAGCAATTTCGTCCAATTCTTTTTGAGCTTCTTCATATTTGCTTTGGAATTCTTCGCACTTCTTTTCCCATTCATCTCTTTCTTCTTCCAAAGCAACGATCCTTCTACCCATTTGCTCAGAGTTCAAATCGGCCTCTTTCAGTTTATCCATTGCTTCCTTCAGTTTGGCTTCACTGTCTTCTAGCTGTTGTTCTAGATCCTGGTTTTTCTTGGTGTAGTTCTCATTATTAGATCTCAAGTTATTGGAGTCTTCAGCCAATTGCTTTGTATCAGACAATTGACTCTCTAACTTCTCCACTTCACTGTCGAGTTGCTCGTTTTTAGCGGACAAAGATTTAATCTCGTTTTCTTTTTCAGTGTTAGATTGCTCTAATTCTTTGAGCTGTTCTCTCAGCTCCTCATATTTTTCTTGCCATGACTCAGATTCTAATTTTAAGCTATTCAATTTCTCTTTGATC

where first line introduced by ‘>’ represent the header, anything after firstline break is considered to be the sequence. Fasta (or Pearson’s) format is the most widely used sequence format in Bioinformatics!

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html%23ModificationsDateB

!but first, you have to find it!

you can search by keyword(could be name, abbreviation...)

... or unique identifier ‘Accesion number’

... or first filter out all sequences of particular organism

... and then use keyword

check results you want to save, click ‘Display settings, ‘Apply’

and copy results into any text editor

or click ‘Send to’, set Format to Fasta and save to wherever you want to

This way, you can also download whole protein/nucleotide set of any particular taxonomic unit,or even the genomic sequence. Try to figure out how!

... you can also search by similarity/homology using BLAST

• set of sequence comparison algorithms (1990)• search sequence databases for optimal local alignments to a query• Heuristic approach based on Smith Waterman algorithm• Finds best local alignments• Provides statistical significance• www, standalone, and network clients

The BLAST programs (Basic Local Alignment Search Tools)

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.

Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” NAR 25:3389-3402.

BLAST+

1) Choose the sequence (query)

2) Select the BLAST program

3) Choose the database to search

4) Choose optional parameters

The BLAST programs (Basic Local Alignment Search Tools)

Program

Description

blastp Compares an amino acid query sequence against a protein sequence database.

blastn Compares a nucleotide query sequence against a nucleotide sequence database.

blastx

Compares a nucleotide query sequence translated in all reading frames against a protein sequence

database. You could use this option to find potential translation products of an unknown nucleotide

sequence.

tblastnCompares a protein query sequence against a

nucleotide sequence database dynamically translated in all reading frames.

tblastxCompares the six-frame translations of a nucleotide query sequence against the six-frame translations of

a nucleotide sequence database.

Program

Description

blastp Compares an amino acid query sequence against a protein sequence database.

blastn Compares a nucleotide query sequence against a nucleotide sequence database.

blastx

Compares a nucleotide query sequence translated in all reading frames against a protein sequence

database. You could use this option to find potential translation products of an unknown nucleotide

sequence.

tblastnCompares a protein query sequence against a

nucleotide sequence database dynamically translated in all reading frames.

tblastxCompares the six-frame translations of a nucleotide query sequence against the six-frame translations of

a nucleotide sequence database.

The BLAST programs: Select the BLAST program

Program Notes

Megablast

Contiguous Nearly identical sequences

Discontiguous

Cross-species comparison

Position Specific

PSI-BLASTAutomatically generates a

position specific score matrix (PSSM)

RPS-BLAST Searches a database of PSI-BLAST PSSMs

Program Notes

Megablast

Contiguous Nearly identical sequences

Discontiguous

Cross-species comparison

Position Specific

PSI-BLASTAutomatically generates a

position specific score matrix (PSSM)

RPS-BLAST Searches a database of PSI-BLAST PSSMs

nucleotide only protein only

The BLAST programs: Select the BLAST program

first choose appropriate database/algorithm, i.e. if you have aa sequence and you are after proteins, use blastp (protein blast), if you’re looking for coding sequence, use tblastn (translated blast) etc...

paste your query sequence or acc. # here

sometimes it’s handy to zoom in the search for specific group

How does it work?BLAST Algorithm in layers

“The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990)

Three heuristic layers: seeding, extension, and evaluation

• Seeding – identify where to start alignment

• Extension – extending alignment from seeds

• Evaluation – Determine which alignments are statistically significant

BLAST Algorithm: Seeding

compile a list of word pairs (w=3)above threshold T

Example: for a human RBP query…FSGTWYA… (query word is in red)

A list of words (w=3) is:FSG SGT GTW TWY WYAYSG TGT ATW SWY WFAFTG SVT GSW TWF WYS

BLAST locates all common words in a pair of sequences, then uses them as seeds for the alignment

Discriminating between real and artificial matches is done using an estimate of probability that the match might occur by chance.

scores (S) and e-values (E) of BLAST hits

word=defined number of letters

BLAST Algorithm: Seeding: Score

score=alignment quality

• Substitution matrices are used for amino acid alignments. – each possible residue substitution is given a score

• A simpler unitary matrix is used for DNA pairs (+1 for match, -2 mismatch)

6

BLAST Algorithm: Seeding: Scoring matrix

aa frequency, aa properties

BLOSUM vs PAM

• BLOSUM 62 as the default in BLAST 2.0. - tailored for comparisons of moderately distant proteins, performs

well in detecting closer relationships. - search for distant relatives may be more sensitive with a different

matrix.

BLOSUM 45 BLOSUM 62 BLOSUM 90

PAM 250 PAM 160 PAM 100

More Divergent Less Divergent

PAM (Percent Accepted Mutation)- theoretical approach- based on assumptions of mutation probabilities

BLOSUM (BLOcks SUbstitution Matrix)- empirical- constructed from multiply aligned protein families- ungapped segments (blocks) clustered based on percent identity

BLAST Algorithm: Seeding: Scoring matrix

BLAST Algorithm: Seeding: E value

• Low E-values suggest that sequences are homologous• Statistical significance depends on both the size of the alignments and the size

of the sequence database‣ Important consideration for comparing results across different searches‣ E-value increases as database gets bigger‣ E-value decreases as alignments get longer

Suggested BLAST Cutoffs

• For nucleotide based searches, one should look for hits with E-values of 10^-6 or less and sequence identity of 70% or more

• For protein based searches, one should look for hits with E-values of 10^-3 or less and sequence identity of 25% or more

e- value= significance of the alignment

The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

when you manage to find a hit (i.e. a match between a “word” and a database entry), extend the hit in either direction.

Keep track of the score (use a scoring matrix)

Stop when the score drops below some cutoff.

KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)

MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)

Hit!extendextend

BLAST Algorithm: Extension and Evaluation

originally hits extended in either direction X refinement of BLAST: two independent hits required


BLAST algorithm extends the initial “seed” hit into an HSP

HSP = high scoring segment pair = Local optimal alignment

BLAST-related tools for genomic DNA

• MegaBLAST at NCBI

• BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query-a mirror image of the BLAST strategy

http://genome.ucsc.edu

• SSAHA at Ensembl uses a similar strategy as BLAThttp://www.ensembl.org

it’ll even tell you, whether itfound any known domain

... or level of similarity

scroll down to bottom...

the more the better

check hits you want to save ... then click ‘Download’

Access to sequenced data: Species and Taxa Specific Databases

https://genome.ucsc.edu/ENCODE/

http://www.genecards.org/

http://www.biobase-international.com/product/hgmd

Comparative database of eukaryotic pathogens

gene/metabolic pathway oriented databases

Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide...

Documents

Transcript of Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide...