NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

Post on 21-Dec-2015

221 views 0 download

Tags:

Transcript of NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

NCBI Molecular Biology Resources

A Field Guidepart 2

August 2-3, 2005

NC

BI

Fie

ldG

uid

eWeb

Access

BLAST

VAST

Entrez

Text

Sequence

Structure

NC

BI

Fie

ldG

uid

eWhy do we need similarity searching?

To identify and annotate sequences with…• incomplete (or no) annotations (GenBank)• incorrect annotations

To assemble genomes To explore evolutionary relationships by…

• finding homologous molecules

• developing phylogenetic trees NOTE: Similar sequences may NOT have similar function!

Searching with Sequences

NC

BI

Fie

ldG

uid

eBasic Local Alignment Search

Tool

• Widely used similarity search tool• Heuristic approach based on Smith Waterman algorithm• Finds best local alignments• Provides statistical significance• All combinations (DNA/Protein) query and database.

– DNA vs DNA

– DNA translation vs Protein

– Protein vs Protein

– Protein vs DNA translation

– DNA translation vs DNA translation

• www, standalone, and network clients

NC

BI

Fie

ldG

uid

e

Global vs Local AlignmentSeq 1

Seq 2

Seq 1

Seq 2

Global alignment

Local alignment

NC

BI

Fie

ldG

uid

eGlobal vs. Local Alignment

Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125

Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194

Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VAWorm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264

Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401

Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471

Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE +Worm: 472 SDPDKRPTFETLQWKLEDL 492

human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... .worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60

440 450human REQLEHI--------KTHELHL . .:: . : ...worm QWKLEDLFNLDSSEYKEASINF 500

Align program (Lipman and Pearson)

BLASTp

NC

BI

Fie

ldG

uid

e

Nucleotide WordsGTACTGGACATGGACCCTACAGGAAQuery:

Word Size = 11GTACTGGACAT

TACTGGACATG

ACTGGACATGG

CTGGACATGGA

TGGACATGGAC

GGACATGGACC

GACATGGACCC

ACATGGACCCT

...........

Make a lookuptable of words

Minimum word size = 7blastn default = 11megablast default = 28

NC

BI

Fie

ldG

uid

e

Protein WordsGTQITVEDLFYNIATRRKALKNQuery:

Word Size = 3

Neighborhood Words

LTV, MTV, ISV, LSV, etc.

GTQ

TQI

QIT

ITV

TVE

VED

EDL

DLF

...

Make a lookuptable of words

Word Size can be 2 or 3 (default = 3)

NC

BI

Fie

ldG

uid

e

Initial Matches and Extensions

Protein BLAST requires two neighboring matches within 40 aa

GTQITVEDLFYNI

<---- SEI YYN ---->

ATCGCCATGCTTAATTGGGCTT

<--- CATGCTTAATT ----->

neighborhood words

exact word match one match

two matches

Nucleotide BLAST requires one exact match

NC

BI

Fie

ldG

uid

e

An alignment that BLAST can’t find

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |

1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT

| || || || ||| || | |||||| || | |||||| ||||| | |

61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC

|||| || ||||| || || | | |||| || |||

121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

NC

BI

Fie

ldG

uid

e

An Alignment BLAST Can Make

Solution: compare protein sequences; BLASTX

Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

BLAST 2 Sequences (blastx) output:

NC

BI

Fie

ldG

uid

e

Scoring Systems - Nucleotides

A G C T

A +1 –3 –3 -3

G –3 +1 –3 -3

C –3 –3 +1 -3

T –3 –3 –3 +1

Identity matrix

CAGGTAGCAAGCTTGCATGTCA

|| |||||||||||| ||||| raw score = 19-9 = 10

CACGTAGCAAGCTTG-GTGTCA

NC

BI

Fie

ldG

uid

e

Scoring Systems - ProteinsPosition Independent Matrices

PAM Matrices (Percent Accepted Mutation)• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used

BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly conserved blocks• Each matrix derived separately from blocks with a defined percent identity cutoff• BLOSUM62 - default matrix for BLAST

Position Specific Score Matrices (PSSMs)

PSI- and RPS-BLAST

NC

BI

Fie

ldG

uid

e

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

NC

BI

Fie

ldG

uid

e

Gapped Alignments

• Gapping provides more biologically realistic alignments• Statistical behavior is not completely understood for gapped alignments• Gapped BLAST parameters must be found by simulations for each matrix

• Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)

NC

BI

Fie

ldG

uid

e

Scores

V D S – C Y

V E T L C F

BLOSUM62 +4 +2 +1 -12 +9 +3 7

PAM30 +7 +2 0 -10 +10 +2 11

Simply add the scores for each pair of aligned residues

Different matrices produce different scores!

NC

BI

Fie

ldG

uid

e

Local Alignment StatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution

Score

Alig

nm

en

ts

(applies to ungapped alignments)

E = Kmne-S E = mn2-S’

K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2

Expect ValueE = number of database hits you expect to find by

chancesize of database

your score

expected number of

random hits

NC

BI

Fie

ldG

uid

e

Advanced BLAST Options: Nucleotide

Example Entrez Queriesnucleotide all[Filter] NOT mammalia[Organism]green plants[Organism]biomol mrna[Properties]gbdiv est[Properties] AND rat[organism]

Other Advanced–e 10000 expect value-v 2000 descriptions-b 2000 alignments

NC

BI

Fie

ldG

uid

eAdvanced BLAST Options: Protein

Matrix Selection•PAM30 -- most stringent•BLOSUM45 -- least stringent

Example Entrez Queriesproteins all[Filter] NOT mammalia[Organism]green plants[Organism]srcdb refseq[Properties]Other Advanced–e 10000 expect value-v 2000 descriptions-b 2000 alignments

Limit by taxonMus musculus[Organism]Mammalia[Organism]Viridiplantae[Organism]

NC

BI

Fie

ldG

uid

e

sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%)

Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88

FilteredUnfiltered

Low Complexity Filtering

NC

BI

Fie

ldG

uid

e

>gi|20140146|sp|Q96RF0|SNXI_HUMAN Sorting nexin 18 Length = 628

Score = 1048 bits (2710), Expect = 0.0 Identities = 528/628 (84%), Positives = 528/628 (84%)

Query: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIRSbjct: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60

Query: 61 XXXXXXXXXXXXXXXXXXXNVPPGGFEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSTFQ 120 NVPPGGFE STFQSbjct: 61 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ 120

. . .

Low Complexity Filter

low complexity sequence

NC

BI

Fie

ldG

uid

e

Neighbors: Precomputed BLAST

Nucleotide

Protein

Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details.

NC

BI

Fie

ldG

uid

eBlink – Protein BLAST Alignments

• Lists only 200 hits • List is nonredundant

NC

BI

Fie

ldG

uid

e

Blink – Best Hits

NC

BI

Fie

ldG

uid

eMegablast: NCBI’s Genome

Annotator

• Long alignments of similar DNA

sequences

• Greedy algorithm

• Concatenation of query sequences

• Faster than blastn; less sensitive

NC

BI

Fie

ldG

uid

eMegaBLAS

T

AI217550AI251192AI254381BE645079

C:\seq\hs.4.fsa

> 1133045 gnl|UG|Hs#S1133045 qd43b11.x1 Homo sapiens cDNA, 3' end CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC> 1141828 gnl|UG|Hs#S1141828 qv37f11.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 1145899 gnl|UG|Hs#S1145899 qv33c06.x1 Homo sapiens cDNA, 3' endGAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 2291670 gnl|UG|Hs#S2291670 7e65f04.x1 Homo sapiens cDNA, 3' end TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTCCTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAAGCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT

NC

BI

Fie

ldG

uid

e

Discontiguous Megablast

• Uses discontiguous word matches

• Better for cross-species comparisons

NC

BI

Fie

ldG

uid

eTemplates for Discontiguous

MegaBLAST

W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111

Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive homology search", Bioinformatics 2002 Mar;18(3):440-5

NC

BI

Fie

ldG

uid

e

Nucleotide vs. Protein BLAST

aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc

Human: N R V T V V L G A Q W G D E G

+ + V + V L G Q W G D E G

A.th.: S Q V S G V L G C Q W G D E G

agtcaagtatctggtgtactcggttgccaatggggagatgaaggt

Comparing ADSS from H. sapiens and A. thaliana

BLASTp finds three matching words

BLASTn finds no match, because there are no 7 bp words

Protein searches are generally more sensitive than nucleotide searches.

NC

BI

Fie

ldG

uid

e

Translated BLAST

Query DatabaseProgram

N Pucleotide rotein

N

N

N

N

P

P

blastx

tblastn

tblastx

PPPP P P

PPPP P P PPPP P P

PPPP P PParticularly useful for nucleotide sequences withoutprotein annotations, such as ESTs or genomic DNA

NC

BI

Fie

ldG

uid

e

Genomic BLAST

• These pages provide customized nucleotide and protein databases for each genome• If a Map Viewer is available, the BLAST hits can be viewed on the maps

NC

BI

Fie

ldG

uid

e

BLAST the Chicken Genome

Program

Accession for human TPO mRNA

NC

BI

Fie

ldG

uid

e

BLAST Hit on the Genome

NC

BI

Fie

ldG

uid

e

BLASTn Hit on the Map Viewer

NC

BI

Fie

ldG

uid

e

TBLASTN Results Using NP_000538

NC

BI

Fie

ldG

uid

eLinking Protein Sequence,

Structure, and Function

sequence function (pfam, smart)

Structure

PSI-BLASTRPS-BLAST

VAST

BLASTp sequence structure

structure structure

sequence structure + function (cd)

NC

BI

Fie

ldG

uid

e

Position Specific Substitution Rates

Active site serineWeakly conserved serine

NC

BI

Fie

ldG

uid

ePosition Specific Score Matrix

(PSSM)

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine is scored differently in these two positions

Active site nucleophile

NC

BI

Fie

ldG

uid

e

PSI-BLAST

Create your own PSSM:Confirming relationships of purinenucleotide metabolism proteins

query BLOSUM62PSSM AlignmentAlignment

NC

BI

Fie

ldG

uid

e

PSI BLAST

>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOHMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYYVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNGRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTHVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY

e value cutoff for PSSM

NC

BI

Fie

ldG

uid

e

PSI Results: Initial BLAST Run

NC

BI

Fie

ldG

uid

eFirst PSSM Search

Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

NC

BI

Fie

ldG

uid

eThird PSSM Search: Convergence

Just below threshold, another nucleotide metabolism enzyme

NC

BI

Fie

ldG

uid

ePfam-A seeds: HMM based models representing a wide variety of functional domains derived from SWISS-PROT

COG

SMART

CD

Entrez Domains (CDD)

HMM based models originally concentrating on eukaryotic signalingdomains, now expanding

BLAST based alignments derived from complete proteomes of prokaryotes

NCBI curated domains based on sequence and structural alignments

Pfam pfam01234

smart00123

cd01234

COG0123

NCBI

NCBI

Sanger

EMBL

Single Domains

Protein Families

NC

BI

Fie

ldG

uid

e

Protein Links: Domains

NC

BI

Fie

ldG

uid

e

Results of a CD-Search

CD

SMART

Pfam

Click on a colored bar to align your sequence to the CD

NC

BI

Fie

ldG

uid

e

CDD Record – heme peroxidases

aligned query

red = high conservation

blue = low conservation

NC

BI

Fie

ldG

uid

e

Curated CD Record

Curated CDs (cd12345) are based on sequence and structure alignments

Annotated features

Structural evidence

aligned query

NC

BI

Fie

ldG

uid

e

Blink: Sequence to Structure

related structures

NC

BI

Fie

ldG

uid

e

Related StructuresCn3D

NC

BI

Fie

ldG

uid

e

Entrez Structure

• Derived from experimentally determined PDB records

• Add value to PDB records by:– Adding explicit chemical bonding

information– Validating and indexing the sequences– Annotating 3D domains and secondary

structure– Adding links to CDD, Taxonomy, Pubmed – Converting PDB data to ASN.1

• Structure neighbors determined byVector Alignment Search Tool (VAST)

MMDB: MMolecular MModeling Data Base

Structure

NC

BI

Fie

ldG

uid

e

Structure Summary Page

Conserved Domains

VAST Neighbors for chain C (domain 0)

Cn3D

VAST Neighbors for domain 2

NC

BI

Fie

ldG

uid

eVAST: Structure

NeighborsVector Alignment Search Tool

For each 3D domain,

locate SSEs (secondarystructure elements),

and represent them asindividual vectors.

1

2

3

4

5 6

Human IL-4

VAST uses 3D Domains only!Whole polypeptides are assigned 3D domain 0 (zero).

NC

BI

Fie

ldG

uid

e

VAST Neighbors

1D2V

1D2V

1Q4G

3D domains!

NC

BI

Fie

ldG

uid

e

Viewing a VAST Alignment

RMSD in Angstroms

Sequence percent identityVAST P value

Cn3D

NC

BI

Fie

ldG

uid

e

Submitting a PDB File to VAST

• Redesigned interface!• This is the best way to convert PDB into MMDB format!

New!

NC

BI

Fie

ldG

uid

eEntrez PubChem

PC Substance

PC Compound

PC BioAssay

Primary database of chemical samples

Derived database of known chemicals fromPC Substance records

Primary database of bioactivity screens ofsamples in PC Substance

New!

NC

BI

Fie

ldG

uid

e

Links from Structure

N-acetylglucosamine

heme

mannose

fucose

NC

BI

Fie

ldG

uid

e

Search for thyroxine

ChemID 24KEGG 4DTP-NCI 3NIST 3 Biocyc 2BIND 2Chembank 2NIAID 1TOTAL 41

NC

BI

Fie

ldG

uid

e

Sequence Polymorphisms

SNP OMIM

• Primary database of submitted SNPs• Curated database of reference SNPs• Contains more than just SNPs:

• True SNPs• MNP (multiple nucleotide)• Insertions• Deletions• Microsatellites• Mixed• No variation (constant)

• Clinical literature database• Curated at Johns Hopkins Univ• Links human genes and genetic disorders to human disease• Lists allelic variants that have clinical consequences

Variations in SNP are not necessarily in OMIM, and vice versa!

General Polymorphisms Human Phenotypes

NC

BI

Fie

ldG

uid

e

Linking to SNP

Links to SNP are also available fromNucleotide and Protein

Entrez Gene - TPO

NC

BI

Fie

ldG

uid

e

Entrez SNP

primary data: ss#

SNP UID: rs#

NC

BI

Fie

ldG

uid

e

Find Non-synonymous SNPs

#7 AND coding nonsynon[Function Class]

Function Class

NC

BI

Fie

ldG

uid

e

Non-synonymous TPO SNPs

Link to Map Viewer

View all SNPs in locus

Link to related 3D structures

NC

BI

Fie

ldG

uid

e

GeneView in dbSNP

NC

BI

Fie

ldG

uid

e

Links to OMIM

Links to SNP are also available fromNucleotide and Protein

Entrez Gene - TPO

NC

BI

Fie

ldG

uid

e

OMIM Record

NC

BI

Fie

ldG

uid

e

Explore a Disease SNP

799

NC

BI

Fie

ldG

uid

e

Curated CD Record

E799

Cn3D

NC

BI

Fie

ldG

uid

e

For More Information…

NC

BI

Fie

ldG

uid

e

For More Information…

•General Help info@ncbi.nlm.nih.gov•BLAST blast-help@ncbi.nlm.nih.gov

E-mail addresses

The (free!) NCBI Newsletter

The NCBI Handbook

http://www.ncbi.nih.gov/Education/index.html

The NCBI Education Page

http://www.ncbi.nih.gov/About/newsletter.html

Follow the link from the NCBI Home Page