NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

71
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

Page 1: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

NCBI Molecular Biology Resources

A Field Guidepart 2

August 2-3, 2005

Page 2: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eWeb

Access

BLAST

VAST

Entrez

Text

Sequence

Structure

Page 3: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eWhy do we need similarity searching?

To identify and annotate sequences with…• incomplete (or no) annotations (GenBank)• incorrect annotations

To assemble genomes To explore evolutionary relationships by…

• finding homologous molecules

• developing phylogenetic trees NOTE: Similar sequences may NOT have similar function!

Searching with Sequences

Page 4: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eBasic Local Alignment Search

Tool

• Widely used similarity search tool• Heuristic approach based on Smith Waterman algorithm• Finds best local alignments• Provides statistical significance• All combinations (DNA/Protein) query and database.

– DNA vs DNA

– DNA translation vs Protein

– Protein vs Protein

– Protein vs DNA translation

– DNA translation vs DNA translation

• www, standalone, and network clients

Page 5: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Global vs Local AlignmentSeq 1

Seq 2

Seq 1

Seq 2

Global alignment

Local alignment

Page 6: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eGlobal vs. Local Alignment

Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125

Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194

Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VAWorm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264

Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401

Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471

Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE +Worm: 472 SDPDKRPTFETLQWKLEDL 492

human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... .worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60

440 450human REQLEHI--------KTHELHL . .:: . : ...worm QWKLEDLFNLDSSEYKEASINF 500

Align program (Lipman and Pearson)

BLASTp

Page 7: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Nucleotide WordsGTACTGGACATGGACCCTACAGGAAQuery:

Word Size = 11GTACTGGACAT

TACTGGACATG

ACTGGACATGG

CTGGACATGGA

TGGACATGGAC

GGACATGGACC

GACATGGACCC

ACATGGACCCT

...........

Make a lookuptable of words

Minimum word size = 7blastn default = 11megablast default = 28

Page 8: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Protein WordsGTQITVEDLFYNIATRRKALKNQuery:

Word Size = 3

Neighborhood Words

LTV, MTV, ISV, LSV, etc.

GTQ

TQI

QIT

ITV

TVE

VED

EDL

DLF

...

Make a lookuptable of words

Word Size can be 2 or 3 (default = 3)

Page 9: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Initial Matches and Extensions

Protein BLAST requires two neighboring matches within 40 aa

GTQITVEDLFYNI

<---- SEI YYN ---->

ATCGCCATGCTTAATTGGGCTT

<--- CATGCTTAATT ----->

neighborhood words

exact word match one match

two matches

Nucleotide BLAST requires one exact match

Page 10: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

An alignment that BLAST can’t find

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |

1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT

| || || || ||| || | |||||| || | |||||| ||||| | |

61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC

|||| || ||||| || || | | |||| || |||

121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Page 11: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

An Alignment BLAST Can Make

Solution: compare protein sequences; BLASTX

Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

BLAST 2 Sequences (blastx) output:

Page 12: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Scoring Systems - Nucleotides

A G C T

A +1 –3 –3 -3

G –3 +1 –3 -3

C –3 –3 +1 -3

T –3 –3 –3 +1

Identity matrix

CAGGTAGCAAGCTTGCATGTCA

|| |||||||||||| ||||| raw score = 19-9 = 10

CACGTAGCAAGCTTG-GTGTCA

Page 13: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Scoring Systems - ProteinsPosition Independent Matrices

PAM Matrices (Percent Accepted Mutation)• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used

BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly conserved blocks• Each matrix derived separately from blocks with a defined percent identity cutoff• BLOSUM62 - default matrix for BLAST

Position Specific Score Matrices (PSSMs)

PSI- and RPS-BLAST

Page 14: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

Page 15: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Gapped Alignments

• Gapping provides more biologically realistic alignments• Statistical behavior is not completely understood for gapped alignments• Gapped BLAST parameters must be found by simulations for each matrix

• Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)

Page 16: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Scores

V D S – C Y

V E T L C F

BLOSUM62 +4 +2 +1 -12 +9 +3 7

PAM30 +7 +2 0 -10 +10 +2 11

Simply add the scores for each pair of aligned residues

Different matrices produce different scores!

Page 17: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Local Alignment StatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution

Score

Alig

nm

en

ts

(applies to ungapped alignments)

E = Kmne-S E = mn2-S’

K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2

Expect ValueE = number of database hits you expect to find by

chancesize of database

your score

expected number of

random hits

Page 18: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Advanced BLAST Options: Nucleotide

Example Entrez Queriesnucleotide all[Filter] NOT mammalia[Organism]green plants[Organism]biomol mrna[Properties]gbdiv est[Properties] AND rat[organism]

Other Advanced–e 10000 expect value-v 2000 descriptions-b 2000 alignments

Page 19: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eAdvanced BLAST Options: Protein

Matrix Selection•PAM30 -- most stringent•BLOSUM45 -- least stringent

Example Entrez Queriesproteins all[Filter] NOT mammalia[Organism]green plants[Organism]srcdb refseq[Properties]Other Advanced–e 10000 expect value-v 2000 descriptions-b 2000 alignments

Limit by taxonMus musculus[Organism]Mammalia[Organism]Viridiplantae[Organism]

Page 20: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%)

Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88

FilteredUnfiltered

Low Complexity Filtering

Page 21: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

>gi|20140146|sp|Q96RF0|SNXI_HUMAN Sorting nexin 18 Length = 628

Score = 1048 bits (2710), Expect = 0.0 Identities = 528/628 (84%), Positives = 528/628 (84%)

Query: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIRSbjct: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60

Query: 61 XXXXXXXXXXXXXXXXXXXNVPPGGFEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSTFQ 120 NVPPGGFE STFQSbjct: 61 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ 120

. . .

Low Complexity Filter

low complexity sequence

Page 22: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Neighbors: Precomputed BLAST

Nucleotide

Protein

Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details.

Page 23: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eBlink – Protein BLAST Alignments

• Lists only 200 hits • List is nonredundant

Page 24: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Blink – Best Hits

Page 25: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eMegablast: NCBI’s Genome

Annotator

• Long alignments of similar DNA

sequences

• Greedy algorithm

• Concatenation of query sequences

• Faster than blastn; less sensitive

Page 26: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eMegaBLAS

T

AI217550AI251192AI254381BE645079

C:\seq\hs.4.fsa

> 1133045 gnl|UG|Hs#S1133045 qd43b11.x1 Homo sapiens cDNA, 3' end CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC> 1141828 gnl|UG|Hs#S1141828 qv37f11.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 1145899 gnl|UG|Hs#S1145899 qv33c06.x1 Homo sapiens cDNA, 3' endGAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 2291670 gnl|UG|Hs#S2291670 7e65f04.x1 Homo sapiens cDNA, 3' end TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTCCTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAAGCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT

Page 27: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Discontiguous Megablast

• Uses discontiguous word matches

• Better for cross-species comparisons

Page 28: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eTemplates for Discontiguous

MegaBLAST

W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111

Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive homology search", Bioinformatics 2002 Mar;18(3):440-5

Page 29: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Nucleotide vs. Protein BLAST

aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc

Human: N R V T V V L G A Q W G D E G

+ + V + V L G Q W G D E G

A.th.: S Q V S G V L G C Q W G D E G

agtcaagtatctggtgtactcggttgccaatggggagatgaaggt

Comparing ADSS from H. sapiens and A. thaliana

BLASTp finds three matching words

BLASTn finds no match, because there are no 7 bp words

Protein searches are generally more sensitive than nucleotide searches.

Page 30: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Translated BLAST

Query DatabaseProgram

N Pucleotide rotein

N

N

N

N

P

P

blastx

tblastn

tblastx

PPPP P P

PPPP P P PPPP P P

PPPP P PParticularly useful for nucleotide sequences withoutprotein annotations, such as ESTs or genomic DNA

Page 31: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Genomic BLAST

• These pages provide customized nucleotide and protein databases for each genome• If a Map Viewer is available, the BLAST hits can be viewed on the maps

Page 32: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

BLAST the Chicken Genome

Program

Accession for human TPO mRNA

Page 33: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

BLAST Hit on the Genome

Page 34: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

BLASTn Hit on the Map Viewer

Page 35: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

TBLASTN Results Using NP_000538

Page 36: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eLinking Protein Sequence,

Structure, and Function

sequence function (pfam, smart)

Structure

PSI-BLASTRPS-BLAST

VAST

BLASTp sequence structure

structure structure

sequence structure + function (cd)

Page 37: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Position Specific Substitution Rates

Active site serineWeakly conserved serine

Page 38: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

ePosition Specific Score Matrix

(PSSM)

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine is scored differently in these two positions

Active site nucleophile

Page 39: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

PSI-BLAST

Create your own PSSM:Confirming relationships of purinenucleotide metabolism proteins

query BLOSUM62PSSM AlignmentAlignment

Page 40: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

PSI BLAST

>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOHMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYYVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNGRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTHVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY

e value cutoff for PSSM

Page 41: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

PSI Results: Initial BLAST Run

Page 42: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eFirst PSSM Search

Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

Page 43: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eThird PSSM Search: Convergence

Just below threshold, another nucleotide metabolism enzyme

Page 44: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

ePfam-A seeds: HMM based models representing a wide variety of functional domains derived from SWISS-PROT

COG

SMART

CD

Entrez Domains (CDD)

HMM based models originally concentrating on eukaryotic signalingdomains, now expanding

BLAST based alignments derived from complete proteomes of prokaryotes

NCBI curated domains based on sequence and structural alignments

Pfam pfam01234

smart00123

cd01234

COG0123

NCBI

NCBI

Sanger

EMBL

Single Domains

Protein Families

Page 45: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Protein Links: Domains

Page 46: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Results of a CD-Search

CD

SMART

Pfam

Click on a colored bar to align your sequence to the CD

Page 47: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

CDD Record – heme peroxidases

aligned query

red = high conservation

blue = low conservation

Page 48: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Curated CD Record

Curated CDs (cd12345) are based on sequence and structure alignments

Annotated features

Structural evidence

aligned query

Page 49: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Blink: Sequence to Structure

related structures

Page 50: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Related StructuresCn3D

Page 51: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Entrez Structure

• Derived from experimentally determined PDB records

• Add value to PDB records by:– Adding explicit chemical bonding

information– Validating and indexing the sequences– Annotating 3D domains and secondary

structure– Adding links to CDD, Taxonomy, Pubmed – Converting PDB data to ASN.1

• Structure neighbors determined byVector Alignment Search Tool (VAST)

MMDB: MMolecular MModeling Data Base

Structure

Page 52: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Structure Summary Page

Conserved Domains

VAST Neighbors for chain C (domain 0)

Cn3D

VAST Neighbors for domain 2

Page 53: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eVAST: Structure

NeighborsVector Alignment Search Tool

For each 3D domain,

locate SSEs (secondarystructure elements),

and represent them asindividual vectors.

1

2

3

4

5 6

Human IL-4

VAST uses 3D Domains only!Whole polypeptides are assigned 3D domain 0 (zero).

Page 54: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

VAST Neighbors

1D2V

1D2V

1Q4G

3D domains!

Page 55: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Viewing a VAST Alignment

RMSD in Angstroms

Sequence percent identityVAST P value

Cn3D

Page 56: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Submitting a PDB File to VAST

• Redesigned interface!• This is the best way to convert PDB into MMDB format!

New!

Page 57: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

eEntrez PubChem

PC Substance

PC Compound

PC BioAssay

Primary database of chemical samples

Derived database of known chemicals fromPC Substance records

Primary database of bioactivity screens ofsamples in PC Substance

New!

Page 58: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Links from Structure

N-acetylglucosamine

heme

mannose

fucose

Page 59: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Search for thyroxine

ChemID 24KEGG 4DTP-NCI 3NIST 3 Biocyc 2BIND 2Chembank 2NIAID 1TOTAL 41

Page 60: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Sequence Polymorphisms

SNP OMIM

• Primary database of submitted SNPs• Curated database of reference SNPs• Contains more than just SNPs:

• True SNPs• MNP (multiple nucleotide)• Insertions• Deletions• Microsatellites• Mixed• No variation (constant)

• Clinical literature database• Curated at Johns Hopkins Univ• Links human genes and genetic disorders to human disease• Lists allelic variants that have clinical consequences

Variations in SNP are not necessarily in OMIM, and vice versa!

General Polymorphisms Human Phenotypes

Page 61: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Linking to SNP

Links to SNP are also available fromNucleotide and Protein

Entrez Gene - TPO

Page 62: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Entrez SNP

primary data: ss#

SNP UID: rs#

Page 63: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Find Non-synonymous SNPs

#7 AND coding nonsynon[Function Class]

Function Class

Page 64: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Non-synonymous TPO SNPs

Link to Map Viewer

View all SNPs in locus

Link to related 3D structures

Page 65: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

GeneView in dbSNP

Page 66: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Links to OMIM

Links to SNP are also available fromNucleotide and Protein

Entrez Gene - TPO

Page 67: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

OMIM Record

Page 68: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Explore a Disease SNP

799

Page 69: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

Curated CD Record

E799

Cn3D

Page 70: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

For More Information…

Page 71: NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

NC

BI

Fie

ldG

uid

e

For More Information…

•General Help [email protected]•BLAST [email protected]

E-mail addresses

The (free!) NCBI Newsletter

The NCBI Handbook

http://www.ncbi.nih.gov/Education/index.html

The NCBI Education Page

http://www.ncbi.nih.gov/About/newsletter.html

Follow the link from the NCBI Home Page