Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.

57
P P a a t t t t e e r r n n d d a a t t a a b b a a s s e e s s Gopalan Vivek

Transcript of Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.

PPaatttteerrnn ddaattaabbaasseess

Gopalan Vivek

Pattern databases - topics

Definition Applications Classifications Common Databases Conclusions

Pattern databases

Definition Applications Classifications Common Databases Conclusions

Secondary databases derived from conserved obtained from multiple sequence alignment of primary databases such as GenBank, EMBL,DDBJ, SP/TrEMBL,PIR,etc

Pattern databases – definition

Primary databases(SWISS-PROT - Protein

GenBank - DNA)

Millions of sequences

Pattern databases

Pattern Extraction - Multiple sequence alignment

Thousands of patterns

Pattern databases

Definition Applications Classifications Common Databases Conclusions

Pattern Databases - Applications

Function prediction of protein/ nucleotide sequences even when sequence similarity is low (<25%).

Useful for classification of protein sequences into families.

It takes less time to search the pattern than the primary database.– Since “patterns” is the compact representation of

features of many sequences.

Pattern databases

Definition Applications Classifications Common Databases Conclusions

Multiple Sequence Alignment (MSA)

Family based databases – considers full MSA

Motif -3Motif -1

Motif based databases – considers local regions in MSA

Pattern Databases – Protein

Motif based PROSITE PRINTS BLOCKS

Family based ProDom PIR-ALN ProtoMap DOMO ProClass Pfam SMART TIGRFAMs SBASE SYSTERS

InterPro - Integrated resources of protein families and sites PROSITE PRINTS BLOCKS Pfam ProDom

InterPro

Pattern databases

Definition Applications Classifications Common Databases

– PROSITE, PRINTS, BLOCKS & SMART (motif based)

– MetaFam, InterPro (Integrated databases)

Conclusions

Databases – General Tips

1. Source

2. Input formats & parameters

3. Output formats

4. Quality of the data

5. Other details – updates, coverage, speed, download, reference, methods etc.

Focus To search pattern databases using the text

or keyword search options in them for “Alkaline phosphatase” enzyme.

To analyze the quality of results from each of these database– Sensitivity, specificity.

Sequence & Pattern searches- In the afternoon’s practical.

PROSITE http://www.expasy.org/prosite/

consists of biologically significant protein sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.

Based on SWISSPROT/TrEMBL

Text Search

Sequence Scanner

ID and text Search

http://www.expasy.org/prosite/

Details about the pattern/profileDetails about the pattern/profile

PROSITE IDPROSITE ID

PROSITE PatternPROSITE Pattern

Result: PROSITE Documentaion pageResult: PROSITE Documentaion page

[IV]-x-D-S-[GAS]-[GASC]-[GAST]-[GA]-T [S is the active site residue]

Numerical ResultsNumerical Results

PROSITE PatternPROSITE Pattern

Detailed View - page 1Detailed View - page 1

Detailed View - page 2Detailed View - page 2

True PositivesTrue Positives

False PositivesFalse Positives

View entry in raw text format (no links)

Raw Text Format – PROSITE FormatRaw Text Format – PROSITE Format

ID Identification AC Accession number DT Date DE Short descriptionPA Pattern MA Matrix/profileRU RuleNR Numerical resultsCC CommentsDR Cross-references to SWISS-PROT3D Cross-references to PDBDO Pointer to the documentation file

// Termination line

PROSITE Profiles

Highly degenerate protein structural and functional domains– immunoglobulin domains, SH2 and SH3 domains.

Consensus sequences of repetitive DNA elements– SINEs, LINEs

Basic gene expression signals– promoter elements, RNA processing signals,

translational initiation sites.

DNA-binding protein motifs. Protein and nucleic acid compositional

domains– glutamine-rich activation domains, CpG islands.

PROSITE - features

Completeness High specificity Documentation Periodic reviewing Parallel update with SWISS-

PROT(primary database)

Multiple Sequence Alignment

Find 4-5 functionally conserved residues

cydeggiscyedggiscyeeggitcyhgdggscyrgdgnt

C-Y-x2-[DG]-G-x-[ST] CORE PATTERN

SWISS-PROT

MoreFALSE POSITIVES ?

Increase the sequence length of the pattern

PROSITE DBYES NO

motif

http://bioinf.man.ac.uk/dbbrowser/PRINTS/

Protein fingerprint database Fingerprint - set of motifs used that

represent the most conserved regions of multiple sequence alignment.

Improved diagnostic reliability than single motif methods

Source – SWISSPROT/TrEMBL

Multiple Sequence Alignment

Identification of ALL the conserved regions

cydeggiscyedggiscyeeggitcyhgdggs

Creation of frequency matrices

SWISS-PROT/ Tr-EMBL

PRINTS DB

xxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxx

Frequency matricesFrequency matrices

motif

fingerprint

Iterative database scanning of the frequency matrices with protein databases till convergence

http://bioinf.man.ac.uk/dbbrowser/PRINTS/

Database ID , no. of motifs and text Search

Motif scanner (for searching a sequence or pattern against PRINTS database)

Page 1 for ‘alkaline phosphatase’ entry in PRINTSPage 1 for ‘alkaline phosphatase’ entry in PRINTS

Documentation,Links & references

Documentation,Links & references

Page 2Page 2

Fingerprint detailsFingerprint details

Sequence SummarySequence Summary

Page 3Page 3

Motif no. 1Motif no. 1

Motif no. 2Motif no. 2

“Raw” motif“Raw” motif

SWISSPROT -IDsSWISSPROT -IDs

Start and Interval between motifs in the fingerprintStart and Interval between motifs in the fingerprint

BLOCKS http://blocks.fhcrc.org/blocks/

Blocks are multiple aligned ungapped segments corresponding to the most highly conserved regions of proteins

The BLOCKS database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins.

Blocks Making

Blocks are produced by the automated PROTOMAT system (Henikoff and Henikoff, 1991), which applies a robust motif-finder to a set of related protein sequences.

http://blocks.fhcrc.org/blocks/blocksdiag.jpg

http://blocks.fhcrc.org/blocks/

Sequence, no. of blocksand text Searches

Blocks Maker

Page 1Page 1

SummarySummary

Search methods using blocksSearch methods using blocks

Page 2

BLOCK - 1BLOCK - 1

Represent start position of the blockRepresent start position of the block

SWISSPROT IDSWISSPROT ID

Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100

Contains >500 domain families associated with signaling, extra-cellular and chromatin-associated proteins are found.

Each domain is extensively annotated with phyletic distributions, functional class, tertiary structures and functionally important residues.

http://smart.embl-heidelberg.de/

ID and text Search

ID & sequence Search

Domain & GO search

Alkaline Phosphatase

Results – Alkaline phosphatase “Signatures” PROSITE

– Represented as a single motif. PRINTS

– Represented as 5 motif regions. BLOCKS

– Represented as 6 block regions SMART

– Represented as a single profile

Composite Pattern Databases

MetaFam InterPro CDD (conserved Domain Database) IProClass

Metafam & PANAL

Metafam - http://metafam.ahc.umn.edu/

PANAL – Protein ANALysis tool page of Metafam http://mgd.ahc.umn.edu/panal/

Protein family classification built with Blocks+, DOMO, Pfam, PIR-ALN, PRINTS, Prosite, ProDom, SBASE, SYSTERS.

PANAL

Interpro http://www.ebi.ac.uk/interpro Built from PROSITE, PRINTS, Pfam,

ProDom, SMART, TIGRFAM, SWISS-PROT and TrEMBL

Text- and sequence-based searches.

http://www.ebi.ac.uk/interpro/

PRINTSPROSITEPfamPRODOMSMART

Detailed View - page 1Detailed View - page 1

Detailed View - page 2Detailed View - page 2

BLOCKS database link

PR – PRINTSPS – PROSITEPF – PfamPD – ProDomSM – SMART

Detailed View - page 2Detailed View - page 2

T – True PositiveF – False Positive

Range of the motif

Pattern databases

Definition Applications Classifications Common Databases

– PROSITE, PRINTS & BLOCKS (motif based)– MetaFam, InterPro (Integrated databases)

Conclusions

CONCLUSION

Diverse pattern databases from small patterns to profiles to complex HMM models

Different strength and weakness Different database formats

Best to combine and analyze results from different pattern databases.