Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

35
Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC PIR: a comprehensive resource for functional analysis of protein sequences and families

description

PIR: a comprehensive resource for functional analysis of protein sequences and families . Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC. PIR Web Site. NEW web site, soon to become public http://pir.georgetown.edu - PowerPoint PPT Presentation

Transcript of Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

Page 1: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

Anastasia Nikolskaya Lai-Su Yeh

Protein Information ResourceGeorgetown University Medical CenterWashington, DC

PIR: a comprehensive resource for functional analysis of protein sequences and families 

Page 2: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

2

PIR Web Site NEW web site, soon to become publichttp://pir.georgetown.edu currently an old version

PIR and UniProt web sites interlinked and cross-navigable

PIR-specific features

Text Search Sequence Search Classification Database Search

Page 3: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

3

i

• Integration of protein family, function, structure

• Rich links (executive summary + hypertext links) to > 90 databases

• Value-added reports for 1.96 Million UniProtKB protein entries

i

iProClass Protein Knowledgebase

Disease/Variation

OMIMHapMap

…Ontology

GO

Protein Sequence

UniProtUniRefUniParcRefSeq

GenPept…

Gene/Genome

GenBank/EMBL/DDBJLocusLinkUniGene

MGITIGR

Gene Expression

GEOGXD

ArrayExpressCleanExSOURCE

Structure

PDBSCOPCATH

PDBSumMMDB

Family

PIRSFInterPro

PfamPrositeCOG

Interaction

DIPBIND

Taxonomy

NCBI TaxonNEWT

Protein Expression

Swiss-2DPAGEPMG

Literature

PubMed

Function/Pathway

EC-IUBMBKEGG

BioCartaEcoCyc

WIT…

Modification

RESIDPhosphoBase

iProClass

Integrated Protein Knowledgebase

iProClass

Integrated Protein Knowledgebase

http://pir.georgetown.edu/iproclass

Page 4: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

4

Example

Want to find info on chorismate mutases,Specifically:Start with Bacillus subtilis P19080 = CHMU_BACSU

Relatedness to other chorismate mutases- Homology- Domain architecture

- Is it related to E.coli P07022 (a well-studied bifunctional enzyme (P-protein), chorismate mutase/prephenate dehydratase)

Page 5: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

5

iProClass Sequence Report

Page 6: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

6

What can we find about “chorismate mutase”

Protein Analysis: I. Text Search iProClass

Page 7: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

7

Text SearchResults (I)

UniProt ID

Page 8: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

8

Text SearchResults (II)

Display options: add or remove columns

Page 9: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

9

Text Search Results (III)

Find chorismate mutase(s) from B. subtilis

Page 10: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

10

Determining Protein HomologyIs B. subtilis CM P19080 homologous to E.coli P-protein P07022? to B. subtilis AroA(G) P39912?Which domains, if any, in multidomain chorismate mutases it corresponds to?What kinds of domain architecture exist in chorismate mutases?

Page 11: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

11

Retrieve Proteins by UID in Batch Mode

ID mapping option: can use various non-UniProt IDs

Batch Retrieval

Page 12: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

12

Determining Protein Homology:Sequence Search

BLAST FASTA SSearch

Page 13: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

13

Blast Search ResultsBLAST query UniProt sequence P19080hits PIRSF005965 family members as best hits

Page 14: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

14

Pre-compiled Related Sequences: saves time

Page 15: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

15

BLAST/SSEARCH Results

SSEARCH Alignment

BLASTAlignment

Page 16: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

16

Determining Protein Homology: Peptide Search

Page 17: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

17

Peptide Search Results

Page 18: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

18

Protein families reflect evolutionary relationships Function often follows along the family lines Therefore, matching a protein sequence a protein family

provides information about a protein (need a highly curated and annotated family)

Faster and often more accurate than searching against a protein database

Protein classification facilitates sequence and functional analysis of proteins and is used for accurate automatic annotation (PIRSF is used for UniProt annotation)

Family Classification System:One-Stop Platform for Protein Analysis

Page 19: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

19

PIRSF Classification System PIRSF: reflects evolutionary relationships of full-length

proteins

Definitions: Basic unit = Homeomorphic Family Homologous: Inferred by sequence similarity Homeomorphic: Full-length sequence similarity and common domain

architecture Hierarchy: Flexible number of levels with varying degrees of sequence

conservation; Network Structure: multiple domain parents

Advantages: Annotation of both generic biochemical and specific biological functions Accurate propagation of annotation and development of standardized

protein nomenclature and ontology

Page 20: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

20

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF001499: Bifunctional CM/PDH (T-protein)

PIRSF006786: PDH, feedback inhibition-insensitive

PIRSF005547: PDH, feedback inhibition-sensitive

PF02153: Prephenatedehydrogenase (PDH)

PIRSF017318: CM of AroQ class, eukaryotic type

PIRSF001501: CM of AroQ class, prokaryotic type

PIRSF026640: Periplasmic CM

PIRSF001500: Bifunctional CM/PDT (P-protein)

PIRSF001499: Bifunctional CM/PDH (T-protein)

PF01817: Chorismatemutase (CM)

PIRSF006493: Ku, prokaryotic type

PIRSF500001: IGFBP-1

PIRSF500006: IGFBP-6

PIRSF Homeomorphic Subfamily

• 0 or more levels

• Functional specialization

PIRSF018239: IGFBP-related protein, MAC25 type

PIRSF001969: IGFBP

PIRSF003033: Ku70 autoantigen

PIRSF016570: Ku80 autoantigen

PIRSF Homeomorphic Family• Exactly one level

• Full-length sequence similarity and common domain architecture

PIRSF Superfamily

• 0 or more levels

• One or more common domains

PF00219: Insulin-like growth factor binding protein

(IGFBP)

PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain

Domain Superfamily• One common Pfam

domain

PIRSF Classification SystemA protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

Page 21: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

21

Unclassified UniProtKB proteins

Uncurated Homeomorphic Clusters

Orphans

Preliminary Homeomorphic Families

Final Families, Subfamilies, Superfamilies

Add/Remove Members

Name, Refs, Abstract, Domain Arch.

Automatic Clustering

Computer-assisted Manual Curation

Automatic Procedure Unassigned Proteins

Au

tom

atic

Pla

ce

me

nt

Hierarchies (Superfamilies/Subfamilies)

Map Domains on Clusters

Merge/Split Clusters

New Proteins

Protein Name Rules/Site Rules Build and Test HMMs

1

2

3

4

5

6

7 8

Page 22: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

22

Unclassified UniProtKB proteins

Uncurated Homeomorphic Clusters

Orphans

Preliminary Homeomorphic Families

Final Families, Subfamilies, Superfamilies

Add/remove members

Name, refs, abstract, domain arch.

Automatic clustering

Computer-assisted Manual Curation

Automatic Procedure Unassigned proteins

Au

tom

atic

pla

ceme

nt

Hierarchies (superfamilies/subfamilies)

Map domains on Clusters

Merge/splitclusters

New proteins

Protein Name Rule/Site Rule Build and test HMMs

1

2

3

4

5

6

7 8

Unclassified UniProtKB proteins

Uncurated Homeomorphic Clusters

Orphans

Preliminary Homeomorphic Families

Final Families, Subfamilies, Superfamilies

Add/remove members

Name, refs, abstract, domain arch.

Automatic clustering

Computer-assisted Manual Curation

Automatic Procedure Unassigned proteins

Au

tom

atic

pla

ceme

nt

Hierarchies (superfamilies/subfamilies)

Map domains on Clusters

Merge/splitclusters

New proteins

Protein Name Rule/Site Rule Build and test HMMs

1

2

3

4

5

6

7 8

Page 23: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

23

Tool: Curator’s Decision Maker

Page 24: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

24

Classification Tool: BlastClust Curator-guided

clustering

Single-linkage clustering using BLAST

Retrieve all proteins sharing a common domain

Iterative BlastClust (fixed length coverage)

Page 25: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

25

Family Analysis of Homologous Proteins1. Fully Curated Protein Family:

Especially important when the protein of interest is underannotated or misannotated (happens often!)

Evidence types: Characterized (validated), Predicted (by computational methods) or Uncharacterized

2. Preliminary or Uncurated Family Have to do some analysis OR contact PIR and ask to prioritize this family

3. No Family Classification Have to do some analysis OR contact PIR and ask to prioritize this family

iProClass search PIRSF - blank

Page 26: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

26

Underannotated Proteins

Search iProClass with PIRSF005965

Providing more information

Page 27: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

27

PIRSF SCAN (sequence search)

UniProt sequence Q8Y5X7 is automatically classified as chorismate mutase of the AroH classPIRSF005965

Returns only matches to fully curated PIRSFs

Page 28: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

28

Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF

PIRSF Family Report: Curated Protein Family Information

Phylogenetic tree and alignment view allows further sequence analysis

Page 29: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

29

PIRSF Family Report (II)

Integrated value added information from other databases

Mapping to other protein classification databases

Page 30: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

30

CM from B.subtilis P19080 does not bring B.subtilis AroA(G) or E. coli P-protein (or related proteins) in BLAST search

Contains a different PFAM domain Identical conserved motifs are not found NOT homologous

PIRSF reports: abstracts contain most of this info PIRSF domain architecture (curated or uncurated): Pfam and

newly defined domains Structure information (PDB links) Hierarchy in DAG (under development)

Chorismate Mutase Results from iProClass Analysis

Use PIRSF family database for the same analysis:

Page 31: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

31

PIRSF Text Search

New domain

AroA(G)

Page 32: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

32

Chorismate Mutase Convergent Evolution – EC 5.4.99.5 (Non-Orthologous Gene

Displacement) Two Distinct Sequence/Structure Types

AroQ Class: SCOP (all ), core: 6 helices, bundle AroH Class: SCOP (+), core: beta-alpha-beta-alpha-beta(2)

Two Pfam Domains: PF01817, PF07736 (New PFAM domain)

AroQAroQ AroHAroH

Page 33: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

33

Developing DAG Viewer

Before:all chorismate mutase proteins and families hit PF01817includingPIRSF005965(not homologous to the rest)

Subfamily

Network structure (in DAG) for PIRSF family classification system reflects PIRSF family hierarchy which is based on evolutionary relationships

Page 34: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

34

DAG Viewer (II)

After:PFAM created a new domain PF07736which is found in PIRSF005965 members

“Orphans”: no family classification

Page 35: Anastasia Nikolskaya  Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

35

PIR Team Dr. Cathy Wu, Director

Protein Classification teamDr. Winona Barker Dr. Lai-Su Yeh Dr. Anastasia NikolskayaDr. Darren Natale Dr. Zhang-Zhi Hu Dr. Raja Mazumder Dr. CR Vinayaka Dr. Sona Vasudevan Dr. Cecilia Arighi

Informatics teamDr. Hongzhan Huang Dr. Peter McGarvey Baris Suzek, M.S. Sehee Chung, M.S. Dr. Leslie Arminski Dr. Hsing-Kuo Hua Yongxing Chen, M.S. Jian Zhang, M.S. Dr. Xin Yuan

Students

Christina Fang Vincent Hermoso Natalia Petrova

UniProt is supported by the National Institutes of Health, grant # 1 U01 HG02712-01