Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna...

124
Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna...

Page 1: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

Protein-protein interactions: structure and systems approaches to analyze diverse genomic data

Anna R. Panchenko

Benjamin A. Shoemaker

NCBI / NIH

Page 2: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

2

Importance of protein-protein interactions.

• Many cellular processes are regulated by multiprotein complexes.

• Distortions of protein interactions can cause diseases.

• Protein function can be predicted by knowing functions of interacting partners (“guilt by association”).

Adapted from S. Fields, FEBS, 2005

A comparison of sequence (GenBank) and protein-protein interaction data (DIP database)

Page 3: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

3

Outline

• Introduction

• Experimental techniques

• Protein and domain interaction databases

• Prediction of protein/domain interactions

• Homology modeling of interaction interfaces

• Protein interaction networks

Page 4: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

4

Example: interaction of guanine-nucleotide binding domain with different

effectors.

Adapted from Vetter & Wittinghofer, Science 2001

Page 5: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

5

Common properties of protein-protein interactions.

• Majority of protein complexes have a buried surface area ~1600±400 Ǻ^2 (“standard size” patch).

• Complexes of “standard size” do not involve large conformational changes while large complexes do.

• Protein recognition site consists of a completely buried core and a partially accessible rim.

• Trp and Tyr are abundant in the core, but Ser and Thr, Lys and Glu are particularly disfavored.

Top molecule

Bottom molecule

rim

core

Page 6: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

6

Different types of protein-protein interactions.

• Permanent and transient.

• External are between different chains; internal are within the same chain.

• Homo- and hetero-oligomers depending on the similarity between interacting subunits.

• Interface type can be predicted from amino acid composition (Ofran and Rost 2003).

Page 7: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

7

Conservation of protein-protein interactions.

• Conservation of protein interfaces is weak compared to the rest of a protein

• Conservation of domain-domain interactions: at SCOP Family level (red) interactions are conserved, at Fold level (blue) are not conserved. Adapted from Aloy et al, J. Mol. Biol., 2003

Page 8: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

8

Experimental methods

Page 9: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

9

Experimental methods for identifying protein-protein interactions.

• Yeast two hybrid

• Mass spectroscopy

• TAP purification

• Gene co-expression

Page 10: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

10

Yeast two-hybrid experiments.

• Many transcription factors have two domains; one that binds to a promoter DNA sequence (BD) and another that activates transcription (AD).

• Transcription factor can not activate transcription unless DNA-binding domain is physically associated with an activating domain (Fields and Song,1989) .

Adapted from B. Causier, Mass Spectroscopy Reviews, 2004

Page 11: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

11

Gal4/LacZ Y2H system

Adapted from A. Traven et al, EMBO Reports, 2006

• Target proteins are fused with BD and AD of GAL4 protein which activate LacZ gene.

• If there is no galactose, GAL80 binds to GAL4 and blocks transcription.

• If galactose is present, GAL4 can activate transcription of beta-galactosidase.

Page 12: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

12

Development and variations of Y2H system.

• Developing yeast strains that carry several reporter genes.

• Developing of haploid yeast strains of opposite mating type. Diploid cells are produced by mating of strains containing baits and preys.

• One-hybrid system detects interactions between a prey protein and a known DNA sequence (bait).

Adapted from B. Causier, Mass Spectroscopy Reviews, 2004

Page 13: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

13

Development and variations of Y2H system.

• RNA yeast three-hybrid system detects interactions between RNAs and proteins.

• Protein yeast three-hybrid system detects the formation of complexes between several proteins.

Adapted from B. Causier, Mass Spectroscopy Reviews, 2004

Page 14: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

14

Genome-wide analysis by Y2H.

• Matrix approach: a matrix of prey clones is added to the matrix of bait clones. Diploids where X and Y interact are selected based on the expression of a reporter gene.

• Library approach: one bait X is screened

against an entire library. Positives are selected based on their ability to grow on specific substrates.

--------------------------------------------------------- Uetz et al 2000, Ito et al 2001: 692-840 interactions detected using

library-based approach in yeastAdapted from B. Causier, Mass Spectroscopy Reviews, 2004

Page 15: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

15

Drawback of Y2H.• The interactions can not be tested if a target protein can

initiate transcription.

• Fusion of a protein into chimeras can change the structure of a target.

• Protein interactions can be different in yeast and other organisms.

• It is difficult to target extracellular proteins. • Proteins which can interact in two-hybrid experiments,

may never interact in a cell.

Page 16: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

16

Advantage of Y2H.

• in vivo technique, good approximation of processes which occur in higher eukaryotes.

• Transient interactions can be determined, can predict the affinity of an interaction.

• Fast and efficient.

Page 17: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

17

Mass spectroscopy.

• Ionization (Ex: Electrospray ionization) produce peptide ions in a gas phase;

• Detection and recording of sample ions mass-to-charge ratios are assigned to different peaks of

spectra;

• Analysis of MS spectra, protein identification search sequence database with mass fingerprint, find correlations between theoretical and experimental

spectra.

Page 18: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

18

Ionization. - Electrospray ionization,

John Fenn, 2002) The solvent evaporates rapidly

in a vacuum and peptides become ionized;

- Matrix Assisted Laser Desorption, K. Tanaka, 2002)

The laser ionizes protein molecules embedded on the matrix

From www.nobelprize.org

Page 19: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

19

Detection.• Peptide fragments are separated based on mass-to-

charge ratios;

• Accuracy of < 1 Da per unit charge;

Page 20: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

20

Tandem affinity purification method (TAP).

• Target protein ORF is fused with the DNA sequences encoding TAP tag;

• tagged ORFs are expressed in yeast cells and form native complexes;

• the complexes are purified by TAP method;

• components of each complex are found by gel

electrophoresis, MS and bioinformatics methods.

Page 21: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

21

Tandem affinity purification method (TAP).

TAP tag consists of two IgG binding domains of Staphylococcus protein A and calmodulin binding peptide;

-------------------------------------- 7123 interactions can be

clustered into 547 complexes (Krogan et al, 2006)

O. Puig et al, Methods, 2001

Page 22: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

22

Differences and similarities between Y2H and MS-TAP.

• Both methods generate a lot of false positives, only ~50% interactions are biologically significant.

• Y2H produces binary interactions, lack of information about protein complexes, but can detect transient interactions.

• Y2H is in vivo technique.

• MS can detect large stable complexes and networks of interactions.

Page 23: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

23

Correlation between gene expression

and protein interactions. • There should exist a relationship between gene

expression levels of subunits in a complex protein-protein interactions can be verified coexpression data.

• Methods are tested on protein complexes: ribosome, proteasome, RNA Polymerase II Holoenzyme and replication complexes.

Page 24: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

24

Correlation between gene expression and protein interactions.

Jansen, Greenbaum & Gerstein, Genome Research, 2002

• Expression profiles were taken from: cell cycle experiments and expression ratios for overall yeast genome for 300 cell states.

• Difference between absolute expression levels can be calculated as • Difference between relative expression levels – Pearson correlation

coefficient.

)(

||

ji

ji

EE

EED

Page 25: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

25

Results of gene coexpression analysis.

Jansen, Greenbaum & Gerstein, Genome Research, 2002

• Subunits from the same complex show coexpression.

• Expression correlation is strong for permanent complexes.

• Transient complexes have

weaker correlation.

Page 26: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

26

Verification of experimental protein-protein interactions.

• Protein localization method.

• Expression profile reliability method.

• Paralogous verification method.

Page 27: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

27

Protein localization method.

Sprinzak, Sattath, Margalit, J Mol Biol, 2003

A – A3: Y2HB: physical methodsC: geneticsE: immunological

True positives:- Proteins which are localized in the

same cellular compartment- Proteins with a common cellular role

Page 28: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

28

Deane, C. M. (2002) Mol. Cell. Proteomics 1: 349-356

Expression profile reliability method.

Page 29: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

29

Expression profile reliability method.

Deane et al, Molecular & Cellular Proteomics, 2002

EPR method is based on observation that interacting proteins are coexpressed. The distance between expression profiles of two proteins:

Parameter α characterizes the accuracy of given data, or correspond to the fraction of false positives.

22 ))/log()/(log( Bref

Bi

Aref

i

AiAB eeeed

)()1()()( 222ABnABiAB ddd

Page 30: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

30

Deane, C. M. (2002) Mol. Cell. Proteomics 1: 349-356

Paralogous verification method.

PVM method is based on observation that if two proteins interact, their paralogs would interact. Calculates the number of interactions between two families of paralogous proteins.

Page 31: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

31

Comparing large scale data of protein-protein interactions.

C. Von Mering et al, Nature, 2002:• All methods except for Y2H and synthetic lethality technique are

biased toward abundant proteins. • PPI are biased toward certain cellular localizations. • Evolutionarily conserved proteins have much better coverage in Y2H

than the proteins restricted to a certain organism.

Page 32: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

32

Functional organization of yeast proteome.

Gavin et al, Nature, 2002

• 589 protein assemblies, • 232 multiprotein

complexes, • new cellular roles for 344

proteins.

Page 33: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

33

Functional organization of yeast proteome: network of protein complexes.

A. Gavin et al, Nature, 2002

• orthologous proteins interact with complexes enriched by orthologs;

• essential gene products are more likely to interact with essential rather than nonessential proteins

Page 34: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

34

Interaction databases

• Experiment (E)• Structure detail (S)• Predicted

– Physical (P)– Functional (F)

• Curated (C)• Homology

modeling (H)• *IMEx consortium

Page 35: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

35

Protein interaction databases

• Protein-protein interaction databases

• Domain-domain interaction databases

Page 36: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

36

DIP database

• Documents protein-protein interactions from experiment– Y2H, protein microarrays,

TAP/MS, PDB

• 55,733 interactions between 19,053 proteins from 110 organisms.

Organisms # proteins # interactions

Fruit fly 7052 20,988

H. pylori 710 1425

Human 916 1407

E. coli 1831 7408

C. elegans 2638 4030

Yeast 4921 18,225

Others 985 401

Page 37: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

37

DIP database

Duan et al., Mol Cell Proteomics, 2002

• Assess quality– Via proteins: PVM, EPR– Via domains: DPV

• Search by BLAST or identifiers / text

Page 38: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

38

DIP database

Duan et al., Mol Cell Proteomics, 2002

• Assess quality– Via proteins: PVM, EPR– Via domains: DPV

• Search by BLAST or identifiers / text

• Map expression data

Page 39: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

39

DIP/LiveDIP

Duan et al., Mol Cell Proteomics, 2002

• Records biological state– Post-translational

modifications– Conformational changes– Cellular location

Page 40: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

40

DIP/Prolinks database

Bowers et al., Genome Biol, 2004.

• Records functional association using prediction methods:– Gene neighbors– Rosetta Stone– Phylogenetic profiles– Gene clusters

Page 41: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

41

Other functional association databases

• Phydbac2 (Claverie)• Predictome (DeLisi)• ArrayProspector

(Bork)

Page 42: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

42

BIND database

• Records experimental interaction data

• 83,517 protein-protein interactions

• 204,468 total interactions• Includes small molecules,

NAs, complexes

Alfarano et al., Nucleic Acids Res, 2005

Page 43: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

43

BIND database

• Displays unique icons of functional classes

Page 44: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

44

MPact/MIPS database

• Records yeast protein-protein interactions

• Curates interactions:– 4,300 PPI– 1,500 proteins

Guldener et al., Nucleic Acids Res, 2006

Page 45: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

45

STRING database

• Records experimental and predicted protein-protein interactions using methods:– Genomic context– High-throughput– Coexpression– Database/literature

mining

von Mering et al., Nucleic Acids Res., 2005

Page 46: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

46

STRING database

• Graphical interface for each of the evidence types• Benchmark against Kegg pathways for rankings

Page 47: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

47

STRING database

• 736,429 proteins in 179 species• Uses COGs and homology to transfer annotation

Page 48: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

48

More interaction databases

• IntAct (Valencia)– Open source interaction database and analysis– 68,165 interactions from literature or user submissions

• MINT (Cesareni)– 71,854 experimental interactions mined from literature

by curators– Uses IntAct data model

• BioGRID (Tyers)– 116,000 protein and genetic interactions

Page 49: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

49

Protein interaction databases

• Protein-protein interaction databases

• Domain-domain interaction databases

Page 50: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

50

InterDom database

• Predicts domain interactions (~30000) from PPIs

• Data sources:– Domain fusions– PPI from DIP– Protein complexes– Literature

• Scores interactions

Ng et al., Nucleic Acids Res, 2003

Page 51: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

51

Pibase database

• Records domain interactions from PDB and PQS

• Domains defined with SCOP and CATH

• All inter-domain and inter-chain distances within 6 Ǻ are considered interacting domains

• From interacting domain pairs, create list of interfaces with buried solvent accessible area > 300 Ǻ2

Page 52: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

52

PIBASE

• Query by PDB, domain, interface

• 1,946 interacting SCOP domains

• 2,387 unique interaction types

Davis et al., Bioinformatics, 2005

Page 53: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

53

PIBASE

• Redundancy removed within a structure

• Properties listed

Page 54: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

54

PIBASE/ModBase

• Protein structure models

• Predict interfaces with Pibase

Pieper et al., Nucleic Acids Res, 2006

Page 55: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

55

3did database

• Defines domains using Pfam

• Data source: Protein structure data

• 3,304 unique interaction types

• 2,247 interacting domains

• Display linkages and chain locations

Stein et al., Nucleic Acids Res, 2005

Page 56: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

56

3did

• List structures• Visualize

interfaces• View interface

overlap distribution

• GO annotation

Page 57: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

57

3did

• Show domain linkages on a given structure

Page 58: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

58

InterPReTS

Aloy & Russell, Bioinformatics, 2003

• Structure prediction of interfaces

• Uses 3did

Page 59: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

59

NCBI CBM database

• To retrieve interactions:– Record interactions– Use VAST structural

alignments to compare binding surfaces

– Study recurring domain-domain interactions

• CBM – database of interacting structural domains exhibiting Conserved Binding Modes

Shoemaker et al., Protein Sci, 2006.

Page 60: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

60

Definition of CBM

• Interacting domain pair – if at least 5 residue-residue contacts between domains (contacts – distance of less than 8 Ǻ)

• Structure-structure alignments between all proteins corresponding to a given pair of interacting domains

• Clustering of interface similarity, those with >50% equivalently aligned positions are clustered together

• Clusters with more than 2 entries define conserved binding mode.

Page 61: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

61

Number of interacting pairs and binding modes

• 833 conserved interaction types• 1,798 total domain interaction types• Up to 24 CBMs per interaction type

• Classify complicated domain pairs by CBMs

• Globin example:– 630 pairs– 2 CBMs account for majority

CBM Structures Species

1 154 Jawed vertebrates

2 112 Jawed vertebrates

3 17 Clam,earthworm

4 4 lamprey

5 4 V.stercoraria

6 2 Rice,soybeans

7 2 human

8 2 lamprey

Page 62: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

62

CBMs distinguish biologically relevant interactions

• Non-biological interactions (e.g. crystal packing) are not conserved among different structures.

• Interaction networks more clear

Page 63: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

63

iPfam database

Finn et al., Bioinformatics, 2005

• View Pfam interactions on PDB structures

• View individual structures and sequence plots

Page 64: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

64

DIMA database

Pagel et al., Bioinformatics, 2005

• Phylogenetic profiles of Pfam domain pairs

• Uses structural info from iPfam

• Works well for moderate information content

Page 65: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

65

Methods of prediction of functional association and protein interactions.

• Phylogenetic profile method.

• Rosetta Stone approach.

• Gene neighborhood method.

• Gene cluster method.

• Co-evolution methods.

• Classification methods.

Page 66: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

66

Phylogenetic profile method.

Pellegrini et al, PNAS 1999

Functionally linked and putative interacting proteins should have orthologs in the same subset of fully sequenced organisms

Page 67: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

67

Rosetta Stone approach.Marcotte et al, Science, 1999

• Some pairs of interacting domains have homologs which are fused into one protein chain – “Rosetta Stone” protein.

• In E.coli ~ 6809 pairs of non-homologous proteins; both proteins from each pair could be mapped to a single protein from some other genome.

Page 68: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

68

Gene neighborhood method.

• Gene pairs from conserved gene clusters encode proteins which are functionally related and possibly interact.

• Conservation of gene order can be used to predict gene function. Adapted from Bowers et al, Genome Biology, 2004

Page 69: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

69

Gene cluster method.

• Bacterial genes of related function are often transcribed simultaneously – operon.

• Identification of operons is based on intergenic distances.

Adapted from Bowers et al, Genome Biology, 2004

Page 70: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

70

Coevolution of interacting proteins – “mirrortree” methods.

Goh et al. J Mol Biol, 2000; Pazos and Valencia, J Mol Biol, 2001 • Interacting proteins may co-evolve and their

phylogenetic trees show similarity.

• Similarity between phylogenetic trees can be quantified by correlation coefficient between distance matrices used to construct trees.

Adapted from Goh et al, J.Mol.Biol.,2000

Page 71: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

71

Predicting interacting partners from interacting protein families.

Problem:

given interacting protein families A {a1,…,an} and

B {b1,…,bm}:

- Find corresponding paralogs ai and bi that interact.- Predict interaction specificity of interaction.- Predict one-to-many correspondence between

interacting partners.

Page 72: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

72

Methods of predicting interacting partners.

Ramani & Marcotte, J. Mol. Biol., 2003, Gertz et al, Bioinformatics, 2003

• Proteins are clustered to find one-to-many correspondence.

• Similarity matrices are aligned

minimizing difference between their elements.

• Interactions are predicted between proteins corresponding to aligned columns.

Adapted from Ramani & Marcotte, J. Mol. Biol., 2003

Page 73: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

73

Problems of matrix permutation methods:

• N! – permutations (N – number of proteins in a family) – large search space

• maximal agreement between similarity matrices does not mean correct pairing of proteins on phylogenetic tree.

Page 74: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

74

Methods of predicting interacting partners.

MORPH method, Jothi et al, Bioinformatics, 2005

• Search space is reduced by swapping whole isomorphic subtrees instead of a single column - avoid local minima.

• Uses information encoded in the phylogenetic trees themselves.

Page 75: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

75

Tree of life (TOL) assists in prediction of protein interactions.

Pazos et al, J. Mol. Biol., 2005Sato et al, Bioinformatics, 2005

• There is “background” similarity between trees of any proteins, no matter if they interact or not.

• “Background” tree is constructed from 16S rRNA sequences.

• rRNA-based distances are subtracted from distances of original phylogenetic tree.

Adapted from Pazos et al, J. Mol.Biol., 2005

Page 76: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

76

Performance of “mirrortree” methods.

Adapted from Pazos et al, J. Mol.Biol., 2005

Pazos et al, J. Mol. Biol. 2005

• Test set of 512 physically interacting proteins from E. coli

• “TOL-mirrortree” method (blue) finds half of real interacting proteins at 6.4% false positive rate compared to 16.5% false positives rate with “mirrortree” method (black).

Page 77: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

77

Classification methods: Random Decision Forest.

• Training set: interacting protein pairs + non-interacting pair;

• Each pair – vector of features (domain types) of dimension N.

• Values of vector: 0, if neither protein in the pair

contains feature; 1, if at least one protein in a pair

contains feature; 2, if both proteins contain the

feature.

D1 D2 D3 D4 D5

D1 D3

= { 1,0,2,1,0 }D4 D3

Page 78: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

78

Classification methods: Random Decision Forest.

D20

D21

D26

D10 D30

D15 1 D22

0 D2

noneone

both

• Choose feature randomly (D20), get values of all pairs in a given position corresponding to this feature.

• Divide all pairs in three groups: those which both have this feature, only one protein has feature, no feature.

Page 79: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

79

Classification methods: Random Decision Forest.

D20

D21

D26

D10 D30

D15 1 D22

0 D2

• Repeat splitting at next node and stop when node impurity is small.

• To classify a new protein

pair – traverse along the tree.

Node impurity = # interacting proteins / # non-interacting proteins

Page 80: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

80

Predicting domain interactions from protein interactions

• Association method

• Maximum likelihood estimation method

• Domain Pair Exclusion Analysis

• Random decision forests

• Calculating P-values

• Integrative method

Page 81: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

81

Predicting domain interactions from protein interactions

• Study interaction networks by identifying protein pairs from high-throughput experiments without structural detail

• Protein sequence search of Pfam, SCOP or CDD domains.

• Train on high-throughput PPI experimental data• Evaluate using structures or MIPS• Predict protein-protein interactions using domain

frequencies

Page 82: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

82

Association method

Sprinzak & Margalit, J Mol Biol, 2001

• Correlated sequence-signatures (domains) predict new protein interactions

• Yeast data (MIPS & DIP) used to calculate log-odds scores

• Domains defined with InterPro

Page 83: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

83

Association method

Sprinzak & Margalit, J Mol Biol, 2001

• Record domains• List interacting protein pairs• Tabulate domain pairs from

protein pairs• Compute log-odds values

Page 84: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

84

Association method

Sprinzak & Margalit, J Mol Biol, 2001

• Log-odds value: log2(Pij/PiPj)

• Pi is the frequency of domain i in the data

• Information content of the table:

• H = 2.48 bits – significant correlation with protein pairs

Page 85: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

85

Association method

Sprinzak & Margalit, J Mol Biol, 2001

• 2,286 domain pairs• 1,141 pairs > 2 bits• 40 pairs with > 2 bits &

count of 5• No experimental error

Page 86: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

86

Random decision forests

Chen and Liu, Bioinformatics, 2005

• Discussed earlier as protein interaction prediction method

• 3,000 domain pairs predicted• No experimental error• Doesn’t assume independency• Accounts for non-interactions

– Riley et al. note that this makes it harder to find specific paralogous interactions

Page 87: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

87

Maximum likelihood estimation method

Deng et al., Genome Res, 2002

• Include competing domains in a given pair of interacting proteins

• Incorporate experimental errors in scores• Domain-domain interactions independent• Proteins interact with at least one domain pair

Page 88: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

88

Maximum likelihood estimation method

Deng et al., Genome Res, 2002

To maximize likelihood function of parameter set Θ(λmn, fn, fp) indirectly using EM algorithm:

1. Use initial parameters to get the expectation of complete dataset.

2. Get maximum likelihood estimator of parameter set, Θ.

3. Iterate until convergence.

Page 89: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

89

Maximum likelihood estimation method

Deng et al., Genome Res, 2002

• 43% specificity, 78% sensitivity

• Predictions using MIPS - 100x > random assignments

• But, only 0.68% predicted

• Predictions had higher gene co-expression correlations than random

Page 90: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

90

Domain Pair Exclusion Analysis

• Extend MLE method to detect specific, infrequent interactions

• Other methods emphasize non-specific, promiscuous domain interactions

• Compute an additional E score: ratio of probability proteins interact given two domains interact over the same probability given two domains do not interact

Page 91: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

91

Domain Pair Exclusion Analysis

Riley et al., Genome Biol, 2005.

1. Sij frequencies

2. MLE of Θij starting with Sij

3. Recalculate Θij with interaction probability ij fixed to zero. Get Eij

from difference.

Page 92: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

92

Promiscuous domains

Riley et al., Genome Biol, 2005.

• Paralogous interactions mask log-odds scores

• Θ shows promiscuity of interaction

• E shows propensity of interaction

• Screen for low Θ values, high E scores to find specific interactions

Page 93: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

93

Domain Pair Exclusion Analysis

Riley et al., Genome Biol, 2005.

• E discriminates best predictions 71x random• Θ and S are ineffectual particularly with modular

domains

Page 94: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

94

Calculating P-values to predict domain interactions

Nye et al., Bioinformatics, 2005.

• P-values for domain pairs responsible for PPIs

• Shuffle domains on sequences for null hypothesis

• Domain architectures considered

• SCOP superfamilies, 3 sets of yeast data

• Tested with domain interactions from PQS

Page 95: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

95

Calculating P-values to predict domain interactions

Nye et al., Bioinformatics, 2005.

• Predicts better at higher number of interacting partners

• Random predictions are better for majority of protein pairs

Page 96: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

96

Integrative method

Ng et al., Bioinformatics, 2003

• Add scores from three sources:– DIP – odds ratio score– Protein complexes – odds ratio score– Domain fusions – simple constant

• 20-fold cross validation– Major change from DIP to DIP + complexes– TPs: 39% to 58%, FPs: 8% to 12%

Page 97: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

97

Limitations of domain interaction prediction methods

• Assume domain pairs interact independently• Repeated domains not scored to distinguish

contacts• Missing domain assignments give false

negatives and positives• Many proteins have no domain assignments• Assume domain pairs, though may require

higher order assemblies

Page 98: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

98

Homology modeling of protein interactions

• Comparison to modeling single proteins

• General procedure

• Automated methods

• CAPRI docking contest

• Designed interfaces

Page 99: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

99

Homology modeling of single proteins

• Structures solved quickly with current techniques

• Decent coverage of major genomes expected

• Structure prediction:– Find homologous template to query– Make query model based on template

Page 100: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

100

Homology modeling of protein interactions

• ~2,000 out of 10,000 interaction types known• Limited number of protein-protein complexes in

PDB• 30% sequence similarity required

Page 101: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

101

Homology modeling of protein interactions

• Large complex structure determination has technical challenges not readily overcome in general

• Likely path involves multiple experimental methods with homology modeling and docking of structural subunits

Page 102: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

102

Support for modeling protein interactions

Shoemaker et al., Protein Sci, 2006.

Globin example:• Interfaces between different

functional subfamilies poorly conserved

• Within the same subfamily well conserved

• Supports homology modeling of interaction interfaces

Page 103: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

103

General procedure for homology modeling (using domains)

• Start with high-throughput (Y2H, TAP/MS) protein interaction data

• Search proteins for homologous domains

• Evaluate likelihood of domain-domain interactions

• Search for template structures similar to query proteins/protein domains

Page 104: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

104

General procedure for homology modeling

• Template structures might be– Complete complexes (rare)– Interacting domain dimers (sometimes)– Single domains (most often)

• Put together structural pieces avoiding steric hindrance and maximize domain complementarity

• Docking potentials score orientations of two interacting domains

• Success depends strongly on similarity and completeness of homologous structures

Page 105: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

105

Modeling of structure of yeast complexes.

P. Aloy et al, Science, 2004

• Example application• Overlap of methods• Structural elements

Page 106: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

106

Modeling of structure of yeast complexes.

• Combination of techniques

• Structural pieces with EM map

Page 107: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

107

InterPReTS

Aloy & Russell, Bioinformatics, 2003

• Search for Pfam domains on target sequences

• Find structural templates matching the same Pfam domains

• Score putative interactions with empirical pair potentials

• Good results except for peptidase / inhibitor class

Page 108: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

108

Multiprospector

Grimm et al., Proteins, 2006.

• Separately thread sequences X and Y against protein dimer database

• For X/Y matches to the same dimer, assess fitness by rethreading with an interface score derived from the dimer database

Page 109: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

109

Multiprospector

Grimm et al., Proteins, 2006.

• On yeast genome, 7,321 interactions were predicted from 304 complexes

• Ranked 3rd amongst large-scale prediction methods– No bias towards abundant

proteins– Provides atomic detail of

interaction surfaces

Page 110: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

110

CAPRI contest

Mendez et al., Proteins, 2005.

• Build atomic models of complexes given structures of the unbound proteins

• Bound/unbound differ by up to 12Ǻ

• “Acceptable” to “highly accurate” predictions made

• Restraints help a lot

Page 111: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

111

Limitations from CAPRI contest affecting homology modeling

• Proteins can undergo significant conformation changes upon binding

• Docking potentials require more accuracy

• Specific and non-specific protein interactions are not adequately distinguished

Page 112: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

112

Interface design

Kortemme & Baker, Curr Opin Chem Biol, 2005

• Computationally alter interface to modify function

• Create useful complexes

• Better understand prediction

Page 113: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

113

Interface design

Kortemme & Baker, Curr Opin Chem Biol, 2005

• Alter oligomeric state in helical bundle

• Increase specificity of promiscuous domains

• Novel interactions• Automated the

process

Page 114: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

114

Basic notions of networks.

Network (graph) – a set of nodes connected via edges.

The degree of a node (connectivity) – total number of connections of a node.

Page 115: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

115

Characteristics of networks.P(k) – degree distribution, k - degree of a node;

Diameter – max of distances between nodes taken over all node pairs.

K=2K=2

K=3

K=1

• Diameter of biological networks is small compared to network size (“small world property”) .

Page 116: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

116

Different network models: Barabasi-Alberts.

Barabasi & Albert, Science, 1999

Model of preferential attachment.• At each step, a new node is added to the graph.• The new node is attached to one of old nodes with probability

proportional to the vertex degree.

ln(P(k))

ln(k)

kkp )(

Degree distribution – power law distribution.

Page 117: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

117

Properties of scale-free networks. kkp ~)(

)()()( kpkkp

Multiplying k by a constant, does not change the shape of the distribution – scale free distribution.

From T. Przytycka

• Small diameter

• Tolerance to errors and attacks

But: subnetworks can be scale-free while underlying degree distribution is not.

Page 118: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

118

Difference between scale-free and random graph models.

Random networks are homogeneous, most nodes have the same number of links.

Scale-free networks have a number of highly connected verteces.

Adapted from Jeong et al, Nature, 2000

Page 119: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

119

Multiple-species gene co-expression networks.

Stuart et al, Science, 2003

• Metagene – a set of similar genes across different organisms (yeast, worm, fly, human).

• Multiple-species network maps genes with orthologs. • Multiple-species network has been constructed by identifying pairs

of genes with the correlated gene expression in different organisms. • Multiple-species network performs better than single-species

network in linking together functionally related metagenes.

Page 120: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

120

Multiple-species gene co-expression networks.

Stuart et al, Science, 2003

True positives – metagenes from the same KEGG functional category; accuracy - % links connecting two members of the same category;

coverage - % metagenes connected to at least one metagene in the category.

Page 121: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

121

Aligning protein interaction networks.

PATHBLAST (Kelley et al. , PNAS, 2003,

Sharan et al, PNAS, 2005).

• The method searches for high-scoring pathway alignments between two networks, where proteins are paired based on their sequence similarity.

A

B

C

D

E

a

b

d

e

Page 122: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

122

Aligning protein interaction networks.PATHBLAST (Kelley et al. , PNAS, 2003,

Sharan et al, PNAS, 2005).

• The network alignment between worm, yeast and fly detected 71 network regions that were conserved between all three species.

Page 123: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

123

Comparing networks by their connectivities.

Hoffmann & Valencia, TRENDS in Genetics, 2003

• Correlation coefficient between protein connectivities of two networks quantifies the agreement between the networks.

• Significant correlations between different experimental and theoretical methods:

gene neighborhood method (GN) correlates with both experimental and in silico methods.

Page 124: Protein-protein interactions: structure and systems approaches to analyze diverse genomic data Anna R. Panchenko Benjamin A. Shoemaker NCBI / NIH.

124

Summary.

• Many biological networks are scale-free.

• Essential proteins have high connectivities.

• Multiple-species network performs better than single-species network in linking together functionally related genes.