Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by...

32
Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues Aligned % Sequence Identity Homologous relationships established by both 3D structure and sequence: Homologous Non-homologous pted from work by Sanders and co-workers

Transcript of Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by...

Page 1: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and

Sequence

Residues Aligned

% S

equ

ence

Id

enti

ty

Homologous relationshipsestablished by both 3Dstructure and sequence:

Homologous

Non-homologous

Adapted from work by Sanders and co-workers

Page 2: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Structure can often provide valuable clues to biochemical and biophysical

aspects of protein function

Structure-based Functional Genomics

Page 3: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Biological Functionsof Genes and Proteins

• Genetic Function / Phenotype

• Cellular Function

• Biochemical Function

• Detailed Atomic Mechanism

•Biochemical Function

•Detailed Atomic Mechanism

Page 4: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

An Important Approach to the Protein Folding Problem is to

Characterize the “Natural Language of Proteins”

Representative 3D Structure from Each of Several Thousand Sequence Families of Domains

Page 6: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

• Structure provides information on function and will aid in the design of experiments

• Development of better therapeutic targets from comparisons of protein structures from:– Pathogens vs. hosts– Diseased vs. normal tissues

Expected PSI Expected PSI BenefitsBenefits

J. Norvell

Page 7: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

• Collection of structures will address key biochemical and biophysical problems– Protein folding, prediction, folds, evolution, etc.

• Benefits to biologists– Technology developments – Structural biology facilities– Availability of reagents and materials– Experimental outcome data on protein production

and crystallization

PSI Benefits (con’t)PSI Benefits (con’t)

J. Norvell

Page 8: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

PSI Pilot Phase

• 5-year pilot phase, September, 2000• Pilot phase Goals

– Development of high throughput structure genomics pipeline to produce unique, non-redundant protein structures

– Pilots for testing all facets and strategies of structural genomics

• PSI target selection policy – Representatives of protein sequence families – Public release of all targets, progress, results, and

structures

J. Norvell

Page 9: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

PSI Pilot Research Centers

•Seven research centers funded in FY2000

•Two additional research centers funded in FY2001

•Co-funding by NIAID for two of the nine research centers

•Many subprojects

J. Norvell

Page 10: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

PSI Pilot Phase -- Lessons Learned

• Structural genomics pipelines can be constructed and scaled-up

• High throughput operation works for many proteins

• Genomic approach works for structures• Bottlenecks remain for some proteins• A coordinated, 5-year target selection policy

must be developed• Homology modeling methods need

improvement

J. Norvell

Page 11: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Bioinformatics

Barry Honig, Columbia University Mark Gerstein, Yale UniversitySharon Goldsmith, Columbia UniversityChern Goh, Yale UniversityIgor Jurisica, Ontario Cancer Inst.Andrew Laine, Columbia UniversityJessica Lau, Rutgers UniversityJinfeng Liu, Columbia UniversityDiana Murray, Cornell Medical SchoolBurkhard Rost, Columbia UniversityMike Wilson, Yale University

X-ray Crystallography

Wayne Hendrickson, Columbia UniversityPeter Allen, Columbia UniversityGeorge DeTitta, Hauptman-WoodwardJohn Hunt, Columbia University Rich Karlin, Columbia University Joe Luft, Hauptman-WoodwardAlex Kuzin, Columbia University Phil Manor, Columbia UniversityLiang Tong, Columbia UniversityKalyan Das, Rutgers University

Protein Production / Biophysics

Gaetano Montelione, Rutgers University Thomas Acton, Rutgers UniversityStephen Anderson, Rutgers UniversityCheryl Arrowsmith, Ontario Cancer Inst.YiWen Chiang, Rutgers UniversityNatasha Dennisova, Rutgers UnivedrsityMasayori Inouye, RWJMS - UMDNJLichung Ma, Rutgers UniversityRong Xiao, Rutgers UniversityAdlinda Yee, Ontario Cancer Instit

Protein NMR

Thomas Szyperski, SUNY BuffaloJames Aramani, Rutgers University Cheryl Arrowsmith, Ontario Cancer Inst.John Cort, Pacific Northwest Natl LabsMichael Kennedy, Pacific Northwest Natl Labs Gaouhua Liu , SUNY Buffalo Theresa Ramelot, Pacific Northwest Natl LabsJanet Huang, Rutgers UniversityGaetano Montelione, Rutgers UniversityGVT Swapna, Rutgers UniversityBin Wu, Ontario Cancer Inst.

Northeast Structural Genomics Consortium:A SG Research Network

Page 12: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Goals of the NESG Consortium

Short TermDevelop a Scalable Platform for Structural and Functional Proteomics of Prokaryotic and Eukaryotic Proteins

Long TermCharacterize the repertoire of eukaryotic protein structural domain families

Page 13: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

The NESG Publication Network

PubNetDouglas, Montelione, GersteinBioinformatics, 2005 in press

Page 14: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Target Selection Strategy

Page 15: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Target Selection for Structural ProteomicsC. Orengo, Snowbird, UT 4.17.04

How many protein families can we identify in the How many protein families can we identify in the genomes with/without structural genomes with/without structural

representatives?representatives?

Which families should we target to maximise Which families should we target to maximise the structural coverage of the genomes?the structural coverage of the genomes?

Can we select families to optimise function Can we select families to optimise function coverage?coverage?

Page 16: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Rost Clusters: Structural Genomics Targets

• Protein domain families / clusters

• Full length proteins < 340 amino acids

• No member > 30% identity to PDB structures

• No regions of low complexity

• Not predicted to be membrane associated

Target genomes Reagent genomes (prokaryotes):Eukaryotes Eubacteria Archea

Arabidopsis thaliana (A) Aquifex aeolicus (Q) Aeropyrum pernix (X)Caenorhabditis elegans (W) Bacillus subtilis (S) Archaeoglobus fulgidus (G)Drosophila melanogaster (F) Escherichia coli (E) Methanobacterium thermoautotrophicum (T)

Homo sapiens (H) Haemophilus influenzae (I) Pyrococcus horikoshii (J)Saccharomyces cerevisiae (Y) Helicobacter pylori (P)

(Mus musculus) Staphylococcus aureus (Z)Thermotoga maritima (V)Campylobacter jejuni (B)

Neisseria meningitides (M)Thermus thermophilus (U)

~ 20,000 “NESG Clusters”

Page 17: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

NESG Domain Clusters• Protein domain families / clusters

• Full length proteins < 340 amino acids

• No member > 30% identity to PDB structures

• No regions of low complexity

• Not predicted to be membrane associated

Aeropyrum pernixAquifex aeolicusArabidopsis thalianaArchaeglobus fulgidisBacillus subtilisBrucella melitensisCaenorhabditis elegansCampylobacter jejuniCaulobacter crescentus

Deinococcus radioduransDrosophila melanogaster

Escherichia coliFusobacterium nucleatumHaemophilus influenzaeHelicobacter pyloriHomo sapiens

Human cytomegalovirus Lactococcus lactisM. thermoautotrophicumNeisseria meningitidisOtherPyrococcus furiosusPyrococcus horikoshi

Saccharomyces cerevisiaeStaphylococcus aureusStreptococcus pyogenesStreptomyces coelicolor

Thermoplasma acidophilumThermotoga maritimaThermus thermophilusVibrio cholerae

Liu, Hegi, Acton, Montelione, & Rost PROTEINS 2004. 56: 188-200 Wunderlich et al. PROTEINS 2004 56: 181-187 Acton et al. Meths Enzymol. 2005 in press

1 Euka: 2 Proka

Cloned / Expressed> 1000 Human Proteins

WR41ET8

Page 18: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Protein StructureProduction

Page 19: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Primer Prímer Program

http://www-nmr.cabm.rutgers.edu/bioinformatics/index.html

Page 20: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

DNA Mini-preps PCR ReactionSet up-96 well

PCR Purification

Res

tric

tio

n D

iges

t

Qiaquick Purify

Lig

atio

nT

ran

sfo

rm C

olo

ny

PC

R

Cycle Sequencing

Big Dye removal

Auto-Steps with the Biorobot 8000

Page 21: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

96- Well Expression

Overnight culture

24 Well Blocks

2 ml of MJ9

Transfer ~200 ul of overnight culture to appropriate well

Page 22: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

HR969

HSQC and HetNOE Screening

Amenability to Structural Determination by NMR

Is Determined on NiNTA-Purified Samples

# Targets Good Excellent314 60 25

20% 8%

Page 23: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Some 30% of full-length, expressed, soluble eukaryotic proteinsfrom the Rost Clusters produced in E. coli by NESG are DISORDERED based on Heteronuclear 1H-15N NOE Data

Critical NMR Observation From SPiNE

It may not be possible to determine 3D structures of a large portion of the Rost domain families in isolation!

Page 24: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Sample Optimization - Buffer Screening

Microdialysis Buttons- Optimization for NMR

Buffer NaCl DTT Arginine

50 mM Ammonium Acetate pH 5.0 0 0 0

50 mM Ammonium Acetate pH 5.0 0 10 mM 0

50 mM Ammonium Acetate pH 5.0 0.1 M 10 mM 0

50 mM Ammonium Acetate pH 5.0 0 10 mM 0.1 M

50mM MES pH 6.0 0 0 0

50mM MES pH 6.0 0 10 mM 0

50mM MES pH 6.0 0.1 M 10 mM 0

50mM MES pH 6.0 0 10 mM 0.1 M

50mM Bis.Tris pH 6.5 0 0 0

50mM Bis.Tris pH 6.5 0 10 mM 0

50mM Bis.Tris pH 6.5 0.1 M 10 mM 0

50mM Bis.Tris pH 6.5 0 10 mM 0.1 M

Vary Buffer Conditions - Stability

Screen for ppt.

100 mM Arginine Small sample mass

(50 ug/button)

Bagby S, Tong KI, Liu D, Alattia JR, Ikura M. 1997. J Biomol NMR.

Page 25: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Monodisperse Conditions

Aggregation Screening - Crystallization

Analytical Gel Filtration with Light Scattering

Proterion - 96 Well

Less Sample

More Conditions

Philip Manor, Roland Satterwhite and John Hunt

LS

RI

Page 26: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

5 hours

12 hours

ÄKTAxpress™

4 modules in parallel 16 samples AC-GF

AC

AC/GF

Affinity Chromatography (AC)HiTrap™ Chelating HP, 1 and 5 ml

Gel Filtration (GF)HiLoad 16/60 Superdex 200 pg

Page 27: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Solubility / 2004 Stats

Organism Cloned % Sol* PDBA. aeolicus (Q) 85 46 3A. thaliana (A) 35 29 1A. fulgidis (G) 23 74 2B. subtilis (S) 158 49 4B. melitensis (L) 15 67 0C. elegans (W) 90 50 6C. jejuni (B) 20 55 0D. melanogaster (F) 113 15 1E. faecalis (Ef) 23 100 0E. coli ( E) 118 50 12H. influenzae (I) 101 57 4H. pylori (P) 75 21 1H. sapiens (H) 548 43 4N. meningitidis (M) 22 54 1P. furiosus (Pf) 48 46 2P. horikosh i (J) 19 63 1S. pyogenes (D) 12 50 1

*defined as greater than 60% soluble by SDS-PAGE analysis

Many HR (Human) proteins in advanced stages of NMR

3 HR Crystal structures

Total Week GoalCloned 511 51 50Fermented 183 20 ~20-24Purified 180 20 ~20-24

2004 ProductionSolubility vs Organism

2004 HR Success

T. Acton et al

Page 28: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Internet-based Data Management

Page 29: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Cloned Targets 4,220

Purified Targets 1,458

Crystal Structures in PDB 84

NMR Structures in PDB 72

Structures In PDB 147

Total Structures 160

In Refinement (NMR + Xray) 13

Intrinsicly Unfolded Proteins > 70

New Folds 12

Publications 209

NESG PROGRESS SUMMARY Jan 1, 2005

Intrinsically Disordered ProteinsFull-length Proteins Produced in E. coli

Organism % UnfoldedE. coli 8%yeast 18%fly / worm 25%human 35%

Page 30: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Phylogenetic Distribution of 160 NESG Structures

Most (>95%) completed NESG structures are

members of eukaryotic protein domain families

Eukar

yotic

Eubacteria

Archea

Some 35 (~20%) NESG structures submitted to the PDB are eukaryotic

proteins

Page 31: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Uniqueness of NESG Structures

Page 32: Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and Sequence Residues.

Leverage of NESG Structures

lower panel: number of proteins for which the sequence-unique structures experimentally determined (red) by each consortium could be used to buildhomology models (light green).

upper panel shows the number of new models that could be built for ten entirely sequenced eukaryotes (tan) and for the human genome (green)

Total Leverage ~20,000 Structures Novel Leverage ~ 4,000 Structures

Liu and Rost