Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by...

Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by Combined Analysis of 3D Structural and

Sequence

Residues Aligned

% S

equ

ence

Id

enti

ty

Homologous relationshipsestablished by both 3Dstructure and sequence:

Homologous

Non-homologous

Adapted from work by Sanders and co-workers

Structure can often provide valuable clues to biochemical and biophysical

aspects of protein function

Structure-based Functional Genomics

Biological Functionsof Genes and Proteins

• Genetic Function / Phenotype

• Cellular Function

• Biochemical Function

• Detailed Atomic Mechanism

•Biochemical Function

•Detailed Atomic Mechanism

An Important Approach to the Protein Folding Problem is to

Characterize the “Natural Language of Proteins”

Representative 3D Structure from Each of Several Thousand Sequence Families of Domains

National Institutes of HealthProtein Structure Initiative (PSI)

Long-Range Goal

To make the three-dimensional atomic level structures of most proteins easily available from knowledge of their corresponding DNA sequences

http://www.nigms.nih.gov/psi.html/

J. Norvell

http://www.pnas.org/content/vol97/issue2/images/large/pq0203510001.jpeg






• Structure provides information on function and will aid in the design of experiments

• Development of better therapeutic targets from comparisons of protein structures from:– Pathogens vs. hosts– Diseased vs. normal tissues

Expected PSI Expected PSI BenefitsBenefits

J. Norvell

• Collection of structures will address key biochemical and biophysical problems– Protein folding, prediction, folds, evolution, etc.

• Benefits to biologists– Technology developments – Structural biology facilities– Availability of reagents and materials– Experimental outcome data on protein production

and crystallization

PSI Benefits (con’t)PSI Benefits (con’t)

J. Norvell

PSI Pilot Phase

• 5-year pilot phase, September, 2000• Pilot phase Goals

– Development of high throughput structure genomics pipeline to produce unique, non-redundant protein structures

– Pilots for testing all facets and strategies of structural genomics

• PSI target selection policy – Representatives of protein sequence families – Public release of all targets, progress, results, and

structures

J. Norvell

PSI Pilot Research Centers

•Seven research centers funded in FY2000

•Two additional research centers funded in FY2001

•Co-funding by NIAID for two of the nine research centers

•Many subprojects

J. Norvell

PSI Pilot Phase -- Lessons Learned

• Structural genomics pipelines can be constructed and scaled-up

• High throughput operation works for many proteins

• Genomic approach works for structures• Bottlenecks remain for some proteins• A coordinated, 5-year target selection policy

must be developed• Homology modeling methods need

improvement

J. Norvell

Bioinformatics

Barry Honig, Columbia University Mark Gerstein, Yale UniversitySharon Goldsmith, Columbia UniversityChern Goh, Yale UniversityIgor Jurisica, Ontario Cancer Inst.Andrew Laine, Columbia UniversityJessica Lau, Rutgers UniversityJinfeng Liu, Columbia UniversityDiana Murray, Cornell Medical SchoolBurkhard Rost, Columbia UniversityMike Wilson, Yale University

X-ray Crystallography

Wayne Hendrickson, Columbia UniversityPeter Allen, Columbia UniversityGeorge DeTitta, Hauptman-WoodwardJohn Hunt, Columbia University Rich Karlin, Columbia University Joe Luft, Hauptman-WoodwardAlex Kuzin, Columbia University Phil Manor, Columbia UniversityLiang Tong, Columbia UniversityKalyan Das, Rutgers University

Protein Production / Biophysics

Gaetano Montelione, Rutgers University Thomas Acton, Rutgers UniversityStephen Anderson, Rutgers UniversityCheryl Arrowsmith, Ontario Cancer Inst.YiWen Chiang, Rutgers UniversityNatasha Dennisova, Rutgers UnivedrsityMasayori Inouye, RWJMS - UMDNJLichung Ma, Rutgers UniversityRong Xiao, Rutgers UniversityAdlinda Yee, Ontario Cancer Instit

Protein NMR

Thomas Szyperski, SUNY BuffaloJames Aramani, Rutgers University Cheryl Arrowsmith, Ontario Cancer Inst.John Cort, Pacific Northwest Natl LabsMichael Kennedy, Pacific Northwest Natl Labs Gaouhua Liu , SUNY Buffalo Theresa Ramelot, Pacific Northwest Natl LabsJanet Huang, Rutgers UniversityGaetano Montelione, Rutgers UniversityGVT Swapna, Rutgers UniversityBin Wu, Ontario Cancer Inst.

Northeast Structural Genomics Consortium:A SG Research Network

Goals of the NESG Consortium

Short TermDevelop a Scalable Platform for Structural and Functional Proteomics of Prokaryotic and Eukaryotic Proteins

Long TermCharacterize the repertoire of eukaryotic protein structural domain families

The NESG Publication Network

PubNetDouglas, Montelione, GersteinBioinformatics, 2005 in press

Target Selection Strategy

Target Selection for Structural ProteomicsC. Orengo, Snowbird, UT 4.17.04

How many protein families can we identify in the How many protein families can we identify in the genomes with/without structural genomes with/without structural

representatives?representatives?

Which families should we target to maximise Which families should we target to maximise the structural coverage of the genomes?the structural coverage of the genomes?

Can we select families to optimise function Can we select families to optimise function coverage?coverage?

Rost Clusters: Structural Genomics Targets

• Protein domain families / clusters

• Full length proteins < 340 amino acids

• No member > 30% identity to PDB structures

• No regions of low complexity

• Not predicted to be membrane associated

Target genomes Reagent genomes (prokaryotes):Eukaryotes Eubacteria Archea

Arabidopsis thaliana (A) Aquifex aeolicus (Q) Aeropyrum pernix (X)Caenorhabditis elegans (W) Bacillus subtilis (S) Archaeoglobus fulgidus (G)Drosophila melanogaster (F) Escherichia coli (E) Methanobacterium thermoautotrophicum (T)

Homo sapiens (H) Haemophilus influenzae (I) Pyrococcus horikoshii (J)Saccharomyces cerevisiae (Y) Helicobacter pylori (P)

(Mus musculus) Staphylococcus aureus (Z)Thermotoga maritima (V)Campylobacter jejuni (B)

Neisseria meningitides (M)Thermus thermophilus (U)

~ 20,000 “NESG Clusters”

NESG Domain Clusters• Protein domain families / clusters

• Full length proteins < 340 amino acids

• No member > 30% identity to PDB structures

• No regions of low complexity

• Not predicted to be membrane associated

Aeropyrum pernixAquifex aeolicusArabidopsis thalianaArchaeglobus fulgidisBacillus subtilisBrucella melitensisCaenorhabditis elegansCampylobacter jejuniCaulobacter crescentus

Deinococcus radioduransDrosophila melanogaster

Escherichia coliFusobacterium nucleatumHaemophilus influenzaeHelicobacter pyloriHomo sapiens

Human cytomegalovirus Lactococcus lactisM. thermoautotrophicumNeisseria meningitidisOtherPyrococcus furiosusPyrococcus horikoshi

Saccharomyces cerevisiaeStaphylococcus aureusStreptococcus pyogenesStreptomyces coelicolor

Thermoplasma acidophilumThermotoga maritimaThermus thermophilusVibrio cholerae

Liu, Hegi, Acton, Montelione, & Rost PROTEINS 2004. 56: 188-200 Wunderlich et al. PROTEINS 2004 56: 181-187 Acton et al. Meths Enzymol. 2005 in press

1 Euka: 2 Proka

Cloned / Expressed> 1000 Human Proteins

WR41ET8

Protein StructureProduction

Primer Prímer Program

http://www-nmr.cabm.rutgers.edu/bioinformatics/index.html

DNA Mini-preps PCR ReactionSet up-96 well

PCR Purification

Res

tric

tio

n D

iges

t

Qiaquick Purify

Lig

atio

nT

ran

sfo

rm C

olo

ny

PC

R

Cycle Sequencing

Big Dye removal

Auto-Steps with the Biorobot 8000

96- Well Expression

Overnight culture

24 Well Blocks

2 ml of MJ9

Transfer ~200 ul of overnight culture to appropriate well

HR969

HSQC and HetNOE Screening

Amenability to Structural Determination by NMR

Is Determined on NiNTA-Purified Samples

# Targets Good Excellent314 60 25

20% 8%

Some 30% of full-length, expressed, soluble eukaryotic proteinsfrom the Rost Clusters produced in E. coli by NESG are DISORDERED based on Heteronuclear 1H-15N NOE Data

Critical NMR Observation From SPiNE

It may not be possible to determine 3D structures of a large portion of the Rost domain families in isolation!

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Sample Optimization - Buffer Screening

Microdialysis Buttons- Optimization for NMR

Buffer NaCl DTT Arginine

50 mM Ammonium Acetate pH 5.0 0 0 0

50 mM Ammonium Acetate pH 5.0 0 10 mM 0

50 mM Ammonium Acetate pH 5.0 0.1 M 10 mM 0

50 mM Ammonium Acetate pH 5.0 0 10 mM 0.1 M

50mM MES pH 6.0 0 0 0

50mM MES pH 6.0 0 10 mM 0

50mM MES pH 6.0 0.1 M 10 mM 0

50mM MES pH 6.0 0 10 mM 0.1 M

50mM Bis.Tris pH 6.5 0 0 0

50mM Bis.Tris pH 6.5 0 10 mM 0

50mM Bis.Tris pH 6.5 0.1 M 10 mM 0

50mM Bis.Tris pH 6.5 0 10 mM 0.1 M

Vary Buffer Conditions - Stability

Screen for ppt.

100 mM Arginine Small sample mass

(50 ug/button)

Bagby S, Tong KI, Liu D, Alattia JR, Ikura M. 1997. J Biomol NMR.

Monodisperse Conditions

Aggregation Screening - Crystallization

Analytical Gel Filtration with Light Scattering

Proterion - 96 Well

Less Sample

More Conditions

Philip Manor, Roland Satterwhite and John Hunt

LS

RI

5 hours

12 hours

ÄKTAxpress™

4 modules in parallel 16 samples AC-GF

AC

AC/GF

Affinity Chromatography (AC)HiTrap™ Chelating HP, 1 and 5 ml

Gel Filtration (GF)HiLoad 16/60 Superdex 200 pg

Solubility / 2004 Stats

Organism Cloned % Sol* PDBA. aeolicus (Q) 85 46 3A. thaliana (A) 35 29 1A. fulgidis (G) 23 74 2B. subtilis (S) 158 49 4B. melitensis (L) 15 67 0C. elegans (W) 90 50 6C. jejuni (B) 20 55 0D. melanogaster (F) 113 15 1E. faecalis (Ef) 23 100 0E. coli ( E) 118 50 12H. influenzae (I) 101 57 4H. pylori (P) 75 21 1H. sapiens (H) 548 43 4N. meningitidis (M) 22 54 1P. furiosus (Pf) 48 46 2P. horikosh i (J) 19 63 1S. pyogenes (D) 12 50 1

*defined as greater than 60% soluble by SDS-PAGE analysis

Many HR (Human) proteins in advanced stages of NMR

3 HR Crystal structures

Total Week GoalCloned 511 51 50Fermented 183 20 ~20-24Purified 180 20 ~20-24

2004 ProductionSolubility vs Organism

2004 HR Success

T. Acton et al

Internet-based Data Management

Cloned Targets 4,220

Purified Targets 1,458

Crystal Structures in PDB 84

NMR Structures in PDB 72

Structures In PDB 147

Total Structures 160

In Refinement (NMR + Xray) 13

Intrinsicly Unfolded Proteins > 70

New Folds 12

Publications 209

NESG PROGRESS SUMMARY Jan 1, 2005

Intrinsically Disordered ProteinsFull-length Proteins Produced in E. coli

Organism % UnfoldedE. coli 8%yeast 18%fly / worm 25%human 35%

Phylogenetic Distribution of 160 NESG Structures

Most (>95%) completed NESG structures are

members of eukaryotic protein domain families

Eukar

yotic

Eubacteria

Archea

Some 35 (~20%) NESG structures submitted to the PDB are eukaryotic

proteins

Uniqueness of NESG Structures

Leverage of NESG Structures

lower panel: number of proteins for which the sequence-unique structures experimentally determined (red) by each consortium could be used to buildhomology models (light green).

upper panel shows the number of new models that could be built for ten entirely sequenced eukaryotes (tan) and for the human genome (green)

Total Leverage ~20,000 Structures Novel Leverage ~ 4,000 Structures

Liu and Rost

Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by...

Documents

Transcript of Sequence Similarity Analysis Often Misses Evolutionary Relationships Which Can Be Detected by...