Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115.

Protein Structure Prediction

Xiaole Shirley Liu

Jun Liu

STAT115

Protein Structure PredictionRam Samudrala

University of Washington

STAT1153

Outline

• Motivations and introduction

• Protein 2nd structure prediction

• Protein 3D structure prediction– CASP– Homology modeling– Fold recognition– ab initio prediction– Manual vs automation

• Structural genomics

STAT1154

Protein Structure

• Sequence determines structure, structure determines function

• Most proteins can fold by itself very quickly

• Folded structure: lowest energy state

Protein Structure• Main forces for considerations

– Steric complementarity– Secondary structure preferences (satisfy H

bonds)– Hydrophobic/polar patterning– Electrostatics

Rationale for understanding protein structure and function

Protein sequence

-large numbers of sequences, including whole genomes

Protein function

- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution

structure determination structure prediction

homologyrational mutagenesisbiochemical analysis

model studies

Protein structure

- three dimensional- complicated- mediates function

STAT1157

Protein Databases

• SwissProt: protein knowledgebase

• PDB: Protein Data Bank, 3D structure

View Protein Structure

• Free interactive viewers

• Download 3D coordinate file from PDB

• Quick and dirty:– VRML– Rasmol– Chime

• More powerful– Swiss-PdbViewer

Compare Protein Structures

• Structure is more conserved than sequence• Why compare?

– Detect evolutionary relationships– Identify recurring structural motifs– Predicting function based on structure– Assess predicted structures

• Protein structure comparison and classification– Manual: SCOP– Automated: DALI

Compare protein structures

• Need ways to determine if two protein structures are related and to compare predicted models to experimental structures

• Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979):

• Usually use C atoms

i i iidx dy dz

3.6 Å 2.9 Å

NK-lysin (1nkl) Bacteriocin T102/as48 (1e68) T102 best model

• Other measures include contact maps and torsion angle RMSDs

STAT11511

SCOP• Compare protein

structure, identify

recurring structural

motifs, predict function• A. Murzin et al, 1995

– Manual classification

– A few folds are highly

populated

– 5 folds contain 20% of all homologous superfamilies

– Some folds are multifunctional

STAT11512

Determine Protein Structure

• X-ray crystallography (gold standard)– Grow crystals, rate limiting, relies on the repeating

structure of a crystalline lattice

– Collect a diffraction pattern

– Map to real space electron density, build and refine structural model

– Painstaking and time consuming

STAT11513

Protein Structure Prediction

• Since AA sequence determines structure, can we predict protein structure from its AA sequence?= predicting the three angles, unlimited DoF!

• Physical properties that determine fold– Rigidity of the protein backbone

– Interactions among amino acids, including• Electrostatic interactions

• van der Waals forces

• Volume constraints

• Hydrogen, disulfide bonds

– Interactions of amino acids with water

unfolded

Protein folding landscape

Large multi-dimensional space of changing conformationsfr

folding reaction

molten globule

J=10-8 s

native

J=10-3 s

barrierheight

Protein primary structure

twenty types of amino acids

two amino acids join by forming a peptide bond

H NCα

each residue in the amino acid main chain has two degrees of freedom (and

the amino acid side chains can have up to four degrees of freedom 1-4

STAT11516

2nd Structure Prediction helix, sheet, turn/loop

STAT11517

2nd Structure Prediction

• Chou-Fasman 1974• Base on 15 proteins (2473 AAs) of known

conformation, determine P, P from 0.5-1.5

• Empirical rules for 2nd struct nucleation– 4 H or h out of 6 AA, extends to both dir,

P > 1.03, P > P, no breakers– 3 H or h out of 5 AA, extends to both dir, P

> 1.05, P > P, no breakers

• Have ~50-60% accuracy

)20//( sj

STAT11518

P and P

STAT11519

• Garnier, Osguthorpe, Robson, 1978• Assumption: each AA influenced by flanking

positions

• GOR scoring tables (problem: limited dataset)

• Add scores, assign 2nd with highest score

STAT11520

• D. Eisenberg, 1986– Plot hydrophobicity as function of sequence

position, look for periodic repeats– Period = 3-4 AA, (3.6 aa / turn)– Period = 2 AA, sheet

• Best overall JPRED by Geoffrey Barton, use many different approaches, get consensus– Overall accuracy: 72.9%

STAT11521

3D Protein Structure Prediction

• CASP contest: Critical Assessment of Structure Prediction

• Biannual meeting since 1994 at Asilomar, CA• Experimentalists: before CASP, submit sequence

of to-be-solved structure to central repository• Predictors: download sequence and minimal

information, make predictions in three categories• Assessors: automatic programs and experts to

evaluate predictions quality

STAT11522

CASP Category I• Homology Modeling (sequences with high

homology to sequences of known structure)

• Given a sequence with homology > 25-30% with known structure in PDB, use known structure as starting point to create a model of the 3D structure of the sequence

• Takes advantage of knowledge of a closely related protein. Use sequence alignment techniques to establish correspondences between known “template” and unknown.

STAT11523

CASP Category II

• Fold recognition (sequences with no sequence identity (<= 30%) to sequences of known structure

• Given the sequence, and a set of folds observed in PDB, see if any of the sequences could adopt one of the known folds

• Takes advantage of knowledge of existing structures, and principles by which they are stabilized (favorable interactions)

STAT11524

CASP Category III• Ab initio prediction (no known homology with any

sequence of known structure)

• Given only the sequence, predict the 3D structure from “first principles”, based on energetic or statistical principles

• Secondary structure prediction and multiple alignment techniques used to predict features of these molecules. Then, some method necessary for assembling 3D structure.

STAT11525

Structure Prediction Evaluation

• Hydrophobic core similar? • 2nd struct identified?• Energy: minimized? H-bond contacts?• Compare with solved crystal structure: gold

standard

yxyxNRMSD Ni

2||||),;(

Comparative modelling of protein structure

KDHPFGFAVPTKNPDGTMNLMNWECAIPKDPPAGIGAPQDN----QNIMLWNAVIP** * * * * * * * **

… …

scanalign

build initial modelconstruct non-conserved

side chains and main chains

refine

STAT11527

Homology Modeling Results

• When sequence homology is > 70%, high resolution models are possible (< 3 Å RMSD)

• MODELLER (Sali et al)– Find homologous proteins with known

structure and align– Collect distance distributions between atoms in

known protein structures– Use these distributions to compute positions for

equivalent atoms in alignment– Refine using energetics

STAT11528

• Many places can go wrong:– Bad template - it doesn’t have the same

structure as the target after all– Bad alignment (a very common problem)– Good alignment to good template still gives

wrong local structure– Bad loop construction– Bad side chain positioning

STAT11529

• Use of sensitive multiple alignment (e.g. PSI-BLAST) techniques helped get best alignments

• Sophisticated energy minimization techniques do not dramatically improve upon initial guess

STAT11530

Fold Recognition Results

• Also called protein threading

• Given new sequence and library of known folds, find best alignment of sequence to each fold, returned the most favorable one

STAT11531

Fold Recognition with Dynamic Programming

• Environmental class for each AA based on known folds (buried status, polarity, 2nd struct)

STAT11532

Protein Folding with Dynamic Programming

• D. Eisenburg 1994• Align sequence to each fold (a string of

environmental classes)

• Advantages: fast and works pretty well• Disadvantages: do not consider AA contacts

STAT11533

• Each predictor can submit N top hits

• Every predictor does well on something

• Common folds (more examples) are easier to recognize

• Fold recognition was the surprise performer at CASP1. Incremental progress at CASP2, CASP3, CASP4…

STAT11534

• Alignment (seq to fold) is a big problem

STAT11535

ab initio

• Predict interresidue contacts and then compute structure (mild success)

• Simplified energy term + reduced search space (phi/psi or lattice) (moderate success)

• Creative ways to memorize sequence structure correlations in short segments from the PDB, and use these to model new structures: ROSETTA

Ab initio prediction of protein structuresample conformational space such that

native-like conformations are found

astronomically large number of conformations5 states/100 residues = 5100 = 1070

select

hard to design functionsthat are not fooled by

non-native conformations(“decoys”)

Sampling conformational space – continuous approaches

• Most work in the field- Molecular dynamics- Continuous energy minimization (follow a valley)- Monte Carlo simulation- Genetic Algorithms

• Like real polypeptide folding process

• Cannot be sure if native-like conformations are sampled

energy

Molecular dynamics

• Force = -dU/dx (slope of potential U); acceleration, force = m ×a(t)

• All atoms are moving so forces between atoms are complicated functions of time

• Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial

• Atoms move for very short times of 10-15 seconds or 0.001 picoseconds (ps)

x(t+t) = x(t) + v(t)t + [4a(t) – a(t-t)] t2/6

v(t+t) = v(t) + [2a(t+t)+5a(t)-a(t-t)] t/6

Ukinetic = ½ Σ mivi(t)2 = ½ n KBT

• Total energy (Upotential + Ukinetic) must not change with time

new positionold position

new velocity

old velocity acceleration

n is number of coordinates (not atoms)

Energy minimization

For a given protein, the energy depends on thousands of x,y,z Cartesian atomic coordinates; reaching a deep minimum is not trivial

Furthermore, we want to minimize the free energy, not just the potential energy.

energy

number of steps deep minimum

starting conformation

Monte Carlo Simulation• Propose moves in torsion or Cartesian conformation

space• Evaluate energy after every move, compute E• Accept the new conformation based on

• If run infinite time, the simulated conformation follows the Boltzmann distribution

• Many variations, including simulated annealing and other heuristic approaches.

ΔEP exp

E(C)( ) exp

Scoring/energy functions

• Need a way to select native-like conformations from non-native ones

• Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms.

• Knowledge-based scoring functions: – Derive information about atomic properties

from a database of experimentally determined conformations

– Common parameters include pairwise atomic distances and amino acid burial/exposure.

STAT11542

Rosetta

• D. Baker, U. Wash• Break sequence into short segments (7-9 AA)• Sample 3D from library of known segment

structures, parallel computation• Use simulated annealing (metropolis-type

algorithm) for global optimization– Propose a change, if better energy, take; otherwise take

at smaller probability

• Create 1000 structures, cluster and choose one representative from each cluster to submit

STAT11543

Manual Improvements and Automation

• Very often manual examination could improve prediction– Catch errors– Need domain knowledge– A. Murzin’s success at CASP2

• CAFASP: Critical Assessment of Fully Automated Structure Prediction– Murzin Can’t play!!

• MetaServers: combine different methods to get consensus

STAT11544

CAFASP Evaluation

STAT11545

Structural Genomics

• With more and more solved structures and novel folds, computational protein structure prediction is going to improve

• Structural genomics: – Worldwide initiative to high throughput

determine many protein structures– Especially, solve structures that have no

homology

STAT11546

Summary• Protein structures: 1st, 2nd, 3rd, 4th

– Different DB: SwissProt, PDB and SCOP– Determine structure: X-ray crystallography

• Protein structure prediction:– 2nd structure prediction– Homology modeling– Fold recognition– Ab initio– Evaluation: energy, RMSD, etc– CASP and CAFASP contest

• Manual improvement and combination of computational approaches work better

• Structural Genomics, still very difficult problem…

STAT11547

Acknowledgement

• Amy Keating

• Michael Yaffe

• Mark Craven

• Russ Altman

Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115.

Documents

Transcript of Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Xu, Yang; Liu, Jia; Shen, Yulong; Liu, Jun; Jiang ...

Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Stat 115 Lecture Notes 12 - math.wsu.edumath.wsu.edu/faculty/xchen/stat115/LectureNotes6/LectureNotes12_notes.pdfStat 115 Lecture Notes 12 Xiongzhi Chen Washington State University

Tracing Tuples Across Dimensions A Comparison of Scatterplots and Parallel Coordinate Plots Xiaole Kuang (Master student, NUS) Haimo Zhang (PhD student,

Gene Set Enrichment Analysis Microarray Classification STAT115 Jun S. Liu and Xiole Shirley Liu.

24 1 Deploying Wireless Sensors to Achieve Both Coverage and Connectivity Xiaole Bai Santosh Kumar Dong…

STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Fast Sequence Search Multiple Sequence Alignment Xiaole Shirley Liu STAT115/STAT215, 2010.

Immuno and Epigenetic Therapies Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology STAT115 Lab 3 PART I Homework Q8 The Dot Matrix Method.

Microarray Normalization Xiaole Shirley Liu STAT115 / STAT215.

LncRNA Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Guangwu Liu , Yangang Liu - Atlantis Press

Teacher : Liu Chunling Liu Chunling Please introduce yourself!

Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In.

More on TF Motif Finding ChIP-chip / seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.