Crystallography -- Lecture 22 Refinement and Validation.

Crystallography -- Lecture 22

Refinement and Validation

Refinement

Steps after initial modeling:

(1) Rigid body refinement.

(2) Density modification.

(3) Difference maps.

(4) Least squares, protein coordinates + overall B-factor.

(5) Add waters, ions. More least squares.

(6) Least squares, protein coordinates + atomic B-factors.

(7) Least squares, multiple occupancy and anisotropic B-factors.

(8) Validation. Publication!

Initial model to final model

Rigid body refinement

(1) Rigid body refinement.After molecular replacement only, to get the precise orientation of the molecule relative to the crystal axes. Whole molecule treated as a rigid group. Model may be cut into domains. If so, then each domain is rigidbody refined.

Density modification.(2) Density modification.

Coordinate-free refinement. The map is modified directly, then new phases

are calculated. This step may be skipped for good starting models.

Density modification :

Fo’s and (new) phases

Map Modified map

Fc’s and new phases

initial phases

Solvent Flattening: Make the water part of the map flat.

(1) Draw envelope around protein part

(2) Set solvent to <> and back transform.

(1) Calculate map.

(2) Skeletonize the map

(3) Make the skeleton “protein-like”

(4) Back transform the skeleton.

Protein-like means: (a) no cycles, (b) no islands

Difference maps

(Fo-Fc) = Difference map. Fc is calculate from the coordinates. This map shows missing or wrongly placed atoms.

(2Fo-Fc) = This is a “native” map (Fo) plus a difference map (Fo-Fc). This map should look like the corrected model.

(X) means “maps calculated using amplitudes X”

Omit map = Difference map or 2Fo-Fc after removing suspicious coordinates. Removes “phase bias” density that results from least-squares refinement using wrong coordinates.

(3) Difference maps are used throughout the refinement process after a model has been built.

FÉTHIÈRE et al, Protein Science (1996), 5: 1174- 1183.

Omit maps

Two inhibitor peptides in two different crystals of the protease thrombin.

The inhibitor coordinates were omitted from the model before calculating Fc.

Then maps were made using Fo-Fc amplitudes and Fc phases.

(stereo images)

Least-squares refinement

•The partial derivative of the R-factor with respect to each atomic position can be calculated, because we know the change in amplitudes with change in coordinates.

•A 3D derivative is a “gradient”. Each atom is moved down-hill along the gradient.

•“Restraints” may be imposed to maintain good stereochemistry.

dRfactor

dr v i

(4) Least squares, protein coordinates + overall B-factor.

bond lengths

bond angles

torsion anglesplanar groupsvan der WaalsRestraint types:

Stereochemical constraints

bond lengths

bond angles

•Bond lengths, angles, and planar groups may be fixed (frozen) to their ideal values during refinement.

•Using constraints, Ser has 3 parameters, Phe 4, and Arg 6.

•There are an average 3.5 torsion angles per residue.

•Papain has ~700 torsion angle parameters.

data/parameter ratio =25,000/700≈35 planar groups

Constraints reduce the effective number of parameters

Adding waters, ions.

(5) Add waters, ions. More least squares.

Calculate difference map

Place waters (just an oxygen) in the peak positive density position if

(1) there is no atom there,

(2) there is an atom nearby,

(3) the density or shape does not suggest an ion of ligand.

Atomic B-factor refinement

Restraint: Atoms that are bonded to each other should not have large differences in B.

B = “temperature factor” = Gaussian d-2-dependent scale factor

€

e− B

sin 2 θ

λ2

⎛

⎝ ⎜

⎞

⎠ ⎟Gaussian equation :

The derivative of the R-factor with respect to B can be calculated, since B-effects the amplitudes.

Because the high resolution amplitudes depend on B more than low-resolution amplitudes, high resolution (2.5Å or better) is required to refine atomic B-factors.

€

=e−

B

4 d 2

⎛

⎝ ⎜

⎞

⎠ ⎟

FT :

€

Fv h ( ) = f (g)

g= all atoms.

∑ e2πi

r h •

r r g( )e

− Bgsin 2 θ

λ

⎛

⎝ ⎜

⎞

⎠ ⎟

(6) Least squares, protein coordinates + atomic B-factors.

Multiple Occupancy

OH

OH

OH

1 2 3 4 5 6 7 8 12345678901234567890123456789012345678901234567890123456789012345678901234567890ATOM 145 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 NATOM 146 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 CATOM 147 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 CATOM 148 O VAL A 25 29.520 15.059 59.174 1.00 15.65 A1 OATOM 149 CB AVAL A 25 30.385 17.437 57.230 0.28 13.88 A1 CATOM 150 CB BVAL A 25 30.166 17.399 57.373 0.72 15.41 A1 CATOM 151 CG1AVAL A 25 28.870 17.401 57.336 0.28 12.64 A1 CATOM 152 CG1BVAL A 25 30.805 18.788 57.449 0.72 15.11 A1 CATOM 153 CG2AVAL A 25 30.835 18.826 57.661 0.28 13.58 A1 CATOM 154 CG2BVAL A 25 29.909 16.996 55.922 0.72 13.25 A1 C

PDB “ATOM” lines showingaltloc indicators (A or B)in column 17 and occupancy in cols 56-60.

(7) Least squares, multiple occupancy and anisotropic B-factors.Only possible with high-resolution data and a high-quality model.Some atoms (Ser or Val sidechains) may have more than one location. Multiple alternative locations may be defined for these cases.

Anisotropic B-factors

PDB “ANISOU” lines follow “ATOM” or “HETATM” lines.

(7) Least squares, multiple occupancy and anisotropic B-factors.

Atom motions are probably not isotropic. The cloud of density for each atom can be better modeled by an ellipsoidal Gaussian. (6 parameters)

1 2 3 4 5 6 7 812345678901234567890123456789012345678901234567890123456789012345678901234567890ATOM 107 N GLY 13 12.681 37.302 -25.211 1.000 15.56 NANISOU 107 N GLY 13 2406 1892 1614 198 519 -328 NATOM 108 CA GLY 13 11.982 37.996 -26.241 1.000 16.92 CANISOU 108 CA GLY 13 2748 2004 1679 -21 155 -419 CATOM 109 C GLY 13 11.678 39.447 -26.008 1.000 15.73 CANISOU 109 C GLY 13 2555 1955 1468 87 357 -109 CATOM 110 O GLY 13 11.444 40.201 -26.971 1.000 20.93 OANISOU 110 O GLY 13 3837 2505 1611 164 -121 189 OATOM 111 N ASN 14 11.608 39.863 -24.755 1.000 13.68 NANISOU 111 N ASN 14 2059 1674 1462 27 244 -96 N

Molecular dynamics w/ Xray refinement

MD samples conformational space while maintaining good geometry (low residual in restraints).

E = (residual of restraints) + (R-factor) dE/dxi is calculated for each atom i, then we move i downhill. Random vectors added, proportional to temperature T.

The simulated annealing MD method:

(1) start the simulation “hot”

(2) “cool” slowly, trapping structure in lowest minimum.

“X-plor” Axel Brünger et al

radius of convergence

total residual

parameter space

...=How far away from the truth can it be, and still find the truth?

radius of convergence depends on data & method.

More data = fewer false (local) minima

Better method = one that can overcome local minima

The final model

www.rcsb.org

Errors and Validation

Sources of error

•Error is broadly defined as the difference between your model and reality.•Sources of error can be in the data (the crystal itself or

the processing of the data) or in the molecular model.•If the model is at fault, errors may be localized to certain parts of a model, or spread throughout.

Sources of error in crystal structures

Data

Model

X-rays

Crystal

Detector

Polarization

variable flux

colimation

filtering/monochrometer

Experimental sources of error

vertical graphite monochromater

horizontally polarized X-rays

weaker scatter vertically

Solution: zonal scaling. Polarization

Scale factors are calculated

in evenly-sampled zones of reciprocal

space.

Experimental sources of error

variable wavelength

A problem for synchrotron X-rays. Solution: Use an external flux meter. Scaling.

Large colimator means high background, large spots, spot overlap if cell dimensions are large. Small colimator means longer exposures.

t

Spots may be radially smeared. Solution: Use monochromater instead of direct Xrays.

variable flux

colimation


Data

Model

X-rays

Crystal

Detector

mosaicity

twinning

absorbsion

decay

non-isomorphism


Data

Model

X-rays

Crystal

Detector

mosaicity

twinning

absorbsion

decay

non-isomorphism

separate multiple crystals

clean and dry the crystal

get a better crystal

give up, start over

freeze the crystal


Data

Model

X-rays

Crystal

Detector saturation limit

machining

pixel size

sue

shorter exposures

back up, you’re too close

Computational Sources of error

Data

data/parameter ratio

phase bias

bad geometry

X-rays

Crystal

Detector

Luzatti or plot will estimate errors. Real-space R.

Omit maps, 2Fo-Fc maps.

PROCHECK

Model

Cross-validation: The free R-factorThe R-factor measures the residual difference between observed and calculated amplitudes.

Free R is summed on a “test set”. Test set data was not used for refinement.

Free R ask: “How well does your model predict the data it hasn’t been fit to?”

Rfree=Fobs h( ) −kFcalc h( )

h∈T∑

Fobs h( )h∈T∑

Note: T = independent test set of F’s.

What is over-fitting?If you have three points, you can fit them to a quadratic equation (3 parameters) with zero residual, but is it right?

Observed data

R-factor = 0.000!!

calculated

Fitting unseen data, as a testFit is correct if additional data, not used in fitting the curve, fall on the curve.

Low residual in the “test set” validates the fit.

residual≠0

cross-validationMeans: measuring the residual on data (a “test set”) that were not used to refine (or fit) the model.

The residual on test data is likely to be small if

is large.

a line has 2 parameters

dataparameters

Parameters versus DataExample from Drenth, Ch 13:

Papain crystal structure has 25,000 reflections.

Papain has 2000 non-H atoms

times 4 parameters each (x, y, z, B)

equals 8000 parameters

data/parameters = 25,000/8000 ≈ 3 <-- this is too small!

Phase errorEvery reflection has a phase error, which is the difference of the calculated phase from the true phase (unknown).

Free R-factor correlates with phase error

free R

<phase error>

Thought experiment

What is the phase error for 4Å resolution reflections if the average coordinate error is 1Å?

Coordinate error causes phase error

If the error in atomic position is 1Å, and the Bragg plane separation is 4Å,then the error in phase is ≤ (1/4)*360°=90°

If the error is a Gaussian in real space, then the phase error is also a Gaussian. (The projectionof a 3D Gaussian on the normal to the Bragg planes is a 1D Gaussian)

Luzzati plot

Data is divided into shells in S (=1/d).

The R-factor for each shell is calculated and plotted.

The plot is matched to the theoretical R vs S for a model with randomly-distributed errors = .

ps. Luzzati did this in 1952, long before computers!

Map evaluator: Real space R-factor

Rrealspace=ρobs −ρcalc∑ρobs +ρcalc∑

R=Fobs −kFcalc

hkl∑

Fobshkl∑

Reciprocal space R:

Electron density “residual”

Summed over real space position r

Real space R-factor as a diagnostic

High B-factors or real-space R may indicate places where the model is locally wrong.

In class exercise: Procheck

http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html

To run PROCHECK on MODLAB machines:

validation -f 8dfr.pdb -o 0

(-o O [zero] means PDB format. This is the default, so you can omit it.)

Read procheck.out using the vi editor, or jot, or the more command. This has a summery of the output file, including their names.

Use “showps” to look at .ps files: showps xxxxx.ps

Ramachandran Plot: energy of local steric interactions

Ramachandran angle regions are

(A,B,L) Most favored (red)(a,b,l,p) allowed (yellow)(~a,~b,~l,~p) generously allowed (beige?)disallowed (white)

Preferred sidechain angles

Crystallography -- Lecture 22 Refinement and Validation.

Documents

Transcript of Crystallography -- Lecture 22 Refinement and Validation.