Genetic Threading By J.Yadgari and A.Amir Published: special issue on Bioinformatics in Journal of...

Genetic Threading

By J.Yadgari and A.Amir

Published: special issue on Bioinformatics in Journal of Constraints, June 2001

Alexandre Tchourbanov

University of Nebraska at Lincoln

CSCE 421-821

December 4, 2001

Structure of the presentation

• Introduction to protein native structure

• Methods of finding a native structurePhysicalComputational

Common methods and principlesProtein threading method

Protein threading using genetic approach

Problem of protein structure prediction

• Proteins are key molecules in all life processes

• The function of a protein directly related to its three dimensional structure

• Knowing and understanding the structure of proteins will have a tremendous impact on understanding of biological processes, medical discoveries, and biotechnological inventions

Problem of protein structure preduction

• Given a sequence of amino acids, predict the unique 3D folding of molecule minimizing its free energy

Lys

Gly

Leu

1 2Computational

Methods of prediction

Physical methods of prediction

3

Practicaluse of the

3D structural knowledge

Primary structure

Protein structure

• A protein is built up from a chain of amino acids linked by peptide bonds

• There are 20 amino acids that can be divided into several classes based on size and other chemical and physical properties

• Depending on type of a residue, protein could be either hydrophilic (water loving) or hydrophobic (water hating)

General structure of an amino acid• Each amino acid consists of:

1. Common main chain part, containing the heavy atoms N, C, O, C forming amide plane

2. Chain residue of size 0 – 10 additional atoms

Common part

Chain residue

Peptide bond

• Peptide bond connects carboxyl group of the first amino acid with amino group of the second acid

• Peptide bonds are planar and rigid

Sequence of amino acids

• Sequence of amino acids, connected by peptide bonds, form protein

• There is no flexibility for rotation around peptide bond

• There is more flexibility for protein to rotate around N-C-bond (called the -angle) and around C-C-bond (-angle)

• These angles are restricted to small regions in natural proteins

Part of Protein (…|Phe|Asp|Ala|…)

Protein folding• Using the freedom of rotations, the protein can

fold into a specific and unique three dimensional structure (called conformation), forming a native structure

Physical methods of determiningprotein native structure

• X-ray crystallography requires significant amounts of purified protein molecules (1014) to grow a crystal and protein needs to crystallize

• NMR method applicable to proteins of small and average size, which do not crystallize

• Both methods are expensive and give coherent results on the same protein, proving to be correct

• Structure of many important proteins is still unknown

Physical methodsX-ray crystallography

NMR (Nuclear Magnetic Resonance)

Protein structure in X-ray crystallography

• X-ray diffraction pattern is recorded and processed using FFT to form electron density map

• Regions of map with the highest electron density reveal the location of atomic nuclei

Family of structures in NMR method

• Absorption of radio frequency energy is recorded as a 2D spectrum

• Possible 3D structures are constructed by computer according to NMR signal

Computational methods to find a protein structure

• The unique 3D arrangement of protein corresponds to lowest free energy conformation

• Most computational approaches for solving the protein folding problem look for the lowest free energy conformation

• Two principal methods are currently in use for computing the lowest energy conformation:

1. Molecular dynamics 2. Monte Carlo

Molecular dynamics

• Forces acting on each atom at a particular state of the system are calculated using an empirical force field

• Atoms allowed to move with accelerations resulting from forces, changing conformation

• Once atom moved significantly, acting forces are recalculated (every 10-15 sec)

• Even super computers can simulate only 10-9 sec of folding time, which is insufficient

Monte Carlo method

• Used with simplified model of protein (does not consider structure of every amino acid)

• Procedure makes random move from current conformation and evaluates resulting energy changes

• If new conformation is better, it replaces old one with newly generated, and process repeats

• Method is not powerful enough to find an optimal conformation even for simple cases

Protein threading • Many proteins in nature are homologous, having

different primary structure, but forming the same conformation to carry out the same functionality in a living matter and having the same evolutionary origin

• Most protein share the secondary structure motifs:

1. Helices

2. Extended strands forming sheets

3. Specific turns4. Random coils

Protein threading

• Threading means mapping a given sequence to a given structure

• To assign a structure to a sequence one would then need to thread the sequence through all known conformations, evaluating compatibility, and assign the most compatible structure to the sequence

• Upon discovery of completely different structure from any known, enter it into database of structures

Protein threading

• Structure is presented by the black trace

• Sequence (at the top) is threaded through the structure, encoding an alignment (at the bottom)

• Zero means structure deletion, values greater that one mean sequence deletion, while one is a fit

Protein threading

• The size of the search space to thread sequence of length k into structure of size n could be found as a selection with repetition

• Search space is huge and problem appears to be NP-complete [Unger,R., Moult,J. (1993)]

!)!1()!1(1

knkn

k

kn

Protein threading

• In order to reduce complexity of search task, (m –1) core and m non-core regions are introduced

• Usually -helices and -sheets are core regions, connected by loops

• Total number of amino acids in core regions is c

m loops (non-core)

m-1 core regions

)!()!1()!1(1

cknckm

ck

ckm

Protein threading• Although suffering from some inherent

limitations (such as prediction of the right structure with completely wrong threading), method became a significant tool in protein structure prediction

• Any threading procedure must contain two major components:1. An alignment algorithm to position a

sequence on a structure 2. Score function to evaluate the “energy”

of the sequence in given conformation

Protein threading possible implementations

• Protein threading could be implemented using:

1. Enumeration for small problems, 2. Dynamic programming to find core regions

to “freeze”, 3. Monte Carlo variants with Gibbs sampling4. Branch and bound search

• Genetic programming with constraints seems to be a decent alternative in comparison with other methods

Protein threading using genetic programming

• Genetic Algorithms are parallel computational tools that are based on the principle of diversity and selection

• Solutions are represented as strings, for example 11111100111311

• Sum of all terms in the string needs to be equal to the number of amino acids in the sequence, as well as length of the string equal to the length of the structure

Protein threading using genetic programming

• These strings are maintained as a population that undergoes evolutionary process via generic operators such as: – Replication (copying of the string to the next

generation) – Mutation (changing bits in the string) – Crossover (concatenating a prefix of one string

with suffix of another)

• Energy function is a good candidate to evaluate fit of an offspring

Energy function

• Energy functions are subject to minimizations

• Energy functions are calculated by extracting from the structural database frequencies of interactions between pairs of residues as a function of amino acids types and distance

• Tendency of certain hydrophilic residues to be on the surface can be approximated by energy term related to the position

Implementing mutation

• An example of mutation could be transformation of 1111100111311 into 11111100211211, which is also a valid encoding

• We need to have validity check every time we do mutation and compensate for problems

• Reverting of substrings is especially interesting mutation, since it does not violate a valid structure of the solution

Implementing crossovers

1 1 2 0 1 1 2 0 1 1 1 1 1 1

1 1 1 1 1 1 0 0 1 1 1 3 1 1

1 1 2 0 1 1 2 0 1 1 1 3 1 1

Parent 1

Parent 2

Offsprings

1 1 1 1 1 1 0 0 1 1 1 1 1 1

Following issues were addressed

• The linear trade-off between population size and the number of generations

• Optimal level of mutation rate

• Locality of mutation operator

• Locality of the crossover operator

• Regular mutations versus reverse mutations

• Magnitude of the mutation operation

• Quality control of the crossover operation

Results

• For author’s examples, the optimal performance is achieved with population size of 300 solutions and duration of 1000 generations

• The optimal rate of mutations is 0.25 to 0.3 of the populations

The minimal energy of threading runs

The average energy of the population during threading

Structural comparisons

Structural alignment

Most similar threading alignment

Least similar threading alignment

Difference between sequence deletions andstructure deletions plots

Maximal mutation magnitude

Average score of5 runs after 600 generations

Average score of5 runs after 2000 generations

Summary

• The running time of a GA depends linearly on the number of solutions in the population (i.e. population size) and also depends linearly on the number of generations the process is repeated

• Genetic algorithms method is a feasible and efficient approach to threading

• It is especially encouraging that the threading alignments are quite similar, quantitatively, to the structural alignments

Summary

• Changing the locality of the mutation and crossover operation does not show a consistent change in the performance of the algorithm

• Mutations of high magnitude are counterproductive, probably because changes between the template and the assigned structure do not tend to concentrate in single position

• Using crossover under strict quality control was shown not to be effective, since genetic mechanism has quality control itself

Summary

• The success of the reverse mutation is quite surprising and should be further explored

Future work

• Threading algorithms should be tested on their ability to assign a conformation for new and unknown sequence

• Authors plan to implement the genetic algorithm in a complete threading package, with all the necessary components and to test it in a realistic prediction setup.

Genetic Threading By J.Yadgari and A.Amir Published: special issue on Bioinformatics in Journal of...

Documents

Transcript of Genetic Threading By J.Yadgari and A.Amir Published: special issue on Bioinformatics in Journal of...