Genetic Threading By J.Yadgari and A.Amir Published: special issue on Bioinformatics in Journal of...
-
date post
22-Dec-2015 -
Category
Documents
-
view
226 -
download
0
Transcript of Genetic Threading By J.Yadgari and A.Amir Published: special issue on Bioinformatics in Journal of...
Genetic Threading
By J.Yadgari and A.Amir
Published: special issue on Bioinformatics in Journal of Constraints, June 2001
Alexandre Tchourbanov
University of Nebraska at Lincoln
CSCE 421-821
December 4, 2001
Structure of the presentation
• Introduction to protein native structure
• Methods of finding a native structurePhysicalComputational
Common methods and principlesProtein threading method
Protein threading using genetic approach
Problem of protein structure prediction
• Proteins are key molecules in all life processes
• The function of a protein directly related to its three dimensional structure
• Knowing and understanding the structure of proteins will have a tremendous impact on understanding of biological processes, medical discoveries, and biotechnological inventions
Problem of protein structure preduction
• Given a sequence of amino acids, predict the unique 3D folding of molecule minimizing its free energy
Lys
Gly
Leu
1 2Computational
Methods of prediction
Physical methods of prediction
3
Practicaluse of the
3D structural knowledge
Primary structure
Protein structure
• A protein is built up from a chain of amino acids linked by peptide bonds
• There are 20 amino acids that can be divided into several classes based on size and other chemical and physical properties
• Depending on type of a residue, protein could be either hydrophilic (water loving) or hydrophobic (water hating)
General structure of an amino acid• Each amino acid consists of:
1. Common main chain part, containing the heavy atoms N, C, O, C forming amide plane
2. Chain residue of size 0 – 10 additional atoms
Common part
Chain residue
Peptide bond
• Peptide bond connects carboxyl group of the first amino acid with amino group of the second acid
• Peptide bonds are planar and rigid
Sequence of amino acids
• Sequence of amino acids, connected by peptide bonds, form protein
• There is no flexibility for rotation around peptide bond
• There is more flexibility for protein to rotate around N-C-bond (called the -angle) and around C-C-bond (-angle)
• These angles are restricted to small regions in natural proteins
Protein folding• Using the freedom of rotations, the protein can
fold into a specific and unique three dimensional structure (called conformation), forming a native structure
Physical methods of determiningprotein native structure
• X-ray crystallography requires significant amounts of purified protein molecules (1014) to grow a crystal and protein needs to crystallize
• NMR method applicable to proteins of small and average size, which do not crystallize
• Both methods are expensive and give coherent results on the same protein, proving to be correct
• Structure of many important proteins is still unknown
Physical methodsX-ray crystallography
NMR (Nuclear Magnetic Resonance)
Protein structure in X-ray crystallography
• X-ray diffraction pattern is recorded and processed using FFT to form electron density map
• Regions of map with the highest electron density reveal the location of atomic nuclei
Family of structures in NMR method
• Absorption of radio frequency energy is recorded as a 2D spectrum
• Possible 3D structures are constructed by computer according to NMR signal
Computational methods to find a protein structure
• The unique 3D arrangement of protein corresponds to lowest free energy conformation
• Most computational approaches for solving the protein folding problem look for the lowest free energy conformation
• Two principal methods are currently in use for computing the lowest energy conformation:
1. Molecular dynamics 2. Monte Carlo
Molecular dynamics
• Forces acting on each atom at a particular state of the system are calculated using an empirical force field
• Atoms allowed to move with accelerations resulting from forces, changing conformation
• Once atom moved significantly, acting forces are recalculated (every 10-15 sec)
• Even super computers can simulate only 10-9 sec of folding time, which is insufficient
Monte Carlo method
• Used with simplified model of protein (does not consider structure of every amino acid)
• Procedure makes random move from current conformation and evaluates resulting energy changes
• If new conformation is better, it replaces old one with newly generated, and process repeats
• Method is not powerful enough to find an optimal conformation even for simple cases
Protein threading • Many proteins in nature are homologous, having
different primary structure, but forming the same conformation to carry out the same functionality in a living matter and having the same evolutionary origin
• Most protein share the secondary structure motifs:
1. Helices
2. Extended strands forming sheets
3. Specific turns4. Random coils
Protein threading
• Threading means mapping a given sequence to a given structure
• To assign a structure to a sequence one would then need to thread the sequence through all known conformations, evaluating compatibility, and assign the most compatible structure to the sequence
• Upon discovery of completely different structure from any known, enter it into database of structures
Protein threading
• Structure is presented by the black trace
• Sequence (at the top) is threaded through the structure, encoding an alignment (at the bottom)
• Zero means structure deletion, values greater that one mean sequence deletion, while one is a fit
Protein threading
• The size of the search space to thread sequence of length k into structure of size n could be found as a selection with repetition
• Search space is huge and problem appears to be NP-complete [Unger,R., Moult,J. (1993)]
!)!1()!1(1
knkn
k
kn
Protein threading
• In order to reduce complexity of search task, (m –1) core and m non-core regions are introduced
• Usually -helices and -sheets are core regions, connected by loops
• Total number of amino acids in core regions is c
m loops (non-core)
m-1 core regions
)!()!1()!1(1
cknckm
ck
ckm
Protein threading• Although suffering from some inherent
limitations (such as prediction of the right structure with completely wrong threading), method became a significant tool in protein structure prediction
• Any threading procedure must contain two major components:1. An alignment algorithm to position a
sequence on a structure 2. Score function to evaluate the “energy”
of the sequence in given conformation
Protein threading possible implementations
• Protein threading could be implemented using:
1. Enumeration for small problems, 2. Dynamic programming to find core regions
to “freeze”, 3. Monte Carlo variants with Gibbs sampling4. Branch and bound search
• Genetic programming with constraints seems to be a decent alternative in comparison with other methods
Protein threading using genetic programming
• Genetic Algorithms are parallel computational tools that are based on the principle of diversity and selection
• Solutions are represented as strings, for example 11111100111311
• Sum of all terms in the string needs to be equal to the number of amino acids in the sequence, as well as length of the string equal to the length of the structure
Protein threading using genetic programming
• These strings are maintained as a population that undergoes evolutionary process via generic operators such as: – Replication (copying of the string to the next
generation) – Mutation (changing bits in the string) – Crossover (concatenating a prefix of one string
with suffix of another)
• Energy function is a good candidate to evaluate fit of an offspring
Energy function
• Energy functions are subject to minimizations
• Energy functions are calculated by extracting from the structural database frequencies of interactions between pairs of residues as a function of amino acids types and distance
• Tendency of certain hydrophilic residues to be on the surface can be approximated by energy term related to the position
Implementing mutation
• An example of mutation could be transformation of 1111100111311 into 11111100211211, which is also a valid encoding
• We need to have validity check every time we do mutation and compensate for problems
• Reverting of substrings is especially interesting mutation, since it does not violate a valid structure of the solution
Implementing crossovers
1 1 2 0 1 1 2 0 1 1 1 1 1 1
1 1 1 1 1 1 0 0 1 1 1 3 1 1
1 1 2 0 1 1 2 0 1 1 1 3 1 1
Parent 1
Parent 2
Offsprings
1 1 1 1 1 1 0 0 1 1 1 1 1 1
Following issues were addressed
• The linear trade-off between population size and the number of generations
• Optimal level of mutation rate
• Locality of mutation operator
• Locality of the crossover operator
• Regular mutations versus reverse mutations
• Magnitude of the mutation operation
• Quality control of the crossover operation
Results
• For author’s examples, the optimal performance is achieved with population size of 300 solutions and duration of 1000 generations
• The optimal rate of mutations is 0.25 to 0.3 of the populations
Structural comparisons
Structural alignment
Most similar threading alignment
Least similar threading alignment
Difference between sequence deletions andstructure deletions plots
Maximal mutation magnitude
Average score of5 runs after 600 generations
Average score of5 runs after 2000 generations
Summary
• The running time of a GA depends linearly on the number of solutions in the population (i.e. population size) and also depends linearly on the number of generations the process is repeated
• Genetic algorithms method is a feasible and efficient approach to threading
• It is especially encouraging that the threading alignments are quite similar, quantitatively, to the structural alignments
Summary
• Changing the locality of the mutation and crossover operation does not show a consistent change in the performance of the algorithm
• Mutations of high magnitude are counterproductive, probably because changes between the template and the assigned structure do not tend to concentrate in single position
• Using crossover under strict quality control was shown not to be effective, since genetic mechanism has quality control itself