Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and...

251

Transcript of Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and...

Page 1: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 2: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Bioinformatics A PRIMER

Page 3: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 4: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Copyright © 2005, New Age International (P) Ltd., PublishersPublished by New Age International (P) Ltd., Publishers

All rights reserved.

No part of this ebook may be reproduced in any form, by photostat, microfilm,xerography, or any other means, or incorporated into any information retrievalsystem, electronic or mechanical, without the written permission of the publisher.All inquiries should be emailed to [email protected]

PUBLISHING FOR ONE WORLD

NEW AGE INTERNATIONAL (P) LIMITED, PUBLISHERS4835/24, Ansari Road, Daryaganj, New Delhi - 110002Visit us at www.newagepublishers.com

ISBN (13) : 978-81-224-2646-5

Page 5: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Dedicated to my parents, in theirmemory, and to my wife and son,for their constant encouragement,

support and patience.

Page 6: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Preface

Associations of elements, molecules, their complexes and aggregates have physicochemicalinformation content that would be of chemical and biological importance. Therefore, there istendency to address wide varieties of physicochemical interactions under “Bioinformatics”.Such an approach would unfortunately dilute the focus of the main objectives of bioinformatics.

“Bioinformatics” is the study (by experimental and computational) of biological informa-tion, from its storage sites (DNA/RNA) in the genome to the various gene products in the cell.In the case of life processes, to the realm of which bioinformatics logically belongs, the funda-mental building blocks that the life systems made up are nucleic acids, proteins, carbohy-drates, lipids and their complexes. The central aims of bioinformatics, therefore, is to elucidate(by experimental methods) and understand structural (primary, secondary, tertiary etc.) fea-tures of these biological entities and correlate these structural features to address the physico-chemical interactions, functions and pathways among these molecular entities in the cell(experimentally as well as computationally).

Bioinformatics is a multi-disciplinary subject, with immense scope in molecular biology,biotechnology, pharmaceutical and medical fields– e.g., genome and protein sequencing,structure prediction and molecular modeling and drug design and development of novelmolecules and drugs (molecular engineering), medical diagnostics and therapeutics. With theadvances in experimental methodologies (X-ray diffraction; NMR spectroscopy) in molecularstructure determination, biology has become data-rich with considerable amount of experi-mental data being made available on complex biomolecular structures. Tremendous progressin computational arenas has also been taking place, in terms of vast data storage capacity,processing and visual display. These developments have made it possible to address complexarray of biological systems and interactions in a systematic and quantitative way.

Study of bioinformatics as understood and pursued (by computational and biomedicalscientists) is treated with computational biology, which relies primarily on computational(theoretical) methods as applied to biological and medical sciences–in areas such as in genomics,proteomics, drug design etc. In this kind of approach there is a tendency to underplay thecrucial importance of experimental methods of data acquisition and structure-function inter-pretation. But, the availability of structural data by experimental methods is the core aspect inrational molecular drug design (molecular engineering) and validation. Lack of such data or

Page 7: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

poor understanding would lead to spurious and inconsistent molecular models and thuswould undermine the very objective of bioinformatics.

The major objectives of bioinformatics still are the same–

(i) Development of computer databases and algorithms to analyze biological data.

(ii) Processing, and interpretation of these complex physicochemical databases on molecu-lar interactions, structures and functions of biomolecules.

(iii) Computational methods for structure prediction and molecular modeling (molecularengineering/drug design), based on available experimental data that include caseswhere the existing experimental techniques are too time-consuming, or unable to pro-vide structural information due to inherent operational constraints.

Till recently, progress in bioinformatics (quantitative biology) was initiated and nurtured byphysical scientists (crystallographers; NMR specialists) and biologists. However, this situationis changing with bioinformatics relying more and more upon computation-oriented problem-solving protocols. Biologists may know biological systems and their functions, but the empha-sis should be, while taking up computational bioinformatics, on the structural and functionalaspects of biological molecules vis a vis their physical and chemical characteristics. On theother hand, the computational personnel may be versatile in the operational aspects of com-puter programs and algorithms, but the real handicap arises if they lack basic knowledgeabout the structures and functions of biological systems, whose complexity they are supposedto unravel. This situation is akin to a driver who is adept at driving automobiles but lacks thebasic knowledge of automobile engineering. Therefore, it is imperative that both groups haveoperational knowledge of the essentials of molecular biophysics, molecular biochemistry andstructural biology while undertaking the task of molecular modeling and design.

With these broad objectives in mind, the material contents of this book, “Bioinformatics: APrimer”, are organized under molecular biophysics, experimental methods of structure eluci-dation, database search, data mining and analysis, computational methods of structure pre-diction, and rational molecular/drug design and validation, with easy interface between theseareas and various chapters. Ample tables and figures, culled form the Protein Data Bank (PDB)and other sources, are intended to facilitate the reader insights to the structure-functionfeatures at the molecular level. Exercise modules and bibliography for each chapter, andglossary are aimed at providing the reader wider perception and insight to the subject matter,and scientific and technical terms. Index is also provided to help in easy access to the wordsand topics to the subject matter.

P. Narayanan

viii Preface

Page 8: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Acknowledgement

Thanks are due to Ms. Swarna Murthy, Bhabha Atomic Research Centre, Mumbai, forrendering bibliographic help.

P. Narayanan

Page 9: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Contents

Preface vii

Acknowledgement ix

1. Bioinformatics: Introduction 11.1 “Bioinformatics” 11.2 The Objectives of Bioinformatics 4

Section IBiomolecular Structure (Molecular Biophysics)

2. Atoms and Molecules 112.1 Atomic Structure 112.2 Molecules 13

3. Features of Nucleic Acids 153.1 Constituents of Nucleic Acids 153.2 Polynucleotides 163.3 The Genome Projects 21

4. Features of Proteins 304.1 Amino Acids 304.2 Proteins 344.3 Forces Stabilizing the Molecular Structure 42

Section IIExperimental Methods of Structure Elucidation (Bioinformatics-I)

5. Physicochemical Characterization of Biomolecules 515.1 Hydrodynamical Methods 525.2 Chromatographic Methods 555.3 Electrophoretic Methods 565.4 Blotting Techniques 66

6. Primary Structure (Sequence) Determination of Biomolecules 706.1 Primary Structure Determination of Nucleic Acids 706.2 Primary Structure Determination of Proteins 76

Page 10: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

7. Spatial Structure Determination of Biomolecules 837.1 X-Ray Diffraction Methods 837.2 Nuclear Magnetic Resonance (NMR) Spectroscopy 897.3 Imaging Methods 95

8. Protein-Ligand Interactions 1028.1 Protein-Nucleic Acid Interactions 1028.2 Protein-Protein Interactions 1108.3 Protein-Carbohydrate Interactions 1108.4 Protein-Lipid Interactions 110

Section IIITowards Structure Prediction (Bioinformatics-II)

9. The Protein-folding Problem 1159.1 Genomics Analysis 1169.2 Proteomic Analysis 125

10. Computational Methods in Structure Prediction 13310.1 Protein Folding Rules 13310.2 Structure Prediction of Fibrous Proteins 13410.3 Structure Prediction of Globular Proteins 13510.4 Application of Structure Prediction Programs 144

Section IVDatabase Search, Analysis and Modeling (Bioinformatics-III)

11. Database Search 15111.1 Primary Structure (Sequence) 15111.2 Databases 15211.3 Genome Datbase Search 15511.4 Protein Database Search 160

12. Data Mining, Analysis and Modeling 16812.1 Sequence Alignment Analysis 17012.2 Pair-wise Sequence Alignment 17312.3 Multiple Sequence Alignment (MSA) 17812.4 Phylogenetic Analysis 18412.5 Secondary Structure Analysis 19012.6 Motifs, Domains and Profiles 19212.7 Pattern Recognition 19912.8 Protein Classification and Modeling 199

13. Medico- and Pharmacoinformatics 21213.1 Disease Gene Identification 21213.2 Genetic Variations and Genetic Diseases 21413.3 Genetic Testing and Therapy 216

14. Molecular Engineering 22314.1 Genomics and Proteomics Analyses 22314.2 Rational Design 22714.3 Validation 230

Glossary 234

Index 242

xii Contents

Page 11: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

1Bioinformatics: Introduction

Structural and physicochemical characteristics of atoms, molecules and their complexes in thecells of the living organisms form the essence of information content and its use. Geneticmaterials–genes and gene products (e.g. proteins) are the basis of the life processes. Under-standing of the intricate processes of information storage, retrieval and transmission in genesand gene products in the cell is the first logical step towards our understanding of the complexlife processes. A better understanding of these biochemical processes would help in under-standing the structure-function relationships, and biochemical pathways in the life processes.An understanding of the behavior of biological systems at each level of their organization canonly be achieved by careful study of the complex dynamical interactions between the compo-nents of these systems. For this understanding to be quantitative it is necessary to developstructural, biophysical and biochemical mathematical models. Once developed, these modelscan be simulated, analyzed, and visualized through application of modern engineering andcomputational approaches.

Acquisition of high-throughput biological data (e.g. from genomic projects) at fast rate hasushered in computer-intensive data analysis (in silico analysis). Computational methods areused to obtain meaningful data from gene expression microarrays (cDNA microarray-basedRNA quantitation), proteomics, mass spectrometry (MS), 2-DE, protein-ligand interactionstudies, and other experiments, in order to establish biological pathways. It is hoped that thiswould ultimately help, in addition to understanding how various factors are interconnected,in improving genes and gene products, designing better and new molecular species, identify-ing disease-susceptible genes and define and diagnose disease on a molecular basis, and toidentify targets for therapeutic intervention and design of new drugs.

1.1 “BIOINFORMATICS”

Progress in structural biology has been closely associated with the emergence of a new area inquantitative biology, currently known under “bioinformatics”. Elucidation of three-dimen-sional structures, primarily by X-ray crystallography, of biomolecular model complexes–those of nucleotides and peptides complexes– enabled one to address the dynamics of protein-nucleic acid interactions, and the conformational studies of biomolecules (e.g. Ramachandrananalysis). Technical advances in structural biology, together with the development of fastcomputers with vast storage capacity for processing voluminous data ushered in the era ofmacromolecular crystallography, which has immensely enriched the structural and quantitativebiology, and was also the harbinger for the emergence of bioinformatics. Besides, laboratory

Page 12: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

2 Bioinformatics: A Primer

automation, integration of improved technologies in biological and allied sciences led to rapidaccumulation of vast amount of genome sequence information, functional expression analysisand other types of experimental data.

The considerable “algorithmic complexity of biological systems of biological systems re-quires a vast of amount of detailed information at the cellular and molecular levels for theircomplete description. Thus, the need for systematic organization, and analysis, towards anintegrated view of biology, necessitated the introduction of computers, resulting in the evolu-tion of a new area of quantitative biology, namely “bioinformatics”, dedicated to the compu-tational “mining” or in silico analysis of this experimental data.

Till recently, while the (experimental) biologists focused on accumulating more data fromelaborate experimental approaches, the quantitative biologists concentrated on developingalgorithms for the interpretation of these data, with minimum cross-interaction. The introduc-tion of computers in a comprehensive way in natural sciences has completely changed the“mindset” of both the experimental and quantitative biologists and universalized the outlooktowards the approach and the methodology in scientific research. This change in attitude hasalso, in part, ushered in greater acceptance of scientists from divergent fields–physicists,chemists, molecular biologists, biomedical and pharmacological personnel, and computerscientists.

“Bioinformatics” is thus a generic term that encompasses the application of computationaltools and approaches to the study of information content, organization, and processing inbiological systems (genes and gene products) utilization at the molecular level– e.g. study ofdiagnostic, therapeutic and prognostic features (biomedinformatics), and structural and chemi-cal features in molecular and drug design (cheminformatics). It is the symbiotic relationshipbetween computational and biological sciences with emphasis on computational aspects. Ingeneral, bioinformatics is understood and pursued as the study of information content andinformation flow in biological systems and processes. It is a bridge between observation(experimental data) in diverse biologically related disciplines and extrapolation of informa-tion, by computational analysis, about how the systems and processes function. It also envis-ages subsequent application of the knowledge and insights thus gained towards rationaldesign and synthesis of new molecules (e.g. drugs, insulin and other biological compounds)tailor-made to desired specifications or conditions. Thus, the aims of bioinformatics are broadand diverse, and require trans-disciplinary collaboration between molecular biophysicists,drug chemists, molecular biologists and computer-aided modeling experts.

1.1.1 Information Content and Transmission

Double helical DNA (in general) is the molecule in which the genetic information (genetic code)is stored. The information is stored chemically by sequence-specific base-pairing (A = T;& G ∫ C). This basepair complementarity implies that the strands of nucleic acids act astemplates for replication (duplication) of the genetic code. The genetic code is transcribedinto a messenger RNA (mRNA) strand, which is complementary to its DNA template. It isthe information-carrying link between the DNA and synthesis (translation) of proteins inribosomes. The tRNA molecules carry respective amino acids to the ribosomal sites andinteract with the mRNA (codon-anticodon interactions) in synthesizing proteins.

Page 13: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Bioinformatics: Introduction 3

The genetic code is the relationship between the sequence of bases in DNA (its mRNAtemplate) and the sequence of amino acids in proteins. An amino acid is coded by a set of threecontiguous bases, called codons. The genetic code is degenerate, because more than one codoncan code for the same amino acid (Fig. 1.1).

A = Adenine; G = Guanine; C = Cytosine; U = Uridine; ## = Stop signal(Source: Narayanan, P. (1998) J Uni Mumbai, 55(82); 41).

Fig. 1.1 Dictionary of the Genetic Code

1.1.2 Structural Defects and Genetic SignificanceGenes are the units of heredity that provide the blueprint for our physical body, determiningnot only how we may live, but also the quality of that life. The extent of quality of life can bedrastically altered by disease, and genetic disease is perhaps the purest illustration of therelationship between our genes and our health.

Since the genetic code in the mRNA template is ”read” sequentially in triplets of nucle-otides (codons) and translated into synthesis of proteins, any defects manifest in nucleotidesor shift (frame-shift) in reading will result in synthesis of an altered protein (basis of mutation).While genetic mutations are fundamental to evolution, they are also the reason for genetic(hereditary) diseases. There is a direct link between improper folding of protein tertiarystructure and genetic diseases at the molecular level. Some of the genetic disorders can bedebilitating or lethal. In the case of single-point mutation, that is, single nucleotide polymor-phism (SNP), predisposition for the disease is directly associated with the presence of a singlegene allele. To quote few examples:

Sickle cell anemia is due to single-point mutation (single nucleotide polymorphism—SNP)in the 6th position of the b-chain of hemoglobin, with the substitution of hydrophobic valine

Page 14: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

4 Bioinformatics: A Primer

residue in place of glutamic acid, which is acidic. This change of a single amino acid alters thestructure of hemoglobin in such a way that the deoxygenated protein (dHbS) polymerizes andprecipitates within the erythrocyte, leading to characteristic sickle shape. Similarly, single-point mutation with substitution of critical threonine by lysine in antithrombin results inthrombosis. In cystic fibrosis transmembrane-regulator (CFTR) protein (1480 residues and 170kilodaltons), a single-point mutation leading to the deletion of the crucial amino acid pheny-lalanine at position 508 leads to misfolding of the protein and its inability in protein traffickingfunction. A single-point mutation in rhodopsin (His Æ Pro at position 23) leads to retinaldegeneration and blindness. The characteristic repeating unit of collagen monomers is (Gly-X-Y)n, and glycine residue is very crucial in the formation of collagen helices. Therefore, anymutation that results in the replacement of glycine by any other amino acid would lead to hostof pathological disorders.

In many neuro-degenerative diseases, such as Alzheimer’s, Parkinson’s, Huntington’s,agglutination of soluble proteins as fibers is related to tri-nucleotide repeat expansion (TNRE),that is, to critical number of particular amino acid residue (e.g. glutamine residues). Inducedconformational change propagated from an abnormal conformer to its normal counterpartresults in prion diseases.

For common diseases such as cancer, the situation is less clear, and depends on genetic andenvironmental factors. A number of proteins can contribute to cellular transformation andcarcinogenesis when their normal structure is altered by mutations in their genes. These genesare termed proto-oncogenes. For some of these proteins (e.g. protein P21 of C-ras gene), asingle-point mutation at position 12 (or 61) replacing glycine makes them oncogenic. It ishoped that bioinformatics operations on the human gnome project (HGP), and other genomicdata would shed light on many common diseases, and subsequently control. One such ex-ample is development of inhibitors of HIV-1 protease by molecular modeling to optimizationof drug candidates.

1.2 THE OBJECTIVES OF BIOINFORMATICS

As stated earlier, bioinformatics deals with the (computational) methods of storing, retrieving,and analyzing of biological information (genetic code) as it passes from its storage sites(DNA/RNA) in the genome to the sites of synthesis of various gene products (e.g., synthesisof proteins at the ribosomal sites), and structure-function relationships and their effects in thecells and the organisms.

Availability of three-dimensional (tertiary) structure of any biomolecule (say a protein) isvital to understand its structure-function interactions at the molecular level, and to undertakeany molecular design tasks. While the primary structure (genomic or proteomic sequence)data are obtainable (experimentally) in a faster and a more ‘routine’ way, the acquisition ofsecondary and tertiary structural data, by experimental methods, is still a time-consumingand tedious task. Therefore, theoretical “structure prediction” methods, employing computa-tional tools and algorithms, for prediction of secondary and tertiary structures of biomoleculesfrom their primary structure (sequence) data, is an attractive alternative method, and themajor objective of bioinformatics (molecular bioinformatics). The rationale of molecularbioinformatics is– conceptualizing biology in terms of molecules and applying “informatics”techniques to understand and organize the information associated with these biomolecules inthe cells and organisms.

Page 15: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Bioinformatics: Introduction 5

Many novel genes are being uncovered through the systematic searching of available ge-nomic sequence data and their putative function is being assigned through sequence identityalgorithms.

A general approach in bioinformatics is–

(1) Application of computer software tools for creating computer databases (both genomicand proteomic databases).

(2) Development of algorithms to utilize and mange these databases in knowledge-basedanalysis.

(3) Utilization of databases and computational methods in “structure prediction” methods.

(4) Use of (primary, secondary and tertiary) databases and “structure prediction” algorithmsin the rational molecular design to repair defective biomolecular species in the cell, and/orto synthesize better molecular species/drugs, tailor-made to desired requirements(genetic engineering/molecular engineering).

Nothing in the field of quantitative biology is really new. The paradigm shift is in handlinga large-scale, automated, and integrated approach to molecular biology and medical sciences.Two developments distinguish bioinformatics from classical biological and allied sciences–

(1) Integration of advanced physical techniques (lasers, better sequencers and mass spec-trometers etc.).

(2) Central role of computer-assisted operations in data acquisition and analysis. Data collec-tion is directly connected to laser-based detectors and automated. At the same time, datastorage, retrieval and exchange operations are almost completely computer-based (insilico biological analysis).

Automation in data acquisition and processing has resulted in tremendous increase through-put at a fast pace. Automated data acquisition enables scientists to spend more of their timedata analysis and interpretation processes.

Experimental research in molecular biology has in recent years yielded a wealth of informa-tion and a large amount of gene expression data is being added at a fast rate. Withoutfunctional assignment, the true goal of any genome project, which is to understand howgenomes are organized, and expressed, and other functional features of the genome, cannot beachieved. That is, “structure mining”, not just “sequence mining” should be the objective ofbioinformatics. Therefore, structural aspects of molecules (X-ray and NMR structural data)will become more and more important in future genomic research.

However, the three-dimensional structure information by experimental methods (X-raycrystallography and NMR spectroscopy) is lagging behind the gene sequence data. Therefore,structure prediction by theoretical methods is a viable option in bioinformatics. The mainobjective of bioinformatics will be, therefore, to combine experimental structural data (mainlyfrom X-ray crystallography and NMR spectroscopy) and theoretical methods of structureprediction, to understand the structure-function relationships in biomolecular complexes andutilize such ‘knowledge’ in rational design of new molecular species, drugs and therapeuticagents.

In addition, a deeper understanding of complex biological systems will need a more quan-titative type of biology that is closely integrated with the physical sciences (aim of quantitativebiology). With the availability of large-scale genome sequence data, modern biology hasbecome more data-rich and is faced with organizing and analyzing the sequence and other

Page 16: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 17: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Bioinformatics: Introduction 7

EXERCISE MODULES

1. What are the objectives of Bioinformatics?

2. What is the genetic code and how is it stored and transmitted?

3. Why is the Genetic code is degenerate?

4. Give some examples of amino acids coded by multiple codons.

5. Which are the amino acids coded by maximum number of codons?

6. Which are the amino acids coded by one codon?

7. What is the molecular basis of hereditary diseases?

8. What is mutation, and what are single-point mutations and frame-shift mutations?9. What are the objectives of bioinformatics?

10. What is the importance of rational design of moleunles?

BIBLIOGRAPHY

1. Baum, J. & Brodsky, B. (1999), Curr Opin Struct Biol., 9(1); 122. “Folding of peptide models of collagen andmisfolding in disease”.

2. Bently, D.R. (2000), Med Res Rev., 20; 189. “The Human Genome Project–an overview”.

3. Broder, S. & Venter, J.C. (2000), Curr Opin Biotechnol., 11; 581. “Whole genomes: the foundation of newbiology and medicine”.

4. Carell, R.W. & Gooptu, B. (1998), Curr Opin Struct Biol., 8(6); 799. “Conformation changes and disease–serpins, prions and Alzheimer’s”.

5. Davies, K.E. & Reid, A.P. (1988), IRL Press: New York. “Molecular basis of Inherited Diseases”.

6. Dickerson, R.E. & Geis, I. (1983), Benjamin-Cummings: Menlo Park/CA. “Hemoglobin: Structure,Function, Evolution and Pathology”.

7. Dobson, C.M. (1999), Trends Biochem Sci., 24; 329. “Protein misfolding, evolution and disease”.

8. Harrison, P.M., et al. (1997), Curr Opin Struct Biol., 7(1); 53. “The Prion folding problem”.

9. Lee, P.S. & Lee, K.H. (2000), Curr Opin Struct Biol., 11; 171. “Genomic analysis”.

10. Narayanan, P. (1998), J Uni Mumbai, 55(82); 41. “Influence of base-stacking interactions on the variabledegeneracy of the genetic code”.

11. Narayanan, P. (2001), Bhalani Pubs: Mumbai. “Clinical Biophysics: Principles and Techniques”.

12. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics”(2nd Print).

13. Perutz, M.F. (1991), W.H. Freeman: New York. “Protein Structure and Function”.

14. Perutz, M.F. (1992), W.H. Freeman: New York. “Protein Structure: New Approaches to Disease andTherapy”.

15. Stryer, L. (1995), Freeman Press: New York. “Biochemistry”, 4th Edn.

16. Watson, J.D., et al. (1987), Benjamin-Cummings: Menlo Park/CA. “Molecular Biology of the Gene”,4th Edn.

17. Weinberg, R.A., et al. (1985), Sci Amer., 253(4); 48. “The molecules of Life”.

18. Wilson, J.M. (1993), Nature, 365; 691. “Cystic fibrosis: vehicles for gene therapy”.

19. Wladawer, A. & Vondrasek, J. (1998), Annu Rev Biophys Biomol Struct., 27; 249. “Inhibitors of HIV-1 protease: a major success of structure-assisted drug design”.

Page 18: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Section I

Biomolecular Structures(Molecular Biophysics)

Page 19: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

2Atoms and Molecules

Matter is composed of individual entities, called elements, which are the basic building blocksof all molecules and chemical compounds. Each element is distinguishable from others by thephysical and chemical properties of its basic component, the atom. Molecules are formed froma cluster of atoms bonded by chemical bonds. There are a variety of chemical bonds (covalent,double, and triple) resulting in a variety of molecular species. A basic knowledge of thesechemical bonds and their stereochemical features is necessary in understanding the essentialfeatures of molecular structures. In bioinformatics, the molecules and their interactions (DNA-DNA, DNA-protein and protein-protein interactions) should be addressed at the atomic andmolecular levels. Therefore, in order the evaluate sequence and structural information, onemust have a basic understanding of the representation of molecules in terms of atoms andbonds representation, known as “chemical graph” (atomic structure without coordinateinformation).

2.1 ATOMIC STRUCTURE

Atomic structure consists of a central nucleus with positive charge, where practically all themass (with protons and neutrons) is concentrated, and negatively charged electrons distrib-uted in different orbits around the nucleus. The concept, from the classical physics viewpoint,is analogous to the planetary motions around a central massive star.

But, the concepts of classical physics cannot explain either the stability of the atom or theoccurrence of discrete spectral lines. According to the laws of classical physics, an orbitingcharged particle would radiate energy, and accordingly electrons orbiting around a nucleusare unstable and they would spiral down and collapse into the nucleus. Niels Bohr (1885-1962)resolved this difficulty, and also ushered in quantum physics with his ad hoc proposition thatelectrons revolving in their orbits do not radiate energy, and spectral lines are due to electrontransitions between the orbits (Fig. 2.1).

According to the concepts of quantum physics, the state of an electron can be determinedby wave functions (orbitals). Orbitals represent regions in space in which a particle of particularenergy is most likely to be found. Orbitals for electrons are represented by four quantumnumbers–principal quantum number, n, angular (azimuthal) quantum number, l, magneticquantum number, m, and spin quantum number, s. The orbitals belonging to the same princi-pal quantum number (n) constitute a “shell”. The capital script letters of the alphabet denotesthe “shells”.

Page 20: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

12 Bioinformatics: A Primer

n = 1, 2, 3, 4…. (K, L, M, N…. shells) (2.1)

Orbitals of the same shell, but different azimuthal values are called “subshells”. Thesesubshells are denoted by lower script letters of the alphabet.

l = 0, 1, 2, 3, …. (s, p, d, f,….. subshells) (2.2)

The magnetic quantum number, m, will have ±l values. Its significance is realized when aparticle is under a magnetic field (e.g. Nuclear magnetic resonance, NMR). The spin quantumnumber, s, can have only two values, either +½ (↑ ) or –½ (↓ ).

The quantum numbers, and the physical attribute to which they correspond, combinedwith the Pauli’s exclusion principle, which states that no two electrons have the same fourquantum numbers, provides a rational approach towards determining the electronic configu-ration of atomic structure (Table 2.1), and thereby the physical and chemical periodicity ofatoms (the periodic table of the elements). The order of filling of electronic orbitals of an atomof atomic number, Z, is

1s. 2s2p. 3s3p. 4s3d4p. 5s4d5p. 6s4f5d6p. 7s6d5f.

Table 2.1 The Electronic Configuration of Atomic Structure

s p d f g Total Number ofl Æ 0 1 2 3 4 Orbitals Electrons

Ø n1 (K) 1 1 22 (L) 1 3 4 83 (M) 1 3 5 9 184 (N) 1 3 5 7 16 325 (O) 1 3 5 7 9 25 50

Fig. 2.1 Origin of the Atomic Line Spectra

Page 21: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Atoms and Molecules 13

2.2 MOLECULES

The smallest entity of a chemical compound is a molecule. Ionic bonds do not lead to theformation of single molecules, but to the formation of conglomerates (e.g. NaCl). Single mo-lecular structures are formed from the association of atoms by chemical (covalent) bonds. Acovalent bond is formed between two atoms when they share an electron pair between them.There are several types of chemical bonds–– single bonds, double bonds, conjugate bonds andcoordination bonds.

Single bond (σ-bond) molecules are formed from the combination of s- orbitals. The σ-bond is cylindrically symmetrical. Some examples of such molecules are aliphatic organicmolecules, ring-structured mono-sugars, and saturated fats.

Double bond (and triple bond) molecules are formed from the combination of s- and p-orbitals. The three p- orbital electrons (π-orbitals) are directed orthogonally along the threeCartesian axes (p

x, p

y, and p

z). While σ-bond is axial covalent bond, electrons in π-molecular

orbitals reside only above and below the bond axis. π-orbital structures are planar; aromaticmolecules (benzene, anthracene, phenyl alanine etc.) exhibit π-bond characteristics.

Metals form coordinate complexes with molecules, forming tetrahedral, square planar,trigonal pyramid and octahedral coordinated moieties.

There are four major classes of biological macromolecules in cells––nucleic acids, proteins,carbohydrates and lipids (Table 2.2). Nucleic acids are involved in storage and transmission ofgenetic information. Proteins are involved in a wide range of biological and biochemical ac-tivities, structural as well as functional. Nucleic acids and proteins are linear polymers ofnucleotides and peptides, respectively. Carbohydrates are linear as well as branched-chainpolymers and they are involved in structural, energy storage and cell-cell communication.Biological membranes (lipids) are macromolecules, but are not polymers and they are involvedin energy storage functions.

Table 2.2 Biological Macromolecules and their Functions

Macromolecule(s) Function Examples

Nucleic Acids Storage of genetic information and storage DNA; mRNA; tRNAProteins Structural and biochemical functions Globular proteins (hemoglobin);

fibrous proteins (fibrin; silk)

Carbohydrates Structural support; energy storage Cellulose; starch; glucose

Lipids Cell membranes; energy storage Cell membranes; fats; cholesterol

EXERCISE MODULES

1. Why does the classical physics fail to explain the stability of the atom?

2. What are the essential features quantum physics?

3. What are orbitals and quantum numbers?

4. Explain various kinds bonds

5. Draw formulas of some aliphatic molecules (linear and ring-structured), aromatic molecules and metalcoordinate complexes (Help: check up any organic chemistry textbook).

Page 22: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

14 Bioinformatics: A Primer

BIBLIOGRAPHY

1. Atkins, P.W. (1998), Oxford University Press: Oxford. “Physical Chemistry”, 6th Edn.

2. Hallet, F.R., et al. (1982), Metheusen Pubs: Toronto. “Physics for the Biological Sciences: A topicalApproach to Biophysical Concepts”.

3. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics”(2nd Print).

4. Pauling, L. (1960), Cornell University Press: New York. “Nature of the Chemical Bond”.

5. Rae, A.I.M. (1981), McGraw Hill: London. “Quantum Mechanics”.

6. Tinoco, I. Jr., et al. (1985), Prentice-Hall: Englewood Cliffs/NJ. “Physical Chemistry: Principles andApplications in Biological Sciences”, 2nd Edn.

Page 23: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

3Features of Nucleic Acids

Nucleic acids, proteins, carbohydrates, and lipids and membranes are the major biologicalmacromolecules that are of importance in the study of cell structure and function. However,from the standpoint of bioinformatics, nucleic acids and proteins and their complexes are themost important biological macromolecules. Therefore, an understanding of their structuralfeatures, vis a vis the structural characteristics of their constituents, is of importance to under-stand their functional characteristics.

Nucleic acids– DNAs and RNAs (except tRNAs) are long thread-like macromolecules thatplay central roles in all the hereditary processes storage of genetic code, replication, transcrip-tion and translation into protein synthesis. Coded in the nucleic acid (DNA/RNA) is biologi-cal (chemical) information that can be stored, replicated, transcripted and translated intoprotein synthesis processes.

3.1 CONSTITUENTS OF NUCLEIC ACIDS

Generally, the basic unit of a nucleic acid is a nucleotide (it is a dinucleotide in left-handed Z-form of nucleic acids). A nucleotide comprises (i) a purine or pyrimidine base, (ii) a ribose ordeoxyribose sugar and (iii) a phosphate group (Fig. 3.1).

3.1.1 Nucleic Acid Bases

Nucleic acid bases are hetero-atom ring compounds (N and O atoms), and there are two classesof nucleic acid bases. They are (i) purines (R)–adenine (A) and guanine (G), and (ii) pyrim-idines (Y)–– thymine (T), uracil (U) and cytosine (C). The bases are planar, exhibiting aromaticnature with polar characteristics (due to nitrogen and oxygen moieties), and they exist in amino(NH2) and keto (C=O) tautomeric forms. Only those bases with “proper” tautomeric forms (amino(NH

2) and keto (C=O) forms) can form correct hydrogen bond base-pairing (Watson-Crick base-pair-

ing) patterns, which are essential for storage and transmission of genetic information.In all naturally occurring nucleic acids, a nucleic acid base is covalently linked to a sugar

moiety by the b-glycosyl bond (C1’ – N bond), formed between the anomeric carbon atom(C1’) of the ribose sugar and the N9 of a purine or N1 of a pyrimidine base. A nucleic acid basewith a sugar unit is referred to as a nucleoside. The conformation the base with respect to thesugar moiety is represented by the torsion angle (c) around the glycosyl bond (C1’ – N).

Page 24: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

16 Bioinformatics: A Primer

3.1.2 The SugarsThe sugars in nucleic acids are five-memberedfuranose rings–– D-ribose in RNAs and 2-deoxy-D-ribose in DNAs (Fig. 3.2). The furanose sugarrings are puckered. Generally, the prominentsugar puckers are C2’-endo (in DNAs) and C3’-endo (in RNAs and DNA-RNA hybrids).

3.1.3 The Phosphate GroupThe phosphate group is linked to the sugar at the C5'-position. A nucleoside with the attach-ment of a phosphate group is called a nucleotide.

3.2 POLYNUCLEOTIDESNucleic acids are linear polymers of nucleotides formed by the condensation of two or morenucleotides linked via phosphodiester bonds. That is, they are polynucleotides. The formationof phosphodiester bonds in nucleic acids exhibits directionality. The conformation of the ri-bose-phosphate backbone of a nucleotide unit is represented by six torsion angles. In nucleicacids, all torsion angles are correlated; that is, structural changes follow a concerted motion.

In living cells, the genetic information is stored in DNA (RNA in some viruses), transcribedonto messenger RNA (mRNA) and then translated into proteins in ribosomes. The Watson-Crick hypothesis for DNA double helix provides a rational explanation for storage of geneticcode, replication and transcription processes.

1. DNA (B-form) is a right-handed double helix of ~ 20 Å diameter, formed by two antiparallelright-handed helical strands wound around each other.

2. The ribose-phosphate backbone, formed via 3’-5’phosphodiester linkage forms theperiphery of the double helix structure.

3. The structure is stabilized by non-bonded interactions: H-bonding between bases ofadjacent strands (polar) and base stacking (non-polar) (Fig. 3.3)

Fig. 3.2 Chemical Structures of Riboseand 2-Deoxyribose

O

Ribose

O

Deoxyribose

O OO

Base Base

O O

Base + Pentose sugar = NucleosideBase + Pentose sugar + Phosphate group = Nucleotide

Fig. 3.1 Constituents of a Nucleotide

Page 25: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 26: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

18 Bioinformatics: A Primer

flexibility and can exist in distinct tertiary structure folding (Figs. 3.6 & 3.7). The quaternarystructure of nucleic acids generally refers to nucleic acid-ligand association. DNA-histone com-plexes, nucleic acid-protein complexes are some examples (see protein-nucleic acid interac-tions in Chapter 8).

3.2.1 The Nucleic Acid Families

Nucleic acids exist in several structural forms, depending upon relative humidity, uniquenessof the base sequences and solvent and salt concentrations, They are classified under A-, and B-and Z- families, based on their structural and conformational features. Due to subtle varia-tions in their structural parameters, helical arrangements in different families are macroscopi-cally different and distinct. These structural families are interconvertible, depending on hu-midity and salt concentrations.

3.2.1.1 The A-Family

To the A-family belong RNAs and DNA-RNA hybrids. The A-family exists in right-handeddouble helical form; and exhibits structural conservatism and is uniform in overall shape. Thehelix is pushed into the major groove (D = 4.5 Å) and the polynucleotides chains wrap aroundthe helix axis like a ribbon. As a consequence, there is a deep and narrow major groove, butshallow and wide minor groove. The A-family has both inter-strand and intra-strand baseoverlap.

Fig. 3.4 Double-helical Structure of DNA

(Ref: US Dept of Energy: Genomes to Life Project; http://doegenomes.to.org)

Adenine Thymine

N

NN

N

N

NN

O

O

Guanine Cytosine

N

O

O

N

N

NN

N

N

N

Fig. 3.5 Watson-Crick Type Base-pairing in Nucleic Acids

Page 27: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 28: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

20 Bioinformatics: A Primer

Fig. 3.8(b) Tertiary Structure of B-DNA(Dodecanucleotide) Unit

(Ref: Drew, H.R., et al. (1981) Proc NatlAcad Sci, USA. 78; 2179.Source: Protein Data Bank; 1BNA.pdb)

Fig. 3.8(a) Double-helical Structureof B-DNA (CKP Model)

3.2.1.2 The B-Family

B-, C-, D- and T-DNA forms belong to the B-family. The B-form of double-helical DNA prevailsunder normal physiological conditions of low ionic strength and high relative humidity (>90%). In the B-form the helix axis passes through the center of basepairs (D = 0. Å). Therefore,the major and minor grooves are of equal depth, but unequal widths (Fig. 3.8). The B-form isstructurally more flexible and is sensitive to sequence, base composition and environmentalconditions (solvent and salt). When the relative humidity is reduced to < 75%, the B-formundergoes a reversible transformation to the A-form. The transformation is dependent onbase composition.

3.2.1.3 The Z-Family

At higher salt concentrations, poly(dG-dC) regions of right-handed double helical B-DNAstructure can transform into the left-handed Z-DNA form (zig-zag form). The structural fea-tures of the Z-form greatly differ from the B-form. These are:

(i) The double helix is left-handed.(ii) The repeat unit is a dinucleotide (dG + dC). T

(iii) The glycosyl bond conformation and the sugar pucker are different.(iv) There is no major groove.(v) There is no proper base stacking.

Page 29: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Nucleic Acids 21

Fig. 3.8(c) Space-filling Model of B-DNA Repeat Unit

3.3 THE GENOME PROJECTS

A gene is a segment of DNA that contributes to phenotype function. Genes that do not appearto encode protein products (pseudogenes) can be characterized by sequence, transcription, orhomologuous to another gene. The primary objectives of genome projects, namely gene am-plification and gene sequencing are twofold–– (1) for making a series of descriptive diagrammaps (gene mapping) of each chromosome (human and other organisms) at increasingly finerresolution, and (2) for high-resolution gene mapping and production of large quantities ofproteins and inference of amino acid sequence from the gene sequence data (see chapter 9 forfurther details).

3.3.1 Gene ExpressionA gene is a specific sequence of nucleotides that carry information required for protein syn-thesis. Gene expression, the process of transmitting genetic information for protein synthesisis carried out in three distinct stages– (1) replication, (2) transcription and (3) translation.There exists a host of regulatory features and factors in realizing these processes.

3.3.1.1 Replication

DNA replication is based upon the complementarity of the genetic (chemical) informationstored in base-pairing. During replication the DNA double helix is unwound by helicases(Fig. 3.9), with each single strand becoming a template for synthesis of a new, complementarystrand. The replication is semi-conservative; each daughter molecule consists of one parentstrand and one newly synthesized strand (Fig. 3.10). The replication process occurs at the

Page 30: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

22 Bioinformatics: A Primer

Parent DNA

DaughterDNA molecules

(Light) = Newly synthesized

Fig 3.9 Structure of Rep Helicase + Single-stranded DNA complex

(Ref: Korolev, S., et al. (1997), Cell, 90; 635)(Source: Protein Data Bank: 1UAA.pdb)

Fig. 3.10 Semi-conservative Natureof DNA Replication

replication forks, regions where DNA is unwound exposing single strands that act as tem-plates (at specific DNA sequences called origins of replication) (Fig. 3.11).

RNA primes the DNA synthesis, that is, it initiates the DNA synthesis process (Fig. 3.12).RNA polymerase (called primase) synthesizes short stretch of RNA (~ 10 bases) that is comple-mentary to one of the DNA template strands. This RNA chain serves as the primer for synthe-sis of new DNA molecule, and the chain-elongation is catalyzed by DNA polymerases (Fig.3.13). DNA polymerases are template-directed. They catalyze the formation of a phosphodiesterbond only if the base on the incoming nucleotide is complementary to the base on the tem-plate strand.

In the case of DNA, DNA polymerase III (DNA pol III) catalyzes the DNA synthesis (in the5’à 3’ direction) (Fig. 3.14). The polymerization process is bi-directional. The leading strand ispolymerized continuously, and the lagging strand discontinuously (Okazaki fragments). As aresult, while one RNA primer is required for the leading strand, each Okazaki fragment re-quires a RNA primer. The RNA portion of the RNA-DNA hybrid is hydrolyzed by DNA pol I(Fig. 3.15). DNA ligase joins the Okazaki fragments. The process occurring at the replicationfork and various enzyme complexes involved are schematically represented in figure 3.16.

Page 31: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Nucleic Acids 23

Fig. 3.11 DNA Synthesis at the Replication Fork Fig. 3.12 Initiation of DNA Synthesisby an RNA Primer

Fig. 3.13 Chain-elongation Reaction catalyzed by DNA Polymerases

3.3.1.2 Transcripion

The templates for protein synthesis are RNAs––

DNA Transcriptionæ Ææææææ RNA Translationæ Æææææ Protein

Page 32: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 33: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Nucleic Acids 25

synthesis is catalyzed by RNA ploymerases (RNA pol). Whereas prokaryotes have one RNApolymerase, eukaryotes have three (protein-encoding genes are transcribed by RNA poly-merase II in eukaryotes). Transcription is initiated in the promoter region by a complex ofdifferent factors. The process is similar to DNA replication.

Transcription involves three distinct stages– (i) initiation, (ii) elongation and (iii) termina-tion. (i) The RNA polymerase binds at the DNA promoter site, unwinds the DNA doublehelix, and initiates the synthesis of a transcript. The promoter sequences are not transcribed.(ii) RNA pol moves along the DNA, maintaining the transcription “bubble” to expose theDNA template strand, and catalyzes the 3’ elongation of the transcript. (iii) Formation of ahairpin loop in nascent RNA transcript results in the RNA strand RNA pol from the DNAtemplate.

3.3.1.3 TranslationTranslation is the unidirectional process that takes place on the ribosomes whereby the ge-netic information present is an mRNA is converted into a corresponding sequence of aminoacids in a protein. After transcription, the single-stranded mRNA is moved from the nucleusto the cellular cytoplasm, to the ribosome, the protein synthesis apparatus. Activated tRNAmolecules (aminoacyl-tRNAs) carrying specific amino acids are also brought to the ribosomes.Sequence of amino acids in a protein is determined by the sequence of codons (contiguousnucleotides triplets) in mRNA as ‘read’ by anticodons of tRNAs. The anticodon of a tRNAbinds with a specific codon in mRNA by complementary base-pairing (base-specific codon-anticodon hydrogen bonding).

Translation also follows initiation, elongation and termination steps. There are two tRNA-binding sites in the ribosome– A-site (for entry of aminoacyl-tRNA) and P-site (for peptidyl-tRNA, carrying growing polypeptide). Initiation results in the binding of the initiator tRNA tothe start signal of mRNA. In bacteria, the first amino acid is always N-formylmethionine (fMet)and initiation codon is preceded by Shine-Delarno sequence. In eukaryotes, the first amino acidis methionine (Met) and the initiation codon has 5’-cap. The initiator tRNA occupies the peptidyl(P) site on the ribosome.

Elongation consists of three steps– (i) binding of aminoacyl-tRNA (codon recognition), (ii)peptide formation and (iii) translocation (Fig. 3.17). Elongation starts with the binding of anaminoacyl-tRNA to the A-site (aminoacyl site). A peptide bond is the formed between thefMET-tRNA and aminoacyl-tRNA. The resulting depeptidyl-tRNA is then translocated fromthe A-site to the P-site, while the other tRNA (uncharged tRNA) molecule leaves the A-site.The mRNA moves a distance of three nucleotides and a new aminoacyl-tRNA binds to theempty A-site to start another round of elongation.

Elongation process is assisted by various elongation proteins (elongation factors)(Fig. 3.18). Encountering a “stop” codon, recognized by a protein release factor, leads to termi-nation of the translation process and release of polypeptide from ribosomes.

3.3.2 Gene AmplificationGene amplification is selective increase in the number of copies of a specific gene coding for aspecific protein without a proportional increase in other genes. Gene amplification is neces-sary for obtaining sufficient quantities of desired gene or gene fragment for further analysisand for production of desired protein in large quantities for sequencing (see chapter 6) andother analysis. Cloning and polymerase chain reaction (PCR) are two molecular biologicaltechniques that are in use for gene amplification.

Page 34: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 35: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Nucleic Acids 27

3.3.2.1 Gene Cloning

Gene cloning involves the use of recombinant DNA technology to propagate DNA fragments,isolated from chromosomes using restriction enzymes, inside a foreign host. Following intro-duction into suitable host cells, the DNA fragments can then be reproduced along with thehost cell DNA. Cloning procedures are routinely employed to produce unlimited material forexperimental study.

3.3.2.2 Polymerase Chain Reaction (PCR)

Polymerase chain reaction (PCR) is a very versatile in vitro gene amplification method that hasbrought a tremendous progress in molecular biology and genetics. PCR can amplify a desiredDNA sequence of any origin hundreds of million times in hours.

In gene amplification by polymerase chain reaction (PCR) methods, a desired cDNA cloneis synthesized using mRNA as a template. Suitable primers are used to hybridize to the corre-sponding sequences, and they are extended in a chain synthesis reaction by DNA polymerases,using the inserted sequence as the template. The PCR mixture contains DNA bases (four types)and two primers (~ 20 bases long). The mixture is (i) heated to denature thermally the double-stranded target molecule and separate the target sequence (ii) cooled (annealing) to allow theprimers to bind to their complementary sequence on the separated strands, and (iii) the poly-merase to extend the primers into the new complementary strands. Repeated heating andcooling cycles multiply the target DNA exponentially, since each new double strand separatesto become two templates for further synthesis. The reaction is efficient, specific, and extremelysensitive.

The nucleotide that the polymerase attaches will be complementary to the base in the corre-sponding position on the template strand (e.g if the adjacent template base is C, the poly-merase attaches G). The polymerase chain reaction proceeds with two primers, bound to theopposite strands of the gene target, and their 3’-ends pointing at each other. The reaction isterminated by the incorporation of dideoxynucleotides. The resultant is a series of fragmentsof different lengths for each primer.

3.3.3 Gene Separation

Cutting genomic DNA at specific sites by suitable restriction enzymes generates DNA frag-ments. The fragments are amplified either by cloning or polymerase chain reaction (PCR)methods. Electrophoresis techniques are used to separate the fragments. Small diameter cap-illary array gel electrophoresis permits application of high electric fields, thus providing sig-nificantly faster separation than traditional slab gels (chapter 5). While conventional electro-phoresis is applicable to separate fragments < 40 kilo bases, pulse-field gel electrophoresis (PFGE)techniques has improved the separation of larger fragments (~10M bases). This technique em-ploys multiple electrodes, placed orthogonally with respect to the gel, and short pulses ofalternate current are passed through the gel.

3.3.4 Gene Sequencing

Genome sequences are assembled from DNA sequence fragments of approximately 500basepairs length. Conventional (1st generation) gene sequencing methods employed Maxam-Gilbert and Sanger methods. Maxam-Gilbert method uses chemicals to cleave DNA at specific

Page 36: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

28 Bioinformatics: A Primer

bases, resulting in fragments of different length. Sanger sequencing method (dideoxy method)uses enzymatic procedure to synthesize DNA replication at positions occupied by one of thefour bases, and then determines the resulting fragment length (see chapter 6). Multiplex se-quencing procedure enables to analyze ~ 40 clones on a single DNA-sequencing gel.

Developments in gene sequencing techniques (2nd and 3rd generation) are–– ultra-thinelectrophoresis, resonance ionization spectroscopy to detect suitable isotope labels, laser-in-duced fluorescence, gel-less flow cytofluorimetry, scanning-tunneling or atomic force micros-copy, and mass spectrometry (see chapter 6).

3.3.5 Genetic MappingA genome map describes the relative positions of genes and other markers and the spacingbetween them on each chromosome. Mapping involves (i) dividing the chromosomes intosmaller fragments by restriction enzymes, and (ii) mapping the fragments to correspond totheir respective locations on the chromosomes. Low-resolution maps are genetic linkagemaps, which depict the relative chromosomal location of DNA markers along the chromo-some. Physical maps describe the chemical characteristics of the DNA molecule itself. Physicalmaps can be low-resolution or high-resolution maps. Low-resolution chromosomal maps arebased on the banding patterns (light and dark bands reflecting regional variations in theamounts of A-T versus G-C) observed in light microscopy of stained chromosomes. High-resolution physical maps provide complete basepair of each chromosome in the genome.Determination of basepair sequences of genes (high-resolution physical mapping) is necessaryfor inferring the amino acid sequences (primary structure) of corresponding proteins.

EXERCISE MODULES

1. Build chemical structures of nucleic acid bases.2. Build Watson-Crick basepairs (A = T; G ∫ C).3. Build a nucleoside; rotate around the bond (C1'–N) bond and observe the orientation of the base with

respect to the sugar moiety.4. Build a nucleotide and rotate C5' – O5' bond and observe the orientation of phosphate group with respect

to the sugar moiety.5. What are the structural features of nucleic acids?6. Describe the essential features of Watson-Crick model of DNA double helix.7. Build a few turns of DNA double helix.8. What are the primary, secondary and tertiary structural features of nucleic acids?9. Describe various forms of nucleic acids.

10. What are the objectives of the genome project?11. What is gene expression?12. What is replication?13. Why is replication semi-conservative?14. What is replication fork?15. What is the role of a primer?16. What are the functions of DNA pol I and DNA pol III?17. What is transcription and what are the salient features?

Page 37: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Nucleic Acids 29

18. What is translation and where does it occur?19. What are the main features of translation?20. What is the role of tRNAs in translation?21. What is relevance of genome projects?22. What is gene amplification and why is it necessary?23. What is gene cloning and how is it done?24. What is polymerase chain reaction (PCR)?25. What are the gene separation methods?26. What are gene-sequencing methods?27. What is genetic mapping?

BIBLIOGRAPHY

1. Baltimore, D. & Berg, A.A. (1995), Nature, 373; 287. “DNA-binding proteins”.2. Berman, H.M., et al. (2000), Nucleic Acid Res., 28; 235. “Protein Data Bank”.3. Blackburn, G.M. & Git, M.J (Eds). (1990), Oxford University Press: Oxford. “Nucleic acids in Chemistry

and Biology”.4. Calladine, C.R. & Drew, H.R. (1992), Academic Press: New York. “Understanding of DNA”.5. Conn, G.L. & Druper, D.E. (1998), Curr Opin Struct Biol., 8(3); 278. “RNA structure”.6. Darnel, J.E. Jr. (1985), Sci Amer., 253(4); 68. “RNA”.7. Dickerson, R.E., et al. (1982), Science, 216; 475. “The anatomy of A-, B- and Z-DNA”.8. Dickerson, R.E. (1983), Sci Amer., 249(6); 86. “DNA helix and how it is read”.9. Felsenfeld, G (1985), Sci Amer., 253(4); 58. “DNA”.

10. Innis, M., et al. (1990), Academic Press: San Diego, CA. “PCR Protocols: A Guide to Methods andApplications”.

11. Johnson, P.F. & McKnight, S.L. (1989), Annu Rev Biochem., 58; 799. “Eukaryotic transcriptionalregulatory proteins”.

12. Kornberg, A. & Baker, T.A. (1992), W.H. Freeman: New York. “DNA Replication”.13. Lodish, H., et al. (1995), Sci Amer Books: New York. “Molecular Cell Biology”, 3rd Edn.14. Narayanan, P. (2001), Bhalani Pubs: Mumbai. “Clinical of Biophysics: Principles and Techniques”.15. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print).16. Ptashne, M. (1988), Nature, 335; 683. “How eukaryotic transcriptional activators work”.17. Saenger, W. (1984), Springerverlag: Berlin. “Principles of Nucleic Acid Structure”18. Tijan, R. (1995), Sci Amer., 272(3); 7. “Molecular machines that control genes”.19. Walker, J.M. & Gaastra, W. (Eds) (11983), Croom Helm: London. “Techniques in Molecular Biology”.20. Watson, J.D. & Crick, F.H.C. (1953), Nature, 171; 737. “The molecular structure of nucleic acids”.21. Watson, J.D. & Crick, F.H.C. (1953), Nature, 171; 964. “Genetic implications of the structure of deoxy-

ribonucleic acid”.22. Watson, J.D., et al. (1988) Benjamin-Cummings: New York. “Molecular Biology of the Gene”.23. ———— (1997), Human Genome Project, US Dept of Energy: Washington, DC. “A Primer on

Molecular Genetics”.

Page 38: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

30 Bioinformatics: A Primer

4Features of Proteins

Proteins are important constituents of all biological systems. All biologically relevant proteinsare linear (except for cystines) polypeptides, constituted from a repertoire of twenty L-aminoacids. A polypeptide (protein) consists of repeating peptide (main-chain) units with differentside-chain residues (R-groups). Physicochemical characteristics of side-chains and tertiary fold-ing of the polypeptide backbone make each protein unique, structurally and functionally. Pro-tein structures are organized under four structural categories. Non-bonded interactions (ionic,hydrogen bonds and van der Waals) stabilize the secondary, super-secondary, tertiary andquaternary structures of macromolecules.

4.1 AMINO ACIDS

There are twenty L-amino acids that are the basic structuralunits of all naturally occurring proteins and enzymes. They allhave (except proline, which is an imino acid) an amino group(NH3

+), a carboxylate group (COO–), a hydrogen atom and asustituent group, R, called a side chain– all covalently bondedto the central tetrahedral a-carbon atom (Fig. 4.1). All aminoacids (except glycine) with different substituents at the a-car-bon exhibit chirality (L- or D-from). Amino acids differ fromeach other structurally and functionally owing to the structureand the chemical nature of their side chains.

4.1.1 Characteristics of Amino Acids

Ionization states of amino acids are pH-dependent (Fig. 4.2). They exist in zwitterionic stateunder physiological pH (~7) conditions.

Amino acids can be characterized, based on the shape, size, and chemical nature of theirside-chains (Fig. 4.3 and Table 4.1). Whereas the constituents of nucleic acids are four with twosubclasses with similar structural and physicochemical characteristics, amino acids differ greatlyin size, shape, and chemical characteristics of their side chains.

From the chemical nature of the side-chains under physiological conditions, there are:

(i) Five amino acids with charged R-groups: two acidic amino acids, aspartic acid andglutamic acid and, three basic amino acids, arginine, lysine and histidine.

Fig. 4.1 Chemical Formula(Representation)of L-Amino Acids

Page 39: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 40: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

32 Bioinformatics: A Primer

Fig. 4.3 Chemical Structures of L-Amino Acids found in Proteins

Table 4.1 Amino Acids with their Side-chains (–R) and their Characteristics

Amino acid Abbreviation Side-chain pKa Characteristics(–R group) a-NH3

+;a - COO–;R-Group

Alanine Ala (A) –CH3 9.8; Small;2.4; Hydrophobic/Ambivalent

Arginine Arg (R) –CH2–(CH2)2–NH–C–(NH2)2+ 9.0; Large; Hydrophilic/

1.8; (Basic)12.5

Asparagine Asn (N) –CH2–(C=O)–NH2 8.7 Small;2.1; Hydrophilic/(Neutral)

Aspartic Acid Asp (D) –CH2–COO– 9.9; Small;2.0; Hydrophilic/(Acidic)

3.9

Cysteine Cys (C) –CH2–SH 10.7; Medium;1.9; Hydrophilic/8.4 Ambivalent

Glutamine Gln (Q) –CH2)2–(C=O)–NH2 9.1; Small;2.2; Hydrophilic/

(Neutral)

(contd.)

Page 41: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Proteins 33

Glutamic Acid Glu (E) –CH2–CH2–COO– 9.4; Small;2.1; Hydrophilic/4.1 (Acidic)

Glycine Gly (G) –H 9.8; Small;2.4; Hydrophobic/

Ambivalent

Histidine His (H) –CH2– imidazole ring 9.3; Large;1.8; Hydrophilic/6.0 (Basic)

Isoleucine Ile (I) –CH–CH2–CH3 9.8; Medium;| 2.3; Hydrophobic/

CH3 (branched)

Leucine Leu (L) –CH2–CH–(CH3)2 9.7; Medium;2.3; Hydrophobic/

(linear)

Lysine Lys (K) –CH2–(CH2)3–NH3+ 9.1; Large;

2.2; Hydrophilic/10.6 (Basic)

Methionine Met (M) –CH2–CH2–S–CH3 9.3; Medium;2.2; Hydrophobic/

Ambivalent

Phenyl- alanine Phe (F) –CH2–phenyl ring 9.3; Large;2.1; Hydrophobic/

(aromatic)

Proline Pro (P) Imino ring 10.6; Medium;2.0; Hydrophobic/

Ambivalent

Serine Ser (S) –CH2–OH 9.2; Small;2.2; Hydrophilic/~ 13 Ambivalent

Threonine Thr (T) –CH–CH3 9.1; Small;| 2.1; Hydrophilic/

OH ~ 13 Ambivalent

Tryptophan Trp (W) –CH2–indole ring 9.4; Large;2.4; Hydrophobic/

Ambivalent

Tyrosine Tyr (Y) –CH2–phenolic ring 9.2; Large;2.2; Hydrophobic/10.4 Ambivalent

Valine Val (V) –CH–(CH3)2 9.7; Medium;2.3; Hydrophobic

Table 4.1 (contd.)

Page 42: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

34 Bioinformatics: A Primer

(ii) Five polar (neutral/ambivalent) amino acids: asparagine, glutamine, cycteine, serine andthreonine.

(iii) Ten non-polar (hydrophobic/ambivalent) amino acids: alanine, glycine, phenylalanine,isoleucine, leucine, methionine, proline, tryptophan, tyrosine (uncharged) and valine.

4.2 PROTEINSAs stated, naturally occurring peptides and proteins are linear polymers, synthesized from alibrary of twenty L-amino acids, covalently linked through the peptide (C–N) bonds by thehydrolysis between a-amino and a-carboxyl groups of successive amino acids (Fig. 4.4).A polypeptide consists of repeating peptide backbone (main-chain) units with side- chains(R1, R2, R3,…..). The repeating unit along the polypeptide backbone is (HN–CaH–CO).Typically a peptide consists of less than 50 amino acids while a protein has greater than 50amino acids.

Fig. 4.4 Formation of Polypeptides from Amino Acids via Peptide (C–N) bonds

Operationally, the structural and functional features and protein complexes are addressedat four levels of hierarchical structural organization (1) primary structure, (2)secondary structure, (3) tertiary structure and (4) quaternary structure.

Page 43: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Proteins 35

4.2.1 The Primary Structure (1°-Structure)The linear number and order of the amino acids present in a peptide or protein constitutes theprimary structure (1°-structure). The convention for the designation of the order of aminoacids is that the N-terminal end (free a-amino group) is to the left (and the number 1 aminoacid) and the C-terminal end (end with the residue containing a free a-carboxyl group) is tothe right. Single alphabet nomenclature of amino acids is used in the amino acid sequence data(Fig. 4.5). Determination of primary structure of a protein (see chapter 6) is essential tounderstand the mechanisms of biochemical reactions, to trace evolutionary paths and to carryout computational methods of structure prediction from the sequence homologies with relatedproteins. Primary structure of a protein is generally a prerequisite for three-dimensionalstructure determination by physical techniques.

Fig. 4.5 Amino Acid Sequence of a Disulfide-containing Protein

Amino acid sequence data of proteins can be inferred from the base sequence of corre-sponding nucleic acids. With the development of highly efficient gene sequencing methods,gene sequencing is faster and easier and than protein sequencing, this alternative is followedwherever it is feasible. However, there are certain ambiguities and limitations in inferringamino acid sequence from gene sequencing. These are:

(i) Degeneracy of codons (more than one codon coding for the same amino acid) leads toambiguities.

(ii) The genetic code is not universal.(iii) Deletion and insertion of nucleotide(s) can lead to erroneous reading frame for the amino

acids.(iv) Post-modified proteins and disulfide-containing proteins can be determined only by

direct protein sequencing.

4.2.2 The Secondary Structure (2°-Structure)Ordered local segments (helices and sheets), reverse turns and loops and local hydrogenbonding of the polypeptide backbone constitute the secondary structure of proteins. Thesecondary structure elements constitute the building blocks of the folding units in the globularproteins. The secondary structure of a polypeptide (protein) is determined largely by localsequence information.

According to Pauling hypothesis on the peptide unit––

(i) The peptide bond (C–N bond) is rigid, and the amide group (O=C––NH) is planar as aresult (Fig. 4.6).

(ii) There are only two degrees rotational freedom per peptide, namely about the singlebonds, N–Ca (f) and Ca–C (y). These conformational angles are referred to as Ramachandranangles.

(iii) Hydrogen bonds play a crucial role in stabilizing the polypeptide chain conformation.

Page 44: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

36 Bioinformatics: A Primer

(iv) Trans configuration is preferred across the peptide bond.

Angles around C and N atoms are ~ 120°; O = C—NH group is planarColor code: C = black; O = Red; N = Blue; H = White

Fig. 4.6 Stereo-chemical Features of the Peptide Unit

Pauling postulated two ordered structures that should occur in polypeptides, namely thea-helix and b-sheet.

4.2.2.1 a-HelixThe a-helix (3.613-helix) is common secondary structure encountered in proteins of globularclass. It is a right-handed rod-like helical segment, stabilized by intra-molecular hydrogenbonds, parallel to the helix axis, occurring between NH and C=O groups peptides spaced fourresidues apart (Fig. 4.7(a)). There are many proteins with predominance a-helices, such asmyoglobin, melettin, and cytochrome C’ (Fig. 4.7(b)).

There are other types of helices (310-helix, p-helix) (Table 4.2) that do occur in proteins andalso in the L-polyproline type helix, which is the conformation of collagen monomers.

Table 4.2 Structural Parameters of ordered Segments in Polypeptides

j (°) y (°) Atoms inType (Ca-N) (Ca-C) n d (Å) P (Å) the loop

27-ribbon 105 (–75) 240 (60) 2.0 2.8 5.6 7

310-helix 130 (–50) 155 ( –25) 3.0 2.0 6.0 10

3.613-helix 122 (–58) 133 (–47) 3.6 1.5 5.4 13(aR-helix)

4.416-helix 120 (–60) 110 (–70) 4.4 0.8 3.5 16(p-helix)

27-ribbon 105b-sheet (≠≠) 60 (–120) 295 (115) 2.0 3.47 6.95 —-

b-sheet (≠Ø) 40 (–140) 315 (135) 2.0 3.47 6.95 —-

L-helix 105 (–75) 330 (150) –3.0 3.12 9.36 —-(Collagen)

P = pitch of the helix; d = rise per turn; n = number of residues per turn;

Page 45: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Proteins 37

Thick lines indicate the Polypeptide Backbone. Ca-Carbon atoms are numbered(Substituents at the Ca-atoms are deleted for clarity)

Fig. 4.7(a) Features of Right-handed (3.613) a-Helix with Intra-molecularHydrogen Bonds along the Helix Axis

Fig. 4.7(b) Example of a Protein (Cytochrome C’) Structurewith Predominance of a-Helices

(Ref: Dobs, A.J., et al.(1996), Acta Cryst D, Biol Crystallogr; 52; 356)(Source: Protein Data Bank: 1CGO.pdb)

Page 46: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

38 Bioinformatics: A Primer

4.2.2.2 b-SheetThe b-sheet is a pleated sheet structure and results from intermolecular hydrogen bonds,perpendicular to the strand axis, between NH and C=O groups of neighbouring strands(Fig. 4.8(a)). The polypeptide chain is extended (Table 4.5). b-Sheets occur either as parallel(≠≠) or as antiparallel (≠Ø) strands. Many proteins have a predominance of b-sheet structure(Fig. 4.8)(b)).

Thick lines indicate the Polypeptide Backbone. (Substituentsat the Ca-Carbon atoms are deleted for clarity)

Fig. 4.8 (a) Features of b-Sheet Pleated Structure, with extended Conformation(Inter-molecular Hydrogen Bonds are transverse to the Strand Axis)

Page 47: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Proteins 39

Fig. 4.8 (b) Example of a Protein (Retino-binding Protein (RBP)Structure with Predominance of b-Sheets

(Ref: Zanotti, G., et al. (2001), Biochim Biophys Acta; 64; 1550)(Source: Protein Data Bank: 1IIU.pdb)

4.2.2.3 Turns and LoopsWhile helices and sheets are ordered segments, because their residues have repeating back-bone torsion angles, j and y and their hydrogen bonding patterns are periodic, turns andloops do not exhibit such regular secondary structural features. Turns are those regions in aprotein where the polypeptide backbone folds back and changes the overall direction of thepolypeptide chain by nearly 180°. There are cases of protein structures with predominance ofturns and loops (Fig. 4.9).

4.2.2.4 Conformation AnalysisConformational flexibility of a polypeptide is restricted to certain regions because of thesteric hindrances of moieties. The sterically allowed and not-allowed conformations of thepolypeptide backbone can be determined by Ramachandran conformation plots(Ramachandran j, y maps). Ramachandran analysis demonstrates the conformations of non-glycine polypeptides are severely restricted to certain j-y regions only (Fig. 4.10).

Ramachandran maps do not provide information on three-dimensional protein folding,that is, information on the nearest neighbor amino acid residues in a polypeptide chain thatare sequentially distant. Folding information can be obtained from “distance” plots or “diago-nal” plots. Distance plot provides distances between each amino acid residue to all otheramino acids of a polypeptide. Inspection of distance plots helps in discerning tertiary struc-tural features of proteins and can provide correlation of structural units of a protein with itsDNA-exonic regions, even in protein structures that do not shoe apparent domain structure.

Page 48: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

40 Bioinformatics: A Primer

Fig. 4.9 Example of a Protein (Rubredoxin Mutant G10A)Structure with Predominance of Turns and Loops

(Ref: Maher, M.J., et al. (1999) Acta Cryst D, Biol Crystallogr., 55; 962)(Source: Protein Data Bank: 1B13.pdb)

Fig. 4.10 Ramachandran (j, y) Map

Page 49: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Proteins 41

Fig. 4.11 Ribbon Diagram of the Tertiary Structure of Protein Phosphatase

(Ref: Barford, D., Flint, A. and Tonks, N)(http://biop.ox.ac.uk/www/mol_of_life/pdb/phos.html)

4.2.3 The Tertiary Structure (3°-Structure)

The three-dimensional structure (spatial folding) of a protein is referred to as the tertiarystructure (Fig. 4.11). The tertiary structure (protein folding) of a protein is unique and specificto each protein. The structural and functional features of proteins—binding sites of ligands(protein-drug interactions), the active sites of enzymes, or the binding sites for other proteins(protein—protein associations)—depend on their tertiary structures, and therefore, knowl-edge of the spatial folding of any protein is prerequisite to understand its structural andbiochemical functions. This knowledge can also aid in our understanding of how particularmutations or variations in the gene that encodes a particular protein lead to changes in thebehavior of that protein which can result in disease or in differences in drug interactionsamong different individuals.

4.2.4 The Quaternary Structure (4°-Structure)

The structural organization of single-polypeptide chain (monomeric) proteins, such asmyoglobin, trypsin, insulin, is complete at the tertiary level. However, in the case of proteinscontaining two or more polypeptide chains (oligomeric proteins), such as hemoglobin(Fig. 4.12), cytochrome oxidase, ATPase, there exists quaternary level of organization. Thequaternary structure deals with the specific arrangements of subunits, with respect oneanother in the protein complex. Same non-bonded interactions that come into play in the

Page 50: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

42 Bioinformatics: A Primer

Fig. 4.12 Ribbon Diagram of Hemoglobin, a tetrameric (a2b2-subunit) Protein

(Ref: Tame, J. & Vallone, B)(Source: Protein Data Bank: 1A3N.pdb)

stabilization of tertiary structure folding (e.g. hydrogen bonding, and van der Waals interactions)are also responsible in quaternary interactions. Hydrophobic interactions (non-directionaland entropy-driven) play a major role in the higher-order structural organization and stabilityof quaternary structures in macromolecular complexes.

4.3 FORCES STABILIZING THE MOLECULAR STRUCTURE

The structure is stabilized by various non-bonded molecular interactions, such as—(i) electrostatic, (ii) van der Waals, (iii) hydrogen bonding, and (iv) hydrophobic interactions(Table 4.3).

Page 51: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 52: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

44 Bioinformatics: A Primer

(Uij = Madelung energy; N = Number of molecules or 2N ions; a = Modelung constant; q =charge; rij = distance between ions i and j; z = number of nearest neighbors of an ion; l and r= empirical parameters).

Electrostatic interactions (salt bridges) occur between oppositely charged R-groups such aslysine, arginine, aspartic acid and glutamic acid. Majority of the amino acids found on theexterior surfaces of globular proteins (at loops and turns) contains charged or polar R-groups.Salt bridges play a structural role in allosteric cooperativity in multi-subunit proteins (e.g.hemoglobin, aspartic carbamylase).

4.3.2 Van der Waals InteractionsVan der Waals forces are weak non-bonded interactions, contact distances between atoms (>3.0 Å), molecules and moieties. These forces, both attractive and repulsive arise due to dipole-dipole and dipole-induced dipole interactions. These interactions play a significant role in thestabilization of correct folding of proteins, hydration and solvent structure.

4.3.2.1 Dipole/Dipole InteractionsAtoms and molecules with zero net charge (q = 0), with charge separation (centers of positiveand negative charges do not coincide) are permanent dipoles (polar entities), and they cangenerate electrical fields. The dipole moment, m, is

= q. Æ

(4.2)Since the energy of a charged species is affected by an electric field, the energy of an ion ordipole will be affected by the presence of neighboring ions or dipoles. The interaction energybetween permanent dipoles (polar molecules, m π 0) depends on the spatial orientation of thedipoles (Fig. 4.14).

Fig. 4.14 Non-bonded Interactions between Permanent Dipoles (m π 0)

Page 53: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Proteins 45

U = - 2 1 23

m mDr

for prolate spheroids (4.3)

U = - m m1 23Dr

for oblate spheroids

Average energy of interaction for randomly distributed dipoles is given by Keeson equa-tion.

U = - 2

3

112

22

6

m mDr kT

Keesom equation (4.4)

(D = dielectric constant; r = distance between two dipoles; k = Boltzmann constant;T = absolute temperature).

4.3.2.2 Dipole/Induced-Dipole InteractionsIons or dipoles can induce dipole character to a neutral (m = 0), polarizable molecule (Fig. 4.15).

mind = aE (4.5)

The energy of interation is

U = -2 2

6

a mDr

(4.6)

Fig. 4.15 Schematic of Induced-dipole Interactions

4.3.2.3 Induced Dipole/Induced-Dipole InteractionsThe cohesive forces between neutral and non-polar species, called dispersive forces or Londoninteractions, are due to asymmetric temporal charge distribution around atoms that creates an

Page 54: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

46 Bioinformatics: A Primer

induction in neighboring atoms. Dispersive forces are always attractive. For the indeced-dipole/induced dipole interaction, the quantum mechanical treatment gives

U = - 3

4

2

6

I

r

aLondon equation (4.7)

(a = polarizability tensor; E = electric field; I = 1st ionization potential)

Total fusion of atom is prevented from happening due to repulsive forces on account ofoverlapping of nuclei and electron clouds of different atoms at very close range. The totalenergy of interaction (attractive and repulsive) is represented by Lennard-Jones potential

U = -

+

Lennard-Jones Potential (4.8)

Attraction Repulsion

(A and B are empirical constants).

4.3.3 Hydrogen BondingThe hydrogen bond is a special case of a permanent dipole attractive (polar) interaction.Hydrogen bond is formed whenever a polar donor group containing a hydrogen atom (e.g. O–H, N–H) interacts (at a distance 2.5 - 3.0 Å) with electronegative acceptor atom(s), such as O,N, Cl and F. Hydrogen atom has only one electron. When the electron is used to form acovalent bond with an electronegative atom (e.g. O, N, Cl), the electron cloud is pulledtowards the electronegative atom, and the nucleus is partially unshielded. Consequently, theproton can interact directly with another negative atom nearby. Hydrogen bond is linear anddirectional.

Though the hydrogen bond is a weak non-bonded interaction (~ 20 kJ/mol), it plays acrucial role in determining the physicochemical properties, structural stability and function ofmany compounds. For example, the unique properties of water are due to hydrogen bondnetworks. Ordered segments in proteins and proteins and other macromolecular complexesare due to intra- and intermolecular hydrogen bonds. Storage of genetic information in nucleicacids is via base-specific hydrogen bond patterns.

In polypeptides hydrogen bonds are formed between NH and C = O groups of the polypep-tide backbone. Eleven amino acids (out of 20) can form hydrogen bonds through their sidechains.

(i) H-bond donors only: Arg (guanidinium group), and Trp (indole group).(ii) H-bond donors and acceptors: Side chains of Asn, Gln, Ser and Thr can serve both as

H-bond donors as well as acceptors.(iii) pH-dependent H-bonding: The hydrogen bonding potential of the side chains of Asp,

Glu, His, Lys and Tyr is pH-dependent. These groups can serve as donors and acceptorsof H-bonds over a certain pH range, and either as acceptors or donors of H-bonds (butnot both) at other pH values.

4.3.4 Hydrophobic InteractionsHydrophobic (nonpolar) interactions are weak and non-directional. The formation of nonpo-lar associations is one of the most important factors in macromolecular folding, stacking, and

Page 55: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Features of Proteins 47

higher-order (tertiary and quaternary) structure assembly. The driving force for the formationof hydrophobic environment is entropy (positive entropy change).

EXERCISE MODULES

1. Explain (pictorially) the difference between an L- and a D-amino acid (Hint: Apply Fisher convention).2. Build models of all 20 L-amino acids.3. Identify the amino acids according to the size and shape (small, medium and large; linear or branched).4. Identify the amino acids according to charge (hydrophilic––acidic, basic and polar neutral; hydropho-

bic––aliphatic and aromatic).5. What is the primary structure of proteins?6. There is no free rotation about the peptide bone– why?7. Build a decapeptide from the amino acid library (exclude Cys and Pro).8. What is the secondary structure of proteins?9. What are the structural parameters of a a-helix? What are the features of H–bonding?

10. What are the structural parameters of a b-sheet? What are the features of H–bonding in parallel and anti-parallel b-sheets?

11. What is reverse turn and what is its importance in globular proteins?12. Which are the preferred residues in turns?13. What is conformation analysis? Comment on Ramachandran conformational plots.14. What is the tertiary structure of proteins?15. What is the quaternary structure of proteins?16. Which are the non-bonded interactions stabilizing macromolecular structures?17. What are salt bridges?18. Comment on van der Waals interactions.19. What is hydrogen bonding?20. What is the importance of hydrogen bonding in biology?21. What are hydrophobic interactions and how are they important?

BIBLIOGRAPHY

1. Baum, S.J. & Scaife, C.W. (1987), McMillan: New York. “Chemistry: A Life Science Approach”,

3rd Edn.2. Berman, H.M., et al. (2000), Nucl Acid Res., 28; 235. “Protein Data Bank”.

3. Branden, C-I. & Tooze, J. (1999), Garland Pubs: Philadelphia. “Introduction to Protein Structure”,

2nd Edn.4. Chothia, C. (1984), Annu Rev Biochem., 53; 537. “Principles that determine the structure of proteins”.

5. Creighton, T.E. (1993), Freeman Press: New York. “Protein structures and Molecular Properties”,

2nd Edn.6. Dickerson, R.E. & Geis, I. (1983), Benjamin-Cummings: Menlo Park/CA. “Hemoglobin: Structure,

Function, Evolution and Pathology”.

7. Doolittle, R.F. (1985), Sci Amer., 253(4); 88. “Proteins”.

8. Klotz, I.M., et al. (1970), Annu Rev Biochem., 39; 25. “Quaternary structure of proteins”.

Page 56: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

48 Bioinformatics: A Primer

9. Kyte, J. (1994), Garland Pubs: New York. “Structure in Protein Chemistry”.

10. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print).11. Sackhein, G. (1991), Addison-Wesley: New York. “Introduction to Chemistry for Biology Students”,

4th Edn.12. Stryer, L. (1995), H.C. Freeman: New York. “Biochemistry”, 4th Edn.13. Voet, D. & Voet, J.D. (1990), John Wiley: New York. “Biochemistry”.

14. Walker, J.M. & Gaastra, W. (Eds) (1983), Croom Helm: London. “Techniques in Molecular Biology”.

15. Weinberg, R.A., et al. (1985), Sci Amer., 253(4); 48. “The molecules of Life”.

Page 57: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Section II

Experimental Methods ofStructure Elucidation

(Bioinformatics-I)

Page 58: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

5Physicochemical Characterization

of Biomolecules

Analyses of structure-function aspects of biomolecules are carried out at various levels oforganization (primary, secondary, tertiary and quaternary structures), and by various meth-ods– biochemical (physicochemical), molecular biology and biophysical.

(i) Biochemical (physiochemical) methods comprise isolation, purification, identificationand physicochemical characterization of hydrodynamical (molecular mass, shape, andsize) parameters, and reaction mechanisms. Sequence (primary structure) analysis thatincludes molecular biology methods is also addressed under this category.

(ii) Biophysical methods encompass physicochemical characterization, structure determina-tion by X-ray crystallography, NMR spectroscopy and augmented by other spectroscopicmethods.

For undertaking any experimental structural elucidation of any biological macromolecule(say a protein), it should be available in pure (homogeneous) form and in sufficient quantitiesfor biophysical characterization. Therefore, protein purification is the very first step to beundertaken. Protein purification is carried out either by (i) physicochemical (chromatography,electrophoresis etc) methods or/and by (ii) molecular genetics (gene selection and amplifica-tion etc) methods. The next step is to determine the primary structure (amino acid sequence) ofthe protein. This can be achieved either by amino acid sequencing or/and by gene sequencingmethods. The final step, and the most challenging experimental task, is to determine thespatial (three-dimensional) structure of the protein. The only physical techniques that areavailable, to-date, to determine the three-dimensional structures at atomic and molecularlevels are single crystal X-ray diffraction and multi-dimensional NMR spectroscopic methods.

A general protocol of structure analyses of biological macromolecules is given in theflowchart (Figure 5.1).

Biochemical characterization of biomolecules includes purification, identification anddetermination of molecular mass, shape and size by various physicochemical methods. Theseinclude (1) hydrodynamical, (2) chromatographic, and (3) electrophoretic methods. Blottingtechniques, which come under molecular biology can also be treated under the purview ofbiochemical characterization.

Page 59: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 60: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Physicochemical Characterization of Biomolecules 53

The diffusion coefficient (D) can be determined from the optical Doppler effect of the laserbeam scattered by the moving particles. Schlieren optics can be employed in a centrifuge to

measure the concentration gradient ∂∂Cx

FHG

IKJ across the boundary as the solute behaves like a

prism. The refractive index gradient (concentration gradient) can also be measured by theinterference method.

5.1.2 Rotational DiffusionNon-spherical particles undergo tumbling motion (rotational diffusion), and shape of macro-molecules can be determined by rotational diffusion method.

Ω = RTNAz

(5.4)

ζ = 8πη3 Stokes equation (5.5)

τ = 4 2p hN ab

RTA for ellipsoids (5.6)

(Ω = rotational diffusion coefficient; R = gas constant; T = absolute temperature; NA = Avogadronumber; ζ = torque; τ = relaxation time; η = viscosity coefficient; a and b = major and minoraxes of the ellipsoid).

The rotational diffusion coefficient (Ω) can be determined by flow birefringence method, bydetermining the orientation of elongated molecules in a velocity gradient produced bymechanical shearing force. This method is applicable for analysis of asymmetric rod-likemolecules (e.g. DNA, collagen). The depolarization fluorescence method can also be used todetermine the relaxation time, from which the size and shape of macromolecules can beestimated.

5.1.3 Light ScatteringTreating molecules as point dipoles, the dipole moment, m, induced by an electric field, E, ondipoles of polarizability, a, is

m = a.E (5.7)Oscillating dipole emits radiation, and the ratio of scattered (Iθ) and the incident (I0) inten-

sities is given by Rayleigh equation

II

q

0=

16 4 2 2

2 2

p a ql

Nr

.sinRayleigh equation (5.8)

The Rayleigh ratio, Rθ, is

Rθ = 8 4 2

2

p al

N(5.9)

(N = number of particles per unit volume; θ = angle of scattering; λ = wavelength of radiation;r = specimen to detector distance).

The polarizability (α) is related to the refractive index, n, and concentration, C, of theparticles.

Page 61: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

54 Bioinformatics: A Primer

n2 – 1 = 4παN (5.10)

Therefore, Rθ = 2 2

4

pl N

dndC

C MA

FHG

IKJ . (5.11)

LCRq

= 1M

for ideal solutions (5.12)

LCRq

= 1 2

2MBC

M+ for real solutions (5.13)

Where, Λ = 2 2

4 02

2pl N

ndndCA

FHG

IKJ

(M = molecular mass; n0 = refractive index of solvent; B = second virial coefficient).Scattering by solvent and solution (solvent + solute) can be measured experimentally and

difference taken. (dn/dC) is determined by differential refractometer. The quantity (ΛC/Rq) isplotted as a function of concentration (C). The molecular mass can be obtained from the inter-cept of the plot, and the slope at C à 0 gives 2B/M2, from which the 2nd virial coefficient (B)can be determined.

Spectral width of scattered radiation (due to Doppler broadening) can also be used toobtain information on translational and rotational diffusion coefficients, from which radius ofgyration and molecular shapes can be determined.

5.1.4 SedimentationThe sedimentation rate of a particle (in a centrifuge) through a solution is related to the netforce acting on the particle. The centrifugal force, FC, acting on a particle is opposed by thefriction force, F’, and at equilibrium Fc = F’.

(1 - Vρ) m(ω2r) = f s(ω2r) (5.14)

NAf = RTD

Einstein equation (5.1)

\ M = RT

VsD( – )1 r (5.15)

(V = specific volume of the particle; m = mass of the particle; ω = angular velocity; r = distanceof the particle from the revolving axis in a centrifuge; NA = Avogadro number; M = molecularmass; s = sedimentation coefficient; (10–13 s) is called a Svedberg).

Specific volume (V) of the particle can be determined from the variation of the density ofthe solution with solute concentration. The sedimentation coefficient (s) can be measured in acentrifuge by optical Doppler effect of laser light scattered by the moving particle. The mo-lecular mass (M) can then be calculated if the diffusion coefficient (D) of the particle is known.

Molecular mass can be determined without the knowledge of the diffusion coefficient byequating diffusion with sedimentation. This is realized by sedimentation equilibrium method.At equilibrium, the diffusion rate is equal to the sedimentation.

Page 62: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Physicochemical Characterization of Biomolecules 55

Cdrdt

= DdCdr

(5.16)

Considering concentrations, C1 and C2 at positions r1 and r2 from the rotor axis of the centri-fuge,

M = 21 2

2

1

22

12

RTV

CC

r r( )

ln

-

FHG

IKJ

-r w d i(5.17)

A linear gradient of sucrose (~ 5 – 20%) or CsCl is employed in analytical centrifugationmethods to determine the sedimentation of macromolecules. Macromolecules are layered atopthe gradient in a centrifuge tube, and then subjected to centrifugal fields in excess of 105g. Thesizes of unknown macromolecules can then be determined by comparing their migration dis-tances in the gradient with those of known substances.

5.2 CHROMATOGRAPHIC METHODS

Chromatography is a physical technique used to separate mixtures of substances based ondifferences in the relative affinities of the substances for mobile and stationary phases. Amobile phase (fluid or gas) passes through a column containing a stationary phase of poroussolid or liquid coated on a solid support.

Liquid chromatographic (LC) methods, where the mobile phase is a liquid, are extensivelyused in physicochemical characterization of biomolecules that are thermo-labile (e.g.biomolecules) because of the simplicity of operation and versatility. High Performance LiquidChromatographic (HPLC) methods are aimed at increasing the efficiency of liquid chromatog-raphy to a high level of sensitivity (10–12 g/L). Advent of HPLC has resulted in tremendousprogress in the separation of a wide variety of inorganic and organic compounds. Reversed-phase HPLC (based on the hydrophobicity of compounds) has become the workhorse forpurification and characterization of molecules of biological importance.

Denaturing high-performance liquid chromatography (dHPLC) is used in single nucleotidepolymorphism (SNP)-detection methods, based on discrimination between perfect and mis-matched hybridization, measures melting temperatures. Mismatching results in loweredmelting temperature. This method involves passing the hybridization mixture through achromatographic column several times at different temperatures and deriving the meltingtemperature from changes in the chromatogram.

Chromatographic separation methods rely on differences in physicochemical propertiessuch as adsorption, solubility, size, affinity, and ionic mobility between solutes and solvents.

5.2.1 Thin Layer Chromatography (TLC)

In thin layer chromatography (TLC), the stationary phase is a thin coat of cellulose, or aluminaor silica on a plate (sorbent plate). Spots of known and unknown samples are applied along aline at the edge of the sorbent plate, and the plate is kept in a closed chamber that contains themobile phase (solvent) in such a way that the line spots is just above the solvent layer. As themobile phase moves up the plate, the samples also migrate with it, the migration distances

Page 63: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

56 Bioinformatics: A Primer

being proportional to the affinity to the adsorbent and the solvent. The plate is removed anddried, just before the solvent front reaches the top edge of the plate. Spots are identified byphotometric or other suitable methods. The relative migration, Rf , is characteristic of the analyte.

Rf = Distance of migration of the analyte from the origin

Distance of migration of solvent from the origin(5.18)

5.2.2 Size-exclusion (Molecular-sieve) ChromatographySize-exclusion chromatography is based on molecular size, particularly applicable to separationof high molecular mass macromolecules. The packing columns are generally glucose polymersin bead form of required pore size. Molecules of size larger than the pore size cannot enter intothe pores, and are excluded and eluted first. The smaller size molecules that can enter into thepores of the beads are retarded in their rate of travel and are eluted last. Molecules of sizes inbetween are eluted, inversely proportional to their molecular sizes. Beads of different poresizes can be used, depending upon the desired protein size separation profile.

5.2.3 Ion-exchange ChromatographyProteins (and nucleic acids) can be separated based on their net charge by ion-exchange method.The sorbent column consists of a cellulose polymer, either negatively (cation exchanger) or posi-tively (anion exchanger) charged. Separation is achieved in two steps– (i) adsorption (binding)of analytes of opposite charge to the sorbent column as a solution of analytes is passed throughthe column, (ii) then eluting the analytes, one at a time, separated from each other, by an iongradient. Separations are generally performed under conditions in which one of the ionic spe-cies predominates in both phases.

5.2.4 Affinity ChromatographyProteins have high affinity for their substrates or cofactors or receptors or antibodies raisedagainst them. This specific characteristic is exploited in protein purification by affinity chro-matography. The sorbent column consists of beads with specific affinity chemical group (X)bound to them. When a solution consisting a mixture of proteins is passed through the col-umn, protein specifies with affinity to X is bound to the column. The bound proteins are theneluted from the column by passing a buffer solution containing a high concentration of thechemical group (X).

5.2.5 Supercritical Fluid Chromatography (SFC)Supercritical fluid chromatography (SFC) is a separation method that combines the flexibility andversatility of other conventional extraction methods. It uses supercritical fluids (high-purityCO2) under very high pressure as the mobile phase. The SFC technique exploits this liquid-likesolvating property, and gas-like mobility of supercritical fluids for better separation. SFC hasproven to offer advantages in terms of increased resolution, decreased purification, and sampledry-down time and purification capabilities complementary to HPLC.

5.3 ELECTROPHORETIC METHODS

Proteins and nucleic acids can also be characterized according to size and charge byelectrophoretic methods. The fundamental physical principle upon which all electrophoretic

Page 64: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Physicochemical Characterization of Biomolecules 57

techniques are based is the migration of charged particle(s) towards the electrode(s) of opposite polarityunder the influence of an applied electric field. The migration of a charged particle under theinfluence of an applied electrical field depends on its electrophoretic mobility, µ, the frictionaldrag in the medium and other physicochemical parameters. Electrophoresis is a prelude toblotting techniques employed in molecular and immunobiology.

The force, F, acting on a particle of charge q, in an applied electric field, E, (F = q.E), which isopposed by the frictional drag, F’, (F’ = fv); and at equilibrium, F = F’.

F = q.E = F’ = fv = (6πηr)v (Stokes equation) (5.19)Electrophoretic mobility, µ, is defined as

µ = v/E = q

r6ph(5.20)

(f = frictional coefficient; η = viscosity coefficient; v = terminal velocity and r =radius of theparticle).If µ is positive (pH > the isoelectric point), the particle moves towards the cathode. If µ is negative (pH< the isoelectric point), the particle moves towards the anode.

The physicochemical parameters that influence the electrophoretic mobility of a particlein an electric field are:

(i) The ionizable groups present on the surface of the particle.(ii) Shape and rigidity of the particle.(iii) Pore size of the separation matrix.(iv) Characteristics of the buffer medium (pH, concentration etc.).

Optimization of these parameters leads to greater flexibility and versatility of the electro-phoresis methods.

Any electrophoretic setup consists of at least two components— an electrophoresis unit,and a power pack. The choice of the separating medium depends on the purpose.Paper: Cheap and easy to use, but resolution is poor.Cellulose acetate: Minimal adsorption and faster and clear separation. Better than paper androutinely used in clinical analysis of serum and other body fluid samples.Agarose gels: Invariably used in macromolecular separation (nucleic acids and proteins). Thegel is transparent, so photoscanning is possible.Polyacrylamide gels: Offer many advantages. Used in protein purification and characterization.Photoscanning is possible.

Electrophoresis can be carried out with a buffer solution or with a gel soaked in the buffersolution. Adverse effects of diffusion and convection are minimal in solid support and, in thecase of gels, pore size can be varied by varying the gel concentration, so that separation isdependant not only on the charge of the particle, but also on its size and shape (its physicalfeatures). Therefore, gel electrophoresis has become the standard method for separation andcharacterization of macromolecular constituents.

Both slab gels and capillary gels are in use to accomplish sizing. Slab gel matrices are gener-ally cross-linked (4–6% polyacrylamide gel), whereas capillary gel matrices are non-cross-linked.

Page 65: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

58 Bioinformatics: A Primer

5.3.1 Horizontal Electrophoresis of Nucleic AcidsNucleic acid constituents are large size particles and, therefore, larger pore-size gels are re-quired for their separation. Agarose gels meet this requirement, but they do not have struc-tural strength for vertical setup. Therefore, horizontal setup with agarose gels is invariablyused in the separation of nucleic acid constituents (Fig. 5.2).

Lid

Gel

Wick

Electrode

BufferCooling plate

–+

Fig. 5.2 Horizontal Electrophoresis Setup

The net charge of a nucleic acid is independent of the pH of the medium. That is, the charge/mass ratio is almost the same for all nucleic acids. Therefore, the electrophoretic separation ofnucleic acids is based solely on the difference in their molecular mass, M. The molecular mass(M) of a nucleic acid can be determined from its electrophoretic mobility by running standardnucleic acid markers of known M on the same gel, with proper pore size. Pore size of anagarose gel depends on the gel concentration. The general recipe is– (Table 5.1).

Table 5.1 Molecular Separations as a Function of Agarose Gel Concentration

Agarose gel Concentration Molecular separation size

0.3% DNA duplexes of 5 – 60 kb.0.4 – 0.5% Viral nucleic acids and plasmids.

0.8% Larger restriction fragments (0.5 – 10 kb).1 – 1.2% Smaller restriction fragments (< 5kb).

~ 2% Smaller fragments (0.1 – 3 kb).

Microchip electrophoresis is an electrophoretic method that has considerable impact inDNA separations, in large part because microchip electrophoresis offers some clear advan-tages over slab gel electrophoresis for automation, speed, and quantitative capability.

5.3.2 Column ElectrophoresisBoth slab gel and capillary gel (and free-solution) matrices can be employed to separatenucleic acids and proteins. In order to accomplish sizing, a sieving matrix is prepared andloaded either between two glass plates (slab gel), or into glass capillaries.

Page 66: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Physicochemical Characterization of Biomolecules 59

5.3.2.1 Slab Gel Electrophoresis

Electrophoretic separation of proteins is a function of the pore size and sieving effects of the gelmatrix and physicochemical properties of the buffer medium. A judicial optimization is nec-essary in all protein purification and characterization protocols.

Vertical slab gel setup, with polyacrylamide gel supports, has become standard and univer-sal in protein electrophoresis procedures. A vertical slab gel unit has two reservoirs of buffer(each containing an electrode) separated by the gel. A thin slab of gel is formed between twoglass plates that are clamped together but held apart by plastic spacers. The gel slab consistsof two phases— the lower part with higher gel concentration (separation gel) and the top partwith lower gel concentration (stacking gel). The stacking gel with larger pore size is for concen-trating the sample so that it enters the separation gel as concentrated band. A plastic comb,placed in the stacking gel (and removed after polymerization), provides loading wells forsamples. The lower part of the separation gel is dipped in the electrophoresis tank thatcontains the buffer (Fig 5.3).

Fig. 5.3 Schematic of Slab Gel Electrophoresis Apparatus

5.3.2.2 Poly Acrylamide Gel Electrophoresis (PAGE)Molecular separation by native PAGE is according to the net charge/mass ratio of the moleculeand sieving effects of the gel matrix. The gel is prepared by polymerizing acrylamide(CH2=CH.CO.NH2) with a small quantity of cross-linker, bisacrylamide (CH2(CH2=COCHNH2).Ammonium persulfate and tetramethylethylenediamine (TEMED) are added as initiator andcatalyst of polymerization. Varying the concentration of the monomer and the cross-linker inthe gel solution controls the pore size of the gel.

5.3.2.3 SDS-PAGE

Conventional electrophoresis (PAGE, called the native PAGE) is employed for separation ofproteins and not for the characterization of molecular mass, shape and size. The electrophoreticmobility of a protein depends simultaneously on its (i) net charge, (ii) its size (molecular mass)and shape and (iii) its structural rigidity. These factors vary with the experimental conditions.In order to establish quantitative relation between one of the parameters and the electrophoretic

Electrode

Glass plate

Sandwichgel

Buffersolution

Electrode

Detectorassembly

Page 67: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 68: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Physicochemical Characterization of Biomolecules 61

5.3.2.4 Pore Gradient Gels

In this system, the slab gel is not of uniform pore size, but a linear gradient, established withvarying acrylamide gel concentration. A particle loaded on the gel will migrate rapidly throughthe dilute (large pore) gel region until it reaches a smaller pore-containing region throughwhich it only moves extremely slowly. Advantages are:

(i) A much greater range of proteins of M values can be separated.(ii) Proteins of very similar M values (isoenzymes) can be resolved.(iii) Choice of buffer and electrical conditions is not critical.(iv) Inherent self-limiting of the migration of proteins.

Estimation of M of native PAGE (without SDS) can be achieved in a linear gradient gel(~ 3 – 20%). The distance, D, traveled by a protein in time, t, is

t = (aD + b)2 (5.23)

5.3.2.5 Immunoelectrophoresis

Immunoelectrophoresis is a combination of gel electrophoresis and immunodiffusion meth-ods. Serum (antigen) samples are placed in wells made in agar plates and electrophoresis iscarried out to separate proteins according their charge. After electrophoresis, longitudinaltrenches are cut and antiserum (antibody) samples are introduced in the trenches and incu-bated in a humid chamber. Proteins and antiserum samples diffuse and precipitin lines areformed wherever the proteins (antigens) and antiserum samples (antibodies) meet (Fig. 5.5).

Fig. 5.5 Schematic of Immunoelectrophoretic Diffusion Setup

5.3.2.6 Isoelectric Focusing (IEF)The isoelectric point of a protein is the pH at which the net charge of the protein is equal to zero.The use of polyacrylamide gel electrophoresis can also be used in separation of amphotericmolecules (proteins) according to their differences in the isoelectric points (pI). This method isextremely useful in separation of isoenzymes, for studying micro-heterogeneity in a protein(e.g. a protein may show a single band in an SDS-PAGE, but may show three bands in IEF, ifthe protein exists in mono-, di- and triphosphate forms).

The apparatus usually consists of a narrow tube containing a mixture of polyacrylamide geland ampholytes (which are small molecules with positive and negative charges). The

Page 69: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

62 Bioinformatics: A Primer

ampholytes have wide range of isoelectric points, and when an electric field is applied, thosewith low PI will migrate towards anode, whereas those with high PI towards cathode. In thisprocess a pH gradient is set up from one end of the gel to the other, as a particular ampholytecomes to rest at a position coincident with its isoelectric point. Proteins, introduced in such acolumn, migrate through the column in the electric field until each one reaches a point atwhich its own isoelectric point exactly equals the pH in the column.

Isoelectric focusing can also be achieved by immobilized pH gradient (IPG) strips. This isachieved by co-polymerization with acrylamide a set of monomers that carried ampholytefunctionally. Therefore, by changing the concentration of the different monomers along thestrip, pH gradients are covalently immobilized and stabilized into the gel. In this way IPGstrips with various pH gradients (3-12) can be fabricated.

5.3.3 2-D Gel Electrophoresis (2-DE)With the completion of genomic sequence analysis, the attention is shifting towards under-standing gene functions vis a vis study of the different biomolecules involved in cellularfunctions. Proteins are one of the fundamental biomolecules involved various cellular func-tions. The term proteomics is coined to describe the large-scale study of the proteins related toa genome. The rapidly expanding field of proteomics relies heavily upon two-dimensional (2-D) electrophoresis as the best method to separate the large number of proteins present inbiological extracts. 2-DE, followed by mass spectrometric analysis of isolated proteins, is theworkhorse of tool for proteomic analysis.

Proteomic studies can be performed on a complex tissue level. The first and probably themost important experimental setup in a proteomic study is the extraction of proteome from itsmedium. Once the protein extract has been isolated from the lysate, a separation technique isrequired to separate the individual proteins. 2-D gel electrophoresis (2-DE) has been themethod of choice for the large-scale purification of proteins in proteomic studies.

The 2-DE technique combines IEF, which separates proteins according to their charge (pI),with the size-separation method of SDS-PAGE. The first dimension is carried out in PA gels innarrow tubes in the presence of ampholytes, 8M urea and a non-ionic detergent. The dena-tured proteins separate in the gel according their pIs.

The full potential of the 2-DE method is better realized with the IPG strip. The proteinextract is used to re-swell IPG strip prior to focusing. Once the strip is swelled, an electriccurrent is applied across the IPG strip. The proteins that are positively charged (that is belowthe area of the strip with pH below their pI) move toward the cathode and encounter anincreasing pH until reaching their pI, at which point they will be neutral. The proteins that arenegatively charged (that is, in the area of the strip with pH above their pI) will move towardthe anode and encounter a decreasing pH until reaching their pI. The end result is that everyprotein is concentrated and constantly focused at their pIs.

The second dimension for the 2-DE separation formed by polymerizing an SDS-PAGE gelbetween two glass plates. The PA gel, incubated in a buffer, or the IPG strip is place at the topof the SDS-PAGE gel. An electric field is then applied across the gel, and the proteins migrateinto the second dimension where they are separated according their molecular mass (Mr). Theproteins are detected by a staining protocol that creates a two-dimensional set of spots (Fig.5.6). The location of each spot is detected by the protein’s pI, and Mr. The intensity of each spot

Page 70: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 71: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

64 Bioinformatics: A Primer

separation medium. The narrow diameter of the capillary minimizes the thermal effectsassociated with electric field. As a result, very high electric potential (~ 30 kV) can be appliedacross the capillary without degradation of the separation due to heating. Separation time isinversely proportional to the applied potential, and therefore, application of high fields resultsin rapid and efficient separation. Unlike the slab gels, where separated molecular species arevisualized after a fixed run time, capillary electrophoresis is a “finish-line” technique, wherethey are detected after traversing a fixed distance.

Combination of capillary electrophoresis and mass spectrometry allows laser-induced fluo-rescence (LIF) detection to quantify the amount of protein in the sample, and each LIF peak thenanalyzed by electrospray-time-of-flight-mass spectrometry (ESI-TOF-MS).

5.3.4.1 Free-solution ElectrophoresisCapillary free-solution electrophoresis does not have an analog in slab electrophoresis. Acapillary is filled with a buffer and proteins are separated, based on their mobility in the buffer,which is related to size/charge ratio of the protein– where highly charged and small proteinshave the highest mobility.

5.3.4.2 2-D Capillary Electrophoresis

In two-dimensional capillary electrophoresis method, a mixture of proteins, fluorescent labeled,is introduced in to the first capillary. Electric field is applied for a period of time. The aliquotthat migrated from the first capillary during that period is injected into a second capillary,where entrance abutted the first capillary. The electric field is switched off. The fraction that isinjected into the second capillary is separated by applying an electric field across the secondcapillary. The process is repeated until every fraction in the first capillary is passed the secondcapillary and gets separated. This procedure provides a two-dimensional electropherogram asa raster image from successive separations of the first capillary’s fractions.

2-D capillary electrophoresis, with SDS-PEO gel separation in the first capillary tube, andfree-solution separation in the second capillary, would allow orthogonal separation. Thismethod would enable to resolve complex mixtures into their constituents. The separationmethod will produce separations that are similar to IEF/SDS-PAGE in a fully automatedsystem. When combined with laser-induced fluorescence detection, it provides orders ofmagnitude higher sensitivity and dynamic range the IEF/SDS-PAGE analysis.

5.3.4.3 Capillary Zone Electrophoresis (CZE)Capillary zone electrophoresis (CZE) is an alternative to liquid chromatography (LC) on ac-count of its ability to separate ionizable compounds. Separation of analyte in CZE is based ondifferences in their electrophoretic mobilities, which can be modified by adding suitable solublereagents. CZE separations occur entirely in a liquid phase. As there no mass transport is in-volved between mobile and stationary phases, no peak broadening results from this source.Therefore, separation power is much greater than for similar LC. Thus, CZE is more efficient,expeditious and selective than LC.

In CZE, a capillary tube is filled with separation buffer, and one end of the tube is insertedinto the microspray interface of a mass spectrometer. A sample reservoir, containing a proteindigest, is introduced at the other end of the capillary tube, and an electric field is applied fromthe injection end to the microspray interface. The applied electric field generates a bulk flowof liquid toward the microelectrospray interface (electrophoretic force). The analytes present

Page 72: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Physicochemical Characterization of Biomolecules 65

in the samples are driven into the capillary tube by the electric field. The sample reservoir isthen replaced by a buffer reservoir, and an electric field is applied across the capillary tube. Abulk flow of liquid moves toward the microelectrospray, due to electrosmotic pumping, wherethe analytes are separated by electrophoresis. The separated analytes are then successively ledinto the tandem mass spectrometer (MS-MS), and tandem mass spectra are acquired.

5.3.5 Control Parameters and Detection

Selection of buffer, its ionic strength and pH, are of utmost importance in electrophoresis ofproteins. The ionic strength of the buffer should be ~ 0.05 – 0.15 M. At low ionic strength thereis rapid migration but greater diffusion; and at high ionic strength sharp bands are obtainedbut there is higher heat production. Conductivity is determined by the nature of the compo-nent ions— e.g., Na+, K+ and phosphate buffers have higher conductivity than that of Tris-HCldue to the presence of ions. Tris- buffers are ideal for fractionation of basic proteins, and ca-codylate buffers (anions) are for acidic proteins at neutral pH values.

Tracking dyes are added to the starting solution to keep track of ions that migrate in thesame direction as the proteins to be fractionated. The mobility of buffer ions that migrate in thesame direction as the macromolecules is of importance and resolution is enchanted if mobilities of bothare comparable. In alkaline and neutral buffers, acidic proteins become negatively charged andmigrate towards the anode and, therefore, negatively charged dyes are used (Bromophenol Blueas a marker for Cl

− and Gly− ions). For electrophoresis in acidic medium, with proteins migrat-

ing towards the cathode, positively charged dyes are used (Methyl Green Pyronin).As the charge/mass ratio proteins can be altered by the pH of the buffer, the optimal pH

value of a running buffer is that which ensures the maximum difference in the charge of thecomponent proteins. For acid proteins, optimal pH values fall in the neutral or slightly alka-line regions and such proteins migrate towards the anode. Tris-buffers at pH ~ 8.9 are pre-ferred. For basic proteins, it is preferable to choose slightly acidic pH (4 – 5). K+-acetate andTris-acetate (pH ~ 4.5) are most suitable.

Detection and localization of protein bands are performed by staining gels by CoomassieBlue R-250 (CBB) and fixing them by trichloroacetic acid (TCA). Detection is done eitherfluorescence or silver staining methods, and by autoradiography if the sample is radiolabeled.The nucleic acids are identified by staining the gels with various dyes—ethidium bromide,acridine orange, Pyronin etc. Methyl Green and DAPI selectively stain DNA but not RNA. Forvisible region fluorescence (blue or green excitation), fluorophores in the fluorescein andrhodamine families are most commonly used. For near IR fluorescence, the most commondyes are from the polymethine carbocyamine family. Mass spectrometry (MS) is the soughtafter method in the genomic and proteomic analysis (identification and sequence determina-tion; see Chapter 6).

All the fluorescence dye approaches require the utilization of a fluorescent detector systemto determine the position of protein/DNA spots on the gel. In principle three basic compo-nents of a fluorescence detector system– (i) exciting energy source, (ii) fluorescent sample and(iii) fluorescence detector (Fig. 5.7). The laser/detector assembly consists of a laser source(laser diode) as the excitation source, and a detector assembly (photomultiplier tube (PMT),or avalanche diode). The laser source is placed at an angle such that the focused polarized

Page 73: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 74: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Physicochemical Characterization of Biomolecules 67

Glassplate

Weight

Filterpaper

Wick

Papertowels

Transferbuffer

Membrane

Gel

Support

Fig. 5.8 Schematic of a Blotting Apparatus

Fig. 5.9 Immunodetection of Protein Blots by Enzyme-linked Antibodies

probes under the appropriate hybridization conditions. Radioactive DNA molecules (cDNA)can be employed as probes in Southern and Northern blotting. Specific antibodies can be usedto examine individual protein bands in Western blotting (Fig. 5.9) or by other methods, suchas by fluorescence (Fig.5.10).

Page 75: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

68 Bioinformatics: A Primer

Enzyme(Peroxidase)

Fluorescent Molecule(Luminol)

Nofluorescent species(Aminopthalic acid)

2H O2 2 N2

2O–

2H O2

Fluorescence

Fig. 5.10 Schematic of Detection of Proteins in Western Blots by Fluorescence

EXERCISE MODULE

1. Which are the two branches of structure determination of biomolecules and what are their objectives?2. Which are the physiochemical methods of structure elucidation clubbed under biochemical characteriza-

tion?3. Which are the hydrodynamical methods of structure elucidation and what information do they provide?4. Which are the chromatographic methods that employed in the physiochemical characterization of

biomolecules?5. What is the reason for the extensive use of electrophoretic methods in separation and characterization of

biomolecules?6. What are procedures followed in the electrophoretic separation of (i) nucleic acids, and (ii) proteins?7. Which are important physicochemical methods employed in protein separation and characterization?8. What kinds of complementary structural data do PAGE and SDS-PAGE provide?9. Explain the use of immunoelectrophoretic methods in biomedical sciences.

10. What are the applications of 2-D gel electrophoresis (2-DE)?11. What are the advantages of capillary electrophoretic methods?12. What are the detection methods in gel electrophoresis?13. What is the relevance of blotting techniques in genome analysis?

BIBLIOGRAPHY

1. Anazawa, T., Takahashi, S. & Kambara, H. (1996), Anal Chem., 68; 2699. “A capillary array gelelectrophoresis system using multiple laser focusing for DNA sequencing”.

2. Andrew, A.T. (1986), Oxford University Press: Oxford. “Electrophoresis: Theory, Techniques and Bio-chemical and Clinical Applications”.

3. Atkins, P.W. (1998), Oxford University Press: Oxford. “Physical Chemistry”, 6th Edn.4. Braithwaite, A. & Smith, F.J. (1996), Chapman & Hall: London. “Chromatographic Methods”.5. Dean, P.D.G., Johnson, W.S. & Middle, F.A. (1985), IRL Press: Oxford. “Affinity Chromatography: A

Practical Approach”.6. Deutscher, M.P. (Ed) (1990), Academic Press: New York. “Guide to Protein Purification”.

Page 76: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Physicochemical Characterization of Biomolecules 69

7. Dolnik, V. (1999), J Biochem Biophys Methods, 41; 103. “DNA sequencing by capillary electrophoresis”.8. Fried, B. & Sherma, J. (1999), Marcel Dekker: New York. “Thin Layer Chromatography”, 4th Edn.9. Gevaert, K. & Vanderckhove, J. (2000), Electrophoresis, 21; 1145. “Protein identification methods in

proteomics”.10. Görg, A. (1993), Biochem Soc Trans., 21; 130. “Two-dimensional electrophoresis with immobilized pH

gradients: current state”.11. Hacock, W.S. (Ed) (1990), Wiley Interscience: New York. “HPLC in Biotechnology”.12. Hames, B.D. & Rickwood, D. (Eds) (1990) IRL Press: New York. “Gel Electrophoresis of Proteins: A

Practical Approach”, 2nd Edn.13. Harris, E.L.V. & Angal, S. (Eds) (1989), IRL Press: Oxford. “Protein Purification Methods: A Practical

Approach”.14. Hearn, M.T.W. (Ed) (1991), VCH Pubs: New York. “HPLC of Proteins, Peptides and Polynucleotides”.15. Huang, X.C., Quesada, M.A. & Mathies, R.A. (1992), Anal Chem., 64; 2149. “DNA sequencing using

capillary array electrophoresis”.16. Katz, E., et al. (1999), Marcel Dekker: New York. “Handbook of HPLC”.17. Melvin, M. (1987), John Wiley: New York. “Electrophoresis”.18. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics”, (2nd Print).19. Neuberger, A.A. & Van Deenan, L.L.M. (Eds) (1988) Elsevier: New York. “Modern Physical Methods

in Biochemistry”.20. Nishikawa, T. & Kambara, (1996), Electrophoresis, 17; 1476. “Characteristics of single-stranded DNA

separation by capillary gel electrophoresis”.21. Oestermann, L.A. (1984), Springerverlag: Berlin. “Methods of Protein and Nucleic Acids Research”,

Vol I.22. Rabilloud, T. (2000), Anal Chem., 72; 48A. “Detecting proteins, separated by 2-D gel electrophoresis”.23. Rickwood, D. (Ed) (1984), IRL Press: Oxford. “Centrifugation” A Practical Approach”.24. Scott, R.P.W. (1995), Marcel Dekker: New York. “Techniques and Practice of Chromatography”.25. Tinoco, I, (Jr). (1985) Prentice-Hall, Engelwood: NJ. “Physical Chemistry: Principles and Applications in

Biological Sciences”, 2nd Edn.26. Walker, J.M. and Gaastra, W. (Eds) (1983), Croom Helm: London. “Techniques in Molecular Biology”.27. Wilson, K. and Walker, J.M. (1994), Foundation Pubs: New Delhi. “Principles and Techniques of

Practical Biochemistry”.

Page 77: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

6Primary Structure (Sequence)

Determination of Biomolecules

Once a biological macromolecule is obtained in a purified form (protein or nucleic acid), thenext step is to determine its primary structure (sequence). Determination of primary structure(sequence) of biological molecules is the starting point of generating computer databases andcomputational analyses in bioinformatics studies. Availability of the primary structure of amacromolecule is a prerequisite in the determination of its spatial structure by physicaltechniques, such as X-ray diffraction and NMR spectroscopic methods.

Molecular separation and amplification, sequence determination of component fragmentsand analysis of consensus sequences is the main steps in the primary structure determinationof nucleic acids and proteins.

6.1 PRIMARY STRUCTURE DETERMINATION OF NUCLEIC ACIDS

DNA sequencing technology is a major component of all the genomic studies. Gene amplifi-cation, separation and sequencing are hierarchical steps in primary structure determination ofnucleic acids. High-throughput sequencing (required for large-scale genome projects) meth-ods use robotics, automated DNA-sequencing machines and computers, to handle to achievefastness and large-scale data management. The ‘assembly line” purification platforms, repli-cate the steps used in manual template purification protocol.

6.1.1 Gene AmplificationCutting genomic DNA at specific sites by suitable restriction enzymes generates DNA frag-ments. Gene amplification is necessary for obtaining sufficient quantities of desired gene orgene fragment for further analysis and for production of desired protein in large quantities foranalysis. The fragments are the amplified either by cloning or polymerase chain reaction(PCR) methods.

6.1.1.1 Gene CloningGene cloning involves the use of recombinant DNA technology to propagate DNA fragments,isolated from chromosomes using restriction enzymes, inside a foreign host. Following intro-duction into suitable host cells, the DNA fragments can then be reproduced along with thehost cell DNA. Cloning procedures are routinely employed to produce unlimited material forexperimental study.

Page 78: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Primary Structure (Sequence) Determination of Biomolecules 71

6.1.1.2 Polymerase Chain Reaction (PCR)Polymerase chain reaction (PCR) is a very versatile gene amplification method that has broughta tremendous progress in molecular biology and genetics. It is an in vitro method of amplify-ing a desired DNA sequence of any origin hundreds of million times in hours. The procedureinvolves cycle of steps in which the double-stranded target sequence is denatured, oligonucle-otide primers bordering the region to be amplified are annealed and the primers are extendedby thermo-stable polymerases and dNTPs.

In DNA amplification by PCR, a desired cDNA clone is synthesized using mRNA as atemplate. Suitable primers are used to hybridize to the corresponding sequences, and they areextended in a chain synthesis reaction by thermo-stable DNA polymerases, using the insertedsequence as the template. The PCR mixture contains DNA bases (four types) and two primers(~ 20 bases long). The mixture is heated to separate the target sequence and then cooled(annealing) to allow the (i) primers to bind to their complementary sequence on the separatedstrands, and (ii) the polymerase to extend the primers into the new complementary strands.Repeated heating and cooling cycles multiply the target DNA exponentially, since each newdouble strand separates to become two templates for further synthesis.

The nucleotide that the polymerase attaches will be complementary to the base in thecorresponding position on the template strand (e.g. if the adjacent template base is C, thepolymerase attaches G). The polymerase chain reaction proceeds with two primers, bound tothe opposite strands of the gene target, and their 3’-ends pointing at each other. The reactionis terminated by the incorporation of dideoxynucleotides. The resultant is a series of frag-ments of different lengths for each primer.

6.1.2 Gene Separation

Electrophoresis techniques are used to separate the fragments. Small diameter capillary arraygel electrophoresis permits application of high electric fields, thus providing significantlyfaster separation than traditional slab gels. While conventional electrophoresis is applicable toseparate fragments < 40 kilo bases, pulse-field gel electrophoresis (PFGE) technique has improvedthe separation of larger fragments (~10M bases). This technique employs multiple electrodes,placed orthogonally with respect to the gel, and short pulses of alternate current are passedthrough the gel back and between two directions. Automated electrophoretic methods, withlaser-induced detection have revolutionized the gene sequencing methodologies.

6.1.3 Gene Sequencing

Genome sequences are assembled from DNA sequence fragments of approximately 500 base-pairs length. Knowing the sequence of a DNA molecule is vital for making prediction about itsfunction and facilitating manipulation of that molecule. Conventional (1st generation) genesequencing methods employed Maxam-Gilbert and Sanger methods. Maxam-Gilbert method(also called the chemical degradation method) uses chemicals to cleave DNA at specific bases,resulting in fragments of different length. Sanger sequencing method (dideoxy method) usesenzymatic procedure to synthesize chains of varying length in four different reactions, stoppingthe DNA replication at positions occupied by one of the four bases, and then determines theresulting fragment length. This method is automated, and is now by for the most widely usedtechnique for sequencing DNA. Multiplex sequencing procedure enables to analyze ~ 40clones on a single DNA-sequencing gel.

Page 79: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

72 Bioinformatics: A Primer

The principles of Sanger sequencing method are:

(i) Enzymatic procedure to synthesize DNA replication (of varying lengths) at positionsoccupied by one of the four bases.

(ii) Then separation of the fragments as gels by electrophoresis and determination of theidentify, and order of nucleotides, based on the size of the fragments.

A DNA polymerase extends an oligonucleotide primer annealed to a unique location on aDNA template by incorporating deoxynucleotides (dNTPs) complementary to the template.The 5’-end of every DNA fragment within a sample begins with the same priming sequence.Synthesis of the new DNA strand continues until the reaction is randomly terminated by theinclusion of a dideoxynucleotide (ddNTP). These nucleotide analogs are incapable of chainelongation since the ribose moiety of the ddNTPs lack the 3’-OH necessary for forming aphosphodiester bond with the next incoming dNTP. This results in a population of truncatedsequencing fragments of varying length.

The identity of the chain-terminating nucleotide at each position is specified by runningfour separate base-specific reactions, each of which contains a different dideoxynucleotide–ddATP, ddCTP, ddGTP, and ddTTP. The four such fragment sets are loaded in adjacent lanesof polyacrylamide gel and separated by electrophoresis, according to the fragment size. Bythis method, DNA fragments differing in length by one nucleotide can be resolved (Fig. 6.1).

Detection can be achieved by (i) autoradiography if radioactive label is introduced into thesequencing reaction products, or by (ii``) fluorescence, if the reaction products are labeledwith an appropriate fluorescent dye. Fluorescence detection employs direct methodology,which is simple, sensitive and amenable for automation. The use of laser-induced fluorescence(LIF) technology, which can be coupled to computerized detection systems, has replaced mostof the radioactive techniques in genomic studies, allowing automation of sequencing meth-ods. Automated DNA sequencing by fluorescence techniques accomplish real-time detectionof DNA fragments as they move through a portion of the electrophoresis gel that is irradiatedby a laser beam– DNA is separated on an automated DNA sequencer and the fluorescentimage of DNA migrating through the acrylamide gel is captured by CCD (charge-coupleddiode) cameras similar to that found in common home video recorders.

Fig. 6.1 Electropherogram of a DNA Sequence

Page 80: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Primary Structure (Sequence) Determination of Biomolecules 73

6.1.3.1 Latest Gene Sequenci MethodsDevelopments in gene sequencing techniques (2nd and 3rd generation) are–– ultra-thin elec-trophoresis, resonance ionization spectroscopy to detect suitable isotope labels, gel-less flowcytofluorimetry (single molecule detection in flowing solution), laser-induced fluorescence,confocal (LSCM), near-field scanning optical (NSOM), scanning-tunneling (STM) or atomicforce microscopy (AFM), and mass spectrometry (MS).

Single-molecule detection can be achieved with flow-cytofluorimetry, to monitorfluorescently tagged macromolecules. A small sample volume in a flowing solution can beobtained by introducing the sample from a capillary tube inserted into a flow cell (Fig. 6.2).While a fluorescent molecule transits in a focused laser beam, it undergoes cycles of photonabsorption and emission so that its presence is signaled by a burst of emitted photons, whichenables to distinguish the signal from the background. The background from Rayleigh andRaman scattering can be drastically reduced using a pulsed laser beam and single-photontiming technique.

Fig. 6.2 Schematic of Single-molecule Detection by Flow-cytofluorimetry

Confocal microscopy is well suited for fluorescently tagged single-molecule detection.Either fluorescence spectroscopic or microscopic approach can be used for sequencing ofnucleic acids (DNA or RNA). In this approach, a nucleic acid strand (DNA/RNA) strand isreplicated by a polymerase using nucleotides linked to fluorophores via a linker arm. Thefluorescently tagged nucleic acid strand is attached to a solid support (e.g. latex bead), andsuspended in a flowing stream. The nucleic acid bases are then sequentially cleaved by anexonuclease. The released labeled nucleitodes are detected and identified by their fluores-cence signature– either by their spectral characteristics if the tagged fluorophores are differ-ent, or by their different lifetimes, if the same fluorophore is tagged (Fig. 6.3). Thus, through-put sequencing (several hundred bases per second) of nucleic acids can be determined fromthe order in which the labeled nucleic acids pass through the laser beam.

Page 81: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

74 Bioinformatics: A Primer

Fig. 6.3 Sequencing of Nucleic Acids by Fluorescently-labeledSingle-molecule Detection Method

6.1.4 Genome SequencingGenome sequencing projects do face several technical problems. The selection of sequencingstrategy usually depends on the size of the target DNA molecule. Since present experimentalmethods can provide data on ~ 500 base pairs size genes, determination of larger genomicsequences requires a strategy to assemble overlapping sequence fragments. “Shotgun”sequencing strategy (sequencing method, which involves randomly sequencing tiny clonedpieces of the genome, with no foreknowledge of where on a chromosome the piece originallycame from) is employed in most of the current large-scale genome sequencing projects.

The strategy employed in the “whole genome shotgun” DNA sequencing method is(Fig. 6.4)–

(i) Chromosomes are first separated by pulse-field gel electrophoresis, from which eachDNA molecule is broken into random DNA fragments.

(I) = Synthesis of complementary strand with fluorescently labeled nucleotides;(II) = Attachment of strand to a support and suspension in a flow sample stream;(III) = Sequential cleavage by an exonuclease and detection.

Page 82: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Primary Structure (Sequence) Determination of Biomolecules 75

(ii) Purified DNA fragments are to construct small-insert shotgun libraries, which are clonedand sequenced from each end. Sufficient DNA sequencing is performed so that eachnucleotide of DNA in the genome is covered numerous times in fragments of ~ 500 bp.

(iii) After sequencing, assembly of scaffolds of DNA sequences (series of contigs that are inthe right order but are not necessarily connected in one continuous stretch of sequence) byidentifying overlapping stretches of DNA sequences, to reconstruct the complete ge-nome.

Fig. 6.4 Schematic of the “Whole Genome Shotgun (WGS)”DNA Sequencing Protocol

A large segment of target DNA is randomly fragmented by physical shearing or enzymaticdigestion to fragment sizes in the range of 1 - 5 kb. Chromosomes of a target organism arepurified, fragmented, and sub-cloned in fragments of ~ kilo base pairs. They are further sub-cloned as smaller fragments of plasmid vectors for DNA sequencing. First, the gene fragmentsare sequenced to determine the order of the bases in each sequence. Next, overlapping frag-ments are built up in a multiple alignment, a process known as sequence assembly, fromwhich a consensus sequence for the clone is obtained, called “contig”. Contig assembly is oneof the most difficult and critical functions in DNA sequence analysis. Full chromosomalsequences are then assembled from the overlap sequences in a highly redundant set of frag-ments. Full chromosomal sequences are then assembled from the overlap sequences in ahighly redundant set of fragments.

In genomic studies, the earliest stages of genome analysis are performed automatic meth-ods. The genome sequences are then annotated. More detailed information is collected bylaboratory experiments and a closer examination of the sequence data.

6.1.4.1 cDNA SEQUENCING(1) Isolating of mRNA from human tissues, and exon coding sequences are cDNA. This

allows rapid isolation of annotation of putative gene sequences.

(2) Grouping of ESTs into consensus groups, based on sequence overlaps.

6.1.4.2 NANOPORE SEQUENCINGNanopore (solid-state and membrane) probe technology enables sequencing of single DNAmolecule– probing of single DNA molecule and direct conversion of sequence information

Page 83: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 84: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Primary Structure (Sequence) Determination of Biomolecules 77

2-mercaptoethanol or dithiothreitol (Cleland reagent) to reduce the S–S bonds to sulfhydryl(SH) groups. The resultant free-SH groups are alkylated by iodoacetic acid to preventreoxidation.

Table 6.1 Chemical Reagents for Cleavage of Polypeptides

Reagent Specificity

Cyanogens Bromide (CNBr) Highly specific; Carboxyl side of Met or Trp

Hydroxyl amine Specific; Arg–Gly bonds.

2-Nitro-5-thiocarbanobenzoate Specific; Cysteine residues.

Table 6.2 Endopeptidases for Cleavage of Polypeptides

Enzyme Specificity

Pepsin Non-specific; cleaves N-terminal to F, L, W, Y only when next to P.

Thermolysin Small neutral residues; cleaves N-terminal to F, I, M, V, W, Y, butnot if next to P.

Elastase Cleaves C-terminal to A, G, S, V, but not if next to P.

Trypsin Highly specific for positively charged residues; cleaves C-terminalto R, K, but not if next to P.

Chymotrypsin Prefers bulky hydrophobic residues; cleaves C-terminal to F, Y, W,but not if next to P.

Table 6.3 Exopeptidases in C-terminal Cleavage of Polypeptides

Enzyme Specificity

Carboxypeptidase-A Cleaves all C-terminal terminal residues, except R, K or P, orif P residue is next to terminal residue.

Carboxypeptidase-B Cleaves when C-terminal residue is K or R, and not when Presidue is next to terminal residue.

Carboxypeptidase-C Cleaves all free C-terminal residues.

Carboxypeptidase-Y Cleaves all free C-terminal residues.

6.2.2 Sequence Analysis

For sequence analysis, the cleaved fragments are then separated and purified by various

fractionation methods (reversed-phase HPLC, TLC electrophoresis—chapter 5). Amino acidsequencing is carried out employing Edman degradation method or more efficiently by tan-

dem mass spectrometric (MS-MS) method. In Edman method, a polypeptide is sequentially

cleaved, one residue at a time from the amino-end. Edman reagent, phenylisothiocyanate

(PITC), reacts with terminal amino group of the peptide to form phenylthiocarbamayl (PTC)

derivative. Under mildly acidic conditions the derivative forms PTH-amino acid, which leaves

Page 85: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

78 Bioinformatics: A Primer

an intact peptide. The PTH-amino acid is the end product of one cycle of Edman degradation.

The remaining polypeptide chain may now be subjected to further cycles of the Edmandegradation. Amino acids are purified and identified by chromatographic methods (TLC,

HPLC). Unlike the Edman method, the MS-MS approach can be carried out with mixture of

cleavage fragments (no chemical separation and identification).

Information about the order of the cleaved and sequenced fragments is obtained by com-parison of fragment overlaps. Chymotrypsin cleaves preferentially on the carboxyl side of

aromatic and other bulky non-polar residues. Because these chymotryptic peptides overlap

two or more tryptic peptides, they can be used to establish their sequential order.

Amino acid sequence data of proteins can be inferred from the base sequence of corre-sponding nucleic acids. With the development of highly efficient gene sequencing methods,gene sequencing is faster and easier and than protein sequencing, this alternative is followedwherever it is feasible. However, there are certain ambiguities and limitations in inferringamino acid sequence from gene sequencing. These are:

(i) Degeneracy of codons (more than one codon coding for the same amino acid) leads toambiguities.

(ii) The genetic code is not universal.(iii) Deletion and insertion of nucleotide (s) can lead to erroneous reading frame for the amino

acids.(iv) Post-modified proteins and disulfide-containing proteins can be determined only by

direct protein sequencing.

Also, comparison of amino acid sequence data with DNA sequence data is necessary toidentify initiation, termination and intron-coding regions. In addition, DNA sequence datadoes not provide information on post-translational modifications (methylation, glycosylation,phosphorylation) of proteins.

6.2.3 Mass Spectrometry (MS) in Sequence AnalysisUntil recently, protein sequencing by Edman degradation method was the method of choicefor the identification of proteins. The situation has drastically changed with the integration ofadvanced physical technologies, such as mass spectrometric methods for protein identifica-tion and sequence analysis. Of particular relevance to the “Genomes-to-Life” program is theunique ability of MS to identify a protein unambiguously, establishing the amino acid se-quence (the order in which these building blocks of proteins are arranged) and determiningthe presence of post-translational modifications that can impact the protein’s function.

Mass spectrometers generally couple three devices—(i) an ionization device, (ii) a massanalyzer (a device that separates a mixture of ions by their mass-to-charge ratios), and (iii) adetector.

Mass spectrometric methods require the transfer analyte into the gas phase. Incorporationof mass spectrometry in biological studies is made possible with the development of gentleionization techniques– matrix-assisted desorption ionization (MALDI), and electrospray ionization(ESI). MALDI and ESI methods permit the efficient transfer of proteins and peptides into gasphase and into the mass spectrometer. Generally, the MS analysis consists of individuallyexcising the spot of interest from a 2-DE gel. The spot is rinsed and subjected to enzymatic

Page 86: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Primary Structure (Sequence) Determination of Biomolecules 79

digestion (trypsin). The most commonly used mass analyzers for protein biochemistry appli-cations are time-of-flight (TOF), triple-quadrupole, quadrupole-TOF, and ion-trap instruments.

6.2.3.1 MALDI-TOF-MSThe purified proteins, isolated spots from a 2-D gel electrophoresis, are individually excised,digested wit a protease (e.g. trypsin) that cuts the protein at predictable positions, and spottedon a MALDI plate for co-crystallization with a standard matrix solution. The plate is insertedinto the vacuum chamber of the MS apparatus. A selected spot is then illuminated using afocused-pulse laser beam of wavelength tuned to the absorbance wavelength of the matrix.Positively charged peptides in the gas phase are attracted toward the orifice flight tube keptat a negative bias (Fig. 6.6).

All the peptides are subjected to the same electric field for the same plate-to-orifice distance

(L), and reach the orifice with a velocity proportional to . Time-of-flight mass spectrom-

eter is an arrangement based on the fact that ions of different mass-charge need different timesto travel through a certain distance in a field-free region after they have all been initially giventhe same translational energy. Time of flight through the tube correlates directly to mass, withlighter molecules having a shorter time of flight than heavier ones. Once the particles are outof the time-of-flight (TOF) tube, they are separated according to their m/z ratio, and reach thedetector at different time intervals. The mass analyzer records the intensity observed at thedetector as function of time of flight, which correlated to m/z ratio. Peaks in an MS spectrumcorrespond to masses of each of the peptides analyzed (“mass fingerprinting”).

Fig. 6.6 Schematic of MALDI-TOF-MS Analysis of Peptide Mass Fingerprinting

MALDI-TOF-MS, in combination with protein database searches by peptide mass finger-printing, is a valuable and sensitive, and high-throughput screening approach in proteomics.By comparing the observed mass of a protein mixture in the MS spectrum with predictedpeptide masses, derived from all known protein sequence databases, it is possible in principleto identify the protein of interest.

Surface-enhanced laser desorption ionization (SELDI)-TOF-MS is a type of affinity MShigh-throughput proteomic fingerprinting tool that shares a basic identity with the MALDI-TOF, which can be used for protein purification, identification and target discovery andvalidation. Differential protein expression via SELDI brings the research team one step closerto the ultimate drug target than gene expression.

Page 87: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

80 Bioinformatics: A Primer

Biomolecular interaction analysis mass spectrometry is a two-dimensional, chip-based,analytical technique for rapid and sensitive analysis of biomolecules. It represents a synergyof two individual technologies–surface plasmon resonance (SPR) sensing and MALDI-TOFmass spectrometry.

6.2.3.2 ESI-MS/MSAs peptide mass fingerprinting by MALDI-TOF-MS does not always work for unambiguousprotein identification, there has been an increase emphasis on using tandem mass spectrom-etry (MS/MS), equipped with collision-induced dissociation. Tandem mass spectrometry(MS/MS) is an arrangement in which ions are subjected to two or more sequential stages ofanalysis (which may be separated spatially or temporally) according to the quotient mass-charge. One of the most popular types of tandem MS instrument is the triple quadrupole massspectrometer. Electrospray ionization (ESI) allows the transfer of non-volatile analytes (suchas proteins and nucleic acids) at atmospheric pressure from a liquid to phase to gas phase. Thebasis of MS/MS is collision induced dissociation, and ESI-based MS has several advantagesover MALDI-TOF– (i) they can be easily coupled to different sample separations and sampleintroduction techniques, and (ii) increase in the quality of MS/MS spectra generated frommultiple-charged analytes.

Nanoelectrospray-MS/MS is a newer adaptation of ESI methodology in conjunction withMS/MS. Only a very small amount of the unseparated peptide mixture is sprayed directly intothe MS machine. Nanospray/HPLC has become the method of choice for protein digests toESI-MS/MS. Full automation is possible– protein digests are on a reversed phase HPLCcolumn, and analytes separated according their hydrophobicity, are analyzed by MS.

The method can be used for de novo protein sequencing and study of post-translationalmodifications. Protein database search algorithms are pursued to reveal the identity of theprotein. Peptide amino acid sequence information contained in the MS/MS spectra are com-pared with known sequence in protein/genome databases. A single confident match betweena peptide MS/MS spectrum and a protein sequence entry from a database can be enough toidentify a protein or a family of proteins.

de novo sequencing (determination of sequences of genes or amino acids whose sequence isnot yet known) can be pursued with LC/MS/MS or nanoelectrospray MS/MS methods. Thesephysical techniques (e.g. SELDI-TOF) can be used to find specific proteomic patterns that candistinguish healthy from diseased patients. The ability of the pattern itself to become thediagnostic represents a new paradigm for the application of proteomics to clinical specimenanalysis and disease diagnosis.

When more than two stages are involved, the technique is called multi-dimensional MS.Multiple mass spectrometry (MS/MS/MS) provides even greater certainty of identificationand additional characterization information than electrospray ionization/ tandem massspectrometry.

Photoionization Mass Spectrometry (PI-MS) is another emerging technology that would bean important tool for high-throughput pharmaceutical analysis. PI-MS meets the require-ments for many applications where ESI and atmospheric pressure chemical ionization (APCI)underperform. Multi-isotope Imaging Mass Spectrometry (MI-MS) is a cutting edge technol-

Page 88: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Primary Structure (Sequence) Determination of Biomolecules 81

ogy, which enables visual and quantitative assessment of intra- and trans-cellular metabolicpathways, signal transduction, virus penetration, and localization of drugs.

Remarkable advances are taking place in protein expression via sequence analysis andidentification, but still face major hurdles. 2-DE methods are cumbersome, and all the steps inprotein expression study must become much more easily reproducible and more affordablebefore they will enable researchers to significantly further our knowledge of protein expres-sion. Another major challenge is to improve quantification of proteins. It is not sufficient tofind a protein is expressed; one must also know how much is expressed to be able to identifyimportant patterns.

EXERCISE MODULES

1. What are the main steps in the primary structure determination of nucleic acids and proteins?

2. Explain the procedure followed in the nucleic acid sequence determination.

3. What the objective of gene amplification?

4. What is gene cloning and what its role in sequence determination of nucleic acids?

5. What is PCR and what is its important role in molecular biology?

6. Which are the physicochemical methods used in gene separation?

7. Which electrophoretic methods are used in gene separation?

8. Which are the standard methods used in gene sequencing?

9. What are the physicochemical methods of analysis applied in genomic projects?

10. What is the importance of primary structure determination of proteins?

11. What are the step-by-step methods used in the primary structure determination of proteins?

12. Write on the Edman method in protein sequencing.

13. What are the practical strategies employed in the separation of small polypeptides, large proteins and

disulfide-containing proteins?

14. Write on the use of mass spectrometry in protein sequencing?

15. What are the applications and advantages of MALDI-TOF, and ESI-MS/MS?

BIBLIOGRAPHY

1. Abersold, R. (1993), Curr Opin Biotechnol., 4; 412. “Mass spectrometry of proteins and peptides inbiotechnology”.

2. Adams, M.D., et al. (1991), Science, 252; 1651. “Complementary DNA sequencing: expressed sequencetags and the human genome project”.

3. Alphey, L. (1997), Springerverlag: New York. “DNA Sequencing”.

4. Ansorge, W., Voss, H. & Zimmermann, J. (1997), Wiley & Sons: New York. “DNA SequencingStrategies”.

5. Beavis, R.C. & Chait, B.T. (1996), Methods Enzymol., 270; 519. “Matrix-assisted laser desorption ioniza-tion mass-spectrometry of proteins”.

6. Branden, C-I. & Tooze, J. (1999), Garland Pubs: Philadelphia. “Introduction to Protein Structure”,2nd Edn.

Page 89: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

82 Bioinformatics: A Primer

7. Brumley, Jr., R.L. & Smith, L.M. (1991), Nucleic Acids Res., 19; 4121. “Rapid DNA sequencing byhorizontal ultrathin gel electrophoresis”.

8. Creighton, T.E. (1993), Freeman Press: New York. “Protein structures and Molecular Properties”,2nd Edn.

9. Dolnik, V. (1999), J Biochem Biophys Methods, 41; 103. “DNA sequencing by capillary electrophoresis”.

10. Durbin, R., et al. (1998) Cambridge University Press: Cambridge. “Biological Sequence Analysis”.

11. Edman, P. (1970), Mol Biol Biochem Biophys., 8; 211–55. “Sequence determination”.

12. Fenn, J.B., Mann, M. & Meng, C.K. (1990), Mass Spectrum Rev., 9; 37. “Electrospray ionization–principles and practice”.

13. Gevaert, K. & Vanderckhove, J. (2000), Electrophoresis, 21; 1145–54. “Protein identification methods inproteomics”.

14. Griffin, T.J. & Smith, L.M. (2000), Trends Biotechnol., 18; 77. “Single-nucleotide polymorphism analysisby MALDI-TOF mass spectrometry”.

15. Harding, J.D. & Keller, R.A. (1992), Trends Biotechnol., 10; 55. “Single-nucleotide detection as anapproach to rapid DNA sequencing”.

16. Hunkapiller, T., et al. (1991), Science, 254; 59–67. “Large scale and automated DNA sequence determina-tion”.

17. Kaufman, R. (1995), J Biotechnol., 41; 155–75. “Matrix-assisted desorption ionization (MALDI) massspectrometry: a novel analytical tool in molecular biology and biotechnology”.

18. Klotz, I.M., et al. (1970), Annu Rev Biochem., 39; 25. “Quaternary structure of proteins”.

19. Kyte, J. (1994), Garland Pubs: New York. “Structure in Protein Chemistry”.

20. Link, A.J., et al. (1999), Nature Biotechnol., 17; 676. “Direct analysis of protein complexes using massspectrometry”.

21. Mathies, R.A. & Huang, X.C. (1992), Nature, 359; 167. “Capillary array electrophoresis: an approach tohigh-speed, high-throughput DNA sequencing”.

22. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics”(2nd Print).

23. Pappin, D.J. (1997), Methods Mol Biol., 64; 165. “Peptide mass fingerprinting using MALDI-TOF massspectrometry”.

24. Patterson, S.D., Thomas, D. & Bradshaw, R.A. (1996), Electrophoresis, 17; 813. “Application of com-bined mass spectrometry and partial amino acid sequence to the identification of gel-separated proteins”.

25. Sanger, F., Nicklen, S. & Coulson, A.R. (1977), Proc Natl Acad Sci, USA, 74; 5463–67. “DNA sequencingwith chain-terminating inhibitors”.

26. Smith, L.M., et al. (1986), Nature, 321; 674”. Florescence detection in automated DNA sequence analysis”.

27. Southern, E.M. (1975), J Mol Biol., 98; 503. “Detection of specific sequences among DNA fragmentsseparated by gel electrophoresis”.

28. Stryer, L. (1995), H.C. Freeman: New York. “Biochemistry”, 4th Edn.

29. Tsugita, A. (1987), Adv Biophys., 23; 81. “Developments in protein microsequencing”.

30. Venter, J.C., et al. (1998), Science, 280; 1540. “Shotgun sequencing of the human genome”.

31. Voet, D. & Voet, J.D. (1990), John Wiley: New York. “Biochemistry”.

32. Walker, J.M. & Gaastra, W. (Eds) (1983), Croom Helm: London. “Techniques in MolecularBiology”.

33. Weber, J.L. & Myers, E.W. (1997) Genome Res., 7; 401. “Human whole-genome shotgun sequencing”.

34. Wilson, K. & Walker, J.M. (1994), Foundation Books: New Delhi. “Principles and Techniques of PracticalBiochemistry”.

35. Yates, J.R. (2000), Trends Genet., 16; 5. “Mass spectrometry: from genomics to proteomics”.

Page 90: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

7Spatial Structure Determination

of Biomolecules

The most challenging experimental task in structure elucidation of macromolecules, is thedetermination of the three-dimensional (spatial) structure. To-date, there are only two physicaltechniques that are capable of providing spatial (secondary and tertiary) structural informationof macromolecules (e.g. nucleic acids, proteins, carbohydrates, lipids and their complexes).These are X-ray diffraction and Nuclear Magnetic Resonance (NMR) spectroscopic methods.In addition, scanning and imaging techniques, such as fluorescence imaging, microscopies,tomographies provide gel patterns, morphological and functional features of cells andorganelles.

7.1 X-RAY DIFFRACTION METHODS

X-ray crystallography is the most important quantitative method, to-date, in elucidation ofthree-dimensional architecture of matter in crystalline state at molecular and atomic resolu-tions. Therefore, this method has become an indispensable analytical tool in biological sci-ences. There seems to be no limit to the size of the molecule not to its structural complexity toattempt its structure determination. Indeed, a very substantial part of what is known ofbimolecular structures has been due to X-ray crystallography. The primary condition is– thematter should be in single-crystalline state. Further, the power and scope of X-ray diffraction isenhanced manifold by advances in the technology of production and detection of X-rays. Theuse of high-intensity, high degree of collimation, and continuous wavelength sources (syn-chrotron radiation) has helped in (i) reducing the data collection time and (ii) resolving the“phase problem”. X-ray detectors with increased sensitivity, increased dynamic range, andincreased counting rate have enabled fast data and accurate collection. These developmentshave enhanced the power of X-ray crystallography in solving macromolecular structures ofimmense complexity and have ushered in study of real-time structural dynamics of biologicalmolecules (time-resolved X-ray crystallography) at molecular level.

X-rays are produced when high-energy electrons are decelerated by a target, giving rise toX-ray continuum (Bremsstrahlung), and characteristic line spectra, depending on the appliedvoltage. Wavelengths of characteristic X-ray wavelengths, employed in X-ray diffraction studies(e.g. CuKα = 1.54 Å) are comparable to interatomic distances. Therefore only these X-ray linesare used in structure elucidation by X-ray diffraction.

Page 91: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

84 Bioinformatics: A Primer

7.1.1 Principles of X-ray Diffraction

The basic principles of X-ray diffraction are similar to those of image formation by light andelectron microscopies. The scattered rays in light- and electron microscopes are focused bylens systems. There is no proper lens system to focus X-rays (refractive index of X-rays is 1 inall media). What is recorded in X-ray diffraction methods are scattered X-rays (diffractionpatterns by films or electronically). The operational procedures involved in the transforma-tion of the diffraction patterns for “image reconstruction” are quite intricate and involved.

Matter in solid state can either be amorphous, semi-crystalline or single-crystalline. X-raydiffraction is a scattering phenomenon. Though the amount of diffracted energy is same for asubstance in amorphous or crystalline state, the information content obtainable is not the samein all states. The information content is least in the amorphous state and maximum in thesingle crystalline state. A single crystal is a manifestation of regular and periodic arrangementof atoms/ molecules or clusters of molecules in all three dimensions (crystal lattice). Due toperiodic arrangement of diffraction elements, scattering is coherent to produce observablediffraction patterns. In addition, the periodic arrangement channels the diffracted rays at certaindirections only, greatly enhancing the signal-to-noise ratios (information content).

Scattering is random in the amorphous state and gives rise to diffuse scattering.

Scattering by semi-crystalline state gives rise to diffraction lines and arcs.

Scattering by single-crystalline state gives rise to discrete diffraction spots.

7.1.2 Determination of Unit Cell Morphology

As scattering of X-rays by a single molecule is too weak because of its high penetration power,single crystalline state of matter is a pre-requisite for three-dimensional structure determina-tion by X-ray diffraction methods.

7.1.2.1 Unit Cell and Space Group

In crystalline substances, the crystal lattices act as diffraction ratings to X-rays. A unit cell isparallelepiped (an imaginary box) that contains one basic unit of the structure and transla-tional repeat of the unit cell in all three dimensions represent the crystal.

Each reflection (spot) of a diffraction pattern can be identified by triple indices, hkl, calledMiller indices. Indices h,k,l describe number of times a, b and c axes (of the unit cell) areintersected, respectively.

The underlying principle of X-ray diffraction can be understood from the intuitive interpre-tation offered by Lawrence Bragg. It states that a beam of X-rays incident upon a stack ofparallel, equally spaced lattice planes appear to be reflected by these planes (Fig. 7.1). Accord-ing to Bragg’s law, the condition for constructive interference is

2d sinθ = nλ Bragg’s equation (7.1)

(Where, d = interplanar spacing; θ and λ = angle of incidence and wavelengths of X-rays;n = integer 1, 2, 3,…). Bragg’s equation is used to determine the unit cell parameters (celllengths and angles) from the spatial separation of the diffraction spots.

Page 92: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Spatial Structure Determination of Biomolecules 85

q = angle of incidence; d = interplanr spacing

Fig. 7.1 Bragg’s Interpretation of X-ray Scattering

7.1.2.2 Unit Cell Content (Z)Unit cell content, Z, which is the number of molecules in the cell can be determined from theunit cell parameters

Z =

¥¥

(7.2)

(V = volume of the unit cell; do = measured density of the crystal; F = formula mass).

7.1.3 Structure Determination

Bragg’s equation is useful in obtaining unit cell parameters and geometry (external morphol-ogy) only, from the spatial separations of the diffraction spots (recorded photographically,counter or other methods). But, it does not give information pertaining to the structuralarchitecture, namely, the positional coordinates of atoms and molecule(s) in the unit cell.

Structure determination requires a deep understanding of wave properties, as X-ray dif-fraction is a wave phenomenon. A wave is characterized by two parameters– (i) amplitudeand (ii) phase, and both these quantities should be available for each reflection, hkl (diffractionspot), for the image reconstruction processes.

7.1.3.1 The ‘Phase Problem”

All X-ray diffraction techniques can record only the intensity data, that is, the square of theamplitudes of the reflections (hkl), but not the phase information. That is, amplitude data ofthe diffraction spots, hkl, can be obtained from the intensity data of the reflections. But, X-raydiffraction data can not provide phase information, and resolving the “phase problem” is central andcrucial in X-ray crystallography. As the phase information of diffracted spots cannot be obtainedexperimentally, mathematical procedures are invoked to resolve the “phase problem” in theimage reconstruction methods– transforming a diffraction pattern (DP) into an image of theobject is carried out by Fourier transformation (FT) methods.

FT of the DP ¤ Structure (7.3)

Page 93: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

86 Bioinformatics: A Primer

The phase difference, α, between a ray reflected by an atom at the origin of the unit celland by jth atom at position (xj, yj, zj) is given by

α = 2π(hxj + kyj + lzj) (7.4)

If the scattering power of the jth atom is fj, then the ums of the contributions from all theatoms in the unit cell to the Bragg reflection, hkl, is given by the structure factor Fhkl

Fhkl = S

=

+ +

p (7.5)

Fhkl is a complex quantity representing both the amplitude and phase of individual reflec-tions, hkl (Fig. 7.2).

Æ= |Fhkl|exp(iahkl) (7.6)

(|Fhkl| = amplitude; αhkl = phase of the reflection hkl).

Fig. 7.2 Representation of the Structure Factor, Fhkl

Æ

X-rays are scattered by electrons and the representation of the structure in essence the elec-tron density distribution of the molecule(s) in the unit cell. Since the electron density in acrystal varies continuously and periodically, the electron density, ρxyz, for a unit cell ofvolume, V, is

ρxyz =

S S S a p¥ - + + (7.7)

Accordingly, the electron density at any point in the unit cell can be computed from a Fou-rier series and the coefficients of the terms are the structure factors of the diffraction. Both theamplitude |Fhkl| and the phase αhkl for each reflection (hkl) are needed to compute the equation(7.7). The major effort in X-ray crystallography is to obtain ‘phase’ information and carry outstructure determination by iterative methods. Patterson methods, direct methods, molecularreplacement, and anomalous dispersion are some of the standard methods in resolving thephase problem.

7.1.3.1.1 Patterson Method: In Patterson method (known as heavy-atom method), the intensities,Ihkl, of reflections (hkl) are used to compute Patterson syntheses, setting the phase value of each

Page 94: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Spatial Structure Determination of Biomolecules 87

reflection either 0 or 180°. This method is useful in structures that contain few heavy atoms.This method is dealt under isomorphous replacement in macromolecular crystallography.

7.1.3.1.2 Direct Methods: In these methods, starting phase sets are computed mathematically(statistically) from the intensity data, establishing internal phase relationships between reflec-tions and diffraction patterns. These methods have become standard in structure elucidationof small molecules (structures with ~ 50 non-hydrogen atoms).

7.1.3.1.3 Molecular Replacement Method: This method is often employed in macromolecularcrystallography. Structural moieties of a molecule (or part of it), whose structure is known aretranslated and rotated in the unit cell to obtain the spatial position of the molecule (initialcoordinates) whose structure is to be determined.

7.1.3.1.4 Anomalous Dispersion Method: Though the feasibility of anomalous dispersion methodfor structure elucidation of macromolecules has been proposed by Ramachandran & group(Madras group), i1ts full potential has been realized in resolving the phase problem due tolatest developments in macromolecular crystallography. From the wide-band wavelengthspectrum, obtainable from a synchrotron source, any desired wavelength(s) can be selected tosuit desired absorption edge(s) of absorber(s) to enhance the effectiveness of the anomalousdispersion method. This is the basis of multiple anomalous dispersion (MAD) procedure inresolving the phase problem in macromolecular crystallography.

7.1.4 Structure Refinement

With a set of calculated phases, αc, electron density maps are computed with observed ampli-tudes, F0, and refined (by least-squares or other procedures) to obtain better fit betweenobserved (F0) and calculated (Fc) structure factors. The procedure is iterative. The goodness fitof the refinements is monitored by several standard statistical methods; one of the parametersbeing reliability index, R. Low R value is an indication of well-refined structure.

R = S

S

-(7.8)

The flowchart (Fig. 7.3) highlights the stepwise procedures followed in single-crystal X-raycrystallography.

7.1.5 Fibrous Macromolecules

A major class of biomolecules (e.g. DNA, RNA, collagen and keratin group proteins are infibrous (not single-crystalline) state. As fibers exhibit periodicity only along the fiber axis (z-axis), the information content from the fiber diffraction data is meager. The entire intensitydata (comprising of layer lines of diffuse spots and streaks) can be recorded on a single photo-graph, and the data is insufficient to solve the structure ab initio.

Many fibrous molecules are helical, with general structural characteristics– radius, rise perresidue and pitch. The repeating units (averaged out) along the fiber axis are nucleic acidbases in nucleic acids, amino acid side-chains in proteins and protein subunits in viruses. Thetheory of helical fiber diffraction, represented in cylindrical coordinates, leads to Fourier-Besseltransformation of the diffraction pattern representing the structure.

Page 95: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 96: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Spatial Structure Determination of Biomolecules 89

(iii) Intensity maxima are confined to layer lines only (no discrete reflections).(iv) Due to mirror and cylindrical symmetries, the diffraction photograph shows characteris-

tic X pattern (e.g. DNA diffraction photograph).

The stepwise procedures followed in X-ray structure determination of fibrous moleculesare as in the flowchart (Fig. 7.4).

Fig. 7.4 Flowchart for X-ray Structure Determination of Fibrous Molecules

7.2 NUCLEAR MAGNETIC RESONANCE (NMR) SPECTROSCOPY

Atomic nuclei with odd mass number, with half-integral spin (I = ½, 3/2, 5/2….) give rise tonuclear magnetic resonance (NMR). When subjected to an external magnetic field, NMR-sen-sitive atomic nuclei, such as 1H, 13C, 15N, and 31P, behave like tiny bar magnets, precessingaround the axis (z-axis) of the magnetic field with frequency, ν.

ν = gp Larmor frequency (7.11)

NMR transitions between adjacent nuclear spin states are induced by the application of anoscillating magnetic field in the radio frequency region (r.f. field), perpendicular to the z-axis(in the azimuthal x-y plane), and the resonance absorption occurs when

∆E = gp

(7.12)

(B0 = applied magnetic field; γ = nuclear gyromagnetic ratio; ∆E = energy difference; h = Planckconstant).

The strength of NMR signals not only depends on the isotope spin, but also on its abun-dance. 1H isotope with 99.98% abundance is the most sensitive NMR technique (1H-NMRspectroscopy).

Page 97: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

90 Bioinformatics: A Primer

7.2.1 The “Chemical Shift”

Atomic nuclei are shielded from the influence of the external magnetic field by electrons, andthe phenomenon is called the “chemical shift”. In molecules the nuclei of an element exist indifferent environments, and thus give rise to different spectral chemical shifts). For example,the spatial distribution of particular nuclei (say 1H) in different chemical environments (inCH, CH2, CH3 groups etc) in a molecule is determined by the “chemical shifts”. Chemical shiftis the basis of NMR spectroscopy and is the most important parameter in NMR structuralstudies. Chemical shift, δ, is quoted with respect to a standard reference frequency.

δ = n n

n

- ¥ = (ss – sr) × 106 (7.13)

(νs and νr σs and σr are frequencies and shielding constants of sample and reference com-pounds, respectively).

7.2.2 NMR Spectra

Multidimensional NMR spectroscopy has emerged as a complementary technique to X-raycrystallography in the three-dimensional structure determination of biomolecules. NMR analy-sis of a stable pure protein (< 20,000 Daltons) in solution is possible, but the results are moreambiguous than those from X-ray crystallography. While X-ray based studies provide a directmap of the electron densities in different parts of a molecule in a crystal, in NMR, this infor-mation is obtained in an indirect manner, through measurement of “chemical shifts” and cou-pling constants.

An NMR spectrum is a plot of resonance frequenciesand their intensities. Many of the effects that occur inNMR experiments are determined by the behavior ofthe magnetization vector, M, which represents the netmagnetization. The magnetization vector,

Æ has two

components––the longitudinal component, Mz, and theazimuthal component, Mxy. At thermal equilibrium, thelongitudinal component of it, Mz = M0 and thetransverse component, Mxy = 0 (Fig. 7.5). Applying anrf pulse (in the x-y plane) perturbs the equilibrium stateof magnetization, and Mz = 0, and Mxy ≠ 0. Once the rffield is switched off, the M tries to come back to its thermal equilibrium state, and the processis called relaxation (spin-lattice, and spin-spin). By choosing the time duration, t, of the pulse,it is possible to turn the magnetization vector M in any desired direction. This is the underlyingprinciple of all NMR spectroscopic experiments.

7.2.2.1 Free Induction Decay (FID)The sensitivity of NMR spectra can be increased dramatically if short, intense radio frequencypulses replace the slow frequency sweep. The pulses cause a signal to be emitted by the nuclei.This signal is measured as a function of time after the pulse. The decay profile of pulsed NMRsignal, which is time-dependant (signal amplitudes as a function of time), is called the free

Fig. 7.5 Magnetization Vector, M, atThermal Equilibrium

Page 98: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Spatial Structure Determination of Biomolecules 91

induction decay (FID). FIDs can be converted by Fourier transformation to frequency-domainspectrum, to give amplitudes as a function of frequency. This is the basis of modern NMRspectroscopy, called FT-NMR.

7.2.3 NMR ExperimentsThe spectra of complex biomolecules contain a large number of peaks, many of which are closetogether or overlap. Higher field magnets, or higher frequency (e.g. 600 MHz) instrumentsoffer better peak resolution, enabling analysis of larger and larger molecules. Also, in NMR,sensitivity increases almost with the square of the magnetic field, so when magnetic fieldstrength is doubled, sensitivity increases about fourfold. Data can thus be acquired faster, oralternatively, samples can be run at lower concentrations in the same experimental time,which is particularly important to the study of large biomolecules.

NMR spectra contain information about the structure of molecules through chemical shift(spin-spin coupling), which is sensitive to local electronic environment, and through nuclearOverhauser effect (NOE), which gives interaction between the dipole moments of two nucleiin spatial proximity, and thus provides information about the distance between nuclei and issensitive to the positions of the neighborhood spins. There are varieties of NMR experiments(1D-, 2D- 3D-NMR) to elucidate three-dimensional structures macromolecules (< 20,000 Daltons)in solution (Fig. 7.6).

Fig. 7.6 Schematic of Processes in NMR Experiments

In 1D-NMR, the only variable is the acquisition time, and the spectrum is a plot of amplitudeas a function of frequency, ν. Multi-dimensional NMR experiments substantially reduce theoverlap problems encountered in 1D-NMR spectra by effectively spraining the signals onto

Page 99: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

92 Bioinformatics: A Primer

multi-dimensions. To extend the analysis to multi-dimensional NMR, multiple time-periodsare added. Thus, multi-dimensional NMR experiments are composed of sections: preparation,evolution, mixing, and detection (Fig. 7.6). The time-domain pulses (FIDs) are Fouriertransformed into frequency-domain spectra S(ω1, ω2) and are plotted as contour maps.

In 2D-NMR, a second time-period is added. The evolution time, t1, is increased in equiva-lent increments, and for each t1, a complete signal is collected as a function of the detectiontime, t2 (t1 < t2). That is, the process is repeated for several values of t1, and a set of FIDs arecollected instead of a single FID as is the case with 1D-NMR. The 2D-NMR spectrum is thus aplot of amplitude as a function of frequencies, ν1, and ν2. in 2D-NMR spectrum, the diagonalpeaks correspond to a normal 1D-NMR spectrum. The off-diagonal peaks result from interac-tions between hydrogen atoms (protons) that are close to each other in space. In 3D-NMRspectrum there are correlations of three different frequencies generated through the two dif-ferent mixing times of the experiment.

7.2.3.1 Mixing MechanismsVariety of 2D- and 3D-NMR experiments arises from the different mixing mechanisms. Thereare two fundamentally different mixing mechanisms employed in NMR experiments– COSYand NOESY (Fig. 7.7).

Fig. 7.7 2D-COSY and NOESY Pulse Sequences

Correlated spectroscopy (COSY) is coherent transfer of information (spin-spin coupling)through chemical bonds. A COSY spectrum gives peaks between hydrogen atoms (protons)that are covalently connected. A very useful modification of COSY is to select only coherencethat was transferred at some point in the experiment. This is achieved by applying a ‘doublequantum filter’ (double quantum filtered COSY (DQF-COSY)).

Page 100: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Spatial Structure Determination of Biomolecules 93

In the solution phase, because of rapid Brownian motion, the dipolar interaction averagesto zero on the NMR time scale. Thus there will be no splitting of lines due to this interaction.However, the dipole-dipole interaction does contribute to the relaxation properties of the spinsystems and affects the line widths and polarization transfers of one spin system to anotherwhen the spin system is perturbed. This is genesis of Nuclear Overhauser effect (NOE), whichdescribes changes in intensity of one resonance line when another line in the spectrum is per-turbed by RF radiation. A NOESY spectrum gives peaks between hydrogen atoms (protons)that are close together in space. The magnitude of NOE is proportional to d–6, and for NOE tooccur the nuclei should be close to one another in space (< 5.0 Å apart). That is, NuclearOverhauser effect spectroscopy (NOESY) is a space-relaxation-mediated information trans-formation process, which provides internuclear distance information (< 5.0 Å).

7.2.4 Structure DeterminationWhile X-ray based structure elucidation procedures provide a direct map of electron densitiesin different parts of molecule in crystalline state, this information is obtained in an indirectmanner, in structure elucidation by NMR, through measurements of chemical shifts andcoupling constants. The electron densities depend on the nature of the chemical bonds (single,double or triple, or polar or non-polar bonds). In diamagnetic systems (isolated moleculeswith no unpaired electrons), the chemical shift of a nucleus reflects the electron densitydistribution around the nucleus. Nuclei separated by a double or triple bond have greaters-electron densities and consequently have larger coupling constants than in the case of asingle bond. Similarly, a proton in the vicinity of an electron-withdrawing group has lowerelectron density around it and is consequently less shielded.

The structure determination rests on the accurate assignment of resonances, by fitting themodel structure to the experimental data from these contour maps by various algorithms.NMR spectroscopy can generate a variety of distance constraints which can be used to com-pute the three-dimensional structures of biomolecules. The general strategy for three-dimen-sional structure determination by multi-dimensional NMR consists of basically two steps: (i)sequence-specific resonance assignments, and (ii) quantification of the interactions betweenthe assigned nuclei. Multi-dimensional NMR spectra contain information about the interac-tion of protons (H) that are covalently linked (COSY spectra)(off-diagonal peaks in a 2Dspectrum provides information on the correlations between protons in different parts of themolecules), or not covalently linked, but are close in space (NOSY spectra) that providesinternuclear distance information.

The process of associating specific spins in the molecules with specific resonance is calledthe sequence-specific assignment of resonances. This is carried out from combination of COSY,DQF-COSY, and NOESY spectral data. The first stage of analysis is the identification of spinsystems by COSY. NOESY is used to identify intra-residue, neighboring residue and distant(in sequence) contacts. COSY and NOESY data are used to construct and refine model struc-tures by different algorithms (distance geometry, energy minimization etc.).

For biological macromolecules, 1H-NMR is the method of choice, because

1. Protons (H) are present at many sites in biological molecules.2. High abundance of 1H for each site.3. 1H nucleus is the most sensitive to detect.

Page 101: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

94 Bioinformatics: A Primer

In nucleic acids, identification of protons that belong to an individual base or sugar is donewith COSY. The association of specific bases with sugars is done using NOESY spectrum.

In proteins, the primary information for recognizing the type of amino acid associated witheach resonance comes from couplings established in DFQ-COSY spectra, and identification ofsequentially neighboring amino acid 1H spin systems from the sequentially NOE connectivities.The chemical shifts of NH and CH protons can be indicative of regular secondary structure. Ingeneral, the proton resonances from backbone (main-chain directed assignments) and differ-ent types of protons on side-chains are grouped, based on chemical shifts and analyzed.NOESY is used for sequence-specific assignments in helix, sheet and turns regions.

Much of protein NMR spectroscopy relies on spectral correlation techniques (HeteronuclearMultiple Quantum Correlation, HMQC) using 13C or 15N nuclei (heteronuclei). Spectral edit-ing allows a subset of an entire spectrum to be observed. Normally, one observes a subset of1H spectra that has been selected based upon which nucleus the protons are attached to. Thesame techniques involved in spectral editing allow the measurement of heteronuclear corre-lations (which, for example, allows one to know which 1H are attached to which 13C or 15N).

Isotope substitution, and hetero-nuclear (13C-, 15N- and 31P-NMR) methods can be used tosimplify 1H-NMR spectra and also to obtain more information for sequence assignments. 13C

or 15N can be used to select only the protons attached to them. 31P-NMR is also used in nucleicacid studies.

The flowchart (Fig. 7.8) gives general steps in structure determination by NMR spectros-copy.

NMR spectroscopy can provide structural details (< 20,000 Daltons) of averaged-outconformations of the molecules in solution, (Fig. 7.9a). The results are more ambiguous, incontrast with the static and clearer picture of structures that would be available from X-raycrystallography (Fig. 7.9b).

Fig. 7.8 Flowchart of Steps in StructureDetermination by NMR Spectroscopy

Page 102: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Spatial Structure Determination of Biomolecules 95

Fig. 7.9(a) Averaged Structure of Fibronectin by Fig. 7.9(b) Structure of FibronectinNMR Spectroscopy by X-ray Crystallography

(Ref: Bocquier, A.A. (1999), Structure, 7; 1451) (Ref: Learhy, D.J., Aukhil, I &(Source: Protein Data Bank: 1QO6.pdb) Erickson, H.P. (1998), Cell, 84; 155)

(Source: Protein Data Bank: 1FNF.pdb)

Though NMR generally gives a lower-resolution structure than X-ray crystallography does,it does not require crystallization. In addition, NMR spectra can used to establish quantitativespectral data activity relationship (QSDAR) of molecules and their biological binding activity.All the quantitative spectral data-activity relationship (QSDAR) models yield a relationshipthat may be used to predict the binding activity of a molecule from its experimental orsimulated spectral data alone. Binding characteristics help determine how well the drugworks– how effective and selective it is, and whether it can be administered in reasonablequantities.

NMR studies can also be used for structure-based, site-specific screening. In this approach,two amino acid types are labeled with 13C and 15N. If this pair of amino acids occurs only oncein the sequence, there will be only one peak in a one-dimensional/ two-dimensional HNCO-type NMR spectrum. This technique allows screening for only those binders that interact witha specific site of the receptor.

7.3 IMAGING METHODS

Understanding a complex living system will require a thorough comprehension of the interac-tions of cells and tissues in the organism. The molecular machinery of life must be studied at

Page 103: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

96 Bioinformatics: A Primer

all size scales from atoms to complete organisms. Extensive information about the proteinsthat make up the cells’ functional units can be obtained through the use of molecular biology,crystallography, and computational biology. But understanding their function within theirnatural environment will require examining these proteins within the cell, through all phasesof cell behavior. Imaging is a very powerful unifying tool for such studies.

Scanning technology, such as that is used for scanning a fluorescence-labeled gel is concep-tually quite simple. A light source excites the labeled samples and a detector system measuresand records the emitted fluorescence.

Cellular and molecular imaging will be a key tool in translating the structural knowledgeinto better ways of diagnosing, treating, and preventing the disease. Imaging can identify thekinds of molecular structures/receptors that cover the surface of a tumor, information thatpotentially can predict how it may behave and respond to certain treatments. Monitoring theprocesses and pathways inside a cell as the cell transforms from normal to cancerous willallow us to detect this change in people earlier in the cancer process, perhaps before a tumorhas even had the chance to become fully malignant.

Smart contrast agents can be used as tumor markers in imaging techniques. When smartcontrast agents are injected into the body, they are undetectable. However, when they comeinto contact with tumor-associated enzymes called proteases, the smart agents change shapeand become fluorescent. The fluorescent signal can then be detected using sophisticatedimaging devices.

7.3.1 Structural ImagingOptical microscopy (e.g. confocal microscopy) and electron microscopy (e.g. scanning (SEM),and transmission (TEM)) are used in morphological imaging and cellular studies.Tomographies—X-ray computed tomography (CT), magnetic resonance imaging (MRI),magnetic resonance microscopy (MRM), and single photon emission computed tomography(SPECT) are new additions to the growing number of imaging technologies at cellular andmolecular levels. Scanning-probe microscopies, such as scanning-tunneling, and atomic-forcemicroscopies do not come under “microscopy” in the conventional term. They are surfacescanning probing techniques that prove surface topology at atomic resolution.

7.3.1.1 MicroscopiesOptical, scanning-probe and other novel imaging techniques, and electron microscopic meth-ods come under microscopies.

7.3.1.1.1 Optical Microscopies: Among optical microscopies, Laser scanning confocal microscopy(LSCM) is the one that is widely used in biological studies. It is a light microscopic techniquein which only a small spot is illuminated and observed at a time. An image is constructedthrough point-by-point scanning (raster scanning) of the field (a section of the specimen), andsection by section by stepwise increments along an axis (z-axis), which would enable the 3-Dimage reconstruction. Thus, confocal microscopy permits analysis of the 3-D architecture of aspecimen, which cannot be achieved by conventional light microscopy. Confocal microscopycan produce three-dimensional (3-D) images of fluorescently-tagged gene products to deter-mine their distribution in the cell during different stages of the cell cycle or under variousenvironmental conditions.

Page 104: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Spatial Structure Determination of Biomolecules 97

Bio-photonic imaging (employing confocal microscopy) is a novel approach to functionalgenomics, target validation, and drug screening and preclinical testing. It uses a biolumines-cent reporter gene to tag a target of interest, which can be a gene, a cell, or a microorganism,in a whole mouse. Because light passes through tissue, the labeled mouse can be anesthetizedand photographed with a camera capable of detecting the bioluminescence. This method canbe used to label bacteria, infect an organism, and study the effect of antibiotics on the infection,or the effects of various drugs that can modify response to infection. In oncology, suchmolecular and cell-based imaging can impact directly on cancer treatment and diagnosis, andthat the development and testing of new molecular-based therapies would benefit substan-tially from advances in our ability to image specific molecular and cell processes.

Two-photon or multiphoton excitation mode fluorescence scanning microscopy is basedon excitation resulting from successive or simultaneous (coherent) absorption of two or morephotons (the energy of excitation being the sum of the energies of the photons by an atom ormolecular entity. The capability of using near-IR excitation wavelengths provides multipho-ton excitation scanning microscopy many advantages providing over single-photon confocalmicroscopy– e.g. clear three-dimensional images of biological tissues in real-time with much-reduced photobleaching and photodamage to cells, since there are few intrinsic near-IR ab-sorbing chromophores.

In fluorescence-based microscopy, specimens are stained with fluorescent materials, whichemit light when exposed to light. Immunofluorescence microscopy utilizes antibodies that arelabeled with fluorescent dye. Fluorescence lifetime imaging microscopy (FLIM) is an imagingtechnique in which the mean fluorescence lifetime of a chromophore is measured at eachspatially resolvable element of a microscope image. Imaging of fluorescence lifetimes enablesbiochemical reactions to be followed at each microscopically resolvable location within thecell.

7.3.1.1.2 Scanning-probe Microscopies (SPM): Scanning-probe microscopy (SPM) is essentiallysurface-probe technique. There are no lenses in scanning-probe microscopic (SPM) methods.Instead, a “probe” tip is brought very close to the specimen surface, and the interaction of thetip with the region of the specimen immediately below it is measured. The type of interactionmeasured defines the type of SPM. The technique is called atomic-force microscopy (AFM), if theinteraction measured is the force between the atoms at the end of the “probe” tip and theatoms in the specimen surface; called scanning-tunneling microscopy (STM), if the quantummechanical current (tunneling current) is measured. These techniques provide topographicmaps of the sample at atomic resolution. They are employed for characterizing surface fea-tures biomolecular complexes, and molecular interactions.

In atomic force microscopy (AFM), a probe (force transducer) systematically rides across thesurface of a sample being scanned in a raster pattern. The vertical position is recorded opti-cally as a spring attached to the probe rises and falls in response to peaks and valleys on thesurface. AFM can be used to scan conducting as well as non-conducting surfaces.

Scanning-tunneling microscopy is based on the principle of quantum mechanical tunneling ofelectrons– tunneling current is measured as function of the distance between the tip of theprobe and the specimen surface. One of the disadvantages of STM is that the specimen shouldhave a conducting surface.

Page 105: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

98 Bioinformatics: A Primer

Near-field scanning optical microscopy (NSOM), also known as Scanning near-field opticalmicroscopy (SNOM) is an aperture-less imaging method. It employs near-field probing ratherthan beam focusing, and thus circumvents the resolution limit imposed by diffraction effects,which are common to all aperture-based imaging techniques.

7.3.1.1.3 Novel Imaging Techniques: X-ray microscopy is an imaging technique that employsFresnel zone plates, for focusing scattered X-rays. The resolution (~ 100 Å) is about ten timesbetter than in optical microscopy, and is ideal for imaging cells and sub-cellular particles.

Coherent anti-Stokes Raman scattering (CARS) microscopy, based on Ramanspectroscopy, allows one to localize specific types of molecules inside living cells withouttagging them with fluorescent dyes or genetic modifications. In this technique, two laserbeams are sent into the cell, at frequencies that differ exactly the frequency at which a particu-lar type of chemical bond (e.g. C–H bond) in the cell is excited. The excited bond vibrates,emitting a frequency characteristic of its vibrational mode, thus enabling the imaging of apoint-by-point chemical map of the cell (chemical microscopy).

Magnetic resonance microscopy (MRM) is noninvasive and nondestructive imaging toll. MRMallows live cells to be studied simultaneously, providing a necessary link between cellularresponse and molecular information on proteins and other molecules involved in a certaincellular event.

Surface plasmon resonance (SPR) microscopy is potentially useful for the study of molecularevents in cell membranes–transport and trafficking processes involving the membrane.

7.3.1.1.4 Electron Microscopies: Electron microscopic methods provide ultrastructure detailsof biological specimen. Whereas the wavelength of visible light limits the resolution of lightmicroscopy to hundreds of nanometers, the wavelength of intermediate-voltage electrons isonly a fraction of an angstrom. It is, therefore possible to determine the 3-D structure ofproteins in a biological specimen, such as supramolecular assemblies, organelles, cells, andtissues, using low-dose electron beam intensity and recordings from a large number of viewangles in a transmission electron microscope (TEM). This technology is termed electron tomogra-phy. Electron tomography works essentially like a CT scan– a computer constructs a 3-D imageof a flash-frozen cell from a series of image “slices” created by penetrating electron beams.Scanning electron microscopy (SEM) is essentially a scanning-probe microscopy (with electronoptics). The specimen is scanned in raster fashion by a fine-focus electron and beam and theback-scattered secondary electrons are collected and processed electronically in image recon-struction methodologies. Unlike in TEM, the specimen preparation in SEM is not that strin-gent and can be used to analyze thicker specimens.

7.3.1.2 TomographiesMolecular Imaging approach fuses the disciplines of molecular biology, genetic engineering,immunology, cytology, and biochemistry with imaging. Integration of imaging techniqueswith computers has paved the way for here-dimensional description of the physiological andbiochemical processes in healthy and defective organs. Advances in MRI/MR spectroscopy,MR microscopy (MRM), Positron emission tomography (PET) and single photon emission computedtomography (SPECT) are used to evaluate normal and abnormal tissue metabolism and perfu-sion in response to genetic, physiological, or therapeutic challenges.

Page 106: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Spatial Structure Determination of Biomolecules 99

X-ray computed tomography (CT) images regional structure or anatomy, by combining X-rayimaging with computer processing and presentation, to provide 3-D structural images ofinternal organs with greater clarity.

Magnetic Resonance Imaging (MRI) is a non-invasive method of imaging internal anatomy.The method rests on localizing the nuclear spins locally by applying a linear magnetic fieldgradient axially. The imaging offers both near-cellular (i.e. 50 micron) resolution and whole-body imaging capability.

Single-photon emission computed tomography (SPECT) is a tomography method that usesradionuclides, which emit a single photon of a given energy. The camera is rotated 180 or 360degrees around the patient to capture images at multiple positions along the arc. The com-puter is then used to reconstruct the transaxial, sagittal, and coronal images from the 3-dimensional distribution of radionuclides in the organ. SPECT can be used to observe bio-chemical and physiological processes as well as size and volume of the organ.

7.3.2 Functional/Metabolic ImagingFew techniques are available for investigating the molecular bases of human brain patho-physiology in vivo. The ability to observe both the structures and also which structuresparticipate in specific functions is due to two new techniques called functional magnetic reso-nance imaging (fMRI), and positron emission tomography (PET).

fMRI provides high resolution, noninvasive reports of neural activity detected by a bloodoxygen level-dependent signal. fMRI techniques, with or without contrast agents, provideimaging researchers with the ability to unravel the mysteries of organ and cellular perfusionand for mapping cerebral function. The ability of functional/metabolic imaging studies tomonitor pharmacological alterations may provide the basis for future testing of new drugs forthe treatment of many diseases/malignancies– malignancy, heart disease, Alzheimer disease,cerebrovascular diseases, multiple sclerosis, AIDS and others.

Positron emission tomography (PET) technique is based on the principle that the positionof positron-emitting radionuclide can be precisely determined, because of annihilation ofpositron-electron pair, with emission of two gamma photons in diametrically opposite direc-tions. PET builds images by detecting energy given off by decaying radioactive isotopes. Thetechnique is complementary to the anatomic imaging modalities such as computed tomogra-phy (CT) and magnetic resonance imaging (MRI). PET is used to study the brain activity withsuitable labeled radionucleide. A rapid sequence of PET images of the brain would reveal theresponse of the brain to various chemical stimuli, and pinpoint areas of abnormal activities.

EXERCISE MODULES

1. Which are the physical techniques for spatial structural data information?2. What are the essentials principles of three-dimensional structure determination by X-ray diffrac-

tion, and what are the advantages and limitations and why?3. Why is the single-crystalline state is imperative in X-ray diffraction studies?4. What are the steps in the determination of unit cell parameters by X-ray crystallography?5. What is the “phase problem” in X-ray crystallography?

Page 107: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

100 Bioinformatics: A Primer

6. Name some methodologies employed in resolving the “phase problem” in X-ray structure determi-nation.

7. Explain the procedural steps followed in the X-ray structure determination of fibrous molecules.8. What is the nuclear magnetic resonance (NMR)9. What is “chemical shift” and what is its importance in NMR spectroscopy?

10. What kinds of data are obtained from 1D- and multi-dimensional NMR experiments?11. What are the essentials of structure analysis by NMR spectroscopy, and what are the advantages and

limitations in comparison with the X-ray diffraction methods?12. What are the optical microscopic methods in imaging?13. What are the advantages of multi-photon excitation microscopy over confocal microscopy?14. What is the principle on which scanning-microscopies are based?15. What are the applications of atomic force and scanning tunneling microscopies in biology?16. What are the uses and advantages of NSOM and CARS microscopies?17. Which are the tomograhic imaging methods used in biology and medicine?18. Which are the two important techniques in functional imaging studies in medicine?

BIBLIOGRAPHY

1. Blundell, T. L. & Johnson, L.N. (Eds) (1976), Academic Press: New York. “Protein Crystallography”,2nd Edn.

2. Cavanagh, J., et al. (1996), Academic Press: New York. “Protein NMR Spectroscopy: Principles andPractice”.

3. Close, G.M. & Gronenborn, A.M. (Eds) (1993), MacMillan: London. “NMR of Proteins”.4. Creighton, T.E. (1993), Freeman Press: “New York. “Proteins–– Structures and Molecular Properties”,

2nd Edn.5. Drenth, J. (1994), Springerverlag: New York. “Principles of Protein Crystallography”.6. Glusker, J.P. & Trueblood, K.N. (1985), Oxford University Press: New York. “Crystal Structure Analy-

sis: A Primer”, 2nd Edn.7. Engel, A., Schoenberger, C.A. & Miller, D.J. (1997), Curr Opin Struct Biol., 7(2); 279. “High-resolution

imaging of native biological samples using scanning-probe microscopy”.8. Herman, B. & Lemaster, J.J., (Eds). (11993), Academic Press: New York. “Light Microscopy: Emerging

Methods and Applications”.

9. Kirz, J. Jacobson, C. & Howells, M. (1995), Q Rev Biophys., 28; 33. “Soft X-ray microscopes and theirapplications”.

10. Ledley, R.S. (1974), Science, 186; 197. “Computerized Translational X-ray tomography of the human body”.

11. Lee, J.K.T. et al., (Eds). (1998), Lippincott-Raven: New York. “ Computed Body Tomography with MRICorrelation”.

12. Lichtamn, J.W. (1994), Sci Amer., 271(2); 40. “Confocal Microscopy”.

13. Mettler, F.A. & Guiberteau, M.J. (1991), Saunders: Philadelphia. “Essentials of Nuclear MedicineImaging”.

14. Morris, P.G. (1986), Oxford University Press; Oxford. “NMR Imaging in Medicine and Biology”.

15. Narayanan, P. (1989), Phys Edun., 6; 217. “X-ray structure determination of biomolecules”.16. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print).

Page 108: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Spatial Structure Determination of Biomolecules 101

17. Narayanan, P. (2001), Bhalani Publishers: Mumbai. Clinical Biophysics: Principles and Techniques.18. Parthasarathy, S. & Glusker, J.P. (Eds) (1997), New Age Intl Pubs: New Delhi. “Aspects of Crystallog-

raphy in Molecular Biology”.19. Sander, J.K.M. & Hunter, B.K. (1993), Oxford University Press: Oxford. “Modern NMR Spectroscopy”.

20. Slayter, E.M. & Slayter, H.S. (1993), Cambridge University Press: Cambridge. “Light and ElectronMicroscopy (2nd Edn)”.

21. Wüthrich, K. (1986), Wiley & Sons: New York. “NMR of Proteins and Nucleic Acids”.

Page 109: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

8Protein-Ligand Interactions

The molecular associations/interactions are the basis of transformation and regulation ofgenetic information and all cellular actions and biochemical reactions, such as cell-cell recog-nition, neuronal signaling, hormonal action, and protein and enzyme functions. Central tosuch molecular-molecular associations/interactions are protein-nucleic, protein-carbohydrate,and protein-lipid interactions. Knowledge of protein structure often plays a crucial role infunctional identification and characterization.

8.1 PROTEIN-NUCLEIC ACID INTERACTIONS

Molecular interactions/associations are governed by–

(a) The size and shape of the ligand that imposes steric hindrance.(b) Electronic potential at the site of interaction.

Molecular interactions that occur in protein-nucleic acid complexes are between(i) Protein side chains and DNA backbone (50%).

(ii) Protein side chains and DNA side chains (30%).(iii) Protein backbone and DNA backbone.(iv) Protein backbone and DNA side chain (1%).

These types of interactions can also serve as general types of interactions in all macromo-lecular associations. The interactions can be non-specific, such as those found in histone-DNAassociation in chromatin, as well as specific as found in restriction endonuclease-DNA com-plexes. While non-specific interactions enable docking of the interacting molecular moieties,the specific interactions enable sequence-specific associations. Molecular interactions betweenfunctional groups can be classified under (1) electrostatic, (2) hydrogen bonding and (3)intercalation interactions.

8.1.1 Electrostatic Interactions

Electrostatic interactions in protein-nucleic acid complexes occur between positively chargedside chains of proteins (e.g. lysyl, arginyl) and negatively charged phosphate groups of thenucleic acid backbone. Electrostatic interactions are also mediated by metal ions. Electrostaticpotential of the basepair moiety plays an important role in protein-nucleic acid interactions inthe major and minor grooves (Fig. 8.1).

Page 110: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 111: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

104 Bioinformatics: A Primer

In protein-nucleic acid complexes, the aromatic side chains, such as Phe, Trp and Tyr caninteract in the major and minor grooves of nucleic acids by intercalation and further stabilizedby hydrogen bonding between the peptide and nucleic acid moieties. Such a combination ofintercalation and hydrogen bonding imparts specificity in protein-nucleic acid interactions.

8.1.4 DNA-regulatory ProteinsDNA-regulatory proteins are associated with transcriptional control. These proteins bind tospecific DNA sequences and thus help in switching on or off genetic coding as required. Mostof these proteins bind in the major groove of the DNA. Many of them have an orderedorganization of secondary structures (super-secondary structures) that form distinct structuralmotifs (e.g. helix-turn-helix, zinc-finger and leucine-zipper motifs).

8.1.4.1 “Helix-turn-Helix” Motif”

Many of the prokaryotic transcriptional regulatory proteins have the helix-turn-helix (HTH)structural motif (Figs. 8.2, 8.3, 8.4 & 8.5). The HTH motif is approximately 20 residues long,with a 7-residue helix, a short turn, and nine-residue helix (recognition helix). The ‘recognitionhelix’ fits into the major groove of B-DNA. The specificities of the various helix-turn-helixmotifs for binding to different DNA sequences arise primarily from the different amino acidside chains that emanate for the “recognition helix”. The other helix lies across the majorgroove and makes non-specific contacts with DNA.

Fig. 8.2 Schematic of “Helix-turn-Helix” Structural Motif

Page 112: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Protein-Ligand Interactions 105

Fig. 8.3 Example of “Helix-turn-Helix” Motif found in DNA-regulatory Proteins

Fig. 8.4 Structure of Lambda Repressor-Operator Complex

(Ref: Beamer, L.J. and Pabo, C.O. (1992) J Mol Biol., 227; 177)(Source: Protein Data Bank: 1LMB.pdb)

Page 113: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

106 Bioinformatics: A Primer

Fig. 8.5 Structure of Phage 434 Cro Protein with “Helix-loop-Helix” Structural Motif(Ref: Mondragon, A., Wolberger, C. & Harrison, S.C. (1989), J Mol Biol., 205; 179)

(Source: Protein Data Bank; 2CRO.pdb)

8.1.4.2 “Zinc-finger” Motifs

Several types of zinc-finger motifs have been identified. In these the Zn2+ ion forms a coordi-nation moiety (moieties) with Cys/His residues of the protein (Fig. 8.6). Zinc-finger motifs arefound not only in DNA-binding proteins but in proteins in general (Fig. 8.7), involving pro-tein-protein interactions.

Fig. 8.6 Structure of Zif268 Protein-DNA Complex with “Zinc-finger” Structural Motif

(Ref: Erickson, M.E., et al. (1996) Structure, 4; 1171)(Source: Protein Data Bank: 1AAY.pdb)

Page 114: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Protein-Ligand Interactions 107

Fig. 8.7 Example of “Zinc-finger” Structural Motif found in Proteins

(Ref: Kowalski et al. (1999) J Biomol NMR, 13; 249)

8.1.4.3 “Leucine-zipper” Motif

The leucine-zipper motif has been found in several eukaryotic transcriptional regulatory pro-teins (Fig. 8.8). The motif (~ 30 amino acid residues) consists of leu or ile at seven residueintervals (heptad spacing). The basic motif is

(–L–X6–L–X6–L–X6–X6–L–X6–)

Fig. 8.8 (i) Examples of “Leucine-zipper” Structural Motif found in DNA-regulatory Proteins

(Ref: Vinson, C.R., et al. (1989) Science, 246; 211)

8.1.5 Other DNA-binding ProteinsHistones, DNA-polymerases and restriction endonucleases are some of the proteins associ-ated with nucleic acids and their functions. Histones are basic proteins that interact with DNA

Page 115: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 116: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Protein-Ligand Interactions 109

Fig. 8.10 Structure of Human TBP Core Domain-DNA Complex

(Ref: Nikolov, D.B., et al. (1996) Proc Natl Acad Sci, USA. 93; 4862)(Source: Protein Data Bank: 1CDW.pdb)

Fig. 8.11 Structure of EcoRV Endonuclease-DNA Complex

(Ref: Kostrewa, D. & Winkler, F.K. (1995) Biochemistry 34; 683)(Source: Protein Data Bank: 1RVA.pdb)

Protein-DNA interactions can be detected by DNA footprinting, gel shift analysis, yeast onehybrid assays or Southwestern blots. Can also be analyzed by genetic analysis and X-raycrystallography.

Theoretical methods of analyzing molecular dynamics (MD) trajectories may potentiallyreveal factors governing DNA recognition by proteins. This approach presupposes that MDtrajectories accurately reflect behavior of DNA molecules in solution. Therefore, a prerequisitefor accurate prediction of protein binding requires that MD simulations be validated bycomparison with experimental dynamics data. Nuclear magnetic resonance (NMR) experi-ments provide a powerful way to obtain insight into the dynamics of molecules in solution.

Page 117: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

110 Bioinformatics: A Primer

Thus, DNA structural and dynamic features derived from NMR data are compared with MDsimulations to extract motional modes that may have functional significance.

8.2 PROTEIN-PROTEIN INTERACTIONS

Protein-protein interactions include biological pathways, regulatory systems and signalingcascades. They play a major role in almost all relevant physiological processes occurring inliving organisms, including DNA replication and transcription, RNA splicing, protein biosyn-thesis, and signal transduction. Molecular interactions that occur in protein-nucleic acidcomplexes are the same that occur between protein-protein interactions/associations, namelynon-bonded interactions– ionic, hydrogen bonding, van der Waals, and hydrophobic interac-tions. Structure-function aspects can be determined by X-ray crystallography and NME spec-troscopy (Chapter 7). Physicochemical and biomolecular methods are– phage display, proteinarrays, immunoprecipitation assays, and yeast two-hybrids (screening technique to identifygenes encoding interacting proteins). Yeast two hybrid is an approach to studying protein-protein interactions. The basic format involves the creation of two hybrid molecules, one inwhich a “bait” protein is fused with a transcription factor, and one in which a “prey” proteinis fused with a related transcription factor. If the bait and prey proteins indeed interact, thenthe two factors fused to these two proteins are also brought into proximity with each other. Asa result, a specific signal is produced, indicating an interaction has taken place. Yeast threehybrid: Modification of yeast two hybrid system. The third hybrid may be a first one with aRNA or with a small molecule that is a cell permeable chemical inducer of dimerization. Thethree-hybrid system enables the detection of RNA-protein interactions in yeast using simplephenotypic assays.

8.3 PROTEIN-CARBOHYDRATE INTERACTIONS

Many proteins covalently conjugated with carbohydrates by post-translational modification.These proteins, called glycoproteins, are classified as O-linked if the sugars are attached to the–OH groups of serine or threonine, and as N-linked if the sugars are attached to the amidenitrogen of the asparagine side chain. Glycoproteins are involved wide variety of biologicalfunctions. For example, the variability in the composition of the carbohydrate moieties ofglycoproteins of erythrocytes that determine the blood groups specificity. Carbohydrates ofglycoproteins appear to act as recognition markers in various cellular processes.

8.4 PROTEIN-LIPID INTERACTIONS

Protein-lipid interactions are predominantly hydrophobic in character. The major function oflipoproteins is to aid in the storage transport of lipid and cholesterol.

EXERCISE MODULES

1. What are the various types of protein-nucleic acid interactions?2. What are the physicochemical parameters that govern these interactions?3. Which are the amino acid residues involved in electrostatic interactions?4. How do metals mediate in electrostatic interactions?

Page 118: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Protein-Ligand Interactions 111

5. What are the features of protein-nucleic acid interactions in the major groove and which amino acids cantake part in these interactions?

6. What are the features of protein-nucleic acid interactions in the minor groove and which amino acids cantake part in these interactions?

7. Which is the major component in the specificity of protein-nucleic acid interactions?8. What is the importance of oligodentate hydrogen bonding?9. What is intercalation and what is its role in protein-nucleic acid interactions?

10. Which are the structural motifs in DNA-regulatory proteins?11. What are the essential features of helix-turn-helix structural motif?12. What are the essential features of zinc-finger structural motif?13. What are the essential features of leucine-zipper structural motif?14. What is the role of glycoproteins?15. What is the role of lipoproteins?

BIBLIOGRAPHY1. Baltimore, D. & Berg, A.A. (1995), Nature, 373; 287. “DNA-binding proteins”.1. Berman, H.M., et al. (2000), Nucleic Acids Res., 28; 235. “Protein Data Bank”.2. Branden, C.I. & Tooze, J (1999), Garland Pubs: Philadelphia. “Introduction to Protein Structure”,

2nd Edn.3. Brennan, R.G. & Matthews, B.W. (1989), TiBS., 14; 286. “Structural basis of DNA-protein recognition”.4. Choo, Y. & Klug, A. (1997), Curr Opin Struct Biol., 7(1); 117. “Physical basis of protein-DNA recognition”.5. Creighton, T.E. (1993), Freeman Press: New York. “Proteins– Structures and Molecular Properties”,

2nd Edn.6. Dickerson, R.E. (1983), Sci Amer., 249(6); 86. “The DNA helix and how it is read”.7. Johnson, P.F. and McKnight, S.L. (1989), Annu Rev Biochem., 58; 799. “Eukaryotic transcriptional

regulatory proteins”.8. Kjellen, L. & Lindhal, U. (1991), Annu Rev Biochem., 60; 443. “Proteoglycans: Structure and interactions”.9. Konforti, B. (1999), Nature Struct Biol., 6; 505. “Rules for protein-DNA interactions”.

10. Kuhlbrandt, W. & Gouax, E. (1999) Curr Opin Struct Biol., 9(4); 445. “Membrane proteins”.11. Lilley, D.M.J. (Ed) (1995), IRL Press: Oxford. “DNA-Protein: Structural Interactions.12. Lodish, H., et al. (1995), Sci Amer Books: New York. “Molecular Cell Biology”, 3rd Edn.13. Mc Cammon, J.A. (1998), Curr Opin Struct Biol., 8(2); 245. “Theory of biomolecular recognition”.14. Nagai, K. (1996), Curr Opin Struct Biol., 6(1); 53. “The RNA-protein complexes”.15. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics”(2nd Print).16. Pabo, C.O. and Sauer, R.T. (1992), Annu Rev Biochem., 61; 1053. “Transcription factors: structural

families and principles of DNA recognition.17. Ptashne, M. (1988), Nature, 335; 683. “How eukaryotic transcriptional activators work”.18. Richard, T.J. & Steitz, T.A. (1998), Curr Opin Struct Biol., 8(1); 11. “Protein-nucleic acid interactions”.19. Saenger, W. & Heinamann, U. (Eds) (1989), Macmillan: Houndmills. “Protein-Nucleic Acid Interactions”.20. Schmiedeskamp, M. & Klevit, R.E. (1994), Curr Opin Struct Biol., 4; 28. “Zinc-finger diversity”.21. Sharon, N. & Lis, H. (1993), Sci Amer., 268(1); 82. “Carbohydrates in cell recognition”.22. Steitz, T.A. (1990), Quart Rev Biophys., 23; 305. “Structural studies of protein-nucleic acid interactions:

the sources of sequence-specific binding”.23. Storch, J. & Kleinfeld, A.M. (1985), TiBS., 10; 418. “Membranes”.24. Struhl, K. (1989), TiBS., 14; 137. “Helix-turn-helix, zinc-finger, and leucine-zipper motifs for eucaryotic

transcriptional regulatory proteins”.25. Tijan, R. (1995), Sci Amer., 272(3); 7. “Molecular machines that control genes”.26. Travers, A.A. (1990), Chapman-Hall: London. “DNA-Protein Interactions”.

Page 119: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Section III

Towards Structure Prediction(Bioinformatics-II)

Page 120: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

9The Protein-folding Problem

The first half of the genetic code, namely how a gene sequence is translated into a polypeptideis fairly straightforward and has been resolved (vide: works of Khorana, Nierenberg andMathaei). Gene sequence is read in triplets of nucleotides (codons) and translated into proteinsynthesis. However, the second half of the genetic code, namely, inferring the three-dimen-sional folding of a protein (tertiary structure) from its amino acid sequence (primary structure)is still an unresolved problem. This problem, the problem of predicting the tertiary structureof a protein from its amino acid sequence data is called the protein-folding problem, and is alsoknown under various names, bioinformatics being the latest and much sought after word.

The protein-folding problem is a central theme in molecular biology and bioinformatics.There is a direct correlation between the protein folding and genetic diseases. Therefore,proper knowledge of protein-folding mechanisms is fundamental to our understanding ofprotein function vis a vis its structure, and genetic diseases, life processes and evolution atmolecular level.

A rational approach towards molecular modeling (molecular engineering/drug design),namely, the design of novel molecules to suit desired requirements, rests on the availability ofthe three-dimensional structures of proteins that are to be engineered. The only physical tech-niques that are available, to-date, to determine the three-dimensional structures at atomic andmolecular levels are single crystal X-ray diffraction and multi-dimensional NMR spectroscopicmethods. The ‘bottleneck’ encountered towards bioinformatics (by experimental methods)arises out of the difficulties and delays in obtaining the tertiary structure data of macromol-ecules by these experimental methods, due to inherent constraints and operational restraints.Single-crystalline state of matter is a prerequisite for initiating X-ray diffraction studies, whilethe size (mass) of the molecule is the limiting factor for NMR spectroscopic methods. Manyproteins fail to crystallize and/or cannot be obtained or dissolved in large enough quantitiesfor NMR spectroscopic studies. In addition, these involve elaborate technical procedures, andthe operational constraints make the existing experimental techniques of macromolecular struc-ture determination a challenging and time-consuming task. However, these experimentaltechniques are crucial and are pursued, and constant efforts are being made to minimize theoperational constraints and enhance their importance.

The primary structure (amino acid sequence) data are obtainable faster and in a more “rou-tine” way than their tertiary structure data. With advent of gene cloning and fast genesequencing techniques (e.g. laser-activated fluorescence scanning), the availability of sequencedata is growing at very fast rate as compared to the structural data (sequence/structure deficit

Page 121: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

116 Bioinformatics: A Primer

ratio ~ 1,500:1). Therefore, theoretical prediction of tertiary structures of proteins from theirprimary structure data is an alternative approach towards molecular engineering (molecularbioinformatics). This approach, though highly complex and challenging, is an attractive anddesirable one for multiple reasons––

1. To realize the full potential of rapidly growing gene sequence data.2. For rational molecular design (drugs).3. For molecular (protein) engineering (rationally altered proteins).4. De novo synthesis (design) of proteins.

All the structural information necessary for proper tertiary folding is in the primary struc-ture data. The molecular forces involved in the tertiary structure stability are the same non-bonded interactions––van der Waals interactions, hydrogen bonding and hydrophobic inter-actions.

However, the task of predicting the tertiary structure of a protein a priori from its aminoacid sequence data is not at all simple and straightforward. This is because protein folding isa highly cooperative process, involving various molecular interactions. That is, the second halfof the genetic code, namely, inferring the unique spatial folding of proteins from their amino acid sequencedata, still remains an unsolved problem. This complexity is referred to as the protein-folding problem.

The protein-folding problem can be addressed either by genomic or proteomic-basedanalyses, or by methods.

9.1 GENOMICS ANALYSISThe genome is the sum of genes and intergenic sequences of the haploid cell. The complex andrichly structured data from genomics pose the greatest encoding problem. From this perspec-tive, the sequence data from human and other genomes provide great opportunities fortheorists interested in the establishment of the complete genetic information and their struc-tures in diverse organisms. Availability of genome sequences also provides an opportunity toexplore genetic variability between organisms as well as within individual organisms. Ge-nome sequences all genes of an organism enable one to identify the genes that influencemetabolism, cell division and disease processes and evolutionary modeling etc., (Fig. 9.1), aswell as to manipulate gene expression.

Genomic analysis comprises both structural genomics and functional genomics. Availabil-ity of genome sequences provides an opportunity to explore genetic variability betweenorganisms as well as within individual organisms. A newly identified gene in an organism canbe compared to the existing database of information to find another that has similar function.Tracing the phylogenetic history of genes, gene domains and gene linkages in diverse organ-isms is one of the most challenging aspects of genome analysis (bioinformatics).

9.1.1 Structural GenomicsAs traditionally defined, the term structural genomics refer to the use of sequencing andmapping technologies, with bioinformatic support, to develop complete genome maps (ge-netic, physical, and transcript maps) and to elucidate genomic sequences for different organ-isms, particularly humans. It is concerned with the gene structure and relative positions of thegenes in a chromosome (gene mapping). Polymorphisms (sequence variations) that are closeto a trait are seldom separated from one generation to the other. Therefore, these polymor-phisms may be used for mapping and identifying important genes.

Page 122: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 123: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

118 Bioinformatics: A Primer

inheritance. Distance is measured in base-pairs. Physical maps describe the chemical charac-teristics of the DNA molecule itself. Physical maps are particularly important when searchingfor disease genes by positional cloning strategies and for DNA sequencing.

Physical maps can be low-resolution or high-resolution maps. Low-resolution chromo-somal maps are based on the banding patterns (light and dark bands reflecting regionalvariations in the amounts of A-T versus G-C) observed in light microscopy of stained chromo-somes.

A cDNA map shows the position of expressed DNA regions (exons), the regions transcribedinto mRNA, relative to particular chromosomal regions. A cDNA map can provide the chro-mosomal location for genes whose functions are currently unknown. For disease-gene hunt-ers, the map can also suggest a set of candidate genes to test when the approximate locationof a disease gene has been mapped by genetic linkage techniques.

High-resolution physical maps (e.g. by shotgun sequencing and sequence assembly) pro-vide complete base-pair sequences of each chromosome in the genome. Determination ofbase-pair sequences of genes (high-resolution physical mapping) is necessary for inferring theamino acid sequences (primary structure) of corresponding proteins.

The two current approaches in high-resolution physical mapping are termed-top-down(producing a macrorestriction map) and bottom-up (resulting in a contig map). With eitherstrategy the maps represent ordered sets of DNA fragments that are generated by cuttinggenomic DNA with restriction enzymes. The fragments are then amplified by cloning or bypolymerase chain reaction (PCR) methods. Electrophoretic techniques are used to separate thefragments according to size into different bands, which can be visualized by direct DNAstaining or by hybridization with DNA probes of interest.

DNA fingerprinting is a method of assembling overlapping cloned DNA molecules (contigmaps), based on restriction fragment analysis or Southern blot hybridization patterns-diges-tion of the DNA with restriction enzymes followed by electrophoresis and visualization byhybridization with probes specific for repetitive sequences. The introduction of sequencetagged sites, which are short DNA segments defined by their unique sequences, allowed theuse of PCR in contig assembly. Another large-scale physical mapping is the use of radiationhybrid mapping, or site-specific chromosome fragmentation. The introduction of pulsed fieldgel electrophoresis (PFGE), and fluorescence in situ hybridization (FISH) are major technologyadvances in sequence analysis. Contig maps are important because they provide the ability tostudy a complete, and often, large segment of the genome by examining a series of overlap-ping clones which then provide an unbroken succession of information about that region.

Current mapping methods leave large number of gaps, to filled by other experimentalmethods. Chromosome walking is one strategy of filling in gaps. It involves hybridizing aprimer of known sequence to a clone from an unordered genomic library and synthesizing ashort complementary strand. The complementary strand is then sequenced and its end is usedas the next primer for further walking.

Genome mapping is a reconstruction of the entire set of chromosomes for a given organism,showing the relative position of every gene. Genome control maps would identify all thecomponents of the transcriptional machinery that have roles at any particular promoter andthe contribution those specific components make to coordinate regulation of genes. The mapwill facilitate modeling of the molecular mechanisms that regulate gene expression and impli-

Page 124: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

The Protein-folding Problem 119

cate components of the transcription apparatus in functional interactions with gene-specificregulators.

Genome sequencing projects do face several technical problems. Since present experimen-tal methods can provide data on ~ 500 basepairs size genes, determination of larger genomicsequences requires a strategy to assemble overlapping sequence fragments. “Shotgun” strat-egy is employed in most of the current large-scale genome sequencing projects. Chromosomesof a target organism are purified, fragmented, and sub-cloned in fragments of ~ kilo basepairs.They are further sub-cloned as smaller fragments of plasmid vectors for DNA sequencing.First, the gene fragments are sequenced to determine the order of the bases in each sequence.Next, overlapping fragments are built up in a multiple alignment, a process known as se-quence assembly, from which a consensus sequence for the clone is obtained. Full chromo-somal sequences are then assembled from the overlap sequences in a highly redundant set offragments.

In genomic studies, the earliest stages of genome analysis are performed automatic meth-ods. The genome sequences are then annotated. More detailed information is collected bylaboratory experiments and a closer examination of the sequence data.

9.1.2 Genome Annotation and Comparative GenomicsEach fragment of DNA contains unique features. A DNA fragment may encode a portion of agene or a gene control sequence, or the fragment may be a portion of a genome that has noapparent function. The raw genomic sequence (basepairs) data is meaningless, withoutanalyzing various factors/regions that constitute the sequence. The elucidation and descriptionof biologically relevant features in the sequence is essential in order for genome data to beuseful. The goal of genomic annotation is to extract biologically relevant information from rawgenomic sequence data. The quality with which annotation is done will have direct impact onthe value of the sequence. The process is iterative that requires finding putative codingregions, identifying what each region codes for, and using the available evidence to refine thecoding, and control regions. Genome annotation requires a spectrum of bioinformatics tools,each tuned to the genome of analysis. Annotation is done at two levels at the DNA sequencelevel and at the predicted protein sequence level.

Genome sequence databases contain an assortment of data types that cannot be treatedalike. These include untranslated regions (UTRs), coding region sequences (CDRs), introns(intervening sections of DNA which occur almost exclusively within a eukaryotic gene, butwhich are not translated to amino acid sequences in the gene product) and exons and ribo-some-binding sites, and translational termination sites (Fig. 9.2).

DNA sequence analysis estimates boundaries between coding and non-coding sequences,gene structures, translates protein-coding genes into protein sequence, and characterize con-ditions under which different forms of gene may be expressed, and host other structural andfunctional information. Once the individual coding sequences are discerned, genome annota-tion constructs systems of gene and gene products by combining knowledge gained fromsequence analysis with knowledge from other data sources.

Once properly annotated DNA sequence available, it is possible to infer the roles of all of thegene products, how they are controlled and interact, and their possible involvement in dis-ease.

Page 125: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 126: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

The Protein-folding Problem 121

Shine-Delgarno sequences. Gene identification in prokaryotes is simplified by their lackingintrons.

Eukaryotic genes, on the other hand, are commonly organized as coding regions (exons)and non-coding regions (introns, transposable elements, pseudogenes, repeat elements, andUTRs) (Fig. 9.2), and hence may comprise several disjoining ORFs and the gene products maybe of different lengths. Non-coding regions contain regulatory residues with promoters andtranscription factor-binding sites. The introns are removed from the pre-mature mRNA througha process called splicing, which leaves the exons untouched, to form an active mRNA. Themain task of gene identification (in eukaryotic DNA) involves coding region recognition(intron/exon discrimination) and detection of splice sites (boundaries between exons andintrons).

A central problem in bioinformatics is the assignment of function to sequenced open read-ing frames (ORFs). The most common approach is based on inferred homology using a statis-tically based sequence similarity methods (SIM) (e.g. BLAST, PSI-BLAST). Open readingframe expressed sequence tags (ORESTES) approach provides sequence information along thewhole length of each transcript, rather than just the ends. The method involves low-stringencyPCR to produce cDNA libraries, samples of which are then sequenced. Lately, alternative non-SIM based bioinformatic methods are becoming popular. One such method is Data MiningPrediction (DMP) that is based on combining evidence from amino-acid attributes, predictedstructure and phylogenic patterns; and uses a combination of Inductive Logic Programmingdata mining, and decision trees to produce prediction rules for functional class. DMP predic-tions are more general than is possible using homology.

For many eukaryotic organisms, the complete genome sequence is not available. What isavailable for some of the organisms is a large collection of partial gene sequence data fromrandomly chosen cDNA clones, called expressed sequence tags (ESTs). This approach wouldprovide a rapid way to identify genes in any organism. The general strategy for an EST projectinvolves construction of cDNA libraries form a variety of tissues at different stages of devel-opment, and the subsequent large-scale single-pass sequencing of clones from these libraries.Due to single-pass sequencing, the error rate in the EST sequences can be high. All the same,EST libraries are useful for preliminary identification of genes by database similarity searches.Using ESTs and cDNAs help to refine gene boundaries (Fig. 9.3).

Fig. 9.3 EST and cDNA Alignment to help resolveExon Boundaries

EST searches can be done at the DNA level with BLASTn or FASTA, or at the protein levelwith tBLASTn, tBLASTx or FASTAx. Motif identification can be done with searches against

Page 127: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

122 Bioinformatics: A Primer

BLOCKS, PFAM, and PROSITE (Chapter 12). If DNA or protein level sequence similarities aresignificant and informative, then their functional information can be transferred to genomicquery sequence as a putative function annotation.

In eukaryotes, after transcription, the mRNA for a gene may be alternatively spliced (alter-native transcripts). This phenomenon adds to the complexity of intron-exon and exon-intronsplice junctions identification. However, alternative splicing can be deduced if sufficient ESTinformation is available for a predicted gene. Therefore, EST information generated by cDNAsequencing projects is critical to annotate and interpret a eukaryotic genome completely. AnEST sequencing projects aim to create ass much data as possible for the genome. They samplelibraries of cDNA molecules prepared from a variety of tissues. Intron/exon prediction in-cludes consensus sequences at the intron-exon and exon-intron splice junctions, base compo-sition and condon usage. UTRs occur both in DNA and RNA. They are portions of thesequences flanking coding reading sequences (CDRs) but are not themselves translated. Avail-ability of protein sequences would help refine intron/exon boundaries (Fig. 9.4).

Fig. 9.4 Protein Sequence Comparison to refine Intron/Exon Boundaries

Complete CDRs are rarely sequenced in one reaction. So variable-length, overlappingfragments are aligned, in a multiple sequence alignment (sequence assembly), to obtain a consen-sus sequence. This is also to minimize cloning errors. Once coding sequences are identified andextracted from the genome, evidence for each coding region sequence (CDR) is collected froma variety of sequence similarity and motif search tools (BLAST and FASTA, PFAM). When asearch of the databases reveals several ESTs, computational methods can be used in clusteringESTs by sequence similarity to known genes (Chapter 12). This allows each predicted gene tobe compared against an array of EST sequences, enabling more effective and informativeannotations.

ESTs are not only incomplete, but also to a certain degree inaccurate. Since gene predictionmethods are only partially accurate, partial cDNA copies of expressed genes (ESTs) confirmthat a predicted gene is transcribed. The use of experimentally determined EST data in genestructure prediction and refinement underscores the importance of the integrated approach inbioinformatics–combining experimental data with computational tools of analysis.

Comparative genomics is comparison of all the proteins in two or more proteomes, therelative locations of related genes in separate genomes. This includes a comparison of genenumber, gene contact, and gene location on chromosomes. Availability of complete genomesequences makes possible a comparison of all of the proteins encoded by a genome, theproteome of the organism with those of another. Comparison of DNA sequences of twoorganisms provides information on gene relationships, conserved sequences to identify genecoding regions and evolutionary history. A comparison of each protein in the proteome withall other proteins distinguishes unique proteins from proteins that have arisen from gene

Page 128: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 129: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

124 Bioinformatics: A Primer

for the genome by revealing which genes are expressed at a particular stage of the cell cycle,and genetic variation. With increasing number of completely sequenced genomes, it is nowpossible to make DNA microarrays in which all the genes of an organism are represented, andto simultaneously assess the expression of all these genes. The main applications of DNAmicroaary chips are in gene expression profiling, mutational analysis, detecting single nucle-otide polymorphisms (SNPs), and pharmacogenomics. In this analysis, all the genes of anorganism are represented by oligonucleotide sequences (non-redualant ESTs), spread in anarray on a microscope slide. Fluorescent labels are then attached to the oligonucleotides. Theoligonucleotides are collectively hybridized to a labeled cDNA library prepared by reverse-transcribing mRNA from cells. The amount of label binding to each oligonucleotide spot is areflection of the amount of mRNA in the cell. Labeled (fluorescent) probes are scanned withmicroscope or imaging equipment. From this analysis, a set of genes that responds in anidentical manner may be identified.

Messenger RNA expression arrays immobilize stretches of mRNA and are used to measurethe concentration of mRNA species in a sample as a function of tissue type, cell cycle and otherenvironmental conditions. Gene-transcript profile technique (referred to as transcript imag-ing) is particularly appealing because RNA transcripts represent the primary output of thegenome, and RNA-based and protein-based measurements are complementary. For diseaseclassification, a ‘molecular signature’ of a tissue may be obtained by allowing tissue RNA toreact with a DNA microarray. This information may help to refine disease classifications, toguide the choice of therapy and to find new therapeutic targets.

Serial analysis of gene expression (SAGE) is an alternate approach to microarrays for geneidentification and quantification and monitoring global gene expression. It is a sequence-based method, which involves the production of short nucleotide targets from expressedgenes (10-16 bp) that are then concatenated and sequenced sequentially, revealing the identityof multiple tags simultaneously. The study of global gene expression, using DNA microaarysor SAGE, relies on the observed phenotype changes due to dynamic changes in particularmRNA population. An automated hybrid of SAGE and differential display, enables completeelucidation gene expression patterns for a given tissue or cell. It requires a complex series ofsteps involving multiplex PCR, cDNA cloning, in vitro transcription, cDNA construction,sequencing gel analysis, and quantification. These tools provide researchers with a powerfulplatform for exploring novel gene-disease relationships.

Genome annotation is tripartite, iterative procedure.(1) First, the coding regions of a genomic sequence must be discerned from the non-coding

regions. The genomic sequence is split into overlapping contiguous sequences (contigs–DNA sequences, which have been assembled solely on the basis of direct sequencinginformation). Repeat regions are identified.

(2) Once a gene is identified, or predicted, the next step is to identify putative functions,possible homologues in other organisms and within the genome. These tasks require theuse of multiple bioinformatics tools. Each coding sequence is compared with sequences inthe databases at both the DNA and protein levels. Gene finding tools assist in identifyinggenes. ESTs, cDNA sequences, and known protein sequences are compared to predictDNA sequences. Analysis of gene products (protein sequences translated from eachcoding sequence), by sequence similarity, profile and other search methods allow identi-fication of functions. Once a set of nucleotide sequences is available it is possible to ascribe

Page 130: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 131: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

126 Bioinformatics: A Primer

known and predicted genes of two or more genomes.Each protein of one proteome is then selected in turn asa query sequence of the proteome of another organism orthe combined proteome of a group of organisms.

9.2.1 Amino Acid Folding Propensity ParametersThe physicochemical characteristics of amino acids––theirshape, and size and charge (Table 9.1) influence tertiaryfolding of a protein in the sequence. Other parametersthat influence the folding are pH, side chains and solventinteractions. Knowledge of physiochemical and positionalcharacteristics in protein complexes is essential inunderstanding functional features of proteins vis a vis itsstructure, ab initio methods of structure predictionincorporating the physiochemical properties of the aminoacids in a protein, to literally calculate the 3-D structure(lowest energy model) based on protein folding, andapplications of computation-based proteomics studies.

Table 9.1 Structural and Physicochemical Properties of Amino acids

Amino acid Mass Volume HP Scale Surface area(dalton) (Å3) (K & D)# Buried (%)##

Alanine (A) 71.1 67 1.8 74Arginine (R) 186.2 148 –4.5 64Asparagine (N) 114.1 96 –3.5 63Aspartic acid (D) 115.1 91 –3.5 62Cysteine (C) 103.2 86 2.5 91Glutamine (Q) 128.1 114 –3.5 62Glutamic acid (E) 129.1 109 –3.5 62Glycine (G) 52.0 48 –0.4 72Histidine (H) 137.1 118 –3.2 78Isoleucine (I) 113.3 124 4.5 88Leucine (L) 113.2 124 3.8 85Lysine (K) 128.2 135 –3.9 52Methionine (M) 131.2 124 1.9 85Phenylalanine (F) 147.2 135 2.8 88Proline (P) 97.1 90 –1.6 64Serine (S) 87.1 73 –0.8 66Threonine (T) 101.1 93 –0.7 70Tryptophan (W) 186.2 163 –0.9 85Tyrosine (Y) 163.2 141 –1.3 76Valine (V) 99.1 105 4.2 86

(#) = Hydrophobicity of amino acid side chains (Source: Kyte & Doolittle (1982).(##) = Mean surface area (%) buried (Source: Rose et al (1985).

Fig. 9.7 A Flowchart ofProteomic Analysis

Page 132: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

The Protein-folding Problem 127

Most amino acids play more than one structural and functional role. On the basis of size andshape, glycine is the smallest amino acid with only H atom for a side chain. This allows Gly tohave a greater conformational flexibility and fit in where other residues will be too bulky.Proline lacks a free amide hydrogen group that prevents main-chain hydrogen bonding. It hasadditional structural constraint on the backbone relative to any other amino acid. Both Glyand Pro are helix disrupters, which is an important factor in globular proteins. SH group ofcysteine is the most reactive group of any amino acid side chain. The formation of disulfide (S–S) bonds between cysteines present within proteins is important to the formation of activestructural domains in a large number of proteins.

On the basis of hydrophobicity/hydrophilicity, a protein assumes a structure in whichpolar, hydrogen bonding and non-polar interactions maximize simultaneously. Thus, buriedsurface area of amino acid side chain is often used as a measure of the contribution to proteinfolding from the hydrophobic effect. Non-polar amino acids (Ile, Leu, Phe and Met) aregenerally found in the interiors, whereas charged residues (Arg, His, Lys, Asp and Glu) tendto be exposed on the surface of proteins. Ala, Ile, Leu, Met and Val are the aliphatic residuesthat fit together in the hydrophobic interiors of proteins. Neutral polar residues (Asn, Gln,Cys, Ser and Thr) are found on the surface as well as inside the protein. Ser and Thr have OHresidues and they have important roles as active site residues.

Reverse turns exhibit structural and physicochemical characteristics.

1. Reverse turns are polar due to the presence of NH and C=O groups; so usually they aresituated at molecular surface (boundary separating protein and solvent), in contact withthe solvent, water.

2. Backbone and side chain hydrogen bonds are disrupted in turn regions.3. Solvent participation is an important influence in turn stabilization.4. Regions adjacent to the reverse turn have many hydrophobic residues.5. Turns in a hydrophobic environment (e.g. interior of proteins) are seldom. In such cases,

bound water molecules and hydrogen bonds between side chains and the polypeptidebackbone are involved in neutralizing the hydrophilicity.

9.2.2 Protein Folding Kinentics MethodsNo one has the technology in place to solve hundreds of crystal structures for hundreds of newproteins. Therefore, the current experimental recourse is to screen protein-folding conditions(intermediate structures). The protein-folding problem can be analyzed experimentally bykinetic methods, by analyzing the folding pathways. Partially folded protein intermediatesare chemically trapped, and physically separated, and their structures are analyzed by bio-physical and biochemical methods. Introducing mutants and analyzing the effects of mutationon the kinetics of folding can also be used to study folding pathways. The results of thesestudies are:

(i) The structural elements collapse as a compact unit (molten globule) and later reorganizetowards the native structure.

(ii) All but rate-limiting steps are readily reversible, so that the initial folding process israpid.

(iii) Inter-conversions of the molten globule state with the fully unfolded state are rapid andnon-cooperative, whereas those with the full folded state are slow and cooperative.

Page 133: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

128 Bioinformatics: A Primer

(iv) Partially folded intermediates are inherently unstable.(v) The transition state has ordered segments largely intact.

(vi) Local sequence alterations and environment can influence the overall structure of pro-teins.

(vii) Folding does not proceed by stepwise acquisition of the native structure.(viii) There is a preferential accumulation of the most stable intermediate in folding pathway,

because the productive pathway for refolding leads from this intermediate.

Molecular dynamics (MD) simulations can be used to study molecular dynamics in proteinfolding, and to predict protein-folding rates. Prediction of such stable intermediate folds is onethe important procedures in structure prediction methods (see Chapter 12).

9.2.3 Phylogenetic (Evolution) MethodsThe study of relationships between two groups of organisms is called taxonomy. The branchof taxonomy that deals with genetic sequences is called phylogenetic or molecular evolution.Molecular evolution is self-assembly into higher structures and the subsequent evolution ofproteins to the living organisms. Concept of molecular self-replication is based on the appli-cation of chemical kinetics to template-induced polynucleotide replication and translation.

Protein-folding problem can be addressed from phylogenetic methods either via pheno-typic approach or “cladistic” approach. Phenotypic approach relies heavily on sequence data,while the cladistic approach relies on knowledge of ancestral relationships as well as sequencedata (see Chapter 8). Rational design of new proteins is based on insights obtained from thebasic mechanisms for evolution––mutation and recombination.

9.2.3.1 Evolution and Function

The basic mechanisms for evolution are mutation, recombination and natural selection. Theseprocesses are closely related to genes and dynamics of their replication and translation. Thestudy of evolutionary aspects of proteins provides insights on important and conservedregions in protein folding. The dynamic processes in the evolution of new proteins are modi-fication of side chains, insertion and deletion of amino acid residues, and all of these affect thefolding. Generally, changes in the interior of proteins are conservative so that the main-chainfolding is not drastically affected. Homology generally means relationship of nucleic acid orprotein sequences that are descent from a common ancestral sequence. Homology (evolution-ary relationship) can be inferred from sequence similarity results. From the observation ofsimilarity, one might be able to infer whether the sequence similarity would relate to func-tional similarity. Sequences that are highly divergent during evolution cannot be detected bysimple sequence similarity search methods. In such cases, computational methods, based onprofile-search, that go beyond the simple pair-wise sequence similarity methods, should alsobe tried.

Sequence alignments are intended to unravel evolutionary pathways and/ or structuralhomology between two proteins. However, sequence homology does not necessarily indicatefunctional homology. Phylogenetic homology does not necessarily imply structural homol-ogy or neither of them necessarily implies functional homology. Comparisons of tertiarystructures of homologous proteins (proteins related by divergence from a common ancestor)have shown that three-dimensional structures have been better conserved during evolution

Page 134: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

The Protein-folding Problem 129

than the primary structures. The conservation of main-chain folding underscores the fact that(i) conservation of structural and physicochemical features imposes stringent conditions onfolding pathways, and (ii) proteins that serve functions similar in all organisms retain strongstructural homology. Mutations and gene duplications affect expression of genetic code inproteins. Accumulation of point mutations by natural drift (phenotype mutations) results instructural changes in homologous proteins, although their functional properties are similar.These evolutionary changes are continuous and small. Gene fusion and gene duplication, onthe other hand, introduce drastic and discontinuous structural changes. Evolution of com-pletely new proteins/enzymes by point mutations is comparatively slow, and their emer-gence can be explained by gene duplication. Proteins with new functions are produced bygene duplication. Trough mutation and natural selection, one of the copies can develop a newfunction. Following speciation, a newly derived genome will inherit the families of ancestororganisms, but will also develop new ones to meet evolutionary challenges.

One example of gene duplication and divergent evolution is a/b-barrel domains in severalclasses of proteins. Gene duplication led to several copies of the gene, and gene fusion led toproteins with different functions that share a common architecture. Combination of twodissimilar polypeptides by gene fusion generally results in the formation of a new polypeptidewith altered functional features (e.g. a-lactalbumin). Different proteins can be generated byjuxtaposition of different combinations of exons. For example, a combination of a-lactalbuminand an enzyme, transferase, has resulted in the formation of new enzyme lactose synthetase.

9.2.3.2 Evolutionary Trends in Protein Structures

A common mechanism for the evolution of new proteins is gene duplication to form morecopies and the new copies evolve independently by point mutation and amino acid insertionsand deletions to yield proteins with new functions. In related organisms, the gene content ofthe genome and gene order on the chromosome are likely to be conserved. As the relationshipbetween the organisms decreases, local group of genes remain clustered together, but chromo-somal rearrangements move the clusters to other locations.

Evolution is achieved either via divergent evolution or convergent evolution. Production ofdifferent protein species by mutation of genes descended from a common ancestral gene iscalled divergent evolution (e.g. myoglobins, hemoglobins, cytochrome c and insulins). Con-vergent evolution is acquisition of similar functional features by a class of proteins withdissimilar primary structures and structural conformations (e.g. subtilisins and pancreaticproteinases). These proteins have different tertiary structures, but with similar functions.

Analysis of three-dimensional structures of proteins shows that proteins with high primarystructure homology have closely similar tertiary structures, and hence similar functions. Theconverse is not always true. Closely related sequences, which may be assumed to share acommon structure, may not share the same function (e.g. lysozymes and lactalbumins; insulinand relaxin; plasma albumin and fetoprotein; ovalbumin and antithrombin). For example, henegg-white lysozyme shares 50 percent homology with a-lactalbumin but their functions aredifferent. The b-barrel structural motif found many proteins classes, may have different func-tions. Also, proteins with different primary structures (analogous proteins from convergentevolution) can have similar tertiary structures and functions. This reinforces fact that duringevolutionary changes (divergent or convergent evolution) the tertiary structures are con-served more strongly than the primary structures.

Page 135: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

130 Bioinformatics: A Primer

9.2.3.3 Molecular Structure and EvolutionEvolution can be studied from the molecular perspective, from the comparative analysis oftertiary structures of proteins from various organisms and environments. In divergence evo-lution the functions appear to be the same, but there are changes in the tertiary structures toadapt to different environments. Therefore, sequence comparisons of orthologous proteins(proteins that perform same functions in different species (e.g. insulin, hemoglobins, lysozymeetc.) open the way to the construction of phylogenetic trees. Such phylogenetic studies (e.g.cytochrome C, globins) provide valuable structural information on protein folding dynamics,conserved regions, invariant residues crucial for proper folding/function and regions pronefor additions and deletions. Similarly, sequence comparison studies of paralogous proteins,which are proteins with different but related functions in an organism (e.g. G-proteins insignal-transduction process) provide valuable insights to the structure-function relationshipsin proteins.

Interacting pairs of proteins co-evolve to maintain functional and structural complementarity.Consequently, such a pair of protein families shows similarity between their phylogenetictrees. Evaluation of the degree of co-evolution of family pairs by global protein structuralinteractome map (PSIMAP—a map of all the structural domain–domain interactions in thePDB) would improve the accuracy of prediction based on ‘homologous interaction’.

EXERCISE MODULES

1. What is the genesis of the “protein-folding” problem?2. Why is structure prediction is so important in bioinformatics?3. What are the parameters influencing the tertiary folding?4. What is a genome?5. What is structural genomics?6. What is genetic linkage mapping?7. What is physical mapping?8. What are EST databanks and how are they prepared?9. What is the importance of annotation and comparative genomics?

10. What are gene chips and what are their applications in genomic studies?11. What is functional genomics?12. Why is the protein-folding problem so complex?13. What is hydrophobicity scale?14. What is a proteome?15. What is intra-proteomic analysis?16. What is inter-proteomic analysis?17. What are the kinetic methods of protein folding analyses?18. What is taxonomy?19. What is molecular evolution?20. What is phenotypic approach of evolutionary study?21. What is cladistic approach of evolutionary study?22. How do you correlate evolution of proteins to their functions?23. What are the evolutionary trends in protein structure?24. How do you study evolution of proteins from their 3-D structure?

Page 136: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

The Protein-folding Problem 131

BIBLIOGRAPHY

1. Baldwin, R.L. (1989), TiBS, 14; 291. “How does protein folding get started?”2. Baxevanis, A.D. & Ouellette, B.F., (Eds). (1998), Wiley & Sons: New York. “Bioinformatics: A Practical

Guide to the Analysis of Genes and Proteins”.3. Birren, B., et al (Eds). (1997), Cold Spring Harbor Laboratory Press: New York. “Genome Analysis: A

Laboratory Manual”.4. Blake, C.C.F. (1985), Int Rev Cytology, 93; 149. “Exons and the evolution of proteins”.5. Bork, P. & Koonin, E.V. (1996), Curr Opin Struct Biol., 6(3); 366. “Protein sequence motifs”.6. Branden, C & Tooze, J. (1999), Garland Pubs: Philadelphia. “Introduction to Protein Structure”,

2nd Edn.7. Brennan, R.G. & Matthews, B.W. (1989), Trends Biochem Sci., 14; 286. “Structural basis of DNA-protein

recognition”.8. Brown, T.A. (1999), Wiley-Liss: New York. “Genomes”.9. Chee, M., et al. (1996), Science, 274; 610–14. “Accessing genetic information with high-density DNA

arrays”.10. Chothia, C. (1984), Annu Rev Biochem. 53; 537. “Principles that determine the structure of proteins”.11. Chothia, C. & Lesk, A.M. (1986), EMBO J., 5; 823. “The relation between the divergence of sequence and

structure in proteins”.12. Cohen, C. & Perry, D.A.D. (1993), TiBS., 11; 245. “a-helical coiled coils– a widespread motif in proteins”.13. Creighton, T.E. (1978), Prog Biophys Mol Biol., 33; 231. “Experimental studies of protein folding and

unfolding”.14. Creighton, T.E. (1993), Freeman Press: New York. “Proteins–– Structures and Molecular Properties”,

2nd Edn.15. Danchin, A. (1999), Curr Opin Struct Biol., 9(3); 363. “From Protein sequence to function”.16. Dickerson, R.E. (1980), Sci Amer., 243(3); 136. “Cytochrome C and the evolution of energy metabolism”.17. Dickerson, R.E. & Geis, I. (1983) Benjamin-Cummings: Menlo Park, CA. “Hemoglobin: Structure,

Function, Evolution and Pathology”.18. Dobson, C.M. & Karplus, M. (1999), Curr Opin Struct Biol., 9(1); 92. “The fundamentals of protein

folding: bringing together theory and experiment”.19. Dunham, I.N., et al. (2000), Nature, 404; 904. “The DNA sequence of human chromosome”.20. Dutt, M.J. & Lee, K.H. (2000), Curr Opin Biotechnol., 11; 176”. “Proteomic analysis”.21. Eisen, M.B. & Brown, P.O. (1999) Methods Enzymol., 303; 179. “DNA arrays for analysis of gene

expression”.22. Eisenhaber, F., Persson, B. & Argos, P. (1995), Crit Rev Biochem Mol Biol., 30; 1. “Protein structure

Prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence”.23. Englander, S.W. (1993), Science, 262; 848. “In pursuit of protein folding”.24. Farber, G. & Petsko, G.A. (1990), TiBS., 15; 228. “The evolution of a/b-barrel enzymes”.25. Fickett, J.W. (1996), Trends Genet., 12; 316. “Finding genes by computer: the state of the art”.26. Gilbert, W., de Souza, S.J. & Long, M. (1997), Proc Natl Acad Sci, USA, 94; 7698. “Origin of genes”.27. Hedge, P., et al. (2000), Biotechniques, 29; 548. “A concise guide to cDNA microarray analysis”.28. Johnson, M.S. & Overignton, J.P. (1993), J Mol Biol., 233; 716. “A structural basis for sequence compari-

son”.29. Kim, W. K., Bolser, D.M. & Park, J.H. (2004), Bioinformatics, 20(7); ([email protected]). “Large-

scale co-evolution analysis of protein structural interlogues using the global protein structural interactomemap (PSIMAP)”.

30. Kleywegt, G.J. (1999), J Mol Biol., 285; 1887. “Recognition of spatial motifs in protein structures”.31. Kyte, J. (1994), Garland Pubs: New York. “Structure in Protein Chemistry”.

Page 137: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

132 Bioinformatics: A Primer

32. Kyte, J. & Doolittle, R.F. (1982), J Mol Biol., 157; 105. “A simple method for displaying the hydropathiccharacter of a protein”.

33. Lee, P.S. & Lee, K.H. (2000), Curr Opin Struct Biol., 11; 171. “Genomic analysis”.34. Lesk, A.M. (1991) IRL Press: London. “Protein Architecture: A Practical Approach”.35. Lesk, A.M. & Chothia, C. (1980), J Mol Biol., 136; 225. “How different amino acid sequences determine

similar protein structures: the structure and evolutionary dynamics of globins”.36. Levitt, M. & Chothia, C. (1976), Nature, 261; 552. “Structural patterns in globular proteins”.37. Li, W.H. (1997), Sinuaer Associates, Sunderaland: MA. “Molecular Evolution”.38. Lilley, D.M.J. (Ed) (1995), IRL Press: Oxford. “DNA-Protein: Structural Interactions.39. Luthy, R., Bowie, J.U. & Eisenberg, D. (1992), Nature, 356; 83. “Assessment of protein models with three-

dimensional profiles”.40. Marcotte, E.M., et al. (1999), Science, 285; 751. “Detecting protein function and protein-protein interac-

tions from genome sequences”.41. Marshall, A. & Hodgson, J. (1998), Nature Biotechnol., 16; 27. “DNA chips: An array of possibilities”.42. Martin, A., et al. (1998), Structure, 6; 875. “Protein folds and functions”.43. Mount, D.W. (2001), Cold Spring Harbor Lab Press: New York. “Bioinformatics: Sequence and Genome

Analysis”.44. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print).45. Orengo, C.A., et al. (1994), Nature, 372; 631. “Protein superfamilies and domain superfolds”.46. Orengo, C.A., et al. (1999), Curr Opin Struct Biol., 9(3); 374. “From protein structure to function”.47. Overington, J.P. (1992), Curr Opin Struct Biol., 2; 394. “Comparison of three-dimensional structures of

homologous proteins”.48. Pabo, C.O. & Sauer, R.T. (1992), Annu Rev Biochem., 61; 1053. “Transcription factors: structural

families and principles of DNA recognition.49. Pandey, A. & Mann, M. (2000), Nature, 405; 837. “Proteomics to study genes and genomes”.50. Perutz, M.F. (1991), W.H. Freeman: New York. “Protein Structure and Function”.51. Qian, N. & Sejnowski, T.J. (1988), J Mol Biol., 202; 865. “Predicting the secondary structure of globular

proteins using neural network models”.52. Richardson, J.S. (1981), Adv Prot Chem., 34; 167. “The anatomy and taxonomy of protein structure”.53. Richardson, J.S. (1985), Methods Enzymol, 115; 349. “Describing patterns of protein tertiary structure”.54. Rose, G.D., et al. (1985), Science, 229; 834. “Hydrophobicity of amino acid residues in globular proteins”.55. Sackhein, G. (1991), Addison-Wesley: New York. “Introduction to Chemistry for Biology Students”, 4th

Edn.56. Sanchez, R. & Sali, A. (1997), Curr Opin Struct Biol., 7; 206. “Advances in comparative protein structure

modeling”.57. Sensen, C.W., (Ed). (2001), Wiley-VCH: Weinheim. “Biotechnology(5b): Genomics and Bioinformatics”

(2nd Edn).58. Struhl, K. (1989), Trends Biochem Sci., 14; 137. “Helix-turn-helix, zinc-finger, and leucine-zipper motifs

for eukaryotic transcriptional regulatory proteins”.59. Sun, Z. & Jiang, B. (1996), J Prot Chem., 15; 675. “Conformation of commonly occurring super-secondary

structures (basic motifs) in protein databank”.60. Todd, A.E., Orengo, C.A. & Thornton. J.M. (1999), Curr Opin Chem Biol., 3(5); 548. “Evolution of

protein function, from a structural perspective”.61. Vriend, G. & Sander, C. (1991), Proteins, 11; 52. “Detection of common three-dimensional structures in

proteins”.62. Zaidi, F.N., Nath, U. & Udagaoankar, J.B. (1997), Nature: Struct Biol., 4; 1016. “Multiple intermediates

and transition states during protein unfolding”.

Page 138: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

10Computational Methods in

Structure Prediction

Theoretical (computational) methods of tertiary structure prediction and model building ofproteins (and other biological molecules), on the basis of the known three-dimensional struc-ture of their homologous, are at present the alternate ways to obtain structural information(molecular bioinformatics). By virtue of genome projects, the sequence databases are growingfaster than the structural databases, and there is a sequence/structure deficit (1,500:1). This isdue to the reason that though the primary structure databases of proteins are increasing at afaster rate, there is a paucity of three-dimensional structural data, due to inherent limitationsand procedural constraints on the existing experimental methods (X-ray crystallography andNMR spectroscopy). All the same, all structure prediction methods (statistical, physical andsimulation methods) are empirical with inherent limitations and they all relay one way oranother on experimental data for validation.

10.1 PROTEIN FOLDING RULES

Simply stated, the ‘protein-folding’ problem is to predict the 3-D folding of a protein from itsprimary structure (amino acid sequence) data. Empirical rules for protein-folding problemhave been formulated from the “knowledge” culled from the experimental data from struc-ture determination by physical techniques, molecular biology and quantitative biochemistry.From the experimental data, it is proposed that steps toward the native folding of a protein arethrough stages—(i) rapid collapse of random-coil state to “molten globule” state, (ii) slowprocess of semi-compact states to a transition state, (iii) fast folding from the transition to thefinal native state, and (iv) there is a preferential accumulation of the most stable intermediatein folding pathway. From the “knowledge-based” studies, the empirical rules that govern thepacking that occurs between and among secondary structural elements to form structuralmotifs, modules, domains, and tertiary structures are:

1. Residues that become buried in the interior of a protein close-pack. Close packing andexclusion of water and burial of hydrophobic groups in the interior are the major deter-minants of tertiary folding.

2. Polar and charged residues are predominantly found at the reverse turns and surfaces ofprotein molecules.

3. The Ramachandran conformation angles of the main-chain and side-chains of the polypep-tides lie in the narrowly allowed regions.

Page 139: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

134 Bioinformatics: A Primer

4. Secondary structural elements in proteins retain conformations close to the minimum free

energy conformations of the isolated secondary structures. They interact in a manner thatinduces no appreciable steric strain.

5. Packing interactions (non-bonded interactions) contribute to folding stability and, there-

fore, tend to be conserved in protein folding processes.

6. Structural motifs, modules, domains and topologies are more important than the amino

acid sequence homology in the evolution of protein structures. That is, the 3-D structure

of proteins is more faithfully conserved than is the underlying sequence.

7. In proteins with natural physical constraints, such as disulfide-containing proteins, the S–

S bridges are found as integral parts of structural motifs, influencing the tertiary folding.

8. In disulfide-containing proteins, the S–S moieties render not only structural stability but

also functional features. There exists a hierarchy of S–S bridges in stabilizing structuralmotifs/moieties.

10.2 STRUCTURE PREDICTION OF FIBROUS PROTEINS

Proteins are classified under two broad categories—(1) fibrous proteins and (2) globular

proteins. Fibrous proteins are long and thread-like (e.g. collagen), while globular proteins (e.g.

myoglobin, insulin) are compact globules, due to reverse turns of the polypeptide backbone.

Protein-folding rules are relatively simple in the case of fibrous proteins. Many of them are

helical (or sheet) and in many cases several helical polypeptide monomers entwine to form

coiled-coil super helical structures, and existing in relatively simple structural motifs.

10.2.1 The Keratin Group Proteins

The primary structures of a-keratin group (k-m-e-f) proteins (e.g. hair, wool, myosin, andfibrinogen) show a general trend towards a heptad-residue repeat with a preponderance of

non-polar residues at the 1st and 4th positions and charged residues at the 5th and 7th posi-

tions. The residues at the 2nd, 3rd and 6th positions are generally polar. The heptad-residue

repeat of the helix leads to the alignment of the 1st and 4th non-polar residues forming a non-

polar strip on one side of the helix. The non-polar faces of several helices associate to form

coiled-coil structure.Silk (b-keratin) has antiparallel b-pleated sheet structure with a six-residue repeat unit–

(Gly-Ser-Gly-Ala-Gly-Ala)n. Individual sheets are packed together so that Gly faces Gly and

Ala or Ser.

10.2.2 The Collagen Group Proteins

The structural unit of collagen, tropocollagen, comprises three left-handed polyproline type

helical monomers. The amino acid sequence of each polypeptide monomer is (Gly-X-Y)333.

Generally X is proline and Y is 4-hydroxyproline. With every third residue being glycine andstereochemical constraints make it straightforward to predict tertiary structures of collagen

group proteins a priori from the amino acid sequence data.

Page 140: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Computational Methods in Structure Prediction 135

10.3 STRUCTURE PREDICTION OF GLOBULAR PROTEINS

In the case of globular proteins, the structure prediction rules are inadequate. Many differentamino acid sequences give similar three-dimensional structures. We do not yet fully under-stand the rules of protein folding. The protein-folding problem is highly complex in globularproteins because,

(i) All twenty amino acids can be involved in the construction of the polypeptide.

(ii) No repeating units that would have simplified the problem.

(iii) Globular proteins are not linear; the direction of the backbone folding changes manytimes to render compact globular shape.

Therefore, the protein-folding problem in globular proteins is attempted at several levels ofcomplexity, incorporating empirical protein folding rules culled from known 3-D structures,physicochemical data and “knowledge” accumulated from other sources. Success rate dependson the structural features of the class of proteins addressed.

Structural topologies (motifs, and domains) and not amino acid sequence homologies arebetter conserved in evolution. Different amino acid sequences can give rise to similar proteinstructures. That is, three-dimensional spatial architecture is more important in protein foldingthan amino acid sequence homologies. Therefore, protein-folding problem is better addressedby classifying protein structures by shapes, motifs and modules and then carrying out mo-lecular modeling procedures. Structure prediction methods will have better success if patternmatching is given more importance. That is, instead of matching amino acid sequence data ofprotein with sequence data of its homologues, the objective should be to match the amino acidsequence of a test protein with a given topology/shape/profile. This ‘inverse folding’ method,where proteins are identified by structural motifs and shapes and amino acid sequences arealigned to fit the structural motifs is a recent development in structure prediction by statisticalmethods.

The complexity of the protein-folding problem can be minimized, in the case of certainclasses of proteins, such as immunoglobulins and disulfide-containing proteins, where theinherent natural constraints influence the tertiary folding interactions

10.3.1 Secondary Structure Prediction

The secondary structure elements (helix, sheet and turn) constitute the building blocks of thefolding units in globular proteins. The aim of secondary structure prediction in essence is toprovide information and location of helices, strands and random coil segments within a pro-tein from its amino acid sequence data. Therefore, prediction of the secondary structure of aprotein is often used as the first step in an attempt toward predicting its tertiary structure.There is a variety of empirical and statistical methods––(i) Chou-Fasman and GOR, (ii) neuralnetwork and (iii) nearest neighbor—available to predict secondary structures of proteins fromthe amino acid sequence data. These methods try to predict the propensity of each amino acidto be a part of a helix, a sheet or a coil region in a protein. Protein sequences are proposed assliding windows of fixed-length segments, usually ranging from 7–17 amino acids. Thecentral residues are then assigned one of the states, namely helix, sheet or coil, depending ontheir propensity (Table 10.1).

Page 141: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

136 Bioinformatics: A Primer

Table 10.1 2-D Structure Propensity Chart for Amino acid Residues

Amino acid a-helix b-sheet Reverse turn

Alanine (A) 1.40 0.75 0.80

Arginine (R) 1.20 0.90 0.90

Asparagine (N) 0.78 0.66 1.54

Aspartic acid (D) 1.00 0.66 1.40

Cysteine (C) 0.95 1.10 1.00

Glutamine (Q) 1.17 1.00 0.95

Glutamic acid (E) 1.45 0.51 0.77

Glycine (G) 0.63 0.85 1.60

Histidine (H) 1.12 0.83 0.90

Isoleucine (I) 1.00 1.57 0.47

Leucine (L) 1.30 1.20 0.60

Lysine (K) 1.22 0.70 1.02

Methionine (M) 1.30 1.12 0.60

Phenylalanine (F) 1.15 1.23 0.60

Proline (P) 0.50 0.60 1.50

Serine (S) 0.72 0.95 1.40

Threonine (T) 0.78 1.43 0.50

Tryptophan (W) 1.03 1.26 0.90

Tyrosine (Y) 0.74 1.40 1.00

Valine (V) 0.96 1.66 0.50

Source: Chou, P.Y. & Fasman, G.D. (1978); Fasman, G.D (Ed). (1990).

The Chou-Fasman method is based on analyzing the frequency of each of twenty aminoacids in a-helices, b-sheets and turns. The frequency of an amino acid fi is divided by thefrequency of all residues in the sequence. The sequence is first scanned to find a short se-quence of amino acids that has a high probability for starting a nucleation event that couldform one type of secondary structure. For a-helix, a prediction is made when four of six aminoacids have probability > 1.0 of being in a-helix. For b-strand, the presence of three aminoacids, out of five residues, with a probability of > 1.0 being in b-strand is considered as nucle-ation region. These nucleation regions are extended along the sequence in each direction untilthe predicted values for four amino acids drops < 1.0. Turns are modeled as tetrapeptides.

Prediction of secondary structure can be aided by examining the periodicity of aminoacids with hydrophobic side chains in the protein structure. Hydrophobicity tables that givehydrophobicity values for each amino acid are used to locate the most hydrophobic regions ofthe protein. A sliding window is moved across the sequence and the average hydrophobicityvalue of amino acids within the window is plotted. Hydrophobic moment display, where hy-drophobic amino acids tend to segregate to opposite sides of structure plotted againstRamachandran angles from on residue to the next along the protein chain.

Whichever procedure is followed, the general features of the secondary structure predictionprograms are to predict regions of ordered regions, reverse turns and loops, and formulate

Page 142: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Computational Methods in Structure Prediction 137

further empirical rules to predict how these secondary structure elements further fold intocompact super-secondary and tertiary structures. For example, the hydrogen bond network ina-helix is intra-molecular, the H-bonds of helices can be localized to intra-segment partners.Thus, it is presumed that the helical secondary structure is dependent mainly on local sequenceinformation (short-range interaction), and not on final folding, and thus can be predicted fromneighborhood amino acid sequence data. On the other hand, the formation of the H-bondnetwork is intermolecular in b-sheet formation, and thus requires long-range interactions.

Prediction methods for secondary structure have low accuracy. Such empirical assump-tions have no physical meaning and some times may lead to erroneous results. This isbecause, tertiary folding is highly cooperative and limited knowledge of a few aspects of astructure does not necessarily give insight to predict the whole structure. Interactions betweensequentially distant residues override the intrinsic conformational propensities of individualresidues to achieve a proper tertiary folding. In essence, bits and parts information does notprovide total information on protein folding.

The direct approach toward the protein-folding problem is to predict the three-dimensionalstructure of a protein from its amino acid sequence data. This is carried out by direct searchmethods (e.g. pair-wise sequence search methods). Converse approach to the protein-foldingproblem is– for a particular type of folding what are the compatible amino acid sequences?

10.3.2 Super-secondary Structure Classification

Proteins have an ordered organization of secondary structures that form distinct structuralmotifs and domains. These motifs and domains (e.g. Greek-key, helix-turn-helix, zinc-finger,and leucine-zipper) are termed as super-secondary structures. One way of approaching theprotein-folding problem is to classify protein structures by structural motifs and domains andincorporate this information in structure prediction procedures. This approach is one or twosteps below the hierarchy of predicting the final tertiary structure. Family search methods(e.g. template method) are based on comparison of protein motifs, domains or families. Theunderlying rationale of these programs is based on the assumption that proteins with similaramino acid sequences will have similar structures (not necessarily true!); and domains, andtertiary folding rather than amino acid sequences are conserved during evolution.

Well-defined types of folding units (super-secondary structures) can be classified under(i) all a-, (ii) all b-, (iii) a/b and a+b, and (iv) other types of structure classes.

10.3.2.1 All a-motif Structures

In globular proteins, a-helices are packed to form various structural motifs, from simple ‘up-down’ helices (e.g. melettin, myohemerythrin, cytochrome C’ and Cyt562) (Fig. 10.1), to morecomplex helix bundles, (e.g. ribonuclease inhibitor protein (Fig. 10.2)) and transmembranehelices motifs and domains in membrane proteins (e.g. rhodopsin) and to more diverse helicalfolds, like in lysozyme (Fig. 10.3) and ‘globin fold’, where a-helices of the bundle are wrappedaround the core in different directions so that sequentially adjacent a-helices are usually notadjacent to each other (e.g. myoglobin) (Fig. 10.4). These structural units are maintained bypreserving the hydrophobic interior core during evolution.

Page 143: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

138 Bioinformatics: A Primer

Fig. 10.1(i) Structure of Melettin (Exampleof “all a-Helix” Structural Motif)

(Ref: Eisenberg, D., Gribskov, M. &Terwilliger, T.C.);

(Source: Protein Data Bank; 2MLT.pdb)

Fig. 10.1(ii) Structure of Myohemerythrin(Example of “Up-down a-Helices Bundle”

Structural Motif)

(Ref: Sheriff, S., Hendrickson, W.H. & Smith,J.L. (1987)); (Source: Protein Data Bank;

2MHR.pdb)

Fig. 10.2 Structure of Ribonuclease InhibitorProtein (Example of “Multiple Helices”

Structural Motif)

(Ref: Kobe, B. & Deisenhoffer, J. (1996) JMol Biol., 264; 1028)

(Source: Protein Data Bank: 1BNH.pdb)

Fig. 10.3 Structure of Lysozyme (Exampleof a with “Diverse Helices” Structural Motif)

(Ref: Weaver, L.H., Grutter, M.G. &Matthews, B.W. (1995) J Mol Biol., 245; 54)

(Source: Protein Data Bank: 153L.pdb)

Page 144: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Computational Methods in Structure Prediction 139

Coiled-coil structures typically comprisetwo or three a-helices coiled around eachother forming a super-coiled structure, withseven residues every second turn. Coiled-coilregions in proteins may be identified bysearching for the 7-residue (heptad) period-icity–– (a-b-c-d-e-f-g)-. The a and d residuesare usually hydrophobic amino acids.

The leucine-zipper motif is typicallymade of two antiparallel a-helices held to-gether by interactions between hydrophobicleucine residues located at every 7th positionin each helix. The zipper holds protein sub-units together. The leucine residues arelocated at approximately every two turns ofthe a-helix. The binding of the subunits froma “scissor-grip” like structure with ends thatlie on the major groove of DNA double helix.The predicted motif is a coiled-coil structure.

Membrane a-helices are like a-helicesthat are buried in the structural core of aprotein.

10.3.2.2 All b-motif StructuresThe “b-turn-b” motif structures have antiparallel b-strands joined by a turn/loop (Fig. 10.5).The b strands have right-handed twist, and packing of b strands gives a barrel-like structure.b-Motif structures exhibit simple “strand-turn-strand” motifs (e.g. T cell coreceptor proteinCD8 and superoxide dismutase (Fig. 10.6) to more complex Greek-key, b-barrel, and “swissroll” structural motifs (Fig. 10.7). Structural motif in immunoglobulins, “immunoglobulin-fold”,and in human cell adhesion protein CD2 (Fig. 10.8) is a version of the Greek-key structuralmotif.

Fig. 10.5 Schematic of “b-turn-b” Structural Motif

Fig. 10.4 Structure of Myoglobin (Exampleof Helical-wheel “Globin-fold” Structural Motif)

(Ref: Watson, S.C. & Kendrew, J.C)(Source: Protein Data Bank: 1MBN.pdb)

Page 145: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

140 Bioinformatics: A Primer

Fig. 10.6(i) Structure of Human T CellCoreceptor Protein CD8 (Example of “Strand-

turn-Strand” Structural Motif)

(Ref: Leahy, D.J., Axel, R. & Hendrickson,W.A. (1992) Cell, 68; 1145)

(Ref: Protein Data Bank: 1CD8.pdb)

Fig. 10.6(ii) Structure of SuperoxideDismutase (Example of “Strand-turn-Strand”

Structural Motif)

(Ref: Carugo, K.D., et al. (1996) ActaCrystallogr D: Biol Crstallogr., 52; 176)(Source: Protein Data Bank: 1XSO.pdb)

Fig. 10.7(i) Structure of Bovine Lens g-Crystallin Protein (Example of “b-Barrel”

Structural Motif)

(Ref: Najmudin, S., et al. (1993) ActaCrystallogr D: Biol Crystallogr., 49; 223)(Source: Protein Data Bank: 4GCR.pdb)

Fig. 10.7(ii) Structure of Transthyretin(Example of “b-Barrel” Structural Motif)

(Ref: Sunde, M., et al. (1996) Eur JBiochem., 236; 491) (Source: Protein Data

Bank: 1TFP.pdb)

Page 146: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Computational Methods in Structure Prediction 141

Fig. 10.8 Structure of Human Cell Adhesion Protein CD2 (Example of aStructure with “Immunoglobulin-fold” Structural Motif)

(Ref: Bodian, D.L., et al. (1994) Structure, 2; 755)(Source: Protein Data Bank: 1HNF.pdb)

10.3.2.3 (a + b) and (a /b)-Motif StructuresExamples of (a + b) structures are insulin, trypsin inhibitor ribonuclease, glutaredoxin, p21,and bacteriochlorophyll-containing protein (Fig. 10.9). a/b-Motif structures (e.g. Rossmann

Fig. 10.9(i) Structure of T4 GlutaredoxinProtein (Example of (a+b) Structural Motif)

(Ref: Eklund, H., et al. (1992) J Mol Biol.,228; 596)

(Source: Protein Data Bank: 1ABA.pdb)

Fig. 10.9(ii) Structure of Protein p21(Example of (a+b) Structural Motif)

(Ref: Wittnghofer, F., et al. (1991) EnvironHealth Perspect., 93; 11)

(Source: Protein Data Bank: 121P.pdb)

Page 147: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

142 Bioinformatics: A Primer

fold) are built up by bab structural motif (Fig. 10.10). Structures (e.g. flavodoxin, carbonicanhydrase, adenylate kinase, and triose phosphate isomerase) consist of parallel b-strands atthe center pointing in different directions like arrows (b-barrel) with a-helices wound around(Fig. 10.11).

Fig. 10.9(iii) Structure of Bacteriochloro-phyll-containing Protein

(Example of (a + b) Structural Motif)

(Ref: Tornrund, D.E. & Matthews, B.W.(1993) Photosyn Reaction Center, 1:13)Source: Protein Data Bank: 4BCL.pdb)

Fig. 10.10 Schematic of “b-a-b”(“Rossmann-fold”) Structural Motif

Fig. 10.11(i) Structure of Carbonic Anhy-drase (Example of (a/b) Structural Motif)

(Ref: Nair, S.K. & Christianson, D.W. (1991) JAmer Chem Soc., 113; 9455)

(Source: Protein Data Bank: 1HCA.pdb)

Fig. 10.11(ii) Structure of Adenylate Kinase(Example of (a/b) Structural Motif)

(Ref: Berry, M.B. & Phillips, Jr. G.N. (1998)Proteins, 32; 276)

(Source: Protein Data Bank: 1ZIN.pdb)

Page 148: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Computational Methods in Structure Prediction 143

Fig. 10.11(iii) Structure of Triose Phosphate Isomerase (Example of (a/b) Structural Motif)

(Ref: Lolis, E., et al. (1990) Biochemistry, 29; 6609)(Source: Protein Data Bank: 1YPI.pdb)

10.3.2.4 Other Structural MotifsThere are many super-secondary structures that do not come under more common structuralclasses. In the structural motif, common to all scorpion venom toxins and many snake venomtoxins, the disulfide bonds impose natural constraints and stabilize the helix and sheet regionsof the motif (Fig. 10.12). EF-hand structural motif is found in many calcium-binding proteins(Fig. 10.13). Motifs, such as ‘helx-turn-helix’, ‘zinc-finger’, and leucine-zipper’, are found in

Fig. 10.12 Structure of Erabutoxin (A SnakeVenom Toxin; Disulfide Bonds stabilize the

Structure)(Ref: Nastopoulos, L., et al. (1998) ActaCrystallogr D: Biol Crystallogr., 54; 964)(Source: Protein Data Bank: 1QKE.pdb)

Fig. 10.13 Structure of Calmodulin (“E-FHand” Structural Motif)

(Ref: Wilson, M.A. & Brugner, A.T. (2000) JMol Biol., 301; 1237)

(Source: Protein Data Bank: 1EXR.pdb)

Page 149: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

144 Bioinformatics: A Primer

nucleic acid regulatory proteins (Fig. 10.14). These structural motifs influence the spatial fold-ing, and such structural data are helpful in the tertiary structure prediction analyses.

Fig. 10.14 Structure of Zif268-DNA Complex (“Zinc-finger” Structural Motif)

(Ref: Pavletich, N.P. & Pabo, C.O. (1991) Science, 252; 809)(Source: Protein Data Bank: 1ZAA.pdb)

10.4 APPLICATION OF STRUCTURE PREDICTION PROGRAMS

The stated goal of structural genomics and proteomics involves generating a set of structuresrepresentative of most of the possible motifs, profiles and folds for specific proteins and thensolving the structures for new proteins based on known motifs, profiles and other fold-structure relationships. There are many empirical procedures, programs and structure predic-tion and model building packages available––single and multiple-sequence alignment, pro-tein family classification, inverted protein prediction, pattern recognition protocols, to namea few (see Chapter 12). Sequence comparison and database searching are the pre-eminentapproaches in these methods (see Chapter 11). Protein family classification provides an effec-tive means of understanding the structure and function. Fold recognition procedure (e.g. GeneThreader) has become an important approach to the protein structure prediction problem.

The complexity of structure prediction problem can be minimized in certain classes ofproteins, where structural (geometrical) constraints strongly influence the tertiary foldinginteractions. Such classes of proteins are immunoglobulins, disulfide-containing proteins,metallo-proteins (EF-hand, zinc-finger motifs) and other classes of proteins with distinctlyidentifiable motifs (e.g. helix-turn-helix, leucine-zipper). In cases of disulfide-containingproteins, the S-S bridges impose natural constraints and influence the folding. S–S bridgemoieties contain both secondary and tertiary structure features and such S–S moieties help inminimizing the cooperative processes in tertiary structure folding. In these classes of proteins,simplification of structure prediction problem can be achieved by incorporating the(i) ‘knowledge’ governing the packing interactions between and among various structural

Page 150: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Computational Methods in Structure Prediction 145

elements, and (ii) ‘knowledge’ about the structural motifs and their hierarchies the rolesbridges in these structures.

Artificial neural networks (ANN) procedures are increasingly applied for gene recognition,secondary structure prediction, protein family classification and molecular design. The prin-ciples are based on the analogy to the functioning of biological neural networks, with inputs(dendrites), processing algorithms (soma), and outputs (axons) networks interfaced (Fig.10.15). In the neural network approach, computer programs are trained to be able to recognizeamino acid sequence patterns that are located in known structures and to distinguish thesepatterns from other patterns not located in these structures.

Fig. 10.15 Flowchart of Neural Network Protocols in Bioinformatics

EXERCISE MODULES

1. Which are the physical techniques employed to elucidate 3-D structures of macromolecules?

2. Why is the structure prediction methods are an alternate route to experimental methods?

3. What are the protein folding mechanisms?

4. What are the differences between fibrous and globular proteins?

5. What are the general structural characteristics of fibrous molecules?

6. Why is the structure prediction of fibrous proteins is simpler than that of globular proteins?

7. What are the salient features of collagen group protein?

8. Try to build the coiled-coil structure of a a-keratin protein from the amino acid sequence and hydropho-bic propensity of amino acids.

9. Why are the protein folding rules are more complex in the case of globular proteins?

10. Which are the constituents of secondary structure in proteins?

11. Which are the physicochemical parameters that assist in the prediction of secondary structure?

12. Why are the secondary structure prediction methods not that reliable?

13. Model the secondary structure of a globular sequence from its amino acid sequence data (obtain thesequence data from a Web site).

14. What are the various classes of folding units in proteins?

15. Which are the cases where the complexity of the structure prediction can be minimized and why andhow?

16. Download proteins of various classes (all a-, all b-, a/b, zinc-finger, helix-turn-helix) and observe theirdomains, topologies and structures.

BIBLIOGRAPHY

1. Attwood, T.K. & Parry-Smith, D.J. (2002), Pearson (Education): Delhi. “Introduction to Bioinformatics”.

2. Bajorath, J., Stenkamp, R. & Aruffo, A. (1993), Protein Sci., 2; 1798. “Knowledge-based model building ofproteins: concepts and examples”.

Page 151: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

146 Bioinformatics: A Primer

3. Baldwin, R.L. (1989), TiBS, 14; 291. “How does protein folding get started?”

4. Barton, G.J. (1995) Curr Opin Struct Biol., 5(3); 372. “Protein secondary structure prediction”.

5. Berman, H.M., et al. (2000), Nucleic Acids Res., 28; 235. “Protein Data Bank”.

6. Blundell, T.L., et al. (1987), Nature, 326; 347. “Knowledge-based prediction of protein structures and thedesign of novel molecules”.

7. Bork, P. & Koonin, E.V. (1996), Curr Opin Struct Biol., 6(3); 366. “Protein sequence motifs”.

8. Bowie, J.U. & Eisenberg, D. (1993), Curr Opin Struct Biol., 3; 437. “Inverted protein prediction”.

9. Branden, C. & Tooze, I. (1999), Garland Press: New York. “Introduction to Protein Structure”,2nd Edn.

10. Burge, C.S. & Karlin, K. (1997), J Mol Biol., 268; 78–94. “Prediction of gene structures in human genomicDNA”.

11. Chothia, C. (1984), Annu Rev Biochem. 53; 537. “Principles that determine the structure of proteins”.

12. Chothia, C. & Lesk, A.M. (1986), EMBO J., 5; 823. “The relation between the divergence of sequence andstructure in proteins”.

13. Chou, P.Y. & Fasamn, G.D. (1978), Annu Rev Biochem., 47; 251. “Empirical predictions of proteinconformation”.

14. Creighton, T.E. (1992), Freeman Press: New York. “Protein Folding”.

15. Creighton, T.E. (1993), Freeman Press: New York. “Proteins––Structures and Molecular Properties”,2nd Edn.

16. Cuff, J.A. & Barton, G.J. (2000), Proteins, 40; 502. “Application of multiple sequence alignment profiles toimprove protein second structure prediction”.

17. Danchin, A. (1999), Curr Opin Struct Biol., 9(3); 363. “From Protein sequence to function”.

18. Dobson, C.M. & Karplus, M. (1999), Curr Opin Struct Biol., 9(1); 92. “The fundamentals of proteinfolding: bringing together theory and experiment”.

19. Dubchak, I., Holbrook, S.R. and Kim, S-H. (1993), Proteins, 16; 79. “Prediction of protein folding classfrom amino acid composition”.

20. Eisenhaber, F., Persson, B. and Argos, P. (1995), Crit Rev Biochem Mol Biol., 30; 1. “Protein structurePrediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence”.

21. Engel, J. (1991), Curr Opin Cell Biol., 3; 779. “Common structural motifs in proteins of the extracellularmatrix”.

22. Englander, S.W. (1993), Science, 262; 848. “In pursuit of protein folding”.

23. Farber, G. & Petsko, G.A. (1990), TIBS, 15; 228. “The evolution of a/b barrel enzymes”.

24. Fasman, G.D. (Ed). (1990), Plenum Press: New York. “Prediction of Protein Structure and the Principlesof Protein Conformation”.

25. Frishman, D. and Argos, P. (1995), Proteins, 23; 566. “Knowledge-based protein secondary structureassignment”.

26. Garnier, J., Gibrant, J.F. and Robson, B. (1996), Methods Enzymol., 266; 540. “GOR method forpredicting protein secondary structure from amino acid sequence”.

27. Gribskov, M. & Veretnik, S. (1996), Methods Enzymol., 266; 198. “Identification of sequence patternwith profile analysis”.

28. Guex, N., Diemand, A. & Pettsch, M.C. (1999), Trends Biochem Sci., 24; 364´–67. “Protein modeling forall”.

29. Hadley, C. & Jones, D.T. (1999), Struct Fold Desn., 7; 1099–1112. “A systematic comparison of proteinstructure classifications: SCOP, CATH and FSSP”.

Page 152: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Computational Methods in Structure Prediction 147

30. Janin, J. & Chothia, C. (1980), J Mol Biol., 143; 95. “ Packing of a-helices onto b-pleated sheets and theanatomy of a/b-proteins”.

31. Johnson, M.S. & Overignton, J.P. (1993), J Mol Biol., 233; 716. “A structural basis for sequence compari-son”.

32. Jones, D.T. (1999), J Mol Biol., 292; 195–202. “Protein secondary structure prediction based on position-specific scoring matrices”.

33. Kleywegt, G.J. (1999), J Mol Biol., 285; 1887–97. “Recognition of spatial motifs in protein structures”.

34. Lesk, A.M. (1991) IRL Press: London. “Protein Architecture: A Practical Approach”.

35. Lesk, A.M. & Chothia, C. (1980), J Mol Biol., 136; 225. “How different amino acid sequences determinesimilar protein structures: the structure and evolutionary dynamics of globins”.

36. Levitt, M. & Chothia, C. (1976), Nature, 261; 552. “Structural patterns in globular proteins”.

37. Lilley, D.M.J. (Ed) (1995), IRL Press: Oxford. “DNA-Protein: Structural Interactions.

38. Lohman, R., Schneider, G. and Behrens, D. (1994), Protein Sci., 3; 1597. “A neural network model for theprediction of membrane-spanning amino acid sequences”.

39. Lupas, A. (1996) Methods Enzymol., 266; 513. “Prediction and analysis of coiled-coil structure”.

40. Luthy, R., Bowie, J.U. & Eisenberg, D. (1992), Nature, 356; 83. “Assessment of protein models with three-dimensional profiles”.

41. Merz, K.M. & Le Grand, S.M. (1994), Birkhauser: Boston/MA. “The Protein Folding Problem andTertiary Structure Prediction”.

42. Moult, J. (1999) Curr Opin Biotech., 10(6); 583. “Predicting protein three-dimensional structures”.

43. Mount, D.W. (2001), Cold Spring Harbor Lab Press: New York. “Bioinformatics: Sequence and GenomeAnalysis”.

44. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print).

45. Narayanan, P. & Lala, K. (1992), Life Sciences, 50; 683. “Prediction of tertiary structure in ‘scorpion toxin’type structures”.

46. Orengo, C.A., et al. (1994), Nature, 372; 631. “Protein superfamilies and domain superfolds”.

47. Orengo, C.A., et al. (1994), Curr Opin Struct Biol., 4(3); 429. “Classification of protein folds”.

48. Perutz, M.F. (1991), W.H. Freeman: New York. “Protein Structure and Function”.

49. Quain, N. & Sejnowski, T.J. (1988), J Mol Biol., 202; 865. “Predicting the secondary structure of globularproteins using neural network models”.

50. Richardson, J.S. (1985), Methods Enzymol, 115; 349. “Describing patterns of protein tertiary structure”.

51. Ripley, B.D. (1996), Cambridge University Press: Cambridge. “Pattern Recognition and Neural Net-works”.

52. Rost, B., Schneider, R. & Sander, C. (1997), J Mol Biol., 270; 471. “Protein fold recognition by prediction-based threading”.

53. Struhl, K. (1989), TiBS., 14; 137. “Helix-turn-helix, zinc-finger, and leucine-zipper motifs for eucaryotictranscriptional regulatory proteins”.

54. Sun, Z. & Jiang, B. (1996), J Prot Chem., 15; 675. “Conformation of commonly occurring super-secondarystructures (basic motifs) in protein databank”.

55. Todd, A.E., Orengo, C.A. & Thorton, J.M. (1999), Curr Opin Chem Biol., 3(5); 548. “Evolution of proteinfunction, from a structural perspective”.

56. Vriend, G. Sander, C. (1991), Proteins, 11; 52. “Detection of common three-dimensional structures inproteins”.

57. Wu, C.H. & McLarty, J.W. (2000), Elsevier: Amsterdam. “Neural Networks and Genome Informatics”.

58. Zucker, M. (2000), Curr opin Struct Biol., 10; 303. “Calculating nucleic acid secondary structure”.

Page 153: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Section IV

Database Search, Analysisand Modeling

(Bioinformatics-III)

Page 154: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

11Database Search

The first step towards protein structure prediction, starting from the beginning, is isolatingand purifying the desired protein, say “MyProtein” by experimental methods. The next stepis determination of amino acid sequence of “MyProtein” either by gene sequencing methodsor by physicochemical methods (refer to Chapters 4, 5 & 6). Once, the primary structure of“MyProtein” is determined by experimental methods, the rest of the protocols–databasesearches for sequence similarity, alignment, analysis and modeling—are all computational-based procedures.

Databases are of several types:

(1) Primary databases contain one principal kind of information such as gene sequence data.

(2) Secondary databases contain one principal kind of information, such as sequence align-ment (e.g. motifs, profiles, domains), derived from other databases.

(3) Knowledge databases contain structural and functional information from many sources(e.g. hydrophobicity, pH, actives sites etc.).

11.1 PRIMARY STRUCTURE (SEQUENCE)

Both genetic (gene selection and amplification) and physicochemical methods (chromatogra-phy and electrophoresis) can be used for isolation and purification of the desired protein,“MyProtein”. With the advent of gene cloning and polymerase chain reaction (PCR) tech-niques, it is possible to purify defined fragments of DNA in large quantities. Beginning witha single molecule of DNA the PCR method can generate millions of copies of the DNAfragment in a short span of time. Wherever it is feasible gene-cloning methods can be em-ployed to obtain large quantities of homogeneous protein at a faster rate.

The primary structure (amino acid sequence data) can be obtained either by protein se-quencing or by gene sequencing methods (see Fig. 1.2 of Chapter 1). Laser-activated fluores-cence technology has enhanced the fastness of gene sequencing methods, and the amino acidsequence of a protein can be inferred from its gene sequence (refer to Chapter 4). But, inferringamino acid sequence from the gene sequence has some pitfalls and ambiguities. These factorsshould be taken note of while inferring the amino acid sequence from the gene sequence. Someof the factors leading to these ambiguities are:

1. A set of three contiguous nucleotides (codon) codes for an amino acid. Any frame-shifterror reading of the gene sequence would lead to inferring wrong amino acid sequence.

Page 155: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

152 Bioinformatics: A Primer

2. Genetic code is degenerate as in most cases more than one codon code for the same aminoacid (see Chapter 4). Therefore, there are ambiguities in such cases.

3. Genomic DNA sequences contain assortment of data types (e.g. untranslated sequences,coding and non-coding, transcription and translation regions).

In addition, information gathered at the mRNA level fails to represent the changes occur-ring at the protein level. This is due to numerous regulation mechanisms in place duringprotein expression and post-expression.

11.2 DATABASES

Database search and analysis are of two categories. (1) Genomics analysis includes analysis ofnucleic acid composition, restriction enzyme cleavage sites, transcriptional factors, promotersites, secondary structure and sequence similarity searches. (2) Proteomics analysis includesdetermination of amino acid composition, sequence alignment, phylogenetic analysis, se-quence similarity searches, prediction of secondary structure, motifs, profiles, domains andtertiary structure (Fig. 11.1).

Fig. 11.1 Organization of Biological Databases

Searching of sequence databases is one of the most common tasks with a newly discoveredprotein or nucleic acid. This is used to find if (i) the sequence is already in a database, (ii) if itis new, then to infer its structure (secondary and tertiary), and its function, and (iii) presenceof active sites, substrate-binding sites etc.

Databases are effectively electronic filing cabinets, a convenient and efficient method ofstoring vast amount of information. The process of database search starts with retrieval ofsequences that are similar to that of “MyProtein” for further analysis. For example, if“MyProtein” has been purified from snake venom, it could be a toxin, lipase, phosphodi-esterase or one of the proteins found in snake venoms. Depending on the class of protein andthe number amino acids in “MyProtein, database search can be initiated to obtain sequencesof neurotoxins, cytotoxins, or other classes of proteins from various species. This can beachieved by various approaches, programs, and from various databases (from genomic andproteomic databases). Most of the databases (databanks) are Web-based. Other sources arejournals, authors, research groups, and institutions.

Page 156: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Database Search 153

11.2.1 Search Sites and Search EnginesWWW search engines are good tools to get started from the Web sites. Some of the Web sitesand engines are:

Altavista

Google

Yahoo

Infoseek

Medline

Meta Crawler

Web Crawler

Research Index

Shared

What’s New Tool

Pedro’s Biomolecular Research Tools.

GAC : (http://compbio.ornl.gov/gac/index.shtml/). (Genome AnnotationConsortium).

COG : (http://www.ncbi.nlm.nih.gov/cog/). A gene classification system––cluster of orthologous groups.

DDBJ : (http://www.ddbj.nig.ac.jp/). (DNA Databank of Japan). A nucleic aciddatabase.

DSSP : Database of secondary structures of proteins from PDB.EBI : (http://www.ebi.ac.uk/). (European Bioinformatics Institute; UK, an

outstation of the EMBL).EMBL : (http://www.ebi.ac.uk/). (European Molecular Biology Laboratory;

Germany).ExPASy : (http://www.expasy.ch/). Expert Protein Analysis System, a Molecular

Biology Server, Switzerland, with SWISS-PROT, PROSITE, 2D-PAGE, andother proteomics tools. Key site for protein sequence and structureinformation.

GDB : (http://www.gdb.org). The genome databank.GenBank : (http://www.ncbi.nlm.nih.gov/Web/GenBank/).

GenBank of the National Institute of Health (NIH, USA) genetic sequencedatabase is an annotated collection of all publicly available DNAsequences. GenBank is a part of the International nucleotide sequencedatabase, which is comprised of the DNA databank of Japan (DDBJ), theEuropean Molecular Biology Laboratory (EMBL, Germany) and GenBankat NCBI, USA.

GenomeNet : (http://www.genome.ad.j/). (Genome database, Japan).

GOLD : Genomes on-line database. Provides list of all genome projects worldwide.

GRAIL : (http://compbio.ornl.gov/Grail-1.3/). Gene Recognition and AssemblyInternet Link software. A suite of tools designed to provide analysis andputative annotation of DNA sequences both interactively and throughthe use of automated computation.

Page 157: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

154 Bioinformatics: A Primer

GSDB : (http://www.seqim.ncgr.org/). (Genome sequence database, USA).HGP : (http://www.ornl.gov/TechResources/HumanGenome/). (Human

Genome Project, USA).NCBI : (http://www.ncbi.nlm.nih.gov/). (National Center for Biotechnology

Information; NIH, USA)NDB : Nucleic acid structure database.NRL 3D : Sequence/structure database derived from PDB (at Johns Hopkins

University, Baltimore, USA).

OMIM : (http://www3.ncbi.nlm.nih-gov/Omim). On-line Mendelian inheritancein Man (for human genes and genomics at NCBI).

PDB : (http://www2.ebi.ac.uk/pdb/index.shtml). (Protein Data Bank;Brookhaven National Laboratory; USA).

PDBFINDER : A database comprising PDB, DSSP and HSSP.PEDANT : (http://www2.ebi.ac.uk/pdb/index.shtml). A protein extraction,

description and analysis tool.Sanger Center : (http://www.sanger.ac.uk/DataSearch/). Genomic sequencing, and

genomics analysis server (UK).SRS : (http://www.srs.hgmp.mrc.ac.uk/). (Sequence Retrieval System).

STACK : (http://www.sanbi.ac.za/Dbases.html). Sequence tag alignment andconsensus knowledge database.

TRANSFAC : (http://transfac.gbf.de/TRANSFAC/index.html). Transcription factordatabase, for transcription factors and transcription factor-binding sites.

Other sites : (http://www.hgmp.mrc.ac.uk/GenomeWeb/). (A list of other sites).

EMBASE : (http://www.ncbi.embase.com/). Bibliographic index to biomedical andpharmacological literature.

PUBMED : (http://www.ncbi.nlm.nih.gov/PubMed/). Covers mainly medicalliterature.

SWISS-PROT : (http://expasy.hcuge.ch/sprot/sprot-top.html). A protein sequencedatabase (Switzerland).

TIGR : (http://www.tigr.org/). (The institute for genomic research).Other sites : (http://www.hgmp.mrc.ac.uk/GenomeWeb/). (A list of other sites).Journals : (Nature–Molecular Biology; Molecular Biology; Proteins;

Protein Science; Nucleic Acid Research; Current Opinion in Structure Biology; Bioinformatics).

11.2.2 Sequence Retrieval ProgramsThere are various sequence retrieval programs available and some of the programs andsources are:

BLAST : Basic Local Alignment and Search Tool (Home Page: NCBI, USA) se-quence retrieval and sequence similarity search engine, which consists ofa suite of programs—BLASTN (nucleotide BLAST), BLASTP (ProteinBLAST), BLASTX (Translated BLAST), PhyloBLAST and PIR-BLAST.

CLEVER : Command-line ENTREZ Version from NCBI. It is an interactive tool tobrowse ENTREZ database using only test input/output.

Page 158: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Database Search 155

ENTREZ : (http://www3.ncbi.nlm.nih.gov/ENTREZ/). ENTREZ is a powerfulsearch engine, a part of NCBI server. The NCBI contains all the nucle-otide and protein sequences in GenBank and Medline. The programallows one to start with only tentative set of keywords, or a sequenceidentified in the laboratory, and rapidly accesses a set of relevant list anda list related database sequences.

FASTA : (http://www2.igh.cnrs.fr/bin/fasta-guess.cgi). Sequence retrieval andsimilarity search database.

FETCH : FETCH is sequence retrieval program that retrieves sequences from theGenBank and other databases. The program requires the exact locusname or accession number of a sequence.

NetFETCH : This is a sequence retrieval program that retrieves sequences from theNCBI’s NetENTREZ Web server. Name or accession number can retrievesequences.

LOOKUP : LOOKUP is a sequence retrieval program that uses SRS (Sequence Re-trieval System) and is useful if the accession number is not known, butone wishes to download sequences of all proteins related to the queryprotein. LOOKUP identifies sequence by name, accession number, key-word, title, reference, feature or date. The output is a list of sequences.

E-mail Servers : These servers are useful for those who do not have full access to theInternet with a graphical WWW browser. NCBI has e-mail query service:([email protected]). The query format is:

DATALIB $$ Titles MaxDocs # BEGIN

Words DATALIB and BEGIN are mandotory.

$$ = gb (for Genbank); e (for EMBL); sp (for Swiss-Prot).

# = Number of sequence needed.

EMBL Get : EMBL sequences can be obtained via e-mail: ([email protected]). Theinput format is: get nuc: ##; get prot: ##. (## = Accession number).

11.3 GENOME DATBASE SEARCH

Database searches can be carried out either based upon gene sequence databases, calledgenome informatics, or upon protein sequence (proteome) databases. Genome informatics in-cludes (i) functional genomics, which deals with interpretation of the function of nucleotidesequences on a genomic scale, and (ii) structural genomics, which deals with classification andprediction of protein structures from gene sequence data.

There is a vast amount of gene sequence data available (e.g. from genome sequence project).Two main databases that are widely used for novel gene discovery are high-throughputgenomic databases, and the expressed sequence tag (EST) databases. EST databases are single-pass, partial sequences of 50-500 nucleotides from cDNA libraries. They provide direct windowonto the expressed genome. EST sequencesare generated by shotgun sequencing method. Thesequencing is random and a sequence can be generated several times, and can be inaccurate.

Page 159: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

156 Bioinformatics: A Primer

The major nucleic acid sequence databases are:

COMPEC : (http://compec.bionet.nsc.ru/). A database that contains protein-DNA andprotein-protein interactions for composite regulatory elements.

EID : (http://golgi.harvard.edu/gilbert/eid/). An Exon-Intron database, derivedfrom GenBank.

MAGPIE : Multipurpose Automated Genome Interpretation Environment. It is genomeanalysis and annotation system to add graphical representation to the results.

dbEST : Database Expressed Sequence Tags (EST) at NCBI.

DDBJ : Nucleotide database, Japan.

EMBL : Information can be retrieved from EMBL using SRS system.

GenBank : DNA database from NCBI. Information can be retrieved using Entrez.GenBank is available via FTP.

GSDB : The Genome Sequence Database, from the National Center for GenomeResources.

MEDLINE : The facility provides abstracts from the original published articles.

11.3.1 Gene Structure and Gene SequencesGenome sequence databases are used for codon usage, restriction maps, identification ofcoding regions (exons), repeats, translation (protein coding) and motif identification. Patternrecognition methods play an important role in elucidating the location and significance ofgenes throughout the genome.

Genome sequence databases contain an assortment of data types that cannot be treatedalike. The raw sequence (basepairs) data is meaningless, without analyzing (annotating) vari-ous factors/regions that constitute the sequence. These include untranslated regions (UTRs),coding region sequences (CDRs), introns and exons and ribosome-binding sites, and transla-tional termination sites (see chapter 9; & Fig. 9.2). Gene identification in prokaryotes is simpli-fied by their lacking introns. In eukaryotes intron-exon and exon-intron splice junctions haveto be identified. Intron/exon prediction includes consensus sequences at the intron-exon andexon-intron splice junctions, base composition and condon usage. UTRs occur both in DNAand RNA. They are portions of the sequences flanking coding reading sequences (CDRs) butare not themselves translated.

The task of transcription and translational recognition involves prediction of promoter sitesthat function in the initiation and termination of transcription and translation (CCAAT box;GC box; TATA box). The transcription initiation site is always an ATG codon and it is always~ 30 basepairs downstream from TAATAA sequence.

In an arbitrary DNA sequence, it is not known whether the 1st base marks the start of thecoding sequence. So, it is always essential to carryout a six-frame translation (three forward andthree reverse). Thus for any piece of DNA sequence, the result of six-frame translation is sixpotential protein sequences.

The simplest method of finding DNA sequences that code proteins is to search for correctreading frame, called open reading frames (ORFs). An ORF is normally the longest readingframe uninterrupted by a stop codon (TAA, TAG or TGA). The coding regions (CDRs) can befrom (i) sufficient ORF length, presence of flanking Kozak sequence, (iii) patterns of codon

Page 160: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Database Search 157

usage, and presence of ribosome-binding sites (Shine-Dalgarno sequences) upstream of thestart codon.

While the coding region is a single gene open reading frame (ORF) in prokaryotes, theeukaryotic genes are commonly organized as exons (coding regions) and introns (non-codingregions), and hence may comprise several disjoining ORFs and the gene products may be ofdifferent lengths. The main task of gene identification (in eukaryotic DNA) involves codingregion recognition (intron/exon discrimination) and splice sites detection.

Complete CDRs are rarely sequenced in one reaction. So variable-length, overlappingfragments are aligned, in a multiple sequence alignment (sequence assembly), to obtain a consensussequence. This is also to minimize cloning errors.

Majority of the DNA sequence data in the databases contain partial sequences, ExpressedSequence Tags (ESTs) obtained by random sequencing of cDNA copies of cell mRNA se-quences. EST libraries are useful for preliminary identification of genes by database similaritysearches. Screening the predicted protein sequences against an expressed sequence tag (EST)library confirms the prediction and expression of the gene. Cloning and sequencing the intactcDNA may then be used to make a more detailed analysis. An EST database of an organismcan be analyzed for the presence of gene family, orthologs and paralogs.

Since gene prediction methods are only partially accurate, partial cDNA copies of ex-pressed genes (ESTs) confirm that a predicted gene is transcribed. ESTs are not only incom-plete, but also to a certain degree inaccurate. When a search of the databases reveals severalESTs, EST analysis protocols are used. These include sequence similarity, sequence assembly(multiple sequence alignment) and sequence clustering algorithms (see Chapter 12).

A large number of regulatory sequences (promoters, enhancers etc.) have been identifiedand collected into databases.

EPD : Eukaryotic Promoter Database. Provides a comprehensivecompilation of eukaryotic transcriptional sites (promot-ers).

FINDPATTERNS : Searches DNA sequences for the occurrence of transcrip-tion initiation sites.

FRAMES : (Open Reading Frames; (ORFs).

GEMS : (Genomatrix, Germany) provides an output of mono-, di-,and trinucleotide frequencies.

GSDB : Genome sequence database (from GenBank).

HTD : Human Transcription Database. It provides informationrelated to RNA molecules that have been sequenced.

MAP : (ORFs).

Mat Inspector : (uses TransFac Database).

Model-it : Produces pictures of DNA (needs RasMol program in *.pdffile).

ORF Finder : (http://www.ncbi.nlm.nih.gov/gorf/gorf.html). ProvidesORFs in a sequence as colored bars (NIHNCBI, USA).

Page 161: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

158 Bioinformatics: A Primer

Plotorf : Plotting of open reading frames (ORFs). It displays ORFsin a sequence, designed as the longest sequences startingwith a “start” codon (usually ATG) and ending with a“stop” codon. The largest ORFs are likely to be true genes.

Promoter Scan : Searches for promoter sites (NIHBIMS, USA).

Protein Back translation : Gives DNA sequence from protein sequence.

Reverse Translate a Protein : Gives DNA sequence from protein sequence.

Signal Scan : Searches for promoter sites (NIHBIMS, USA).

TESS (USA) : Transcription Element Search Software.

TransFac (Germany) : Transcription Factor Database.

TRRD : (http://www.mgs.bronet.nsc.ru/mgs/dbases). Transcrip-tion Regulatory Regions Database. Provides informationabout DNA regulatory regions (DNA-protein interactions).

VecScreen : Screens the DNA sequence for potential vector sequence(NCBI, USA).

Coding regions (CDRs) can be found by searching

CODONPREFERENCE : Statistical algorithm that measures codon usage. Identifies pro-tein-coding sequences.

Codon Usage (USA) : Analysis of different ORFs in a gene sequence.

Frame Plot (Japan) : Permits to select maximum size of the ORF, and the start codon,which can be used in similarity searches.

GENLANG : Pattern recognition program.

GenMark : Provides a family of programs for ORF analysis.

GRAIL : An ORF identification tool. Provides analysis of protein codingpotential of a DNA sequence.

REPuter (Germany) : Provides maximal repeats in complete genomes.

TESTCODE:

Genome sequence comparison can be carried by several methods.

Core Genes : Determines core set of genes.

Gene Builder : Gene Builder is a versatile, multi-module operation system.Each module is executed independently, with various options.Some of the modules with options are:

1. Mode : There are tow options, GENE and EXON. The GENE option isused for predicting the full gene model. Potential coding frag-ments (PCFs) are used to construct the gene models with themaximum coding potential by using the dynamic program-ming methods. The EXON option is used for only selecting theexons with the best scores. The EXON option can be useful forlong genomic sequences with an unknown number of poten-tial genes, since there is very little over prediction.

Page 162: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Database Search 159

2. Sequencing error : The option is to correct potential sequencing errors due to thecorrection frame-shifts and substitutions in the “stop” codons. The pre-

dicted gene model can be substantially improved if these er-rors are eliminated.

3. Splice sites : Classification analyses combined with the weight-matrixmethod are used for splice site prediction.

4. Potential coding : Potential coding regions are found by combining the proteinRegions coding potential, calculated b using the dicodon statistic and

the splicing signals. In ‘All’ option, all potential coding exonswill be used for gene construction. “All” option is useful as thefirst step of sequence analysis when no information about genecontent is available for a query sequence. With “Good” and“Excellent” options, only the excon having ‘good’ and ‘excel-lent’ quality will be used for gene prediction. If the “Proteinsimilarity” option is selected, only the exons with similarity toa selected homologous protein will be used for gene recon-struction.

5. First and Last : This option is useful where several genes are present in a querycoding exons sequence and the gene location can be confirmed by using

homology with a chosen protein.

6. EST mapping : A homology search is performed against the EST database andthe position of the homologous EST sequences in relation to aquery sequence is given in the output.

GeneFinder : The algorithm first predicts all possible potential exons, andthen by dynamical programming it searches for optimal com-bination of these exons and construct gene model.

GenHacker (Japan) : Predicts gene structure in microbial genomes using hiddenMarkov model (HMM).

GenScan : (http://genes.nit.edu/genscan.html/). GenScan is a general-purpose gene identification program. It determines the mostlikely gene structure for each sequence, based on probabilisticmodels. The score of a predicted feature (e.g. exon) is alog-odds measure of the quality of the feature based on localsequence properties (refer to Chapter 12 for details). Forwardand backward recursions are performed that allow determina-tion of the most likely gene structure in the sequence, probabil-ity of each exon, together with the corresponding predictedamino acid sequences.

Pairwise FLAG : Alignment of small genomes (gigabases). Performs local align-ment for two different DNA sequences.

PipMaker : DNA sequence alignment tool.

SCAN2 : Provides color-coded graphical alignment of genome-lengthDNAs.

Page 163: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

160 Bioinformatics: A Primer

VISTA : Visualization tools for alignments. Allows alignment of twogenome-length sequences.

WebGene : A multi-package system for gene structure analysis andprediction.

11.4 PROTEIN DATABASE SEARCH

Protein sequence similarity search analysis is more sensitive than by DNA sequence similaritysearch, because

1. DNA has only four bases as compared to twenty amino acids.

2. Pair-wise comparison of DNA bases is scored as “match” or “mismatch”, whereas twoamino acids can share varying degrees of similarity, based on their physicochemicalproperties.

3. Proteins have database information at various levels (primary, secondary and tertiarylevel databases).

But, for phylogenetic analysis, DNA sequence is better suited, because

(i) The pattern of mutations, insertions and deletions at nucleotide level is definitive.

(ii) Silent mutations, that is, mutations at the DNA level do not result in an amino acidsubstitution at the protein level, because of the redundancy of the genetic code.

Many reputed genome databases (EBI; NCBI; SWISS-PROT) also have protein databases.Some others are:

CluSTr : (http://www.ebi.ac.uk’clustr). A database from SWISS-PROTand TrEMBL protein databank. It can be used for (i) search fornew protein families, (ii) annotation of newly sequencedproteins, (iii) prediction of functions of new proteins, and(iv) proteome analysis.

DIP : (http://dip.doe-mbi.ucla.edu/). Database of Interacting Pro-teins. Contains information on protein-protein interactions.Gives experimental methods used for determining interactions.

HSSP : (http://www.sander.embl-heidelberg.de/hssp/). A databaseof homology-derived secondary structure of proteins (EMBL,Germany).

MIPS : (Martinsried Institute for Proteins Sequence). European part-ner of PIR for genomic and protein data analysis.

MMDB : (http://www.ncbi.nlm.nih.gov/Structure/). Molecular mod-eling database. An NCBI source that contains all the experi-mentally determined 3-D structural data in the PDB.

NRL-3D : Produced by PIR from sequences from PDB database.

PDB : (Protein Data Bank, Brookhaven, USA).

PIR : (http://www_nbrf.georgetown.edu/pir/). (Protein InformationResource, USA). PIR produces the protein sequence database(PSD) of functionally annotated protein sequences.

Page 164: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Database Search 161

PIR-NREF : A comprehensive database for sequence searching and proteinidentification.

PIR-PSD : (http://www_nbrf.georgetown.edu/pir/). (Protein Informa-tion Resource, USA). PIR produces the protein sequence data-base (PSD) of functionally annotated protein sequences.

SRS : Sequence Retrieval System.

SWISS-PROT : (http://expasy.hcuge.ch/sprot/sprot-top.html). A database ofprotein sequences and structures, translated from the EMBLgenomic database.

TrEMBL : (http://www.ebi.ac.uk). TrEMBL is computer-annotatedsupplement of SWISS-PROT that contains all the translationsof EMBL nucleotide sequences.

UCL : (University College, London). Protein sequence and structureanalysis.

11.4.1 Sequence Similarity SearchBLAST, FASTA, and other search tools are used to carry out sequence similarity searches. Someof these servers are:

BLAST : (http://www.ncbi.nlm.nih.gov/blast3.html/). A suite of BasicLocal Alignment Search Tools (NCBI, USA) for sequence simi-larity search.

CINEMA : A Color Interactive Editor for Multiple Alignments for nucleicacids and Proteins.

CLUSTAL : Multiple sequence alignment tool based on clustering algo-rithms.

Consensus : Takes CLUSTAL or multiple sequence alignment programs andcalculates the consensus sequence.

DiAlign : Constructs pair-wise and multiple alignments by comparingwhole segment of the sequence.

FASTA : (http://www2.ebi.ac.uk/fasta3/). Sequence similarity searchtool.

LALIGN : A tool for matching of two sequences.

SAPS : Statistical Analysis of Protein Sequences, which includes analy-sis of amino acid composition, charge, hydrophobicity andtransmembrane segments etc.

Sanger Center : (http://www.sanger.ac.uk/DataSearch/). The Sanger Center forDatabase Search Services.

Various servers/databases are available to carry out sequence similarity search and phylo-genetic analysis, and some of the servers and databases are:

DISTANCES : Calculates pair-wise distances between groups of sequencesfor phylogenetic analysis.

PARSIMONY : A clasdistic algorithm for constructing ancestral relationship.

Page 165: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

162 Bioinformatics: A Primer

PAUP : http://www.lms.si.edu/PAUP/). Phylogenetic Analysis UsingParsimony.

PHYLIP : It is a collection of many programs––UPGMA, Parsimony,Neighbor-joining, Maximum likelihood algorithms.

PhyloBLAST : Compares the query protein to the SWISS-PROT/TrEMBL da-tabase and carries out phylogenetic analysis.

PILEUP : PILEUP uses UPGMA to create its dendogram of DNAsequences, and then uses this dendogram to guide its multiplealignment algorithm.

Tree base & Tree view : For graphical representation of phylogenetic trees.

Tree Gen : Tree generation from distance data.

UPGMA : Unweighted Pair Group Method is a clustering algorithm us-ing arithmetic averages.

11.4.2 Secondary Structure SearchThere are many secondary structure prediction tools/servers available. The prediction programsrely on the propensity parameters of amino acids and their physicochemical characteristics,such as hydrophobicity, charge and solvation.

BCM Search Launcher : Provides access to a large collection of secondary structureprediction tools.

CoDe : Secondary structure consensus prediction.

DAS : Transmembrane prediction server (Sweden).

DSC : (http://www.bmm.icnet.uk/dsc/). A linear discriminator sec-ondary structure prediction program.

JPRED : Consensus method of secondary structure prediction.PHDsec : Prediction of secondary structure (EMBL, Germany).PHDhtm : Transmembrane helix location prediction and topology.PREDATOR : (http://www.embl-heidelberg.de/). A program that can pre-

dict secondary structure of single sequence, or for number ofrelated sequences.

PROFsec : Secondary structure prediction server.ProScale : Predicts hydrophobicity, a-helix, b-sheet and other features.PSA (Boston, USA) : Protein sequence analysis database has protein secondary struc-

ture prediction server.PSI-Pred (UK) : PSI-BLAST profiles.ReDe (USA) : Transmemebrane prediction server.SSCP : Prediction based on amino acid composition input informa-

tion.SSPRED : Prediction of secondary structure from SWISS-PROT database.TMHMM : Prediction of transmembrane helices in proteins (Denmark).TMPred : Prediction of transmembrane regions and orientation (ISREC,

Switzerland).

Page 166: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Database Search 163

11.4.3 Motifs, Domains and Profiles SearchMany proteins are organized into structural motifs (super-secondary structures) and domainsthat are highly conserved. Profiles are mathematical representation of conserved regions,encompassing domain alignments. Some of the motifs, domains and profiles servers are:

BLOCKS : (http://www.ebi.ac.uk). Search tool of motifs and protein clas-sification.

CDD-Search : Conserved domain database search.

COILSCAN : Identifies coiled-coil regions.

HTHSCAN : Helix-turn-helix motifs scan.

InterPro Search : Meta site profile scan server (USA).

Leucine Zippers : Leucine-zipper motifs scan.

MAST : Motif Alignment and Search Tool for searching sequence data-bases for sequences that contain one or more groups of motifs.

MeMe (USA) : Motif elicitation tool.

MOTIF : Meta site motif search (Japan).

PFAM : (http://www.sanger.ac.uk/pfam). Encodes sequence conserva-tion within aligned families.

PRINTS : (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS).Search of Motifs and protein classification.

PRODOM : A protein domain database (France).

PROFILES : A profile search database.

Profile Scan : Meta site profile scan server (ISREC, Switzerland) that searchesa sequence against a library of profiles.

PROSITE : (http://www2.ebi.ac.uk/ppsearch/). Secondary structure da-tabase from EBI. It is the best starting point for motif search.

SMART : (http://smart.emol-heidelberg.de/). Genetically mobile proteindomain search server.

TOPITS : Fold recognition by prediction-based threading.

11.4.4 Pattern Recognition Search

Pattern recognition programs follow reverse process of sequence analysis. Rather than predicthow a sequence will fold, they predict how well a fold will match a sequence. Some of thepattern recognition servers are:

BLASTPAT : BLAST-based patterns database search.

EPAT : Patter n search (for PDB; SWISS-PROT; PIR databases).

FASTAPAT : FASTA-based patterns database search.

FINDPATTERNS :

PRATT : A pattern recognition tool. Searches for patterns conserved inset of protein or nucleic acid sequences. It is able to discoverpatterns conserved in sites of unaligned protein sequences.

Page 167: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

164 Bioinformatics: A Primer

PatternP :

PATScan : Search for patterns conserved in set of protein or nucleic aciddatabases.

11.4.5 Protein ClassificationProteins can be classified under various categories based on their structural similarities.

Some of the protein classification databases are:

ASTRAL : (http://astral.stanford.edu/). Compendium for sequence andstructure analysis. It is partially derived from, and augmentsthe SCOP database. Most of the resources provided here dependupon the coordinate files maintained and distributed by theProtein Data Bank (PDB).

CATH : (http://www.biochem.ucl.ac.uk/bsm/CATH). Class,Architecture, Topology & Homology database is a hierarchicaldomain classification of protein structures.

GeneFind : An integrated neural network protein classification database.

iProClass (USA) : Integrated protein classification resource that provides a com-prehensive family relationships and structure/functional fea-tures of proteins.

ProClass : Provides summary description of protein family, structure andfunction for PIR-PSD, SWISS-PROT and TrEMBL.

SCOP : Clustering algorithm that provides hierarchical structuralclassification.

11.4.6 Tertiary Structure ModelingTertiary structures are predicted by homology modeling methods. Some of the homology-based modeling databases are:

3DinSight : ( h t t p : / / w w w . r t c . r i k e n . g o . j p / j o u h o u / 3 d i n s i g h t /3DinSight.html) (Japan). An integrated database and searchtool for structure, property and function of biomolecules. Thestructural data, functional data (motifs, mutations, protein-nucleic acid binding, protein-ligandbinding etc.) and propertydata (amino acid property and thermodynamic data of proteins)of biomolecules are implemented into a relational database, sothat flexible searches can be done by a combination of queries(SQL).

3D-JIGSAW (UK) : A homology protein-modeling tool.

3D-PSSM : Protein fold recognition tool.

DALI : The server searches the Protein Data Bank (PDB) for structurehomologues of a query protein.

Decipher (USA) : A modeling server with a variety of nucleic acid and proteinanalysis tools.

Page 168: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Database Search 165

FAMS : Fully Automated Modeling System (Japan) based upon ho-mology modeling method, including a structural optimizationprocess.

FSSP : Database families of structurally similar proteins derived fromPDB at EMBL.

MODELLER : (http://guitar.rockefeller.edu/modeller/). A 3D-structre mod-eling server (New York University, USA).

PDBSum : The database provides summaries of structural analyses ofPDB data files.

Predict Protein : The server (EMBL, Germany) is used to find structural homo-logues of a query protein sequence.

Predict Protein Server : (ht tp ://www.embl-heidelberg.de/predictprote in/predictprotein.html). A neural network-based prediction serverused to find structural homologues of a query protein sequence.

ProSAL (Sweden) : A Meta site for protein analysis and characterization.

SDSC1 : San Diego Supercomputer Center Protein Structure HomologyModeling.

SWISS MODEL Server : 3D-structure by homology modeling (> 50% homology).

WHATIF : (http://www.umbi.kun.nl/whaif/). A web interface (EMBL,Germany) that provides tools for examining PDB files.

11.4.7 Knowledge DatabasesKnowledge databases contain structural and functional information from many sources (e.g.hydrophobicity, pH, actives sites etc.), for mining sequence databases in conjunction withmass spectrometry, and other data. Some of these are:

PeptIdent : Protein identification using pI, Mr, and peptide massfingerprinting data at ExPASy.

ProteinProspector : Proteomics tools for mining sequence databases in conjunctionwith mass spectrometry.

ProtParam : Resource for amino acid composition, Mr, pI, and extinctioncoefficients (at ExPASy).

ProtScale : Resource for hydrophobicity, and other conformationalparameters (at ExPASy).

Prowl : Resource for protein chemistry and mass spectrometry.

EXERCISE MODULES

1. What are the general experimental methods of determining the primary structure of proteins?

2. Name some methods that have enhanced the gene sequencing?

3. What are the pitfalls of inferring protein sequences from gene sequences?

4. What is a database?

5. What the two major categories of database searches?

Page 169: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

166 Bioinformatics: A Primer

6. What are search sites and search engines?

7. What is a sequence retrieval program?

8. Comment on genome database search.

9. What are the genome database searches used for?

10. What are the database types obtainable from genome database searches?

11. What are open reading frames (ORFs)?

12. What are coding regions (CDRs), and expressed sequence tags (ESTs)?

13. Why is protein database search more sensitive than gene database search?

14. Name few programs used in sequence similarity search?

15. What are the secondary structure prediction tools based upon?

16. Define motifs, domains and profiles?

17. What is basis for tertiary structure prediction?

BIBLIOGRAPHY

1. Altschul, S.F., et al. (1990), J Mol Biol., 215; 403. “Basic local alignment search tool”.

2. Altschul, S.F., et al. (1997), Nucleic Acid Res., 25; 3389. “Gapped BLAST and PSI-BLAST: A newgeneration of protein database search programs”.

3. Attwood, T.K. & Beck, M.E. (1994), Protein Engineering, 7; 841. “PRINTS– A protein motif fingerprintdatabase”.

4. Attwood, T.K. & Parry-Smith, D.J. (2002), Pearson (Educational): Delhi. “Introduction to Bioinformatics”.

5. Bairoch, A. & Apweiler, R. (2000), Nucleic Acid Res., 28; 45. “The SWISS-PROT protein sequencedatabase and its supplement TrEMBL in 2000”.

6. Barker, W.C., et al. (1998), Nucleic Acid Res., 26(1); 27. “The PIR—International protein sequencedatabase”.

7. Bateman, A., et al. (2000), Nucleic Acids Res., 28; 263. “The Pfam protein families database”.

8. Benson, D.A., et al. (1998), Nucleic Acid Res., 26(1); 1. “GenBank”.

9. Berman, H.M., et al. (2000), Nucleic Acids Res., 28; 235. “The Protein Data Bank”.

10. Bork, A. & Gibson, T. (1996), Methods Enzymol., 266; 162. “Applying motif and profile searches”.

11. Burge, C.B. & Karlin, S. (1998), Curr Opin Struc Biol., 8; 346. “Finding the genes in genomic DNA”.

12. Corpet, F., Gouzy, J. & Kahn, D. (1998), Nucleic Acid Res., 26; 323. “The ProDom database of proteindomain families”.

13. Cuff, J.A., et al. (1998), Bioinformatics, 14; 892. “JPred: A consensus sequence structure predictionserver”.

14. Etzold, T., Ulyanov, A. & Argos, P. (1996), Methods Enzymol., 266; 114. “SRS—International retrievalsystem for molecular biology databases”.

15. Gracy, J. & Argos, P. (1998), Trends Biochem Sci., 23; 495. “DOMO: a new database of aligned proteindomains”.

16. Hamosh, A., et al. (2000), Human Mutat., 15; 57. “Online Mendelian Inheritance in Man (OMIM)”.

17. Henikoff, J.G., Henikoff, S. & Pietrokovski,S. (1999), Biotransformatics, 15; 471. “Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations “.

18. Henikoff, S. & Henikoff, J.G. (2000), Adv Protein Chem., 54; 73. “Amino acid substitution matrices “.

19. Higgins, D.G., Thompson, J.D. & Gibson, T.J. (1996), Methods Enzymol., 266; 383. “Using CLUSTALfor multiple sequence alignments”.

20. Hofmann, K., et al. (1999), Nucleic Acids Res., 27; 215. “The PROSITE database, its status in 1999”.

Page 170: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Database Search 167

21. Hogue, C.W. & Bryant, S.H. (1998), Methods Biochem Anal., 39; 46. “Structure Databases”.

22. Holme, L., et al. (1992), Protein Sci., 1; 1691. “A database of protein structure families with common foldingmotifs”.

23. Holm, L. & Sander, C. (1997), Nucleic Acids Res., 25; 231. “Dali/FSSP classification of three-dimensionalprotein folds”.

24. Hubbard, D.T. (1999), Nucleic Acid Res., 27; 254. “SCOP: a structural classification of proteins database”.

25. Jones, D.T. (1999), J Mol Biol., 287; 797. “Gen THREADER: efficient and reliable protein fold recognitionmethod for genomic sequences”.

26. Kreil, D.P. & Etzold, T. (1999), Trends Biochem Sci., 24; 155. “DATABANKS– a catalogue database ofmolecular biology databases”.

27. Lakowski, R.A., et al. (1997), Trends Biochem Sci., 22; 488. “PDBsum: A web-based database of summa-ries and analyses of all PDB structures”.

28. Mewes, H.W., et al. (2000), Nucleic Acids Res., 28; 37–40. “MIPS: a database for genomes and proteinsequences”.

29. Michie, A.D., Jones, M.L. & Attwood, T.K. (1996), TiBS, 21(5); 191. “DbBrowser: integrated access todatabases worldwide”.

30. Morgenstern, B., et al. (1998), Bioinformatics. 14; 290. “DIALIGN: Finding local similarities by multiplesequence alignment”.

31. Murzin, A.G., et al. (1995), J Mol Biol., 247; 536. “SCOP: A structural classification of proteins databasefor investigation of sequence and structures”.

32. Orengo, C.A., et al. (1997), Structure, 5(8); 1093–1108. “CATH– a hierarchical classification of proteindomain structures”.

33. Pearson, W.R. (1990), Methods Enzymol., 183; 63. “Rapid and sensitive sequence comparison withFASTP and FASTA”.

34. Pearson, W.R. (2000), Methods Mol Biol., 132; 185–219. “Flexible sequence similarity searching withFASTA3 program package.

35. Sali, A., et al. (1995), Proteins, 23; 318. “Evaluation of comparative protein modeling by MODELLER”.

36. Schuler, G.D., et al. (1996), Methods Enzymol., 266; 141. “Entrez: molecular biology database andretrieval system”.

37. Sonnhammer, E.L., et al. (1998), Nucleic Acid Res., 26(1); 320. “Pfam: multiple sequence alignments andHMM-profiles of protein domains”.

38. Stoesser, G., et al. (1998), Nucleic Acid Res., 26(1); 8. “The EMBL nucleotide sequence database”.

39. Wu, C.H., Shivakumar, S. & Huang, H. (1999), Nucleic Acids Res., 27; 272. “ProClass protein familydatabase”.

Page 171: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

12 Data Mining, Analysis and Modeling

Once the database search is complete, the next course of action in protein structure prediction methodology rests on data mining, analysis, and modeling procedures. It involves (i) primary 5equence alignment, (ii) secondary and tertiary structure prediction, (iii) homology modeling, md (iv) X-ray diffraction pattern analyses. Currently, there is no reliable de novo predictive method for protein 3D-structure determination. The success of this the success of this approach depends rather on the sequence similarity (or lack of it) of "MyProtein" with other sequences from the database search (Fig. 12.1).

Purified protein

Path 1

Homology found in the

database

Path 2

Alignment of sequence to 3-D structure

(3-D modeling protocols)

No homology found in the

database

Secondary structure prediction

I Path 3 I

.. Tertiary

structure prediction

I I

r- M~d~!~ct~~_-----------------------~ Fig. 12.1 A Flowchart of Protein Informatics Procedures

Page 172: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 169

Data mining and analysis aim at nontrivial extraction, by computational means (in silcomethods), of previously unknown and potentially useful information from data, or search forrelationships and patterns that exist in databases. The structure prediction of a protein from itsamino acid sequence data by data mining and analysis procedures depends on the amount ofstructural information available from various types of databases.

1. High degree of sequence similarity and three-dimensional structure data of homologousprotein (s) are available.

2. Poor sequence similarity from database search.3. No sequence similarity found from database search.

If “MyProtein” has a high degree of sequence similarity with the protein sequences from thedatabase search, and also if three-dimensional structure(s) of homologous protein(s) in theseries is (are) available, then protein structure prediction of the test protein is relatively simple,and reliable, and a fairly accurate tertiary structure model can be generated with computationalalgorithms, based upon the tertiary structures of homologous protein(s) (Path 1 of Fig. 12.1).

For modeling of (putative) proteins, alignment of sequences is followed by insertions,deletions and replacements in the three-dimensional structure(s) of the homologous proteins(s).Initial modeled proteins are refined using energy minimization and other procedures to givefinal structures without appreciable steric hindrances. Alternatively, modeling can be carriedout on the basis of each available known tertiary structure and test the resulting models forpacking of side chains, solvent accessibility and other physicochemical parameters. Simulta-neously, but selectively, the information from all the known tertiary structures of homologousfamily can be used in modeling procedures.

Once a 3-D model of a protein (“MyProtein”) is available, its structure can be viewed invarious directions to visualize the tertiary folding, and towards rational design of new versionof the proteins.

More elaborate alignment and structure prediction procedures are required, if the sequencesimilarity between “MyProtein” and the database proteins is poor, to arrive at plausible modelstructure(s) (Path 2). Computational procedures include secondary structure prediction, foldrecognition, alignment of motifs and profiles, and finally 3D-structure modeling.

If for some reason database search does not find any proteins with sequence similarity with“MyProtein” sequence (it could be a new class of protein), secondary and tertiary structureprediction procedures are solely based on statistical methods (Path 3). The correctness (valid-ity) of the modeled structure(s) should be treated with due caution and skepticism. The realtest of the model structure(s) is to crystallize the protein and determine its three-dimensionalstructure by X-ray diffraction methods. Overall analysis approach towards protein structureprediction, depending on the sequence similarity, is given in table 12.1.

Table 12.1 Protein Structure Approach depending on Sequence Identity

Sequence similarity Approach Path taken (Fig. 12.1)

> 80% Sequence alignment Path 150–80% Sequence alignment; pair-wise alignment Path 1 & Path 225–50% Consensus methods; profile methods Path 2< 25% De novo structure prediction methods Path 3

Page 173: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

170 Bioinformatics: A Primer

12.1 SEQUENCE ALIGNMENT ANALYSIS

Sequence alignment is the process of lining up two or more sequences to achieve maximallevels of identity (and conservation, in the case of amino acid sequences) for the purpose ofassessing the degree of similarity and the possibility of homology. Sequence similarity analy-sis is the single most powerful method for structural and functional inference available indatabases. Sequence similarity analysis allows the inference of homology between proteinsand homology can help one to infer whether the similarity in sequences would have similarityin function. Methods of analysis can be grouped into two categories– (i) sequence alignment-based search, (ii) profile-based search.

Fundamentally, sequence-based alignment searches are string-matching procedures. Asequence of interest (the query sequence) is compared with sequences (targets) in a databank-either pair-wise (two at a time) or with multiple target sequences, by searching for a series ofindividual characters. Two sequences are aligned by writing them across a page in two rows.Identical or similar characters are placed in the same column and non-identical characters canbe placed opposite a gap in the other sequence. Gap is a space introduced into an alignmentto compensate for insertions and deletions in one sequence relative to another. In optimalalignment, non-identical characters and gaps are placed to bring as many identical or similarcharacters as possible into vertical register.

The objective of sequence alignment analysis is to analyze sequence data to make reliableprediction on protein structure, function and evolution vis a vis the three-dimensional struc-ture. Such studies include detection of orthologous (same function in different species), andparalogous (different but related functions within an organism) features. Analysis proceduresinclude various statistical algorithms for sequence alignment, pattern matching and predic-tion of structure directly from sequence.

Sequences that are highly divergent during evolution cannot be detected by simple se-quence similarity search methods. In such cases, computational methods, comprising mul-tiple sequence alignment and profile-matching searches that go beyond simple pair-wisesequence similarity methods, are tested for meaningful results.

A set of n amino acids can form 20n different polypeptides, and the problem of proteinstructure prediction becomes astronomical even for a small protein of 100 amino acids. One ofthe methods of minimizing this problem is to rely on statistical methods to search for struc-tural similarities (protein families), based on the sequence similarities, from the probabilitiescalculated from the observed frequencies of amino acids in the family classes. Thus, sequencesimilarity analysis is the cornerstone of bioinformatics. It is useful for discovering structural,functional and evolutionary information in sequences. The sequence alignments indicate thechanges that could have occurred between two homologous sequences and a common ances-tral sequence during evolution. Sequence alignment from the database search is the operationupon which all other computational procedures are based.1. Necessary for inferring phylogenetic relationships.2. Sequence similarity analysis is the starting point for predicting the secondary structure of

proteins.3. It is prerequisite for all “knowledge-based” protein family classification and tertiary struc-

ture prediction.

Page 174: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 171

Sequence similarity search algorithms rest on the premise that if two sequences are suffi-ciently similar, almost invariably they have similar biological functions, and will be de-scended from a common ancestor.

Sequence conservatism ÆÆ Structural conservatism ÆÆ Functional conservatism

Two proteins that have a certain number of amino acids common at aligned positions aresaid to be identical to that degree (25% identical for 40 common amino acids out of 160-residuesequence). That a stretch of two sequences is nearly identical does not imply that they arehomologous, related by divergence form a common ancestor. Homology is not synonymouswith similarity. There is an important difference between similarity and homology. Similarityis a value between 0 and 100%. On the other hand, there are no degrees of homology. Thesequences are either homologous or not. But, a high level of sequence similarity is a strongindication of homology, implying a common divergent evolutionary relationship.

The sequence similarity analysis can be stated as—given two sequences how to find bestalignment that can be obtained by sliding one sequence along the other. A major complicationarises due to insertions or gaps in the alignment of sequences gaps in the alignment ofsequences. To prevent the accumulation of too many gaps in an alignment, introduction of agap causes the deduction of a fixed amount (the gap score) from the alignment score. Exten-sion of the gap to encompass additional nucleotides or amino acid is also penalized in thescoring of an alignment. Usually, gap penalties (cost of inserting and extending gaps) arechosen to be length dependent. Typically, the cost of extending a gap (gap elongation) is 5-10times lower than is the cost for introducing a gap (gap open). The process of alignment can bemeasured in terms of the number and length of gaps introduced, and the number of mis-matches remaining in the alignment. A matrix relating such parameters represents the dis-tance between two sequences. Various methodologies, mutation matrices (scoring matrices),dotplots, global and local sequence alignments and other algorithms are available to addressthe sequence alignment problem.

12.1.1 Similarity/Distance MatricesA sequence can be described in terms of the number of bits needed to specify its message. Thecorrespondence between two aligned sequences can be expressed in terms of similarity/identity score. Scoring penalties are introduced to minimize the number of gaps. The totalalignment score is then a function of the identity between aligned residues and the gappenalties incurred. A compilation of the similarity scores in pair-wise alignment into a matrixis called scoring matrix. Such matrices are constructed for

1. Evaluating match/mismatch between any two characters (residues).2. A score for insertion/deletion.3. Optimization of total score.4. Evaluating the significance of the alignment.

12.1.2 Construction of Scoring MatrixScoring matrices implicitly represent a particular theory of evolution. Elements of a matrixspecify the weight to be assigned to a given comparison (i) by the measure of similarity forreplacing one residue with another (similarity matrix), or (ii) by the cost for the replacement(distance matrix). Similarity matrices are used for database searching, while distance matricesare naturally used for phylogenetic tree construction.

Page 175: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

172 Bioinformatics: A Primer

The distance score (D) is usually calculated by summing up of mismatches in an alignmentdivided by the total number of matches and mismatches, which represents the number ofchanges required to change one sequence into the other, ignoring gaps.

D =Matches

(Matches Mismatches)+(12.1)

Similarity (S) and distance matrices (D) are inter-convertible (S = 1 – D). Understanding oftheories underlying scoring matrices can aid in making proper choices.

By determining the number of mutational changes by sequence alignment methods, aquantitative measure can be obtained of the distance between any pair of sequences. Thesevalues can be used to reconstruct a phylogenetic tree, which describes a relationship betweenthe gene sequences. The more mutations required changing one sequence into the other, themore unrelated the sequences and the lower the probability that they share a recent commonancestor sequence. Conversely, the more alike a pair of sequences, the fewer the number ofchanges required to change one sequence into the other, and the greater the likelihood thatthey share a recent common ancestor sequence.

Distances between DNA sequences are relatively simple to compute as the sum betweentwo sequences, that is the least number of steps required to change one sequence into the other(D = X + Y). It is preferable in phylogenetic tree analysis because (i) the pattern of mutations,insertions and deletions at the nucleotide level are definitive and (ii) silent mutations at theDNA level do not result in an amino acid substitution (because of the redundancy of thegenetic code). A simple matrix of the frequencies of the 12 possible types of replacement (eachbase can be replaced by any of the three other bases) can be used. Differences due to inser-tions/deletions are generally given a large score than substations.

In the distance method, all possible pairs of sequences are aligned to determine which pairsare the most similar or closely related. The alignment provides a measure of the geneticdistance between the sequences. The distance measurements are then used to predict theevolutionary relationship. A matrix of distance scores among all of the sequences is first made.A scoring matrix is a tool to quantify how well a certain model is represented in the alignmentof two sequences. Less similar the sequences, the higher the distance score between them. Themethod is based on the assumption (Markov model) that proteins evolve through a successionof independent point mutations (change from one state to another does not depend on theprevious history of the state) that are accepted in a population and subsequently can beobserved in the sequence pool. The degree of match between two letters (residues) can berepresented in a matrix. The score is

Score =p

q qij

i j

(12.2)

(pij = probability that a residue I is substituted by residue j; qi and qj = background probabilitiesfor residue i and j, respectively).

The simplest matrix in use is the identity (similarity) matrix. If two letters are the same theyare given +1 and 0 if they are not the same. For DNA sequences, the identity matrix is (Table12.2).

Page 176: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 173

Table 12.2 Identity Scoring Matrix for DNA sequences

A T G C

A 1 0 0 0T 0 1 0 0G 0 0 1 0C 0 0 0 1

Replacement matrix, Rij = 1 (for i = j); Rij = 0 (for i π j).Nucleotide bases fall into two classes depending on the ring structure of bases— two-ring

purine bases (A and G) and single-ring pyrimidine bases (T and C). A mutation that conservesthe ring number (A ´ G; or T ´ C) is called transition and a mutation that changes the ringnumber (purine ´ pyrimidine) is called transversion. Use of transition/transversion matrixwith weighted scores reduces noise in comparisons of distantly related sequences (Table 12.3).

Table 12.3 Transition/Transversion Scoring Matrix for DNA sequences

A T G C

A 0 5 5 1T 5 0 1 5G 5 1 0 5C 1 5 5 0

Distances between amino acid sequences are more difficult to calculate, because

1. Some amino acids can be changed due to replacement of single DNA base (single-pointmutation), while replacement of other amino acids require two or three base changeswithin the DNA sequence.

2. While conservative mutations of amino acids do not have much effect on the structureand function, other replacements can be functionally lethal.

Since all point mutations arise from nucleotide changes, the probability that an observedamino acid pair is related by chance, rather than by inheritance should depend only on thenumber of point mutations necessary to transform one codon into the other. A matrix resultingfrom this model would define the distance between two amino acids by the minimal numberof the nucleotide changes required (genetic code matrix). It may be more useful to compare thesequences of the purine (R)-pyrimidine level. To this can be added other physicochemicalattributes of amino acids (hydrophobicity matrix, size and volume matrix etc.).

12.2 PAIR-WISE SEQUENCE ALIGNMENT

Pair-wise alignment is a fundamental process in sequence comparison analysis. Pair-wisealignment of two sequences (DNA or protein) is relatively straightforward computationalproblem. In a pair-wise comparison, if gaps or local alignments are not considered (i.e. fixed-length sequences), the optimal alignment method can be tried and the number of computa-tions required for two sequences is roughly proportional to the square of the average length,

Page 177: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

174 Bioinformatics: A Primer

as is the case in dotplot comparison. The problem becomes complicated, and not feasible byoptimal alignment method, when gaps and local alignments are considered.

That a program may align two sequences is not a proof that a relationship exists betweenthem. Statistical values are used to indicate the level of confidence that should be attached toan alignment. A maximum match between two sequences is defined to be the largest numberof amino acids from on protein that can be matched with those of another protein, whileallowing for all possible deletions. A penalty is introduced to provide a barrier to arbitrary gapinsertion.

Dotplots, dynamic programming and word of k-tuple are the common pair-wise alignmentprocedures.

12.2.1 Dotplot AnalysisDotplot analysis is essentially a signal-to-noise graph, used in visual comparison of twosequences and to detect regions of close similarity between them (Fig. 12.2). The concept ofsimilarity between two sequences can be discerned by dotplots. Two sequences are writtenalong and x- and y-axes, and dots are plotted at all positions where identical residues areobserved, that is, at the intersection of every row and column that has the same letter in boththe sequences. Within the dotplot, a diagonal unbroken stretch of dots will indicate a regionwhere two sequences are identical. Two similar sequences will be characterized by a brokendiagonal; the interrupted region indicating the location of sequence mismatch. A pair ofdistantly related sequences, with fewer similarities, has a noisier plot (Fig. 12.3). Isolated dotsthat are not on the diagonal represent random matches that are probably not related to anysignificant alignment.

Detection of matching regions may be improved by filtering out random matches in adotplot. Filtering (overlapping, fixed-length windows etc.) can be used to place a dot onlywhen a group of successive nucleic acid bases (10–15) or amino acid residues (2–3) match, tominimize noise.

12.2.2 Dynamic Programming MethodsThe best solution for pair-wise sequence alignment problem seems to be an approach calleddynamic programming. Dynamic programming methods assure the optimal global (Needleman-Wunsch method) or local alignment (Smith-Waterman method) by simply exploring all possiblealignments and choosing the best. These methods allow the introduction of artificial gaps inaligned sequences to create an optimal alignment) (Fig. 12.4). The principle of divide-and-conquer rule is extensively used in dynamic programming. Subdivide a problem that is toolarge to be computed into smaller problems that may be efficiently computed. Then assemblethe information to give a solution for the large problem. Scores for each comparison are storedin a table, like a spreadsheet, inside the program. These individual scores are then used tobuild an alignment score, stepping through the tale from beginning to end. A key part ofalignment methods is the scoring method for insertions and deletions (gaps).

(i) The first is a comparison matrix, which gives a single score for every possible match andmismatch between two bases.

(ii) Second score is a penalty to subtracted each time a gap is made in one sequence so thattwo other matching regions can be better aligned.

(iii) Third score is penalty to be subtracted each time a gap is extended by another residue.

Page 178: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 179: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

176 Bioinformatics: A Primer

Fig. 12.4 Global and Local Sequence Alignments

12.2.2.1 Global Alignment

Global alignment is an alignment of two nucleic acid or protein sequences over their entirelength. The Needleman-Wunsch algorithm (GAP program) is one of the methods to carry outpair-wise global alignment of sequences by comparing a pair of residues at a time. Comparisonsare made from the smallest unit of significance, a pair of amino acids, one from each protein.All possible pairs are represented by a two-dimensional array (one sequence along x-axis andthe other along y-axis), and pathways through the array represent all possible comparisons(every possible combination of match, mismatch and insertion and deletion). Statisticalsignificance is determined by employing a scoring system; for a match = 1 and mismatch = 0(or any other relative scores) and penalty for a gap. Each cell in the matrix is examined,maximum score along any path leading to the cell is added to its present contents and thesummation is continued. In this way the maximum match (maximum sequence similarity)pathway is constructed. The maximum match is the largest number that would result fromsumming the cell score values of every pathway, which is defined as the optimal alignment.Leaps to the non-adjacent diagonal cells in the matrix indicate the need for gap insertion, tobring the sequence into register. Complete diagonals of the array contain no gaps.

Needleman-Wunsch algorithm creates a global alignment. That is, it tries to take all thecharacters of one sequence and align it with all the characters of a second sequence. Needleman-Wunsch algorithm works well for sequences that show similarity across most of their lengths.Globally optimal alignment is a difficult problem (biological sequences may have gaps, inser-tion sequences relative to each other). There are limitations to global alignment methods-

1. Global alignment algorithms are often not effective for highly diverged sequences and donot reflect the biological reality that two sequences may only share limited regions ofconserved sequence.

2. The influence of global properties for local properties is not valid for all biological sequences.3. Short and highly similar sub-sequences may be missed in the alignment because the rest

of the sequence outweighs them.

12.2.2.2 Local Alignment

Local alignment is an alignment of some portion of two nucleic acid or protein sequences.Smith-Waterman algorithm is a variation of the dynamic programming approach to generate

Page 180: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 181: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 182: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 179

forms which occur in nature. The phrase also refers to an actual sequence, which approximatesthe theoretical consensus. A known conserved sequence set is represented by a consensussequence. Commonly observed supersecondary protein structures are often formed byconserved sequences. Sequences are aligned optimally by bringing the greatest number ofsimilar characters into register in the same column alignment, just as for the alignment of twosequences.

Table 12.4 (a) Multiple Sequence Alignment of a highly conserved Region of a Protein Family

1 2 3 4 5 6 7

I G A G G V G KII G G G S G G LIII G A R G V G KIV G A S G V G KV G G A G V G KVI G A G E S G KVII G G G G S G FVIII G A C G V G K

Consensus G A/G g g v G kSequence

Table 12.4 (b) Multiple Sequence Alignment of a Disulfide-containing Protein Family

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 G C K Y G C L K L G E N E G C D2 G C K K T C Y K L G E N D F C N3 N C K Y E C K K – – – D D Y C N4 G C K V W C V I N – – N E E C G5 D C V Y E C Y N P K G – S Y C N6 G C K L S C F – I R P S G Y C G7 G C T V S C G T – – – – – – C –

Sq g C k x x C x x x g x n x x C n

(Note: In a consensus sequence, if same residue occurs in a column, upper case letter is used; if a particularresidue occurs most of the times, lower case letter is used; if several residues with equal number are present allare mentioned; and if the distribution in a column is random, the alphabet X is mentioned. In disulfide-containing proteins, retention of S-S bridges are a priority, and accordingly cysteins are kept in register, withmanual intervention if necessary).

Alignment of large number of sequences by pair-wise dynamic programming is almostimpossible, because the problem increases exponentially with number of sequences involved.So, shortcut progressive methods, based on heuristic approach, are available for multiplealignment of sequences. Most of the available programs (BLAST, FASTA etc) use incrementalmethod that makes pair-wise alignments of most related sequences, and then progressively

Page 183: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

180 Bioinformatics: A Primer

add less related sequences or groups of sequences to those aligned group. BLAST and FASTAprograms that very quickly find the “best” diagonal between a pair of sequences. Both BLASTand FASTA make use of amino acid substitution matrix, PAM-250, or BlockSum-62 methodsto score and asses pair-wise sequence alignment, and are particularly good at identifyingbetween 25%–100% sequence similarity with the query sequence. PSI-BLAST can identifymatches between 25%–15% sequence identify.

Once a multiple sequence alignment has been found, the number or types of changes in thealigned sequence residues may be used for phylogenetic analyses. BLAST is now widely usedsequence alignment tool for proteins and nucleic acids. FASTA is more sensitive than BLASTin detecting distantly related protein sequences.

l BLITZ : (http://www.ebi.ac.uk/searchs/blitz.html). Fast comparison ofprotein sequences against SWISS-PROT).

12.3.1 BLAST Suite of AlgorithmsBasic Local Alignment Search Tool (BLAST) is from NCBI/GenBank (USA). It consists of asuite of algorithms, and they provide a fast, accurate and sensitive database searching.BLOSUM62 is the default-scoring matrix. BLAST works better on protein sequence databases.A general operational procedure is:

1. It takes each word from the query sequence, optimally filtered to remove low-complexityregions and locates all similar words in the current test sequence. It initially throws awayall database sequences that do not have a similar match.

2. If similar words are found (3 amino acids or 11 nucleotides), BLAST tries to expand thealignment to the adjacent words (gaps not allowed).

3. High-scoring segment pairs are generated. An HSP consists of two sequence fragments ofarbitrary but equal length whose alignment is locally maximal and for which the align-ment score is above the threshold score.

4. After all words are tested, a set of high-scoring segment pairs (HSPs) are chosen for thatdatabase sequence. Two sequences, a scoring system, and a threshold score define a set ofHSPs.

5. Several non-overlapping HSPs may be combined in a statistical test to create a longer,more significant match.

A suite of BLAST programs is:

l BLAST : Un-gapped BLAST. The program may miss the similarity if two se-quences do not have a single highly conserved region.

l Gapped BLAST : Seeks only one from the un-gapped alignments that make up a signifi-cant match. Dynamic programming is used to extend a central pair ofaligned residues in both directions to yield the final gapped align-ment.

l PSI-BLAST : Position-Specific Interactive BLAST is a generalized BLAST algorithmthat incorporates both pair-wise and multiple sequence alignmentmethods. It is used for the identification of weak sequence similarities.It uses a position-specific score matrix in place of query sequence.

Page 184: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 181

1. It takes as input a protein sequence and compares it to proteindatabanks, and constructs a multiple alignment from a GappedBLAST search and generates a profile from any significant localalignment, called a “profile”.

2. The profile is compared to the protein databases, again seekingbest possible local alignments and PSI-BLAST estimates the statis-tical significance of the local alignments found, using “significant”hits to extend the profile search until convergence.

l BLASTN : Compares the nucleotide query sequence against all nucleotide se-quences in the non-redundant databases (DNA Æ DNA). Suited forhigh-scoring matches; not suited for distant relationship matching.

l BLASTP : Compares a protein query sequence against all protein sequences(gapped) in the non-redundant databases (Protein → Protein). Suitedfor finding homologies.

l BLASTX : The query nucleotide sequence will be translated in all six readingframes (each frame gapped) and the conceptual translation productsare compared against all protein sequences in non-redundant data-bases (DNA translated → protein). Suited for finding ESTs and newDNA searches for finding novel proteins.

l TBLASTN : Compares a protein query sequence against nucleotide sequence da-tabases, dynamically translated in all six reading frames (each framegapped) (Protein → DNA (translated). Suited for finding ESTs andnovel proteins.

l TBLASTX : Compares the six-frame translation of a nucleotide query sequenceagainst the six-frame (ungapped) translation of nucleotide sequencedatabases (DNA (translated) → DNA (translated). Suited for ESTsand gene structure annotations.

l BEAUTY : (http://dot.imgen.bcm.tmc.edu/seq-search/protein-search.html).BLAST Enhanced Alignment Utility that predicts the function of theprotein being tested. It adds additional information, on sequence fam-ily membership, the location of the conserved domains, and the locationsof any annotated domains and sites directly into BLAST search results.These enhancements make it much easier to detect weak, but functionallysignificant, matches in BLAST database searches. The BLOCKS serveroffers a variety of BLAST searches that use as a query sequence a consen-sus sequence derived from multiple sequence alignment of a set of relatedproteins. The consensus sequence is called a cobbler sequence. The BLOCKSserver offers a variety of BLAST searches that use as a query sequencea consensus sequence derived from multiple sequence alignment of aset of related proteins. The consensus sequence is called a cobblersequence.

l BLAST-2 : (http://www.ch.embnet.org/software/frameBLAST.html). A newerrelease of BLAST that allows insertions or deletions in the alignedsequences. Gapped alignments may be more biologically significant.

Page 185: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

182 Bioinformatics: A Primer

Synonymous with gapped BLAST. Compares two sequences againstone another, employing BLASTN, BLASTP, BLASTX, TBLASTN andTBLASTX.).

l PIR-BLAST : Provides general BLASTP against the entire non-redundant referenceprotein database (PIR-NREF).

l PhyloBLAST : Compares the query protein sequence to a SWISSPROT and TrEMBLdatabase using WU-BLAST and then phylogenetic analysis.

The help manual is available on the Web at: (http://www.ncbi.nlm.nih.govt/BLAST/blast-help.html).

12.3.2 FASTA AlgorithmFASTA is a program for rapid alignment of pairs of DNA or protein sequences. Rather thancomparing individual residues in the two sequences, FASTA algorithm instead is based on theidea of identifying short words (k-tuples), common to both sequences under comparison. In adotplot, regions of similarity between two sequences show up as diagonals. Comparison of k-tuples between the two sequences can be viewed as focussing on diagonal matches in adynamic programming matrix. FASTA calculates the sum of these dots along each diagonal inthe following way.

1. Match identical words from each list and then create diagonals by joining adjacent matches(non-overlapping words).

2. Find sum of identical words.3. Rescale using PAM matrix and retain top scoring matrix.4. Join segments using gaps, and eliminate other segments.5. Use (Smith-Waterman) local dynamic programming to create an optimum alignment.

The related algorithms, FASTX and FASTY translate a query DNA sequence in all threereading forward frames and compares all three frames to a protein sequence database. TFASTXand TFASTY compare a query protein sequence to a DNA sequence database, translating eachDNA sequence in all six reading frames. This is generally the best way to scan EST databasess

l FASTA : (http://www2.igh.cnrs.fr/bin/fasta-guess.cgi).l FASTA3 : (http://www2.ebi.ac.uk/fasta3/).l Octopus : It is a program for rapid interpretation of BLAST, BLAST-2, and FASTA

output test files.

12.3.3 PILEUP AlgorithmPILEUP algorithm estimates the best alignment for a group of sequences using pair-wiseapproach. PILEUP uses global alignment procedure (GCG GAP program).

1. First, similarity scores are calculated between all sequences to be aligned, and they areclustered into tree structure by the neighbor-joining method.

2. Next, most similar pairs of sequences are aligned and averages (similar to consensussequences) are calculated to aligned pairs.

3. The final multiple alignment is performed by a series of progressive, pair-wise align-ments between sequences and clusters of sequences.

Page 186: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 183

12.3.4 CLUSTAL Algorithm

CLUSTAL algorithm uses local alignment program (GCG BESTFIT program), which can beadvantageous for aligning highly diverged sequences with some regions of homology, but aredissimilar in other regions. The CLUSTAL algorithm is based on the premise that similarsequences are likely to be evolutionary related. Thus, the method aligns sequences in pairs,following the branching order of a phylogenetic tree family. Similar sequences are alignedfirst, and more distantly related sequences are added later. Once the pair-wise alignmentscores have been calculated, they are used to cluster the sequences into groups.

Some of the advantages of CLUSTAL are:

1. A gap and its length are distinct quantities, and different weights are given to each.2. Different weights are given to different types of mismatches. E.g. a transition (leu - val) is

more probable than transversion (leu - Asp), and hence is treated with different weights.3. It can use rapid alignment method (FASTA), or slower and more accurate Smith-Waterman

method.4. It can add individual sequences to an existing alignment or to align two groups of pre-

aligned sequences with each other.5. It can realign selected sequences of selected regions of the alignment leaving the unselected

portions of the alignment constant.6. Secondary structure features, such as regions of hydrophobicity, proximity to other groups

are incorporated.

12.3.5 Strategies for Sequence Similarity SearchIt is necessary to decide at the outset, whether to search nucleic acid or protein databases.Whether to use protein or nucleic acid sequence query depends upon the biological informa-tion desired. If the sequence is protein, or the gene sequence codes for a protein, then thatsearch should be almost always be performed at the protein level, because proteins with 20-letter alphabet allow one to detect far more distant similarity than do nucleic acids with 4-letter alphabet.

l Initial search should be done with the heuristic algorithm programs (e.g. BLAST andFASTA).

l If the query sequence is an unknown sequence, then matching a gene fragment will prob-ably not contribute much useful information.

l It is possible to automatically translate a DNA sequence into amino acid sequence in all sixreading frames (BLASTX) and compare it to protein sequence database; or compare aprotein sequence to the six reading frame translation of all DNA base sequences (TFASTA &TBLASTN).

Some of the sequence similarity alignment servers/databases are:

l ALIGN : Applies the BLOSUM50 matrix to deduce the optimal alignment be-tween two sequences.

l CINEMA : (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA2.1/).

Page 187: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

184 Bioinformatics: A Primer

A Color Interactive Editor for Multiple Alignments for nucleic acids and Proteins.

l CLUSTAL : Multiple sequence alignment tool based on clustering algorithms.l Consensus : Takes CLUSTAL or multiple sequence alignment programs and calcu-

lates the consensus sequence.l DCA : Multiple sequence alignment tool, better suited to distantly related

sequences.l DiAlign : Constructs pair-wise and multiple alignments by comparing whole

segment of the sequence. Suitable for local similarity search.l FASTA : Sequence similarity search tool.l LALIGN : A tool for matching of two sequences.l MULTALIN : (http://www.toulouse.inra.fr/multalin.html). A multiple sequence

alignment server with hierarchical clustering algorithm. Sequencealignment is highlighted with color code for easy visualization.

l PSSM : Position-specific Scoring Matrix. Analysis of multiple sequence align-ment for conserved blocks. It represents an alignment of sequence ofthe same length (no gaps). Sliding the matrix along the sequence oneposition at a time scores every possible sequence position. The aminoacid substitution scores in each column of the PSSM are used to evalu-ate each sequence position.

l SAPS : Statistical Analysis of Protein Sequences, which includes analysis ofamino acid composition, charge, hydrophobicity and transmem-brane segments.

l T-COFFEE : Multiple sequence alignment tool.l USC Server : Aligns two sequences with dynamic programming.l VSNS : (http://www.techfak.uni-bielfeld.de/bed/Curric/MulAli/). An

excellent, comprehensive resource for multiple sequence alignment,software and tutorials.

BLAST and FASTA and related programs are statistically based sequence similarity search(SIM) methods. Lately, alternative non-SIM-based bioinformatic methods are becoming popu-lar. One such method is Data Mining Prediction (DMP) that is based on combining evidencefrom amino-acid attributes, predicted structure and phylogenic patterns; and uses a combina-tion of Inductive Logic Programming data mining, and decision trees to produce predictionrules for functional class. DMP predictions are more general than is possible using homology.

12.4 PHYLOGENETIC ANALYSIS

Phylogenetic analysis of a family of related sequences is a determination of how the familymight have been derived during evolution. Placing the sequences as outer branches on a treedepicts the evolutionary relationships among the sequences. The branching relationships onthe inner part of the tree then reflect the degree to which different sequences are related.

12.4.1 The Dayhoff Mutation Data MatrixOnce the evolutionary relationship of two sequences is established, the residues that didexchange are similar (conservative mutations). This is the underlying principle behind the

Page 188: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 185

Dayhoff mutation data matrix compilation. The Dayhoff mutation data matrix is based on theconcept of the percentage-accepted mutation (PAM). Proteins are organized into familiesbased on the degree of sequence similarity. From aligned sequences, a phylogenetic tree isderived showing graphically which sequences are mot related and therefore share a commonbranch on the tree. After the construction of the evolutionary trees, they are used with scoringmatrices to evaluate the amino acid changes that occurred during evolution of the genes forthe proteins in the organisms from which they originated. Subsequently, a set of tables (matri-ces), the percentage of amino acid mutations accepted by evolutionary selection, known asPAM tables are determined. PAM tables show which amino acids are most conserved and thecorresponding positions in two sequences during evolution. Steps in the construction ofmutation matrix are:

1. Align sequences that are at least 85% identical and determine pair exchange frequencies.2. Compute frequencies of occurrence.3. Compute relative mutabilities.4. Compute a mutation probability matrix.5. Compute evolutionary distance scale.6. Calculate log-odds matrix.

1st Step: Pair-exchange frequenciesPAM (Point Accepted Mutation) is a unit of evolutionary distance between two amino acidssequences of closely related proteins. PAM1 = 1 accepted point mutation (no insertions ordeletions) event per 100 amino acids. PAM1 can be multiplied by itself N times. PAM250 = 250point mutations/100 amino acids (mutations occur multiple times at any given position, withidentity score = 20%).

Tally replacements “accepted” by natural selection, in all pair-wise comparison.

Aij = Number of times amino acid, j, is replaced by amino acid, i, in all comparisons.If score = 0; functionally equivalent and/or easily inter-mutable.

If score < 0; two amino acids that are seldom inter-changeable.

2nd Step: Frequencies of occurrence, fi

fi =Ovservation of amino acid

Observations of any amino acidith

(12.3)

fi = 1 (12.4)

3rd Step: Relative mutabilities

The amino acids that do not mutate are to be taken into account. This is the relative mutabilityof the amino acid.

mj = fj ¥ (number of times amino acid, j, is observed to change)

4th Step: Mutation probability matrix, Mij

Mij = probability that amino acid, i, in row i of the matrix will replace an aminoacid in column j.

Page 189: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

186 Bioinformatics: A Primer

Mij = mj . A

A

ij

iji =Â

1

20 (12.5)

Mii = (1 – mi) (12.6)The diagonal elements represent the probability that the amino acid will remain unchanged.

5th Step: The evolutionary distance scale

An evolutionary distance between two sequences is the number of point mutations that wasnecessary to evolve one sequence into another (the distance is the minimum number ofmutations). Since Mii represents the probabilities for amino acids to remain unchanged, mul-tiplying the matrix by l gives the matrix the evolutionary distance of PAM1. The mutationprobability is

Mij = l... .mA

Aj

ij

iji =Â

1

20 (12.7)

Mii = (1 – l.mi)In the framework of this model, a mutation probability matrix for any distance can be

obtained by multiplying 1PAM matrix with itself the required number of times.

6th Step: Log-odds matrix

The probability that that some event is observed by random chance piran is

piran = fi

Relatedness Rij =M

fij

i(12.8)

Log-odds matrix, Sij, is the log-odds ratio of two probabilities–probability that two aminoacid residues are aligned by evolutionary descendence to the probability that they are alignedby chance.

Sij = log (Rij) (12.9)

12.4.1.1 Limitatons of The PAM ModelThe PAM model is built on the assumptions that are imperfect.

1. The Markov model, that replacement of any site depends only on the amino acid at thatsite and the probability given by the table, is an imperfect representation of evolution.Replacement is not equally probable over entire sequence (e.g. local conserved sequences).

2. Each amino acid position is equally mutable is incorrect. Sites vary considerably in theirdegree of mutability.

3. Many sequences depart from average amino acid composition.4. Errors in PAM1 are magnified in extrapolation to PAM250.

Page 190: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 187

5. Model is devised using the most mutable positions rather than the most conservedpositions, which reflect chemical and structural properties of importance.

6. Distant relationships can only be inferred.

12.4.2 Blocks Model

Other scoring matrix models are all based on the basic concepts of Dayhoff. In Blocks substi-tution matrix (BLOSUM) method, the starting data is conserved in blocks, and aligned in orderto represent distant relationships more explicitly. In this method, the sequences of the indi-vidual proteins in each of the families are aligned in the regions defined by the blocks. Eachcolumn in the aligned sequences then provided a set of possible amino acid substitutions. Thetypes of substitutions are then scored for all aligned patterns in the database and used toprepare a scoring matrix, the “BLOSUM” matrix, indicating the frequency of each type ofsubstitution. More common (conservative) substitutions should represent a closer relation-ship between two amino acids in related proteins, and thus receive a more favorable score insequence alignment. Conversely, radical substitutions should be less favored. Patterns ofdifferent identities are grouped in different groups—60% identical patterns are groupedunder one substitution matrix blosum60, and those 80% alike under blosum80, and so on.BLOSUM matrix values are given as log-odds scores of the ratio of observed frequency ofamino acid substitution dived by the frequency expected by chance. While PAM matrix isdesigned to track evolutionary origins of proteins, the BLOSUM model is designed to findtheir conserved domains. The better reliability of blocks method is due to-

1. Many sequences from aligned families are used to generate matrices.2. Any potential bias introduced by counting multiple contributions from identical residue

pairs is removed by clustering sequence segments on the basis of minimum percentageidentity.

3. Clusters are treated as single sequences (Blosum60; Blosum80 etc.).4. Log-odds matrix is calculated from the frequencies, Aij, of observing residue, i, in one

cluster aligned against residue, j, in another cluster.5. Derived from data representing highly conserved sequence segments from divergent

proteins rather than data based on very similar sequences (as is the case with PAMmatrices).

6. Detects distant similarities more reliably than Dayhoff matrices.l ALIGN : (http://www2.igh.cnrs.fr/bin/align-guess.cgi). Applies BLOSUM50

matrix to deduce the optimal alignment between two sequences.l BLOCKS : Multiple alignment of ungapped segments corresponding to the most

highly conserved region of proteins.

12.4.3 Clustering AlgorithmsClustering, or grouping procedures of large data sets are a set of statistical methods (a majorsubgroup of numerical analysis), based on similarity criteria for appropriately scaled variablesthat represent the data of interest. Sequence clustering algorithms take a large number ofsequences and subdivide them into clusters, based on the extent of shared sequence identity

Page 191: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

188 Bioinformatics: A Primer

in a minimum overlap region. These algorithms use evolutionary distances to build phyloge-netic trees. The tree construction is based solely on the relative number of similarities anddifferences between a set of sequences. Cluster methods construct a tree by linking the leastdistant pairs of taxa, followed by successive more distant taxa. Kohenen self-organizingmethod is one such clustering algorithm, based neural networks, for construction of phyloge-netic trees from sequence information.

Clustering algorithms can be used in clustering analysis in other problems, such clusteringof ESTs by sequence similarity to known genes. This allows each predicted gene to be com-pared against an array of EST sequences, enabling more effective information annotations.

12.4.4 Distance Method

The distance method calculates the number pair-wise distances between a group of sequencesto produce a phylogenetic tree of the group by the program GROWTREE, based either onUPGMA or the neighbor-joining method. The sequence pairs that have the smallest number ofsequences between them are termed “neighbors”. On a tree, these sequences share a commonnode or a common ancestral position and are each joined to that node by a branch. Finding theclosest neighbors among a group of sequences by the distance method is often the first step inproducing a multiple sequence alignment. CLUSTALW uses the neighbor-joining distancemethod as a guide to multiple sequence alignment. The score between two sequences is thenumber of mismatched positions in the alignment or the number of sequence positions thatmust be changed to generate the other sequence. A general approach is-

1. Find the most closely related sequencers A and B.2. Treat the rest of the sequences as a single composite sequence.3. Calculate the average distance from A to all other sequences, and B to all other sequences.4. Use these values to calculate distances (a and b).5. Now treat A and B as single composite sequence AB, and calculate the average distance

between AB and each of the other sequences and make a new distance table.6. Repeat the steps.

Some of the phylogenetic programs are:

l Phylip : (http://evolution.genetics.washington.edu/Phylip.html). Phyloge-netic inference package is a collection of many programs––UPGMA,Parsimony, Neighbor-joining, Maximum likelihood algorithms.

l PhyloBLAST : Compares the query protein to the SWISS-PROT/TrEMBL databaseand carries out phylogenetic analysis.

l PILEUP : PILEUP uses UPGMA to create its dendogram of DNA sequences,and then uses this dendogram to guide its multiple alignment algo-rithm.

l TreeBase : For graphical representation of phylogenetic trees.l Tree Gen : Tree generation from distance data.l TreeView : (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html). For graphical

representation of phylogenetic trees.

Page 192: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 189

l UPGMA : Unweighted Pair Group Method is a clustering algorithm using arith-metic averages. It calculates branch lengths between the most closelyrelated sequences, and then averages the distance between this pair orsequence characters until all the sequences are included in the tree.

These similarity/distance matrix comparison methods and other statistical algorithms arebased on phenotypic similarities of the species, without taking into account the evolutionaryhistory that brought the species to the current phenotypes. Computer algorithms based on thephenotypic models rely heavily on sequence data in calculating evolutionary distances.

12.4.5 Cladistic MethodsCladistic methods of phylogenetic analysis rely on current data as well as knowledge ofancestral relationships. They are based on the explicit assumption that sets of sequences(proteins) are evolved from a common ancestor by a process of mutation and selection withoutmixing. They Evolutionary trees reconstructed via these are called ‘cladograms’. Computeralgorithms based on the cladistic model generally rely on PARSIMONY or maximum likeli-hood methods for the calculation of relationships and building trees. Parsimony uses position-specific information in a multiple sequence alignment. Maximum likelihood method takesinto account every sequence, every sequence change, and specific model of sequence evolu-tion.

l PARSIMONY : PARSIMONY is the most popular algorithm for constructingancestral relationship. It allows the use of all known evolution-ary information in tree building. It involves evaluating all pos-sible trees and giving each tree a score based on the number ofevolutionary changes that needed to explain the observed data.For each aligned position (vertical column in the multiple se-quence alignment), phylogenetic trees that require the smallestnumber of evolutionary changes to produce the observed se-quence changes are identified. The most parsimonious tree is theone that requires the fewest evolutionary changes for all sequencesto derive from a common ancestor. This method is used forsequences that are similar and for small number of sequences,for which it is best suited.

l Maximum Likelihood : This algorithm attempts to reconstruct a phylogenetic tree usingMethod an explicit model of evolution—let all sites selectively be neutral

and let them spontaneously mutate at constant rate per gameteper generation.

l PAUP : (http://onyx.si.edu/PAUP). Phylogenetic Analysis Using Parsi-mony. Starting with a set of aligned sequences PAUP can searchfor phylogenetic trees that are optimal according to parsimony,distance, or maximum likelihood criteria using heuristic, branch-and-bound or exhaustive tree searching algorithms.

l PAUPSEARCH : Calculates trees.l PAUPDISPLAY : Produces graphical version of PAUPSEARCH tree files.

Page 193: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 194: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 191

boring amino acids. This method examines a sequence window of ~13–17 residues andassumes a central amino acid in the window will adopt a conformation that is determined bythe side chains of all the amino acids in the window. Amino acid segments with the highestlocal scores for a particular secondary structure are assigned to that structure. However,secondary structure do not just depend on conformational preferences of individual aminoacids. Distant interactions within the amino acid sequence may influence local secondarystructure. Vector methods provide a fast and reliable way to align structures. It is muchsimpler computational problem to compare vector representations of secondary structuresthan to compare positions of all Ca or Cb atoms in those structures.

The most sophisticated methods for secondary structure prediction are by neural networkalgorithms. In the neural network approach, computer programs are trained to be able torecognize amino acid sequence patterns that are located in known secondary structures anddistinguish these patterns from other patterns not located in these structures.

Once individually aligning sets of secondary structural elements have been identified, theyare clustered into large alignment groups. This clustering generates a large number of possiblegroups of secondary structural elements from which the most likely ones must be selected.One of the methods is to align the atomic coordinates of a helix (or a sheet) in one protein withthose of the matched helix (sheet) in the second structure and the root mean square deviationcalculated. Some of protein secondary structure prediction programs are:

l BCM Search Launcher : Provides access to a large collection of secondary structure pre-diction tools.

l CoDe : Secondary structure consensus prediction.l DAS : Transmembrane prediction server (Sweden).l DSC : (http://www.bmm.icnet.uk/dsc/). Linear discrimination second-

ary structure prediction server.l JPRED (EBI, UK) : Consensus method of secondary structure prediction server,

based upon PHD, PREDATOR, DSC, ZPRED and MulPred pro-grams.

l PHDsec : Prediction of secondary structure (EMBL, Germany).l PHDhtm : Transmembrane helix location prediction and topology.l PREDATOR : (http://www.embl-heidelberg.de/). Program that can predict sec-

ondary structure of single sequence, or for number of relatedsequences. It is based on an analysis of amino acid patterns instructures that form hydrogen bond interactions between adja-cent b-strands and between n and n + 4 residues in a-helices.

l PROFsec : Secondary structure prediction server.l ProScale : Predicts hydrophobicity, a helix, b-sheet and other parameters.l PSA (Boston, USA) : Protein sequence analysis database has protein secondary struc-

ture prediction server. Predicts probable secondary structuresand folding classes for a given amino acid sequence.

l PSI-Pred (UK) : PSI-BLAST profiles to predict secondary structure, transmem-brane topology and fold recognition.

Page 195: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

192 Bioinformatics: A Primer

l ReDe (USA) : Transmembrane prediction server.l SAPS : Statistical Analysis of Protein Sequences—analysis of amino acid

composition, charge, hydrophobicity, transmembrane regionsand other parameters.

l SOUSI : Classification of secondary structure prediction of membraneproteins.

l SSCP : Prediction of content of helix, strand and coil for a query proteinusing the amino acid composition as the only input information.

l SSPRED : A three-state secondary structure prediction routine based onSWISS-PROT database.

l TMHMM : Prediction of transmembrane helices in proteins (Denmark).l TMPred : Prediction of transmembrane regions and orientation

(Switzerland).l tRNA Scan-SE : Provides cloverleaf diagram of the tRNA molecules.l VAST : Statistical method similar to BLAST. VAST score is the number

of superimposable secondary structural elements found in com-paring two sequences.

Secondary structure prediction can be carried out with statistical predictive methods withmanual intervention wherever necessary. The principle behind manual intervention methodsis to look for patterns of residue conservation that are indicative of secondary structures.

12.5.1.1 a-Helices

a-Helix has a periodicity of 3.6. So, for a-helices with one face buried in the protein core, andthe other exposed to solvent will have residues at positions, i, i + 3, i + 4, i + 7 (where i is aresidue in a helix) will lie on one face of the helix. Thus patterns showing such conservation areindicative of a-helical regions.

12.5.1.2 b-strandstrands

b-Strands that are half buried in the protein core will tend to have hydrophobic residues atposition i, i + 2, i + 4, i + 6 …etc., and polar residues at i + 1, i + 3, i + 5,… etc.

12.5.1.3 Reverse Turns

As reverse turns exhibit polar character, they are usually found at the molecular surfaceregions of proteins. Hydrophobic residues are found in regions adjacent to the turns. Patternof occurrence of helix- and sheet-breakers and hydrophobic residues is a sign of the presenceof reverse turns. Prediction of loop regions based on sequence alone is difficult, because loopregions vary in length, sequence and conformation. A better approach is to align the availabletertiary structures and use distance-geometry to obtains various classes of loop conformations,followed by choosing the best conformer on the basis of ‘energy minimization’ procedure.

12.6 MOTIFS, DOMAINS AND PROFILES

Search and analysis for structural motifs and domains are a part of pattern recognition proto-cols. A motif is an aggregation of secondary structural elements. A protein may contain single

Page 196: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 193

motif or multiple motifs. Structural domain is a segment of polypeptide chain that can foldinto spatially separable entity/moiety in globular proteins. Profiles encompass full domainalignments, by defining which residues are allowed at given positions in the sequence, whichpositions are highly conserved and which positions/regions tolerate insertions.

12.6.1 MotifsFunction of a protein is as much a consequence of regions of local structural elements in theamino acid sequence-motifs. Motifs are components of a more fundamental unit of structureand function, namely, the protein module. Proteins may have modules corresponding todifferent units of function, and these modules may be present in different order. Structuralmotifs, within protein structures, once identified in one protein structure, can be used as templatesto search the entire database of proteins structures.

Approach to pattern recognition is to characterize a family by means of a single conservedmotif to a consensus pattern (e.g. E-F hand, helix-turn-helix, and zinc-finger motifs). The motifsearch programs ignore all but invariant positions in an alignment, and just describe the keyresidues that are conserved and define the family. For example, H–[FW]–X–[LIVM]–X–G–X(5)–[LV]–H–X(3)–[DE]–describes the motif found in a family of DNA-biding proteins. Anelement of tolerance is introduced to motif search by dealing amino acids according to physi-cochemical properties.

There are proteins that have multiple motifs. These motifs (conserved regions) can be usedto create a “fingerprint” so that in a database search there is a better chance of identifying adistant relative. One of the approaches is to excise groups of motifs from alignments, and thesequence information they contain is converted into unweighted scoring matrices, to create a“fingerprint”.

BLOCKS algorithm, which searches for conserved amino acids in a family of proteins, is analternative method of multiple motif search. In this method, each cluster is treated as a singlesegment, each with a score that gives a measure of its relatedness. Blocks within a family areconverted to position-specific substitution matrices (PSSMs), which are used to make databasesearches.

l BLOCKS : (http://www.blocks.org). Search of Motifs and protein classifica-tion. Motifs or blocks are created automatically, by detecting themost highly conserved regions of each protein family (USA).

l BlockMaker : Finds conserved blocks in a group of two or more unaligned,related protein sequences (USA).

l COILSCAN : Identifies coiled-coil regions.l Gibbs Motif Sampler : Identification of conserved motifs in nucleic acids or protein

sequences. It searches for the statistically most probable motifsand can find the optimal width and the number of these motifs ineach sequence.

l HTHSCAN : Helix-turn-helix motifs scan.l HSSP : Database of homology-derived secondary structure of proteins.l Leucine Zippers : Searches for leucine-zipper motifs (Germany).l MAST : Motif Alignment and Search Tool for searching sequence data-

bases for sequences that contain one or more group of motifs.

Page 197: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

194 Bioinformatics: A Primer

l MEME : Multiple emotif for motif elicitation and search tool. MEME lo-cates one or more ungapped patterns in sequences. A search isconducted for a range of possible motif widths, and the mostlikely width for each profile is chosen after one iteration of EMalgorithm. The EM then iterates to find the best EM estimate forthe width.

l MOTIF : Meta site motifs search server (Japan).l Multi Coil : Predicts the location of coiled-coil regions in amino acid sequences.l PRINTS : (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS).

Search of Motifs and protein classification. Motifs are encoded asungapped, unweighted local alignments.

l PROSITE : (http://expasy.ch/sprot/prosite.html). Secondary structure/do-main database from SWISS-PROT. It is based on highly conservedresidues in a protein family. It contains a comprehensive list ofdocumented protein domains. It is the best starting point formotif search. Motifs in proteins are encoded as patterns, and aregood at identifying enzyme classes by their active site motif.

BLOCKS and PRINTS are two motif databases that represent protein or domain families byor more ungapped multiple alignment fragments.

12.6.2 DomainsGlobular proteins exhibit domain (independently folding unit within) structure. Large sizeproteins (> 50,000 daltons) will tend to fold into structural domains. Structural domains arecontiguous stretches of 100-150 amino acid residues that have a globular fold. They cannot bedivided into smaller units and they represent fundamental building blocks that can be used tounderstand the function and evolution of proteins. Single or several structural domains formfunctional domains with functions (and evolutionary significance) that are distinct from otherparts of a protein.

As the protein domains are often the evolutionarily conserved fragments of proteins, atboth the sequence and structural levels, it has many advantages to organize databases basedon domain classification to enhance protein structure prediction and modeling procedures.Fold comparison allows detection of many distance relationships and extends protein fami-lies. Similarity of the protein fold goes beyond divergence of the amino acid sequence. If onecan identify, through sequence analysis, the location or presence of domains, it is often pos-sible to gain greater insight, not only of the probable function but also of the evolutionaryhistory of that protein. Long stretches of repeated amino acid residues, particularly Pro, Gln,Ser and Thr often indicate linker sequences and are usually good place to split protein intodomains. Internal protein domains can indicate whether the protein is likely to be involved insignaling or a transmembrane protein. Transmembrane segments are also very good dividingpoints, since they can easily separate extracellular from intracellular domains and comprise ofordered segments (e.g. a-helices) in the intracellular domains (Fig. 12.9). Profiles, hiddenMarkov models and other profile search algorithms are used to search for domain databases.

Page 198: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 195

Fig. 12.9 Structure of Integral Membrane Light HarvestingProtein Complex with Multiple a-Helices

(Ref: Prince, S.M., et al. (1997) J Mol Biol., 268; 412)(Source: Protein Data Bank: 1KZU.pdb)

Interacting pairs of proteins co-evolve to maintain functional and structural complementarity.Consequently, such a pair of protein families shows similarity between their phylogenetictrees. Evaluation of the degree of co-evolution of family pairs by global protein structuralinteractome map (PSIMAP—a map of all the structural domain–domain interactions in thePDB) would improve the accuracy of prediction based on ‘homologous interaction’.

Some database sites and programs for domain search are:

l CDD-Search : (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). Con-served domain database search, collected from Pfam and Search.

l INTERPRO : (http://www/ebi/ac/uk/interpro/). Integrated resource of proteindomains and functional sites is a combination of PFAM, PRINTS,ProSite, SWISSProt/TrEMBL packages.

l PRODOM : (http://protein.toulouse.inra.fr/prodom.html). Group of sequence seg-ments or domains from similar sequences found in SWISSPROT data-base by BLASTP multiple sequence alignment.

Page 199: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 200: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 197

multiple aligned sequence data to identify additional members of the family, (ii) or by usingthe sequence to search (by Pfam) against publicly available HMM profiles. In the first case, amodel of a sequence family is first produced and initialized with prior information about thesequences. The trained model may then be used to produce the most probable multiplesequence alignment as posterier information. Using the publicity available HMM profile isconvenient, if the domain is already present in the database.

In the HMM, each column in the model represents the probability of a match, insert ordelete in each column of the multiple sequence alignment. Each state generates an observationand has a table of amino acid emission probabilities. There are also transition probabilitiesfrom moving from state to state. A protein is represented as sequence of probabilities, repre-sented by a path through the model. The HMM generates a protein sequence by emittingamino acids as it progresses through a series of interconnecting match, mismatch, delete orinsert states. The object is to calculate the best HMM for a group of sequences by optimizingthe transition probabilities between states and the amino acid compositions of each matchstate in the model.

The HMM is a probabilistic representation of a section of multiple sequence alignment, andhas position-dependent character distribution and position-dependent insertion and deletiongap penalties. A protein sequence can be generated from the HMM by starting at the begin-ning (Fig. 12.11) and then by following any one of many pathways from one type of sequencevariation to another (states) along the state transition arrows and terminating at the end. Each

Fig. 12.11 Schematic of Hidden Markov Model (HMM)

Page 201: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

198 Bioinformatics: A Primer

sequence is a match state. Insert state (hexagon box) produces random amino acid letters forinsertions between aligned column and delete state (circle) produces a deletion in the align-ment with probability 1. The scores show the probability that an amino acid occurs in aparticular state. The “hidden” aspect of the model arises from the fact that the state-sequenceis not directly observed. Instead, one must infer the state-sequence from a sequence of ob-served data using the probability model.

A general procedure is

1. The model is initialized with estimates of transition probabilities of amino acid composi-tion of each match and insert state.

2. All possible pathways through the model for generating each sequence in turn are exam-ined.

3. A new version of the HMM is produced that uses the results found (in step 2) to generatenew transition probabilities and match/insert state compositions.

4. Steps 2 and 3 are repeated about ten more times until the parameters do not changesignificantly.

5. The trained model is used to provide the most likely path for each sequence (viterbialgorithm).

Limitations of HMM models are:

(i) The HMM is a linear model and is unable to capture higher order correlation amongamino acids in proteins. In reality, amino acids in globular proteins, far apart in linearsequence, may be physically close to each other in protein folding.

(ii) The Markovian assumption that the future is independent of the past, given the present,is not strictly applicable in biology (e.g. clustering of hydrophobic amino acids inproteins; and conserved regions).

Some profile search programs/tools/servers/sites are:

l eMATRIX Search : Meta site profile search (USA). Includes BLOCKS, DOMO, PRINTS,PRODOM, and PROSITE databases.

l InterPro Search : Meta site profile scan server (USA) that includes PFAM, PRINTS andPROSITE.

l PANT : Meta site server (USA). Searches PROSITE patterns and PROFILES,BLOCKS, PFAM and PRINTS.

l PFAM : ( h t t p : // w w w . s a n g e r . a c . u k / s o f w a r e / p f a m < h t t p : //www.mrc.lmb.ac.uk/SCOP/). Protein family database containscurated multiple sequence alignments for each protein family. It con-tains functional annotation, databank links for each family and litera-ture references. Thee multiple sequence alignments are used to con-struct HMM profiles. A library of these profiles is used in turn toidentify protein domains in uncharacterized query sequences. PFAMexcels in extracellular domain search.

l PROFILES : Weighted matrices provide a sensitive means of detecting distantsequence relationships, where only few residues are conserved.

Page 202: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 199

l Profile Scan : Meta site profile scan server (ISREC, Switzerland) that searches asequence against a library of profiles. The server includes PROSITE,PFAM, and Gribskov.

l SAM : Protein family search tool based on hidden Markov model.l TOPITS : Fold recognition by prediction-based threading.l UCLA-DOE : Protein fold recognition server (USA).

Because of their information-rich descriptions, PROFILE, PRODOM, and PFAM databasesare able to detect even very distant instances of a motif not otherwise detectable.

12.7 PATTERN RECOGNITION

Pattern recognition programs follow reverse process of sequence analysis. Rather than predicthow a sequence will fold, they predict how well a fold will match a sequence. That is, matchingof sequence with a given topology rather than search for a topology with a given sequence.Pattern recognition methods attempt to detect similarities between 3-D structures that are notaccompanied by any significant sequence similarity. The general approach involves calculat-ing of a table of propensities that gives the probability for each type of amino acid being foundin a given environment. For a given structure each position can be assigned to one of theenvironments. Dynamic programming is then used to find the best match of the sequence tothe pattern of environments found in a given fold. Some of the common programs are:

l BLASTPAT : BLAST-based patterns database search.l EPAT : Pattern search (for PDB; SWISS-PROT; PIR databases).l FASTAPAT : FASTA-based patterns database search.l FINDPATTERNS :l Meta-MEME : The program uses HMM method to find motifs (conserved sequence

domains) n a set of related protein domains.l PRATT : A pattern recognition tool. Searches for patterns conserved in set of

protein or nucleic acid sequences. It is able to discover patterns con-served in sites of unaligned protein sequences.

l PATScan : Search for patterns conserved in set of protein or nucleic acid data-bases.

l PatternP : Search for patterns in protein sequences in a PDB file.

12.8 PROTEIN CLASSIFICATION AND MODELING

The most general description of the three-dimensional structure of protein is in terms of itsspatial fold, that is, the topology of its polypeptide chain. This classification disregards localdifferences, and mostly concerned with the secondary structure elements and their mutualdisposition. Proteins are classified into families according to their sequence similarity (PSSM),secondary structure (class), motif (architecture), profile (HMM) and homology.

Page 203: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

200 Bioinformatics: A Primer

12.8.1 Protein ClassificationProteins are classified into families according to their sequence similarity (PSSM), secondarystructure (class), motif (architecture), profile (HMM) and homology. Protein classificationmethods are based on the premise that proteins that share structural similarities reflect com-mon evolutionary origins. Many proteins are made up of modules (regions of conservedamino acid patterns comprising one or more motifs). Proteins from widely divergent biologi-cal sources may share several such modules, but the modules may not be in the same order.Protein families, members of which have the same domains in the same order, but also havedissimilar regions, are designated as a homeomorphic family. Proteins with the same bio-chemical functions have been examined for the presence of structurally conserved amino acidpatterns that represent an active site or other important feature (Prosite catalog). Proteins canbe classified by clustering methods. They have been used to identify groups of proteins thatlack a relative with known structure and hence are suitable for structure analysis.

Some of the protein classification databases are:

l 3D-PSSM : (http://www.bmm.icnet.uk/3dpssm/). Database, based on structuresimilarity in the SCOP.

l CATH : (http://www.biochem.ac.uk/bsm/CATH/). Class, Architecture, To-pology and Homology database is a hierarchical domain classifica-tion of protein structures.

l FSSP : (http://www2.embl-ebi.ac.uk/dali/fssp/). Fold classification basedon structure alignment of proteins database is based on a structuralalignment of all pair-wise combination of the proteins in the PDBstructural database by the structural alignment program DALI.

l GeneFind : An integrated neural network protein classification database. It isbased on the MOTIFIND neural networks and the ProClass familydatabase. GeneFind uses a multilevel filter system, with MOTIFIND,BLAST and Smith-Waterman pair-wise alignment programs.

l LPFC : http://www-canis.stanford.edu/projects/helix/LPFC/). A library ofprotein family cores based on multiple sequence alignment of proteincores using amino acid substitution matrices based on structure.

l MMDB : Molecular modeling database contains PDB structures that have beencategorized into structurally related groups by the vector alignmentsearch tool (VAST).

l iProClass (USA) : Integrated protein classification resource that provides a comprehen-sive family relationships and structure/functional features of pro-teins.

l ProClass : (http://www-nbrf.georgetown.edu/gfserver/proclass.html). Pro-vides summary description of protein family, structure and functionfor PIR-PSD, SWISS-PROT and TrEMBL.

l Prosite : (http://www.expasy.ch/prosite/). Database of groups of proteins withsimilar biochemical functions, derived on the basis of amino acidpatterns.

Page 204: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 201

l SCOP : (http://www.mrc.lmb.ac.uk/SCOP/). Clustering algorithm for pro-teins with sequence identity > 30%. Provides hierarchical structuralclassification.

It is important in protein classification analysis, that experimental 3-D structure data isavailable for at least one representative protein for every family for homologous proteins, forvalidation of predicted models. The quality of such predicted models would be related to thelevel of structural similarity between a query protein whose sequence is available and therepresentative protein whose is 3D-structre is also available.

12.8.2 Tertiary Structure Modeling

There are a number of methods for predicting the 3-D structure of a protein from its aminoacid sequence data. The best approach is to locate link by sequence analysis between thesequence of a query protein and protein of known 3D-structure. If the query protein sequenceshows significant homology to another protein of known 3-D structure, then a fairly accuratemodel of 3-D structure of the query protein can be obtained via homology modeling (Path 1of Fig. 12.1), by superposition of the query sequence to the of the sequence of a related proteinwhose 3D-structure has been experimentally determined. There are ~ 500 common structuralfolds for ~ 12,000 3-D structures. That is, many different sequences can adopt the same fold.Thus, there are many combinations of amino acids that can assemble together into the same3D-conformation. This implies that while substantially significant sequence similarity is anindicator of evolutionary relationship between sequences, significant structural similaritymay or may not be an indicator of evolutionary relationship. Structural homology modelingmethods provide means to expand this number by building models of other proteins thatshow some level of similarity to a protein with known 3-D structure. If a portion of thesequence matches a domain of a protein of known 3-D structure, a 3-D homology model canbe constructed from the protein.

When a global sequence alignment shows > 45% homology, the amino acids should bequite superimposable in the 3D-structure of the proteins. For comparison of two structures,positions of atoms in two 3D-structures are compared. These methods initially examine thepositions of secondary structural elements— a-helices and b-strands, within a protein domainto determine whether or not the number, type and relative positions of these structuralelements are similar.

Stabilizing the secondary structure elements to maximize the hydrophobicity of the core isan important feature in prediction and modeling of protein folds. Even with no homologue ofknown 3-D structure is found, it may be possible to predict a model from fold recognitionmethods (SCOP; CATH). Steps in building a homology-based model are:

1. Obtain a relative of the query sequence.2. Build template structure in the protein structure database.3. Ensure to align conserved residues that are predicted to be buried/exposed to those

known to be buried/exposed in the template structure (use PHD server).4. Align backbone first and then add side chains of the query sequence.5. Line up every secondary structure element with its appropriate counterpart.

Page 205: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

202 Bioinformatics: A Primer

6. H-bonding patterns are not disturbed in secondary structure.7. Conserve residue properties (size, polarity, hydrophobicity etc.).8. Optimize the structure using energy minimization.

l Dali : (http://www.embl-ebi.ac.uk/dali/). The Dali server is a network servicefor comparing protein structures in 3-D. Dali compares the coordinate listof a query protein structure against those in the Protein Data Bank (PDB).In favorable cases, comparing 3D structures may reveal biologically in-teresting similarities that are not detectable by comparing sequences.

Instead of exhaustive database searching, in which a 3-D query structure is compared toeach and every structure in the database, a rapid 3-D protein structure retrieval system(“ProtDex2”) can be used to perform rapid database searching without having access to every3-D structure in the database. This retrieval process is based on the inverted-file index algo-rithm, constructed on the feature vectors of the relationships between the secondary structureelements (SSEs) of all the 3-D protein structures in the database. ProtDex2 algorithm is fasterthan other protein comparison algorithms, such as DALI.

12.8.2.1 Distance Matrix Method

The method is similar to dotplot analysis. It uses graphic procedure to identify the atoms thatlie most closely together in the 3-D structure. If two proteins have a similar structure, thegraphs of these structures will be superimposable. Distance between Ca-atoms along thepolypeptide chain can be compared by a 2-D matrix representation of the structure. Distancematrix compares geometric relationships between the structures without regard to alignment.The sequence of the protein is listed both across the top and down the side of the matrix. Eachmatrix position represents the distance between the corresponding Ca-atoms in the 3D-structure.The program DALI (distance alignment tool) uses this method to align protein structures.

12.8.2.2 Structure Profile Method

Structure profile method is an environmental template method; a sequence profile to predictwhich amino acids might be able to fit into a given structural position. The environment ofeach in each known structural core is determined, including the secondary structure, theburied surface of side chains. On the basis of these physicochemical parameters at each site,the position is classified into eighteen types, six types representing increasing levels of residueburied and fraction of surface covered by polar atoms, combined with three classes of second-ary structure. Each amino acid is then assigned for its ability to fit into that type of site in thestructure. The sequence of the protein is then aligned with a series of such environmentallydefined positions in the structure to see whether a series of amino acids in the sequence can bealigned with the assigned structural environments of a given core. The procedure is thenrepeated for each protein core in the structural database and the best matches of the querysequence to the core are identified.

The structural 3-D profile is a table of scores with one row for each amino acid position inthe core and a column for each possible amino acid substitution at that position and twocolumns for deletion penalties at that site (Fig. 12.12). Each position in the core is assigned toone of 18 classes of structural environment. The scores in each row reflect the suitability of a

Page 206: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 203

given amino acid for that particular environment. The penalty at each core position reflects theacceptability of an insertion or deletion of one or more amino acids at that position in thestructure. If the position is within the core, these penalties are generally high, reflectingincompatibility with the structure, but the scores are lower for positions on the surface of thecore and within the loop regions. The dynamical programming is also used to identify anoptimal, best-scoring alignment. If a target structure is found to have significantly high score,then the query sequence is predicted to have a fold similar to that of the target core.

Fig. 12.12 3-D Structure Profile Scheme

12.8.2.3 3D-1D Profile Sum MethodIn this method (Fig. 12.13), sequences are screened to prepare a 3-D profile, a discrete list ofscores for matching 1-D sequence to a 3-D structure. The procedure takes into account aminoacid neighbors, main-chain conformations and secondary structural features of each residuein the structure.

12.8.2.4 Contact Potential Method

In contact potential method, each structural core is represented as 2-D contact matrix (similarto distance matrix method or program DALI). A matrix is produced with the amino acids inthe structure listed across the rows and down the columns. In each matrix position, thedistance between the corresponding pair of amino acids in the structure is placed. A group ofamino acids in closest contact produces recognizable patterns. The object is to superimposesets of amino acid pairs in the query sequence on to the distance matrix of the core. To find thebest combinations, the approximate conformational energies of each predicted pair are summedto predict the conformational stability of the predicted structure. Contact energies can be usedto choose the correct core in a structural database.

Page 207: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 208: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 205

l MODELLER : (http://guitar.rockefeller.edu/modeller/). Dynamic programmingalignment of sequences and 3D-structre modeling.

l PDBSum : The database provides summaries of structural analyses of PDB datafiles, structural classification and programs to plot schematic dia-grams of protein-ligand interactions, protein structure motifs etc.

l Predict Protein : The Meta-server (EMBL, Germany) is used to find structural homo-logues of a query protein sequence, detection of functional motifs anddomains, and prediction of secondary structure based on a singlesequence or multiple sequence alignment.

l ProSAL : A Meta site for protein analysis and characterization (Sweden).l ProtDex2 : ([email protected]). Rapid 3-D protein structure data-

base searching using information retrieval techniques.l SDSC1 : San Diego Supercomputer Center Protein Structure Homology Mod-

eling. This is a site to try if the query sequence does not show anysequence homology to existing proteins in databases.

l SRS : (http://srs.hgmp.mrc.ac.uk/). Sequence retrieval system.l SWISS MODEL : (http://www.expasy.ch/swissmodel/). Sequence alignment of a query

sequence with a known 3D-structure by homology modeling (> 50%homology).

l WHATIF : (http://www.umbi.kun.nl/whatif/). A web interface (EMBL,Germany) that provides tools for examining PDB files.

12.8.3 3-D Structure ViewingOnce the 3-D model is available (model), various programs can be used to view the 3-Dstructure, may also be used to compare with other homologous structures in databases (e.g.PDB database) by superposition. The superposed structures can be viewed with CHIME orRasMol.

Some of the structures viewing programs are—

l CHIME : (http://www.umass.edu/microbio/chime/). Good for lecture presen-tation.

l Cn3D : (http://www.ncbi.nih.gov/structure/). Provides viewing of 3-D struc-tures from Entrez.

l ExPASy :l GRASP :l LOCK (USA) : Hierarchical protein structure superposition tool.l MolMol :l Prepi :l RasMol : (http://www.umass.edu/microbio/rasmol/). Most commonly used

viewer program for windows.l Swiss 3D Image : (http://www.expasy.ch/sw3d/) (ExPASy, Switzerland). An image

database that provides high quality pictures of biological macromol-ecules with known three-dimensional structures.

Page 209: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

206 Bioinformatics: A Primer

EXERCISE MODULES

1. What is sequence alignment?2. What is the basis of sequence similarity search algorithms?3. What are the similarity/distance matrices meant for?4. What is meant by pair-wise sequence alignment?5. What is dotplot analysis?6. What is dynamic programming?7. What are the differences between global alignments against local alignments of sequences?8. What is multiple sequence alignment and what it is its objective?9. Which are the programs for multiple sequence alignment?

10. Name BLAST suit and FASTA algorithms for multiple sequence alignment.11. What are the best strategies for sequence similarity search?12. What is phylogenetic analysis?13. What is the Dayhoff mutation data matrix?14. What are the limitations of Dayhoff’s PAM model?15. What are the features of BLOSUM algorithm?16. What are distance method algorithms?17. What are the cladistic methods in evolutionary analysis?18. What is an evolutionary tree and how is it constructed?19. What are the protocols for secondary structure prediction?20. What are the methods for helix prediction?21. What are the methods for strand prediction?22. How are turns and loops predicted?23. What are the methods and programs for predicting motifs, domains, and profiles?24. What is the relevance of Hidden Markov models in multiple sequence alignments and pattern recogni-

tion analysis?25. What are the protocols for pattern recognition?26. Which are programs used in protein classification?27. What are the various methods in protein tertiary structure prediction and modeling?28. What is sequence threading?29. Name some servers/databases for protein modeling?30. Which are the programs for 3-D structure viewing?

BIBLIOGRAPHY

1. Altschul, S.F., et al. (1990), J Mol Biol., 215; 403. “Basic local alignment search tool”.2. Altschul, S.F. (1991), J Mol Biol., 219; 555. “Amino acid substitution matrix from an information theoretic

perspective”.3. Altschul, S.F., et al. (1997), Nucleic Acid Res., 25(17); 3389. “Gapped BLAST and PSI-BLAST: a new

generation of protein database search programs”.4. Argos, P. (1987), J Mol Biol., 193; 385. “A sensitive procedure to compare amino acid sequences”.5. Attwood, T.K. & Beck, M.E. (1994), Protein Engineering, 7; 841. “PRINTS-A protein motif fingerprint

database”.6. Attwood, T.K. & Parry-Smith, D.J. (2002), Pearson (Education): Delhi. “Introduction to Bioinformatics”.7. Aung, Z. & Tan, K-L. (2004), Bioinformatics, 20(7); ([email protected]). “Rapid 3D protein

structure database searching using information retrieval techniques”.

Page 210: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 207

8. Bairoch, A., Bucher, P. & Hoffman, K. (1997), Nucleic Acid Res., 25(1); 217. “The PROSITE database, itsstatus in 1997”.

9. Bairoch, A., & Apweiler, R. (2000), Nucleic Acid Res., 28; 45. “The SWISSPROT protein sequencedatabase and its supplement TrEMBL in 2000”.

10. Bajorath, J., Stenkamp, R. & Aruffo, A. (1993), Protein Sci., 2; 1798. “Knowledge-based model buildingof proteins: concepts and examples”.

11. Baker, W.C., et al. (1999), Nucleic Acid Res., 27; 39. “The PIR-International Protein Sequence Database”.12. Barton, G.J. (1995), Curr Opin Struct Biol., 5(3); 372. “Protein secondary structure prediction”.13. Bateman, A., et al. (2000), Nucleic Acids Res., 28; 263. “The Pfam protein families database”.14. Baxevanis, A.D. & Ouellette, B.F. (eds). (1998), Wiley & Sons: New York. “Bioinformatics: A Practical

Guide to the Analysis of Genes and Proteins”.15. Berman, H.M., et al. (2000), Nucleic Acids Res., 28; 235. “Protein Data Bank”.16. Bilofsky, H.S., et al. (1986), Nucleic Acids Res., 14; 1. “The GenBank: genetic sequence databank”.17. Blundell, T.L., et al. (1987), Nature, 326; 347. “Knowledge-based prediction of protein structures and the

design of novel molecules”.18. Bork, A. & Gibson, T. (1996), Methods Enzymol., 266; 162. “Applying motif and profile searches”.19. Bork, P. & Koonin, E.V. (1996), Curr Opin Struct Biol., 6(3); 366. “Protein sequence motifs”.20. Bowie, J.U. & Eisenberg, D. (1993), Curr Opin Struct Biol., 3; 437. “Inverted protein structure predic-

tion”.21. Bowie, J.U. Lüthy, R. & Eisenberg, D. (1991), Science, 253; 164. “A method to identify protein sequences

that fold into a known three-dimensional structure”.22. Brennan, R.G. & Matthews, B.W. (1989), Trends Biochem Sci., 14; 286. “Structural basis of DNA-protein

recognition”.23. Brenner, S.E., et al. (1996), Methods Enzymol., 266; 635. “Understanding protein structure using SCOP

for fold interpretation”.24. Bryant, S.H. (1996), Proteins, 26; 172. “Evaluation of threading specificity and accuracy”.25. Chothia, C. (1984), Annu Rev Biochem. 53; 537. “Principles that determine the structure of proteins”.26. Chothia, C. & Lesk, A.M. (1986), EMBO J., 5; 823. “The relation between the divergence of sequence and

structure in proteins”.27. Chou, P.Y. & Fasman, G.D. (1978), Advs Enzymol., 47; 45. “Prediction of the secondary structure of

proteins from their amino acid sequence”.28. Cohen, C. & Perry, D.A.D. (1993), TiBS., 11; 245. “a-helical coiled coils-a widespread motif in proteins”.29. Corpet, F., et al. (2000), Nucleic Acids Res., 28; 267. “ProDom and ProDom-CG: tools for protein domain

analysis and whole genome comparisons”.30. Cuff, J.A. & Barton, G.J. (2000), Proteins, 40; 502. “Application of multiple sequence alignment profiles to

improve protein secondary structure prediction”.31. Danchin, A. (1999), Curr Opin Struct Biol., 9(3); 363. “From Protein sequence to function”.32. Dayhoff, M.O (ed). (1978), Natl Biomed Res Foundation (NBRF): Washington, DC. “Atlas of Protein

Sequence and Structure”, Vol 5, Supl 3.33. Dayhoff, M.O., Barker, W.C. & Hunt, L.T. (1983), Methods Enzymol., 91; 534. “Establishing homologies

in protein sequences”.34. Dobson, C.M. & Karplus, M. (1999), Curr Opin Struct Biol., 9(1); 92. “The fundamentals of protein

folding: bringing together theory and experiment”.35. Doolittle, R.F. (1992), Protein Sci., 1; 191. “Reconstructing history with amino acid sequences”.

Page 211: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

208 Bioinformatics: A Primer

36. Dubchak, I., Holbrook, S.R. & Kim, S-H. (1993), Proteins, 16; 79. “Prediction of protein folding class fromamino acid composition”.

37. Eddy, S.R. (1996), Curr Opin Struct Biol., 6; 361. “Hidden Markov models”.38. Eisenhaber, F., Persson, B. & Argos, P. (1995), Crit Rev Biochem Mol Biol., 30; 1. “Protein structure

Prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence”.39. Engel, J. (1991), Curr Opin Cell Biol., 3; 779. “Common structural motifs in proteins of the extracellular

matrix”.40. Englander, S.W. (1993), Science, 262; 848. “In pursuit of protein folding”.41. Etzold, T., et al. (1996), Methods Enzymol., 266; 114. “SRS: Information retrieval system for molecular

biology databanks”.42. Fasman, G.D. (Ed). (1990), Plenum Press: New York. “Prediction of Protein Structure and the Principles

of Protein Conformation”.43. Felsenstein, J. (1996), Methods Enzymol., 266; 368. “Inferring phylogeny from protein sequences by

parsimony, distance and likelihood methods”.44. Fitch, A. & Margoliash, E. (1987), Science, 155; 277. “Construction of Phylogenetic trees”.45. Frishman, D. & Argos, P. (1995), Proteins, 23; 566. “Knowledge-based protein secondary structure

assignment”.46. Garnier, J., Gibrant, J.P. & Robson, B. (1996), Methods Enzymol., 266; 540. “GOR method for predicting

protein secondary structure from amino acid sequence”.47. Gracy, J. & Argos, P. (1998), Trends Biochem Sci., 23; 495. “DOMO: a new database of aligned protein

domains”.48. Gribskov, M. & Veretnik, S. (1996), Methods Enzymol., 266; 198. “Identification of sequence pattern

with profile analysis”.49. Guex, N., Diemand, A. & Pettsch, M.C. (1999), Trends Biochem Sci., 24; 364. “Protein modeling for all”.50. Hadley, C. & Jones, D.T. (1999), Struct Fold Desn., 7; 1099. “A systematic comparison of protein structure

classifications: SCOP, CATH and FSSP”.51. Henikoff, J.G. & Henikoff, S. (1996), Methods Enzymol., 266; 88. “Blocks database and its application”.52. Henikoff, J.G., Henikoff, S. & Pietrokovski, S. (1999), Biotransformatics, 15; 471. “Blocks+: a non-

redundant database of protein alignment blocks derived from multiple compilations “.53. Henikoff, S. & Henikoff, J.G. (2000), Adv Protein Chem., 54; 73. “Amino acid substitution matrices “.54. Higgins, D.G., Thompson, J.D. & Gibson, T.J. (1996), Methods Enzymol., 266; 383. “Using CLUSTAL

for multiple sequence alignments”.55. Hofmann, K., et al. (1999), Nucleic Acids Res., 27; 215. “The PROSITE database, its status in 1999”.56. Holm, L., et al. (1991), Protein Sci., 1; 1691. “A database of protein structure families with common folding

motifs”.57. Holm, L. & Sander, C. (1993), J Mol Biol., 233; 123. “Protein structure comparison by alignment of

distance matrices”.58. Holm, L. & Sander, C. (1997), Nucleic Acids Res., 25; 231. “Dali/FSSP classification of three-dimensional

protein folds”.59. Hubbard, D.T. (1999), Nucleic Acid Res., 27; 254. “SCOP: a structural classification of proteins database”.60. Janin, J. & Chothia, C, (1980), J Mol Biol., 143; 95. “Packing of a-helices on to b-pleated sheets and the

anatomy of a/b-proteins”.61. Johnson, M.S. & Overignton, J.P. (1993), J Mol Biol., 233; 716. “A structural basis for sequence compari-

son”.

Page 212: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 209

62. Jones, D.T. (1999), J Mol Biol., 292; 195. “Protein secondary structure prediction based on position-specificscoring matrices”.

63. Jones, D.T. (1999), J Mol Biol., 287; 797. “Gen TGREADER: efficient and reliable protein fold recognitionmethod for genomic sequences”.

64. Karplus, K., Barrett, C. & Hughey, D.G. (1998), Bioinformatics, 14; 846. “Hidden Markov models fordetecting remote protein homologies”.

65. Kasuya, A. & Thornton, J.M. (1999), J Mol Biol., 286; 1673. “Three-dimensional structure analysis ofPROSITE patterns”.

66. Kim, W. K., Bolser, D.M. & Park, J.H. (2004), Bioinformatics, 20(7); ([email protected]). “Large-scaleco-evolution analysis of protein structural interlogues using the global protein structural interactome map(PSIMAP)”.

67. King, R.D., Weiss, P.H. & Clare, A. (2004), Bioinformatics; ([email protected]). “Confirmation of datamining based predictions of protein function”.

68. Kleywegt, G.J. (1999), J Mol Biol., 285; 1887. “Recognition of spatial motifs in protein structures”.69. Konforti, B. (1999), Nature Struct Biol., 6; 505. “Rules for protein-DNA interactions”.70. Kreil, D.P. & Etzold, T. (1999), Trends Biochem Sci., 24; 155. “DATABANKS– a catalogue database of

molecular biology databases”.71. Krogh, A. et al. (1994), J Mol Biol., 235; 1501. “Hidden Markov models in computational biology: applica-

tion to protein modeling”.72. Kyte, J. & Doolittle, R.F. (1982), J Mol Biol., 157; 105. “A simple method for displaying the hydropathic

character of a protein”.73. Lakowski, R.A., et al. (1997), TiBS., 22; 488. “PDBsum: A web-based database of summaries and analyses

of all PDB structures”.74. Lee, R.H. (1992), Nature, 356; 543. “Protein model building using structural homology”.75. Lesk, A.M. (1991) IRL Press: London. “Protein Architecture: A Practical Approach”.76. Lesk, A.M. & Chothia, C. (1980), J Mol Biol., 136; 225. “How different amino acid sequences determine

similar protein structures: the structure and evolutionary dynamics of globins”.77. Levitt, M. & Chothia, C. (1976), Nature, 261; 552. “Structural patterns in globular proteins”.78. Lipman, D.J. & Pearson, W.R. (1985), Science, 227; 1435. “Rapid and sensitive protein similarity searches”.79. Lohman, R., Schneider, G. & Behrens, D. (1994), Protein Sci., 3; 1597. “A neural network model for the

prediction of membrane-spanning amino acid sequences”.80. Lupas, A. (1996), Methods Enzymol., 266; 513. “Prediction and analysis of coiled-coil structures”.81. Luthy, R., Bowie, J.U. & Eisenberg, D. (1992), Nature, 356; 83. “Assessment of protein models with three-

dimensional profiles”.82. Martin, A., et al. (1998), Structure, 6; 875. “Protein folds and functions”.83. Mewes, H.W., et al. (2000), Nucleic Acids Res., 28; 37. “MIPS: a database for genomes and protein

sequences”.84. Michie, A.D., Jones, M.L. & Attwood, T.K. (1996), Trends Biochem Sci., 21(5); 191. “DbBrowser:

integrated access to databases worldwide”.85. Morgenstern, B., et al. (1998), Bioinformatics. 14; 290. “DIALIGN: Finding local similarities by multiple

sequence alignment”.86. Moult, J. (1999), Curr Opin Biotechnol., 10(6); 583. “Predicting protein three-dimensional structure”.87. Mount, D.W. (2001), Cold Spring Harbor Lab Press: New York. “Bioinformatics: Sequence and Genome

Analysis”.

Page 213: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

210 Bioinformatics: A Primer

88. Murzin, A.G., et al. (1995), J Mol Biol., 247; 536. “SCOP: A structural classification of proteins databasefor investigation of sequence and structures”.

89. Needleman, S.B. & Wunsch, C.D. (1970), J Mol Biol., 48; 443. “A general method applicable to the searchfor similarities in the amino acid sequences of two proteins”.

90. Orengo, C.A., et al. (1994), Nature, 372; 631. “Protein superfamilies and domain superfolds”.91. Orengo, C.A., et al. (1994), Curr Opin Struct Biol., 4(3); 423. “Classification of protein folds”.92. Orengo, C.A., et al. (1997), Structure, 5(8); 1093. “CATH-a hierarchical classification of protein domain

structures”.93. Orengo, C.A., et al. (1999), Curr Opin Struct Biol., 9(3); 374. “From protein structure to function”.94. Overington, J.P. (1992), Curr Opin Struct Biol., 2; 394. “Comparison of three-dimensional structures of

homologous proteins”.95. Pabo, C.O. & Sauer, R.T. (1992), Annu Rev Biochem., 61; 1053. “Transcription factors: structural families

and principles of DNA recognition.96. Pearson, W.A. (1990), Methods Enzymol., 183; 63. “Rapid and sensitive sequence comparison with FASTA

and FASTP”.97. Pearson, W.R. (2000), Methods Mol Biol., 132; 185. “Flexible sequence similarity searching with FASTA3

program package.98. Pearson, W.R. & Miller, W. (1992), Methods Enzymol., 210; 575. “Dynamic Programming algorithms for

biological sequence comparison”.99. Quain, N. & Sejnowski, T.J. (1988), J Mol Biol., 202; 865. “Predicting the secondary structure of globular

proteins using neural network models”.100. Richardson, J.S. (1985), Methods Enzymol, 115; 349. “Describing patterns of protein tertiary structure”.101. Ripley, B.D. (1996), Cambridge University Press: Cambridge. “Pattern Recognition and Neural Net-

works”.102. Rose, G.D., et al. (1985), Science, 229; 834. “Hydrophobicity of amino acid residues in globular proteins”.103. Sali, A. & Overington, J.P. (1994), Protein Sci., 3; 1582. “Derivation of rules for comparative protein

modeling from a database of protein structure alignments”.104. Sali, A., et al. (1995), Proteins, 23; 318. “Evaluation of comparative protein modeling by MODELLER”.105. Sanchez, R. & Sali, A. (1997), Curr Opin Struct Biol., 7; 206–14. “Advances in comparative protein

structure modeling”.106. Sayle, R.A. & Milner-White, E.J. (1995), TiBS., 20; 374. “RASMOL: Biomolecular graphics for all”.107. Schuler, G.D., et al. (1996), Methods Enzymol., 266; 141. “Entrez: molecular biology database and

retrieval system”.108. Sensen, C.W., (Ed). (2001), Wiley-VCH: Weinheim. “Biotechnology(5b): Genomics and Bioinformatics”

(2nd Edn).109. Smith, T.F. & Waterman, M.S. (1981), J Mol Biol., 147; 195. “Identification of common molecular subse-

quences”.110. Sonnhammer, E.L., et al. (1998), Nucleic Acid Res., 26(1); 320. “Pfam: multiple sequence alignments and

HMM-profiles of protein domains”.111. Struhl, K. (1989), TiBS., 14; 137. “Helix-turn-helix, zinc-finger, and leucine-zipper motifs for eucaryotic

transcriptional regulatory proteins”.112. Sun, Z. & Jiang, B. (1996), J Prot Chem., 15; 675. “Conformation of commonly occurring super-secondary

structures (basic motifs) in protein databank”.113. Swindells, M.B. & Thornton, J.M. (1991), Curr Opin Struct Biol., 1; 219. “Modeling by homology”.

Page 214: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Data Mining, Analysis and Modeling 211

114. Todd, A.E., Orengo, C.A. & Thornton, J.M. (1999), Curr Opin Chembiol., 3(5); 548. “Evolution ofprotein function, from a structure perspective”.

115. Vriend, G. & Sander, C. (1991), Proteins, 11; 52. “Detection of common three-dimensional structures inproteins”.

116. Wodak, S.J. & Rooman, M.J. (1993), Curr Opin Struct Biol., 3; 247. “Generating and testing proteinfolds”.

117. Wu, C.H., Shivakumar, S. & Huang, H. (1999), Nucleic Acids Res., 27; 272. “ProClass protein familydatabase”.

118. Zucker, M. (2000), Curr Opin Struct Biol., 10; 303. “Calculating nucleic acid secondary structure”.

Page 215: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

13Medico- and Pharmacoinformatics

Genes are units of heredity that provide the blueprint for our physical body and our well-being. The extent of quality of life can be drastically altered by disease and genetic disease isperhaps the purest illustration of the relationship between our genes and our health. In thiscontext, the major impact of the new era of genomic biology (applications of genome sequencingprojects), combining experimental data from gene expression microarrays, electrophoresis,mass spectrometry other experimental techniques with computational methods, has been andis going to be felt in medical and pharmacological fields. Vast amount of genome sequencedata (ESTs) with refined annotations are available, and these databases with analysis of tissue-specific assays will be the sources of information to understand the molecular basis of geneticdiseases, gene discovery (some of which may have disease propensity), genetic screening, andto find new methods of diagnosis, design of novel genes, and personalized gene-based therapies.A general approach towards genetic disease management is—(i) identification of the disease-causing gene, (ii) diagnostic/screening tests, and (iii) therapeutic protocols (Fig. 13.1).

Fig. 13.1 Flowchart of Databank-based Genetic Disease Management

13.1 DISEASE GENE IDENTIFICATION

Chromosome abnormalities are responsible for a significant portion of genetic disorders thatappear arise de novo. Chromosomal abnormalities can be within the same chromosome (dele-

Page 216: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Medico- and Pharmacoinformatics 213

tions, insertions, duplications and inversions), or between chromosomes (translocation), aswell as changes in chromosome number (ploidy). The chromosomal abnormalities can now bemade during pregnancy through amniocentesis and cytogenic analyses (by chromosomebanding, and fluorescence in situ hybridization (FISH) analysis– a type of in situ hybridizationin which target sequences are stained with fluorescent dye so their location and size can bedetermined using fluorescence microscopy.).

13.1.1 Linkage Analysis and Positional CloningOnce a mutant, a marker allele (one of several alternate forms of a gene which occur at thesame locus on homologous chromosomes, and governing the same biochemical and develop-mental process), or susceptibility region of chromosome has been identified that is associatedwith a disease, specific genetic and biochemical assays, for detecting physiological and meta-bolic changes carried by the disease, can be developed for diagnostic screening. However, forvast majority of individuals, these data were insufficient.

An alternative strategy is linkage mapping analysis and isolation of disease genes, based ontheir chromosomal location (“positional cloning”). The positional cloning approach relies on athree-step process– (i) localizing a disease gene to a chromosomal subregion, generally byusing traditional linkage analysis, (ii) searching databases for an attractive candidate genewithin that subregion, and (iii) testing the candidate gene for disease-causing mutations.

First, the position of the gene must be mapped. Linkage analysis can be used to pinpointwhich chromosomal neighbor contain disease alleles by scanning the genome of family mem-bers, with affected pedigrees, for alleles that appear to be linked to the disease phenotype.DNA fingerprinting methods that rely on enzymatic cleavage of DNA followed by electro-phoresis and visualization by hybridization by probes specific for repetitive sequences–re-striction fragment length polymorphisms (RFLPs), and variable number of tandem repeats(VNTRs)– have been used in linkage analysis. Detection of chromosomal aberrations associ-ated with the disease may be detected by cytogenic methods (e.g. FISH analysis).

After linkage mapping, the next step is to look through the basepairs that have co-segre-gated with the known genetic markers. The gene in this region must be identified and aspecific mutation in one of the genes must be shown to cause the disease. This process istermed positional cloning.

Linkage analysis is useful where the family is quite large such that DNA samples can beobtained can be obtained from several affected individuals in the family. Traditional linkagemaps are not useful in identifying the genes responsible for the majority of complex humandiseases

13.1.2 Impact of Genomics on Disease Gene IdentificationAn understanding of the fundamental nature of the mechanisms of disease will permit thecauses rather than the symptoms of disease to be addressed. Concomitant with our fundamentalunderstanding of disease will come improved intervention through insight into diseasepredisposition, earlier disease detection and disease characterization. The development ofnew mapping technologies, associated with the genome projects, has accelerated the pace ofidentification of disease genes, and the underlying principles of mutational mechanisms. Thishas also introduced the computational tools to search or “mine” DNA and protein sequence

Page 217: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 218: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Medico- and Pharmacoinformatics 215

sequence amplification by PCR– PCR-based genotyping of DNA microsattelites and singlenucleotide polymorphisms (SNPs) technology.

Gene chip technology can also be used to analyze transcription by providing a snapshot ofgene expression for all the genes expressed in a given cell or tissue gene expression profiles).It is also to genotyping the DNA of an individual for the major alleles of each gene using SNPs.For example, cDNA can be prepared for a tumor biopsy and used to create an expressionprofile for that tumor, to detect common ontogenesis. Detection is carried out by annealingfluorescent-libeled DNA to the microarrays, followed by computer-aided scanning and scor-ing. Sequential analysis of gene expression (SAGE) is another method that can be used insteadof microarrays for expression profiling.

This gene chip/SNP detection technology has also been extended to protein analysis throughproteomics, using mass spectrometry, to probe protein-protein interactions occurring withinthe cell. High-throughput screening of SNPs can been accomplished by peptide-nucleic acid(PNA) detection of the polymorphisms, coupled to MALDI-TOF-MS.

13.2.2 Monogenic Diseases

Rare genetic diseases, such as sickle cell anemia, cystic fibrosis (CF) are often Mendelian, andmonogenic traits are the result of mutations, in which predisposition for the disease is directlyassociated with the presence of a single gene allele.

Sickle cell anemia is a classic example of the monogenic disease. It was one of the firstinherited diseases, for which the molecular basis was established– connection between asingle gene mutation, a single nucleotide change (single nucleotide polymorphism, that is,single-point mutation) leading to the substation of a valine for glutamic acid at position 6 inthe b-subunit of the hemoglobin protein complex, and a disease phenotype.

Cystic fibrosis (CF) is another example monogenic disease. Defects in the cystic fibrosistransmembrane regulator (CFTR) locus directly predispose an individual to cystic fibrosis,due to deletion of phenylalanine at position 508 (DF508).

Some monogenic diseases exhibit allele heterogeneity with multiple mutations for theunderlying disease gene. Examples are: Duchenne muscular dystrophy (DMD), a commonmuscle wasting disease, and osteogenesis imperfecta (OI), in which a mutation of either of thetwo genes that make up type-I collagen results in brittle or malformed bones.

For monogenic diseases, a genetic component for the disease becomes evident from pedi-gree analysis. Once the genetic component has been suggested from pedigree analysis, then asearch for common alleles between affected members can be carried out, in which one progressesfrom a disease phenotype to a candidate gene.

For monogenic diseases, genetic mapping for single-gene (Mendelian) traits can be con-ducted by linkage analysis. This involves the identification of DNA markers, previously lacedon the genetic map, that co-segregate with the disease in one or more families.

Mutation screening is another approach for testing DNA samples from multiple individu-als for the presence of a specific single mutation. Methods are single nucleotide primer exten-sion assays, oligonucleotide microarrays (gene chips) for SNP detection and combination ofthe both.

Page 219: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

216 Bioinformatics: A Primer

13.2.3 Polygenic DiseasesMost common genetic diseases, such as diabetes, asthma, cancer, and heart disease are oftenpolygenic, involving multiple genes, and multifunctional, that together with environmentalfactors predisposes the individual to disease. For multifunctional diseases, multiple genes,each having a small effect on the phenotype, are involved. Determining the genomiccomponent(s) of non-Mendelian characters is more difficult, because their phenotypic expres-sion often depends on the interaction of a myriad of genetic, social and environmental factors.

Also, several modes of inheritance depend either the sex of the individual or the sex of theparent transmitting the trait. The majority of X-linked recessive diseases affect predominantlymales, and X-linked dominant diseases are more frequent in females. The phenotype expres-sion of the common genetic diseases that are time-dependent relies on a mechanism known as“imprinting”, which describes the dependence of the disease on the parent transmitting thetrait. Environmental factors, such as diet or exposure to infectious agents may also affect theexpression of disease phenotype and thus the penetration of the trait.

It is these polygenic, non-Mendelian, complex diseases that have the greatest impact on thehuman population. The structural and functional information from the HGP and other genomeprojects will have a much greater impact on the elucidation of the etiology of common, multi-factorial diseases. From the HGP data, the reference sequence of normal gene will provide thestarting point for detecting disease-causing mutations. Documentation of “normal” variationswill provide a basis for the subsequent identification of pathogenic variations. Comparativegenomics in mammals will provide both the opportunities to develop animal models forhuman diseases, and a chance to increase our understanding of how genes and gene familieshave evolved.

13.3 GENETIC TESTING AND THERAPY

The purpose of genetic testing/screening is to identify carriers of genetic disorders that couldpredispose the carrier, or the progeny to an inherited disease. Clinical programs are aimed atdetecting the symptoms of the disease at an early stage. Therapeutic procedures are aimed atimproving the present therapeutic protocols and/or develop new and more effective therapies.

13.3.1 Genetic Testing/ScreeningThe genetic testing/screening can be carried out (i) linkage analysis, and (ii) direct detectionof the mutants by mutation screening or other methods (Fig. 13.3). Once it is established whichgene is responsible for the disease in a given family, linkage analysis can be used to predictwhether a person at risk has in fact inherited the mutation-bearing chromosome.

13.3.1.1 BiomarkersBiomarkers play an important role in disease diagnostics and drug discovery and monitoring.Biological markers are measurable and quantifiable biological parameters (e.g. specific enzymeconcentration, specific hormone concentration, specific gene phenotype distribution in apopulation, presence of biological substances), which serve as indices for health- and physiology-related assessments, such as disease risk, environmental exposure and its effects, diseasediagnosis, metabolic processes. They provide the basis for developing new diagnostic products,

Page 220: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the
Page 221: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

218 Bioinformatics: A Primer

structure function information rather than the traditional trial-and-error method, and targetedat specific sites in the body and at particular biochemical events leading to disease, promise tohave fewer side effects than many of today’s medicines. New methods in protein profilinginclude surface-enhanced laser desorption ionization mass spectrometry (SELDI-TOF-MS),and antibody arrays.

13.3.1.2 Mutation ScreeningMutation scanning refers to the process of analyzing DNA sequences or genes for the presenceof any possible mutation. This can be carried out by (i) single-stranded conformational analy-sis (SSCA), or by (ii) hetero-duplex analysis.

The SSCA method is based on the sequence-dependent mobility of single-stranded DNAmolecule in non-denaturing polyacrylamide gel electrophoresis. The DNA to be analyzed isamplified by PCR, and the two strands of the DNA molecule are separated and electrophoreticseparation is carried out on a non-denaturing polyacrylamide gel. The two strands will usu-ally migrate at different rates, even though they have the same molecular mass. Any changein the base composition, as a result of mutations, may further modify the mobility of thefragment in the gel.

The hetero-duplex analysis involves comparing the electrophoresis mobility of normaldouble-stranded DNA with “heteroduplex” DNA that contains one normal strand and acomplementary strand containing the mutated sequence. The target DNA is amplified byPCR, and the DNA double helix is separated by raising the temperature to 95°C. Duringsubsequent cooling, the complementary DNA strands are re-annealed. The mobility of hetero-duplex DNA will be different than that of the homo-duplex DNA. Large base-pair mismatchesmay also be analyzed by using electron microscopy to visualize heteroduplex regions.

High-throughput genotyping methods (determination of relevant nucleotide-base sequencesin each of the two parental chromosomes) are available for diagnosis, drug efficacy, andtoxicity. These methods utilize genomic DNA that, after digestion, reacts with a SNP array toobtain an individual SNP pattern. These variations can for instance provide information aboutthe diagnosis of a certain disease, or the effectiveness or side effect of a certain drug. Theexamination of single chromosome sets (haploid sets), as opposed to the usual chromosomepairings (diploid sets), is important because mutations in one copy of a chromosome pair canbe masked by normal sequences present on the other copy. In diploid organisms, such ashumans, the linkage of particular SNP genotypes on each chromosome in a homologous pair(the haplotype) may provide additional information not available from SNP genotypingalone.

13.3.1.3 Differential Gene Expression ProfilingThe ability to profile the differences between biological samples is of fundamental importancein biology. Differential gene expression (DGE) and differential protein expression (DPE) arescreening methods that are widely used for target validation.

Current technologies available for the gene expression analysis are DNA microarray, andSAGE. Gene expression profiling (determination of the pattern of genes expressed underspecific circumstances or in a specific cell) has been used extensively to examine biologicalmodels and disease tissues in an effort to understand the disease process and identify thera-peutic targets. This involves studying the expression (as mRNA) of thousands of genes in a cell

Page 222: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Medico- and Pharmacoinformatics 219

or tissue, and how gene expression changes under various conditions. Genome-wide expres-sion profiling of disease states and treatment conditions represents a significant advance in theareas of discovery of molecular disease markers, therapeutic target validation, predictivetoxicology, and clinical monitoring of patients. It allows a comprehensive high-throughputscreening of the effects of an insult (genetic, physiologic, pathologic, etc.) on gene expressionin tissues and specific cell populations of interest. These techniques may aid in determiningthe function of a newly discovered gene or discovering new biomarkers and therapeutics forpatients with disease.

Molecular profiling (MP) analysis on homogeneous cell samples instead of larger tissuesthat may contain mixed cell populations. It is a dynamic new discipline, capable of generatinga global view of mRNA, protein patterns, and DNA alterations in various cell types anddisease processes. MP integrates the expanding genetic databases from the Human GenomeProject with newly developed expression analysis technologies and holds great promise tohelp us to (i) understand the molecular anatomy of normal cells and cells in various stages ofdisease, (ii) develop new diagnostic and therapeutic targets for clinical intervention, and(iii) explain the relationship between genotype and phenotype in humans, which is stilllargely unknown.

Human genic bi-allelic sequences (HGBASE), a database of intra-genic (promoter to end oftranscription) sequence polymorphism, facilitates genotype-phenotype association studiesbased upon the rapidly growing number of known, gene related, single nucleotide polymor-phisms (SNPs). HGBASE includes intra-genic sequence variants found in ‘normal’ individu-als. HGBASE is not limited to bi-allelic polymorphisms but covers all types of intra-genicvariation. Polymorphisms that are probably functionally important (e.g. codon changes) andothers (e.g. intron variations) are all included because all can potentially be employed assurrogate markers for unknown nearby functional variants due to linkage disequilibrium.

A key component of future genomic research and drug development will be the study ofepigenetic imprinting (drug- or environment-induced changes in gene expression) indicativeof disease and/or pharmacological or environmental exposure.

13.3.1.4 Differential Protein Expression ProfilingThe importance of the protein-based methods is that they measure the final expression prod-uct rather than an intermediate. In addition, some of them enable the detection of post-translational protein modifications (e.g., phosphorylation, glycosylation, carboxylation) andprotein complexes, and in some cases, yield information about protein localization. Protein-based methods are important as they measure observable that are not readily detected in otherways. It is likely that expression proteomics will be a useful tool in drug target discovery andin studying the effects of various biological stimuli on the cell.

Protein expression profiling is typically used for target discovery, toxicological studies ordisease marker discovery. Similar to gene expression profiling, protein expression can also beprofiled. This assay provides an indication of the relative levels of protein expression betweentwo different conditions, whether they are disease vs. health, tissue vs. tissue, or normal vs.drug treated. The antibodies can be used to tag the profiled proteins, or the proteins them-selves can be hapten derivatized, which in turn become targets for the immuno-RCA signalamplification complex. Hapten derivatization of the profiled proteins is one way to make thisa universal assay.

Page 223: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

220 Bioinformatics: A Primer

Current technologies for protein profiling and search for biomarkers rely on traditionalproteomic approaches, such as 2-dimensional electrophoresis (2-DE) with quantification viamass spectrometry. However, 2-DE is not ideally suited to rapid, large-scale protein expres-sion screening. The physical process of separating proteins via 2-DE remains long, multi-step,labor intensive, and often results in irreproducible data. These attempts have proven lessuseful, especially for biomarkers that are serum-based. Alternative methods are sought tobypass 2-D gels, using combinations of protein arrays (protein chips), super-critical fluidchromatography (SFC), capillary electrophoresis, and mass spectrometry for protein analysis.

Other new approaches combine the power of artificial intelligence-based algorithms andhigh-throughput proteomic fingerprinting tools such as SELDI-TOF (artificial intelligence-basedbioinformatics) to find specific proteomic patterns that can distinguish healthy from diseasedpatients. The ability of the pattern itself to become the diagnostic represents a new paradigmfor the application of proteomics to clinical specimen analysis and disease diagnosis. Directprofiling of expressed proteins via SELDI brings the research team one step closer to theultimate drug target than gene expression. Differences in expression patterns between distinctbiological states (healthy/ diseased tissue) should allow direct detection of biomarker patterns.

Protein expression analysis can indicate what proteins are expressed, but it is also importantto know where proteins are expressed, and where they go over time (as with secreted pro-teins). By mapping relative distribution of proteins, abundance, tissue specificity, and move-ment (in healthy versus diseased tissue and in control versus treated tissue), one can gain agreater understanding of these proteins’ functions and determine which are likely to be thebest drug targets. Fluorescence microscopy may be employed for protein localization studies.

13.3.1.5 Metabolic Engineering ApproachMetabolomics is the analysis of cellular metabolites, and provides a powerful tool for gaininginsight into functional biology. Monitoring of the level of numerous small molecules within acell, and how those levels change under different conditions, is complementary to geneexpression and proteomic studies. Metabolic profiles of bodily fluids such as plasma,cerebrospinal fluid and urine reflect both normal variation and the physiological impact ofdisease and pharmaceuticals on organ systems. Metabolic engineering (ME) approach is a“knowledge-based” alteration of (by recombinant DNA and other genetic techniques) metabolicpathways found in an organism in order to better understand and use cellular pathways forchemical transformation, energy transduction, and supramolecular assembly.

Metabolic profiling is actively being applied to studies of drug toxicity, drug efficacy andmodel organisms, as well as humans and plants. In vitro screening of the metabolic character-istics of a new chemical entity would help to mimic in vivo conditions. When integrated withgene-expression profiles, metabolic profiling provides plausible explanations and testablehypotheses for the interactions regulating the observed expression changes.

13.3.2 TherapyMost prescription drugs have side effects and certain percentage of patients do not get desiredbenefit from the drug treatment. The inherited genetic differences (represented by SNPs) alternot only our susceptibility to disease but can also affect the response of individual patients toadministered drugs. Genetic variations that affect the efficacy of a drug depend on the poly-morphisms of many genes towards (i) drug metabolism, (ii) transport, and (iii) drug targets.

Page 224: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Medico- and Pharmacoinformatics 221

[Drug metabolism] + [Drug transport] + [Drug targets] = [Drug Effects]

Pharmacogenomics attempts at identifying the disease genes, as well as finding new andindividualized therapies, based on the knowledge of gene polymorphisms. Gene therapy, themedical procedure that involves replacing, manipulating, or supplementing nonfunctionalgenes with healthy genes to treat human disease, is one such attempt. In future a geneticblueprint will allow screening of genetic sequence variations to tailor-made individualizedtreatment where therapies are safer and more effective.

EXERCISE MODULES

1. What are the goals of protein engineering?

2. What are the aims of disease gene identification?

3. What is the impact of genome projects on disease gene identification?

4. How do you identify monogenic diseases?

5. What are the difficulties in identifying polygenic diseases?

6. How does genomic data help in assessing polygencic diseases?

7. What are the protocols for genetic screening?

8. Comment of future therapeutic methods in disease control.

BIBLIOGRAPHY

1. Adams, M.D., et al. (1991), Science, 252; 1651. “Complementary DNA sequencing: expressed sequencetags and the human genome project”.

2. Bently, D.R. (2000), Med Res Rev., 20; 189”. The Human Genome Project– an overview”.

3. Boguski, M.S. & Schuler, G.D. (1995), Nature Genet., 10; 369. “ESTablishing a human transcript map”.

4. Broder, S. & Venter, J.C. (2000), Curr Opin Biotechnol., 11; 581. “Whole genomes: the foundation of newbiology and medicine”.

5. Collins, F.S., Guyer, M.S. & Chakravarthi, A. (1997), Science, 278; 1580. “Variations on a theme:cataloging human DNA sequence variation”.

6. Cotton, R.G.H. (1997), Oxford University Press: Oxford. “Mutation Detection”.

7. Drews, J. (2000), Science, 287; 1960. “Drug discovery: A Historical perspective”.

8. Dulbecco, R. (1986), Science, 231; 1055–56. “A turning point in cancer research: Sequencing the humangenome”.

9. Dunham, I.N., et al. (2000), Nature, 404; 904. “The DNA sequence of human chromosome”.

10. Evans, W.E. & Relling, M.V. (1999), Science, 286; 487. “Pharmacogenomics: translating functional genomicsinto rational therapeutics”.

11. Griffin, T.J. & Smith, L.M. (2000), Trends Biotechnol., 18; 77. “Single-nucleotide polymorphism analysisby MALDI-TOF mass spectrometry”.

12. Hamosh, A., et al. (2000), Human Mutat., 15;57. “Online Mendelian Inheritance in Man (OMIM)”.

13. Jorde, L.B., et al. (1999), Mosby: St. Luis, MO. “Medical Genetics”.

14. Marrone, T.J., Briggs, J.M. & McCammon, J.A. (1997), Ann Rev Pharmacol Toxicol., 37; 71. “Structure-based drug design: computational advances”.

15. Marth, G.T., et al. (1999), Nature Genet., 23; 452. “A general approach to single-nucleotide polymorphismdiscovery”.

Page 225: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

222 Bioinformatics: A Primer

16. Palu, G., et al. (1999), J Biotechnol., 68; 1. “In pursuit of new developments for gene therapy of humandiseases”.

17. Rawlings, C.J. & Searls, D.B. (1997), Curr Opin Genet Devel., 7; 416. “Computational gene discovery andhuman disease”.

18. Roses, A.D. (2000), Nature, 405; 857. “Pharmacogenetics and the practice of medicine”.

19. Sandhu, J.S., Keating, A. & Hozumi, N. (1997), Crit Rev Biotechnol., 17; 307. ‘Human gene therapy”.

20. Schmalzing, D., et al. (2000) Nucleic Acids Res., 28; e43. “Microchip electrophoresis: a method for high-speed SNP detection”.

21. See, D., et al. (2000), Biotechniques, 28; 270. “Electrophoretic detection of single-nucleotide polymor-phisms”.

22. Strachan, T. & Read, A.P. (1999), BioScience: Oxford. “Human Molecular Genetics”.

Page 226: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

14Molecular Engineering

Mutation in evolutionary process is the Nature’s answer to produce molecules with alteredcharacteristics to survive and propagate in altered environmental situations. But, timeframein the evolutionary process is large and is essentially a trial-and-error process. Therefore, themajor objective of all molecular engineering methods is to produce molecular species (mutantspecies), on laboratory time-scale, with altered (improved) characteristics to suit required situ-ations. The objective of selective mutations, by laboratory techniques, is to have better pro-teins. The same rationale is behind the rational drug design of procedures. The goal of proteinengineering methods is, therefore, to apply computational procedures (computational biology)to design proteins with desired characteristics, and apply them to obtain these structuralmutants by experimental methods of molecular biology, and test (validate) their functions.

The focus of molecular medicine is–translating the understanding of health and disease atthe cellular and molecular level to the development of new and novel therapies and diagnosistics (e.g. gene therapy, DNA-based testing, vaccine design). Research in comparative genomichas yielded valuable insight into the mechanisms of transcription, and the function of non-coding DNA. This new level of understanding will enable drug discovery researchers to putgenomic information in context, and link sequence to downstream biologic events within abroader biological context. The molecular basis of targeted therapies will enable a new class ofcompounds that will be more effective and less toxic than traditional classes, and deliver thepromise of genomic.

14.1 GENOMICS AND PROTEOMICS ANALYSESTraditional methods (pharmaceutical, and medicinal chemistry) of drug design/discoveryhave been legend/target-centric (Fig. 14.1); but with the availability of vast amount of ge-nomic sequence and annotated structural and functional data, there has been shift towardsgene-centric drug design/discovery procedures (Fig. 14.2).

14.1.1 Genomics ApproachGenomic approach is rapidly transforming the ways in which new drugs are discovered,developed, and ultimately, prescribed to the patients in need. Applying an understanding ofthe human genome to the study of disease Based-based approach has minimized the majorbottlenecks in drug discovery by besetting the conventional methods. Based-based drugdiscovery has the potential to identify those target molecules that underlie disease processesthemselves, as opposed to symptoms. However, good methods of target identification andvalidation will be necessary to realize this potential. The major trend in lead optimization is

Page 227: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

224 Bioinformatics: A Primer

the movement toward in silicon (computational), and in vitro high-throughput screening (HTS)approaches.

Technical advances (e.g. PCR and Blotting methods, Gene chips, supercritical fluid chroma-tography (SFC), and 2-D electrophoresis-MS), availability of genetic markers, annotated ge-nomic and polemics data, and high-resolution SNP maps, generate by the genome projectstudies, have revolutionized the study of human genetic variations and rational drug design/discovery procedures. Latest approaches, such as artificial intelligence-based algorithms and

Fig. 14.1 Flowchart of Ligand/Target-centric Drug Design/Discovery

Fig. 14.2 Flowchart of Genome-centric Drug Design/Discovery

Page 228: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Molecular Engineering 225

high-throughput polemic fingerprinting tools (e.g. SELDI-TOF) that correlate MS patternitself to become the diagnostic, enable the application of polemics to clinical specimen analysisand therapy.

As sequencing technologies progressed, the genome databases facilitated the rapid cloningof novel genes, and the inference of putative functions from the comparison of the expressedsequence tags (Sets– complementary DNA (coda) sequences generated from the minas ofgenes expressed in the cells of various organisms). Such database information has provided(and provides) a framework from which to develop more rationale treatment strategies, suchas genotype-specific therapies. General approach in ration drug design, utilizing genomicdata is:

1. Identifying of variations within specific genes that cause or predispose to disease.

2. Identifying gene-environment (DNA-DNA, and DNA-protein) interactions that mighthave pharmacogenic implications.

3. Identifying variations in immune response genes, which have implications for vaccinedevelopment.

14.1.2 Proteomics ApproachThe general concept of ascribing function to new proteins by discovering small moleculeligands (such as drugs, nutrients and toxins) is referred to as chemical proteomics. Chemicalproteomics approach makes use of synthetic small molecules that can be used to covalentlymodify a set of related enzymes and subsequently allow their purification and/or identifica-tion as valid drug targets. Furthermore, such methods enable rapid biochemical analysis andsmall-molecule screening of targets, thereby accelerating the often-difficult process of targetvalidation and drug discovery. The method uses labeled-irreversible protease inhibitors toisolate or identify active proteases in complex mixtures by two-dimensional (2-D) gel electro-phoresis or by using protease-activity chips with MALDI–TOF or MALDI–quadrupole–TOF(MALDI–Q–TOF) mass spectrometric identification of the captured proteases.

In proteomics analysis, characterization of novel proteins to the sub-family level (e.g. signaltransduction proteins etc.) is of paramount important in rational drug discovery studies.Computational tools, such as sequence alignment, profile matching, HMMs, homology mod-eling, and neural networks, are employed to bring out many hidden patterns and relation-ships in sequence, 2-D gel and MS databases. Sequence similarity in places along the polypep-tide chain, where the conservation is highest, like the active site, can be used to predictsubstrate (ligand) specificity, analyze structure-function relationships, and to design inhibitoranalogs/drugs. Even a 3-D global view of a protein is helpful for mapping residues proximalin space that are far apart in sequence and for utilization of site-directed mutagenesis resultsand provides a structural context for the analysis of structural mutants.

Knowing of a 3-D structure of a protein and in particular the site of interaction with ligands,allows for computational screening of large libraries of compounds for their binding potentialto specific site on a protein. Screening of 3-D structures of proteins (from databases), withunknown functions, with substrates, co-factors, and molecular modeling would help in opti-mization of drug candidates and identification of potential lead compounds.

Knowledge of amino acid composition, and bulk properties (pI, Mr, and shape) of proteinscan be particularly useful in isolation, purification, and characterization of any newly identi-

Page 229: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

226 Bioinformatics: A Primer

fied proteins. There are several specific databases available, for protein property prediction,based on physicochemical properties, shape and function. Computational chemistry plays animportant role in these studies.

Certain amino acids remain highly conserved even among diverse members of proteinfamilies. These highly conserved sequence patterns are called “signature sequences”, and inmany cases they define the active site of a protein (PROSITE database is an excellent choice).Post-translational modifications of proteins (glycosylation, phosphorylation) do greatly affectstructure and function of proteins. Identification of these sites can be quite useful in under-standing structure-function relationships.

In general, protein structure prediction methods employ high-resolution 3-D structuralfeatures of biologically relevant sites. However, attempts are underway to utilize low-resolu-tion protein structural data for biochemical function assignment– methods that automaticallygenerate a library of 3-D functional descriptors for the structure-based prediction of enzymeactive sites, based on functional and structural information automatically extracted frompublic databases. There are many interaction databases are available for such molecular-molecular interactions. Some of these are:

Aminoacyl-tRNA : (http://rose.man.poznam.pl/aars/index.html). Contains aminoacyl-synthetase tRNA synthetase (AARS) sequences for many organisms. Collecteddatabase pairs of AARS + tRNA can be used to create RNA-protein interaction

records.

BIND : (http://bioinfo.mshri.on.ca/BIND/BIND_prop/index.html).Biomolecular Interaction Network Database. With BIND, computersimulations of whole-cell models of disease processes spanningmedicine to agriculture will be possible.

BRENDA : (http://www.brenda.uni-koeln.de/). A database for enzymes thatcontains annotated information on enzymes–structure, reaction,specificity, post-translational modifications, and cross-reference tostructure databanks.

COMPEC : (http://compel.bionet.nic.ru/). Databank containing protein-DNAand protein-protein interactions of composite regulatory elements(CRES).

DIP : (http://dip.doe_mbi.ula.edu). Database of interacting proteins.Contains information on protein-protein interactions. Gives alsoexperimental methods to determine interactions.

EMP : (http://wit.mcs.aml.gov/EMP/). An enzyme database that is chemicalreaction-based.

ENZYME : (http://www.expasy.ch/enzyme/). This database is an annotatedextension, linked to SWISS-PROT, and deals with the enzyme structure-function features.

FIMM : (http://sdmc.krdl.org.sg.8080/fimm/). Provides information onprotein interactions that are important immunologically.

GeneNet : (http://www.mgs.bionet.nsc.ru/systems/mgl/genenet/). Describesgenetic networks from gene through cell to organism level using achemical reaction-based formalism.

Page 230: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Molecular Engineering 227

KEGG : (http://www.genome.ad.jp/kegg/). Kyoto Encyclopedia of Genes andGenomes. The databank provides information on many metabolicand regulatory pathways, and some as graphical diagrams.

PFBP : (http://www.ebi.ac.uk/research/pfmp). Protein function andBiochemical Networks. The aim the databank is to provide details onmetabolism, gene regulation, transport, and signal transduction. Graphuses abstraction for the interaction data and can describe chemicalreaction pathways.

WIT : (http://wit.mcs.anl.gov/WIT2). What is this database aimed atreconstruction metabolic pathways in newly sequenced genomes bycomparing predicted proteins with proteins in known metabolicnetworks.

14.2 RATIONAL DESIGN

The rapidly growing body of structural information emerging as a result of genomic-derivedtargets and industrialization of protein structure determination is dramatically altering thecomputer-assisted rational molecular/drug design approaches– direct structure-based drugdesign, which combines structural biology with computational and medicinal chemistry (e.g.3D-QSAR) in order to design drugs rather than merely selecting drugs that modulate a proteintarget of interest. Combining medicinal chemistry, computer-aided design, and biochemistryenables accelerated progress from “target-to-hit,” “hit-to-lead,” or “lead-to-candidate.

The first step towards molecular engineering rests on the availability of three-dimensionalstructure information of the molecule under consideration (or of its close structural homo-logues). Ab initio prediction of three-dimensional structures of proteins from the primarystructure data is not possible at present. Therefore, the knowledge of the three-dimensionalstructure of a protein or its homologue(s) is prerequisite for rational design of improved“mutants”. Involved technical procedures and constraints in experimental methods of struc-ture determination (primarily by X-ray diffraction and NMR spectroscopic methods), and thetime factors are the hindrances towards obtaining experimental structural data. While matterin single-crystalline state is a prerequisite for X-ray analysis, size of the protein is the limitingfactor in the case of NMR spectroscopic methods.

In the absence of the experimental data on tertiary structures of proteins, computer-aidedmodel building, on the basis of the known three-dimensional structure of a homologousprotein, is at present the only reliable alternative to obtain structural information for thedesign of new proteins. In silico modeling, that is, modeling of biological pathways andprocesses, and predictive simulation of cellular processes holdthe potential to enhance bothefficacy and efficiency throughout the drug discovery and development process.

Though molecular/drug design is generally computer-aided, but incorporation of pharma-cokinetics and molecular properties of intended drug molecules would greatly help in designprotocols. The de novo design of bioactive compounds, by incremental construction of a ligandmodel within a model of the receptor or enzyme active site, the structure of which is knownfrom X-ray or NMR data, is becoming a valuable and integral part of drug discovery. Theinput of biocomputing in drug discovery is twofold– (i) the computer may help to optimize

Page 231: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

228 Bioinformatics: A Primer

the pharmacological profile of existing drugs by guiding the synthesis of new and “better”compounds; (ii) as more and more structural information on possible protein targets and theirbiochemical role in the cell becomes available, completely new therapeutic concepts can bedeveloped. The computer analysis helps in both steps–to find out about possible biologicalfunctions of a protein by comparing its amino acid sequence to databases of proteins withknown function, and to understand the molecular workings of a given protein structure.Understanding the biological or biochemical mechanism of a disease then often suggests thetypes of molecules needed for new drugs.

The general strategy of all “computer-aided”, “rational-design” procedures is to incorporateexperimental data, sequence homologies, and information from structural databases andphysicochemical parameters as “input data” for designing “new molecules” (designed structuralmutants). The experimentally determined structures or modeled structures that are validatedwith a high level of confidence are used as “lead molecules” for rational design of desiredmolecules.

The efficacy of these “designed structural mutants” for their altered functions can bedetermined by obtaining the molecular species (or some of them) experimentally (e.g. genemanipulation, site-directed mutagenesis, de novo synthesis) and testing their functions vis a vistheir altered structures. The procedures involve iterative process of “experimental data– theoreticalstructure prediction – experimental validation” (Fig. 14.3).

Fig. 14.3 Operational Procedures in Molecular Engineering

14.2.1 Information from Experimental and Structural DatabasesThree of the most prominent uses of modern molecular modeling applications are– (i) structureanalysis, (ii) homology modeling, and (iii) docking. Information from experimental methodsconsists of three-dimensional structural data from X-ray crystallography and NMRspectroscopy, partial structural and functional information from other spectroscopic, andphysicochemical (e.g. electrophoresis and mass spectroscopy), and biomdecular (e.g. geneexpression microarrays) methods. All procedures in computational biology, structure predictionmethods, and computer-aided designs depend on the information from experimental andstructural databases as inputs in computer-aided modeling (“knowledge-based” computer

Page 232: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Molecular Engineering 229

modeling) of possible tertiary structures, selectively redesign them and test them for validationof the prediction.

Bioinformatics offers a means to get to a structure through sequence, while structure-aidedmolecular design offers a means to get to a molecule/drug through structure. In essence, it isa blend of computational chemistry with computational biology to create software that willaid protein chemists in understanding, evaluating and predicting the structure, function andactivity of medically and industrially important proteins/molecules. The procedureincorporates as inputs the propensity profiles of amino acids in tertiary structure folding ofproteins, variant semi-variant and variant residues in the sequence, and other structural andfunctional information. Computational methods of design involve optimizing types ofmutations, their positions in the sequence and their sterochemical compatibility (e.g.Ramachandran conformational maps) and overall thermodynamic stability by energyminimization and other mathematical procedures. Rationally designed proteins with expectedfunctions can be synthesized by experimental methods that involve molecular geneticsprotocols.

The aim of all computer-aided rational drug/molecular design (CAMD) protocols is toinvolve all computer-assisted techniques, such as three-dimensional quantitative structureactivity relationships (3D-QSAR), to discover, design and optimize compounds with desiredstructure and properties, by analyzing the quantitative relationship between the biologicalactivity of a set of compounds (with a putative use as drugs) and their three-dimensionalproperties using statistical correlation methods. General approach is:

(i) Lead optimization by considering both receptor binding and pharmacologically impor-tant properties.

(ii) Conformational search in a protein’s binding site to find the optimal positioning ofligands. This is carried out by subdividing molecules into fragments, the conformationalsearch of the fragments and the assembly of the fragment conformations into molecularconformations (combinatorial chemistry).

(iii) Computational screening of ligand for desired drug-like properties with a scoring func-tion is used to rate the binding affinity or activity for each trial ligand.

14.2.1.1 ScreeningHost of screening methods are available for drug/molecular design– bioassays, chemicalassays (e.g. ELISA, radioimmunoassays), electron microscopy, and structure-based screeningmethods. Combination of combinatorial chemistry and high-throughput screening (HTS) hasgreatly accelerated the drug discovery and development protocols.

Immunoassays are ligand-binding assays. There are three basic components in anyimmunoassay–(i) a specific antigen or antibody capable of binding to the analyte, (ii) the antigento be detected and/or quantified, (iii) and a system to measure the amount of the antigen inthe sample. The antibody can be linked to a radioisotope (e.g. radioimmunoassay, RIA), or toan enzyme that catalyses a monitored reaction (enzyme-linked immunosorbent assay, ELISA),or to a fluorescent compound by which the location of an antigen can be visualized(immunofluorescence).

Electron microscopy-based drug screening method enables direct imaging of minimallyperturbed cells, tissues, and microbes at molecular resolution. Automation in electron micros-copy opens up a new sphere of high-content drug assays.

Page 233: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

230 Bioinformatics: A Primer

Structure-based screening methods of drug discovery targets have recently emerged as analternative and complementary tool to conventional high-throughput bioassay-based screen-ing. They combine the power of NMR spectroscopy, and X-ray crystallography, and auto-matic docking, which provide the means to apply structural information (form NMR, X-ray,and modeling) to identify hits, select targets, and optimize the hits in terms of their affinitiesand specificities. While chemistry on structure-based hits can be aided by X-ray crystal struc-tures of ligand-target complexes, such complexes are often difficult to crystallize. Therefore,structure-based NMR screening is a better alternative to identify drug-like smallmolecule hitsfrom customized libraries. NMR-detected hits are turned into leads through chemical optimi-zation that is guided by 3-D structural data.

14.2.1.2 Virtual ScreeningThough, the three-dimensional molecular structure is one of the foundations of structure-

based drug design, the data available are often for the shape of a protein and a drug sepa-rately, but not for the two together. Docking programs are the computational methods toperform automated docking of ligands (small molecules like a candidate drug) to their mac-romolecular targets (usually proteins, sometimes DNA). These programs select moleculespredicted to be highly complementary to the receptor structure and can screen many of theseligands against the protein (virtual screening/). Virtual screening (in silico screening)technology offers the ability to screen many more compounds at once than the traditionallaboratory-based method.

14.3 VALIDATION

The validity of these “designed structural mutants” for their altered functions can be verifiedexperimentally by obtaining the molecular species (or some of them) experimentally (e.g. bygene manipulation, site-directed mutagenesis, de novo synthesis) and testing their functions visa vis their altered structures (Fig. 14.3). Their three-dimensional structures can be determinedby the existing physical techniques for structure determination, namely NMR spectroscopyand X-ray crystallography (by difference-Fourier techniques).

Structural features (folding) of a protein may not be grossly altered by mutation, butmutation(s) of the crucial residue(s) may lead to drastic changes in the functional features ofthat protein. Mutations at a particular site are determined by the changes that occur at theneighboring sites in the core. That is, steric constraints play an important part in specifying thedetailed properties of a protein. But, there are many examples where a single-point mutationat the crucial position of a protein (or an enzyme) leads to drastic effects on its function. Proteinengineering (molecular tinkering) at the genetic level is aimed at bringing changes in thefunctions of a protein under consideration by altering the amino acid residue(s) at crucialpositions. Such studies can probe the structure-functional relationships of macromolecules atthe molecular level.

Site-directed mutagenesis is one of the strategies in molecular biology (genetics) to modify thefunctions of proteins by rational replacement of amino acid residue(s) at crucial site(s) bysingle-point mutation or by cassette mutation (randomly altering a set of selected residues).This is a novel strategy to inspect reaction mechanisms, alter chemical behavior, and improvethe structural (thermo-stability) and chemical (functional) characteristics of proteins. Func-

Page 234: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Molecular Engineering 231

tions of enzymes can be analyzed by altering amino acid residue(s) at the crucial position(s).Such methods are harbinger for designing enzyme activation/inactivation, and control proto-cols. Site-directed mutagenesis, combined with electrophysiological methods open up thepossibility of detailed analysis of voltage-dependent gating of membrane proteins, and per-haps, eventual design of them.

Automated primer generators are now available to analyze original nucleotide sequenceand desired amino acid sequence and design a primer that has a new restriction enzyme site.These advances in gene manipulation techniques have enabled the design of novel proteinseasier and faster, and to study reaction mechanisms and structure-function relationships.

De novo synthesis is another procedure in the rational design and synthesis protocols ofproteins.

14.3.1 Limitations of Virtual ScreeningWhile experimentally determined structure-function data are necessary and highly desirablefor validation in computational methods of molecular/drug design procedures, many phar-maceutical companies, on account of time and cost factors, prefer computational methods.Many statistical algorithms, predictive biosimulation, molecular dynamics and conforma-tional analysis, and other structure-function relationship analysis are employed, groupedunder virtual screening (in silico screening). 3D-QSAR methods employ statistical correlationmethods, incorporating molecular parameters, such as structural (steric) hydrophobicity,hydrogen bonding, and electronic features. Conformational analysis consists of the explora-tion of energetically favorable spatial arrangements (shapes) and molecular conformationsusing molecular dynamics calculations, simulation procedures (e.g. Monte Carlo method)consisting of randomly sampling the conformational space of a molecule, or by analysis ofexperimentally determined structural data (from NMR and X-ray structure analysis).

Modern pharmaceutical research is underpinned by chemical genomics, a high-throughputapproach to biology and chemistry. But, virtual screening methods are still beset with problemsand uncertainties in validation protocols. The large and growing number of approaches forconducting virtual screening points to both the complexity of the process, as well as thedifficulty of creating ideal solutions. Many virtual high-throughput screening (HTS) methodsgenerate many hits, but little, or unreliable information. Given the large number of possibletargets that are being identified by genomics, it is inconceivable that high-throughput screening(brute-force approach) of synthesized compound libraries will be able to match the challengeof identifying “a small molecule for every protein.” Therefore, the screening should be limitedto generate enough information to start chemical programs, and the goal should be to screenthe fewest possible number of compounds. Towards this goal, the use of computational filtersand analysis is being applied for smarter design of libraries, better selection and prioritizationof compounds for screening, as well as being employed for structure-based design and leadoptimization.

EXERCISE MODULES

1. What are the goals of protein engineering?

2. Why is the emphasis of drug design has shifted towards genome-centric approach?

3. Write about genomics and proteomics approaches towards drug design.

Page 235: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

232 Bioinformatics: A Primer

4. Comment on gene manipulation methods.

5. Which are the physical techniques for structural data information?

6. What are the essentials of rational design of proteins?

7. What are the inputs for the computer-aided rational design protocols?

8. What are the various experimental methods in screening and validation?

9. What is virtual screening?

10. What are the protocols for validation of designed structures?

11. What is site-directed mutagenesis and what is its importance in molecular engineering?

12. What are the procedures for de novo synthesis?13. What are the limitations of virtual screening?

BIBLIOGRAPHY

1. Arakaki, A.K., Zang, Y. & Skolnick, J. (2004), Bioinformatics, 20(7); ([email protected]). “Large-scale assessment of the utility of low-resolution protein structures for biochemical function assignment”.

2. Baker, D. & De Grado, W.F. (1999), Curr Opin Struct Biol., 9(4); 485 . “Engineering and design”.

3. Barrett, G.C. (Ed) (1985), Chapman-Hall: London. “Chemistry and Biochemistry of Amino acids”.

4. Bently, D.R. (2000), Med Res Rev., 20; 189–96”. The Human Genome Project–an overview”.

5. Berman, H.M., et al. (2000), Nucleic Acids Res., 28; 235–42. “The Protein Data Bank”.

6. Blundell, T. L., et al. (1987), Nature, 326; 347. “Knowledge-based prediction of protein structures and thedesign of novel molecules”.

7. Cotton, R.G.H. (1997), Oxford University Press: Oxford. “Mutation Detection”.

8. Creighton, T.E. (1993), Freeman Press: “New York. “Proteins––Structures and Molecular Properties”,2nd Edn.

9. Drews, J. (2000), Science, 287; 1960. “Drug discovery: A Historical perspective”.

10. Dutt, M.J. & Lee, K.H. (2000), Curr Opin Biotechnol., 11; 176”. “Proteomic analysis”.

11. Eisen, M.B. & Brown, P.O. (1999) Methods Enzymol., 303; 179. “DNA arrays for analysis of geneexpression”.

12. Evans, W.E. & Relling, M.V. (1999), Science, 286; 487. “Pharmacogenomics: translating functional genomicsinto rational therapeutics”.

13. Guex, N., Diemand, A. & Pettsch, M.C. (1999), Trends Biochem Sci., 24; 364. “Protein modeling for all”.

14. Hedge, P., et al. (2000), Biotechniques, 29; 548. “A concise guide to cDNA microarray analysis”.

15. Jorde, L.B., et al. (1999), Mosby: St. Luis, MO. “Medical Genetics”.

16. Kreil, D.P. & Etzold, T. (1999), Trends Biochem Sci., 24; 155. “DATABANKS–a catalogue database ofmolecular biology databases”.

17. Levitt, M. & Chotha, C. (1976), Nature, 261; 552 . “Structural patterns in globular proteins”.

18. Marrone, T.J., Briggs, J.M. & McCammon, J.A. (1997), Ann Rev Pharmacol Toxicol., 37; 71. “Structure-based drug design: computational advances”.

19. Marth, G.T., et al. (1999), Nature Genet., 23; 452. “A general approach to single-nucleotide polymorphismdiscovery”.

20. Martin, A., et al. (1998), Structure, 6; 875–84. “Protein folds and functions”.

21. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print).

22. Palu, G., et al. (1999), J Biotechnol., 68; 1. “In pursuit of new developments for gene therapy of humandiseases”.

23. Pandey, A. & Mann, M. (2000), Nature, 405; 837. “Proteomics to study genes and genomes”.

Page 236: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Molecular Engineering 233

24. Rawlings, C.J. & Searls, D.B. (1997), Curr Opin Genet Devel., 7; 416. “Computational gene discovery andhuman disease”.

25. Richardson, J.S. (1985), Methods Enzymol., 115; 349 . “Describing patterns of protein structure”.

26. Ripley, B.D. (1996), Cambridge University Press: Cambridge. “Pattern Recognition and Neural Net-works”.

27. Roses, A.D. (2000), Nature, 405; 857. “Pharmacogenetics and the practice of medicine”.

28. Sandhu, J.S., Keating, A. & Hozumi, N. (1997), Crit Rev Biotechnol., 17; 307. “Human gene therapy”.

29. Schmalzing, D., et al. (2000) Nucleic Acids Res., 28; e43. “Microchip electrophoresis: a method for high-speed SNP detection”.

30. Shao, Z. & Arnold, F.H. (1996), Curr Opin Struct Biol., 6; 513. “Engineering new functions and alteringexisting functions”.

31. Steitz, T.A. (1990), Quart Rev Biophys., 23; 205 . “Structural studies of protein-nucleic acid interaction:the sources of sequence-specific binding.

32. Strachan, T. & Read, A.P. (1999), BioScience: Oxford. “Human Molecular Genetics”.

33. Wladawer, A. & Vondrasek, J. (1998), Annu Rev Biophys Biomol Struct., 27; 249–84. “Inhibitors ofHIV-1 protease: a major success of structure-assisted drug design”.

Page 237: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Glossary

Adhesion Force of attraction between unlike molecules.

Algorithm A methodical, logical sequence of steps, typically involving a repetition of opera-tions, by which a task can be performed.

Alignment The result of comparison of two or more sequences to determine the degree oftheir similarity.

Alignment score Computed sum based on the number of matches, insertions, and deletionswithin an alignment.

Alleles Mutually exclusive forms of the same gene, occupying the same locus on homologouschromosomes, and governing the same biochemical and developmental process.

Allosteric Refers to a change in the properties (usually including shape) of a protein follow-ing the binding of another molecule to the protein.

Amphiphilic Molecules containing both polar (hydrophilic) and apolar (hydrophobic) groups.

Ampholyte Small molecule with positive and negative charges.

Analogue Non-homologous proteins that have similar structural folding architecture, butarisen out of convergent evolution.

Annotation Adding pertinent information, comments, or notations, such as genes coded foramino acid sequences.

Anticodon A triplet of contiguous nucleotides on tRNA that binds to the triplet of contiguousnucleotides (codon) on mRNA.

Atomic-force microscopy A type of scanning-probe imaging technique that provides high-resolution topological maps.

Base-pairing Complementary hydrogen-bonded basepairs (e.g. A=T; C∫G) in nucleic acids.

Bayesian Technique A stochastic procedure used to estimate parameters of a distributionbased on an observed distribution.

Biochips Miniaturized arrays of large number of oligonucleotides (DNA microarrays).

Bioinformatics An interdisciplinary subject to analyze the sequence data of nucleic acids andproteins and predict the structure and function of biological macromolecular complexes.

Biomarkers Measurable and quantifiable biological parameters (e.g. specific enzyme or bio-logical substances).

Biphotonic excitation The simultaneous (coherent) absorption of two photons (either sameor different wavelength), the energy of excitation being the sum of the energies of the twophotons.

Page 238: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Glossary 235

BLAST Basic Local Alignment Search Tool; used in genome informatics for similarity searchof DNA and protein sequences.

BLOCKS Conserved ungapped, aligned sequence segments, algorithm in a set of relatedproteins.

CARS microscopy A “chemical microscopy” that is based on Raman spectroscopy.

Catalyst Substance that accelerates the rate of chemical reaction without being used up in theprocess.

cDNA Complementary DNA that is synthesized in the laboratory from mRNA templateusing reverse transcriptase.

CDR sequence Coding region sequence—an uninterrupted nucleic acid sequence that codesfor a protein.

“Chemical shift” Shielding of atomic nuclei from the influence of the external magnetic field.

Chromatography A physical technique used to separate mixtures of substances based ondifferences in the relative affinities of the substances for mobile and stationary phases.

Chromosome Self-replicating genetic structure in cells, which contain DNA sequences thatconstitute genes.

Clone An exact copy of a gene, a cell or an organism (obtained asexually).

Cloning The processes of generating identical copies of a DNA fragment from a single tem-plate DNA.

Cluster analysis A procedure of grouping together a set of objects from a large group ofrelated objects.

Cobbler A single sequence that represents the most conserved regions in a multiple sequencealignment.

Coding sequence (CDS) A region of DNA or RNA that codes for the sequence of amino acidsin a protein.

Codon A triplet of contiguous nucleotides on mRNA that codes for an amino acid.

COG Clusters of orthologous groups that include orthologs and paralogs.

Confocal microscopy A type of aperture-based light microscopy.

Conformation Spatial arrangement of structural moieties due to rotation around a singlebond.

Consensus sequence A pseudo-sequence that summarizes the residue information containedin a multiple sequence alignment.

Conserved sequence A base sequence in DNA that has remained essentially unaltered through-out evolution.

Contig A length of contiguous sequence assembled from a partial, overlapping sequences,generated from a “shotgun”-sequencing sequencing project.

Correlation A statistical measure that indicates the extent to which two factors vary togetherand thus how well either factor predicts the other.

Dalton Unit of mass that is equivalent to one-twelfth the mass of an atom of carbon-12 (~ themass of a hydrogen atom).

Page 239: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

236 Bioinformatics: A Primer

Databases Computer-based organization of sequence and structural data of biomolecules.

Data mining Search of non-trivial information or relationships from the databases.

Dendogram A graphical representation of an evolutionary tree from the output of a hierarchi-cal clustering method.

Differential Gene Expression Screening technologies that are widely used for target validation.

Directed mutagenesis Alteration of DNA at a specific site and its reintroduction into anorganism to study any effects of the change.

DNA A complex molecule containing the genetic information that makes up the chromo-somes.

DNA fingerprinting A procedure in which multilocus band patterns of a DNA sample aregenerated by digestion of the DNA with restriction enzymes followed by electrophoresisand visualization by hybridization with probes specific for repetitive sequences.

DNA footprinting A method for determining the sequence specificity of DNA-binding pro-teins. The method utilizes a DNA damaging agent, which cleaves DNA at every base-pair;DNA cleavage is inhibited where the ligand binds to DNA.

Docking A type of virtual screening drug design technology to evaluate binding of ligands tomacromolecular targets.

Domain A discrete portion of a protein with a unique function.

Doppler effect Apparent change of frequency due to motion of the source relative the object.

Dotplot A graphical comparison of two sequences.

Dynamic programming A mathematical method of solving a complex problem by combiningsolutions to sub-problems.

Electrophoresis Molecular separation method, based on the migration of charged particlesunder the influence of applied electric field.

ELISA Enzyme-Linked Immunosorbent Assay (ELISA)–an immunoassay utilizing an anti-body labeled with an enzyme marker.

Encoding The processing of information into the memory system, for example by extractingmeaning.

EST Expressed Sequence Tag (EST)–a partial sequence of a cDNA clone, which can act asidentifier of a gene.

Evolution Diversification and mutation of living organisms.

Exon The protein-coding region of a DNA sequence.

Expressed Sequence Tag (EST) A partial sequence of a DNA molecule that is a part of a cDNAmolecule and can act as identifier of a gene.

FASTA A sequence similarity search algorithm.

Fingerprint A group of ungapped motifs characteristic of a family member.

FISH Fluorescence in situ hybridization method in which target sequences are stained withfluorescent dye so their location and size can be determined using fluorescence microscopy.

Frame-shift An alteration in the reading sense of DNA resulting from an inserted or deletedbase.

Page 240: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Glossary 237

Frame-shift mutation A type of mutation in which a number of nucleotides not divisible bythree is deleted from or inserted into a coding sequence.

Free induction decay The time-dependent decay profile (signal amplitudes as a function oftime) of pulsed NMR signal.

Gap Mismatch in the alignment of two sequences caused by insertion or deletion.

Genes The fundamental structural and functional units (nucleic acids) of heredity that codesfor a proteins.

Gene duplication A particular kind of mutation– production of one or more copies of anypiece of DNA including a gene or even an entire chromosome.

Gene expression The processes (transcription and translation) of converting genetic codeinto amino acid sequences.

Gene informatics Database searches to analyze sequence information and prediction of struc-ture and function from the sequence data.

Genetic code The library of contiguous triplet nucleotides (codons), that code for 20 aminoacids and stop signals.

Genetic map A linear arrangement of the relative positions of genes along a chromosome.

Gene marker A gene or other identifiable portion of DNA, in one haploid set of chromo-somes, where inheritance can be followed.

Genome The complete set of all genetic material present in the chromosome of an organism(measured as number of basepairs).

Genomics Study of gene structure and functions.

Genotype Genetic composition of an individual.

Genotyping The determination of relevant nucleotide-base sequences in each of the twoparental chromosomes.

Global alignment The alignment of two nucleic acid or protein sequences over their entirelength.

Half-life The time needed for (1) half the atoms of a radioactive substance to decay or (2) halfthe amount of a substance (e.g., a drug) to be metabolized or excreted.

H-T-H motif Helix-turn-helix structural motif found in proteins (e.g. in many of the prokary-otic transcriptional regulatory proteins).

Heuristic Empirical procedure (“rule-of-thumb” strategy) to solve a problem based on “tested-and-correct” rules.

Homology Two or more biological species, systems or molecules that are related by divergentevolution from a common ancestor.

Hydrolysis Decomposition of a substance by the insertion of water molecules between cer-tain of its bonds.

Hydrophilic Water-loving.

Hydrophobic Water-repelling.

Page 241: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

238 Bioinformatics: A Primer

Image reconstruction The mathematical procedures involved in the transformation of thediffraction patterns for structure determination.

Immunoassay A ligand-binding assay that uses a specific antigen or antibody, capable ofbinding to the analyte, to identify and quantify substances. The antibody can be linked toa radioisotope (radioimmunoassay, RIA), or to an enzyme (ELISA).

Immunoelectrophorsis Combination of gel electrophoresis and immunodiffusion methods.

Indel An insertion or deletion in sequence alignment.

Information theory Procedure that collects information in terms of bits, the minimal amountof structural complexity needed to encode a given piece of information.

In silico analysis Analysis by computational (theroretical) methods, in contrast to experi-mental methods of data acquisition and interpretation.

Intron A region of DNA in a gene that is not allowed to encode a protein.

Ion Atom or group of atoms that has an electrical charge arising from the gain or loss ofelectrons.

Ionic bond Chemical bond formed between ions of opposite charge.

Isoelectric point pH at which a polar (charged) molecule has a zero net charge.

Iterative A sequence of operations that is performed repeatedly.

Isomer Molecule with the same molecular formula as another but with a different structuralformula (e.g., glucose and fructose).

Isotonic Having the same concentration of water as the solution under comparison.

Isotope Atom that differs in weight from other atoms of the same element because of adifferent number of neutrons in its nucleus.

Iterative A sequence of operations that is performed repeatedly.

Kohonen map An unsupervised self-organization (clustering) algorithm.

K-tuple Identical short stretch of sequences, also called words.

Leucine-zipper A structural motif found in DNA-regulatory proteins.

Ligand A molecule that binds to another molecule or to a cell.

Local alignment The alignment of some portion of two nucleic acid or protein sequences.

MALDI-TOF-MS A mass spectrometric technique that is use for the analysis of biologicalmacromolecules.

Mapping Determination of the physical location of a gene on a chromosome.

Markov model A mathematical procedure based upon states and transition probabilitiesused in sequence alignment.

Microarrays Microarrays in which nucleic acids representing genes are spotted on a substrateand then tested against a sample to evaluate mRNA levels, and thus gene expression.

Molecular engineering Computational design (drug deign) of molecular species with requiredstructural and functional characteristics.

Molecule Smallest particle of a covalently bonded element or compound that retains theproperties of that substance.

Page 242: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Glossary 239

Motif An aggregate of secondary structural elements forming a super-secondary structure.

MRI Magnetic resonance imaging–a non-invasive imaging technique based on locating nuclearspins in a specimen.

Multiple sequence alignment An alignment of three or more sequences with gaps inserted inthe sequences such that residues with common structural positions are aligned in the samecolumn.

Multiplex sequencing A sequencing approach that uses several pooled samples, greatlyincreasing sequencing speed. Used in high-throughput sequencing.

Mutation A heritable change in the nucleotide sequence of genomic DNA (or RNA in RNAviruses).

Neural Networks Artificial neural networks (ANN) are a collection of information-process-ing mathematical models that draw on the analogies of adaptive biological learning.

NMR spectroscopy A versatile spectroscopy method that is used in structure determinationof biomolecules.

NOE Nuclear Overhauser Effect–interaction between the dipole moments of two nuclei inspatial proximity–provides information about the distance between nuclei.

NOESY An NMR technique used to help determine protein structures. It reveals how closedifferent protons (hydrogen nuclei) are to each other in space.

Northern blotting An analysis technique for the identification of RNA.

NSOM Near-field Scanning Optical Microscopy—scanning probe imaging technique thatcircumvents diffraction limit, thus improves image resolution.

Odds score The ratio of the likelihood of two events or outcomes. In sequence alignment, it isthe ratio of the frequency with which two characters are aligned in related sequencesdivided by the frequency with which those same two characters align by chance alone.

Open reading frame (ORF) The sequence of DNA or RNA located between the start-code(initiation codon) and the stop-code (termination codon) that encodes a gene. An ORF ispotentially able to encode a polypeptide.

Operator The DNA sequence in prokaryotes to which a repressor or activator protein binds,turning on (or off) transcription of the associated structural genes of the operon.

Optimal alignment Statistically the best possible alignment of given characters.

Orbital Wave function in space.

Ortholog Homologous protein (or gene) that performs same functions in different species.

PAM matrix Percent Accepted Mutation table that describes the probability that one base oramino acid has changed during the course of evolution.

Paralog Homologous protein (or gene) that performs different but related functions withinthe same organism.

PCR method Polymerase chain reaction (PCR), a gene amplification method.

Peptide mass fingerprinting A means of protein identification by means of mass spectrometry.

“Phase problem” Mathematical methods for obtaining the phase information for diffractionspots (hkl) in X-ray structure determination.

Page 243: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

240 Bioinformatics: A Primer

Phenotype Observable traits (characteristics) of an organism.

Phylogenetic analysis Study of the evolutionary relationships.

Physical map A linearly ordered set of DNA fragments encompassing the genome or regionof interest.

Polymorphisms Genetic variations, encompassing any of the many types of variations inDNA sequence that are found within a given population (mutations, point-mutations, andSNPs).

Positional cloning Use of genetic maps to determine the location of disease gene.

Primary structure The linear sequence of amino acids in a protein.

Primer Short preexisting polynucleotide chain to which a polymerase enzyme can add newnucleotides.

Profile A matrix representation of a conserved region in a multiple sequence alignment thatallows for gaps in the alignment.

Promoter Region on a DNA molecule involved in RNA-polymerase binding to initiate tran-scription.

Protein engineering A technique used to produce proteins with altered or novel amino acidsequences.

Protein folding Spatial structure that is unique to an individual protein.

Protein-folding problem The problem of determining the tertiary structure of a protein fromits amino acid sequence data.

Proteome The entire collection of proteins that are coded by the genome of an organism.

Pseudogenes Genes that do not code for proteins or silent genes that contain elements thatcontrol gene expression.

QSAR Quantitative structure activity relationship–mathematical approach towards linkingmolecular structure to its function and activity.

Quaternary structure Arrangement of protein monomers (by non-covalent forces) in amultimeric protein complex.

Radioimmunoassay (RIA) A radiolabeled immunoassay method.

Ramachandran map A plot of sterically allowed and not-allowed conformations of a polypep-tide backbone.

Rayleigh scattering Elastic scattering of light by molecules.

Reading frame A sequence of codons beginning with an initiation codon and ending with atermination codon.

Regression analysis The procedure of fitting a mathematical model to data.

Reverse turn A secondary structure element in proteins where the polypeptide backboneturns sharply on itself.

Scaffolds Ordered set of contigs placed on the chromosome.

Scanning-probe microscopy Surface-probe imaging technique that provides high-resolutiontopological maps.

Page 244: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Glossary 241

Scanning-tunneling microscopy A surface-probe imaging technique that is based on quan-tum mechanical tunneling current. Provides high-resolution topological maps.

Secondary structure Helix, sheet and loop segments in a protein.

Sequence alignment A linear comparison of nucleic acid or amino acid sequences (withinsertions) to bring equivalent positions in adjacent sequences into the correct register.

“Shotgun” sequencing High-throughput sequencing method which involves randomlysequencing tiny cloned pieces of the genome, with no foreknowledge of where on achromosome the piece originally came from.

Silent mutation A nucleotide substitution that does not result in an amino acid substitutionin the protein, because of the redundancy of the genetic code.

Single crystalline Matter wherein internal organization of atoms/ molecules or clusters ofmolecules is regular and periodic in all three dimensions (crystal lattice).

SNPs Single nucleotide polymorphisms (SNPs) are single base-pair positions in genomicDNA at which different sequence alternatives (alleles) exist in normal individuals in somepopulation(s), wherein the least frequent allele has an abundance of 1% or greater.

Southern blotting An analysis technique for the identification of DNA

Splice sites Boundaries between exons and intron.

Substrate Substance that is acted upon by an enzyme.

Tertiary structure Three-dimensional structure of a molecule (e.g. protein).

Threading A procedure of aligning the sequence of an unknown protein with a known three-dimensional structure.

Transcription Synthesis of an RNA copy from a sequence of a DNA.

Transcription factor A regulatory protein that is required to initiate transcription of gene ineukaryotes.

Translation The process of converting the template information in mRNA into synthesis of aprotein.

Transmembrane protein Protein that passes one or more times through the lipid bilayer of acell membrane.

Unidentified reading frame An open reading frame (URF) encoding a protein of undefinedfunction.

Unit cell A parallelepiped (an imaginary box) that contains one basic unit of the structure andtranslational repeat of the unit cell in all three dimensions represent the crystal.

Valence Number of electrons gained, lost, or shared by an atom in bonding to one or moreother atoms.

Virtual screening Validation of drug design by computational (in silico) methods.

Western blotting An analysis technique for the identification of proteins or peptides that havebeen electrophoretically separated by blotting and transferred to strips of nitrocellulosepaper. Radiolabeled antibody probes or fluorophores are used to detect the blots.

Zinc-finger motif A structural motif found in DNA-regulatory proteins.

Page 245: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Index

a-helix, 36, 192b-sheet, 38 ,1923D-QSAR, 231, 240Agarose gel, 58Amino acids, 30

characteristics of, 30chemical formula of, 30folding propensity of, 126hydrophilic, 32hydrophobic, 33ionization states of, 31neutral, 34side chains, 32zwitterionic state of, 30

Ampholytes, 61, 234Anion exchanger, 56Autoradiography, 72

Base-pairing, 15, 234Watson-Crick type, 15, 18

Bessel function, 87Bioinformatics, 1, 4, 52, 234

medico-, 212molecular, 4, 116, 133objectives of, 4pharmaco-, 212

Biologycomputational, 6, 223molecular, 5quantitative, 5structural, 6systems, 6

Biomarkers, 216, 234Biomedinformatics, 2Biomolecules, 51-69

physicochemical characterization of, 51spatial structure determination of, 83-101

Biophysicsmolecular, 6, 9-48

Blotting techniquesNorthern, 66, 239Southern, 66, 241Western, 66, 68, 241

Bonds (chemical)conjugated, 13coordination, 13double, 13glycosyl, 15peptide, 34phosphodiester, 16single, 13

Bragg, 84Bragg’s equation, 84, 85Bremsstrahlung, 83Brewester’s angle, 66

Cation exchanger, 56cDNA sequencing, 75Cellulose acetate, 57Charge-coupled diode (CCD), 72Chemical graph, 11Chemical shift, 90, 235Cheminformatics, 2Chromatography, 55, 235

affinity, 56dHPLC, 55HPLC, 55ion-exchange, 56liquid, 55reversed-phase, 55size-exclusion, 56supercritical fluid (SFC), 56thin layer (TLC), 55

Page 246: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Index 243

Chromosomal maps, 117, 235banding patterns, 28genetic, 117linkage maps, 117, 213physical, 117positional cloning, 213

Coding region (CDR), 120Codon, 3, 25, 235Codon usage, 120Conformation analysis, 39, 235

Ramachandran plots, 39Contig, 75, 118, 235Coomassie Blue, 65Crystal lattice, 84Cystic fibrosis, 4, 215Cystic fibrosis transmembrane regulator (CFTR),4, 215

Cytofluorimetry, 73

Data mining, 168-211analysis, 168prediction, 121

Database search, 151-167genome, 155motifs, domains and profiles, 163, 192, 194pattern recognition, 163, 199primary structure, 151protein, 160search engines, 153secondary structure, 162sequence similarity, 161, 183

Databases, 152, 236BIND, 226BLOCKS, 163, 187, 193, 235BRENDA, 226COILSCAN, 163, 193COMPEC, 156, 226DIP, 226EMBL, 156ENZYME, 226ExPASy, 153genomic, 152, 155HTHSCAN, 193knowledge, 151, 165NCBI, 154Parsimony, 161, 189PDB, 160PFAM, 198PILEUP, 162, 188primary, 151

PRINTS, 163, 194PRODOM, 163, 195PROFILES, 163, 198PROSITE, 163, 194proteomic, 152secondary, 151, 152structural, 228SWISS-PROT, 154

De novo synthesis, 116, 231Dideoxynucleotide (ddNTP), 72Diffusion, 52

rotational, 53translational, 52

Dipole moment, 53Disease gene

identification, 212impact of genomics, 213

DNA, 18, 236A-form, 18B-form, 20replication, 21transcription, 23Z-form, 20

DNA fingerprinting, 118, 236DNA microarrays, 124DNA-histone complex, 108Doppler effect, 53Dotplot analysis, 174, 236Drug discovery, 224

genome-centric, 224“knowledge-based”, 228ligand/target-centric, 224

Einstein, 52Electron density, 86Electrophoretic methods, 56, 236

2-D capillary, 642-D gel, 62capillary, 63capillary zone (CZE), 64column, 58free-solution, 64horizontal, 58IEF/SDS-PAGE, 63, 64immunoelectrophoresis, 61, 238microchip, 58polyacrylamide gel (PAGE), 59pore gradient, 61pulse-field gel (PFGE), 71, 118SDS-PAGE, 59, 60

Page 247: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

244 Bioinformatics: A Primer

slab gel, 59Electrophoretic mobility, 57Electrospray ionization (ESI), 78, 80Elongation factors, 25Endopeptidases, 77Energy

Madelung, 44potential, 43

Eukaryote, 120, 121Evolution, 236

and function, 128and structure, 130convergent, 129divergent, 129molecular, 130

Evolutionary tree, 190Exon, 121, 236Exopeptidases, 77Expressed sequence tag (EST), 121, 157, 236

FISH, 118, 213, 216Fourier transformation, 83Free induction decay (FID), 90, 237Frictional coefficient, 52Frictional drag, 57Functional informatics, 125

Gene, 237amplification, 25, 70cloning, 27, 70duplication, 237expression, 21, 218, 237mapping, 28replication, 21separation, 27, 71sequencing, 27, 71, 73structure, 156

Gene chips, 215Gene expression analysis, 123

DNA microarrays, 124gene chips, 215SAGE, 124, 215

Gene sequencing, 27, 71dideoxy method, 71Maxam Gilbert method, 71multiplex, 71Sanger method, 71

Gene therapy, 221Genetic code, 3, 237

dictionary of, 3

Genetic diseases, 3, 214Alzheimer’s, 4common, 216cystic fibrosis (CF), 215Huntington’s, 4imprinting, 216Mendelian trait, 215monogenic, 215non-Mendelian trait, 216Parkinson’s, 4polygenic, 216sickle cell anemia, 3, 215single nucleotide polymorphism (SNP), 3, 214

Genetic engineering, 5Genetic maps, 117Genetic testing, 216, 217

biomarkers, 216differential gene expression profiling, 218differential protein expression profiling, 219linkage analysis, 213molecular profiling (MP), 219mutation screening, 218

Genetic variations, 214Genome annotation, 119Genome sequencing, 74

“shotgun” strategy, 74, 241Genomics, 116, 237

analysis, 116, 223comparitive, 119functional, 123structural, 116

Glycoproteins, 110

Hidden Markov Model (HMM), 196, 238HIV-1 protease, 4, 233Human genome project (HGP), 21Hydrogen bonding, 46, 103

sequence-specific, 103

Image reconstruction, 84, 238Imaging methods, 95

functional/metabolic, 99microscopies, 96structural, 96tomographies, 98

Immobilized pH gradient (IPG), 62Immunoassay, 236

ELISA, 229RIA, 229, 240

In silico analysis, 5, 227, 238

Page 248: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Index 245

Information content, 2, 84Interactions

codon-anticodon, 2dipole/dipole, 43, 44dipole/induced-dipole, 43, 45electrostatic, 43, 102hydrogen bonding, 46hydrophobic, 46, 237induced-dipole/induced-dipole, 43, 45ionic, 43molecular, 102non-bonded, 43protein-carbohydrate, 110protein-ligand, 102-111protein-lipid, 110protein-nucleic acid, 102protein-protein, 110Van der Waals, 43, 44

Intron, 121, 238Intron/exon boundary, 122Isoelectric focusing (IEF), 61Isoelectric point (pI), 57, 61, 238

Keesom equation, 45Kozak sequence, 120

Larmor frequency, 89Laser-induced fluorescence (LIF), 72Lennard-Jones potential, 46Light scattering, 53Linkage analysis, 117, 213Lipoproteins, 110London equation, 46

Macromoleculefibrous, 87globular, 134, 135

Madelung formula, 43Magnetization vector, 90Mass fingerprinting, 79Mass spectrometry (MS), 78

ESI-MS/MS, 80MALDI-TOF-MS, 78, 79MI-MS, 80SELDI-TOF-MS, 80tandem (MS/MS), 80

Metabolic engineering, 220Metabolomics, 220Methyl green pyronin, 65

Microscopyatomic-force (AFM), 73, 97, 234biphotonic excitation confocal, 97, 234chemical, 98coherent anti-Stokes Raman scattering (CARS),98, 235

confocal, 73, 96, 235electron, 96fluorescence, 97fluorescence lifetime imaging (FLIM), 97magnetic resonance (MRM), 98NSOM, 73, 98optical, 96scanning-probe (SPM), 97, 240scanning-tunneling (STM), 73, 97, 241surface plasmon resonance (SPR), 98transmission electron (TEM), 98X-ray, 98

Miningdata, 121, 168, 236sequence, 5structure, 5

Molecularmass, 52shape, 53size, 53

Molecular dynamics (MD), 109Molecular engineering, 223-233, 238

genomic approach, 223proteomics approach, 225rational design, 227

Molecular structureforces stabilizing, 42

Molten globule, 133Mutation

screening, 218single-point, 3structural, 223

Mutation screening, 215, 218

Nanopore sequencing, 75Neutrons, 11Niels Bohr, 11NMR Spectra, 90

COSY, 92DFQ-COSY, 92HMQC, 94NOESY, 92

NMR spectroscopy, 89multi-dimensional, 91

Page 249: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

246 Bioinformatics: A Primer

Nucleic acid bases, 15adenine, 15cytosine, 15guanine, 15thymine, 15uracil, 15

Nucleic acid families, 18A-form, 18B-form, 20Z-form, 20

Nucleic acids, 15-29cDNA, 75, 121, 235constituents of, 15DNA, 16, 18double helical structure, 17, 18families of, 18mRNA, 24polynucleotides, 16primary structure of, 70RNA, 16tRNA, 19

Nucleoside, 15,16Nucleotide, 16

Okazaki fragments, 22Oligonucleotides, 16Open reading frame (ORF), 120, 156, 157Orbitals, 11

Patterns recognition, 163Pauli, 12Pauli’s exclusion principle, 12Peptide fragmentation, 76Peptide unit, 36Pauling’s hypothesis of, 35Pharmacogenomics, 221Phase problem, 83

anomalous dispersion method, 87direct methods, 87molecular replacement method, 87Patterson method, 86

Phylogenetic analysis, 184BLOCKS model, 187, 235cladistic methods, 189Dayhoff method, 184distance method, 186, 188log-odds matrix, 186PAM model, 185

Polyethylene oxide (PEO), 64

PolymeraseDNA, 22RNA, 22

Polymerase chain reaction (PCR), 27, 71Polymorphisms, 3Polynucleotides, 16Polypeptide, 34

fragmentation, 76Primary structure, 70-82

of nucleic acids, 70of proteins, 31, 76

Prokaryote, 120Protein classification, 200

CATH, 164, 200SCOP, 164, 201

Protein folding, 115clasdistic approach, 128kinetics methods, 127phylogetic methods, 128problem, 115-132rules, 133

Protein P21, 4Protein sequencing

2-DE/MS method, 79Edman method, 77

Protein structurea/b-barrel domains, 129, 142domains, 194evolutionary trends in, 129motifs, 193prediction, 116profiles, 196

Protein trafficking, 4Proteins, 30-48

amino acid sequencing, 76classification, 34, 164collagen group, 134disulfide-containing, 144DNA-regulatory, 104fibrous, 87, 134globular, 134, 135homologous, 169keratin group, 134monomeric, 41oligomeric, 41orthologous, 130, 170paralogous, 130, 170primary structure of, 35purification of, 77quaternary structure of, 41

Page 250: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

Index 247

secondary structure of, 35spatial structure of, 70tertiary structure modeling, 201tertiary structure of, 41, 201three-dimensional structure of, 41

Proteomics, 62, 225analysis, 125inter-proteome, 125intra-proteome, 125

Protons, 11Pseudogenes, 21, 240

Quantum numberangular, 11azimuthal, 11magnetic, 11principal, 11spin, 11

Ramachandrananalysis, 39angles, 40map, 39, 240

Rational design, 227CAMD, 229

Rayleigh equation, 53Rayleigh scattering, 53, 73, 240Refractive index, 53Relative mobility, 60Relaxation, 90

spin-lattice (T1), 90spin-spin (T2), 90

Replication, 21RNA

translation, 25

SAGE, 124, 215Schlieren optics, 53Scoring matrix, 171Search engines, 153Secondary structure elements

a-helices, 192b-sheets, 192reverse turns, 192

Sedimentation, 54Separation gel, 59Sequence alignment analysis, 170

BLAST, 180Blocks method, 187cladistic methods, 189

CLUSTAL algorithm, 183, 184clustering algorithms, 183, 187consensus sequence, 75, 122, 157, 179distance method, 188dotplot, 173, 174dynamic programming, 174FASTA, 182global alignment method, 176k-tuple method, 177local alignment method, 176multiple sequence alignment, 178Needleman-Wunsch algorithm, 176pair-wise alignment, 173, 177PAUP, 189phylogenetic analysis, 184PILEUP algorithm, 182sequence assembly, 122Smith-Waterman algorithm, 176strategies for, 183

Sequence retrival programs, 154BLAST, 154ENTREZ, 155FASTA, 155FETCH, 155

Shine-Delgarno sequence, 25, 120Sickle cell anemia, 3Single nucleotide polymorphism (SNP), 3, 214Site-directed metagenesis, 230Space group, 84Spectra

atomic, 12frequency-domain, 91line, 12NMR, 90

Stacking gel, 59Stokes, 53Stokes-Einstein equation, 53Structural motif

a/b, 141, 142b-barrel, 140, 142b-turn-b, 139all a-helices, 137, 138all b-strands, 139EF-hand, 143Greek key, 139helix-turn-helix, 104, 143immunoglobulin-fold, 139, 141leucine-zipper, 107Rossmann (bab) fold, 141swiss roll, 139

Page 251: Bioinformatics : A Primer · 2019-12-02 · 1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the

248 Bioinformatics: A Primer

zinc-finger, 106, 143, 144, 241Structure

atomic, 11, 12determination, 83, 85, 93factor, 86prediction, 133primary, 35, 70, 76, 240quarternary, 41, 240refinement, 87secondary, 35, 190, 241spatial, 41, 83tertiary, 41, 241three-dimensional, 41, 205

Structure classification, 200domains, 194profiles, 196super-secondary, 137

Structure factor, 86Structure prediction, 134

“knowledge-based”, 135Chou-Fasman method, 136computational methods of, 133-147contact potential method, 203distance matrix method, 171, 202inverse folding method, 135of fibrous proteins, 134of globular proteins, 135pattern recognition, 163, 199profile sum method, 203secondary, 135, 190sequence threading method, 204structure profile method, 202tertiary, 164, 201

Sugardeoxyribose, 16furanose, 16oxyribose, 16

SugarsN-linked, 110O-linked, 110

Svedberg, 54Systems biology, 6

TATA box, 120, 156Tautomer, 15Tomography, 98

CT, 98, 99electron, 98fMRI, 99magnetic resonance imaging (MRI), 99PET, 99SPECT, 98, 99

Transcription, 23Translation, 25Tri-nucleotide repeat expansion (TNRE), 4tRNA, 19

aminoacyl, 25fMet, 25

Turns and loops, 39

Unit cell, 84, 241Unit cell content, 85Untranslated region (UTR), 120, 121

Validation, 230Virial coefficient, 54Virtual screening, 230, 241

Watson-Crick, 1Watson-Crick hypothesis, 16, 18Wave functions, 11

X-ray crystallography, 1macromolecular, 1time-resolved, 83

X-ray diffraction, 83principles of, 84single-crystal, 84