Intelligent Bioinformatics - download.e-bookshelf.de...JWBK023-FM JWBK023-Keedwell April 5, 2005...

30
Intelligent Bioinformatics The application of artificial intelligence techniques to bioinformatics problems Edward Keedwell and Ajit Narayanan School of Engineering, Computer Science and Mathematics University of Exeter, UK

Transcript of Intelligent Bioinformatics - download.e-bookshelf.de...JWBK023-FM JWBK023-Keedwell April 5, 2005...

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    IntelligentBioinformaticsThe application of artificial intelligencetechniques to bioinformatics problems

    Edward KeedwellandAjit NarayananSchool of Engineering, Computer Science and MathematicsUniversity of Exeter, UK

    iii

    Innodata0470021764.jpg

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    ii

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    IntelligentBioinformatics

    i

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    ii

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    IntelligentBioinformaticsThe application of artificial intelligencetechniques to bioinformatics problems

    Edward KeedwellandAjit NarayananSchool of Engineering, Computer Science and MathematicsUniversity of Exeter, UK

    iii

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    Copyright C© 2005 John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester,West Sussex PO19 8SQ, England

    Telephone (+44) 1243 779777

    Email (for orders and customer service enquiries): [email protected] our Home Page on www.wileyeurope.com or www.wiley.com

    All Rights Reserved. No part of this publication may be reproduced, stored in a retrievalsystem or transmitted in any form or by any means, electronic, mechanical, photocopying,recording, scanning or otherwise, except under the terms of the Copyright, Designs andPatents Act 1988 or under the terms of a licence issued by the Copyright Licensing AgencyLtd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing ofthe Publisher. Requests to the Publisher should be addressed to the Permissions Department,John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ,England, or emailed to [email protected], or faxed to (+44) 1243 770620.

    Designations used by companies to distinguish their products are often claimed as trademarks.All brand names and product names used in this book are trade names, service marks,trademarks or registered trademarks of their respective owners. The Publisher is notassociated with any product or vendor mentioned in this book.

    This publication is designed to provide accurate and authoritative information in regard tothe subject matter covered. It is sold on the understanding that the Publisher is not engagedin rendering professional services. If professional advice or other expert assistance isrequired, the services of a competent professional should be sought.

    Other Wiley Editorial Offices

    John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA

    Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA

    Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany

    John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia

    John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark,Singapore 129809

    John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1

    Wiley also publishes its books in a variety of electronic formats. Some content that appearsin print may not be available in electronic books.

    Library of Congress Cataloguing-in-Publication Data

    British Library Cataloguing in Publication Data

    A catalogue record for this book is available from the British Library

    ISBN 0 470 02175 6

    Typeset in 10.5/13.5pt Sabon by TechBooks, New Delhi, IndiaPrinted and bound in Great Britain by TJ International Ltd., Padstow, CorwallThis book is printed on acid-free paper responsibly manufactured from sustainable forestryin which at least two trees are planted for each one used for paper production.

    iv

    http://www.wileyeurope.comhttp://www.wiley.com

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    Contents

    Preface ix

    Acknowledgement xi

    PART 1 INTRODUCTION 1

    1 Introduction to the Basics of Molecular Biology 31.1 Basic cell architecture 31.2 The structure, content and scale of deoxyribonucleic acid (DNA) 41.3 History of the human genome 91.4 Genes and proteins 101.5 Current knowledge and the ‘central dogma’ 211.6 Why proteins are important 231.7 Gene and cell regulation 241.8 When cell regulation goes wrong 261.9 So, what is bioinformatics? 27

    1.10 Summary of chapter 281.11 Further reading 29

    2 Introduction to Problems and Challengesin Bioinformatics 31

    2.1 Introduction 312.2 Genome 312.3 Transcriptome 402.4 Proteome 502.5 Interference technology, viruses and the immune system 572.6 Summary of chapter 632.7 Further reading 64

    v

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    vi CONTENTS

    3 Introduction to Artificial Intelligence andComputer Science 65

    3.1 Introduction to search 653.2 Search algorithms 663.3 Heuristic search methods 723.4 Optimal search strategies 763.5 Problems with search techniques 833.6 Complexity of search 843.7 Use of graphs in bioinformatics 863.8 Grammars, languages and automata 903.9 Classes of problems 96

    3.10 Summary of chapter 983.11 Further reading 99

    PART 2 CURRENT TECHNIQUES 101

    4 Probabilistic Approaches 1034.1 Introduction to probability 1034.2 Bayes’ Theorem 1054.3 Bayesian networks 1114.4 Markov networks 1164.5 Summary of chapter 1254.6 References 126

    5 Nearest Neighbour and Clustering Approaches 1275.1 Introduction 1275.2 Nearest neighbour method 1305.3 Nearest neighbour approach for secondary structure protein

    folding prediction 1325.4 Clustering 1355.5 Advanced clustering techniques 1385.6 Application guidelines 1445.7 Summary of chapter 1455.8 References 146

    6 Identification (Decision) Trees 1476.1 Method 1476.2 Gain criterion 1526.3 Over fitting and pruning 1576.4 Application guidelines 1606.5 Bioinformatics applications 1636.6 Background 169

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    CONTENTS vii

    6.7 Summary of chapter 1706.8 References 170

    7 Neural Networks 1737.1 Method 1737.2 Application guidelines 1857.3 Bioinformatics applications 1877.4 Background 1927.5 Summary of chapter 1937.6 References 193

    8 Genetic Algorithms 1958.1 Single-objective genetic algorithms – method 1958.2 Single-objective genetic algorithms – example 2028.3 Multi-objective genetic algorithms – method 2058.4 Application guidelines 2078.5 Genetic algorithms – bioinformatics applications 2108.6 Summary of chapter 2178.7 References and further reading 217

    PART 3 FUTURE TECHNIQUES 219

    9 Genetic Programming 2219.1 Method 2219.2 Application guidelines 2309.3 Bioinformatics applications 2329.4 Background 2369.5 Summary of chapter 2369.6 References 237

    10 Cellular Automata 23910.1 Method 23910.2 Application guidelines 24510.3 Bioinformatics applications 24710.4 Background 25110.5 Summary of chapter 25210.6 References and further reading 252

    11 Hybrid Methods 25511.1 Method 25511.2 Neural-genetic algorithm for analysing gene expression data 25611.3 Genetic algorithm and k nearest neighbour hybrid for

    biochemistry solvation 262

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    viii CONTENTS

    11.4 Genetic programming neural networks for determininggene – gene interactions in epidemiology 265

    11.5 Application guidelines 26811.6 Conclusions 26811.7 Summary of chapter 26911.8 References and further reading 269

    Index 271

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    Preface

    It is widely recognized that the field of biology is in the midst of a ‘dataexplosion’. A series of technical advances in recent years has increasedthe amount of data that biologists can record about different aspects ofan organism at the genomic, transcriptomic and proteomic levels. Thisdata is, of course, vital to advancing our knowledge. In recent years, thediscipline of bioinformatics has allowed biologists to make full use of theadvances in computer science and computational statistics in analysingthis data. However, as the volume of data grows, the techniques used mustbecome more sophisticated to cater for large-scale data and noise. Also,given the growth in biological data, there is a need to extract informationthat was not previously known from these databases to supplement cur-rent knowledge. Large databases may contain interesting patterns that, ifidentified and authenticated by further laboratory and clinical work, canlead to novel theories about the causes of various diseases and also possi-bly to new drugs for their treatment. The discipline of bioinformatics hasreached the end of its first phase, and the motivation behind this bookis to characterize the principles that may underlie second phase bioin-formatics. That is, second phase bioinformatics is when the discipline,instead of being informed by just computer science and computationalstatistics, is also informed by artificial intelligence techniques.

    As we show in this book, there are problems in bioinformatics andmany other sciences that cannot be solved satisfactorily even with thefastest computers. Clearly, a more ‘intelligent’ approach is required tosolve these increasingly difficult bioinformatics problems, such as geneexpression analysis and protein structure prediction. This book attemptsto address this by looking at the latest advances in artificial intelligencetechnology as applied to computational problems in biology. Artificialintelligence methods are often based on the ways in which humans solve

    ix

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    x PREFACE

    search and optimization problems, or how nature has solved its ownproblems, for example by using the principles of ‘survival of the fittest’in evolutionary computation.

    This book is divided into three parts, each containing a number ofchapters. These parts are designed to allow readers to access the mate-rial most relevant to them. The first part, Introduction, introduces thematerial necessary to understand the technology and biology includedin the later chapters. We recognize that bioinformatics is highly cross-disciplinary and therefore some, all or none of these chapters may berelevant to the reader, depending on their background. The next part,Current Techniques, describes the established artificial intelligence tech-niques in bioinformatics including probabilistic, nearest neighbour andgenetic algorithm approaches. The final part, Future Techniques, is in-tended to give the reader an impression of the latest thinking in the areaof intelligent bioinformatics. Some of these approaches may not havebeen widely applied to problems in bioinformatics, but algorithms suchas genetic programming and various hybrid approaches can be expectedto make a big impact in this domain if experience in other areas of scienceand technology is anything to go by.

    In short, this book has been written to engage and interest readers frommany disciplines. Biologists are provided for in that there is a full intro-duction to the challenges for computer science, and computer scientistsshould also find the chapters on biology and bioinformatics informative.Practicing bioinformaticians are also likely to find the book enlightening,as much of the material has previously only been included in specialistpublications and a collection such as this provides a single resource formany intelligent problem-solving techniques in bioinformatics. However,as with any book of this type, not every technique can be included dueto space restrictions and apologies are offered to researchers whose ownfavourite analytical techniques are not covered in this book.

    Edward KeedwellAjit Narayanan

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    Acknowledgements

    The authors would like to thank everyone involved with producing thisbook including staff at the Department of Computer Science and Centrefor Water Systems at the University of Exeter, in particular Godfrey Wal-ters, Dragan Savic and Soon-Thiam Khu. In addition to this, we wouldlike to thank Bjorn Olsson for his contribution to the tutorials on whichthis book is based, and Laetitia Jourdan for her helpful comments. Also,we would like to thank the many MSc students on the Bioinformaticsprogramme at the University of Exeter, who contributed towards someof the material for this book. Finally we would also like to thank theeditorial and production staff at Wiley, in particular Joan Marsh, An-drea Baier and Robert Hambrook for making this book possible.

    We are grateful to WoltersKluwer Health for permission to adapt andre-use Figures 2.10, 6.3, 7.1, 7.2 and 7.3 and Table 5.1 from ‘Artificialintelligence techniques for bioinformatics’, A. Narayanan, E. C. Keedwelland B. Olsson, Applied Bioinformatics 2002: 1(4) 191–222.

    Dedications

    Ed Keedwell – This book is dedicated to my family Rob, Lyn, Rich andLoveday, to Kate, and in memory of Alex Larigo.

    Ajit Narayanan – This book is dedicated to Lucy, Belinda and Kieran,my mother Janaki, my brother Ramesh and sister Seetha.

    xi

  • JWBK023-FM JWBK023-Keedwell April 5, 2005 23:30 Char Count= 0

    xii

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    Part 1Introduction

    1

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    2

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    1Introduction to the Basicsof Molecular Biology

    1.1 Basic cell architecture

    A cell, typically 10–30 millionths of a metre (10–30µm) across for hu-mans, contains many specialized structures called organelles (Figure 1.1).The cell membrane controls the passage of substances into and out of thecell and encloses cell organelles as well as cell substances; the cytoplasmserves as a fluid container for cell organelles and other cell substancesas well as helping in the transport of substances within the cell; the nu-cleus directs all cell activity and carries hereditary information; the en-doplasmic reticulum serves as a transport network and storage area forsubstances within the cell; the ribosome manufactures different kinds ofcell protein; the Golgi apparatus packages protein for storage or trans-port out of the cell; the lysosome digests or breaks down food materialsinto simpler parts and removes waste materials from the cell; the mi-tochondria serve as the power supply of the cell by producing ATP –adenosine triphosphate – which is the source of energy for all cellactivities; microtubules serve as the support system or skeleton of thecell; and microfilaments assist in cell motility. Each organelle performsone or more special task(s) to keep the cell alive.

    In addition to this intracellular (within cell) architecture, there is alsoan intercellular (between cell) architecture: cells form tissue (aggregationsof similar cells that perform some subfunction), which in turn combineswith other tissues to form organs (aggregation of subfunctions to perform

    Intelligent Bioinformatics Edward Keedwell and Ajit NarayananC© 2005 John Wiley & Sons, Ltd

    3

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    4 INTRODUCTION TO THE BASICS OF MOLECULAR BIOLOGY

    Cell

    DNA

    Golgi apparatusPolypeptide chain(20 amino acids)

    Protein

    mRNA

    (b) Translation

    Ribosome

    (c) Enzymes, proteins (hundreds of amino acids)

    Nucleus

    (a) Transcription

    Figure 1.1 An overview of a typical human cell

    an overall function), which in turn together form an organism (aggrega-tion of all functions to keep the multicellular organism alive). The rest ofthis chapter deals with just two of these organelles, the nucleus and theribosomes, and the processes within a cell that links them together.

    1.2 The structure, content and scaleof deoxyribonucleic acid (DNA)

    DNA and chromosomes

    All the information directing every cell function is stored in large DNAmolecules found in the nucleus. A cell cannot function without DNA.The information it contains must somehow be made available to the restof the cell as well as being passed on to all new cells. Although eachcell contains the full complement of DNA, through some process whichis not yet clearly understood certain parts of the DNA are switched on

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    THE STRUCTURE, CONTENT AND SCALE OF DEOXYRIBONUCLEIC ACID 5

    or off within cells, resulting in different types of cell producing differentproteins for normal growth and functioning of the organism as a whole.

    The human body consists of between 30 to 80 trillion cells, whereone trillion = 1012, i.e. one thousand billion, where one billion equalsone thousand million. What is shown in Figure 1.1 is a eukaryote cell,which has a membrane-bound nucleus. The human body has about200 different types of eukaryote cell. The process of transcription (Fig-ure 1.1(a)) starts with the double-stranded DNA opening up to revealbases coding for a gene. A copy of the gene is made called messenger RNA(mRNA) which leaves the nucleus. The double-stranded DNA closes af-ter transcription. At ribosomes, the process of translation starts (Fig-ure 1.1(b)) whereby three copied bases at a time (codon) are mappedonto one amino acid. The mRNA is broken up and may re-enter the nu-cleus for further mRNA transcription. The growing sequence of aminoacids (polypeptide sequence) may be amended by the Golgi apparatusbefore the final production of enzymes, proteins and other translatedproducts (Figure 1.1 (c)).

    The DNA in the nucleus takes the form of large molecules called chro-mosomes made up of combinations of four types of nucleotides – adenine,guanine, thymine and cytosine (labelled ‘A’, ‘G’, ‘T’ and ‘C’, respectively).Chromosomal structure can be described at different levels. At the lowestlevel, single strands of DNA are paired with their complementary basesto form double strands (about two billionths of a metre (2ηm) wide).These double strands form strings of chromatin about 11ηm wide thatare packed tightly into 30ηm-wide chromatin fibre. Chromatin fibre isitself densely packed into a section of chromosome about 300ηm widewhich again is packed into condensed sections of chromosome about700ηm wide. Finally, chromosome sections are joined together at thecentromere to form an entire chromosome about 1400ηm (1.4µm, or0.0014mm) wide.

    The extreme small scale of DNA and its structure means that it can-not be observed directly. Since the largest magnification that can be seenthrough an optical microscope is 400× and the closest that two dis-tinct spots can be resolved is 0.2 mm, if a chromosome can be seen at allthrough an optical microscope using artificial or natural light it will be asa fuzzy image. Lightwaves with shorter wavelengths (such as blue or ul-traviolet) can be used to increase resolution (the resolution limit is about0.45 times the wavelength), but then special techniques are required tocapture the image, since such short wavelengths are beyond visual capa-bility. Light microscopy can be used to observe a cell but still cannot makeout the organelles with clarity. One of the most popular techniques is

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    6 INTRODUCTION TO THE BASICS OF MOLECULAR BIOLOGY

    transmission electron microscopy (TEM), where electrons are beamedthrough the sample and an image produced resulting from the interactionof the electrons with the sample. TEM can resolve organelles and othersubcellular structures but not the content of chromosomes. In otherwords, it is likely that chromosome content will not be observed directlyat the level of bases, which means that DNA sequences will never beobserved directly. Instead, indirect methods for observing and measuringDNA must be used.

    It is estimated that the DNA in each human cell contains about six orseven billion nucleotides, spread across 46 chromosomes (discrete molec-ular structures of DNA), each one of which takes the shape of a doublehelix. If all the DNA in one cell were stretched end to end, the lengthis estimated to be about 2 m. That is, each DNA chromosome is about50 000 times shorter than its extended length.

    Nucleotides are conventionally portrayed as shapes that lock onto eachother when paired on the two strands that make up the double-helixstructure of a chromosome. Complementary base pairing is representedin Figure 1.2(a) and (b), with T on one strand always being paired withA on the other strand, and C with G. Each strand has directionality (thedirection in which nucleotides code for genes), known as 5′ (5-prime) or3′. That is, the strands run in the opposite direction to each other and are‘anti-parallel’. In Figure 1.2(c), the nucleotides making up a gene havea direction from the 5′ to the 3′ end (left-to-right for the ‘top’ strand,

    T A

    G C

    (a)(c)

    5′

    3′

    3′

    5′

    3′

    (b)

    5′

    5′

    Double helix

    3′

    Figure 1.2 The double-helix structure

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    THE STRUCTURE, CONTENT AND SCALE OF DEOXYRIBONUCLEIC ACID 7

    right-to-left for the ‘bottom’ strand). Each nucleotide is a molecule con-sisting of a five-carbon sugar (deoxyribose for DNA), a phosphate group,and a nitrogenous base (a ring compound containing nitrogen), with eachcarbon being given a number 1′ to 5′. Nucleotides form a chain whenphosphodiester linkages are formed between the sugar portions of themolecules. As a result of the phosphates being linked from the 5′ carbonon one sugar to the 3′ carbon on the next, the DNA sequence has a free5′ carbon (no nucleotide attached) at one end and a free 3′ carbon atthe other end. These free carbon numbers are then used to signify thedirectionality of the sequence.

    Types of cell

    Eukaryotic cells mainly appear in multicellular organisms (e.g. plants,animals) and are distinguished by having a clearly designated nucleuscontaining DNA structured into chromosomes, while prokaryotic cells(single cell organisms) have no such nucleus and their DNA is storedin one, usually circular, molecule. Prokaryotic cells are usually calledbacteria and represent the simplest life forms. There are three classes ofbacteria. Eubacteria are the most common type and can cause diseasein humans either by directly producing toxins harmful to us or by beinginfected by bacterial viruses that then cause the bacteria in us to produceharmful toxins. In addition to the trillions of eukaryotic cells that makeup a human, human bodies also tolerate a large number of bacteria thatproduce useful proteins, e.g. for breaking down some types of food, thathuman DNA could not otherwise manufacture. Archaebacteria are typ-ically found in hostile (usually hot, acidic and oxygenless) environmentsand are assumed to be, or descended from, among the oldest living or-ganisms on this planet, since the early Earth would not have containedoxygen and would have been a hot place. Cyanobacteria use photosyn-thesis (the process of converting energy in sunlight into chemicals usedby living systems) and are believed to be the source of chloroplasts inplants. The remainder of this book will concentrate on eukaryotic cells,such as those found in multicellular creatures.

    The human body is made up of large numbers of about 200 differenttypes of eukaryotic cell, such as nerve cells (neurons) for communica-tion and control, muscle cells for producing mechanical force, and sen-sory cells such as those in the eye and skin. Since all humans (and othermulticellular organisms) start as one fertilized egg cell, it is one of themysteries of modern biology as to why, after division like a prokaryotic

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    8 INTRODUCTION TO THE BASICS OF MOLECULAR BIOLOGY

    cell, the subsequent cells remain together to cooperate for further divisionand specialization into all of the different types of cell, until a full-grownhuman develops. Most prokaryotic cells, after division, go their own wayand lead independent lives.

    DNA, the genome and genes

    For human and other eukaryotic cells, two polynucleotide chains (thatis, two sequences consisting of many different occurrences of the fourneucleotides) form the DNA double helix, with all the bases on the insideof the helix and the sugar-phosphate backbones on the outside (Fig-ure 1.2). Under normal cellular conditions, adenine and guanine (purines)always pair with thymine and cytosine (pyramidines), respectively andvice versa. The complete set of DNA in an organism’s cell is its genome. Aeukaryotic nucleus contains a number of chromosomes, each of which isa double-helix containing hundreds of thousands of bases on anti-parallelstrands. In other words, while the strands are parallel in a chromosome,they run in an opposite direction to each other. One strand is read from‘left-to-right’, or ‘top-to-bottom’, and its complement is read from ‘right-to-left’, or ‘bottom-to-top’.

    So far, the assumption is that a eukaryotic cell contains the full set ofchromosomes, and this is true for about 99.99 per cent of all cells in thehuman body. However, before a normal eukaryotic cell can come intobeing it has to be created. A sex cell for humans contains 23 chromo-somes, consisting of about 3.5 billion bases in total. A sex cell (haploid) isdifferent from a normal cell in that it contains only half the complementof chromosomes required to form a normal (diploid cell). Only when twosex cells merge will a normal cell consisting of 46 chromosomes result.A sex cell for goldfish contains 47 chromosomes (94 chromosomes for anormal goldfish cell), for rice 12, for a fruit fly four, for a guinea pig 32.

    A gene is defined to be a sequence of DNA or bases that code for aspecific function/protein. However, a gene can have more than one formor version. So, while there may be a gene for, say, producing hair of acertain colour (a gross oversimplification), that gene will have differentalleles, such as producing brown hair or blonde hair. A gene is like avariable that can take different values, to use a computational metaphor.It is not known for sure how many human genes are capable of havingdifferent allelic values or how many different allelic values exist for thosegenes that can vary. Some of these differences in allelic values are stronglyassociated with diseases, such as one particular type of diabetes where a

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    HISTORY OF THE HUMAN GENOME 9

    gene which contributes to the production of insulin for breaking downglucose in the blood has a different form to the normal form or versionof the gene. Other differences in allelic values provide normal variationbetween individuals’ however. It is difficult to identify a genuine allelicdifference when comparing the same gene across two individuals; thedifference can be just one base in a multithousand-base gene sequence.Since the content of genes cannot be observed directly, only indirect waysof identifying differences between individuals for the same gene can beused, which leads to problems of knowing where the differences may beand finding methods for checking for the existence of these differences.

    As stated earlier it is estimated that there are several trillion (between30 trillion and 80 trillion) cells in the human body (for skin, muscles,liver, blood, heart, brain, etc.). Each such cell contains the full set of 46chromosomes inherited from the mother and father (23 in each case, viasex cells). It is also estimated that one set of 23 chromosomes code forabout 30 000 genes for humans. On average, about 100 000–150 000bases are required for coding a gene, although this figure varies greatlyfrom a few hundred to a few hundred thousand. Several thousand geneswill on average reside on each chromosome. A genome is defined to bethe complete set of chromosomes inherited from one parent.

    1.3 History of the human genome

    The task of sequencing all the bases of the human genome is called thehuman genome project, which originated in the early 1980s with Gen-Bank when US Department of Energy technicians entered sequences ofAs, Gs, Cs and Ts from journals into databases using special keyboards.New protocols subsequently allowed researchers to enter sequences viatelephone, and later GenBank was transferred to the National Institutefor Biotechnology Information (NCBI). In 1990, the Human GenomeProject (HGP) was launched as a publicly-funded consortium consist-ing of four large sequencing centres in the USA, the Sanger Centre inCambridge, UK, and various laboratories in Japan, France, Germanyand China. Before the project was completed, in Spring 2000 Celera Ge-nomics announced that they had a complete draft of the human genome.While the HGP adopted a systematic method for ‘sequencing’ (identify-ing the nucleotides along all the chromosomes of) the human genomesection by section, Celera adopted a ‘shotgun’ approach, whereby theyfragmented the genome into small, easily sequenced stretches and thenreconstructed the genome through proprietary algorithms. Increases in

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    10 INTRODUCTION TO THE BASICS OF MOLECULAR BIOLOGY

    computational power through the 1990s made Celera’s approach possi-ble. Celera used just one anonymous person’s DNA, whereas the HGPrequired cross-checking with several people’s DNA. Also, Celera repeatedthe sequencing three times, whereas HGP required more repetitions.

    Initially and during the early 1990s, it was thought that the HGP wouldfind 80 000 genes. As the HGP progressed, this figure was revised downto 20 000 to 30 000 genes. A rough calculation indicates that, if there are3.5 billion bases on 23 chromosomes and 30 000 genes, then about120 000 bases are required per gene on average. However, it is nowestimated that 98–99 per cent of DNA in humans is ‘redundant’ (doesnot code for any function). Also, it is estimated that up to 99.9 per centof one person’s genes match another random person’s perfectly. That is,any two people taken at random share the very same DNA sequence (al-lelic values) for nearly every single one of their genes, but the remaining0.1 per cent vary. If 30 000 genes are assumed, then 0.1 per cent is 30,that is, there are still over a billion ways (230 = 1 073 741 824) that twopeople can differ from each other. This is assuming that each gene hasa binary function (on/off, high/low, dark/fair, etc), whereas genes can beexpected to be multivariate (take many values). For instance, if there areon average three different forms for each gene, there are still over 205 bil-lion ways that two people can differ from each other, more than enoughto code for a difference between any two humans currently living (theworld’s population is currently estimated to be about six billion). Also, ifthe estimate of how many genes humans share identically with each otheris just a fraction lower, say 99.8 per cent, then there is even more geneticvariability possible. These differences between values for a specific geneare called polymorphisms and the physical location of a specific gene onchromosomes is called its locus.

    There is also increasing interest in the ‘redundant’ or ‘junk’ DNA,that is, DNA which is believed not to code for any protein. It is notclear whether such sections of DNA are the remains of previously usefulDNA that now have no function, or whether non-coding DNA providesa structural aid to help stabilize chromosomes and the nucleus.

    1.4 Genes and proteins

    Genes code for various products that are used by the cells making up thetissue of the organism. These products are called proteins and they havetwo primary functions: structural, such as helping to form muscle, hairand microtubules, and enzymatic, such as the production of enzymes

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    GENES AND PROTEINS 11

    for starting various chemical reactions in the cell. Proteins thereforecontribute to biological structure and function. Proteins also have threeother functions: they can carry signals, they can transport molecules suchas oxygen and they can regulate cell processes, such as defence mecha-nisms. The process by which genes are made into proteins is started byRNA polymerase coming into contact with a chromosome and identify-ing the start point of a gene. These molecules open up the double helixstructure to expose the DNA strand making up the gene, and a comple-mentary copy of the gene is made in the direction in which the gene ismeant to be read. The process of copying genes into mRNA is called tran-scription, and the process of converting the mRNA into protein is calledtranslation.

    Transcription starts with the double-helix unwinding Figure 1.3(a) andexposing bases that represent the start of a gene. mRNA is then formed,whereby a complementary copy of the gene is made. Since transcriptionproceeds in the 3′ to 5′ direction (more details follow later), the mRNAhas opposite ‘polarity’, that is, the start of the gene is now at the 5′ end ofthe mRNA (Figure 1.3(b)). Introns, or parts of the gene that do not codefor a protein, are removed, typically by the mRNA folding over itselfand forming loops that are cut off, leaving exons in the transcript. Thesetranscripts containing exons only can be further edited (Figure 1.3(c)) sothat alternative splice pathways for the same gene are formed, i.e. onegene can give rise to many different transcripts.

    Transcription

    The transcription process consists of three stages: initiation, elongationand termination. Regions of DNA which signal initiation are termedpromoters and lie ‘upstream’ of the start of the actual gene (Figure 1.4).Initiation starts with molecules such as polymerase II enzymes findingpromoter regions upstream (towards the 3′ end of a strand) of a gene.These regions consists of specific patterns of bases, known as the CAATbox and TATA box. The start point of a gene is typically 25 bases down-stream of the TATA box for eukaryotes. It is believed that there are tworegions of promoters. RNA polymerase II enzymes scan the helix lookingfor these regions and, when found, bind tightly to the region further awayfrom the initiation point. The enzyme then binds to the second regioncloser to the start point and opens up the helix while at the same time re-leasing a factor which signals that mRNA should be formed. Elongationis the process by which an mRNA copy of the genetic information is

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    12 INTRODUCTION TO THE BASICS OF MOLECULAR BIOLOGY

    Free-floatingpolymerase II

    (a) 5' 3'

    5'3'

    (b)

    (c)

    5' 3'

    3'5'

    Polymeraseunzips thedouble helix

    Exposed bases

    mRNA

    To ribosomes

    Doublehelix

    mRNA splicing

    mRNA Left exon

    Left exon

    Primary transcript

    Splice pathway 1 Splice pathway 2

    Right exon

    Splicing enzyme

    Right exon Left � right exon

    Left � right exon

    Intron

    Intron removed

    mRNA

    AGCTUracil

    Figure 1.3 The process of transcription

    actually made on the unravelled stretch of helix. Certain sequences maycause a pause during this process. Termination is caused in one of twoways. The first is a repeated sequence of bases that causes the mRNA tofold over itself and therefore terminate the transcription process. Typi-cally, a GC-rich (guanine followed by cytosine) sequence is sufficient toterminate transcription. The second way is for a terminating factor to bereleased.

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    . . .

    G C

    T C

    C A

    A T

    C T

    . . .

    CA

    AT

    bo

    x

    3′. .

    .. .

    .

    −80

    Oth

    er c

    ontr

    ol e

    lem

    ents

    at p

    osit

    ion

    up t

    o −2

    00 Seq

    uenc

    e el

    emen

    ts o

    f th

    e pr

    omot

    er in

    ani

    mal

    s an

    d lo

    wer

    euk

    aryo

    tes.

    N in

    dica

    tes

    that

    any

    of

    the

    four

    nuc

    leot

    ides

    may

    occ

    ur w

    ith

    appr

    oxim

    ate

    equa

    l fre

    quen

    cy a

    t th

    at s

    ite.

    . . .

    G T

    T T

    G C

    T C

    C T

    N A

    C .

    5′

    +1St

    art

    poin

    tR

    est

    ofge

    ne

    . T A

    T A

    A/T

    A .

    TA

    TA

    bo

    x

    −25

    Figu

    re1.

    4T

    hein

    itia

    tion

    stag

    e

    13

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    14 INTRODUCTION TO THE BASICS OF MOLECULAR BIOLOGY

    The process of transcription therefore results in a complementarycopy of the gene, but there is one complication. C (cytosine) in DNAis transcribed as G (guanine), and G as C. However, while A (adenine) istranscribed to T (thymine), T is not transcribed to A. Instead, for tran-scription, a fifth base called uracil or uradine (U), which is functionallyidentical to adenine (A), is used. Faithful complementary base copying isused instead for another process, replication, whereby the entire geneticmaterial of a cell is copied for cell division and the production of a newcopy of the cell (cloning), such as when a new skin cell is required from anexisting skin cell. Transcription therefore differs from replication in thattranscription involves the use of a fifth base, uracil, which is the comple-mentary base to adenine (A). U does not occur in replicated DNA, andT does not occur in mRNA.

    As previously mentioned, at each nucleotide position along the double-stranded DNA molecule, the nucleotides are complementary. This is be-cause, chemically, A forms two hydrogen bonds with T and C forms threehydrogen bonds with G. There is, however, a peculiar relationship bet-ween the directionality of DNA strands and the type of strand involved.One of the strands holds the information that represents a gene. Thisstrand is called the template or antisense strand (containing anti-codons,to be described below). The other strand is called, confusingly, the cod-ing or sense strand. The ‘sense’ and ‘anti-sense’ strands represent thetwo strands of the double helix (Figure 1.5). Transcription uses the anti-sense, or template, strand. Note that in replication a faithful copy of thesense strand produces the anti-sense strand with appropriate direction,and vice versa. The sense strand can therefore be regarded as containing‘DNA codons’ (to be described later), and the anti-sense strand ‘DNAanti-codons’. DNA codons and anti-codons are not to be confused withmRNA codons, which result from the transcription of the template strand

    Copy A A T T G G C C T G C A T C C A A G G

    T T A A C C G G A C G T A C G T T C C

    3′ 5′

    A A T T G G C C T G C A T C C A A G G

    SenseCoding strand/DNA codons

    5′ 3′

    U U A A C C G G A C G U A G G U U C C

    Anti-senseTemplate/DNA anti-codons

    3′ 5′

    mRNA mRNA codons

    Replication

    Transcription

    5′ 3′

    Figure 1.5 The difference between replication and transcription

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    GENES AND PROTEINS 15

    and use U rather than T. There are therefore three ways that a gene canbe described: through the template or antisense strand, through the cod-ing or sense strand, and through the mRNA that is transcribed from thetemplate or antisense strand.

    Spliceosome and transcriptome

    Just because a gene has been transcribed into mRNA does not meanthat the task of making a copy of a gene has finished. Genes contain‘coding’ and ‘non-coding’ regions. These regions are different from the‘junk DNA’ mentioned earlier, which refers to the DNA between genesrather than within a gene. A coding region in a gene is that sequenceof nucleotides within a gene that is actually used for making a protein.Even within a gene there are non-coding regions – nucleotide sequencesthat are not used for making a protein. These non-coding regions haveto be removed from the mRNA, which is nothing but a faithful copyof a gene from beginning to end, including non-coding regions. Afterthe mRNA has been ‘edited’ to remove introns, there is still anotherprocess that is only recently being understood. The remaining exons inthe mRNA can themselves be ‘edited’ so that some exons are removed(Figure 1.3(b)) or shuffled to form alternative ‘splice pathways’ (that is,alternative ways that the remaining coding regions make up the finalmRNA, Figure 1.3(c)). The study of how mRNA is formed from genesis called ‘transcriptomics’ and the total set of mRNA transcripts is calledthe ‘transcriptome’. The transcriptome provides information as to whichgenes are being transcribed and which are not, depending on the cell typeand various conditions experienced by the cell. The study of how mRNAis edited after initial transcription is called ‘spliceosomics’ and the totalset of alternative splice pathways for all genes is called the ‘spliceosome’.Recent advances in microarray technology have made transcriptomicsand spliceosomics possible, as will be seen later.

    There is growing interest in those regions of DNA within a gene whichindicate exon/intron boundaries to try to understand the transcriptome inmore detail. Introns, for eukaryotes including humans, average in lengthfrom about 200 to 400 nucleotides, but this figure can vary greatly (from50 to about 30 000). Some of the longer introns may contain other genes,each with their own introns. Analysis of exon/intron boundaries reveals,with very few exceptions, a GT/AG rule, whereby the occurrence of GTtowards the 5′ end of a DNA sequence indicates the start of an intronand the occurrence of AG towards the 3′ end indicates the end of the

  • JWBK023-01 JWBK023-Keedwell March 23, 2005 10:21 Char Count= 0

    16 INTRODUCTION TO THE BASICS OF MOLECULAR BIOLOGY

    intron. It appears that internal splicing mechanisms recognize the mRNAcounterparts to these duplets and remove the intervening sequence fromthe transcribed mRNA (called ‘pre-mRNA’). Interest is also growing inalternative splicing models that capture alternative pathways for the re-moval of introns. Any DNA segment can therefore be an exon or anintron, depending on whether it is retained or removed during process-ing of pre-mRNA. Once all editing has taken place, the result is ‘maturemRNA’ which is ready for translation into a polypeptide chain.

    Translation and the proteome

    The mature mRNA leaves the nucleus and is transported to ribosomes,where translation into proteins takes place with the help of transfer RNA(tRNA). The nucleotides of the mRNA enter the ribosome sequentiallyfrom beginning to end and form groups of three bases, called codons(Figure 1.6). When a codon enters the ribosome, free-floating tRNAmolecules consisting of a matching element and an amino acid attempt

    ‘Spent’ tRNAmRNA groupedinto three bases at a time (codon)

    Free-floatingtRNA

    Codons brokenup for reuse ofmRNA bases

    Ribosome

    Growing polypeptidechain (protein)

    Protein folds intocomplex structure

    tRNA

    tRNA matchagainst codons

    Aminoacid

    Figure 1.6 The process of translation at a ribosome