Understanding Bioinformatics
-
Upload
timmy-tran -
Category
Documents
-
view
1.620 -
download
27
Transcript of Understanding Bioinformatics
Understanding
Bioinformatics
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page i
In memory of Arno Siegmund Baum
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page ii
UnderstandingBioinformatics
Marketa Zvelebil & Jeremy O. Baum
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page iii
Senior Publisher: Jackie HarborEditor: Dom HoldsworthDevelopment Editor: Eleanor LawrenceIllustrations: Nigel OrmeTypesetting: Georgina LucasCover design: Matthew McClements, Blink Studio LimitedProduction Manager: Tracey ScarlettCopyeditor: Jo ClaytonProofreader: Sally LivittAccuracy Checking: Eleni RapsomanikiIndexer: Lisa FurnivalVice President: Denise Schanck
© 2008 by Garland Science, Taylor & Francis Group, LLC
This book contains information obtained from authentic and highly regarded sources. Reprinted material isquoted with permission, and sources are indicated. Every attempt has been made to source the figuresaccurately. Reasonable efforts have been made to publish reliable data and information, but the author andpublisher cannot assume responsibility for the validity of all materials or for the consequences of their use.
All rights reserved. No part of this book covered by the copyright herein may be reproduced or used in anyformat in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording,taping, or information storage and retrieval systems—without permission of the publisher.
10-digit ISBN 0-8153-4024-9 (paperback) 13-digit ISBN 978-0-8153-4024-9 (paperback)
Library of Congress Cataloging-in-Publication Data
Zvelebil, Marketa J.Understanding bioinformatics / Marketa Zvelebil & Jeremy O. Baum.
p. ; cm.Includes bibliographical references and index.ISBN-13: 978-0-8153-4024-9 (pbk.)ISBN-10: 0-8153-4024-9 (pbk.)
1. Bioinformatics.[DNLM: 1. Computational Biology--methods. QU 26.5 Z96u 2008] I. Baum, Jeremy O. II. Title.QH324.2.Z84 2008572.80285--dc22
2007027514
Published by Garland Science, Taylor & Francis Group, LLC, an informa business270 Madison Avenue, New York, NY 10016, USA, and 2 Park Square, Milton Park, Abingdon, OX14 4RN, UK.
Printed in the United States of America.
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
Visit our Web site at http://www.garlandscience.comTaylor & Francis Group, an informa business
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page iv
The analysis of data arising from biomedical research has undergone a revolutionover the last 15 years, brought about by the combined impact of the Internet andthe development of increasingly sophisticated and accurate bioinformatics tech-niques. All research workers in the areas of biomolecular science and biomedicineare now expected to be competent in several areas of sequence analysis and often,additionally, in protein structure analysis and other more advanced bioinformaticstechniques.
When we began our research careers in the early 1980s all of the techniques thatnow comprise bioinformatics were restricted to specialists, as databases and user-friendly applications were not readily available and had to be installed on labora-tory computers. By the mid-1990s many datasets and analysis programs hadbecome available on the Internet, and the scientists who produced sequencesbegan to take on tasks such as sequence alignment themselves. However, there wasa delay in providing comprehensive training in these techniques. At the end of the1990s we started to expand our teaching of bioinformatics at both undergraduateand postgraduate level. We soon realized that there was a need for a textbook thatbridged the gap between the simplistic introductions available, which concen-trated on results almost to the exclusion of the underlying science, and the verydetailed monographs, which presented the theoretical underpinnings of arestricted set of techniques. This textbook is our attempt to fill that gap.
Therefore on the one hand we wanted to include material explaining the programmethods, because we believe that to perform a proper analysis it is not sufficient tounderstand how to use a program and the kind of results (and errors!) it canproduce. It is also necessary to have some understanding of the technique used bythe program and the science on which it is based. But on the other hand, we wantedthis book to be accessible to the bioinformatics beginner, and we recognized thateven the more advanced students occasionally just want a quick reminder of whatan application does, without having to read through the theory behind it.
From this apparent dilemma was born the division into Applications and TheoryChapters. Throughout the book, we wrote dedicated Applications Chapters toprovide a working knowledge of bioinformatics applications, quick and easy tograsp. In most places, an Applications Chapter is then followed by a TheoryChapter, which explains the program methods and the science behind them.Inevitably, we found this created a small amount of duplication between somechapters, but to us this was a small sacrifice if it left the reader free to choose at whatlevel they could engage with the subject of bioinformatics.
We have created a book that will serve as a comfortable introduction to any newstudent of bioinformatics, but which they can continue to use into their postgrad-uate studies. The book assumes a certain level of understanding of the backgroundbiology, for example gene and protein structure, where it is important to appreciatethe variety that exists and not only know the canonical examples of first-year text-books. In addition, to describe the techniques in detail a level of mathematics is
PREFACE
v
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page v
required which is more appropriate for more advanced students. We are aware thatmany postgraduate students of bioinformatics have a background in areas such ascomputer science and mathematics. They will find many familiar algorithmicapproaches presented, but will see their application in unfamiliar territory. As theyread the book they will also appreciate that to become truly competent at bioinfor-matics they will require knowledge of biomedical science.
There is a certain amount of frustration inherent in producing any book, as thewriting process seems often to be as much about what cannot be included as whatcan. Bioinformatics as a subject has already expanded to such an extent, and wehad to be careful not to diminish the book’s teaching value by trying to squeezeevery possible topic into it. We have tried to include as broad a range of subjects aspossible, but some have been omitted. For example, we do not deal with themethods of constructing a nucleotide sequence from the individual reads, nor witha number of more specialized aspects of genome annotation.
The final chapter is an introduction to the even-faster-moving subject of systemsbiology. Again, we had to balance the desire to say more against the practicalconstraints of space. But we hope this chapter gives readers a flavor of what thesubject covers and the questions it is trying to answer. The chapter will not answerevery reader’s every query about systems biology, but if it prompts more of them toinquire further, that is already an achievement.
We wish to acknowledge many people who have helped us with this project. Wewould almost certainly not have got here without the enthusiasm and support ofMatthew Day who guided us through the process of getting a first draft. Gettingfrom there to the finished book was made possible by the invaluable advice andencouragement from Chris Dixon, Dom Holdsworth, Jackie Harbor, and othersfrom Garland Science. We also wish to thank Eleanor Lawrence for her skills inmassaging our text into shape, and Nigel Orme for producing the wonderful illus-trations. We received inspiration and encouragement from many others, too manyto name here, but including our students and those who read our draft chapters.
Finally, we wish to thank the many friends and family members who have had tosuffer while we wrote this book. In particular JB wishes to thank his wife Hilary forher encouragement and perseverance. MZ wishes to specially thank her parents,Martin Scurr, Nick Lee, and her colleagues at work.
Marketa Zvelebil
Jeremy O. Baum
May 2007
Preface
vi
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page vi
Organization of this Book
Applications and Theory ChaptersCareful thought has gone into the organization of this book. The chapters aregrouped in two ways. Firstly, the chapters are organized into seven parts accordingto topic. Within the parts, there is a second, less traditional, level of organization:most chapters are designated as either Applications or Theory Chapters. This bookis designed to be accessible both to students who wish to obtain a working knowl-edge of the bioinformatics applications, as well as to students who want to knowhow the applications work and maybe write their own. So at the start of most parts,there are dedicated Applications Chapters, which deal with the more practicalaspects of the particular research area, and are intended to act as a useful hands-onintroduction. Following this are Theory Chapters, which explain the science, theory,and techniques employed in generally available applications. These are moredemanding and should preferably be read after having gained a little experience ofrunning the programs. In order to become truly proficient in the techniques youneed to read and understand these more technical aspects. On the opening page ofeach chapter, and in the Table of Contents, it is clearly indicated whether it is anApplications or a Theory Chapter.
Part 1: Background BasicsBackground Basics provides three introductory chapters to key knowledge that willbe assumed throughout the remainder of the book. The first two chapters containmaterial that should be well-known to readers with a background in biomedicalscience. The first chapter describes the structure of nucleic acids and some of theroles played by them in living systems, including a brief description of how thegenomic DNA is transcribed into mRNA and then translated into protein. Thesecond chapter describes the structure and organization of proteins. Both of thesechapters present only the most basic information required, and should not in anyway be regarded as an adequate grounding in these topics for serious work. Theintention is to provide enough information to make this book self-sufficient. Thethird chapter in this part describes databases, again at a very introductory level.Many biomedical research workers have large datasets to analyze, and these needto be stored in a convenient and practical way. Databases can provide a completesolution to this problem.
Part 2: Sequence AlignmentsSequence Alignments contains three chapters that deal with a variety of analyses ofsequences, all relating to identifying similarities. Chapter 4 is a practical introduc-tion to the area, following some examples through different analyses and showingsome potential problems as well as successful results. Chapters 5 and 6 deal withseveral of the many different techniques used in sequence analysis. Chapter 5focuses on the general aspects of aligning two sequences and the specific methodsemployed in database searches. A number of techniques are described in detail,including dynamic programming, suffix trees, hashing, and chaining. Chapter 6deals with methods involving many sequences, defining commonly occurringpatterns, defining the profile of a family of related proteins, and constructing amultiple alignment. A key technique presented in this chapter is that of hiddenMarkov models (HMMs).
A NOTE TO THE READER
vii
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page vii
Part 3: Evolutionary ProcessesEvolutionary Processes presents the methods used to obtain phylogenetic treesfrom a sequence dataset. These trees are reconstructions of the evolutionary historyof the sequences, assuming that they share a common ancestor. Chapter 7 explainssome of the basic concepts involved, and then shows how the different methodscan be applied to two different scientific problems. In Chapter 8 details are given ofthe techniques involved and how they relate to the assumptions made about theevolutionary processes.
Part 4: Genome CharacteristicsGenome Characteristics deals with the analysis required to interpret raw genomesequence data. Although by the time a genome sequence is published in theresearch journals some preliminary analysis will have been carried out, often theunanalyzed sequence is available before then. This part describes some of the tech-niques that can be used to try to locate genes in the sequence. Chapter 9 describessome of the range of programs available, and shows how complex their output canbe and illustrates some of the possible pitfalls. Chapter 10 presents a survey of thetechniques used, especially different Markov models and how models of wholegenes can be built up from models of individual components such asribosome-binding sites.
Part 5: Secondary StructuresSecondary Structures provides two chapters on methods of predicting secondarystructures based on sequence (or primary structure). Chapter 11 introduces themethods of secondary structure prediction and discusses the various techniquesand ways to interpret the results. Later sections of the chapter deal with predictionof more specialized secondary structure such as protein transmembrane regions,coiled coil and leucine zipper structures, and RNA secondary structures. Chapter 12presents the underlying principles and details of the prediction methods from basicconcepts to in-depth understanding of techniques such as neural networks andMarkov models applied to this problem.
Part 6: Tertiary StructuresTertiary Structures extends the material in Part 5 to enable the prediction andmodeling of protein tertiary and quaternary structure. Chapter 13 introduces thereader to the concepts of energy functions, minimization, and ab initio prediction.It deals in more detail with the method of threading and focuses on homologymodeling of protein structures, taking the student in a stepwise fashion through theprocess. The chapter ends with example studies to illustrate the techniques.Chapter 14 contains methods and techniques for further analysis of structuralinformation and describes the importance of structure and function relationships.This chapter deals with how fold prediction can help to identify function, as well asgiving an introduction to ligand docking and drug design.
Part 7: Cells and OrganismsCells and Organisms consists of two chapters that deal in some detail with expres-sion analysis and an introductory chapter on systems biology. Chapter 15 intro-duces the techniques available to analyze protein and gene expression data. Itshows the reader the information that can be learned from these experimentaltechniques as well as how the information could be used for further analysis.Chapter 16 presents some of the clustering techniques and statistics that aretouched upon in Chapter 15 and are commonly used in gene and protein expres-sion analysis. Chapter 17 is a standalone chapter dealing with the modeling ofsystems processes. It introduces the reader to the basic concepts of systems biology,and shows what this exciting and rapidly growing field may achieve in the future.
A Note to the Reader
viii
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page viii
AppendicesThree appendices are provided that expand on some of the concepts mentioned inthe main part of this book. These are useful for the more inquisitive and advancedreader. Appendix A deals with probability and Bayesian analysis, Appendix B ismainly associated with Part 6 and deals with molecular energy functions, whileAppendix C describes function optimization techniques.
Organization of the Chapters
Learning OutcomesEach chapter opens with a list of learning outcomes which summarize the topics tobe covered and act as a revision checklist.
Flow DiagramsWithin each chapter every section is introduced with a flow diagram to help thestudent to visualize and remember the topics covered in that section. A flowdiagram from Chapter 5 is given below, as an example. Those concepts which willbe described in the current section are shown in yellow boxes with arrows to showhow they are connected to each other. For example two main types of optimalalignments will be described in this section of the chapter: local and global. Thoseconcepts which were described in previous sections of the chapter are shown ingrey boxes, so that the links can easily be seen between the topics of the currentsection and what has already been presented. For example, creating alignmentsrequires methods for scoring gaps and for scoring substitutions, both of which havealready been described in the chapter. In this way the major concepts and theirinter-relationships are gradually built up throughout the chapter.
A Note to the Reader
ix
PAIRWISE SEQUENCE ALIGNMENT AND DATABASE SEARCHING
scoring gaps
alignments
potentiallynonoptimal
band orX-drop
scoring substitutions
residue properties
log-odds scores
optimal alignments
suboptimalalignments
global local
Needleman–Wunsch
Smith–Waterman
PAM scoring matrices
BLOSUM scoring matrices
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page ix
Mind MapsEach chapter has a mind map, which is a specialized pedagogical feature, enablingthe student to visualize and remember the steps that are necessary for specific appli-cations. The mind map for Chapter 4 is given above, as an example. In this example,four main areas of the topic ‘producing and analyzing sequence alignments’ havebeen identified: measuring matches, database searching, aligning sequences, andfamilies. Each of these areas, colored for clarity, is developed to identify the keyconcepts involved, creating a visual aid to help the reader see at a glance the range ofthe material covered in discussing this area. Occasionally there are importantconnections between distinct areas of the mind map, as here in linking BLAST andPHI-BLAST, with the latter method being derived directly from the former, but havinga quite different function, and thus being in a different area of the mind map.
IllustrationsEach chapter is illustrated with four-color figures. Considerable care has been putinto ensuring simplicity as well as consistency of representation across the book.Figure 4.16 is given below, as an example.
A Note to the Reader
x
database
searching
producing and analyzing sequence
alignments
pairwise alignment
pairwise
BLAST
SSEARCH
FAST
A
fam
ilies
patterns
PHI-BLA
ST
PRATT
PROSITE
MEM
E
do
mai
ns
Pfam
others
aligning
sequences
mu
ltiple
global
global
loca
l
local
mea
surin
g
mat
ches
conservation
gap penalty
% id
enti
ty
scorin
g
substi
tutio
n
mat
rices
others
BLOSU
M
PAM
YCVATYVLGIGDRHSDNIMIRESGQLFHIDFGHFLGNFKTKFGINRERVPYCVASYVLGIGDRHSDNIMVKKTGQLFHIDFGHILGNFKSKFGIKRERVPYCVATFVLGIGDRHNDNIMITETGNLFHIDFGHILGNYKSFLGINKERVPYCVATFILGIGDRHNSNIMVKDDGQLFHIDFGHFLDHKKKKFGYKRERVP
p110dp110bp110gp110a
p110dp110b
p110g
p110a
name
7.09e-1391.22e-142
2.13e-119
5.03e-127
PRKD human
P11G pig
0.34
5.9e-161
combinedp-value motifs
2
2
2
2
2
2
1
1 6
6
1
1
1
3
3
3
3 4
1235
3
(A)
(B)
(C)
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page x
Further ReadingIt is not possible to summarize all current knowledge in the confines of this book,let alone anticipate future developments in this rapidly developing subject.Therefore at the end of each chapter there are references to research literature andspecialist monographs to help readers continue to develop their knowledge andskills. We have grouped the books and articles according to topic, such that thesections within the Further Reading correspond to the sections in the chapter itself:we hope this will help the reader target their attention more quickly onto the appro-priate extension material.
List of SymbolsBioinformatics makes use of numerous symbols, many of which will be unfamiliarto those who do not already know the subject well. To help the reader navigate thesymbols used in this book, a comprehensive list is given at the back which quoteseach symbol, its definition, and where its most significant occurrences in the bookare located.
GlossaryAll technical terms are highlighted in bold where they first appear in the text and arethen listed and explained in the Glossary. Further, each term in the Glossary alsoappears in the Index, so the reader can quickly gain access to the relevant pageswhere the term is covered in more detail. The book has been designed to cross-reference in as thorough and helpful a way as possible.
Garland Science Website Garland Science has made available a number of supplementary resources on its website, which are freely available and do not require a password. For moredetails, go to www.garlandscience.com/gs_textbooks.asp and follow the link toUnderstanding Bioinformatics.
ArtworkAll the figures in Understanding Bioinformatics are available to download from theGarland Science website. The artwork files are saved in zip format, with a single zipfile for each chapter. Individual figures can then be extracted as jpg files.
Additional MaterialThe Garland Science website has some additional material relating to the topics inthis book. For each of the seven parts a pdf is available, which provides a set of usefulweblinks relevant to those chapters. These include weblinks to relevant and impor-tant databases and to file format definitions, as well as to free programs and toservers which permit data analysis on-line. In addition to these, the sets of datawhich were used to illustrate the methods of analysis are also provided. These willallow the reader to reanalyze the same data, reproducing the results shown here andtrying out other techniques.
A Note to the Reader
xi
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xi
The Authors and Publishers of Understanding Bioinformatics gratefullyacknowledge the contribution of the following reviewers in the development ofthis book:
Stephen Altschul National Center for Biotechnology Information, Bethesda, Maryland, USA
Petri Auvinen Institute of Biotechnology, University of Helsinki, Finland
Joel Bader Johns Hopkins University, Baltimore, USA
Tim Bailey University of Queensland, Brisbane, Australia
Alex Bateman Wellcome Trust Sanger Institute, Cambridge, UK
Meredith Betterton University of Colorado at Boulder, USA
Andy Brass University of Manchester, UK
Chris Bystroff Rensselaer Polytechnic University, Troy, USA
Charlotte Deane University of Oxford, UK
John Hancock MRC Mammalian Genetics Unit, Harwell, Oxfordshire, UK
Steve Harris University of Oxford, UK
Steve Henikoff Fred Hutchinson Cancer Research Center, Seattle, USA
Jaap Heringa Free University, Amsterdam, Netherlands
Sudha Iyengar Case Western Reserve University, Cleveland, USA
Sun Kim Indiana University Bloomington, USA
Patrice Koehl University of California Davis, USA
Frank Lebeda US Army Medical Research Institute of Infectious Diseases, Fort Detrick, Maryland, USA
David Liberles University of Bergen, Norway
Peter Lockhart Massey University, Palmerston North, New Zealand
James McInerney National University of Ireland, Maynooth, Ireland
Nicholas Morris University of Newcastle, UK
William Pearson University of Virginia, Charlottesville, USA
Marialuisa Pellegrini- European Bioinformatics Institute, Cambridge, UKCalace
Mihaela Pertea University of Maryland, College Park, Maryland, USA
David Robertson University of Manchester, UK
Rob Russell EMBL, Heidelberg, Germany
Ravinder Singh University of Colorado, USA
Deanne Taylor Brandeis University, Waltham, Massachusetts, USA
Jen Taylor University of Oxford, UK
Iosif Vaisman University of North Carolina at Chapel Hill, USA
xii
LIST OF REVIEWERS
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xii
PART 1 Background BasicsChapter 1: The Nucleic Acid World 3
Chapter 2: Protein Structure 25
Chapter 3: Dealing With Databases 45
PART 2 Sequence AlignmentsChapter 4: Producing and Analyzing Sequence Alignments Applications Chapter 71
Chapter 5: Pairwise Sequence Alignment and Database Searching Theory Chapter 115
Chapter 6: Patterns, Profiles, and Multiple Alignments Theory Chapter 165
PART 3 Evolutionary ProcessesChapter 7: Recovering Evolutionary History Applications Chapter 223
Chapter 8: Building Phylogenetic Trees Theory Chapter 267
PART 4 Genome CharacteristicsChapter 9: Revealing Genome Features Applications Chapter 317
Chapter 10: Gene Detection and Genome Annotation Theory Chapter 357
PART 5 Secondary StructuresChapter 11: Obtaining Secondary Structure from Sequence Applications Chapter 411
Chapter 12: Predicting Secondary Structures Theory Chapter 461
PART 6 Tertiary StructuresChapter 13: Modeling Protein Structure Applications Chapter 521
Chapter 14: Analyzing Structure–Function Relationships Applications Chapter 567
PART 7 Cells and OrganismsChapter 15: Proteome and Gene Expression Analysis 599
Chapter 16: Clustering Methods and Statistics 625
Chapter 17: Systems Biology 667
APPENDICES Background Theory Appendix A: Probability, Information, and Bayesian Analysis 695
Appendix B: Molecular Energy Functions 700
Appendix C: Function Optimization 709
xiii
CONTENTS IN BRIEF
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xiii
Preface vA Note to the Reader viiList of Reviewers xiiContents in Brief xiii
Part 1 Background Basics
Chapter 1 The Nucleic Acid World
1.1 The Structure of DNA and RNA 5DNA is a linear polymer of only four different bases 5Two complementary DNA strands interact by base pairing to form a double helix 7RNA molecules are mostly single stranded but can also have base-pair structures 9
1.2 DNA, RNA, and Protein: The Central Dogma 10DNA is the information store, but RNA is the messenger 11Messenger RNA is translated into protein according to the genetic code 12Translation involves transfer RNAs and RNA-containing ribosomes 13
1.3 Gene Structure and Control 14RNA polymerase binds to specific sequences thatposition it and identify where to begin transcription 15The signals initiating transcription in eukaryotes are generally more complex than those in bacteria 17Eukaryotic mRNA transcripts undergo severalmodifications prior to their use in translation 18The control of translation 19
1.4 The Tree of Life and Evolution 20A brief survey of the basic characteristics of the major forms of life 21Nucleic acid sequences can change as a result ofmutation 22
Summary 23Further Reading 24
Chapter 2 Protein Structure
2.1 Primary and Secondary Structure 25Protein structure can be considered on severaldifferent levels 26Amino acids are the building blocks of proteins 27The differing chemical and physical properties ofamino acids are due to their side chains 28
Amino acids are covalently linked together in theprotein chain by peptide bonds 29Secondary structure of proteins is made up of a-helices and b-strands 33Several different types of b-sheet are found in protein structures 35
Turns, hairpins and loops connect helices and strands 36
2.2 Implication for Bioinformatics 37Certain amino acids prefer a particular structural unit 37
Evolution has aided sequence analysis 38
Visualization and computer manipulation of protein structures 38
2.3 Proteins Fold to Form Compact Structures 40The tertiary structure of a protein is defined by the path of the polypeptide chain 41
The stable folded state of a protein represents a state of low energy 41
Many proteins are formed of multiple subunits 42
Summary 43
Further Reading 44
Chapter 3 Dealing with Databases
3.1 The Structure of Databases 46Flat-file databases store data as text files 48
Relational databases are widely used for storingbiological information 49
XML has the flexibility to define bespoke dataclassifications 50
Many other database structures are used for biological data 51
Databases can be accessed locally or online and often link to each other 52
3.2 Types of Database 52There’s more to databases than just data 53
Primary and derived data 53
How we define and connect things is very important: Ontologies 54
3.3 Looking for Databases 55Sequence databases 55
Microarray databases 58
xiv
CONTENTS
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xiv
Protein interaction databases 58
Structural databases 59
3.4 Data Quality 61Nonredundancy is especially important for someapplications of sequence databases 62Automated methods can be used to check for dataconsistency 63Initial analysis and annotation is usually automated 64Human intervention is often required to produce the highest quality annotation 65The importance of updating databases and entryidentifier and version numbers 65
Summary 66Further Reading 67
Part 2 Sequence Alignments
APPLICATIONS CHAPTER
Chapter 4 Producing and Analyzing SequenceAlignments4.1 Principles of Sequence Alignment 72
Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity 73Alignment can reveal homology between sequences 74It is easier to detect homology when comparingprotein sequences than when comparing nucleic acid sequences 75
4.2 Scoring Alignments 76The quality of an alignment is measured by giving it a quantitative score 76The simplest way of quantifying similarity between two sequences is percentage identity 76The dot-plot gives a visual assessment of similaritybased on identity 77Genuine matches do not have to be identical 79There is a minimum percentage identity that can be accepted as significant 81There are many different ways of scoring an alignment 81
4.3 Substitution Matrices 81Substitution matrices are used to assign individualscores to aligned sequence positions 81The PAM substitution matrices use substitutionfrequencies derived from sets of closely related protein sequences 82The BLOSUM substitution matrices use mutation data from highly conserved local regions of sequence 84The choice of substitution matrix depends on theproblem to be solved 84
4.4 Inserting Gaps 85Gaps inserted in a sequence to maximize similarityrequire a scoring penalty 85Dynamic programming algorithms can determinethe optimal introduction of gaps 86
4.5 Types of Alignment 87Different kinds of alignments are useful in different circumstances 87Multiple sequence alignments enable thesimultaneous comparison of a set of similar sequences 90Multiple alignments can be constructed by several different techniques 90Multiple alignments can improve the accuracy ofalignment for sequences of low similarity 91ClustalW can make global multiple alignments of both DNA and protein sequences 92Multiple alignments can be made by combining a series of local alignments 92Alignment can be improved by incorporatingadditional information 93
4.6 Searching Databases 93Fast yet accurate search algorithms have beendeveloped 94FASTA is a fast database-search method based onmatching short identical segments 95BLAST is based on finding very similar short segments 95Different versions of BLAST and FASTA are used for different problems 95PSI-BLAST enables profile-based database searches 96SSEARCH is a rigorous alignment method 97
4.7 Searching with Nucleic Acid or Protein Sequences 97DNA or RNA sequences can be used either directly or after translation 97The quality of a database match has to be tested to ensure that it could not have arisen by chance 97Choosing an appropriate E-value threshold helps to limit a database search 98Low-complexity regions can complicate homology searches 100Different databases can be used to solve particular problems 102
4.8 Protein Sequence Motifs or Patterns 103Creation of pattern databases requires expertknowledge 104The BLOCKS database contains automaticallycompiled short blocks of conserved multiply aligned protein sequences 105
4.9 Searching Using Motifs and Patterns 107The PROSITE database can be searched for protein motifs and patterns 107
Contents
xv
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xv
The pattern-based program PHI-BLAST searches for both homology and matching motifs 108Patterns can be generated from multiple sequences using PRATT 108The PRINTS database consists of fingerprintsrepresenting sets of conserved motifs that describe a protein family 109The Pfam database defines profiles of protein families 109
4.10 Patterns and Protein Function 109Searches can be made for particular functional sites in proteins 109Sequence comparison is not the only way of analyzing protein sequences 110
Summary 111Further Reading 112
THEORY CHAPTER
Chapter 5 Pairwise Sequence Alignment andDatabase Searching
5.1 Substitution Matrices and Scoring 117Alignment scores attempt to measure the likelihood of a common evolutionary ancestor 117The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins 119The BLOSUM matrices were designed to findconserved regions of proteins 122Scoring matrices for nucleotide sequence alignment can be derived in similar ways 125The substitution scoring matrix used must beappropriate to the specific alignment problem 126Gaps are scored in a much more heuristic way than substitutions 126
5.2 Dynamic Programming Algorithms 127Optimal global alignments are produced using efficient variations of the Needleman–Wunschalgorithm 129Local and suboptimal alignments can be produced by making small modifications to the dynamicprogramming algorithm 135Time can be saved with a loss of rigor by notcalculating the whole matrix 139
5.3 Indexing Techniques and Algorithmic Approximations 141Suffix trees locate the positions of repeats and unique sequences 141Hashing is an indexing technique that lists the starting positions of all k-tuples 143The FASTA algorithm uses hashing and chaining for fast database searching 144
The BLAST algorithm makes use of finite-stateautomata 147
Comparing a nucleotide sequence directly with a protein sequence requires special modifications to the BLAST and FASTA algorithms 150
5.4 Alignment Score Significance 153The statistics of gapped local alignments can beapproximated by the same theory 156
5.5 Aligning Complete Genome Sequences 156Indexing and scanning whole genome sequencesefficiently is crucial for the sequence alignment of higher organisms 157The complex evolutionary relationships between the genomes of even closely related organisms require novel alignment algorithms 159
Summary 159Further Reading 161
THEORY CHAPTER
Chapter 6 Patterns, Profiles, and MultipleAlignments6.1 Profiles and Sequence Logos 167
Position-specific scoring matrices are an extension of substitution scoring matrices 168Methods for overcoming a lack of data in derivingthe values for a PSSM 171PSI-BLAST is a sequence database searching program 176Representing a profile as a logo 177
6.2 Profile Hidden Markov Models 179The basic structure of HMMs used in sequencealignment to profiles 180Estimating HMM parameters using aligned sequences 185Scoring a sequence against a profile HMM: The most probable path and the sum over all paths 187Estimating HMM parameters using unalignedsequences 190
6.3 Aligning Profiles 193Comparing two PSSMs by alignment 193Aligning profile HMMs 195
6.4 Multiple Sequence Alignments by Gradual Sequence Addition 196The order in which sequences are added is chosenbased on the estimated likelihood of incorporatingerrors in the alignment 198Many different scoring schemes have been used in constructing multiple alignments 200
Contents
xvi
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xvi
The multiple alignment is built using the guide tree and profile methods and may be further refined 204
6.5 Other Ways of Obtaining Multiple Alignments 207The multiple sequence alignment program DIALIGN aligns ungapped blocks 207The SAGA method of multiple alignment uses a genetic algorithm 209
6.6 Sequence Pattern Discovery 211Discovering patterns in a multiple alignment: eMOTIF and AACC 213Probabilistic searching for common patterns insequences: Gibbs and MEME 215Searching for more general sequence patterns 217
Summary 218Further Reading 219
Part 3 Evolutionary Processes
APPLICATIONS CHAPTER
Chapter 7 Recovering Evolutionary History7.1 The Structure and Interpretation of
Phylogenetic Trees 225Phylogenetic trees reconstruct evolutionaryrelationships 225Tree topology can be described in several ways 230Consensus and condensed trees report the results of comparing tree topologies 232
7.2 Molecular Evolution and its Consequences 235Most related sequences have many positions that have mutated several times 236The rate of accepted mutation is usually not the same for all types of base substitution 236Different codon positions have different mutation rates 238Only orthologous genes should be used to construct species phylogenetic trees 239Major changes affecting large regions of the genome are surprisingly common 247
7.3 Phylogenetic Tree Reconstruction 248Small ribosomal subunit rRNA sequences are wellsuited to reconstructing the evolution of species 249The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset 249A model of evolution must be chosen to use with the method 251All phylogenetic analyses must start with an accurate multiple alignment 255
Phylogenetic analyses of a small dataset of 16S RNA sequence data 255Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved 259
Summary 264Further Reading 265
THEORY CHAPTER
Chapter 8 Building Phylogenetic Trees8.1 Evolutionary Models and the Calculation
of Evolutionary Distance 268A simple but inaccurate measure of evolutionarydistance is the p-distance 268The Poisson distance correction takes account ofmultiple mutations at the same site 270The Gamma distance correction takes account ofmutation rate variation at different sequence positions 270The Jukes–Cantor model reproduces some basicfeatures of the evolution of nucleotide sequences 271More complex models distinguish between the relative frequencies of different types of mutation 272There is a nucleotide bias in DNA sequences 275Models of protein-sequence evolution are closelyrelated to the substitution matrices used for sequence alignment 276
8.2 Generating Single Phylogenetic Trees 276Clustering methods produce a phylogenetic tree based on evolutionary distances 276The UPGMA method assumes a constant molecular clock and produces an ultrametric tree 278The Fitch–Margoliash method produces an unrooted additive tree 279The neighbor-joining method is related to the concept of minimum evolution 282Stepwise addition and star-decomposition methods are usually used to generate starting trees for further exploration, not the final tree 285
8.3 Generating Multiple Tree Topologies 286The branch-and-bound method greatly improvesthe efficiency of exploring tree topology 288
Optimization of tree topology can be achieved by making a series of small changes to an existing tree 288
Finding the root gives a phylogenetic tree a direction in time 291
8.4 Evaluating Tree Topologies 293Functions based on evolutionary distances can be used to evaluate trees 293
Unweighted parsimony methods look for the trees with the smallest number of mutations 297
Contents
xvii
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xvii
Mutations can be weighted in different ways in the parsimony method 300
Trees can be evaluated using the maximumlikelihood method 302
The quartet-puzzling method also involves maximumlikelihood in the standard implementation 305
Bayesian methods can also be used to reconstructphylogenetic trees 306
8.5 Assessing the Reliability of Tree Features and Comparing Trees 307The long-branch attraction problem can arise even with perfect data and methodology 308
Tree topology can be tested by examining the interior branches 309
Tests have been proposed for comparing two or more alternative trees 310
Summary 311
Further Reading 312
Part 4 Genome Characteristics
APPLICATIONS CHAPTER
Chapter 9 Revealing Genome Features
9.1 Preliminary Examination of Genome Sequence 318Whole genome sequences can be split up to simplify gene searches 319
Structural RNA genes and repeat sequences can be excluded from further analysis 319
Homology can be used to identify genes in bothprokaryotic and eukaryotic genomes 322
9.2 Gene Prediction in Prokaryotic Genomes 322
9.3 Gene Prediction in Eukaryotic Genomes 323Programs for predicting exons and introns use a variety of approaches 323
Gene predictions must preserve the correct reading frame 324
Some programs search for exons using only the query sequence and a model for exons 327
Some programs search for genes using only the query sequence and a gene model 332
Genes can be predicted using a gene model and sequence similarity 334
Genomes of related organisms can be used to improve gene prediction 336
9.4 Splice Site Detection 337Splice sites can be detected independently byspecialized programs 338
9.5 Prediction of Promoter Regions 338
Prokaryotic promoter regions contain relatively well-defined motifs 339
Eukaryotic promoter regions are typically morecomplex than prokaryotic promoters 340
A variety of promoter-prediction methods are available online 340
Promoter prediction results are not very clear-cut 341
9.6 Confirming Predictions 342There are various methods for calculating the accuracy of gene-prediction programs 342
Translating predicted exons can confirm thecorrectness of the prediction 343
Constructing the protein and identifying homologs 343
9.7 Genome Annotation 346Genome annotation is the final step in genomeanalysis 347
Gene ontology provides a standard vocabulary for gene annotation 348
9.8 Large Genome Comparisons 353
Summary 354
Further Reading 355
THEORY CHAPTER
Chapter 10 Gene Detection and GenomeAnnotation
10.1 Detection of Functional RNA Molecules Using Decision Trees 361Detection of tRNA genes using the tRNAscan algorithm 361
Detection of tRNA genes in eukaryotic genomes 362
10.2 Features Useful for Gene Detection in Prokaryotes 364
10.3 Algorithms for Gene Detection in Prokaryotes 368GeneMark uses inhomogeneous Markov chains and dicodon statistics 368
GLIMMER uses interpolated Markov models of coding potential 371
ORPHEUS uses homology, codon statistics, andribosome-binding sites 372
GeneMark.hmm uses explicit state duration hidden Markov models 373
EcoParse is an HMM gene model 376
10.4 Features Used in Eukaryotic Gene Detection 377Differences between prokaryotic and eukaryotic genes 377
Introns, exons, and splice sites 379
Promoter sequences and binding sites for transcription factors 381
Contents
xviii
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xviii
10.5 Predicting Eukaryotic Gene Signals 381Detection of core promoter binding signals is a key element of some eukaryotic gene-prediction methods 381A set of models has been designed to locate the site of core promoter sequence signals 383Predicting promoter regions from general sequence properties can reduce the numbers of false-positive results 387Predicting eukaryotic transcription and translation start sites 389Translation and transcription stop signals complete the gene definition 389
10.6 Predicting Exon/Intron Structure 389Exons can be identified using general sequenceproperties 390Splice-site prediction 392Splice sites can be predicted by sequence patternscombined with base statistics 393GenScan uses a combination of weight matrices and decision trees to locate splice sites 394GeneSplicer predicts splice sites using first-orderMarkov chains 394NetPlantGene uses neural networks withintron and exon predictions to predict splice sites 395Other splicing features may yet be exploited for splice-site prediction 396Specific methods exist to identify initial and terminal exons 396Exons can be defined by searching databases forhomologous regions 397
10.7 Complete Eukaryotic Gene Models 397
10.8 Beyond the Prediction of Individual Genes 399Functional annotation 400Comparison of related genomes can help resolveuncertain predictions 403Evaluation and reevaluation of gene-detectionmethods 405
Summary 405Further Reading 406
Part 5 Secondary Structures
APPLICATIONS CHAPTER
Chapter 11 Obtaining Secondary Structure from Sequence
11.1 Types of Prediction Methods 413Statistical methods are based on rules that give the probability that a residue will form part of a particular secondary structure 414Nearest-neighbor methods are statistical methods
that incorporate additional information about protein structure 414Machine-learning approaches to secondary structure prediction mainly make use of neuralnetworks and HMM methods 415
11.2 Training and Test Databases 416There are several ways to define protein secondary structures 417
11.3 Assessing the Accuracy of Prediction Programs 417Q3 measures the accuracy of individual residue assignments 417Secondary structure predictions should not beexpected to reach 100% residue accuracy 418
The Sov value measures the prediction accuracyfor whole elements 419
CAFASP/CASP: Unbiased and readily available protein prediction assessments 419
11.4 Statistical and Knowledge-Based Methods 421The GOR method uses an information theory approach 422
The program Zpred includes multiple alignment of homologous sequences and residue conservation information 425
There is an overall increase in prediction accuracy using multiple sequence information 426
The nearest-neighbor method: The use of multiplenonhomologous sequences 428
PREDATOR is a combined statistical and knowledge-based program that includes the nearest-neighbor approach 428
11.5 Neural Network Methods of Secondary Structure Prediction 430Assessing the reliability of neural net predictions 432
Several examples of Web-based neural networksecondary structure prediction programs 432
PROF: Protein forecasting 434
PSIPRED 434
Jnet: Using several alternative representations of the sequence alignment 434
11.6 Some Secondary Structures Require Specialized Prediction Methods 435Transmembrane proteins 436
Quantifying the preference for a membraneenvironment 437
11.7 Prediction of Transmembrane Protein Structure 438
Multi-helix membrane proteins 439
A selection of prediction programs to predicttransmembrane helices 441
Contents
xix
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xix
Statistical methods 443
Knowledge-based prediction 443
Evolutionary information from protein familiesimproves the prediction 444
Neural nets in transmembrane prediction 445
Predicting transmembrane helices with hidden Markov models 446
Comparing the results: What to choose 447
What happens if a non-transmembrane protein issubmitted to transmembrane prediction programs 448
Prediction of transmembrane structure containing b-strands 448
11.8 Coiled-coil Structures 451The COILS prediction program 452PAIRCOIL and MULTICOIL are an extension of the COILS algorithm 453Zipping the Leucine zipper: A specialized coiled coil 453
11.9 RNA Secondary Structure Prediction 455
Summary 458Further Reading 459
THEORY CHAPTER
Chapter 12 Predicting Secondary Structures12.1 Defining Secondary Structure and Prediction
Accuracy 463The definitions used for automatic protein secondarystructure assignment do not give identical results 464There are several different measures of the accuracy of secondary structure prediction 469
12.2 Secondary Structure Prediction Based on Residue Propensities 472Each structural state has an amino acid preferencewhich can be assigned as a residue propensity 473The simplest prediction methods are based on theaverage residue propensity over a sequence window 476Residue propensities are modulated by nearbysequence 479Predictions can be significantly improved by including information from homologous sequences 484
12.3 The Nearest-Neighbor Methods are Based on Sequence Segment Similarity 485Short segments of similar sequence are found to have similar structure 487Several sequence similarity measures have been used to identify nearest-neighbor segments 488A weighted average of the nearest-neighbor segment structures is used to make the prediction 490A nearest-neighbor method has been developed topredict regions with a high potential to misfold 491
12.4 Neural Networks Have Been Employed Successfully for Secondary Structure Prediction 492Layered feed-forward neural networks can transform a sequence into a structural prediction 494Inclusion of information on homologous sequences improves neural network accuracy 502More complex neural nets have been applied to predict secondary and other structural features 503
12.5 Hidden Markov Models Have Been Applied to Structure Prediction 504HMM methods have been found especially effective for transmembrane proteins 506
Nonmembrane protein secondary structures can also be successfully predicted with HMMs 509
12.6 General Data Classification Techniques Can Predict Structural Features 510Support vector machines have been successfully used for protein structure prediction 511
Discriminants, SOMs, and other methods have also been used 512
Summary 514
Further Reading 515
Part 6 Tertiary Structures
APPLICATIONS CHAPTER
Chapter 13 Modeling Protein Structure
13.1 Potential Energy Functions and Force Fields 524The conformation of a protein can be visualized in terms of a potential energy surface 525Conformational energies can be described by simple mathematical functions 525Similar force fields can be used to representconformational energies in the presence of averaged environments 526Potential energy functions can be used to assess a modeled structure 527Energy minimization can be used to refine a modeledstructure and identify local energy minima 527Molecular dynamics and simulated annealing are used to find global energy minima 528
13.2 Obtaining a Structure by Threading 529The prediction of protein folds in the absence ofknown structural homologs 531Libraries or databases of nonredundant protein folds are used in threading 531Two distinct types of scoring schemes have been used in threading methods 531Dynamic programming methods can identify optimal alignments of target sequences and structural folds 533
Contents
xx
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xx
Several methods are available to assess the confidence to be put on the fold prediction 534The C2-like domain from the Dictyostelia: A practical example of threading 535
13.3 Principles of Homology Modeling 537Closely related target and template sequences givebetter models 539
Significant sequence identity depends on the length of the sequence 540
Homology modeling has been automated to deal with the numbers of sequences that can now be modeled 541
Model building is based on a number of assumptions 541
13.4 Steps in Homology Modeling 542Structural homologs to the target protein are found in the PDB 543
Accurate alignment of target and template sequences is essential for successful modeling 543
The structurally conserved regions of a protein are modeled first 544
The modeled core is checked for misfits beforeproceeding to the next stage 545
Sequence realignment and remodeling may improve the structure 545
Insertions and deletions are usually modeled as loops 545
Nonidentical amino acid side chains are modeledmainly by using rotamer libraries 547
Energy minimization is used to relieve structural errors 548
Molecular dynamics can be used to explore possible conformations for mobile loops 548
Models need to be checked for accuracy 549
How far can homology models be trusted? 551
13.5 Automated Homology Modeling 552The program MODELLER models by satisfying protein structure constraints 553
COMPOSER uses fragment-based modeling toautomatically generate a model 553
Automated methods available on the Web forcomparative modeling 554
Assessment of structure prediction 554
13.6 Homology Modeling of PI3 Kinase p110aa 557Swiss-Pdb Viewer can be used for manual or semi-manual modeling 557
Alignment, core modeling, and side-chain modeling are carried out all in one 558
The loops are modeled from a database of possible structures 559
Energy minimization and quality inspection can be carried out within Swiss-Pdb Viewer 559
MolIDE is a downloadable semi-automatic modeling package 560
Automated modeling on the Web illustrated withp110a kinase 561
Modeling a functionally related but sequentiallydissimilar protein: mTOR 563
Generating a multidomain three-dimensional structure from sequence 564
Summary 564
Further Reading 565
APPLICATIONS CHAPTER
Chapter 14 Analyzing Structure–FunctionRelationships
14.1 Functional Conservation 568
Functional regions are usually structurally conserved 569
Similar biochemical function can be found in proteins with different folds 570
Fold libraries identify structurally similar proteinsregardless of function 571
14.2 Structure Comparison Methods 574
Finding domains in proteins aids structure comparison 574
Structural comparisons can reveal conservedfunctional elements not discernible from a sequence comparison 576
The CE method builds up a structural alignment from pairs of aligned protein segments 576
The Vector Alignment Search Tool (VAST) alignssecondary structural elements 577
DALI identifies structure superposition withoutmaintaining segment order 578
FATCAT introduces rotations between rigid segments 579
14.3 Finding Binding Sites 580
Highly conserved, strongly charged, or hydrophobicsurface areas may indicate interaction sites 582
Searching for protein–protein interactions using surface properties 584
Surface calculations highlight clefts or holes in a protein that may serve as binding sites 585
Looking at residue conservation can identify binding sites 586
14.4 Docking Methods and Programs 587
Simple docking procedures can be used when the structure of a homologous protein bound to a ligand analog is known 588
Specialized docking programs will automatically dock a ligand to a structure 588
Contents
xxi
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xxi
Scoring functions are used to identify the most likely docked ligand 590
The DOCK program is a semirigid-body method that analyzes shape and chemical complementarity of ligand and binding site 590Fragment docking identifies potential substrates by predicting types of atoms and functional groups in the binding area 591GOLD is a flexible docking program, which utilizes a genetic algorithm 591The water molecules in binding sites should also be considered 592
Summary 593Further Reading 594
Part 7 Cells and Organisms
Chapter 15 Proteome and Gene Expression Analysis15.1 Analysis of Large-scale Gene Expression 601
The expression of large numbers of different genes can be measured simultaneously by DNA microarrays 602Gene expression microarrays are mainly used to detect differences in gene expression in different conditions 602Serial analysis of gene expression (SAGE) is also used to study global patterns of gene expression 604Digital differential display uses bioinformatics and statistics to detect differential gene expression in different tissues 605Facilitating the integration of data from differentplaces and experiments 606The simplest method of analyzing gene expressionmicroarray data is hierarchical cluster analysis 606Techniques based on self-organizing maps can be used for analyzing microarray data 608Self-organizing tree algorithms (SOTAs) cluster from the top down by successive subdivision of clusters 610Clustered gene expression data can be used as a tool for further research 610
15.2 Analysis of Large-scale Protein Expression 612Two-dimensional gel electrophoresis is a method for separating the individual proteins in a cell 613Measuring the expression levels shown in 2D gels 614Differences in protein expression levels betweendifferent samples can be detected by 2D gels 615Clustering methods are used to identify protein spots with similar expression patterns 615Principal component analysis (PCA) is an alternative to clustering for analyzing microarray and 2D gel data 618
The changes in a set of protein spots can be tracked over a number of different samples 618Databases and online tools are available to aid the interpretation of 2D gel data 620Protein microarrays allow the simultaneous detection of the presence or activity of large numbers of different proteins 621Mass spectrometry can be used to identify the proteins separated and purified by 2D gelelectrophoresis or other means 621
Protein-identification programs for mass spectrometry are freely available on the Web 622
Mass spectrometry can be used to measure protein concentration 623
Summary 623
Further Reading 624
Chapter 16 Clustering Methods and Statistics16.1 Expression Data Require Preparation Prior
to Analysis 626Data normalization is designed to remove systematic experimental errors 627
Expression levels are often analyzed as ratios and are usually transformed by taking logarithms 628
Sometimes further normalization is useful after the data transformation 630
Principal component analysis is a method forcombining the properties of an object 631
16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points 633Euclidean distance is the measure used in everyday life 634
The Pearson correlation coefficient measures distance in terms of the shape of the expressionresponse 635
The Mahalanobis distance takes account of thevariation and correlation of expression responses 636
16.3 Clustering Methods Identify Similar and Distinct Expression Patterns 637Hierarchical clustering produces a related set ofalternative partitions of the data 639
k-means clustering groups data into several clusters but does not determine a relationship between clusters 641
Self-organizing maps (SOMs) use neural networkmethods to cluster data into a predetermined number of clusters 644
Evolutionary clustering algorithms use selection,recombination, and mutation to find the best possible solution to a problem 646
Contents
xxii
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xxii
The self-organizing tree algorithm (SOTA) determines the number of clusters required 648
Biclustering identifies a subset of similar expression level patterns occurring in a subset of the samples 649
The validity of clusters is determined by independent methods 650
16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression 651t-tests can be used to estimate the significance of the difference between two expression levels 654Nonparametric tests are used to avoid makingassumptions about the data sampling 656Multiple testing of differential expression requiresspecial techniques to control error rates 657
16.5 Gene and Protein Expression Data Can be Used to Classify Samples 659Many alternative methods have been proposed that can classify samples 660Support vector machines are another form ofsupervised learning algorithms that can produceclassifiers 661
Summary 662Further Reading 664
Chapter 17 Systems Biology
17.1 What is a System? 669A system is more than the sum of its parts 669A biological system is a living network 670Databases are useful starting points in constructing a network 671To construct a model more information is needed than a network 672There are three possible approaches to constructing a model 674Kinetic models are not the only way in systems biology 678
17.2 Structure of the Model 679Control circuits are an essential part of anybiological system 680The interactions in networks can be represented as simple differential equations 680
17.3 Robustness of Biological Systems 683Robustness is a distinct feature of complexity in biology 684Modularity plays an important part in robustness 685Redundancy in the system can provide robustness 686Living systems can switch from one state to another by means of bistable switches 688
17.4 Storing and Running System Models 689Specialized programs make simulating systems easier 691Standardized system descriptions aid their storage and reuse 692
Summary 692Further Reading 693
APPENDICES Background Theory
Appendix A: Probability, Information, andBayesian Analysis
Probability Theory, Entropy, and Information 695Mutually exclusive events 695Occurrence of two events 696Occurrence of two random variables 696
Bayesian Analysis 697Bayes’ theorem 697Inference of parameter values 698
Further Reading 699
Appendix B: Molecular Energy Functions
Force Fields for Calculating Intra- and IntermolecularInteraction Energies 701
Bonding terms 702Nonbonding terms 704
Potentials used in Threading 706Potentials of mean force 706Potential terms relating to solvent effects 707
Further Reading 708
Appendix C: Function Optimization
Full Search Methods 710Dynamic programming and branch-and-bound 710
Local Optimization 710The downhill simplex method 711The steepest descent method 711The conjugate gradient method 714Methods using second derivatives 714
Thermodynamic Simulation and Global Optimization 715Monte Carlo and genetic algorithms 716Molecular dynamics 718Simulated annealing 719Summary 719
Further Reading 719
List of Symbols 721Glossary 734Index 751
Contents
xxiii
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xxiii
BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xxiv
2ZIP method, 453–4, 455F310-helices, 435
defining, for prediction algorithms, 464–5
3D-Coffee, 2033DEE library, 5743Djigsaw, 5633D-PSSM, 533, 534F3¢ end, 6, 12T, 17–183-patterns, 2175¢ end, 6, 12T, 14, 19–10 motif, 1616S RNA sequences, 249
evolutionary model selection, 254F, 255, 256T
phylogenetic analysis, 249, 251, 255–8, 257F, 258F
–35 motif, 16123D+ program, 535–6, 536Fa/b-fold proteins, 421F, 423F, 573F,
574a-helices, 33–5, 413F
amino acid preferences, 37Chou–Fasman propensities, 474–5,
474F, 475Fcoiled coil formation, 451, 451Fdefining, for prediction algorithms,
464–5, 466–7hydrogen bonding, 34, 35Flength distributions, 467, 468Fprediction, 413–14, 428–9, 429F
see also secondary structure prediction
based on residue propensities, 477–8
neural network methods, 501, 501F
transmembrane proteins, 438, 439–48
sequence–structure correlations, 487–8, 487F
transmembrane proteins see trans-membrane helices
turns, hairpins and loops connecting, 36–7
a-lactalbumin, 538–9, 539Fbab repeat, 40Bb-barrels, transmembrane see
transmembrane b-barrelsb-bulges, 463, 465b-lactamase family, 573Fb-meander, 40Bb-sheets, 34–6, 36F
defining, for prediction algorithms,465, 465F
transmembrane proteins, 436types, 35–6
b-Spider, 466, 467Fb-strands, 34–6, 36F, 413F
amino acid preferences, 37Chou–Fasman propensities, 474–5,
474Fdefining, for prediction algorithms,
465–6, 466–7distortions, 463length distributions, 467, 468Fprediction, 413–14, 428–9, 429F
see also secondary structure prediction
based on residue propensities, 477–8
transmembrane proteins, 448–51, 450F
variability, 467, 467Fturns, hairpins and loops
connecting, 36–7b-turns, 36, 37F, 413F
Chou–Fasman propensities, 475, 476T
defining, prediction algorithms, 465prediction, 413–14, 478, 503
p-helices, 435defining, for prediction algorithms,
464–5f angles see under torsion anglesy angles see under torsion angles
AA (accepted point mutation matrix),
120AACC, 214–15, 214FAAINDEX, 84AAindex, 476AAT program, 331T, 332T, 335, 336ab initio approach, modeling protein
structure, 522, 523Baccepted mutations, 84accepted point mutation matrix (A),
120acceptor splice sites, 18F, 380F, 392acetolactate synthase (ALS) family,
259B, 262activators, 16–17adaptive systems, 667–8additive trees, 228–9, 229F, 230adenosine (A), 6, 6Faffine gap penalty, 127, 128, 133–4, 139Affymetrix GeneChip® arrays, 602Akaike information criterion (AIC),
253–5ALDH10 gene, 324–5
annotation, 351–2exon prediction
accuracy, 345, 345–6different programs, 331–2, 331T,
333–4, 334F, 335, 336experimental results compared,
327, 328Fusing related organisms, 336–7
gene structure, 327Binterspecies comparisons, 353,
353F, 354Fpathway approach to identifying,
348, 349–50Fpromoter prediction, 341, 341Tstart codon, 327, 330F
alignment, sequence see sequence alignment
Alix, Alain, 475all a-fold proteins, 421F, 422F, 573F, 574
751
Note: Entries which are simply page numbers refer to the main text. Other entries have the following abbreviationsimmediately afer the page number: B, box; F, figure; FD, flow diagram; MM, mind map; T, table.
INDEX
End matter 6th proofs.qxd 19/7/07 12:17 Page 751
all b-fold proteins, 421F, 422F, 573F, 574
alternative splicing, 19, 380–1Alu elements, 337BAlzheimer’s disease, 491AMAS program, 93AMBER program, 526, 701amino acid(s) (residues), 11, 27–33
chemical structure, 28Fconservation, to identify binding
sites, 586–7, 587Fconservation values (Zpred), 426,
427F, 428F, 429Thydrophobicity scales, 437–8, 450,
475, 477Tpeptide bonds, 29–33, 31Fphysicochemical properties, 28–9,
28T, 30Famino acid propensities, 37, 472–85,
472FDsee also Chou–Fasman propensitiesaveraged over sequence windows,
476–9derivation and calculations, 473–6nearby sequence effects, 479–84,
480Famino acid sequences, 13, 25, 29
see also protein sequencesevolutionary conservation, 38short segments with structural
correlations, 487–8, 487Famino acid side chains, 28F
modeling, 547–8, 548F, 558–9, 561physicochemical properties, 28–9,
30Ftorsion angles (c1, c2, etc), 547,
548Famino (N) terminus, 29amphipathic helix, 439–41amyloidogenic proteins, 486, 487,
491–2, 492F, 493Fanalogous enzymes, 244, 244Fanalysis of covariance (ANCOVA), 659analysis of variance (ANOVA), 659ancestral states, 226anchor points, 546, 546FAnfinsen, Christian, 412, 412Fannotation, 357
automated, 64–5database, 53data errors or omissions, 64gene, 348–52genome see genome annotationmanual, 65
ANOLEA program, 550–1, 551Tantibiotic synthesis, 643Bantibodies, 381, 555B
modeling, 555–6Banticoding strand, 11anticodons, 13–14, 14F
antigen-binding site, 555–6Bantigens, 555Bantisense strand, 11apoptotic pathway, 681Fapproximate correlation coefficient
(AC), 366BArabidopsis thaliana, 328, 330B
gene duplications, 241BRha1 gene prediction, 393Fsplice sites, 380F, 396vs rice, 335B
Archaea, 21, 21Fhorizontal gene transfer, 246F, 247sequenced genomes, 324T
architecturedatabase, 45network, 676, 677F
Argos, Patrick, 171ArrayExpress, 58, 606, 611ArrayExpress Data Warehouse, 58arrhythmia, cardiac, modeling, 677,
678FATG start codons see start codonsatomic charges, 704atomic mean force potential (AMFP),
551AUG codon, 13, 19, 367AU (approximately unbiased) method,
309average conditional probability (ACP),
366B
Bbackbone (protein), 29, 32
models, 39, 39Fback-propagation method, 497Bbackward algorithm, 190–1bacteria, 21, 21F
see also Escherichia coli; prokaryotes16S RNA, 249horizontal gene transfer, 246F, 247sequenced genomes, 324T
balanced training, 498BBaldi, Pierre, 191BAliBase, 92, 93Fballoting probabilities, 501Barton, Geoff, 206base-pairing, 7–9, 8F
RNA, 456wobble, 14
bases, 5–7, 6Fbase sequences see nucleotide
sequencesBaum–Welch expectation
maximization algorithm, 191–3Bayesian information criterion (BIC),
254–5Bayesian methods, 697–8
dealing with lack of replicates, 657B
phylogenetic tree reconstruction, 250, 251T, 253, 306–7
Bayes’ theorem, 697–8Benjamini, Yoav, 659Berkeley Drosophila Genome Project
(BDGP), 340, 341TBetaturns method, 503biased mutation pressure, 239biclustering, 649–50, 650Fbidirectional recurrent neural network
(BRNN), 504, 505FBifidobacterium longum, 348, 350Fbifurcating (branching) pattern, 226–7binding sites, protein see protein
binding sitesbiochemical pathways see metabolic
pathwaysBioEdit program, 260bioinformatics, 3
protein structure and, 37–9, 38FDBioModels Database, 692Biomolecular Interaction Network
(BIND), 58, 671, 673Fbistable switches, 688–9, 689FBLAST program, 95–6
algorithmic approximations, 141comparing nucleotide with protein
sequences, 150–3Conserved Domain Database (CDD)
search, 99F, 100dealing with low-complexity
regions, 101–2E-values, 98–100, 99F, 156gapped method, 147–50, 178TGenScan modification using, 397restriction of matrix coverage, 140suffix trees, 141–3use of finite-state automata,
147–50, 147F, 148Fversions available, 95–7whole genome alignments, 157–9
blastx program, 96, 97, 150, 343BLAT program, 158BLOCKS database, 58
Dirichlet mixture from, 174–5, 174F
searching, 105–7, 106Fsubstitution matrices from, 122
BLOSUM matrices, 83F, 84alignment scoring, 82derivation, 122–5, 123F, 124Fselection, 84, 85summary score measures, 125F, 126
Blundell, Tom, 532Boltzmann factor, 706bond angle energy, 703bond energy, 702bonding terms, 525–6, 701, 702–4,
702FBonferroni correction, 658
Index
752
End matter 6th proofs.qxd 19/7/07 12:17 Page 752
bootstrap analysis, 310Bassessing tree topology, 309–10comparing tree topologies, 233–4,
233Fcomparing two or more trees, 311parametric, 310Bpractical example, 258, 259F
bootstrap interior branch test, 310bottom-up approach, modeling
biological systems, 674–6, 676Fbovine spongiform encephalopathy
(BSE), 37B, 101Bbranch-and-bound method, 288, 710branches, 226, 227Fbranch length calculations, 293–7,
295F, 296Fassessing reliability, 309–10parsimony methods, 299–300
branch swapping techniques, 289–91, 290F
BRCA2, 78, 79FBrenner, Steven, 480Brudno, Michael, 209Bryant, David, 296, 296FBTPRED method, 503Bucher weight matrix method, 383–4,
384FBurset, Moises, 365–6B, 392BBVSPS program, 551T
CC2-like domain, Dictyostelia, 535–7,
536F, 537FCa atoms, 28, 28F, 29, 417
analysis of geometry, for prediction algorithms, 466, 466F
torsion angles see under torsion angles
Ca models, 39, 39FCaenorhabditis elegans, 399CAFASP (Critical Assessment of Fully
Automated Structure Prediction), 419, 554–6
cAMP PK see cyclic AMP-dependent protein kinase
canonical ensemble, 718Cantor, Charles, 271capping, RNA, 18cap signal (initiator signal, Inr), 389
Bucher weight matrix, 383, 384, 384F
GenScan prediction method, 385, 385F
NNPP prediction method, 385–6, 386F
carboxy (C) terminus, 29Casadio, Rita, 479–80cascade-correlation neural network,
503–4
CASP (Critical Asssessment of Structure Prediction), 419, 554–6
CATH database, 531, 574causal dependencies, 668Cbl protein, 575–80, 576FCCAAT box, detection algorithms, 383,
384–5CDK10 gene, 324–5
DNA sequence, 326–7Bexon prediction, 329F, 330–1, 332T,
336–7translation of predicted exons, 344F
cDNA (complementary DNA)exon prediction using, 397gene-prediction programs using,
334, 335microarrays, 602sequence databases, 56
Celera, 376Bcell-division cycle, 688–9Cell Markup Language (CellML), 692CellML Model Repository, 692cellular modeling
heart, 685Tinternational projects, 668programs, 691–2, 691F
CE (Combinatorial Extension) method,576–7, 578F
central dogma, 10–14, 10F, 10FDcentroid, 711centroid method, hierarchical
clustering, 640, 641Fchaining, 144–6chameleon sequences, 37B, 488CHAOS algorithm, 209CHARMM program, 526, 701ChiClust program, 617, 618–19ChiMap program, 618–20, 619Fchloroplasts, 22, 292BChou, Peter, 472Chou–Fasman propensities, 414, 415F,
472, 474–6applied to GOR, 483calculated values, 474F, 476Tmeasures of accuracy, 424Tnearest-neighbor methods, 489periodic variation, 474–5, 475Ftransmembrane helices, 475–6,
478Fwindow sizes, 477–8
chromatography, 600, 623chromosomes, 10, 21–2
rearrangements, 248Churchill, Gary, 275chymosin B, 486, 487F, 490Fchymotrypsin, 243–4, 244FCINEMA program, 93cis conformation, 32, 33Fclades, 256Cladist program, 608–9, 609F
cladogram, 228, 229FClustalW, 90, 91–2
progressive alignment method, 205scoring scheme, 201–2, 201F, 202Fvs other alignment methods, 92,
93Fcluster analysis, 625–64, 626MM
data preparation, 626–33, 627F, 627FD
defining distances, 633–7, 634FD, 636F
evaluating validity of clusters, 650–1
hierarchical see hierarchical clustering
hydrophobic (HCA), 110–11, 110Fsequence alignment, 90–1, 90F, 126
clustering methodssee also specific methodscomparison between, 643Bgene expression microarray data,
606–11, 611Fidentifying expression patterns,
637–51, 637FDphylogenetic tree construction,
276–9, 277FDprotein expression data, 615–17,
617F, 618FClusters of Orthologous Groups (COG)
database, 103, 243, 245BCMISS modeling tool, 692COACH method, 195, 203coding, 11, 12–13coding strand, 11–12codon-pairs see dicodonscodons, 13
see also start codons; stop codonsfrequency of occurrence, 367, 367Fgenetic code, 12Tmutation rates at different, 238–9,
238Fstatistics, use by ORPHEUS, 372–3
co-expressed genes or proteins, 600, 638
COFFEE scoring system, 200, 203, 204F
COG (Clusters of Orthologous Groups) database, 103, 243, 245B
Cohen, Stanley, 643Bcoiled coils, 413, 435
geometry, 451, 451Fprediction, 451–4, 452FD, 478–9,
510, 510FCOILS program, 452–3, 454F, 478–9collagen, 452common evolutionary ancestor,
measuring likelihood, 117–19comparative modeling see homology
modelingCOMPASS method, 195
Index
753
End matter 6th proofs.qxd 19/7/07 12:17 Page 753
complementary DNA see cDNAcomplementary DNA strands, 7–8complete linkage clustering, 640,
641Fcomplexity
see also low-complexity regionsbiological systems, 684–5compositional, 151–2B
COMPOSER program, 546, 553–4compositional complexity, 151–2Bconcatamers, 605condensation reaction, 29, 31Fcondensed trees, 233–4, 233Fconditioned reconstruction, 292Bconfidence index, 432conformation, 27, 41
see also quaternary conformationenergies, 524–9, 524FDside chains, 547–8
conformational flexible docking, 590
conformers, 547conjugate gradient method, 528, 713F,
714conjugate prior, 698consensus features, 234consensus method, pattern or motif
creation, 105consensus sequences, 16consensus trees, 234–5, 234F, 291Conserved Domain Database (CDD)
search, 99F, 100CONSOLV program, 593ConSurf program, 587, 587Fcontact capacity potential (CCP), 533,
707–8, 708Fcontext strings, 371control circuits, biological systems,
680, 680Fconvergent evolution, 74–5, 75B,
243–4, 244Fcooperativity, 701COPASI modeling tool, 692Corbin, Kendall, 270CorePromoter program, 340, 341T,
388, 389Fcore promoters, 17, 319
see also promoter predictiondetection of binding signals, 339,
381–9models designed to locate, 383–7
Cost, Scott, 489, 491covalent bonds, 32B, 33B
energetics, 525–6, 701, 702–4CPHmodels, 554, 563creatine kinase, 42F, 43Creutzfeldt–Jakob disease (CJD), 101,
101Bvariant (vCJD), 101B
Crick, Francis, 7
Critical Assessment of Fully AutomatedStructure Prediction (CAFASP), 419, 554–6
Critical Assessment of Structure Prediction (CASP), 419, 554–6
Crooks, Gavin, 480C terminus, 29Cy5/Cy3 label gene expression
microarrays, 602–3, 603Fcyclic AMP-dependent protein kinase
(cAMP PK)inserting gaps, 86, 86Flocal and global alignment, 89,
89Fmultiple alignment, 91–2, 92F
cytochrome c oxidase I, 249cytosine (C), 6, 6F
DDali library, 574DALI program, 578–9, 579FDarwinian concept of evolution, 235DAS (Distributed Annotation System),
348–51, 351FDAS (dense alignment surface)
program, 442F, 444–5, 445F, 447data, 53
checking for consistency, 63–4derived (secondary), 53–4log transformation, 629–30, 630Fnormalization, 627–31, 628F, 630Fprimary, 53–4quality, 61–6, 62FD
database management system (DBMS), 48
Database of Interacting Proteins (DIP), 58
databases, 45–66, 46MMaccess to, 52categories (by content), 55–61,
56Fcenters, 55content of entries, 53data quality, 61–6, 62FDdistributed, 48, 52entry identifiers/version numbers,
65–6first computerized, 48, 48Fflat-file, 47, 47F, 48–9links between, 52, 53looking for, 55–61nonredundancy, 62–3ontologies, 54–5, 54Frelational, 48, 49–50, 49Fstructure, 46–52, 47FDfor systems biology, 671–2, 675Ttraining and test, 416–17types, 52–5, 53FD, 55FDupdating, 65–6
data classification, 637–8, 638Fsee also sample classificationsecondary structure prediction,
510–14, 511FDdata warehouses, 48, 51F, 52Davies, Graham P., 420BDayhoff, Margaret, 82, 119Dayhoff mutation data matrices
(MDMs) see PAM matricesdbEST, 56, 321BDEAD-box motif, 420Bdecision trees
detection of functional RNA molecules, 361–3, 363F
sample classification, 661splice site prediction, 394
DEFINE, 417degenerate (genetic code), 13degrees of freedom (df), 654, 655deletions
accounting for, in sequence alignment, 85–7
alignment scoring schemes, 117, 126–7
homology modeling, 542, 545–6, 545F
threading and, 532, 537denatured proteins, 42dendrograms, 636, 636F
gene expression data, 606F, 607, 607F, 608
hierarchical cluster analysis, 639, 640, 640F, 641F
dense alignment surface (DAS) program, 442F, 444–5, 445F, 447
deoxyribonucleic acid see DNAdeoxyribonucleotides, 6deoxyribose, 5–6DESTRUCT method, 503–4, 505Fdeterministic finite-state automaton,
147F, 148–50diagonals
DIALIGN method, 92, 207–9, 208F
FASTA scoring, 95labeling of matrix, 144F, 145restricting matrix coverage to,
139–41, 139F, 140FDIALIGN program, 92, 93F, 207–9DIAL program, 575, 576, 576Fdichotomous (branching) pattern,
226–7dicodons (hexamers), 328, 367
exon prediction using, 390gene detection methods using,
368–72promoter prediction using, 387–8
Dictyostelia, C2-like domain, 535–7, 536F, 537F
dielectric constant, 704
Index
754
End matter 6th proofs.qxd 19/7/07 12:17 Page 754
differential equations, modeling biological systems, 680–3, 682F
digital differential display (DDD), 605–6, 605F
dihedral angles see torsion anglesdihydrofolate reductase (DHFR)
ligand docking, 592, 592Fpocket identification, 585–6, 586F
dimers, 43directed acyclic graph (DAG), 512directional information, 423, 482Dirichlet distribution densities, 174Dirichlet mixture, 174–5, 174F, 176Fdiscriminant analysis
see also linear discriminant analysis;quadratic discriminant analysis
gene prediction, 340, 388, 389F, 396–7
sample classification, 661secondary structure prediction,
512–13distance, 81
see also evolutionary distance; p-distance
definitions for cluster analysis, 633–7, 634FD, 636F
phylogenetic tree reconstruction, 249–50, 251, 251T
distance correction, 236Distributed Annotation System (DAS),
348–51, 351Fdistributed databases, 48, 52divergent evolution, 75Bdivide-and-conquer method (multiple
alignment), 91, 91Fvs other alignment methods, 92,
93FDNA, 4
central dogma concept, 10, 10F, 10FD
complementary see cDNAdouble helix formation, 7–9, 8Fmutations see mutationsnoncoding see junk DNAstrands, 7–9, 8F, 11–12structure, 5–9, 5FD, 8Ftranscription see transcription
DNA gyrases (GyrA and GyrB), 249DNA microarrays, 9, 600, 601–4
basic principle, 602databases see microarray databasesdata clustering methods, 606–10,
643Bdata sharing and integration, 606gene expression studies, 602–4,
603Fprincipal component analysis of
data, 618two-color, 602–3, 603Fuses of clustered data, 610–11, 611F
DNA polymerase, 8DNA repeats, 22B
see also repeat sequencesdetection, 152Bexclusion from analysis, 319–21
DNA replication, 8, 8FDNA sequence databases, 56, 57F
nomenclature for base uncertainty, 63, 63T
DNA sequencesalignment scoring matrices, 124F,
125detecting homology, 75–6gene prediction from see gene
predictionmultiple alignments, 92nucleotide bias, 275–6phylogenetic tree reconstruction,
249preliminary examination, 318–22,
319FDsearching with, 97
docking, 587–93, 588FDaccounting for water molecules,
592–3conformational flexible, 590fragment, 591scoring functions, 590simple strategies, 588specialized programs, 588–92, 592F
DOCK program, 590–1domains
protein, 41see also multidomain proteinsfamilies, 259Bidentifying, 574–6, 576Fshuffling, 570
taxonomic, 21donor splice sites, 18F, 380F, 392dot-plots, 77–8, 77F, 79F
low-complexity regions, 101–2, 102F
double dynamic programming, 534downhill simplex method, 711, 712Fdownstream sequences, 16d-patterns, 217drawhca program, 110F, 111drug design, rational, 588, 589BDSC method, 512–13DSSP program, 417
defining secondary structures, 464–6, 465F, 465T, 467, 467F
length distributions of secondary structures, 467, 468F
nearby sequence effects, 479–80, 480F
duplicationchromosome and genome, 248gene see gene duplicationsequence, 158F, 245
Durbin, Richard, 363DUST program, 152Bdynamic programming algorithms
double, 534gene model, 399, 402Fglobal–local, 533pairwise alignment, 86–7
database searching, 95–7discarding intermediate
calculations, 138Bextension to multiple alignment,
198function optimization, 710local and suboptimal, 135–9optimal global, 129–35principles and methods, 127–41,
128FDtime methods, 139–41, 139F,
140FSankoff algorithm for weighted
parsimony, 300–2, 301Fthreading, 533–4, 534F
EE-Cell Project, 668EcoCyc database, 671, 673FEcoKI restriction enzyme, 420BEcoParse gene model, 375F, 376–7Eddy, Sean, 293, 362, 363edges see branchesEfron, Bradley, 310BEGFR see epidermal growth factor
receptoreigensamples, 633Eisenberg hydrophobicity scale,
450Elber, Ron, 532electronic resonance, 31electrostatic interactions, 33B, 704EMAP modeling tool, 692emergent properties, 669emissions, 179, 181–2eMOTIF, 213–15, 214Fend state, 179, 180, 182–3, 183Fenergies
free see free energymolecular, 700–8potential see potential energy
energy gradient, 528energy minima, global, 524, 528–9energy minimization, 527–8, 528F
applied to homology modeling, 548, 559–60
Ensembl, 103, 403enthalpy see potential energyentropy, 695–7
component of free energy, 525relative, 125F, 126, 697Shannon, 695–6
Index
755
End matter 6th proofs.qxd 19/7/07 12:17 Page 755
enzymes, 40analogous, 244, 244Fconvergent evolution, 243–4, 244Fphylogenetic analysis, 259–63simulation modeling, 690F, 691–2,
691Fepidermal growth factor receptor
(EGFR), 436, 436Bmitogen-activated protein kinase
system, 683Fpathway modeling, 681, 682F, 690
epitope, 555Bergodic systems, 717, 718–19errors
random, 627–8systematic, 625, 627–8type I, 653, 658types and rates, 657–8
Erwinia carotovora, 262Escherichia coli, 21, 378
detection of tRNA genes, 320–1, 320F
EcoCyc database, 671, 673FEcoParse gene model, 375F, 376–7engineered OROlac promoter, 676,
676Fgene classification by codon usage,
370GeneMark.hmm gene model,
375–6genome segment annotations, 322,
323Fheat shock response, 680, 680Flength distributions of
coding/noncoding regions, 374F,375
promoters, 339–40pyruvate formate-lyase, 467Fpyruvate kinase, 480Frobustness, 684start codons, 366F, 367
ESPript, 93ESTs see expressed sequence tagsESyPred3D, 554, 563, 563TEuclidean distance, 634–5, 636FEukarya see eukaryoteseukaryotes, 14, 21–2, 21F
control of translation, 19exon prediction see exon predictiongene detection, 323–37, 323FD, 360
finding correct start codon, 327, 330F
homology searching, 322with only query sequence,
327–32with query sequence and gene
model, 332–4sequence features used, 377–81,
378FDseries of steps, 346T
using correct reading frame, 325–7, 325T, 328F, 329F
using gene control signals, 381–9, 382FD
using gene model and sequence similarity, 334–6
using genomes of related organisms, 336–7
variety of approaches, 324–5vs methods used in prokaryotes,
377–9gene models, 397–9, 398FDgene structure, 319, 325Fintron prediction see intron
predictionmRNA modifications, 18–19origins, 292Bpromoter prediction, 339, 340–2
indefinite nature of results, 341, 341T
online methods, 340–1theoretical basis, 381–9
regulation of transcription, 15, 17–18, 17F
splice site detection see splice sites, detection
tRNA gene detection, 362–3Eukaryotic Promoter Database (EPD),
339, 340European Bioinformatics Institute
(EMBL-EBI), 52, 55, 606databases, 55–6, 60
E-values, 98cut-off thresholds, 98–100, 99F,
101FPSSM construction, 176statistical significance, 156
EVA program, 551Tevolution, 5, 20–3, 20FD
aiding sequence analysis, 38basic concepts of molecular,
235–48, 235FDconvergent, 74–5, 75B, 243–4, 244FDarwinian concept, 235divergent, 75Bgene level, 239–47genome level, 247–8minimum see minimum evolutionnucleotide level, 236–9
evolutionary clustering algorithms, 646–7, 646F
evolutionary distance, 81, 199, 224–5see also p-distanceadditive phylogenetic trees, 228,
229Fcalculation, 268–76, 269Fevaluating tree topologies using,
293–7PAM matrices and, 84sources of errors, 277
tree construction, 251–2, 276–9, 277FD
evolutionary historyphylogenetic trees see phylogenetic
treesrecovering, 223–64, 224MM
evolutionary modelspractical application, 251–5, 253Tselection of appropriate, 253–5,
254F, 256Tsequence alignment, 117–19theoretical basis, 268–76time-reversible, 302
evolutionary trace method, identifying binding sites, 586–7, 587F
exclusive classification, 637–8, 638Fexon prediction, 319, 323–37
assessing accuracy, 343–6, 343F, 344F, 392B
with only query sequence, 327–32
with query sequence and gene model, 332–4
theoretical basis, 379–81, 389–97, 391FD
using correct reading frame, 325–7, 325T, 328F, 329F, 391–2
using gene model and sequence similarity, 334–6
using general sequence properties, 390–2
using genomes of related organisms, 336–7
using homology searches, 397variety of approaches, 324–5
exons, 18, 18F, 19initial and terminal, detection, 390,
396–7length distributions, 379, 379Ftranslating predicted, 343, 344Fuse of term, 379–80
ExPASy program, 345, 412, 620expectation maximization (EM), 191,
216expectation values see E-valuesexpected number of offspring (EO),
209expected score, 119, 126
see also E-valuesexplicit state duration hidden Markov
model (HMM), 374expressed (genes), 11
see also gene expressionexpressed sequence tags (ESTs), 321B
databases, 56, 103digital differential display (DDD),
605–6, 605Fexon prediction using, 397gene-prediction methods using,
334–5
Index
756
End matter 6th proofs.qxd 19/7/07 12:17 Page 756
expression level ratios, 628–30, 629F, 630F
in different samples, 652log transformation, 629–30, 630F
eXtensible Markup Language (XML), 50–1
external nodes, 226, 227Fextracellular matrix (ECM), modeling
tumor invasion, 677, 677FExtreme Pathways, 678extreme-value distribution, 97–8,
155–6, 155Fextrinsic classification, 638extrinsic gene detection methods, 361,
368FDeye, gene expression patterns, 607F,
608
Ffalse discovery error rate (FDR), 658,
659false negatives
in gene prediction, 365Bin sequence analysis, 212
false positivesin gene prediction, 365Bin sequence analysis, 212statistical tests, 653
families, protein see protein familiesfamily-wise error rate (FWER), 658,
659Fano definition of mutual information,
481Fasman, Gerald, 472FASTA program, 95
algorithmic approximations, 141chaining, 144–6comparing nucleotide with protein
sequences, 150–3database searching method, 143,
144–6, 145FE-values, 98, 100, 101F, 156restriction of matrix coverage, 140versions available, 95–6, 96Twhole genome alignments, 157–9
fast Fourier transform (FFT), 206FATCAT program, 579–80, 580Ffeedback control, 680, 680Ffeedforward control, 680, 680FFelsenstein, Joseph, 253, 275Felsenstein 81 (F81) model, 253, 253T,
254F, 256TFelsenstein zone (long-branch
attraction), 292, 308–9, 309FFerrell, J.E., 689FFGENESH program, 332, 333–4, 334F
comparative results, 331T, 332T, 333F
rice genome prediction, 335B
fibrin, 451–2fibrous proteins, 41, 435fields (database), 46–7fingerprints, multiple motif, 109finite-state automata (FSA), 147–50,
147F, 148Fvs hidden Markov models, 147, 179,
180–1FirstEF, 332, 396–7Fitch algorithm see post-order traversalFitch–Margoliash method, 250, 251T
evaluating tree topologies, 293. 297generating single trees, 279–80,
280F, 281Fvs neighbor-joining, 282, 284F, 285
fitness, 235evolutionary clustering, 646–7,
646Fflavin adenine dinucleotide (FAD),
259B, 260, 261F, 262flavodoxin family, 573FFletcher–Reeves formula, 714Flicker program, 614, 620, 620FFlux Balance Analysis (FBA), 678FoldIndex method, 513folding, protein see protein foldingfolding funnel, 525fold recognition see threadingfolds, protein see protein foldsforce fields, 522, 524–9, 701–5
additive, 701class I and II, 702nonadditive, 701
forward algorithm, 190fractional alignment difference, 269frameshift, 150Franklin, Rosalind, 7, 7Ffree energy
folded proteins, 41–2RNA secondary structures, 456,
457–8surface, molecular systems, 525,
525Ffree insertion modules (FIMs), 184–5fructose-1,6-bisphosphate aldolases
(FBPAs), 569F, 570, 570FFSSP database, 574, 578–9Fuchs, Patrick, 475FUGUE program, 532, 535–6, 536Ffully resolved trees, 227function (protein and gene), 40–1
see also structure–function relationships
conservation, 568–74, 568FDevolution, 242, 243–4genome annotation, 400–3orthologs, 239, 243patterns and, 109–11phylogenetic trees for predicting,
262
protein folding and, 40–1, 41Fusing orthologs to predict, 245
functional homology, 569–70, 569F, 570F
function optimization seeoptimization, function
FunSiteP algorithm, 340, 341, 341Tfusion
gene, 72genome, 292B
GGamma distance (correction), 239,
269F, 270Gamma distribution (G), 269F, 270
evolutionary model variation, 253T, 254F
gap extension penalty (GEP), 85, 127gap insertion operator, 210–11, 211Fgap opening penalty (GOP), 127, 202,
202Fgap penalties, 85–6, 87, 126–7
global alignments, 131F, 132–5, 132F, 134F
local alignments, 137manual adjustment, 93multiple alignments, 202, 205, 206position-specific scoring matrices,
170, 177suboptimal alignments, 137F, 139
gaps, 74inserting, 85–7in multiple alignments, 204, 205Fscoring, 126–7
Garnier, J, 422Gaussian distributions see normal
distributionsGAZE program, 399, 402FGC box, detection algorithms, 383,
384–5GC content
bacterial genomes, 238F, 239evolutionary models and, 273promoter prediction using, 386,
387Fregions of different (isochores), 275,
378GenBank, 55–6, 102–3
flat-file format, 47, 47Fsample extract, 57F
gene(s), 5, 10–11evolution, 239–47families see protein familiesfunction see functionfusion, 72nested, 399nonfunctional, 242overlapping, 12, 12F, 360prokaryotic vs eukaryotic, 377–9
Index
757
End matter 6th proofs.qxd 19/7/07 12:17 Page 757
structure and control, 14–20, 15FD, 318–19
structure in eukaryotes, 319, 324GeneBee program, 457F, 458GeneBuilder program, 331T, 332T, 335,
336GeneCluster2 program, 608gene duplication, 73, 239–42, 242F
acetolactate synthase (ALS), 262, 263F
effects on phylogenetic analyses, 245
identified from synonymous mutations, 241B
phylogenetic trees, 226, 231Fstructure–function relationships,
570use for rooting trees, 292–3
gene expression, 11co-expression, 600databases, 58digital differential display (DDD),
605–6, 605Fmicroarrays, 602–4, 603F
see also DNA microarrayspatterns, 638, 639FSAGE method, 604–5, 604Fsample classification, 659–62,
660FDuses of clustered data, 610–11,
611Fgene expression analysis, 599–600,
600MM, 601–11, 601FDclustering methods see clustering
methodsdata preparation for, 626–33, 627F,
627FDstatistics, 652–9
gene loss, 242–3, 243Feffects on phylogenetic analyses,
245GeneMark algorithm, 328–9, 368–70
comparative results, 331–2, 331T, 332T
GeneMark.hmm algorithm, 373–6, 374F
gene models, eukaryotic, 397–9, 398FD
Gene Ontology, 54, 348gene ontology
evaluating validity of clusters, 651genome annotation, 348–52, 402
gene prediction (detection), 317–46, 318MM
assessing accuracy, 342–6, 342FDat exon level, 343, 344F, 392Bat nucleotide level, 343, 343F,
365–6Bat protein level, 343–6, 345F
eukaryotes see under eukaryotes
evaluation and reevaluation of methods, 405
exon prediction see exon predictionfurther analysis, 399–405, 400FDintrinsic and extrinsic methods,
361, 368FDintron prediction see intron
predictionpotential for errors, 65preliminary steps, 318–22, 319FDprokaryotes see under prokaryotespromoter region, 338–42, 381–9splice site detection see splice sites,
detectiontheoretical basis, 357–99, 358MM
general time-reversible model (GTR or REV), 253T, 255, 262
general transcription initiation factors, 17
see also transcription factorsgeneration, 209GeneSplicer program, 394–5genetic algorithms
cluster analysis, 646–7, 646Fdocking, 591–2, 592Ffunction optimization, 709, 716–18,
716Fmultiple sequence alignment
(SAGA), 209–11, 210F, 211Fgenetic code, 11, 12–13, 12T
degeneracy, 13genetic distance, 224–5, 232F
see also evolutionary distancegene (phylogenetic) trees, 226, 230,
231Fcombined with species trees, 243,
244Freconstruction example, 259–63,
261F, 263FGeneWalker program, 331T, 332T,
335–6GeneWise program, 345–6Genie program, 329F, 386Geno3D program, 554, 563, 563Tgenome(s), 4, 10
comparisons see genome sequence alignments
completely sequenced, 71databases, 56, 103evolution, 247–8fusion, 292Bidentifying features, 317–54,
318MMknown prokaryotic, 324Tproblems of defining, 23B
genome annotation, 65, 399–405see also gene predictioncomparing genomes to check
accuracy, 353–4, 353F, 354F, 403–5, 403F, 404F
E. coli segment, 322, 323Fevaluation and reevaluation, 405functional, 400–3pathway information aiding, 348,
349–50Fpipeline approach, 319practical aspects, 346–52, 347FDquality of information used, 403role of gene ontology, 348–52, 402theoretical basis, 357–9, 358MM
Genome Browser, 352, 352FGenomeNet, 84GenomeScan, 397genome sequence alignments
to verify annotation, 353–4, 353F, 354F, 403–5, 403F, 404F
whole genomes, 156–9, 157FDgenome sequences
excluding noncoding regions, 319–21
gene prediction from see gene prediction
preliminary examination, 318–22, 319FD
splitting, 319genome sequencing, 71
multiple genomes, 376Bshotgun procedure, 376B
genomic imprinting, 7genomics
functional, 600role in systems biology, 668structural, 569
GenScan program, 334comparative results, 331T, 332T, 336exon detection, 390promoter detection, 385, 385Fsplice site prediction, 394, 395F,
396transcription stop signal detection,
389translation start site detection, 389use of gene models, 398–9, 401Fuse of homology searches, 397
GenTHREADER, 532–3, 534–5, 535F, 536F
GEPASI, 691–2, 691FGES (Goldman, Engelman and Steitz)
hydrophobicity scale, 438, 475, 477T
Gibbs program, 215–17Gleevec®, 593GLIMMER program, 323, 371–2global alignments, 88–9, 89F
large genome sequences, 352F, 353optimal, 128, 129–35, 129F, 130F,
131Fscore significance, 154time saving methods of deriving,
139–41, 139F, 140F
Index
758
End matter 6th proofs.qxd 19/7/07 12:17 Page 758
global–local dynamic programming, 533
globular proteins, 41length distributions of secondary
structures, 467, 468Fsecondary structure prediction,
509secondary structures, 463
gluconeogenesis pathway, 348, 349–50F
glycolytic pathway, 671, 672FE. coli, 673Finteractions, 673Fmodularity, 686F, 687F
glycosylphosphatidylinositol (GPI) anchors, 513–14, 513F
Godzik, Adam, 491Gojobori, Takashi, 240BGOLD program, 591–2, 592FGOR methods, 414, 422–5, 425F, 472–3
accuracy, 422, 423, 424T, 484derivation, 480–4, 482Fversion III, 483, 484Fversion IV, 423–5, 427F, 483version V, 423–5, 425–6, 426F, 483
Gotoh, Osamu, 206GPI-SOM method, 513–14, 513FG-protein-coupled receptors, 436,
436BGrailEXP program, 331T, 332T, 334–5,
336Grail program, 323, 386, 387F, 389,
399greedy alignment methods, 199greedy permutation encoding method,
646–7Greek Key structure, 40BGRID program, 591GRIN program, 591Grishin, Nick, 466growth factors, 616–17, 617Fguanine (G), 6, 6Fguide tree, 90, 199–200
construction, 204–6, 205Fmultiple alignment from, 206, 206Fpattern discovery, 214
Guigo, Roderic, 365–6B, 392BGumbel extreme-value distribution see
extreme-value distribution
HHbP method, 491–2, 492F, 493FHaemophilus influenzae, 371hairpins, 36–7harmonic approximation, 526, 702–3,
702Fhashing, 95
theoretical basis, 143–6whole genome sequences, 158
heartcellular modeling, 685Tmodeling of function, 677, 678F
heat shock response, E. coli, 680, 680Fhelical wheels, 439F, 440–1, 448helices, 435
see also 310-helices; a-helices; p-helices; transmembrane helices
helix tails, 441hemagglutinin, 34, 486, 486Fhemoglobin, 43, 43FHenikoff, Steven and Jorja, 122, 171Fheptads, 451, 451F, 510Hessian, 714–15hexamers (hexanucleotides) see
dicodonsHHsearch, 195F, 196hidden layers, 431, 431F, 494, 499hidden Markov models (HMMs), 166,
179, 179FDwith duration, or explicit state
duration, 374–6EcoParse gene model, 375F, 376–7exon prediction, 328, 332GAZE gene model, 402FGeneMark.hmm algorithm, 374–6,
374Fgenome annotation, 359GenScan gene model, 399, 401Fmultiple sequence alignments, 200,
203–4profile see profile hidden Markov
modelssecondary structure prediction,
504–10, 506FDtransmembrane protein prediction,
446–7, 446F, 451vs finite-state automata, 147, 179,
180–1hidden neural networks (HNN), 509hierarchical clustering, 638, 639–41
see also UPGMA methodgene expression microarray data,
606–8, 606F, 607Fprotein expression data, 616–17,
617F, 618Fvs other clustering methods, 643B
hierarchical likelihood ratio test (hLRT), 253, 254F, 255
Higgins, Desmond, 209high-scoring segment pairs (HSPs),
141, 149Hinton diagram, 499Fhistone deposition protein, 571FHIV (HIV-1), 337B
drug design, 589Bprotease (HIV-PR), 551–2, 552F
HKY85 model, 253T, 254F, 256T, 273HMM see hidden Markov models
HMMER2 program, 185HMMGene program, 331T, 332, 332T,
333HMMTOP program, 441F, 446–7, 448F,
506–7, 507FHochberg, Yosef, 659Hollerith, Herman, 48, 48Fhomolog methods see nearest-
neighbor methodshomologous genes
chicken, human and puffer fish genomes, 245, 246F
evolution, 239–42, 242Fhomologous proteins, 38
see also protein familiesalignment, 38, 74secondary structure prediction,
416, 418–19, 419Fhomologous sequences
see also sequence alignmentcut-off points for identifying, 81identifying, 74–6inserting gaps, 85–7scoring alignments, 76–81searching databases see searching
sequence databasessecondary structure prediction
using, 425–6, 484–5, 489–90, 502–3
homologyexon prediction based on, 397functional, 569–70, 569F, 570Fgene prediction based on, 320F,
321B, 322, 372–3homology modeling (3D protein
structure), 522–3, 537–64, 538FDassumptions, 541–2automated, 541, 552–6, 553FD,
561–3checking for accuracy, 549–51, 550F,
551T, 560, 560Fenergy minimization, 548, 559–60history, 538–9, 538Floops, 545–6, 546F, 547F, 559, 559Fmanual or semi-manual, 557–61molecular dynamics, 548mTOR protein, 563, 563Tmultidomain proteins, 564PI3 kinase p110a, 557–63principles, 537–42sequence length cut-offs, 540–1,
542Fsequence similarity thresholds,
539–40, 541Fsteps, 540F, 542–52, 543FDstructurally conserved regions
(SCRs), 544–5, 545F, 554trustworthiness, 551–2Web-based servers, 554, 561–3
homoplasy, 244
Index
759
End matter 6th proofs.qxd 19/7/07 12:17 Page 759
horizontal gene transfer (HGT), 246–7,246F, 247F, 292B
Hsp60, 249HSSP database, 490HTML (hypertext markup language),
50–1human immunodeficiency virus see
HIVHutchinson, Gail, 475Hutchinson, Gordon, 387hybridization, 9, 602hydrogen bonds
DNA, 7, 8energetics, 525–6, 701peptide bonds, 29, 32, 32Bprotein folds, 42RNA, 456secondary protein structure, 34, 35,
35F, 36Fdefining, for prediction
algorithms, 464–5, 465Fnonidealized patterns, 463–4
hydropathic (hydrophobicity) profiles, 439, 442
hydrophilic amino acid residues, 29, 30F
transmembrane proteins, 439F, 440–1
hydrophilic regions, folded proteins, 41
hydrophobic amino acid residues, 29, 30F
hydrophobic cluster analysis (HCA), 110–11, 110F
hydrophobicity scales, 437–9, 450, 475, 477T
hydrophobic moment, 440hydrophobic regions
folded proteins, 41, 42indicating binding sites, 583transmembrane proteins, 437–41,
439Fhyperplanes, separating, 661, 662,
662Fhypertext markup language (HTML),
50–1HyPhy program, 255hypothetical proteins, 65, 348
conserved, 348
Iidentity, 76
percent/percentage seepercent/percentage identity
visual assessment, 77–8, 77F, 79Fimmunoglobulin folds, 571Fimmunoglobulins, 381, 555–6Bimportin a, 480Fimprinting, genomic, 7
indels, 85, 117see also deletions; insertions
indexing techniques, 141–6see also hashing; suffix treeswhole genome sequences, 157–9
influenza virushemagglutinin, 34, 486, 486Frational drug design, 589B, 591
informationdirectional, 423, 482mutual, 697pair, 423, 482Shannon entropy and, 696
information theory approach, secondary structure prediction, 422–5, 480–4
informative sites, 298ingroups, 230inhomogeneous Markov chain (IMC)
models, 328, 368–70initiator (Inr) see cap signalinput, 431, 494input layer, 430, 494insertions
accounting for, in sequence alignment, 85–7
alignment scoring schemes, 117, 126–7
homology modeling, 542, 545–6, 545F
threading and, 532, 537integral membrane proteins see
transmembrane proteinsintegrative approach, 670Fintermediate alignment, 198, 204–5,
205Fintermediate sequences, 97internal nodes, 226, 227FInternet, access to databases via, 52interpolated Markov models, 371–2,
388intrinsic classification, 638intrinsic gene detection methods, 361,
368FDintron prediction, 319, 323, 379–81
approaches used, 324–5theoretical basis, 389–97, 391FD
introns, 18–19, 18Fsee also splice sitesAT–AC or U12, 19, 392branch point, 18–19, 396length distributions, 379, 379F
invariable sites, 298inverse protein folding, 530–1inversion, sequence, 158FI-sites library, 487–8, 487Fisochores, 275, 378isoelectric focusing (IEF), 613iterated sequence search (ISS), 168iterative alignment, 198, 206, 206F
JJarnac, 690F, 691–2JC model see Jukes–Cantor modelJnet program, 424T, 434, 435FJones, David, 276, 503JTEF program, 397JTT matrix, 276Jukes, Thomas, 271Jukes–Cantor (JC) model, 253T, 271–2
evaluation using maximum likelihood, 302, 303–4
example distance corrections, 252Fexamples of constructed trees, 256,
258F, 261F, 262Gamma distribution applied to
(JC+G), 273more complex models based on,
272–3synonymous/nonsynonymous
mutations, 241Btesting for suitability, 253, 254F,
256Tjunk DNA, 22B, 336, 378–9jury decision neural networks, 432,
501jury voting technique, 485JWS Online Cellular System Modeling,
692
KKabat database, 103Kabsch, Wolfgang, 464–5Katoh, Kazutaka, 206KD hydrophobicity scale, 475, 477T,
479FKendrew, John Cowdery, 538Fkeratins, 451keys, 49, 49FKihara, Daisuke, 480Kimura-two-parameter (K2P or K80)
model, 253, 253T, 272–3practical application, 261F, 262transition/transversion ratio
calculation, 274–5BKimura-three-parameter (K3P or K81)
model, 253, 253Tkinetic energy, 718kinetic models, 678, 690Fkinetic parameters, biological
networks, 674k-means clustering, 608, 641–2, 642F
vs other clustering methods, 643Bk-mers, 141, 147, 199–200k-nearest-neighbor method, sample
classification, 660–1knockout mice, 688knowledge-based methods
modeling 3D protein structure seehomology modeling
Index
760
End matter 6th proofs.qxd 19/7/07 12:17 Page 760
secondary structure prediction, 414–15, 421–30
transmembrane protein prediction, 443
knowledge-based scoring, 590KOG database, 243, 245BKohonen networks see self-organizing
mapsKrebs cycle see tricarboxylic acid cycleKrogh, Anders, 500, 501F, 502–3k-tuples, 95, 141, 143–4, 147
whole genome sequences, 158–9Kullback–Leibler distance see relative
entropykuru, 101, 101BKyoto Encyclopedia of Genes and
Genomes (KEGG), 348, 671, 672F
Kyte–Doolittle (KD) hydrophobicity scale, 475, 477T, 479F
LL2L tool, 611Laboratory Information Management
System (LIMS), 600LAGAN method, 352F, 353Lake, James, 292BLAMA program, 106
alignment of PSSMs, 193–5, 194FLander, Eric, 488, 491lariat RNA, 18–19, 18Flast common ancestor, 227, 227Flast universal common ancestor, 293lateral gene transfer (LGT) see
horizontal gene transferlayers, neural networks, 430–1, 431F,
494–5learning
supervised, 497B, 638unsupervised, 638, 644
learning rate, 497Bleast-squares method, 250
Bryant and Waddell version, 296, 296F
evaluating tree topologies, 294–6, 295F, 297
leaves, 226, 227FLEGO® system, 686, 688Flength distributions
a-helices and b-strands, 467, 468Fprokaryotic coding/noncoding
regions, 374–5, 374Fvertebrate introns and exons, 379,
379FLennard–Jones terms, 705, 705Fleucine zipper, 413, 451
prediction, 453–4, 455FLevitt, Michael, 195LIBRA, 536, 537F
library extension, COFFEE scoring scheme, 203, 204F
ligandsdocking procedures, 587–93, 588FDdrug design methods, 588, 589Bidentifying candidate, 590
likelihood ratio test, hierarchical (hLRT), 253, 254F, 255
linear discriminant analysis (LDA)promoter prediction, 340, 388, 389Fsecondary structure prediction,
512–13linear gap penalties, 126–7
global alignments, 131F, 132–3local alignments, 137suboptimal alignments, 137F, 139
links (in databases), 52, 53lipopolysaccharide (LPS), 608, 609F,
674Fliquid chromatography, 623local alignments, 88–9, 89F
dynamic programming algorithm, 135–9, 136F
gapped, score statistics, 153, 156multiple alignment using, 92–3, 93Foptimal, 135–7, 136Fprofile hidden Markov model,
183–4, 184Fsuboptimal, 137–9, 137Fungapped, score statistics, 153,
155–6log-likelihoods
amino acid propensities, 476, 478Fevolutionary models, 254F, 256Tmultiple alignments, 192, 216
log-odds ratio, 118–19, 169–70log-odds scores, 188–90logos, 177
aligned HMMs, 196, 196Fpatterns, 213PSSMs, 106F, 177–8, 178F
log ratiosdefining distances between, 634–7expression data, 629–30, 630Ft-test, 654z-test, 653–4
long-branch attraction see Felsenstein zone
LOOPP program, 532–3, 533F, 535–6, 536F
loops, 36–7amino acid residue preferences,
202homology modeling and, 542,
545–6, 546F, 547F, 559, 559Ftransmembrane proteins,
prediction, 506, 508Loopy program, 561low-complexity regions, 100–2, 151–2B
see also repeat sequences
Lowe, Todd, 362lowess normalization, 630–1, 631FLUDI program, 591lysozyme, 538–9, 539F
MM (mutation probability matrix), 120,
121–2machine-learning methods, 430
see also neural network methodssecondary structure prediction,
414, 415–16Macromolecular Structure Database
(MSD), 52, 60, 64macrophages, 608, 609F, 674FMAFFT method, 199–200, 206Mahalanobis distance, 636–7main chain see backboneMajor, Francois, 466major histocompatibility complex
(MHC) proteins, 593majority-rule consensus trees, 234F,
235majority voting technique, 485Mann–Whitney U test, 656–7MAO (multiple alignment ontology),
54F, 55MARCOIL, 510, 510FMarkov chain Monte Carlo (MCMC),
307Markov chains, 368–9
first order, splice site prediction, 394–5
Markov models, 179see also hidden Markov models;
inhomogeneous Markov chain models
fifth order, 368–70, 370Finterpolated, 371–2, 388splice site prediction, 394–5used by GeneMark, 369, 370F
MASCOT program, 622–3mass spectrometry (MS), 600, 621–3
protein identification, 621–3, 622F
protein quantitation, 623MAST program, 106mathematical modeling of biological
systems, 689–92, 689FDapproaches, 674–7, 676Fmodel databases, 692model structure, 679–83, 679FDspecialized programs, 690F, 691–2,
691Fstandardized languages, 692
Matthews correlation coefficient, 469–70
maximal dependence decomposition (MDD), 394, 395F
Index
761
End matter 6th proofs.qxd 19/7/07 12:17 Page 761
maximal segment pair (MSP), 141, 149maximum likelihood (ML), 250, 251T,
286evaluating tree topologies, 302–5,
302F, 303F, 304Fhidden Markov model
parameterization, 191inference of parameter values, 698measure of optimality, 287practical application, 255–6, 257F,
262, 263Ftesting for suitability, 253
maximum parsimony, 250, 251T, 286branch-and-bound technique, 288long-branch attraction problem,
309, 309Fmeasure of optimality, 287unweighted, 297–300, 299Fweighted, 300–2, 300F, 301F
McClintock, Barbara, 337BMcPromoter program, 388mean(s), 626, 652
comparison between two, 652–5MEGA3 program, 250, 260Melanie program, 614–15, 620membrane proteins, 436–7, 462
see also transmembrane proteinsinteractions with membrane, 437,
437Fsecondary structure prediction, 468
MEME program, 105–7, 107F, 215–17MEMSAT program, 443, 475–6, 479messenger RNA (mRNA), 11
analysis of transcribed see gene expression analysis
capping, 18genetic code, 12–13, 12Tpolyadenylation, 18reading frames, 13, 13Fsecondary structure, 455splicing see RNA splicingsynthesis see transcriptiontranslation see translation
metabolic models, 678metabolic pathways
databases as sources, 671, 672F, 673F
modeling interactions, 681–3, 682Fmodularity, 685, 686F, 687Fsimulation programs, 690F, 691–2,
691Fmethylation, 6–7MFOLD program, 457F, 458MIAME (Minimum Information About
a Microarray Experiment), 64, 606
Michener, Charles, 278microarray databases, 58, 60F
applications, 610–11, 612Fdata standards, 64, 606
Microarray Gene Expression Data (MGED), 54–5, 606
MicroArray Quality Control (MAQC) project, 606
microarrays, 602DNA see DNA microarraysprotein, 621
middle-out approach, modeling biological systems, 677, 678F
midnight zone, 81minimum evolution, 250, 282
methods, 250, 251T, 297MIRIAM standard, 692mitochondria, 22, 292B, 367modeling biological systems see
mathematical modeling of biological systems
modeling (tertiary) protein structure, 521–65, 522MM
ab initio approach, 522, 523Bassessment of predicted structure,
554–6comparative, homology or
knowledge-based see homology modeling
potential energy functions and forcefields, 524–9, 524FD
ROSETTA/HMMSTR method, 523Bthreading (fold recognition) see
threadingMODELLER program, 535, 541, 552,
553, 554Fmodel surgery, 182ModelTest, 255modularity, biological systems, 685–6modules, 680, 681F, 685–6, 686FMolecular Biology Database
Collection, 55, 56Fmolecular clock, 229–30, 278
hypothesis, 250molecular configuration, 33Bmolecular dynamics, 528–9
function optimization, 718–19homology modeling, 548
molecular energy functions, 700–8see also bonding terms; nonbonding
termsforce fields for intra- and
intermolecular interactions, 701–5
potentials used in threading, 706–8molecular evolution, 235–48, 235FDMolecular INTeraction database
(MINT), 58molecular mechanics, 524–9molecular modeling, ligand binding,
588, 589Bmolecular models, 39, 39FMolIDE, 542, 557–8, 560–1, 561FMolProbity program, 527, 549, 551T
monophyletic (groups), 231, 255–6, 258
Monte Carlo methodssee also Markov chain Monte Carlodocking, 590function optimization, 716–18,
716Fmodeling protein structure, 523B
Morse potential, 702F, 703MOTIF program, 217motifs, 103–9, 412
see also patternsautomated generation, 105–7, 106F,
107Fcreating databases, 104–5searching for, 103–4, 107–8
MrAIC script, 255mRNA see messenger RNAmTOR protein, 563, 563TMULTICOIL program, 453multidomain proteins, 41
3D structural modeling, 537, 564sequence alignment, 88, 88F
multifurcating trees, 227, 233, 233FMulti-LAGAN method, 353multiple alignment, 89–93
applications, 90construction methods, 90–1,
196–211discovering patterns, 213–15divide-and-conquer method, 91,
91Fby gradual sequence addition,
196–206, 197FDmanual refinement, 93methods not using pairwise
alignment, 207–11, 207FDphylogenetic tree reconstruction
using, 250–1, 255, 260secondary structure prediction
using, 425–7, 427Ffrom series of local alignments,
92–3, 93Ftheory, 165–7, 166MMtransmembrane protein prediction
using, 444, 445value for sequences of low
similarity, 91–2, 92Fvs pairwise alignments, 90, 166–7
multiple alignment ontology (MAO), 54F, 55
multiple linear regression, 514MUMmer method, 159MUSCLE method, 199–200, 206mutation data matrices (MDMs),
Dayhoff see PAM matricesmutation probability matrix (M), 120,
121–2mutation rates
codon position and, 238–9, 238F
Index
762
End matter 6th proofs.qxd 19/7/07 12:17 Page 762
estimating and predicting, 236, 237F
type of base substitution and, 236–8, 238F
mutations, 22–3accepted, 84masking sequence similarities, 72,
73–4selective pressures on, 240–1Bsynonymous/nonsynonymous, 238,
240–1B, 245transition and transversion, 237–8,
238Fmutual information, 697Mycoplasma, 684myoglobin, sperm whale, 538Fmyosin II, 451MZEF program, 328
comparative results, 331–2, 331T, 332T
scores used, 331T
NN-acetylneuraminate lyase gene, 247FNational Center for Biotechnology
Information (NCBI), 52, 55dbEST, 56, 321BGEO, 606Protein Database, 56–8SAGE analysis programs, 605UniGene database, 103, 605–6, 605F
native structure or state (of proteins), 522
NCBI see National Center for Biotechnology Information
nearest-neighbor interchange (NNI) method, 289–90, 289B
nearest-neighbor methods, 414–15, 428–30, 485–92, 485FD
misfolding proteins, 491–2, 492F, 493F
outline, 486, 487Fsample classification, 660–1similarity measures used, 488–90,
490Fweighting of predictions, 490–1
Needleman, S.B., 87, 128Needleman–Wunsch algorithm, 87,
128database search programs using, 95discarding intermediate
calculations, 138Bextension to multiple alignments,
199illustration of original, 135, 135Fmore efficient variations, 129–35,
129F, 130Fnegative selection, 240–1BNei, Masatoshi, 240B, 282
neighbor-joining (NJ) method, 250, 251T, 252–3
generating single trees, 282–5, 282F, 284F
multiple alignment, 199, 200practical application, 261F, 262variants, 285
Nei–Gojobori method, 240–1BNeisseria meningitidis, 348, 350Fnested genes, 399NetPhos server, 110NetPlantGene program, 390–1, 393F,
395–6networks
see also neural networks; systems, biological
architectures, 676, 677Fbiological, 670–1information for constructing, 671–4kinetic models, 678mathematical modeling
approaches, 674–7mathematical representation of
interactions, 680–3scale-free, 676
neural network methodsexon prediction, 334–5, 390–1genome annotation, 359promoter prediction, 340, 385–6,
386F, 387Fsecondary structure prediction,
415–16, 430–4, 430FDassessing reliability, 432Qian and Sejnowski studies,
496–9, 499F, 500FRiis and Krogh methods, 500–1,
501F, 502–3theoretical basis, 492–504,
493FDtransmembrane proteins, 445using homologous sequences,
502–3Web-based programs using,
432–4splice site prediction, 395–6
neural networks, 430–2GenTHREADER, 534–5, 535FKohonen see self-organizing mapslayered feed-forward, 494–502,
495Fmore complex, 503–4, 504F, 505Fmultilayer, 431, 431Ftraining process, 496, 497–8Btwo-layered, 430–1, 431F
neuraminidase, 589BNevill-Manning, Craig, 213Newick or New Hampshire format,
231–2Newton–Raphson method, 528NMR see nuclear magnetic resonance
NNPP program, 340, 341T, 385–6, 386F
NNSSP program, 424T, 433, 488–9, 490, 491
nodesneural networks see units, neural
networkphylogenetic trees, 226, 227Fself-organizing maps, 608, 608F,
644, 644Fself-organizing tree algorithms, 648,
648Fnonbonding terms, 525–6, 701, 704–5noncoding DNA see junk DNAnoncoding RNA (ncRNA) genes,
detection, 319–21, 361–3noncoding strand, 11nonlinearity, 667nonparametric tests, 656–7nonrandom model, sequence
alignment, 117–19nonredundant database, 63nonsynonymous mutations, 239,
240–1B, 245normal distributions, 626, 628F, 698
statistical tests, 653–5normalization
data, 627–31, 628F, 630Flowess, 630–1, 631F
Notredame, Cedric, 209N terminus, 29nuclear magnetic resonance (NMR),
411, 521nucleic acid sequences see nucleotide
sequencesNucleic Acids Research (NAR), 55, 56Fnucleic acid world, 3–23, 4MM
see also DNA; RNAnucleotides, 5–6, 6Fnucleotide sequences, 5, 6
see also DNA sequences; RNA sequences
base composition variations, 275–6
comparison with protein sequences, 150–3
databases, 55–6, 57F, 58derivation of scoring matrices,
124F, 125detection of homology, 75–6evolutionary changes, 236–9evolutionary models, 271–2large-scale rearrangements see
rearrangements, large-scalelow-complexity regions, 151Bscoring of alignment, 76–7, 80–1searching with, 97–103
null distribution, 656null model, 189–90NVT ensemble, 718
Index
763
End matter 6th proofs.qxd 19/7/07 12:17 Page 763
Oobject-oriented databases, 48, 51odds ratio, 118Ohler, Uwe, 388oligomeric proteins, 42–3one-tailed test, 653Online Mendelian Inheritance in Man
(OMIM) Web site, 352ontologies, 54–5, 54F, 64
gene see gene ontologyopen reading frames (ORFs), 13, 318,
367compared to eukaryotic genes,
377–8hypothetical proteins, 348identifying, 318–19, 359–60
practical aspects, 322–3theoretical basis, 364, 371, 372–3
minimum and maximum sizes, 328, 405
orphan (ORFans), 405potential, 364
operational taxonomic units (OTUs), 225
operons, 19–20, 19F, 319, 341optimal alignments, 76, 128
extreme-value distribution, 155, 155F
global, 128, 129–35, 129F, 130F, 131Flocal, 135–7, 136Fscore significance, 153–6, 154FD
optimization, function, 709–19, 709Ffull search methods, 710global, 715–19, 715Flocal, 710–15
ordinary differential equations (ODEs),683
ORFs see open reading framesOrganismic System Theory, 667orphan ORFs (ORFans), 405ORPHEUS program, 323, 372–3orthogonal encoding, 496orthologous genes, 239, 242F
chicken, human and puffer fish genomes, 245, 246F
to construct species trees, 239–47identifying, 243, 245Blarge-scale rearrangements and,
248orthologous sequences (orthologs),
223Osguthorpe, David, 422outgroups, 229F, 230, 258, 291–2output, 680output expansion, 500output layer, 430, 494overall alignment score, 80overlapping classification, 638overlapping genes, 12, 12F, 360overtraining, neural networks, 498B
OWL database, 109oxygen, molecular (O2), 684–5
Pp53 protein, 580–2, 581F, 582F
identifying interaction sites, 584–5, 584F, 587, 587F
module, apoptotic pathway, 680, 681F
Pacific Northwest National Laboratory (PNNL), 668
PAIRCOIL program, 453paired-site tests, 311pair information, 423, 482pairwise alignment, 89, 115–61,
116MMalignment score significance, 153–6complete genome sequences,
156–9discarding intermediate
calculations, 138Bdynamic programming algorithms,
127–41, 128FDindexing techniques and
algorithmic approximations, 141–53, 142FD
inserting gaps, 86, 86Fmultiple alignments based on,
196–206, 197FDsecondary structure prediction
method using, 430substitution matrices and scoring,
117–27, 117FDvs multiple alignment, 90, 166–7
pairwise contact potential (PCP), 533PALSSE method, 466, 466F, 467, 467F,
468PAM matrices, 82–4, 83F
derivation, 119–22, 119Fevolutionary model incorporation,
276PET91 version of PAM250, 121F,
122selection, 84, 85summary score measures, 125F,
126vs percentage identity, 120F, 121
paralogous genes (paralogs), 239–42, 242F
identifying, 243, 245Bparameters
Bayesian inference, 698system, 678, 679, 679F
Parisien, Marc, 466parsimony methods see unweighted
parsimonypartially resolved tree, 227partitional classification, 638partition function, 706, 707, 716
partitionssee also splitsclustering methods, 637, 638hierarchical clustering, 639–41k-means clustering, 641–2phylogenetic trees, 231
parvalbumin (1B8C), 421F, 422Fpath, 179pathogenicity islands, 342, 402–3pathways, metabolic see metabolic
pathwayspatristic distances, 294PatternHunter program, 159patterns, 103–11, 104FD, 151B
see also motifsautomated generation, 105–7, 106F,
107Fcreating databases, 104–5discovery, 165, 166MM, 211–18,
212FDprotein function and, 109–11searching for, 103–4, 107–9, 108F,
109FPavesi, Angelo, 362–3PDB see Protein Data BankPDB_SELECT, 416–17, 473p-distance, 236, 237F, 268–9
effects of correction, 252FGamma correction, 269F, 270, 270Fphylogenetic tree reconstruction,
251–2Poisson correction, 269F, 270
Pearson, William, 144Pearson correlation coefficient, 194,
635–6, 636Fpeptide bonds, 29–33, 31F
trans and cis conformations, 32, 33F
percent/percentage identity, 76–7BLOSUM matrices and, 84homology modeling and, 540–1,
541F, 542Flimitations, 79–81minimum acceptable, 81PAM matrices, 120F, 121
percent similarity, 80–1perceptrons, 430–1, 494per-comparison error rate (PCER), 658per-family error rate (PFER), 658periodicity, 151BPET91 matrix, 121F, 122Petersen, Thomas, 499–500, 501Pfam database, 109phages, sequenced genomes, 324TPHAT matrix, 84PHDhtm program, 442F, 445PHD program, 424T, 432, 432FPHDsec program, 499, 501–2, 503PHI-BLAST program, 108Phobius method, 509
Index
764
End matter 6th proofs.qxd 19/7/07 12:17 Page 764
phosphatidylinositol 3-OH kinase (PI3 kinase) p110a subunit, 557
alignment, 86, 86Fhomology modeling, 557–64, 563Tlocal and global alignment, 89, 89Fmultiple alignment, 91–2, 92Fprotein family profile, 109searching sequence databases, 99F,
100, 101Fphosphatidylinositol 3-OH kinase (PI3
kinase) p110g subunit, 557, 557F, 563, 563T, 564
phosphatidylinositol 3-OH kinases (PI3 kinases), 87B, 557
multidomain nature, 88, 88Fpatterns and motifs, 106–9, 106F,
107F, 109Fphosphatidylinositol-4-OH kinases
(PI4-kinases), 87B, 88Fpatterns and motifs, 106, 107–8
phosphoinositol kinase, 439F, 441phospholipid kinases, 87Bphosphopeptide-binding proteins,
570–1, 572Fphosphorylation sites, predicting, 110phosphotyrosine-binding (PTB)
domain, 571, 572Fphylogenetic tree reconstruction,
248–64assessing tree feature reliability,
307–10, 308FDchoice of method, 249–51, 251Tclustering methods, 276–85, 277FDdata choice, 249evaluating topologies, 293–307,
294FDevolutionary model choice, 251–5multiple alignment as starting
point, 255, 260multiple topologies, 286–93, 287FDpractical examples, 255–8, 257F,
258Fsingle trees, 276–86, 277FDstarting trees for further
exploration, 285–6theoretical basis, 267–311, 268MM
phylogenetic trees, 223–4see also guide treeadditive, 228–9, 229F, 230comparing two or more alternative,
310–11condensed, 233–4, 233Fconsensus, 234–5, 234F, 291fully resolved, 227gene see gene treesmeasuring difference between two,
289Bmultifurcating (polytomous), 227,
233, 233Fpartially resolved, 227
reconciled, 243, 244Frooted see rooted treesscoring multiple alignments, 200–1,
200Fspecies see species treesstrict consensus, 234–5, 234Fstructure and interpretation,
225–35, 226FDsubstitution matrix derivation from,
82–3, 119F, 120topologies see tree topologiesultrameric, 229–30, 229Funrooted see unrooted trees
phylogenomics, 262PHYML program, 251, 255PHYRE program, 535–6, 536FPI3 kinase see phosphatidylinositol
3-OH kinasePISSLRE see CDK10 genePKN/PRK1 protein kinase, 452, 452F,
453F, 454Fplasmids, 21platelet-derived growth factor (PDGF),
616–17, 617Fpleckstrin homology (PH) domain,
571, 572FPocket-Finder program, 585–6, 586Fpoint accepted mutations matrices see
PAM matricesPoisson corrected distance, 269F,
270polyadenylation, 18
signal detection, 389polycystein-1-protein, 571Fpolypeptide chain, 29, 31–2
conformational flexibility, 32, 32Fpolytomous (multifurcating) trees,
227, 233, 233Fporins, 35, 436
secondary structure prediction, 450–1, 450F
position-specific scoring matrices (PSSMs), 96, 166, 168–78
see also profilesaligning, 193–5, 194Fconstruction, 168–71overcoming lack of data, 171–5,
176Frepresentation as logos, 177–8,
178Fsecondary structure prediction
using, 503, 504, 505F, 514sequence weighting schemes, 171,
171Fusing PSI-BLAST, 176–7, 177F
positive-inside rule, 441positive selection, 240–1Bposterior probability, 698post-order traversal, 298–9, 298F,
300–1
potential energy, 522, 524, 525see also force fieldscalculations, 525–6functions, 522, 524–9, 706–8surface, 525
potentials of mean force, 532–3, 706–7PPI-PRED program, 584–5, 584FPRATT program, 108, 109F, 217–18Predator Multiple Seq., 424TPREDATOR program, 414, 424T,
428–30prediction confidence level (PCL), 432prediction filtering, 484PRED-TMBB method, 509Pribnow box, 339, 340primary structure, 26–7, 27F, 29–33principal component analysis (PCA)
application, 618, 619Fprinciple, 631–3, 632F, 633F
PRINTS database, 109prion proteins (PrP), 101B
chameleon sequences, 37Bhydrophobic cluster analysis, 110F,
111low-complexity regions, 101–2,
102Fprior distribution, 172prior probabilities, 307, 698probabilistic approaches
alignment scoring, 117–19pattern discovery, 215–17secondary structure prediction, 414
probabilityconditional, 696marginal, 696posterior, 698prior, 307, 698statistical tests, 652–3, 653F
probability theory, 695–7ProbCons method, 200, 203–4, 206PROCHECK program, 527, 549, 550F,
551TProdom database, 58profile hidden Markov models
(HMMs), 109, 179–93, 374aligning, 195–6, 195F, 196Fbasic structure, 180–5, 181F, 183F,
184Fparameterization
using aligned sequences, 185–7using unaligned sequences,
191–3path lengths, 185, 185F, 186Fscoring sequences against, 187–91
profiles, 96, 165–96, 166MMsee also position-specific scoring
matricesaligning, 193–6, 193FDdefining, 167–78, 167FDrepresentation as logos, 177–8, 178F
Index
765
End matter 6th proofs.qxd 19/7/07 12:17 Page 765
PROF program, 424T, 433F, 434prof-sim method, 195PROFtmb program, 450F, 451, 508F,
509progressive alignment, 198, 204–6,
205Fprokaryotes, 21, 21F
see also bacteria16S RNA, 249control of translation, 19–20gene detection, 359–60
algorithms, 368–77, 368FDhomology searching, 322practical aspects, 322–3, 322FD,
323Fsequence features used, 364–8,
364FDvs methods used in eukaryotes,
377–9gene structure, 318–19genomes, 324Tpromoter prediction, 339–40,
341–2regulation of transcription, 15–17,
16FtRNA gene detection, 361–2, 362F,
363FProMate, 584, 584FPromFind program, 387–8Promoter 2.0 algorithm, 340, 341TPromoterInspector program, 341,
341T, 388promoter prediction, 338–42, 381–9
eukaryotes see under eukaryotesindefinite nature of results, 341,
341Tonline methods, 340–1prokaryotes, 339–40, 341–2
Promoter Recognition Profile, 341promoters, 15–16, 319
core (basal) see core promoterseukaryotic, 17, 17F, 381
ProScan program, 341, 341T, 386–7PROSITE database, 105, 107–8, 108F,
109protease, HIV (HIV-PR), 551–2, 552Fprotein(s), 4–5
concentration measurement, 623conformation see conformationdenatured, 42function see functionhypothetical, 65, 348identification of purified, 621–3,
622Finteractions between atoms, 32Blocalization signals, 111, 111Bphylogenetic trees, 226, 230stability of folded, 41–2synthesis see translation
protein backbone see backbone
protein binding sitesdocking procedures, 587–93, 588FDfinding, 580–7, 581FD
highlighting clefts or holes, 585–6, 585F, 586F
residue conservation for, 586–7, 587F
surface properties for, 584–5, 585F
useful features for, 582–4types, 582water molecules, 592–3
Protein Data Bank (PDB), 60, 62F, 102–3, 531
finding target protein homologs, 543, 557
PDB_SELECT, 416–17, 473Protein Domain Parser (PDP), 575, 576protein expression
2D gel electrophoresis see two-dimensional gel electrophoresis
analysis, 612–23, 612FDcluster analysis, 615–17, 617F,
618Fdata preparation for, 626–33,
627F, 627FDdifferential, 615, 616F, 617Fmethods, 614–20online tools, 620principle component analysis,
618, 619Fstatistics, 652–9tracking changes over different
samples, 618–20, 619Fclustering methods and statistics,
625–64, 626MMdatabases, 58, 620identification of purified proteins,
621–3, 622Fquantitation, 623sample classification, 659–62,
660FDprotein families, 259B
phylogenetic tree reconstruction, 259–63, 261F, 263F
profiles of, 109protein fold libraries, 573topological, 573F, 574
protein folding, 40–1, 41F, 412alternative, 486, 491–2, 492F, 493Finverse, 530–1
protein fold recognition see threadingprotein folds, 40, 41, 411
classification, 573–4, 573Flibraries, 531, 532F, 571–4prediction in absence of known
homologs, 531recognition see threadingstructurally different, with similar
functions, 570–1, 572F
structurally similarwith different functions, 570,
571F, 572Funrelated molecules, 529, 530F
protein interaction(s), 580–2databases, 58–9interactive Web sites, 671–2, 673F,
674Fmaps, 610, 611Fsites see protein binding sites
protein kinases, 86, 87BcAMP-dependent see cyclic
AMP-dependent protein kinasecatalytic subunit (PRKD), 107–8,
107Fmicroarrays, 621PKN/PRK1, 452, 452F, 453F, 454F
protein microarrays, 621ProteinProspector program, 622–3protein–protein interactions
see also protein interaction(s)analysis using clustered data, 610,
611Fsearching for, 584–5, 584F
protein sequence databases, 56–8, 59Fnomenclature for amino acid
uncertainty, 63protein sequences
see also amino acid sequencescomparison with nucleotide
sequences, 150–3constructing predicted, 343–6, 345detection of homology, 75–6evolutionary models, 276low-complexity regions, 100–2,
151Bmultiple alignments, 92obtaining secondary structure from
see secondary structure prediction
phylogenetic tree reconstruction, 249
scoring of alignment, 76–7, 79–80searching for motifs or patterns,
103–4searching with, 97–103substitution matrices, 82–5, 117–25
protein structure, 25–43, 26FD, 26MMclassification, 421F, 573–4, 573Fcomparison methods, 574–80,
575FDimplications for bioinformatics,
37–9, 38FDlow secondary structure content
(low SS), 573F, 574modeling see modeling protein
structuremolecular representations, 39, 39Fnative, 522primary see primary structure
Index
766
End matter 6th proofs.qxd 19/7/07 12:17 Page 766
quaternary see quaternary conformation
secondary see secondary structuresupersecondary, 40B, 529tertiary see tertiary protein structurethree-dimensional see tertiary
protein structurevisualization and computer
manipulation, 38–9, 39Fprotein subunits, 27, 42–3Proteobacteria, 249, 255–8, 257F, 258Fproteome, 600, 612
see also protein expressionanalysis, 612–23, 612FD
proteomics, 600–1applications, 601Trole in systems biology, 668
protocols, 686ProtScale, 110prrp program, 206pseudocounts, 172–3, 176Fpseudo-energy functions, 526–7pseudogenes, 22B, 73, 73B, 242pseudoknots, 457pseudo-torsion angles, 703PSI-BLAST program, 96–7, 108
comparative effectiveness, 177, 178T
homology modeling, 560–1PSSM construction, 176–7, 177Fsecondary structure prediction,
433F, 502, 503, 504PSIMLR method, 514PSIPRED program, 433F, 434, 434F, 503
accuracy, 424T, 469, 469F, 472, 503homology modeling, 560–1
PSORT programs, 111PSSMs see position-specific scoring
matricespSTIING, 58–9, 671–2
analysis of clustered genes, 610, 611F
protein interaction networks, 61F, 674F
purifying (negative) selection, 240Bpurines, 6, 6Fpyridoxal phosphate-dependent
aminotransferases, 570pyrimidines, 6, 6Fpyruvate formate-lyase, 467Fpyruvate kinase, 480F
QQ3, 417–19, 418F, 469
compared to Sov, 470T, 471–2different methods compared, 422,
424TGOR method, 422, 423, 484nearest-neighbor methods, 491
neural network methods, 499, 501, 503, 504
range of values, 469, 469FQian, Ning, 496–9, 499FQ-SiteFinder, 585–6, 586Fquadratic discriminant analysis (QDA),
340, 388, 389F, 396quality match scores, 200, 203–4quantum mechanics, 700quartet-puzzling method, 251T, 305–6,
306Fquaternary conformation, 27, 27F,
42–3, 42F, 43F
RRamachandran plots, 33, 34F, 525
PI3 kinase p110a model, 560, 560Frandom error, 627–8random model, sequence alignment,
117–19rank-sum test, 656–7reaction rates, 679–80reading frames, 13, 13F
see also open reading framesexon prediction and, 325–7, 328F,
329F, 391–2rearrangements, large-scale, 248
examples, 158Fidentifying, 156–7, 158F, 159rat and mouse X chromosomes,
403–4, 403Freceptor tyrosine kinases (RTKs), 436BReciprocal Net database, 52reconciled trees, 243, 244FRECON program, 347records, database, 46–7reductionist approach, 670Fredundancy, biological systems, 686–8redundant data, 63regulatory elements, 15relational databases, 48, 49–50, 49Frelative entropy, 697
substitution matrices, 125F, 126relative mutability, 120Relenza®, 589B, 591reliability (confidence index), 432RELL method, 311repeated elements, 337BRepeatMasker program, 347, 378–9repeat sequences
see also DNA repeats; low-complexity regions
annotation, 347dot-plots for identifying, 77–8, 79Fexclusion from analysis, 319–21,
360, 378–9SEG for identifying, 151–2B
repressors, 16–17resolution, 64
response function, 495, 496Frestriction enzymes, type I, 420Bretrotransposons, 337BREV+G model, 254F, 255–6, 256TREV (GTR) model, 253T, 255, 262R factor, 64Rhodopseudomonas blastica, 450rhodopsin, 440–1, 440F
helical wheel representation, 439F, 441
secondary structure prediction, 441F, 442F, 443, 447F
ribonuclease (RNase), 412ribonucleic acid see RNAribonucleotides, 6ribose, 5–6Ribosomal Database Project (RDP)
database, 255ribosomal RNA (rRNA), 13
see also 16S RNA sequencessequences, identifying, 361small ribosomal subunit, 249
ribosome-binding sites (RBS), 366F, 380
absence in eukaryotes, 380, 389GeneMark.hmm, 375ORPHEUS scoring scheme, 372–3
ribosomes, 13–14, 14Frice genome, 335BRiis, Søren, 500–1, 501F, 502–3RING-finger domains, 575ring of life, 292Britonavir, 589BRivera, Maria, 292BRMSD see root mean square deviationRNA, 4
central dogma concept, 10, 10F, 10FD
functions, 13noncoding, detection, 319–21,
361–3, 361FDstructure, 5, 5FD, 6F, 9–10, 9Ftranscription see transcription
RNA capping, 18RNAfold, 457F, 458RNA polymerase II, 17
promoters, detection, 383–7, 387Fsubunit, 582, 582F
RNA polymerases, 11bacterial, 15–17, 339eukaryotic, 17–18, 383
RNA secondary structure, 9, 435, 455–6
prediction, 455–8, 455FD, 456Ftypes, 456, 456F
RNA sequencesdatabases, 56searching with, 97
RNA splicing, 18–19, 18Falternative, 19, 380–1
Index
767
End matter 6th proofs.qxd 19/7/07 12:17 Page 767
Robinson–Foulds difference seesymmetric difference
Robson, Barry, 422, 480robustness
biological systems, 683–9, 684FDcharacterization, 690as feature of complexity, 684–5
Rocke, David, 627–8roll, 573Froot, 227, 227Frooted trees, 227, 227F
construction, 291–3root mean square deviation (RMSD),
542domain identification, 577modeling of loops, 546, 547Fpractical application, 563, 563T
ROSETTA/HMMSTR method, 523BRost, Burkhard, 470rotamer libraries, 547–8rRNA see ribosomal RNARychlewski, Leszek, 491
SSaccharomyces cerevisiae, 324, 404,
405cDNA array data analysis, 632Fgene expression microarray
database, 611, 612FSAGA multiple alignment method,
209–11, 210F, 211FSAGE (serial analysis of gene
expression), 604–5, 604FSAGEmap, 605Saitou, Naruya, 282Salzberg, Steven, 489, 491SAM (significance analysis of
microarray method), 656sample classification, 659–62, 660FD
see also data classificationbiclustering, 649–50, 650Fmethods available, 660–1principal component analysis,
631–3, 632F, 633Fsupport vector machines, 661–2,
662F, 663Fsample classifier, 660SAM program, 182, 184Sander, Christian, 464–5sandwich, 573FSanger, Frederick, 45Sanger Institute, 55Sankoff algorithm, 300–2, 301FSATCHMO program, 200, 203scatterplots, protein expression data,
615, 617FScherf, Matthias, 388Schneider, Thomas, 178SCOP database, 531, 532F, 572–4
scores (alignment), 76, 117derivation, 117–19expected, 119, 126overall, 80statistical significance, 153–6,
154FDscoring schemes/matrices, 75, 76–81
see also position-specific scoring matrices; substitution matrices
constructing multiple alignments, 200–4
selection of appropriate, 126theoretical basis, 117–27, 117FDthreading, 531–3
scrapie, 101BSCWRL3, 561searching sequence databases,
93–111, 94FDassessing quality of match, 97–100,
99Fdatabase selection, 102–3dealing with low-complexity
regions, 100–2exon prediction, 397patterns and protein function,
109–11programs, 94–7protein sequence motifs or
patterns, 103–7using motifs and patterns, 107–9
secondary RNA structure see RNA secondary structure
secondary structure, 27, 27F, 33–6see also a-helices; b-strandsalternative conformations, 486,
486Fcommon types, 413–14, 413Fdatabases, 60–1defining, for prediction algorithms,
463–8length distributions, 467, 468, 468Flocal sequence effects, 479–84,
480Fsequence correlations, 487–8, 487F
secondary structure prediction, 37, 411–59, 412MM
assessing accuracy, 417–19, 418FD, 469–72
based on residue propensities, 472–85, 472FD
coiled coils, 451–4, 452FDdefining secondary structure,
463–8, 464FDexpected accuracy, 468general data classification
techniques, 510–14, 511FDhidden Markov models, 504–10,
506FDmethods of defining structures,
417, 417F
nearest-neighbor methods seenearest-neighbor methods
neural network methods see underneural network methods
specialized methods, 435–58, 435FD
statistical and knowledge-based methods, 421–30, 421FD
success application, 420Btheoretical basis, 461–514, 462MMtraining and test databases, 416–17,
416FDtransmembrane proteins, 438–51,
438FDtypes of methods available, 413–16,
413FDsecond derivative methods, function
optimization, 714–15SEG program, 151–2BSejnowski, Terrence, 496–9, 499Fselective pressures, 240–1Bself-information, 423, 482self-organizing maps (SOMs), 644–6,
644F, 645Fbasic principle, 608, 608Fbiclustering, 650, 650Fgene expression microarray data,
608–9, 609F, 610secondary structure prediction,
513–14, 513Fvs other clustering methods, 643B
self-organizing tree algorithms (SOTA), 648–9, 648F
evaluating validity of clusters, 651gene expression microarray data,
610, 610Fsemiglobal alignment, 132F, 133semi-Markov model, 374sense strand, 11–12sensitivity (Sn)
exon prediction, 343, 392Bgene prediction at nucleotide level,
365–6Bseparating hyperplane, 661, 662, 662Fsequence alignment, 71–112, 72MM
see also global alignments; local alignments
applications, 72detection of homology, 74–6genome sequences see genome
sequence alignmentshomology modeling, 543–4, 544F,
558–9inserting gaps, 85–7multiple see multiple alignmentoptimal see optimal alignmentspairwise see pairwise alignmentprinciples, 72–6, 73FDprogressive, 198, 204–6, 205Fscores see scores (alignment)
Index
768
End matter 6th proofs.qxd 19/7/07 12:17 Page 768
scoring see scoring schemes/matrices
searching databases see searching sequence databases
suboptimal, 76substitution matrices, 81–5types, 87–93, 88FD
sequence analysis, 71, 72MMevolutionary conservation and, 38
sequence databases, 55–8automated data analysis, 64–5gene prediction using, 334–6nonredundancy, 62–3searching see searching sequence
databasesselecting, 102–3
sequence lengthcompositional complexity and,
151Bhomology modeling and, 540–1,
542Fsubstitution matrix choice and, 85
sequence motifs see motifssequence ontology project (SOP), 55Sequences Annotated by Structure
(SAS), 103sequence similarity see similarity,
sequencesequence–structure correlations,
487–8, 487Fsequence-to-structure networks, 432,
499–500, 500Fserial analysis of gene expression
(SAGE), 604–5, 604Fserine proteases, 570serotonin N-acetyltransferase, 421F
secondary structure prediction, 423F
SH2 domains, 78B, 571, 572FCbl protein, 575, 576Fdot-plot assessment, 77F, 78identification, 576–80searching sequence databases,
98–100sequence alignments, 92, 93F
SH3 domains, 529, 530FShannon entropy, 695–6Shigella flexneri, 262Shine–Dalgarno sequence, 19, 373shotgun genome sequencing
procedure, 376BSH test, 311shuffle test, 534Sibbald, Peter, 171side chains, amino acid see amino acid
side chainssigma factors (s), 339signaling pathways, 110
modeling interactions, 681–3, 682Fnetwork models, 678
signal peptide, 508–9signal sequences, protein localization,
111, 111Bsignificance, statistical, 653SigPath, 692silent states, 180, 181, 183–4, 184Fsimilarity, sequence, 74
dot-plots for assessing, 77–8, 77Fgene prediction using, 334–6homology modeling and, 539–40,
541Fpercent, 80–1percent identity for quantifying,
76–7scoring, 80, 81secondary structure prediction,
488–90Simon, István, 506–7SIMPA96 scoring method, 488, 490,
491simple sequences, 151–2B
see also low-complexity regionssimplex, 711, 712FSIM program, 554simulated annealing, 528–9
function optimization, 719single linkage clustering, 640, 641Fsingleton sites, 298Sippl, Manfred, 534, 706Sippl test, 534Sjögren–Larsson syndrome (SLS), 351,
351BSjölander, Kimmen, 174, 174FSLAGAN program, 158F, 159SLIM matrices, 84small ribosomal subunit rRNA, 249Smith, Randal, 214Smith, Temple, 214Smith–Waterman algorithm, 88–9,
136–7database search programs using,
95, 97, 145–6discarding intermediate
calculations, 138Bvs PSI-BLAST, 178T
Söding, Johannes, 195F, 196sodium dodecyl sulfate (SDS), 613softmax, 495–6Sokal, Robert, 278solvation potential, 533solvents
see also water moleculesomission from energetics
calculations, 700potential terms relating to, 526–7,
707–8SOMs see self-organizing mapsSOSUI program, 442F, 443, 444F, 447SOTA see self-organizing tree
algorithms
Sov, 417, 419, 419Fcompared to Q3, 470T, 471–2derivation, 470–2different methods compared, 422,
424TGOR method, 423range of values, 469F, 472
spaced seed method, 158–9spacer unit, 496, 500speciation duplication inference (SDI),
293speciation events, 226, 239, 242Fspecies
reconstructing evolution, 249specific databases, 103
species (phylogenetic) trees, 225–30, 227F, 229F
combined with gene trees, 243, 244F
effects of gene loss/missing gene data, 242–3, 243F
orthologous genes for constructing, 239–47, 242F
vs gene trees, 230, 231Fspecificity (Sp)
exon prediction, 343, 392Bgene prediction at nucleotide level,
365–6Bspliceosomes, 18SplicePredictor program, 393–4splice sites, 18–19
detection, 337–8, 338F, 379–81, 390theoretical basis, 392–6, 395F
donor and acceptor, 18F, 380F, 392variability, 379, 380F
splice variants, 380–1SpliceView program, 338, 339Fsplicing, RNA see RNA splicingsplits
assessing accuracy, 309differences between two trees, 289Bmultiple alignment guide trees,
206, 206Fphylogenetic trees, 231–2, 232F
Src-homology domains see SH2 domains; SH3 domains
SSAHA program, 158SSEARCH program, 96T, 97, 100SSPAL method, 489, 490, 490F, 491SSpro method, 504, 505Fstandard deviation, 652, 653F
dealing with lack of replicates, 657BStanford Microarray Database (SMD),
58, 60F, 611star decomposition, 285–6start codons, 13, 19, 318, 367
E. coli, 366F, 367predicting correct, 327, 330F, 333–4,
389star tree, 200F, 201
Index
769
End matter 6th proofs.qxd 19/7/07 12:17 Page 769
start state, 179, 182–3, 183Fstates (hidden Markov models), 179,
180, 181state variables, 679–80statistical methods
secondary structure prediction, 414, 415F, 421–30
transmembrane protein prediction, 443
statistical tests, 625, 626MM, 651–62, 651FD
importance of variance, 652, 652Fmultiple, controlling error rates,
657–9, 658Tnonparametric, 656–7
steady state, 690steepest descent method, 528, 711–13,
713Fstep-down Holm method, 658Stephens, Michael, 178step-up Hochberg method, 659stepwise addition, 285–6steric hindrance, 32Sternberg, Michael, 206stop codons, 12T, 13, 19, 318, 367
detection, 389Streptococcus protein G, 484FStreptomyces coelicolor, 643Bstrict consensus trees, 234–5, 234FSTRIDE program, 417STR matrix, 84Structural Bioinformatics Protein
Databank see Protein Data Bankstructural databases, 59–61
automated data analysis, 64checking for data consistency, 63–4
structure, protein see protein structureStructured Query Language (SQL),
49–50structure–function relationships,
567–93, 568MMdocking methods and programs,
587–93, 588FDfinding binding sites, 580–7, 581FDfunctional conservation, 568–74,
568FDstructure comparison methods,
574–80, 575FDstructure-to-structure network, 432,
499Student’s t-distribution, 654, 655suboptimal alignments, 76, 135–9,
137Fsubstitution groups, 213substitution matrices, 81–5
see also BLOSUM matrices; PAM matrices
evolutionary models and, 276position-specific scoring matrices
and, 168–71
selection of appropriate, 126theoretical background, 117–27,
117FDthreading, 532
subtilisin, 243–4, 244Fsubtree pruning and regrafting (SPR),
289B, 290, 290Fsubtrees, 230subunits, protein, 27, 42–3suffix, 142suffix trees, 141–3, 143F
whole genome sequences, 158sum-of-pairs (SP), scoring multiple
alignments, 200F, 201superfamilies, 259, 259B
phylogenetic tree reconstruction, 259–63, 261F, 263F
protein fold libraries, 573superkingdoms, 21supersecondary structures, 40B, 529supervised learning, 497B, 638support vector machines (SVMs)
sample classification, 661–2, 662F, 663F
secondary structure prediction, 511–12, 512F, 513F
survivin, 583, 583FS-values
branch-and-bound method, 288maximum-likelihood methods,
287minimum evolution method, 297optimizing tree topologies, 288,
290, 291, 291F, 293parsimony methods, 287, 293,
297–9, 301starting trees, 286
SWISS-2D-PAGE, 620Swiss Institute for Bioinformatics (SIB),
620Swiss-Model, 552, 554, 561–3, 562FSwiss-Pdb Viewer, 542, 557–60, 558F,
559F, 562–3Swiss-Prot database, 54, 56–8, 59F,
102–3manual annotation, 65pattern and motif searching, 105,
106–8searching, 98–100, 99F, 101Fvs PSI-BLAST, 178T
switches, bistable, 688–9, 689Fsymmetric difference, 289, 289B, 291SYM model, 253Tsynonymous mutations, 238, 240–1B,
245syntenic regions, 248, 403–4, 404Fsystematic errors, 625, 627–8systems, biological, 669–78, 669FD
see also networksbistable switches, 688–9, 689F
concept, 669–70, 670F, 671Fcontrol circuits, 680, 680Finformation needed to construct,
671–4mathematical modeling
approaches, 674–7, 676Fmathematical representation of
interactions, 680–3modularity, 685–6network properties, 670–1redundancy, 686–8robustness, 683–9, 684FDstandardized description, 692storing and running models,
689–92, 689FDsystems biology, 667–93, 668MM
model types used, 678structure of model, 679–83,
679FDsystem properties, 683–9, 684FDWeb-based tool and databases,
671–2, 675TSystems Biology Markup Language
(SBML), 692
TTamura-Nei (TN) model, 253Ttarget protein, 527
alignment with template, 543–4, 544F
finding structural homologs, 543, 557
similarity to template, 539–40TATA-binding protein (TBP), 17TATA box, 17, 383
Bucher weight matrix, 383, 384, 384F
detection, 383–7, 389genes lacking, 381, 383GenScan prediction method, 385,
385FNNPP prediction method, 385–6,
386Ftaxa, 225Taylor, Willie, 276tblastx, 96, 150T-Coffee program, 203, 204Ftemperature
biological systems, 679–80molecular dynamics simulations,
718simulated annealing, 529, 719
template protein, 527, 542–3alignment with target, 543–4, 544Flocating, 543, 557similarity to target, 539–40
terminator signal, 16tertiary contact (TC) measure, 491–2,
492F
Index
770
End matter 6th proofs.qxd 19/7/07 12:17 Page 770
tertiary protein structure, 27, 27F, 40–2
see also protein foldsanalyzing function from see
structure–function relationshipsexperimental methods of
determining, 521modeling see modeling protein
structurevisualization and computer
manipulation, 38–9, 39Ftest dataset, 416–17test statistic, 652, 653Ftetramers, 43thermodynamic simulation, and global
optimization, 715–19, 715Fthermodynamic stability, folded
proteins, 41–2thiamine diphosphate (TDP), 259B,
260Thornton, Janet, 276, 475THREADER program, 707threading (fold recognition), 523–4,
529–37, 530FDassessing confidence of prediction,
534–5, 535Fdynamic programming methods,
533–4, 534Flibraries of protein folds, 531potentials used, 706–8practical example, 535–7, 536F,
537Fprocedure, 530–1, 531Fpseudo-energy functions, 527scoring schemes, 531–3
three-dimensional protein structure see tertiary protein structure
thymine (T), 6, 6FTie, Jien-Ke, 449BTIM barrel folds, 570, 570F, 573F
differing functions, 570, 572Ftime-delay neural network (TDNN),
385–6, 386FTMAP program, 442F, 444, 447TMbase, 443TMHMM server, 446, 446F, 447F,
507–9assessing accuracy, 471F, 472comparative results, 442F
TMpred program, 442F, 443Toll-like receptor, 608top-down approach, modeling
biological systems, 676–7, 677Ftopological families, 573F, 574topological models, 678TopPred program, 441, 442torsion angle potential, 703, 703Ftorsion (dihedral) angles, 29–33
amino acid side chains (c1, c2, etc), 547, 548F
Ca chain (f, y), 29–32, 32Fideal b-strands, 36FRamachandran plots, 33, 34Fsecondary structure prediction,
417, 466, 466F, 503–4, 504F, 505Fimproper, 703peptide bond (w), 31–2, 32F
traceback, 132, 136, 138B, 300training, neural networks, 496, 497–8Btraining dataset, 416–17trans conformation, 32, 33Ftranscription, 11–12, 11F
regulation, 15–18, 16F, 17Fstop signals, detection, 389
transcription (initiation) factorsbinding sites, 381, 386
detection algorithm, 386–7general, 17leucine zipper, 413, 451
transcription start site (TSS), 15–16, 16F, 17F
prediction, 338–9, 340, 381–9transcriptome, 600transfer function see response
functiontransfer RNA (tRNA), 13
base modifications, 7function in translation, 13–14gene detection methods, 320–1,
320F, 361–3secondary structure prediction,
457F, 458structure, 14F
transition mutations, 237–8, 238Ftransitions, hidden Markov models,
179, 180, 181, 181Ftransition/transversion ratio (R),
237–8calculation, 274–5Bweighted parsimony method, 300,
300Ftranslation, 13–14, 14F
control, 19–20genetic code, 12–13, 12Tpredicted exons, 343, 344F, 345, 345start sites, prediction, 389stop signals see stop codons
translation initiation factor 5A (1BKB), 421F
secondary structure prediction, 422F
TRANSLATOR program, 345translocation, 158Ftransmembrane b-barrels, prediction,
448–50, 450F, 508F, 509transmembrane helices, 436
amino acid propensities, 475–6, 478F
helical wheel diagrams, 439F, 440–1length distribution, 468, 468F
prediction, 439–48algorithms available, 441–7assessing accuracy, 471F, 472based on residue propensities,
477–8, 479, 479Fcomparing results, 447–8example, 449Bhidden Markov models, 506–9,
507Fusing evolutionary information,
444–5three-dimensional structure, 440F
transmembrane proteins, 435, 436–517-transmembrane spanning
superfamily, 436Bbitopic and polytopic, 437, 437Ffunctional importance, 436Bhydrophobicity scales and, 437–8prediction, 438–51, 438FD
example, 449Bhidden Markov models, 506–9
structural elements, 437Ttransmissible spongiform
encephalopathies, 101Btransport systems, 669–70, 670Ftransposons, 22B, 336, 337Btransversion mutations, 237–8, 238Ftransversion parsimony, 300tree bisection and reconnection (TBR),
289B, 290–1tree methods, multiple alignment,
90–1, 90F, 200–1tree of life, 20–3, 20F, 21F, 38F
horizontal gene transfer within, 246F, 247
origins, 292Btree topologies, 227–8, 228B
comparing, 232–5, 233F, 234Fdescribing, 230–2, 232Fevaluating, 293–307, 294FDgenerating initial, 285–6generating multiple, 286–93,
287FDinterior branch examination,
309–10measuring difference between two,
289BTrEMBL, 102–3tricarboxylic acid (TCA) cycle, 685,
686F, 687Ftrimers, 43tRNA see transfer RNAtRNAscan algorithm, 321, 361–2, 362F,
363FtRNAscan-SE algorithm, 362–3TSSG algorithm, 340, 341TTSSW algorithm, 340, 341, 341Tt-statistic, 654, 655t-test, 654–5, 656T
modifications, 657–9
Index
771
End matter 6th proofs.qxd 19/7/07 12:17 Page 771
tumorsinvasion, mathematical modeling,
676–7, 677Fsample classification, 662, 663F
turns, 36–7, 37Fsee also b-turnsamino acid preferences, 37
Tusnády, Gábor, 506–7twilight zone, 81TWINSCAN program, 331T, 332T,
336–7two-dimensional (2D) gel
electrophoresis, 600, 613–20see also protein expressionanalysis of data, 614–20
clustering, 615–17, 617F, 618Fdifferential protein expression,
615, 616F, 617Fmeasuring expression levels,
614–15principal component analysis,
618, 619Fidentification of separated proteins,
621–3, 622Fspot detection, 614, 614Ftechnique, 613–14, 613F
two-hit method, 149two-tailed test, 653, 653Ftype I error, 653, 658
Uubiquitin ligases, 575UGA codon, 23ultrameric trees, 229–30, 229FUniGene database, 103, 605–6, 605FUniProtKB, 56–8, 65units
see also nodesneural network, 430–1, 494–5, 495F
unrooted trees, 227, 227Fgeneration, 286–91
unsupervised learning, 638, 644untranslated regions (UTRs), 325F, 379
detection, 390, 396–7unweighted parsimony, 297–300, 299FUPGMA method, 199, 250, 251T, 608,
639practical application, 256–8, 258Ftheoretical basis, 278–9, 279F, 640vs Fitch–Margoliash, 280
UPGMC method, 640upstream sequences, 16
URL, 53UTRs see untranslated regionsUzzell, Thomas, 270
Vvan der Waals interactions, 32Bvan der Waals terms, 705variable region, 555Bvariance, 626, 652, 653–4
importance in statistical testing, 652, 652F
Vector Alignment Search Tool (VAST), 577–8, 579F
Venn diagram, amino acid conservation, 426, 428F
Venter, J. Craig, 376BViagra, 589Bvirtual heart project, 677virulence factors, 341–2viruses, 21
overlapping genes, 12, 360sequenced genomes, 324T
VISTA program, 353–4, 353F, 354Fvitamin K epoxide reductase (VKOR),
449BViterbi algorithm, 188–9von Bertalanffy, Ludwig, 667von Heijne, G., 441, 442
WWaddell, Peter, 296, 296FWaterman, M.S., 136, 154water molecules, 700
see also solventsligand–protein docking and, 592–3
Watson, James, 7Watson–Crick base-pairing, 7–9, 8Fweight matrices
Bucher, 383–4, 384Fsplice site prediction, 394
weight sharing, neural networks, 500–1, 501F
Welsh’s t-test, 655WHAT_CHECK program, 549–50WHAT-IF program, 549, 551Twhole-genome alignment, 156–9,
157FDsee also genome sequence
alignmentsWilcoxon test, 656–7Wilkins, Maurice, 7, 7F
windows (sequence), 476–9GOR methods, 422–3nearest-neighbor methods, 428,
486, 487F, 489neural network methods, 431support vector machines, 511
winner takes all strategy, 495wobble base-pairing, 14Woese, Carl, 249Wood, Valerie, 405words, 95, 141WormBase, 399Wu-BLAST, 95Wunsch, C.D., 87, 128
XX chromosomes, mouse and rat,
403–4, 403FX-drop method, 139F, 140–1, 140Fxenologous genes, 247XHTML (eXtensible hypertext markup
language), 50–1XML (eXtensible markup language),
50–1xProfiler, 605Xquery, 51X-ray crystallography, 411, 521X-SITE program, 591, 592F
YYASPIN, 509, 509FYBL036C hypothetical protein (1CT5),
421Fsecondary structure prediction,
423FYi, Tau-Mu, 488, 491Yona, Golan, 195
ZZmasek, Christian, 293Zpred program, 425–7, 484, 485
accuracy, 424Tamino acid properties used, 426,
428F, 429Tconservation values, 426, 427F,
428F, 429Tz-statistic, 577, 578F, 654z-test, 309, 653–4Zvelebil conservation number, 426Zviling hydrophobicity scale, 477T
Index
772
End matter 6th proofs.qxd 19/7/07 12:17 Page 772