The genome race is on the road

5
The genome race is on the road

Transcript of The genome race is on the road

The genome race is on the road

~Ull”U17tS Of d&I thJt WitI bc gc1”‘- atcd over the course of chc project. Low-resolution gcnctic (2-S cM) and physical maps of the whole human gcnon~, md completion of scvcral pilc: ;::quencing projects (both l~ulmll and model gcnomcs),

should also bc achicvcd by 1995. Gcnctic and physical maps of siniplc model gcnonx~ should bc com- plctcd, including a high-resolution map (I -2 cM) ofthc mouse gcnome.

The second phase (19954000) should 5cc further improvcmcnt in tcchno!og. which will enable pro- duction of rcfincd physical and genetic (1-2 CM) maps ofthc human gcn”mc. and larg”;scalc humln sc- quc‘ncing projects al*d completion of scqucncing simple mod4 gcnomcs.

Uy 3005, the scqucncing of the human gcnotnc (with the csccption of tandem rcpcat regions) is schcd- ulcd t” hnvc been complctcd. During this final stage, n m.ljor in0~r~c in tcrnis of understanding human dis- ease should rcsulr from analysis of ilulllall p”lylllorphisl~ls, and knowl- cd&$ of the complete sc~;ucncc of

medically iniportnnt .mx of the human gcnomc and corresponding regions of model gcnomcs. By 2005, chc integraticrn of datnbascs to crcatc :L unified inf”rma:ion courcc that can corrclatc scqucncc and map to known biological chxxccrigtics should have been achicvrd.

Whcchcr thcsc goals can bc achicvcd on time will dcpcnd very much on the rate at which the rcquircd tcchn”lo&T dcvclops, as well as on co-operation bstwcrn, and tinancial c”mmitmcnt from all participants.

What does HGP involve? Mapping -Jiading the way

Maps arc linear rcprcscntlti”ns that dcscribc the organization ofa scf of landmarks , using a dctincd sysrcm of mcasurrmcnt basrd on co-ordi- natcs.

Physical maps can cithcr bc cyto- gcnrtic or molecuku in basis. Cyto- gcnctic maps order gcnctic loci bnscd “n their relative position and order along a chromonomc, “f&n in rc- lation t” tbc banding patterns obtained after dift&ntial staining. Fluorescence irr sifrr hybridizarion (FISH) (xc J. Korcnbcrg, pp. 27-32 in this issue of TIHTECH) is ;L major source “f input data for this type of Inap. Molcculatly based phy4cal maps USC landmarks based on sc- qucncc features such as rcsrriction- clldo:iuclcnse clcavagc sites, sc-

qucncc-tagged sites (ST%)’ or single-copy probcc (SCl’s)‘. ST%, short, sing!+copy DNA scqucnccs tbnt can bc dctecced using the pot- ymcrasc chain rcxtion (XX), have been advocated as providing “1 com- mon language bctwvcen all types of mnpping’? a rcgiinn of DNA can be mapped by determining the osder of a scrics of STSs 2nd nlcasuring the distances bchvccn them. Long- range physical maus arc us~lnlly coii- rtmctcd by mcasurir., 647 the Icngl1 of large DNA fqmcntl; which carry :I DNA marker using pulsed-field gel clectrophorcsis (I’FGE) scpamtion. Thcsc frq~~~nts arc gcnrmtrd using rdr+xttcr rcstricti”n endo- nuclcdses (i.c. cnzymcs that clcavc DNA only infrcqucntly). Cloned DNA in the form of ycdst artifici,ll chr”n~oson~es (YACs), cowids or phagc vectors carrying genoniic inserts can then be cbarnctcrizcd in terms of their c”ntcnt of physical markers such &at their ordered rc- lationship is established and groups ofovcrlapping clones or ‘contlgs’ arc gcncmtcd. ‘STS-content mapping’ ofcloncs involves screening by PCR to dctcct which clones contain pxr- tic&r ST%, and then using the prcs- cncr of other ST% to order and dcfinc the “vcrlnps betwrcn clnncs’. The map must bc in a form which provides clear and easy access for the cntirc mapping coinlnimir)r. and ii sccm~ Zkcly that the idea of using ST%“, which is compatible with stratcgics wbcrcby probcq and primer pairs arc aligned along Y AC contigs (see below), will bc adopted.

Gcnctic linkage maps represent the gcnctic distance bctwccn markers: linkage maps base distance on centi- morgan (CM) units, a mcasurc of the frcqucncy ofrccombinacion between the specified gcncs. Ccnctic distancr dots not have a constant rrlationship with physical distance, and, :vhrrc- as physical distances arc additivc, gcnctic distances arc not. Although the order of loci in physical and gcnctic maps will bc t!w same, thcrc is no easy way of converting irnni one to the nthc:. A map unit of 1 CM indicates a 1% chance of rccombi- llation, but rcconibinadon ticqucncics vary with chromosomal region and scs, as well as with the gcnomc size.

The initial phase of the HGI’ (3991-1995) has focused on estab- lishing human maps - bnth physical and gcnctic - of landmarks such as polymorphism:;, gents and specific DNA sequrnces. The process of establishing thcac maps in conjunc-

non with sequencing &forts will yield the start of a composite DNA seqL,clKe.

Sequencing and mapping large genomcs arc inextricably linked, whether d ‘top-down’ or ‘hottoni- up’ appraath ic considered. The top- down strategy starts at the Icvcl of intact genomes. and pnxerds, for csample, bv tbc separation by I’FCE of large DNA frgwnts, the physical and gcnrtic linkage ofDNA markers, and the construction of long-rang maps. The construction of thcsc maps leads to the gcncmtion of cloned regions of DNA, which can then bc scqucnccd. Tbc bottom- up stratcg, on the other hand, starts at the lcvcl ofnuclcotidc :cqucncc in a large number of cmdom cosmid or bactrriophagc clones, and proceeds by assembling longer-mngc sequence by identifying overlapping sets of cloned sequences (contigs).

Quite a range of techniques have been devclopcd for linking indi- vidual clones, COntigS can bc as- sembled simply by idcntifving ovcr- lap bcrween clones at the lcvcl of scquencc. Altcrnativcly, overlapping clones can bc idcntificd by ‘fingcr- printing’ methods. These can dcpcnd on identifying common restriction cnzymc clcavagc patterns in the overlaps, or idcnti$it?g scqucncc owrlap through the USC of hybrid- ization probcn (c.g. rcpctitivc squcnces, STSs, synthetic oligo- nuclcotides or whole clones). Ke- petitivc screening of large libraries of cloncr with many diffcrccr probes can bc a tedious and time-consum- ins process and a rang of methods have been proposed for arranging the clones ac high density in two- dimensional arrays to form a matrix prior to screening. For csamplc: the ordering of phagc or cosmid clones by dctccring idcntitj+g scqucnccs wlrb oiigonuclcotidc probes”; or multiplexing7 - the simultaneous analysis with many probes of a whole !ibrary ofcosmid clones (the painvise comparison of data gcncrated by the mixed probes can bc decoded using algorithms to predict the order and linkage of all clones in the collection into contig); the scqucntial screcn- ing ofcomples libraries with 50-100 oligonuclcotide probes, and con+u- tational analysis of the similarities of hybridization characterirticc of each clone to order the clones~. Thctc tcchniqocs arc amcnablc to auto-

TlBTECti JAWEB 1992 (VOL 101

tnation, which should help to speed up considerably the process of order- ing and mpping clones‘~.

In addition to the ordering of ran- dom libraries, mapping also involves the directed search from one clone to clout the scqumce corresponding to irnrncdiately adjmznt insert DNA, i.e. ‘chromoso,or;tc walking’. Two major approaches can bc used for walking from one clone to the next: (1) screening clone collections using cud-clones of vector-inscr:s as hy- bridization probes. or (2) by PCR techniques that rcmovc the need for end-cloning (cud-sequencing of inserts to dcslgn I’CIi primers, &x:crse PCII, Alu-Alu I’CR, and Alu-vrctor I’CI\ fscc Glossa~;~ and A, Roscnchal, pp. 1&48 in this ISSUC of TfBTECH 1).

A problrm which persists is that whether a top-down or bottom-up approach provides the starting point for a project, there is a gap in terms of scale bctwecu the resolution that can bc obtained by gcnctic-mapping techniques and the mnsinnm !cn@ of co&g scqucnce that can be assembled - as attributed to Pavid Bocstcir?, the gn bccwccn ‘the cosmid and the centiinorgan’.

The standard recombinant-DNA cloning tcchnalogy was limited until rcccntly to small plasmid, cosmid and viral ronstructs - able to carry cs- ogcnous DNA inserts of not more than 50 lib. Although such systems are suitable for the manipulation and analysis of genes and small gene cfus- ters where the information is tightly packed, most genes from bigher organisms spau great discancrs of DNA. Standard techniques may be used for the cloning of such gcnecic material in a IarSc munbcr of ovcr- lapping clones. The subsequent use of such clones, howcvcr, is unwieldy, error-prone and timr-coilsurning. Cloning systems which enable greater lengths of seyuencc to bc included in fewer clones hnprovs ctficicncy and accuracy as well as the coucinuity of the fiual map.

The dmlopmcnt of YAC vectors, which are capable of carrying much larger inserts and can be assembled into concigs spanning megabase Icnghs, but can also be manipulated by conventional tnolccular biology cechniqucs to obtain the nucleotide sequence, sccnx co be closing the gap. The ccncirnorgan represents the reso!;ition obtainable by linkage- mapping studies, and on average cor- responds to approximately 1 Mb of the humart ~enorne. Unti! the

dcvclopmene of YACs, coutigs dcrivcd from pbagc or cosmid clones usually only spanned 100-200 lib; individual YACs can carry several hundred kilobascs of insert, such that YAC conrigs approach the mcgabasc range (see I<. Anand, pp. 35-40 in this issue of TIR?‘ECH). Are YACs the answer co all cloning needs then? Probably not, since for many ma- nipulations, such as sequencing, there is a need CO subclone into vectors such as phagr or cosmids co obtain more manageable-sized Fngmcnts. In addition, YAC-based mapping is not without its problems: multiple fragments of chromosomal DNA may be co-cloned within the same YAC, and certain chromoso- ma1 regions apptar to be inbcrently unclonablc, yielding clones which are d&ted for sotw scgmwts of DNA “I.

Gcnolne-sequctlcio: projects will probably proceed by mtcgating tbc capabilities of a rang of vector systems (see 1”. Lit&, pp. 33-35 in this issue of RBTECH). Mapping genoniic spans in ths megabase-plus range will cc.ntinuc to involve the use of PFGE alid rare-cutter enzymes, thou& othl:- new tcchniqucs which can link cytogenctics and molecular biology are being dcvelopcd. An cxsmple is the use of laser ‘micfo- tO~hnolo~~~’ for cleaving whole chromoaoml:s visible under the microscope, and dxn physically ma- nipulating the resulting fragments, c&~ling the clouing of megabase- sized fragments Tom defined chro- lllosorxll regions (see K. 0. Greulicb, pp. 48-5 I in this issue of TIBTECH).

Sequencing tecfinology Sequencing technologies will need

co improve dramacicaiiy in speed and capacity co handle the inevitable increase in DNA-sequencing activity associated with the world-wide gcnomc initiative. Current mrthod- olo@cs are just not suited to large-scale scqucncing projects. Until about five years ago, practical DNA-sequencing techniques were limited to the mau- WI radioactive methods described by San&~. and by Masam and Gilbertl’. The introduction, in 1986, of a flu- orescence-based modification of Sangr dideoxy sequencing enabled the DNA sequencing steps to be nuto- mated and the introduction of‘ cam- pucerized analysis of the g&based information: these were subsequently dewlopcd as in rommcrcially av&lablc systems.

Motivation for developing novel sequencing approaches is strong - sequencing is a tedious and slow pro- cess. The hv0 basic appro:~chcs co speeding up the progress arc incrcxcd automation (robotics) and drvclaping altcrnacivc technologies or new chcm- istries. New tcchnologics. such ;IS solid-phase scqucncing rystcms (set M. Uhlcn, pp. Z-55 in this inuc of Tff3TECf-f) offer the opport&l~- for setni-autotnatiott. Radically diffcrenc new approachts, which are based on novel sequencing reactiolti us well as alccred analytical methods, such as the flow-cytometry scquencin~ dew&cd by j. Harding (pp. 5%s7 in this issue of TfBTECH) may cot lx inr- plenwnced for some time. 7hr intro- duzcion ofcapillary clcctrophorcsis and ocher ‘inicroscilucilcin3’ iechnoly may permit automation to follow a dlf- fcri-nt route and thereby hcilitacc iucrcascd :hroughput”. Although reduced sample volumes save on costs by rcduring the amount of template and rrag,cnts required for sequencing, autonxmon of such scqwncing ap- proaches 1~12): not prow to bc straight- ionvard. Many current ‘autornatcd procedures do not cstcnd beyond gel- reading cquipnxnt and attcmptc to incrcasc automation and ixegrxc dif- ti‘rcnt sragcs of&c process xc’ hnpcrcd by the scqucncing clxtnistry, which ;XSS dcvclopcd fQr sma/1-sc& pnjccts. Thus, for adranccs in ctutomation to succeed, they will probably need CO bc dcvclopcd largely in conjunction with new scqucncing technologies.

A nngc of techniques has bcsn pro- posed tor DNA sequencing usin!: hybridizncion tcchniqws (dcp~ding on the obscrwtion that mism&xd and mismatch-tk hybridizatior) of

oligonuclcotid~ s of 8-i-20 bases can be disting&hed)?? seyucncmg by hy- bridization (SBH)‘“, irzgmcntation scqucncing (FS)‘” and oli~onuclcotide hybridization sequencing (OHSji7. Al1 these techniques are based on the idea that thr sequcncc of long fragnxwt5 of DNA or cv~il cntirc genomcs c&d bc built up from overlapping shorter . . x~ucnces; all obviate the need for thr enzymatic or chemical wxtions and electrophoresis steps which arc part of traditional sequencing m&odok@s.

An ewn mow ambitious idea is tix direct sequencing of DNA under the microscope. Recent advances in scan- ning cumi&g microscopy (STM) have cnnbltd atomic rcsolutiun images of DNA to be obtained’“, and, coupled with single-molcculc spcc- troscopy, could possibly be used for DNA scqucncing”‘.

TIBTECH Ji\WFEFEEI 1992 LVOL li)>

4

editovial

Genome Mapping a

Alu sequence - Member of a famiiy of interspersed repetitive dimeric DNA sequences (-300 bp long) in the human genome. Alu sequences form the major family of human short interspersed repeat sequences and are dispersed throughout the genome, with an average of 4 kb between copies.

/Vu-PCR -The use of PCR primers conforming to two A!u sequences in inverted orientation permits direct amplification of sequences between the Aiu repeats, from complex backgrounds (such as human chromosomes in hybrid cells), as well as from YAC, cosmid or lambda clones -this technique is known as Alu-Alu PCR. A related technique, &-vector PCR, also uses PCR primers, one derived from an Alu repeat and one derived from vector sequence. (See PCR.)

cDNA - Single-stranded complementary DNA. cDNA is synthesized from an mRNA template by reverse transcriptase.

cDNA cloning - Molecular cloning of the coding sequence of a gene from its transcript - does not contain any introns (non- coding sequences).

Centimorgan - (CM) - Measure of genetic distance (the distance that separates two genes between which there is a 1% chance of recombination) dependent on the size of the genome (e.g. Arabidopsis; 1 CM = 139 kb: Human; 1 CM = 1108 kb).

Centramere - Site of chromosoma! attachment to the spindle, required for accurate distribution of chromatids to daughter cells.

Chromosome banding - Differential staining of metaphase chromosomes by a variety of techniques. Aids in chromosomal identification.

Chromosome flow sorting -A method for sorting individual chromosomes based on quantification of laser-induced fluorescence of stained chromosomes.

Chromosome walking - A procedure for cloning an ordered array of overlapping contiguous regions of chromosomes, based on the systematic isolation of successiie clones by using the sequence of the end of one clone to probe by hybridization for the adjacent, overlapping clone.

Cloning vector-Any DNA molecule capable of autonomous re$;,ation within a host cell into which exogenous DNA sequences can be inserted.

Contig - A group of cloned DNA sequences which are contiguous.

Cosmid -A hybrid bacteriophage cloning vector used for cloning long DNA fragments. Characterized by possessing the lambda cos site, which is required for packaging into the lambda capsid. After introduction into the host E. coli, the vector replicates as a plasmid. Cosmids are useful for construchng a gene library because only hybrid plasmids with large DNA inserts are packaged.

Cot value - Parameter used in analysis of renaturation of DNA genomes (original DNA concentration x time). Highly repetiiive sequences renature at low Cot values, unique sequences renature at high Cot values.

DNA library -A store of cl,,ned DNA in recombinant cloning vectors containing chromosomespecific fragments. May be a cDNA- OI genome library.

Genetic mapping-Production of a representation of the relative positioning and genetic distance separating genes, based on the frequency of recombination.

Genome mapping - Production of an ordered set of overlapping clones that cover the entire genome.

Hybridiiation probe -A small labelled nucleic acid molecule used to detect complementary sequences through base pairing (hybridization).

HTF islands - ‘Hpa II-tiny-fragments islands (also termed CpG islands). Singlecopy, unmethylated loci in vertebrate genomes, containing a high density of Hpa II restriction enzyme sites, in which the dinucleotide CpG is abundant - associated with transcribed regions (i.e. genes).

lntron -Any intervening sequence in eukaryotic genes that interrupts the coding sequence (exon) and which is transcribed but processed out of the precursor RNA to yietd the mature RNA -does not therefore code for a protein product.

inverse PCR - The use of end-sequences in outward orientation as PCR primers, useful for amplifying unknown sequences adjacent to known sequences. (See AluPCR.)

Junk DNA - In eukatyotes; any DI\;A sequences that have no apparent function.

So what should be sequenced? With the HGP well underway, the

focus is turning to improving ef- tici<rwy and nxsimizing rc‘tums fin ccrrns of ~~iologically and clinically relevant data) on the investment. Careful budgeting and international organization (see C. Cantor, pp. Ci-8 in this iswr of TBTECH) to avoid duplicatic:: i;: .&IL is “IW target area. The increased auto- mation of most aspects of genomc mapping and sequencing is essential to cnsurc a reasonable rate of progress, since the major$y of pro- cedures call for handling ot vast num- bers of component samples, whether at the level of initial sequence analysis, rcpetitivc screening, or the management of ordered reference libraries of C~OIICS'~. Use of the vast amounts of data b&g gcncrated

..--._-.__ ._ .-__ TBTECH JAN/ES I992 NOL 101

ncccssitatrs efficient means of stor- ing, communicating, analysing and crossreferencing. Current tcch- nology cannot cope, and new sys- tems are being devised (SW C. Fields, pp. 58-61, sod G. Cameron, pp. 6 l-66, in this issue of TIBTECH;

Another consideration is that the technology being applied to the HGY five years from now will make today’s methods look primitive. It thrrcforc makes sense to target those regions which will yield informanon of grcatcst immediate value (such as coding and regulatory regions), and leave the remaining sequence to a later date when it can, hopefully, bc obtained with less &ort. Various strategies have been proposed for idcnti&ng and isolating coding sequences from complex mammalian genornic DNA (e.g. exon amplifi-

cation, whereby exons are isolated from cloned gcnomic DNA by selecting for function.ll 5’ and 3’ splice siteP, and cDP! 4 cloning (see C. Venter er 4”. pp 9-l 1 in this issue of TIBTECH). Although tnajor advance:: in developing machine learning and neural-network Loch- niquer for identi@ing fimctionally significant regions within ‘naive’ sequence are being made (see l<. Mural E/ nl.. pp. 66-69 in this issue of ‘FIBTECH), the techniques are not ‘Tier sui&isntly advanced to replace a sclectivr approach by ‘blanket’ sequencing.

Each milestone along thr route of the HGI’ which is passed will increa-** our understanding of gene function and cf the basis of genetic disease. Not OI:$-will the knowledge gained alter our penpective of the

5

editorial

quencing Glossary

LINE - (Long Interspersed Nucleotide Element) - Found in the chromosomal DNA of eukaryotes f-6-7 kb in length, and -104 copies in mammalian genomes).

Map - Representation of the relative positions of genes or restriction sites and the distance between them. (See Genetic mapping.)

Map distance - The distance, in terms of percentage recombination, between linked genes. Map distance is measured in centimorgans (see Centimorganl

ORF - (Open Reading frame) -A stretch of nucleatide sequence wirin an initiation codon at one end, a series of triplet codons and a termination codon at the other end: potentially capable of coding for .an as yet unidentified peptide or protein.

PCR - (Patymerase Chain Reaction] - A technique for the enzymatic in vitro amplification of specific nucleotide sequences between two convergent primers that hybridize to the opposite strands. The product of strand synthesis using one primer ac!s as template for synthesis using the other primer; repeated cycles of denaturation, primer annealing and strand synthesis result in an exponential increase in copies of the sequence bounded bj the primers. Inverse PCR acts by the same mechanism to amplii DNA sequences flanking a known sequence by use of primers oriented in the reverse orientation. (See Alu-PCR.1

Primer -A short oligonucleotide which pairs with a complementary strand of DNA, providing a free 3’ OH terminus for extension of the other strand.

Restriction mapping -A method for rapid mapping of large segments of DNA by identifying the positiins of restriction enzyme cleavage sites, and any insertions or deletions which alter the pattern of cleavage relative to a standard sequence.

RFLP - (Restriction Fragment Length Polymorphism) -Variation in the length of restriction fragments due to the insertion or deletion of restriction sites or intervening sequence, or rearrangements which affect the length of sequence between sites. Can hc ssed to construct genetic linkage maps (RFLP maps) to follow inheritance of specific mutations and genetic diseases. Linkage of a RFLP with a specific gene permits subsequent identification and

ordering of other RFLPs that straddle the gene by observing recombination characteristics.

RFLP marker -Any marker resulting in changes in the length of genomic DNA produced by digestion with specific restriction enzymes.

Shuttle vector-Any muftiiunctional vector which can replicate in two or more organisms and can be used to transfer genes between organisms.

Somatic cell hybrid clone panel - A panel of hybrid cell clones used for mapping human genes. Each clone contains a unique combination of the 24 human chromosomes. By correlating the presence/absence of a particular human gene to be mapped with the clones in the panel, the chromosome location of that gene may be assigneo. Mapping to a specific region of the chromosome is possible when the pane! of hybrid clones contains diierent segments of a particular chromosome.

STSs -[Sequence Tagged Sites) -Short, singlecopy DNA sequences that characterize mapping landmarks on the genome and can be detected by PCR. Advocated as ‘common language’ of physical mapping projects IRefs: Green, E. D. and Olson, M. V. 11990) Science 250,94-98; Olson, M., Hood, L., Cantor, C. and Botstein, D. (1989) Science 245, 1434-14351. A region of a genome can be mapped by determining the order of a series of STSs.

Subcloning - A procedure w!rereby smaller DNA fragments are cloned from a large DNA fragment insert which has already been cloned.

Targeted gene transfer -The directed transfer of gene sequences to a specific site in the genome by homologous recombination between sequences in the vector and at the site of insertion in tie genome.

Telomere -The sequence/structure at the molecuiar erds of eukaryotic linear chromosomes that stabilizes the chromosome, prevents fusion with other sequences and permits chromosome replication without loss of chromosomal sequence.

human evolutionary genetic inheri- tancc, but the technological and bio- logical advances made as part of the HGP initiative over the next 15 years will foster even closer links between the divcrsc research disciplines of biorechnology.

_.___---- TIBTECH hIt&'FEfi 1932!bQL 101