Low-Cost/High-Accuracy Microbial Genome Synthesis and Monitoring

Thanks to: DARPA & DOE-GtL

Agencourt, Ambergen, Atactic, BeyondGenomics, Caliper, Genomatica, Genovoxx, Helicos, MJR, NEN, Nimblegen, Xeotron/Invitrogen

For more info see: arep.med.harvard.edu

1-Feb-2005 9:15-10 MITRE

Low-Cost/High-Accuracy Microbial Genome Synthesis and Monitoring

Synthetic - homologous recombination

testing of DNA motifs

1.3 2.4 (1.3 in argR)

1.1 1.3

0.7 2.5

0.2 1.4

1.4 3.5

RNA Ratio (motif- to wild type) for each flanking gene

Bulyk, McGuire,Masuda,Church Genome Res. 14:201–208

Synthetic Genomes & Proteomes. Why?

• Test or engineer cis-DNA/RNA-elements •Access to any protein (complex) including post-transcriptional modifications• Affinity agents for the above.• Protein design, vaccines, solubility screens • Utility of molecular biology DNA -- RNA -- Protein

in vitro "kits" (e.g. PCR -- T7 -- Roche)

Toward these goals design a chassis:• 115 kbp genome. 150 genes.• Nearly all 3D structures known.• Comprehensive functional data.

(PURE) translation utility

Removing tRNA-synthetases, translational release-factors,RNases & proteases

Selection of scFvs[antibodies] specific for HBV DNA polymerase using ribosome display. Lee et al. 2004 J Immunol Methods. 284:147

Programming peptidomimetic syntheses by translating genetic codes designed de novo. Forster et al. 2003 PNAS 100:6353

High level cell-free expression & specific labeling of integral membrane proteins. Klammt et al. 2004 Eur J Biochem 271:568

Cell-free translation reconstituted with purified components. Shimizu et al. 2001 Nat Biotechnol. 19:751-5.

in vitro genetic codes

5'

mS yU eU

UGGUUG CAG

AAC... GUU A 3'GAAACCAUG

fM TN V E

| | | | | || | |

5' Second base 3'

U

A

C

C U

mSyU

eU

A C U

G

A

0

500

1000

1500

2000

2500

3000

3500

30 40 50 60 70 80

3H-E dpm

time (min.)

fM yU mS eU E |

Forster, et al. (2003) PNAS 100:6353-7

80% average yieldper unnatural coupling. eU = 2-amino-4-pentenoic acid

yU = 2-amino-4-pentynoic acid mS = O-methylserine gS = O-GlcNAc–serine bK = biotinyl-lysine

Escherichia coli Mycoplasma 3D structureColiphage 29 DNA polymerase + +Coliphage P1 Cre recombinase - + >Coliphage Lox/Cre recombinase site - +Coliphage T7 RNA polymerase + + >Coliphage T7 RNA polymerase initiation site + + >Coliphage T7 RNA polymerase termination site + +RNase P RNA + -RNase P protein + + >RNase P site/RNA primer for DNA polymerase + +Small subunit 16S ribosomal RNA + +All 21 small subunit ribosomal proteins (1-21) + except 1,21 +Large subunit 5S ribosomal RNA + +Large subunit 23S ribosomal RNA + +Large subunit 23S rRNA G2445>m2G methylase: unknown ? -Large subunit 23S rRNA U2449>dihydroU synthetase: unknown ? -Large subunit 23S rRNA U2457>pseudoU synthetase ? -Large subunit 23S rRNA C2498>Cm methylase: unknown ? -Large subunit 23S rRNA A2503>m2A methylase: unknown ? -Large subunit 23S rRNA U2504>pseudoU synthetase ? -All 33 large subunit ribosomal proteins (1-7,9-11,13-25,27-36) + except 25, 30 +Translational initiation factor 1 + +Translational initiation factor 2 + +Translational initiation factor 3 + +Translational elongation factor Tu + +Translational elongation factor Ts + +Translational elongation factor G + +Translational release factor 1 + +Translational release factor 2 - +Translational release factor Gln methylase + +Translational release factor 3 - +Ribosome recycling factor + +33/45 Transfer RNAs (see Fig. 2) 29/33 +tRNA(I) C34>lysidine synthetase ? +tRNA(R) A34>I deaminase ? +tRNA(ASV) U34>cmo5U (=V) synthetase: unknown - -tRNA(R) U34>2sU Cys desulfurase - +tRNA(R) nm5U34 methylase ? +tRNA(R) U34>cmnm5U GTPase ? +tRNA(R) U34>cmnm5U synthetase ? +tRNA(R) cmnm5U34>nm5U,mnm5U synthetase ? -tRNA(R) G37 N1-methylase + +tRNA(RNIKM) A37>t6A N6-threonylcarbamoyl-A synthetase: unknown + -tRNA(CLFSWY) A37>i6A synthetase - +tRNA(CLFSWY) i6A37>s2i6A(ms2i6A) synthetase - +All 22 aminoacyl-tRNA synthetase subunits (20 enzymes) + except G subunit, Q + except G subunitMet-tRNA formyltransferase + +Chaperonin DnaK + +Chaperonin GroEL + +Chaperonin GroES + +

Total genes = 150Forster & Church

Oligos for 150 & 776

synthetic genes(for E.coli minigenome & M.mobile whole genome

respectively)

Up to 760K Oligos/Chip18 Mbp for $700 raw (6-18K genes)

<1K Oxamer Electrolytic acid/base 8K Atactic/Xeotron/Invitrogen Photo-Generated Acid Sheng , Zhou, Gulari, Gao (U.Houston) 24K Agilent Ink-jet standard reagents 48K Febit 100K Metrigen 380K Nimblegen Photolabile 5'protection Nuwaysir, Smith, Albert

Tian, Gong, Church

Improve DNA Synthesis CostSynthesis on chips in pools is 5000X less expensive per

oligonucleotide, but amounts are low (1e6 molecules rather than usual 1e12) & bimolecular kinetics slow with square of concentration decrease!)

Solution: Amplify the oligos then release them.

10 50 10 => ss-70-mer (chip)

20-mer PCR primers with restriction sites at the 50mer junctions

Tian, Gong, Sheng , Zhou, Gulari, Gao, Church

=> ds-90-mer

=> ds-50-mer

Improve DNA Synthesis Accuracyvia mismatch selection

Tian & Church Other mismatch methods: MutS (&H,L)

Genome assembly

Moving forward: 1. Tandem, inverted and dispersed repeats (hierarchical assembly, size-selection and/or scaffolding)2. Reduce mutations (goal <1e-6 errors) to reduce # of intermediates 3. 15kb to 5Mb by homologous recombination (Nick Reppas)4. Phage integrase site-specific recombination, also for counters.

Stemmer et al. 1995. Gene 164:49-53;Mullis 1986 CSHSQB.

50

75

125 225 425 825 … 100*2^(n-1)

All 30S-Ribosomal-protein DNAs(codon re-optimized)

Tian, Gong, Sheng , Zhou, Gulari, Gao, Church

1.7 kb

0.3 kb

s190.3kb

Nimblegen 95K chip

Atactic <4K chip

Improving synthesis accuracy

Method Bp/error

Chip assembly only 160 Hybridization-selection 1,400MutS-gel-shift 10,000MutHLS cleavage 30,000 (10X better than PCR)

Tian & Church 2004Carr & Jacobson 2004Smith & Modrich 1997

Extreme mRNA makeover for protein expression in vitro

RS-2,4,5,6,9,10,12,13,15,16,17,and 21 detectable initially.

RS-1, 3, 7, 8, 11, 14, 18, 19, 20 initially weak or undetectable.

Solution: Iteratively resynthesize all mRNAs with less mRNA structure.

Tian & Church

20w 20m 17w 17m 16w 16m

10kd

W: wild-typeM: modified

Western blot based on His-tags

Safety Proposals

Church, G.M. A synthetic biohazard non-proliferation proposal. http://arep.med.harvard.edu/SBP/Church_Biohazard04c.doc (2004)

1. Monitor oligo synthesis via expansion of Controlled substances, Select Agents, &/or Recombinant DNA

2. Computational tools for the above

3. System modeling checks for synthetic biology projects

4. Multi-auxotroph, novel genetic code for the host genome, prevents functional transfer of DNA to other cells.

http://arep.med.harvard.edu/SBP/Church_Biohazard04c.doc%20(2004

http://arep.med.harvard.edu/SBP/Church_Biohazard04c.doc%20(2004

Why sequence?

• Synthetic biology & laboratory selections• Pathogen "weather map", biowarfare sensors• Cancer: mutation sets for individual clones, loss-of-heterozygosity• RNA splicing & chromatin modification patterns.• Antibodies or "aptamers" for any protein• B & T-cell receptor diversity: Temporal profiling, clinical • Preventative medicine & genotype–phenotype associations • Cell-lineage during development• Phylogenetic footprinting, biodiversity

Shendure et al. 2004 Nature Rev Gen 5, 335.

Personal genomics & cancer therapy

Mutations G719S, L858R, Del746ELREA in red.

EGFR Mutations in lung cancer: correlation with clinical response to gefitinib [Iressa] therapy. Paez, … Meyerson (Apr 2004) Science 304: 1497

Lynch … Haber, N Engl J Med. (Apr 2004) 350:2129.

Pao .. Mardis,Wilson,Varmus H, PNAS (Aug 2004) 101:13306-11.

Dulbecco R. (1986) A turning point in cancer research: sequencing the human genome. Science 231:1055-6.

Why 'single molecule' sequencing?

(1) Single-cells: Preimplantation (PGD), uncultivatable

(2) Co-occurrence on a molecule, complex, cell RNA splice-forms & DNA haplotypes

(3) Cost: $1K-100K "personal genomes"http://grants.nih.gov/grants/guide/rfa-files/RFA-HG-04-003.html

(4) Precision: Counting 109 RNA tags (to reduce variance)

(~5e5 RNAs per human cell)Fixed 5e3 5e4 5e6 5e9 (goal) costs EST SAGE MPSS Polony-FISSeq (polymerase colony)

CD44 Exon Combinatorics (Zhu & Shendure)

• Alternatively Spliced Cell Adhesion Molecule• Specific variable exons are up-or-down-regulated in

various cancers (>2000 papers)• v6 & v7 enable direct binding to chondroitin sulfate,

heparin…

Zhu,J, et al. Science. 301:836-8.

Zhu J, Shendure J, Mitra RD, Church GM. Science 301:836-8. Single molecule profiling of alternative pre-mRNA splicing.

EXON PATTERN Eph4 Eph4bDD TOTALEph4 FRATIO LSTP-PV------------7-8-9-10 609 764 1373 1.17 1E-4--------------8-9-10 320 390 710 1.13 3E-2----------6-7-8-9-10 431 251 682 -1.85 4E-18------4-5-6-7-8-9-10 218 216 434 -1.08 2E-1----------------9-10 68 143 211 1.96 7E-7--------5-6-7-8-9-10 86 39 125 -2.37 2E-6----3-4-5-6-7-8-9-10 40 56 96 1.30 9E-2------4-5---7-8-9-10 16 74 90 4.30 2E-9--2-3-4-5-6-7-8-9-10 44 28 72 -1.69 1E-21-2-3-4-5-6-7-8-9-10 22 5 27 -4.73 3E-4--------5---7-8-9-10 5 19 24 3.53 3E-3----3-4-5---7-8-9-10 1 15 16 13.95 4E-4--2-3-4-5---7-8-9-10 1 10 11 9.30 5E-3

Eph4 = murine mammary epithelial cell line

Eph4bDD = stable transfection of Eph4 with MEK-1 (tumorigenic)

CD44 RNA isoforms

Chromosome-wide haplotyping

IL6-3572 : A

60-Mb

CD36-4366 : A/T

Human Chr. 7

A..A

A..T

73

3

1

150 Mb

Convergence on non-electrophorectic tag-sequencing methods?

Tag >400 14-26 20 100 26 bp (2-ends) EST SAGE MPSS 454 Polony-Seq Ronaghi• Single-molecule vs. amplified single molecule. • Array vs. bead packing vs. random• Rapid scans vs. long scans (chemically limited, 454)• Number of immobilized primers: 0: Chetverin'97 "Molecular Colonies" 1: Mitra'99 > Agencourt "Bead Polonies" 2: Kawashima'88, Adams'97 > Lynx/Solexa: "Clusters"

http://arep.med.harvard.edu/Polonator/Plone.htm

Bead Polony Sequencing Pipeline

In vitro libraries via paired tag

manipulation

Bead polonies via emulsion PCR

[Dre03]

Monolayered immobilization in acrylamide

Enrichment of amplified beads

SOFTWARE

Images → Tag Sequences

Tag Sequences → Genome

FISSEQ or “wobble”sequencing

Epifluorescence Scope with Integrated Flow

Cell

Polony Fluorescent In Situ Sequencing Libraries

Greg PorrecaAbraham Rosenbaum

1 to 100kb Genomic1 to 100kb Genomic

M

L R

M

PCRbead

Sequencingprimers

Selectorbead

2x20bp after MmeI (BceAI, AcuI)

Dressman et al PNAS 2003 emulsion

Cleavable dNTP-Fluorophore (& terminators)

Mitra,RD, Shendure,J, Olejnik,J, Olejnik,EK, and Church,GM (2003) Fluorescent in situ Sequencing on Polymerase Colonies. Analyt. Biochem. 320:55-65

Reduce

or

photo-cleave

Polony-FISSeq: up to 2 billion beads/slideCy5 primer (570nm) ; Cy3 dNTP (666nm)

Jay ShendureSelf Organizing Monolayer

• # of bases sequenced (total) 23,703,953

• # bases sequenced (unique) 73

• Avg fold coverage 324,711 X

• Pixels used per bead (analysis) ~3.6

• Read Length per primer 14-15 bp

• Insertions 0.5%

• Deletions 0.7%

• Substitutions (raw) 4e-5 • Throughput: 360,000 bp/min

Polony FISSeq Stats

Current capillary sequencing 1400 bp/min (600X speed/cost ratio, ~$5K/1X)

(This may omit: PCR , homopolymer, context errors)Shendure

High accuracy special case: homopolymers (e.g. AAA, CC, etc.)

• Use "compressed" tags , ACG = ACCG=ACCCG• Quantitate incorporation • Reversible terminators• FRET between adjacent 3' bases • Wobble sequencing

All five of these work.

• Maintenance of amplification fidelity using linear amplification from initial genomic fragment

Degenerate (aka “wobble) sequencing

“single tipped” vs “double tipped”

length of anchoring sequence

natural vs. universal nucleotides (i.e. deoxyinosine)

single fluor vs. four-color fluor mixtures of dNTPs for extensions

Sequenase vs Klenow vs BST

Exonuclease stripping vs heat stripping

CTAGCGAGCTAGNNNNNNNNACTAGCGAGCTAGNNNNNNNNGCTAGCGAGCTAGNNNNNNNNCCTAGCGAGCTAGNNNNNNNNT

anchor degenerate

“tip”

Wobble vs Simple base-extension

1/4 vs 2.5/4 base/cycle

>8 vs 14-200 base reads

3e-3 vs 4e-5 non-homopolymer errors

3e-3 vs 1e-1 homopolymer errors

40' per cycle, 60 hr per 20 cycles

Sequencing single molecules

Ecosystem studies need single-cell amplification because of multiple chromosomes (& RNAs) per cell. Many cells are hard to grow. Microbes exchange genome subsets.

(Even an 80% genome coverage is better than 100 kb BACs)

Many input molecules required to sequence one molecule. vs. one molecule sufficient to sequence via many copies of it.

Single cell sequencing

29 real-time amplification

No template control

Affymetrix quantitation of independent amplifications

Low-Cost/High-Accuracy Microbial Genome Synthesis and Monitoring

Documents

Transcript of Low-Cost/High-Accuracy Microbial Genome Synthesis and Monitoring