Design and Optimization of Universal DNA Arrays

48
Design and Optimization of Universal DNA Arrays Ion Mandoiu Computer Science & Engineering Department University of Connecticut http://www.engr.uconn.edu/~ion/

description

Design and Optimization of Universal DNA Arrays. Ion Mandoiu Computer Science & Engineering Department University of Connecticut http://www.engr.uconn.edu/~ion/. DNA Arrays. Exploit Watson-Crick complementarity to simultaneously perform a large number of substring tests - PowerPoint PPT Presentation

Transcript of Design and Optimization of Universal DNA Arrays

Page 1: Design and Optimization of  Universal DNA Arrays

Design and Optimization of Universal DNA Arrays

Ion MandoiuComputer Science & Engineering Department

University of Connecticut

http://www.engr.uconn.edu/~ion/

Page 2: Design and Optimization of  Universal DNA Arrays

DNA Arrays• Exploit Watson-Crick complementarity to simultaneously

perform a large number of substring tests

• Used in a variety of high-throughput genomic analyses

– Transcription (gene expression) analysis

– Single Nucleotide Polymorphism (SNP) genotyping

– Alternative splicing, ChIP-on-chip, tiling arrays, genomic-based species identification, point-of-service diagnosis, …

• Common array formats involve direct hybridization between labeled DNA/RNA sample and DNA probes attached to a glass slide

Page 3: Design and Optimization of  Universal DNA Arrays

Universal DNA Arrays

• “Programable” arrays– Array consists of application independent

oligonucleotides– Analysis carried by a sequence of reactions

involving application specific primers– Flexible AND cost effective

• Universal array architectures: tag arrays, APEX arrays, SBE/SBH arrays,…

Page 4: Design and Optimization of  Universal DNA Arrays

Overview

Tag Array Design- Tag Set Design

- Tag Assignment Algorithms

SBE/SBH Assays- Decoding and Multiplexing Algorithms

Conclusions

Page 5: Design and Optimization of  Universal DNA Arrays

SNP Genotyping with Tag ArraysTag

+

PrimerG

A

G

G

A

G

G

A

G

TG CA

T

C

Cantitag

TC C

1. Mix reporter probes with unlabeled genomic DNA

2. Solution phase hybridization

3. Single-Base Extension (SBE)4. Solid phase hybridization

Page 6: Design and Optimization of  Universal DNA Arrays

Tag Array Advantages

• Cost effective– Same array used in many analyses can be mass produced

• Easy to customize– Only need to synthesize new set of reporter probes

• Reliable– Solution phase hybridization better understood than

hybridization on solid support

Page 7: Design and Optimization of  Universal DNA Arrays

Tag Set Design Problem

(H1) Tags hybridize strongly to complementary antitags

(H2) No tag hybridizes to a non-complementary antitag

(H3) Tags do not cross-hybridize to each other

t1t1 t2t2 t1 t2t1

Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)-(H3)

Page 8: Design and Optimization of  Universal DNA Arrays

Hybridization Models

• Melting temperature Tm: temperature at which 50% of duplexes are in hybridized state

• 2-4 rule

Tm = 2 #(As and Ts) + 4 #(Cs and Gs)

• More accurate models exist, e.g., the near-neighbor model

Page 9: Design and Optimization of  Universal DNA Arrays

• Hamming distance model, e.g., [Marathe et al. 01]– Models rigid DNA strands

• LCS/edit distance model, e.g., [Torney et al. 03] – Models infinitely elastic DNA strands

• c-token model [Ben-Dor et al. 00]:– Duplex formation requires formation of nucleation complex

between perfectly complementary substrings

– Nucleation complex must have weight c, where wt(A)=wt(T)=1, wt(C)=wt(G)=2 (2-4 rule)

Hybridization Models (contd.)

Page 10: Design and Optimization of  Universal DNA Arrays

c-h Code Problem• c-token: left-minimal DNA string of weight c, i.e.,

– w(x) c

– w(x’) < c for every proper suffix x’ of x

• A set of tags is a c-h code if(C1) Every tag has weight h

(C2) Every c-token is used at most once

c-h Code Problem [Ben-Dor et al.00]

Given c and h, find maximum cardinality c-h code

Page 11: Design and Optimization of  Universal DNA Arrays

Algorithms for c-h Code Problem

• [Ben-Dor et al.00] approximation algorithm based on DeBruijn

sequences

• Alphabetic tree search algorithm

- Enumerate candidate tags in lexicographic order, save tags whose c-tokens are not used by previously selected tags

- Easily modified to handle various combinations of constraints

• [MT 05, 06] Optimum c-h codes can be computed in practical time for small values of c by using integer programming

• Practical runtime using Garg-Koneman approximation and LP-rounding

Page 12: Design and Optimization of  Universal DNA Arrays

Token Content of a Tag

c=4 CCAGATT CC CCA CAG AGA GAT GATT

Tag sequence of c-tokensEnd pos: 2 3 4 5 6 7 c-token: CCCCACAGAGAGATGATT

Page 13: Design and Optimization of  Universal DNA Arrays

Layered c-token graph for length-l tags

st

c1

cN

ll-1c/2 (c/2)+1 …

Page 14: Design and Optimization of  Universal DNA Arrays

Integer Program Formulation [MPT05]

• Maximum integer flow problem w/ set capacity constraints

•O(hN) constraints & variables, where N = #c-tokens

Page 15: Design and Optimization of  Universal DNA Arrays

Number of c-tokensc Num c-tokens

5 208

6 568

7 1552

8 4240

9 11584

10 31648

Page 16: Design and Optimization of  Universal DNA Arrays

Periodic Tags [MT05]

• Key observation: c-token uniqueness constraint in c-h code formulation is too strong– A c-token should not appear in two different tags, but

can be repeated in a tag

• A tag t is called periodic if it is the prefix of () for some “period” – Periodic strings make best use of c-tokens

Page 17: Design and Optimization of  Universal DNA Arrays

c-token factor graph, c=4 (incomplete)

CC

AAG AAC

AAAA

AAAT

Page 18: Design and Optimization of  Universal DNA Arrays

Vertex-disjoint Cycle Packing Problem

• Given directed graph G, find maximum number of vertex disjoint directed cycles in G

• [MT 05] APX-hard even for regular directed graphs with in-degree and out-degree 2– h-c/2+1 approximation factor for tag set design problem

• [Salavatipour and Verstraete 05] – Quasi-NP-hard to approximate within (log1- n)

– O(n1/2) approximation algorithm

Page 19: Design and Optimization of  Universal DNA Arrays

Cycle Packing Algorithm1. Construct c-token factor graph G

2. T{}

3. For all cycles C defining periodic tags, in increasing order of

cycle length,

• Add to T the tag defined by C

• Remove C from G

4. Perform an alphabetic tree search and add to T tags consisting

of unused c-tokens

5. Return T

– Gives an increase of over 40% in the number of tags compared to previous methods

Page 20: Design and Optimization of  Universal DNA Arrays

Experimental Results

h

Page 21: Design and Optimization of  Universal DNA Arrays

Antitag-to-Antitag Hybridization

• Additional constraint: antitags do not cross-hybridize, including self

• Formalization in c-token hybridization model:(C3) No two tags contain complementary substrings of

weight c

• Cycle packing and tree search extend easily

Page 22: Design and Optimization of  Universal DNA Arrays

Results w/ Extended Constraints

h

Page 23: Design and Optimization of  Universal DNA Arrays

More Hybridization Constraints…

• Enforced during tag assignment by- Leaving some tags unassigned and distributing primers across multiple arrays [Ben-Dor et al. 03]

- Exploiting availability of multiple primer candidates [MPT05]

t1 t2t1

Page 24: Design and Optimization of  Universal DNA Arrays

Assignable Primers• If primer p hybridizes to the complement of tag t’, at

most one of the assignments (p,t’), (p,t) and (p’,t’) can be made

p

t’

t

p’

• Set P of primers is assignable to a set T of tags if the condition above is satisfied for every p,p’ and t,t’

Page 25: Design and Optimization of  Universal DNA Arrays

Characterization of Assignable Sets

• conflict graph: – G=(T P,E), where (t,p) ∈ E if t and p hybridize– X = number of primers adjacent to a degree 1 tag– Y = number of degree 0 tags

Y=2

X=1

• [Ben-Dor 04] Set P is assignable to T iff

X+Y |P|

Page 26: Design and Optimization of  Universal DNA Arrays

Finding Assignable Primer Sets

Maximum Assignable Primer Set Problem: given primer set P and tag set T, find a maximum size assignable subset of P

• Both problems are NP-hard [Ben-Dor 04]

Multiplexing Problem: given primer set P and tag set T, find partition of P into minimum number of assignable sets

Page 27: Design and Optimization of  Universal DNA Arrays

Integration with Primer Selection

• In practice, several primer candidates with equivalent functionality– In SNP genotyping, can pick primer from either forward and

reverse strand

– In gene expression/identification applications, many primers have desired length, Tm, etc.

Page 28: Design and Optimization of  Universal DNA Arrays

Pooled Array Multiplexing Problem

Pooled Multiplexing Problem: Given set of primer pools P and tag set T, find a primer from each pool and a partition of selected primers into minimum number of assignable sets

Page 29: Design and Optimization of  Universal DNA Arrays

Pooled Multiplexing Algorithms

1. Primer-Del = greedy deletion for pools similar to [Ben-Dor et al 04]

Repeatedly delete primer of maximum potential until X+Y #pools, where Potential of tag t is 2-deg(t)

Potential of primer p is sum of potentials of conflicting tags Subtract ½ if primer adjacent to a tag of degree 1

Page 30: Design and Optimization of  Universal DNA Arrays

Pooled Multiplexing Algorithms

1. Primer-Del = greedy deletion for pools similar to [Ben-Dor et al 04]

2. Primer-Del+ = same but never delete last primer from pool unless no other choice

3. Min-Pot = select primer with min potential from each pool, then run Primer-Del

4. Min-Deg = select primer with min degree, then run Primer-Del

5. Iterative ILP = iteratively find a maximum assignable pool set using integer linear program

Page 31: Design and Optimization of  Universal DNA Arrays

Results: GenFlex Tags, c=8

Page 32: Design and Optimization of  Universal DNA Arrays

Herpes B Gene Expression Assay

Tm # poolsPool size

500 tags 1000 tags 2000 tags

# arrays % Util. # arrays % Util. # arrays % Util.

60 14461 4 94.06 2 97.20 1 72.30

5 4 96.13 2 100.00 1 72.30

67 15601 4 96.53 2 98.70 1 78.00

5 4 98.00 2 99.90 1 78.00

70 15221 4 96.73 2 98.90 1 76.10

5 4 97.80 2 99.80 1 76.10

Tm # poolsPool size

500 tags 1000 tags 2000 tags

# arrays % Util. # arrays % Util. # arrays % Util.

60 14461 4 82.26 3 65.35 2 57.05

5 4 88.26 3 70.95 2 63.55

67 15601 4 86.33 3 69.70 2 61.15

5 4 91.86 3 76.00 2 67.20

70 15221 4 88.46 3 73.65 2 65.40

5 4 92.26 2 91.10 2 70.30

GenFlex Tags

Periodic Tags

Page 33: Design and Optimization of  Universal DNA Arrays

Overview

Tag Array Design- Tag Set Design

- Tag Assignment Algorithms

SBE/SBH Assays- Decoding and Multiplexing Algorithms

Conclusions

Page 34: Design and Optimization of  Universal DNA Arrays

TTGCA

GATAA

T

A

T

AA AC CC CA

AT AG CG CT

TT TG GG GT

TA TC GC GA

Primers

CCATT

T

A

T

T

A

T

hybridization to k-mer array (SBH)

SBE/SBH Assay [MP 06]

single-base extension (SBE)

Page 35: Design and Optimization of  Universal DNA Arrays

Some notations

• P set of primers, X set of probes

• Ep {A,C,T,G}⊆ the set of possible extensions for primer p

• The spectrum of primer p, SpecX(p), is the set of probes hybridizing with p

• The extended spectrum of primer p with extension set Ep,)(),( peSpecEpSpec XpX Eep

Page 36: Design and Optimization of  Universal DNA Arrays

Decodable primer sets

• Four parallel single-color SBE/SBH experiments one type of extension in each SBE experiment

– P is weakly decodable with respect to extension e if for every primer p

• One SBE/SBH experiment with 4 colors (4 extensions)

– P is weakly decodable if for every primer p and every extension e ∈ Ep

)'(\)( ' epSpecpeSpec XX pp

)'(\)( ',' epSpecpeSpec XpX Eepp

Page 37: Design and Optimization of  Universal DNA Arrays

Strongly r-decodable primer sets

• Hybridization involving labeled nucleotide is less predictable

Informative probes should not rely on it

• Signal from one SNP may obscure signal from another when read at the same probe due to differences in DNA amplification efficiency

Informative probes cannot be shared between SNPs

• P is strongly r-decodable if for every primer p

where r = redundancy parameter

rEpSpecpSpec ppp XX |),'(\)(| ''

Page 38: Design and Optimization of  Universal DNA Arrays

MPPP

Minimum Pool Partitioning Problem (MPPP)Given:

• primer pools set P and extensions sets Ep, for every primer p

• probe set X• redundancy r Find:

partition of P into the min number of strongly r-decodable subsets

A set of primer pools P ={P1,…,Pn } is strongly r-decodable iff there is a primer pi in each pool Pi such that {p1,…,pn} is strongly r-decodable.

Page 39: Design and Optimization of  Universal DNA Arrays

MDPSP

Maximum r-Decodable Pool Subset Problem (MDPSP)Given:

• primer pools set P and extensions sets Ep, for every primer p

• probe set X• redundancy r Find:

• strongly r-decodable subset of P of maximum size

Page 40: Design and Optimization of  Universal DNA Arrays

Min-Greedy Algorithm for Maximum Induced Matching in General Graphs• Pick a vertex u of min degree • Pick a vertex v of min degree from among u’s

neighbors• Add edge (u,v) to the matching• Delete all neighbors of u and v from the graph• Repeat the above steps until the graph becomes empty

• [Duckworth 05] d-1 approximation factor for d-regular graphs

Page 41: Design and Optimization of  Universal DNA Arrays

Min-Greedy Algorithms for MDPSP• Bipartite hybridization graph G:

– Primers in left side, probes in right side– Two types of edges:

• N+(p)=SpecX(p)• N-(p)=SpecX(p,Ep) \ SpecX(p)

• Two algorithm variants:– MinPrimerGreedy: pick primer first– MinProbeGreedy: pick probe first

• Delete primer/probe if N+ degree drops below r/1

Page 42: Design and Optimization of  Universal DNA Arrays

Experimental results for k-mers (|Ep|=4, primer length=20)

Page 43: Design and Optimization of  Universal DNA Arrays

MDPSP Size vs Primer Lengthk=10

0

20000

40000

60000

80000

100000

120000

140000

160000

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

primer length

r=2, n=200k, MinProbe

r=2, n=200k, Sequential

r=2, n=200k, MinPrimer

r=5, n=200k, MinProbe

r=5, n=200k, Sequential

r=5, n=200k, MinPrimer

Page 44: Design and Optimization of  Universal DNA Arrays

Experimental results for c-tokens (|Ep|=4, primer length=20)

Page 45: Design and Optimization of  Universal DNA Arrays

MDPSP Size vs Primer Lengthc=13

0

10000

20000

30000

40000

50000

60000

70000

80000

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

primer length

r=2, n=200k, MinProbe

r=2, n=200k, Sequential

r=2, n=200k, MinPrimer

r=5, n=200k, MinProbe

r=5, n=200k, Sequential

r=5, n=200k, MinPrimer

Page 46: Design and Optimization of  Universal DNA Arrays

Overview

Tag Array Design- Tag Set Design

- Tag Assignment Algorithms

SBE/SBH Assays- Decoding and Multiplexing Algorithms

Conclusions

Page 47: Design and Optimization of  Universal DNA Arrays

Conclusions and Ongoing Work

• Combinatorial algorithms yield significant increases in multiplexing rates of universal arrays– New SBE/SBH architecture particularly promising based on

preliminary simulation results

• Ongoing work: – Extend methods to more accurate hybridization models, e.g.,

use NN melting temperature models

– More complex (e.g., temperature dependent) DNA tag set non-interaction requirements for DNA self/mediated assembly

– Probabilistic decoding in presence of hybridization errors

– Application to novel domains, e.g., DNA barcoding

Page 48: Design and Optimization of  Universal DNA Arrays

Acknowledgments

• Claudia Prajescu and Dragos Trinca

• Funding from NSF (Awards 0546457 and 0543365) and UCONN Research Foundation