# Combinatorial Optimization Methods for Reliable Genomic...

### Transcript of Combinatorial Optimization Methods for Reliable Genomic...

1

1

Combinatorial Optimization Methods for Reliable Genomic-Based Detection Systems

Ion MandoiuUniversity of Connecticut

Computer Science & Engineering Department

2

Motivationm Early detection, early response: rapid identification of pathogens causing epidemic outbreaks enables faster containment

m Emerging large scale systems for infectious agent detection: - BioWatch [DHS]- Human Virome project [Anderson et al. 03]

m Genomic-based assays are becoming the method of choice for early detection and identification

- Sequence data increasingly available - Broad detection spectrum, fast, easy to automate- Reduced deployment and update overhead

m Besides resolving numerous technological challenges, novel bioinformatics tools will be needed to assist in assay design and optimization

2

3

Can Computer Scientists Really Help?

m They’ve done it before: BLAST, Human Genome assemblym Computer virus detection

- more than 68,000 viruses detected in real-time- daily updates of computer virus signatures- techniques used by computer anti-virus programs can be used to speed-up genomic-based detection assays

4

Overview

m Generic Detection System Architecturem The String Barcoding Problemm Primer Set Selection for Multiplex PCRm Conclusions

3

5

Detection System Requirements• Fast, highly specific pathogen detection and

identification without compromising sensitivity (low false alarm rate)

• Ability to work with trace amounts of genetic material• Fully automated operation -- should require minimal

human intervention• Parallel detection of a large number of pathogens• Discrimination between pathogens and non-pathogenic

organisms• Low operating cost• Easy to upgrade

6

Key System Components• Selection of distinguishing oligonucleotides based on

available genomic sequences • Selective amplification of distinguishing sequences

from environmental sample• Hybridization-based detection of present distinguishers• Pathogen identification by comparison with stored

signatures/barcodes of known pathogens

4

7

Generic System Architecture

Multiplex PCR

PCR Machine

Mixture of (degenerate) primers

Set of (degenerate) primers

Mixture of (degenerate) primers

Single-base extension and hybridization with universal tag array

Amplified DNA sequences from sample

Sample containing minute tracesof pathogen genetic material

Probes obtained by ligating distinguisher

reporters and anti-tags

… …

Barcodes of pathogens present

in sample

Fluorescentnucleotides

Multiplex PCR Mixture of (degenerate) primers

Multiplex PCR

8

SBE & Hybridization with Universal Tag Arrays

+

5

9

Overview

m Generic Detection System Architecturem The String Barcoding Problem

-Problem Formulation

-Integer Program-Fast heuristics

m Primer Set Selection for Multiplex PCRm Conclusions

10

Motivation

• Need for rapid virus detection– Given

• Virus with unknown identity • Database of known viruses

– Problem• Identify unknown virus quickly

– Ideal solution• Have sequence of

– Viruses in database– Unknown virus

• Solution– use BLAST (or any sequence similarity program/algorithm)

6

11

Real World

• Only have sequence for pathogens in database– Not possible to quickly sequence an unknown virus

• Can quickly test for presence of short substringsin unknown virus (substring tests) using, e.g., hybridization + SBE

• New Idea (Borneman et al.’01, Rash&Gusfield’02)– String Barcoding: use substring tests to uniquely

identify each virus in the database

12

Problem DefinitionGiven:

Genomic sequences g1,…, gn

Find:Minimum number of distinguisher strings t1,…,tk

Such that: For every gi ≠ gj, there exists a string tl which is the Watson-Crick complement for a substring of gi or gj, but not of both

- At least log2n distinguishers needed

- Fingerprints è n distinguishers

- Much fewer than n distinguishers needed in practice (close to log2n)

7

13

Small Example

• Given sequences:1. cagtgc2. cagttc3. catgga

• Feasible set of distinguishers: {tg, atgga}

11catgga

00cagttc

01cagtgc

atggatg

0/1 row vectors: unique barcode for each pathogen

14

Problem Complexity

• Unknown if NP-hard to find optimum solution when size of distinguishers is not bounded

• Max-Length String Barcoding– Additional parameter k = maximum distinguisher

length– This variant is NP-Hard by reduction from Minimum

Testing Set (Garey, Johnson, 1979)– It means that in practice it may be difficult to find

optimum solution

8

15

Integer Program Formulation• Basic Idea (Rash&Gusfield’02)

– Write problem as minimization of a linear function subject to linear constraints

– Variables restricted to take 0/1 values• For our problem

– One variable for each candidate distinguisher• Value = 1 è candidate is selected• Value = 0 è candidate is not selected

– One constraint for each pair of strings in S• At least one good distinguisher chosen for each pair

– Objective Function• Minimize sum of variables (#selected candidates)

16

Practical Implementation

• Key point: runtime needed to solve integer program depends on #variables

• Lots of variables can be removed:– Candidates that appear in all sequences– Sufficient to keep a single candidate among

those that appear in the same set of strings• How to remove useless variables?

– Rash&Gusfield’s method: use suffix trees

9

17

Suffix Trees

• Key Properties of the suffix tree built for a set of strings S:– Rooted tree with character sequences labeling edges– Tree nodes labeled with a subset of the original string IDs– Every substring of original input set appears as a tree walk

from root exactly once

18

Suffix Tree Example

• Strings:1. cagtgc2. cagttc3. catgga

v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3}

v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3}

v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3}

v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3}

v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1}

10

19

Integer Program

MinimizeV18 + V22 + V11 + V17 + V8 #objective functionSuch thatV18 + V17 + V8 >= 1 #constraint to cover pair 1,2V22 + V11 + V8 >= 1 #constraint to cover pair 1,3V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3Binaries #all variables are 0/1V18 V22 V11 V17 V8End

11catgga

00cagttc

01cagtgc

atgga (V22)tg (V18)

20

Limitations of Integer Program Method

• Works only for moderately sized datasets• 50-150 sequences• Average length ~1000 characters• Over 4 hours needed to come within 20% of optimum

• Scalable Heuristics?

11

21

Information Content Heuristic

• [Berman et al. 2004]– Keep track of the partition defined by distinguishers

selected so far 1

2

3

n-1n

Distinguisher 1

Distinguisher 2

22

Information Content Heuristic

• [Berman et al. 2004]– Keep track of the partition defined by distinguishers

selected so far – In every step, choose candidate that reduces partition

entropy by largest amount• Initial entropy = log2(n!) ≈ n*log2n• Final entropy = 0

12

23

Information Content Heuristic

• [Berman et al. 2004]– Keep track of the partition defined by distinguishers

selected so far – In every step, choose candidate that reduces partition

entropy by largest amount• Initial entropy = log2(n!) ≈ n*log2n• Final entropy = 0

• Theorem: Information Content Heuristic is always finding a #distinguishers within 1+ln(n) of optimum

24

Limitations of ICH• Real genomic data has degenerate nucleotides

– Ambiguous sequencing– Single nucleotide polymorphisms

• For sequences with degenerate nucleotides there are three possibilities for distinguisher hybridization– Sure hybridization– Sure mismatch– Uncertain hybridization

è No partition to work with!

13

25

Simpler Greedy Heuristic• Setcover greedy:

– In every step, choose candidate that distinguishes the largest number of not yet distinguished pairs

• Distinguisher selection as setcover problem:– Elements to be covered are the pairs of sequences– Each candidate distinguisher defines a set of pairs that it

separates– Problem: find minimum number of sets that cover all elements

• By a classical result, setcover greedy gives 2*ln(n) approximation; in practice as good as ICH

• Runtime is few seconds for Rash&Gusfield datasets

26

Overview

m Generic Detection System Architecturem The String Barcoding Problemm Primer Set Selection for Multiplex PCR

- Problem formulation- Greedy and LP-rounding algorithm for primer set selection with uniqueness constraints- Experimental results

m Conclusions

14

27

The Polymerase Chain Reaction

Target Sequence

Primer 1Primer 25’

3’

5’

5’

3’

5’

3’

3’

PolymerasePrimers

Repeat 20-30 cycles

28

Primer Pair Selection Problem

• Given:

• Genomic sequence around amplification locus

• Primer length k

• Amplification upperbound L

• Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperatures, secondary structure, cross hybridization, etc.)

≤ L

Forward primer

Reverse primer

amplification locus

3'

3'

5'

5'

15

29

Multiplex PCR• Multiplex PCR (MP-PCR)

– Multiple DNA fragments amplified simultaneously– Boundaries of each amplification fragment still defined by

two oligonucleotide primers– A primer may participate in the amplification of multiple

targets

• Primer set selection– Typically done by time-consuming trial and error – An important objective is to minimize number of primers

Ø Reduced assay cost Ø Higher effective concentration of primers è higher

amplification efficiencyØ Reduced unintended amplification

30

Other Applications of Multiplex PCR• Spotted microarray synthesis [Fernandes&Skiena’02]

– Need unique pair of primers for each one of the n amplification products, but primers can be used multiple times

– Potential to reduce #primers from O(n) to O(n1/2)

• SNP Genotyping– Thousands of SNPs that must genotyped using hybridization

based methods (e.g., single-base extension)– Selective PCR amplification needed to improve accuracy of

detection steps (whole-genome amplification less appropriate)– No need for unique amplification!– Primer minimization is critical

• Reduced cost• Fewer multiplex PCR reactions, less mispriming

16

31

Primer Set Selection Problem

• Given:

• Genomic sequences around each amplification locus

• Primer length k

• Amplification upper bound L

• Find:

• Minimum size set of primers S of length k such that, for each amplification locus, there are two primers in S hybridizing to the forward and reverse sequences within a distance of L of each other

• For applications requiring uniqueness: S should contain a unique pair of primers amplifying each each locus

32

Previous Work

• Well-studied problem: [Pearson et al. 96], [Linhart & Shamir’02], [Souvenir et al.’03], etc.

• Almost all problem formulations decouple selection of forward and reverse primers– Cannot directly enforce constraints on amplification

product length!– To enforce bound of L on amplification length, select only

primers that hybridize within L/2 bases of desired target– In worst case, this method can increase the number of

primers by a factor of O(n) compared to the optimum• Greedy set cover algorithm gives O(ln n)

approximation factor for the “decoupled” formulation– Cannot find better approximation unless P=NP

17

33

Previous Work (contd.)• [Fernandes&Skiena’02] model primer set selection

with uniqueness constraints as a minimum multicolored subgraph problem:– Vertices of the graph correspond to candidate primers– There is an edge colored by color i between primers u and

v if they hybridize within a distance of L of each other around i-th amplification locus

– Goal is to find minimum size set of vertices inducing edges of all colors

• Can be used to model length amplification constraints• [Lancia et al.’02] Trivial approximation algorithm: select 2

primers for each amplification target– O(n1/2) approximation since at least n1/2 primers required by every

feasible solution

34

Integer Program Formulation• Variable xu for every vertex (candidate primer) u

- xu set to 1 if u is selected, and to 0 otherwise

• Variable ye for every edge e

- ye set to 1 if corresponding primer pair selected to amplify corresponding target

• Objective: minimize sum of xu’s

• Constraints:

- for each i, sum of ye’s over all e’s amplifying locus i is at least 1

- ye ≤ xu for every e incident to u

18

35

Linear Program Relaxation• Integer program hard to solve exactly

• Can still solve efficiently the linear programming relaxation, in which variables are allowed to take fractional values

0

1

:Subject to

: Minimize

,

and every for

color every for

≥

≤

≥

∑

∑

∑

∈∈

∈

ve

xy

y

x

xy

vvev

e

ee

vv

χ

χ

χ

χ

36

LP-Rounding Algorithm

m Theorem [Konwar et al.’04]: The LP-rounding algorithm finds a feasible solution at most O(m1/2lnn) times larger than the optimum, where m is the maximum color class size, and nis the number of nodes

m For primer selection, m ≤ L2 è approximation factor is O(Llnn)

m Better approximation?- Unlikely for minimum multi-colored subgraph problem

(1) Solve linear programming relaxation

(2) Select node u with probability xu

(3) Repeat step 2 O(ln(n)) times and return selected nodes

19

37

Selection w/o Uniqueness Constraints• Can be seen as a “simultaneous set covering” problem:

- The ground set is partitioned into n disjoint sets, each with 2Lelements

- The goal is to select a minimum number of sets (== primers) that cover at least half of the elements in each partition

• Naïve modifications of the greedy set cover algorithm do not work

• Key idea: use potential function Φ to measure progress towards fasibility. For primer selection, potential function counts the total number of elements that remain to be covered

• Initially, Φ = nL

• For feasible solutions, Φ = 0

38

Greedy Approximation Algorithm

• Theorem: The greedy algorithm in returns a feasible primer set whose size is at most 1+ln ? times larger than the optimum, where ? is the maximum potential value decrease caused by a single primer

• For primer selection ? is equal to nL in the worst case, and is much smaller in practice– The number of primers selected by the greedy algorithm is at most ln(nL)

larger than the optimum

Potential-Function Driven Greedy Algorithmm Select a primer that decreases potential function Φ by the

largest amount (breaking ties arbitrarily)m Repeat until feasibility is achieved

20

39

Experimental Setting• Datasets

– Extracted from NCBI databases– Randomly generated using uniform distribution

• Compared algorithms– G-FIX: greedy primer cover algorithm of Pearson et al.

• Primers restricted to be within L/2 bases of amplification locus– G-VAR: naïve modification of G-FIX

• For each locus, first selected primer can be up to L bases away• If first selected primer is L1 bases away from amplification locus,

opposite sequence is truncated to a length of L- L1

– MIPS-PT: iterative beam-search heuristic of Souvenir et al.– G-POT: potential function driven greedy algorithm

40

Experimental Results, NCBI tests

0.1061080.0870.047820 0.08915130.08100.03910

0.111326180.08130.041412

0.321048210.30150.1313850 0.3318150300.36240.222310

0.2829246410.30320.143112

0.5814226320.89200.49178100 0.7531844500.72370.373710

0.61422601750.84480.595312

#Primers

G-POT(Potential- function

greedy)

#Primers

MIPS-PT (Souvenir et al.)

G-VAR(G-FIX with dynamic

truncation)

G-FIX(Pearson et al.)#

Targets#Primers#Primers CPU

secCPUsec

CPUsec

CPUsec

k

21

41

#primers, as percentage of 2n (l=8)

n

42

#primers, as percentage of 2n (l=10)

n

22

43

#primers, as percentage of 2n (l=12)

n

44

CPU Seconds (l=10)

n

23

45

Overview

m Generic Detection System Architecturem The String Barcoding Problemm Primer Set Selection for Multiplex PCRm Conclusions

46

Conclusions• Building the next-generation of pathogen detection systems will

require novel bioinformatics tools for genomic assay design, built around accurate mathematical models and powerful algorithmic techniques

• We have given improved algorithms for two critical optimizations: distinguisher selection for string barcoding, and primer set selection for multiplex PCR

24

47

Ongoing Work• String Barcoding

– Probe mixtures as distinguishers– Redundancy and error correcting properties– Simultaneous detection of multiple pathogens

• Primer Set Selection– Improved hybridization models– Practical validation– Degenerate primers

• Universal Tag array design– Tag selection (Ben-Dor’00)– Tag placement and embedding– Assignment of reporter probes to anti-tags

• Partitioning into multiple multiplexed PCR reactions and multiple Universal Tag array hybridizations (Aumann et al. WABI’03)

48

Acknowledgments

• B. DasGupta, K. Konwar, A. Russell, A. Shvartsman• UCONN Research Foundation