Coding For DNA Computing: Combinatorial and Biophysical Aspects
Transcript of Coding For DNA Computing: Combinatorial and Biophysical Aspects
Coding for DNA Computing:
Combinatorial and Biophysical Aspects
Olgica MilenkovicUniversity of Colorado, Boulder
A Joint Work with Navin KashyapQueen’s University, Kingston
LDPCITERATIVE DECODING
Outline The DNA Computing Paradigm Applications Error-Control Coding for DNA Computing
Constrained Coding: DNA Secondary and Tertiary Structure
Statistical Mechanics of DNA/RNA Folding Results and Open Problems
Molecular Biology: Terminology DNA Double Helix
Watson-Crick Complements: A→T, G →C, T →A, C →G RNA: Single-Stranded, T Replaced by U Helix Denaturation (Ambient Temperature
Governed) DNA Oligonucleotide Sequences DNA Hybridization DNA Enzymes: Functional Proteins Operating on
DNA
DNA Computing: Adleman’s Experiment (1994)The Problem: An “Unremarkable” Instance of the Directed Traveling Salesmen Problem on a Graph with Seven NodesFigures from Adleman, SA 1998
The Method: Remarkable Oligonucleotide DNA Hybridization TechniqueMiami (CTACGG) Miami (CTACGG) NY (ATGCCG) NY (ATGCCG) Route (Edge): Second Half of Codeword for Miami (CGG) and Route (Edge): Second Half of Codeword for Miami (CGG) and First Half of Codeword for NY (ATG): First Half of Codeword for NY (ATG): CGGATG --- Take the Complement of this Word: GCCTACCGGATG --- Take the Complement of this Word: GCCTAC
Not a von Neumann Architecture: Stochastic Mechanism with Massive Parallelism: 1/50th of Teaspoon, 1014paths/1s
Extremely Low Power Consumption: 1 Joule for 2 · 1019 Operations
Storage Capacity: Vol(1g of DNA)=1cm3 , Information=1 trillion CDs18Mb/inch of Length (0.35nm Between Base Pairs)
Versatility of Applications, Only Plausible Option in Many Cases
Drawbacks: First Implementations not Interactive 3-Day Processing Delay VERY LOW RELIABILITY OF COMPUTATION
DNA Computing: The Benefits
Applications of DNA Computers Combinatorial Problems: Directed Traveling Salesmen (Adleman ‘94) 3-SAT (Braich et.al., ‘02)
Input: a 20-Variable, 24-Clause, Boolean Function 3-Conjunctive Normal Form (3-CNF) For each Variable, two Length=15 DNA Sequences Assigned, one representing the Variable, the other representing its Complement Operon Technology, Alameda, CA, Integrated DNA Technologies, Skokie, IL
Non-Attacking Knights (Faulhammer, ’00)Configurations of Knights that can be Placed on n×n Chess Board so that no Knight is Attacking any other Knight on the Board
Figure
Novel Designs of DNA Computers
DNA Logic and Automata: Interactive Systems DNA Transistors (Stojanovic, Stefanovic ‘03) DNA Game-Playing Machines (Stojanovic, Stefanovic ‘03)
MAYA: Consists of Nine Wells (Tubes) Representing the 3x3 Tic-Tac-Toe BoardTubes Contain Mixtures of Enzymes: Network of 23 Molecular Logic Gates“Human Player” has Nine Different DNA Strands: each Specific to one Square on the Board; Player Selects one Square to Play: DNA Strand representing that Square gets Added to all the Nine Wells;
O
MAYA “Analyzes” Play Through Biochemical Reactions Occurring in Wells
Meet MAYA…(Stojanovic, Stefanovic 2003)
Applications of DNA Computers
Figure: http://www.cs.unm.edu/~bandrews/ttt-applet/
The “Killer Application”: SMART DRUGS
E. Shapiro et.al. (Weizmann Institute, Israel), Nature, Science 2003Quintana et.al 2002In Vitro DNA-Based Computer “Programmed” to Diagnose Cancer and “Order” Self-Destruction of Cells
Identifies RNA Cancer Fingerprint Molecules Cancer Leaves its own “Chemical Fingerprint” in the Body, Including Over-Producing or Under-Producing Specific RNA Sequences(Analysis Based on Regulatory Networks of Gene Interactions, Shmulevich et. al., 2002)(Milenkovic and Vasic, DIMACS’2004, ITW’2004)
Software: DNA, Hardware: DNA EnzymesResponds Appropriately by Releasing Short, Active DNA Strand
Interferes with Tumors by Suppressing Key Cancer Genes, Making Diseased Cells Self-Destruct
Experiments: Prostate and Lung Cancer Cells
Applications of DNA Computers
Sensing, Storing, Nano-Scale Mechanics… Biosensing: DNA Fingerprinting of Bacteria/Viruses, Roco
et.al. 2004 DNA-Based Storage Systems: Mansuripur et.al.,
DIMACS’2004 Nucleic Acid Nanostructures and Topology, DNA Self-
Assembly, DNA Nanoscale Mechanical Devices, Seeman et.al. 1998-2002
Applications of DNA Computers
RELIABILITY ISSUES FOR ALL DESCRIBED SYSTEMS UNRESOLVED
Error Control Coding
Constrained Coding
Graph Theory/Combinatorics/Pseudo-Knot Theory
Statistical Mechanics
The Biggest Obstacles… DNA Oligonucleotide Secondary and Tertiary Structure Formation Unwanted Hybridization
DNA Oligonucleotide Sequences are Chemically Active, Tend to Assume Thermodynamically Most Stable Form!
DNA Sequences can Bind to Partially Complementary Sequences as Well!
DNA/RNA Secondary and Tertiary Structure
Secondary Structure Pseudoknots (Tertiary Structure)
Mneimneh, 2003 (Figures from Web Lecture Notes)
DNA Hairpins
DNA/RNA Hairpin Structure Participate in Important Biological Functions:
• Regulation of Gene Expression (Zazopoulos, et. al., 1997);
• DNA Recombination (Froelich-Ammon, et. al., 1994);
• Facilitation of Mutagenic Events (Trinh and Sinden, 1993): in Living Cell, after Breaking of Intermolecular Pairing in Double Helix DNA, Loose Strands Form a DNA Hairpin;
• Potential Antisense Drug (Tang, et. al., 1993): Injecting into a Living Cell Hairpin with Nucleic Acid Bases Complementary to an mRNA of a Disease Gene Blocks its Expression
DNA/RNA Knots
RNA Secondary Structure Influences Function of RNA: Knots are Special “Regulators”Figures: Haslinger, 2001; Craven, 2001
Mathematical FormulationDefinition 1 (Hasliner, 2001): A Secondary Structure S is a Vertex-Labeled Graph on n Vertices, for which the Adjacency Matrix A has the following properties
An Edge (i,j), |i-j|>1 is Called a Base-Pairing.A Secondary Structure Can Consist of the Following Structural Elements:
jlithenjkiandaaIf
athatsuchiikonemostatisthereieachFor
nia
nia
lkji
ki
ii
ii
1.311,1.2
1,1
1,1.1
,,
,
1,
1,
1. A Stack Consists of Subsequent Base Pairs (p-k,q+k),
(p-k+1,q+k-1),…,(p,q); k is the Length of the Stack
2. A Loop Consists of all Unpaired Vertices which are Immediately Interior to some Terminal Base Pair
3. An External Vertex is an Unpaired Vertex which does not Belong to a Loop
If Definition 1, Part 3 is Violated for a Base Pairing, then the Resulting Formation is Referred to as a Pseudoknot
Mathematical Formulation
With Information about Energy of Pairings and Additional Measurements Regarding the DNA Backbone, Determining Stable Secondary Structures Becomes a Purely Combinatorial Problem
Secondary Structure Prediction: Dynamical Programming Approach,
Polynomial Time Nussinov’s and Zuckermann Algorithm
Pseudoknots: NP-Complete, Except for Special Class of H-Knots (Rivas, Eddy 2003)
Nussinov’s Folding Algorithm
WCCWCCnot
,1),(,0),(
Free Energy of Secondary Structure S : ZTSHTEESE avSfree log][][ ,
jiE , Free Energy of Secondary Structure Limited to positions i, i+1,…, j
scomplementWC
CGTAccSequence n
,,0),(0),(
},,,{,),(,...,1
Figure: Mneimneh, 2003, Bundschuh, 2004
Feynman Diagrams for RNA Structure Prediction (Eddy, Rivas 2001)
Free Energy Table:
Sequence CCCAAATGG
jkiEEccE
Ejkki
jijiji ,
),(min
,1,
1,1,
Statistical Physics: DNA Ensemble Analysis
Bundschuh, Hwa 2004
TkccSESEnZ BSji
jinS
/1,),(][,][exp)(),()(
Bundschuh, Hwa 2004: Statistics of Secondary Structures in Ensemble of Long Random DNA Sequences Why? Detection of Important Structural Components in mRNAs, Functional RNAs, Characterization of the Response of Long Oligonucleotide DNA Molecule to Puling Forces Random DNA = Problem of Disordered Systems
Statistical Physics: DNA Ensemble Analysis Molten Phase: Absence of Disorder
2/12/30
00
]4/)21[()(
21)(,2/3
qqqA
qqz
Thermodynamic Ensemble: Large Number of Different Secondary Structures with Equal EnergyStability of Molten Phase: Use N-Replica Method
Glassy Phase: Few Low Energy Configurations in Thermodynamic Limit
Stat Physics DNA Ensemble Analysis
Droplet Theory (Huse and Fisher): ‘‘Large-Scale Low-Energy Excitations’’ About Ground State• Impose deformation over a length scale L>>1, Monitor Minimal Free Energy Cost of Deformation;• Cost Expected to Scale as Lw for large L: Positive w Indicates Deformation Cost Grows with Increasing Size. Negative w Indicates Deformation Cost Decays: there is a Large Number of Configurations with Low Overlap with Ground State, whose Energies are Similar to the Ground State Energy in the Thermodynamic Limit (Zero-Temperature Behavior not Stable to Thermal Fluctuations - No Thermodynamic Glass Phase can Exist at any Finite Temperature
Related Analysis: A. Pagnani, G. Parisi, and F. Ricci-Tersenghi, 2000/2001
The Stability of a Particular Secondary Structure is a Function of Several Constraints: 1) Number of GC versus AT /GT Base Pairs(Larger Number of Hydrogen Bonds Form more Stable Structures) 2) Number of Base Pairs Forming a Stem Region(Presence of Long Subsequence and its Reverse Complement Lead to Stabilization ) 3) Number of Base Pairs in a Hairpin (More than 15 or less than 4-7 Bases put “Stress” on the Loop )4) Number of Unpaired Bases(More Unpaired Bases lead to less Stable Structure )
Hybridization Constraints Individual Sequence Constraints (Wood, Tsaftaaris etc):
IP1) The consecutive-bases constraint. Long Runs of the Same Base Forbidden. IP2) The constant GC-content constraint. Introduced to Achieve Parallelized Operations on DNA Sequences; Assures Similar Thermodynamic (Melting Temperature) Characteristics of all Codewords. GC-Content Usually in the Range of 30-50% of Code Length; Joint Sequence Constraints:JP1) The Hamming distance constraint. Limits Unwanted Hybridizations between Codewords. Requirement is that all Distinct Pairs of Codewords p,q in C be at Hamming Distance at Least dmin. To Limit Undesired Hybridization between a Codeword and the Reverse-Complement of any other Codeword (including itself) the Reverse Complement Hamming Distance has to be at Least dRCminJP2) The frame-shift constraint. Applies Only to Limited Number of Problems. Refers to Requirement that Concatenation of Two or More Codewords should not Properly Contain Another Codeword.JP3) The forbidden subsequence constraint. Specifies that a Class of Substrings Must not Occur in any Codeword or Concatenation of Codewords
Code Construction
Approach I: Binary Mapping Approach II: Extended, Cyclic Goppa Codes over GF(4) Approach III: Hadamard Matrices with Cyclic Core
WHY Cyclic? Will Show that Computational Complexity for Nussinov’s Algorithm Significantly Reduced in this Case
PRIOR WORK:Addressed 1/2/3 Requirements; No Families of Codes Given (Length Limited to 20);No Attempt Whatsoever to Consider Secondary Structure Constraints;References: Condon et.al. 2000-2004; King 2003; Ryakov 2003; Gaborit and King 2004; Ghrayeb et.al. 2004;
Terminology
},,,{,,...,...
21
21
CGTAQpqpppqqq
iin
n
pq
11
11
...
...
qqq
qqq
nn
nn
RC
R
q
q
}},{{#)(_ CGpContentGCw i p
DNA Code C : Set of Codewords over Alphabet Q;Minimum Hamming, Reverse and Reverse-Complement Hamming Distance:
Constant GC Content Code:
qp,dmind
qp,dmind
qp,dmind
qp,dmind
WCWC
RCH
RC
RH
R
H
qpCqp
Cqp
Cqp
qpCqp
,
,
,
,
wContentGCC )(_: pp
|}:{|),( iiWC qpi qpd
Binary Mapping Approach11,10,01,00 GCTA
Co
Cebofesubsequencoddobofesubsequencevene
ofimagebinaryb
qqOqqE
q)qq)q
:)(:)(
()(()(
)( Example: q=ACGTCC
b(q)=001011011010
e(q)=011011
o(q)=001100
Code D: [n,k,d],
Contains All-Ones Word
Construction:
DNA Code: Number of Codewords Length 2n Hamming, Reverse Complement Hamming Distance
at Least d
Longest Length Codes…
)1(22/21
22/22/)2/(
2
n
ddHnddHA
dn
d)2(2
2ddn
Bounds on
2/)2/(2
dn
d
A
(Based on Bounds by Ashikhmin et al, 2005)
Binary Mapping: Subcodes of Simplex Codes (All-Zero Not Allowed) -- EVEN
Special Subset of Codewords from Menas/Zettenberg Codes --ODD
iii
iii
GG
GGGG
01...110...0
,110
01...110...0
,110101
11
12
Extended Cyclic Goppa CodesCcCc R
),( Rcc
211 ini
Approach:• Take a Family of Reversible ( ) Cyclic Codes • Eliminate all Self-Reversible Codewords• From Each Remaining Pair Retain Exactly One Codeword• Complement Second Half of Each Codeword
Let for q a Power of a Prime and Let g(z) be a Polynomial of Degree over such that g(z) has no Root in . The Goppa Code, , consists of all words such that
),(},...,,{ 21m
n qGF ., nm n),( mqGF
)( ),(),,...,,( 21 qGFcccc in
n
i i
i zgzc
1
)(mod0
azzzg ))(()( 21
is a code of length n, dimension and minimum distance .)( mnk 1d
Zhang et. al., 1988
DNA Codes and Goppa Codes 2/kq
12)1(22
222
122
122
4421,14
4421,14
mamam
mamam
mm
mm
Mn
Mn
Choose Constant GC Content Subset of Codewords
Example:
A Reversible Cyclic Code of Dimension k over GF(q) contains self-reversible
Codewords.
CGTTC,CAAAT,CTCCA,GCCTT,GGAGA,ACTAA
For arbitrary positive integers a,m, there exist DNA Codes D such that
12)(
22)(
aDd
aDdRCH
H
having the following properties
Complex (Generalized) Hadamard Matrices
nIHH *
1,...,0,/2 mleC milm
),( mCnHH Matrix of Dimension n×n over Set of m-th Roots of Unity
With property
Exponent Matrix: over
Choose p=3, and Use only One of G/C
),( pZnE }1,...,2,1,0{ pZ p
For any , there exists DNA codes D with codewords of length , with constant GC-content equal to and
Each Codeword of such a Code is a Cyclic Shift of a Fixed Generator Codeword g.
Zk 13 kM 13 kn13 k
132)( kH Dd 13)( kRC
H Dd
Theorem[Heng et.al, 02] Let N=pk-1 for p Prime and a Positive Integer k. Let g(x)=c0+c1x+c2x2+…+cN-kxN-k be a Monic Polynomial over Zp, of Degree N-k, such that g(x)h(x)=xN-1 over Zp , for some monic irreducible polynomial h(x) in Zp[x] . Suppose that the vector , (0,c0,c1,c2,…,cN-k) with ci=0 for N-k<i<N has the property that it contains each element of Zp the same number of times. Then the N cyclic shifts of the vector (c0,c1,c2,…,cN-k) form the code of the exponent matrix of some Hadamard matrix H(pk,Cp)
Hadamard and Vienna…
Vienna Package: T=37◦C
http://www.tbi.univie.ac.at/~ivo/RNA/
Based on Nussinov’s Algorithm
Gives one Minimum Free Energy Secondary Structure
MFOLD (Zuckerman et.al.2000)
Why Cyclic Codes?),...,,( 21 ncccLet a DNA Code Consist of the Cyclic Shifts of a Codeword .
Provided that the free energy table of is known, the free-energy tables of all other codewords can be computed with a total of O(n3) operations only. More precisely, the free-energy table of the codeword can be obtained from the table in O(n2) steps.
),...,,( 11 nn ccc
C C C A A A T G G
C 0 0 0 0 0 0 -1 -2 -3
C 0 0 0 0 0 0 -1 -2 -2
C 0 0 0 0 0 0 -1 -2 -2
A 0 0 0 0 0 0 -1 -1 -1
A 0 0 0 0 0 0 -1 -1 -1
A 0 0 0 0 0 0 -1 -1 -1
T 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0
G A C A A A G G T
G 0 0 -1 -1 -1 -1 -1 -1 -2
A 0 0 0 0 0 0 -1 -1 -2
C 0 0 0 0 0 0 -1 -1 -1
A 0 0 0 0 0 0 0 0 -1
A 0 0 0 0 0 0 0 0 -1
A 0 0 0 0 0 0 0 0 -1
G 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0
d WC(CCCAAATGG,GCCCAAATG)=7 d WC(GACAAAGGT,TGACAAAGG)=9
d WC(CCCAAATGG,GGCCCAAAT)=6 d WC(GACAAAGGT,GTGACAAAG)=7
T1: Free Energy: -0.24Kcal/mol T2: -0.19Kcal/mol
Energies Obtained from Vienna RNA Folding Package (I. Hofacker)
Why Binary Mapping?1 1 1 0 0 0 0 1 1
1 0 -1 -1 -1 -2 -2 -3 -4 -41 0 0 -1 -1 -2 -2 -3 -3 -4
1 0 0 0 0 -1 -1 -2 -3 -30 0 0 0 0 -1 -1 -2 -2 -3
0 0 0 0 0 0 -1 -1 -1 -2
0 0 0 0 0 0 0 -1 -1 -20 0 0 0 0 0 0 0 0 -1
1 0 0 0 0 0 0 0 0 -11 0 0 0 0 0 0 0 0 0
C
C GC G
A
A A
T
1 0 1 0 1 0 1 1 0
1 0 0 -1 -1 -2 -2 -3 -3 -4
0 0 0 0 -1 -1 -2 -2 -3 -3
1 0 0 0 0 -1 -1 -2 -2 -3
0 0 0 0 0 0 -1 -1 -2 -2
1 0 0 0 0 0 0 -1 -1 -1
0 0 0 0 0 0 0 0 -1 -2
1 0 0 0 0 0 0 0 -1 -1
1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
6)001100101,110010100(,6)011001010,110010100(
WC
WC
dd
1 1 0 0 1 0 1 0 0
1 0 -1 -1 -2 -2 -2 -3 -3 -41 0 0 0 -1 -2 -2 -2 -3 -3
0 0 0 0 -1 -1 -1 -2 -2 -30 0 0 0 0 0 -1 -1 -2 -2
1 0 0 0 0 0 0 -1 -1 -2
0 0 0 0 0 0 0 0 -1 -11 0 0 0 0 0 0 0 0 -1
0 0 0 0 0 0 0 0 0 -10 0 0 0 0 0 0 0 0 0
2)101010101,101010110(,8)010101011,101010110(
WC
WC
dd
What Type of Sequences do Minimize the entry E1,n?Cyclic Shifts with a Minimized Set {i: WC(Ci)=Ci+k,
k=1,2,…,m}
The Cyclic Distance (Binary Case) Known: Peng, 1998
34,2/)1(24,2/)2(14,2/)1(
4,2/
)(
),(min)(
),...,,(),,...,,(
11
2121
knnknnknnknn
Sd
SSdSd
sssSsssS
cyc
iHnicyc
ininini
n
Achieved: Maximum Length Shift Register (MLSR) Sequences
(Pseudo-Random Sequences in General)
Sequence Weight: w =n/2, n even
w =(n-1)/2, n odd
What are the Reversal Distance Properties of MLSR Sequences?
Watson-Crick Distance: Plotkin-Type of Bound
n
i
iC
iG
iT
iAWC
WCWC
WC
xxxxMM
MM
1
22)1(),(
)1(),(
),(
u v
u v
u v
vud
dvud
vud
The Watson-Crick Distance
0,0:
iC
iG
iT
iA
iC
iG
iT
iA
xxxxMAX
iColumnofcontentCx
iColumnofcontentGx
iColumnofcontentTx
iColumnofcontentAx
)(2/,2
2 classicalndnd
dM
nd
The Free Energy of a DNA Strand (c1,c2,…,cn) can be Approximated According to Breslauer’s Formula
n
iiifree uucorrectionE
11),(
Much more Accurate:
weights
uuuuuucorrectionE
i
n
imiim
n
iii
n
iiifree
11
221
11 ),(...),(),(
Other Coding Problems
Generalized deBruijn Sequences Association Schemes for Hamming/RC Hamming/Constant
GC Content Binary Mapping Approach with Runlength Constraints Forbidden Pattern Constraints (Enumeration Techniques
by Goulden and Jackson…) Catalan Numbers: b=1: CN(1)=1 ( )
b=2: CN(2)=2 ( ) ( ), ( ( ) )b=3: CN(3)=5 ( ) ( ) ( ), ( ( ) ( ) ), ( ( ) ) ( ), ( ) ( ( ) ), ( ( ( ) ) )
mm
mmCN
21
1)(