Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le...
Transcript of Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le...
The Good Aspects Rates of Profile Vectors Recent Papers
Notes on Recent Works on DNA Coding
Frederic Ezerman
Senior Research FellowSchool of Physical and Mathematical SciencesNanyang Technological University, Singapore.
Workshop at Univ. Gadjah Mada Indonesia14 March 2018
The Good Aspects Rates of Profile Vectors Recent Papers
The Good Aspects
Rates of Profile Vectors
Recent Papers
The Good Aspects Rates of Profile Vectors Recent Papers
A Promotional Video
Let’s start by watching a promotional video.Link: https://www.youtube.com/watch?v=r8qWc9X4f6k
The Good Aspects Rates of Profile Vectors Recent Papers
Capacity
The Good Aspects Rates of Profile Vectors Recent Papers
Challenges not Mentioned
1. Designing Random Access Systems.
2. Reducing Synthesis and Sequencing Costs.
3. Improving Security in Data Processing.
The Good Aspects Rates of Profile Vectors Recent Papers
Making MemoryNature vol. 537, Sept. 2016, p. 24.
The Good Aspects Rates of Profile Vectors Recent Papers
1. We start with Introductory Material from Han Mao Kiah’sslides.More materials are available on his webpage http:
//www.ntu.edu.sg/home/hmkiah/publications.html.
2. Then we summarize the contribution of “Rates of DNASequence Profiles for Practical Values of Read Lengths.”
DNA-based Storage Architecture
Digital Information
DNA Coding
DNA codewords
or templates
Synthesis
DNA Strings for Storage
Editing / Reading via
High Throughput Sequencing
Church et al. (2012) “Next-Generation
Digital Information Storage in DNA”
Encoding
Binary (ASCII) strings to DNA codewords
Avoids homopolymer runs of length greater than three
Balance GC content and avoid secondary structures
Goldman et al. (2013) “Towards practical,
high capacity, low-maintenance
information storage in synthesized DNA”
Encoding
Binary(ASCII) strings to ternary strings to DNA codewords
Avoids homopolymers
Four-fold coverage and single parity-check: error-correction
Code design
Coding philosophy: do not use all possible words, choose a
subset that satisfy certain criteria. Choose words that
are “far from each other” – eg. single-parity check codes
satisfy certain “constraints” – eg. avoid homopolymers,
balanced GC content…
DNA code design
Error control
Random access
“Efficient” sequence assembly
Error control for nanopore sequencers
and other possibilities??
Sequence Assembly
Sequence Assembly Problem
Problem: Sequencing is computationallydemanding. Need to stitch together many shortreads to obtain original sequence.
Objective: Design a code that uses theinformation on short reads / substrings directly,without the need to stitch them together.
Design of Words for Sequence Assembly
Design criteria (Kiah et al. 2015):
Choose words whose profiles are far from each other
Choose words whose profiles obey some constraints
ACG CCA CCG
CCGACA
CGA
CAC
CGC
CGC
CGACAC GACACC GCC
GCCGACACG
ACA
ACGCCGCCGACACGACACCA
TAC
TAC
TTA
TTA
TTATAC TTT
TTT
TTTTAC
CTT
GTA
CTT
CTT
ACT
ACT
ACT
ACT
GTACTTACTTTTACTTTACT
ACG CCA CCG
CCG
TCACGA
CAC
CGC
CGC
CGACAC GAC
ACG
GCC
GCC
GTC
ACG
ACA
Output profile
Design of Words for Sequence Assembly
Design criteria (Kiah et al. 2015):
Choose words whose profiles are far from each other
Choose words whose profiles obey some constraints
Compute the number of words with distinct profiles
(asymptotic up to a constant factor) and the number
of words whose profiles are at a fixed distance apart.
Enumeration: Ehrhart polynomials (Ehrhart, 1962),
Code construction: Varshamov codes for asymmetric
channels (Varshamov, 1973)
Open problem: Enumeration of profiles for a different
regime of parameters.
Design of Words for Sequence Assembly
Design criteria (Kiah et al. 2015):
Choose words whose profiles are far from each other
Choose words whose profiles obey some constraints
Compute bounds on the number of words with
distinct profiles (asymptotic up to a constant factor)
and the number of words whose profiles are at a
fixed distance apart.
NEXT: Enumeration of profiles for a different regime
of parameters.
Let’s get technical…
Notation
ACGCCGCCGACACGACACCA
CCG
𝑛 = length of word
ℓ = length of substring / gram
aka ℓ-gram
𝑞 = size of alphabet
Notation
ACGCCGCCGACACGACACCA
CCG
𝑛 = 20
ℓ = 3
𝑞 = 4
Profile Vector
ACG CCA CCG
CCGACA
CGA
CAC
CGC
CGCCGA
CAC GACACC GCC
GCCGACACG
ACA
ACGCCGCCGACACGACACCA
𝒑 𝒙, ℓ =multiplicity vector that represents this
unordered set of substrings
𝒙 =
𝑞 = 4, 𝑛 = 20, ℓ = 3
𝒑 𝒙, ℓ =(ACA: 2, ACC: 1, ACG: 2, CAC: 2, CCA: 1, CCG: 2, CGA: 2, CGA: 2, GAC: 2, GCC: 2)
Different Words may have same
Profile Vector
ACA CAC
ACAC𝒙 =
𝑞 = 4, 𝑛 = 4, ℓ = 3
𝒑 𝒙, ℓ = ACA: 1, CAC: 1𝒑 𝒚, ℓ = ACA: 1, CAC: 1
ACA CAC
CACA𝒚 =
Define equivalence 𝑞-ary words of length 𝑛 by
𝒙~𝒚 if 𝒑 𝒙, ℓ = 𝒑 𝒚, ℓ .
Equivalence Classes
𝑃𝑞 𝑛, ℓ = number of equivalence classes
𝑅𝑞 𝑛, ℓ =log𝑞 𝑃𝑞 𝑛,ℓ
𝑛= rate
Define equivalence 𝑞-ary words of length 𝑛 by
𝒙~𝒚 if 𝒑 𝒙, ℓ = 𝒑 𝒚, ℓ .
Previous Result (Jacquet et al. 2012, Kiah et al. 2015):
𝑃𝑞 𝑛, ℓ = Θ 𝑛𝑞ℓ−𝑞ℓ−1 ,
lim𝑛→∞𝑅𝑞 𝑛, ℓ = 0.
Equivalence Classes
𝑃𝑞 𝑛, ℓ = number of equivalence classes
𝑅𝑞 𝑛, ℓ =log𝑞 𝑃𝑞 𝑛,ℓ
𝑛= rate
Define equivalence 𝑞-ary words of length 𝑛 by
𝒙~𝒚 if 𝒑 𝒙, ℓ = 𝒑 𝒚, ℓ .
Previous Result (Jacquet et al. 2012, Kiah et al. 2015):
𝑃𝑞 𝑛, ℓ = Θ 𝑛𝑞ℓ−𝑞ℓ−1 ,
lim𝑛→∞𝑅𝑞 𝑛, ℓ = 0.
What is 𝑃𝑞 𝑛, 𝑛 ?
𝑃𝑞 𝑛, 𝑛 = 𝑞𝑛 or 𝑅𝑞 𝑛, 𝑛 = 1
ACA
ACA
𝑞 = 4, 𝑛 = ℓ = 3
The Good Aspects Rates of Profile Vectors Recent Papers
Assumptions and Known Results
Fix q ≥ 2.
1. Let µ be the Mobius function.
2. Let b + 1 < a and let Rq(n, b) be the set of all q-ary stringsof length n that have no run of b consecutive zeroes.
3. Codes satisfying the constraint have been extensively studied.In particular, the size of Rq(n, b) and the recursive formulafor Fq(b, n) , |Rq(n, b)| is well known.
Fq(n, b) =
{qn, if n < b,
(q − 1)∑b
i=1 Fq(n − i , b), otherwise.
The Good Aspects Rates of Profile Vectors Recent Papers
Main Contributions
Efficient encoding and decoding algorithms for all results below,except the one in Eq. (4) where the bound is established bycounting the number of partial de Bruijn sequences.
• If ` ≤ n < 2`, then
Pq(n, `) = qn −∑
r |n−`+1
∑t|r
(r − 1
r
)µ( rt
)qt > qn−1(q − 1).
(1)
• Choose a and m such that 2a ≤ ` and m ≤ qa−1. If n = m`,
Pq(n, `) ≥ (q − 1)m(`−a). (2)
The Good Aspects Rates of Profile Vectors Recent Papers
Main Contributions Continued
• Choose b, a, and m such that 2a ≤ `, b + 1 < a andm ≤ Fq(a− b − 1, b). If n = m`, then
Pq(n, `) ≥ (Fq(n − a− 1, b)(q − 1))m . (3)
• If n ≥ `,
Pq(n, `) >1
n
∑t|n
µ(nt
)qt −
(n
2
)qn−`+1
. (4)
Suppose further that ` ≥ 2 logq n + 2 and n ≥ 8. ThenPq(n, `) > qn−1/n. Hence, for all 0 < κ < 1, we havePq(n, `) ≥ qκn for sufficiently large n.
The Good Aspects Rates of Profile Vectors Recent Papers
Rate for fixed ` = 20The rates R4(n, 20) for 20 ≤ n ≤ 108
The Good Aspects Rates of Profile Vectors Recent Papers
Rate for fixed ` = 100The rates R4(n, 100) for 100 ≤ n ≤ 108
The Good Aspects Rates of Profile Vectors Recent Papers
Rate for fixed n = 1000The rates R4(103, `) for 1 ≤ ` ≤ 103
The Good Aspects Rates of Profile Vectors Recent Papers
Rate for fixed n = 106
The rates R4(106, `) for 1 ≤ ` ≤ 106
The Good Aspects Rates of Profile Vectors Recent Papers
Fountain Codes meet DNA
LT Codes (M. Luby, 2002), Raptor Codes (A. Shokrollahi, 2006).Erlich and Zielinski, Science vol. 355, 2 March 2017.
The Good Aspects Rates of Profile Vectors Recent Papers
Portable and Error-Free
Yazdi, Gabrys, and Milenkovic, Scientific Reports, July 2017.
The Good Aspects Rates of Profile Vectors Recent Papers
Encoding Movie into Living! Bacteria
Shipman, Nivala, Macklis, and Church, Nature vol. 547, 20 July2017.
The Good Aspects Rates of Profile Vectors Recent Papers
Synthetic DNA as Attack Vectors in Crypto
P. Ney, K. Koscher, L. Organick, L. Ceze, T. Kohno, “ComputerSecurity, Privacy, and DNA Sequencing: Compromising Computerswith Synthesized DNA, Privacy Leaks, and More,” USENIXSecurity Symposium, 2017.
• Use synthetic DNA to load malicious codes into computersystems.
• The systems typically process DNA sequencing.
• The codes took over control of the host systems.
• Not so practical yet, but a proof-of-concept of attacks.
The Good Aspects Rates of Profile Vectors Recent Papers
Some References
• O. Milenkovic, Lecture video athttps://www.youtube.com/watch?v=N7zJLSEZKYQ.
• Z. Chang, J. Chrisnata, M. F. Ezerman, and H. M. Kiah,“Rates of DNA Sequence Profiles for Practical Values of ReadLengths,” in IEEE Trans. Inform. Theory, vol. 63, no. 11, pp.3125–3146, Nov. 2017.
• G. M. Church, Y. Gao, and S. Kosuri, “Next-generationdigital information storage in DNA,” Science, vol. 337, no.6102, pp. 1628–1628, 2012.
• N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M.LeProust, B. Sipos, and E. Birney, “Towards practical,high-capacity, low-maintenance information storage insynthesized DNA,” Nature, vol. 494, pp. 77–80, 2013.
The Good Aspects Rates of Profile Vectors Recent Papers
More References
• R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. J.Stark, “Robust chemical preservation of digital information onDNA in silica with error-correcting codes,” AngewandteChemie In. Ed., vol. 54, no. 8, pp. 2552–2555, 2015.
• S. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, “Arewritable, random-access DNA-based storage system,”Scientific Reports, vol. 5, no. 14138, 2015.
• S. Yazdi, H. M. Kiah, E. R. Garcia, J. Ma, H. Zhao, andO. Milenkovic, “DNA-based storage: Trends and methods,”IEEE Trans. Molecular, Biological, Multi-Scale Commun.,vol. 1, no. 3, pp. 230–248, 2015.
The Good Aspects Rates of Profile Vectors Recent Papers
Still More...
• J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, G. Seelig,and K. Strauss, “A DNA-based archival storage system”, inProc. 21st Int. Conf. Architectural Support for Prog.Languages and Operating Systems, 2016, pp. 637–649.
• Y. Erlich, and D. Zielinski, “DNA Fountain enables a robustand efficient storage architecture”, Science, vol. 355,no. 6328, pp. 950–954, 2017.
• S. Yazdi, R. Gabrys, and O. Milenkovic, “Portable anderror-free DNA-based data storage”, Scientific Reports,no. 5011, vol. 7, 2017.
• H. M. Kiah, G. J. Puleo, and O. Milenkovic, “Codes for DNAsequence profiles,” IEEE Trans. Inform. Theory, vol. 62,no. 6, pp. 3125–3146, 2016.