Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le...

38
The Good Aspects Rates of Profile Vectors Recent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow School of Physical and Mathematical Sciences Nanyang Technological University, Singapore. [email protected] Workshop at Univ. Gadjah Mada Indonesia 14 March 2018

Transcript of Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le...

Page 1: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Notes on Recent Works on DNA Coding

Frederic Ezerman

Senior Research FellowSchool of Physical and Mathematical SciencesNanyang Technological University, Singapore.

[email protected]

Workshop at Univ. Gadjah Mada Indonesia14 March 2018

Page 2: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

The Good Aspects

Rates of Profile Vectors

Recent Papers

Page 3: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

A Promotional Video

Let’s start by watching a promotional video.Link: https://www.youtube.com/watch?v=r8qWc9X4f6k

Page 4: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Capacity

Page 5: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Challenges not Mentioned

1. Designing Random Access Systems.

2. Reducing Synthesis and Sequencing Costs.

3. Improving Security in Data Processing.

Page 6: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Making MemoryNature vol. 537, Sept. 2016, p. 24.

Page 7: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

1. We start with Introductory Material from Han Mao Kiah’sslides.More materials are available on his webpage http:

//www.ntu.edu.sg/home/hmkiah/publications.html.

2. Then we summarize the contribution of “Rates of DNASequence Profiles for Practical Values of Read Lengths.”

Page 8: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

DNA-based Storage Architecture

Digital Information

DNA Coding

DNA codewords

or templates

Synthesis

DNA Strings for Storage

Editing / Reading via

High Throughput Sequencing

Page 9: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Church et al. (2012) “Next-Generation

Digital Information Storage in DNA”

Encoding

Binary (ASCII) strings to DNA codewords

Avoids homopolymer runs of length greater than three

Balance GC content and avoid secondary structures

Page 10: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Goldman et al. (2013) “Towards practical,

high capacity, low-maintenance

information storage in synthesized DNA”

Encoding

Binary(ASCII) strings to ternary strings to DNA codewords

Avoids homopolymers

Four-fold coverage and single parity-check: error-correction

Page 11: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Code design

Coding philosophy: do not use all possible words, choose a

subset that satisfy certain criteria. Choose words that

are “far from each other” – eg. single-parity check codes

satisfy certain “constraints” – eg. avoid homopolymers,

balanced GC content…

DNA code design

Error control

Random access

“Efficient” sequence assembly

Error control for nanopore sequencers

and other possibilities??

Page 12: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Sequence Assembly

Page 13: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Sequence Assembly Problem

Problem: Sequencing is computationallydemanding. Need to stitch together many shortreads to obtain original sequence.

Objective: Design a code that uses theinformation on short reads / substrings directly,without the need to stitch them together.

Page 14: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Design of Words for Sequence Assembly

Design criteria (Kiah et al. 2015):

Choose words whose profiles are far from each other

Choose words whose profiles obey some constraints

ACG CCA CCG

CCGACA

CGA

CAC

CGC

CGC

CGACAC GACACC GCC

GCCGACACG

ACA

ACGCCGCCGACACGACACCA

TAC

TAC

TTA

TTA

TTATAC TTT

TTT

TTTTAC

CTT

GTA

CTT

CTT

ACT

ACT

ACT

ACT

GTACTTACTTTTACTTTACT

ACG CCA CCG

CCG

TCACGA

CAC

CGC

CGC

CGACAC GAC

ACG

GCC

GCC

GTC

ACG

ACA

Output profile

Page 15: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Design of Words for Sequence Assembly

Design criteria (Kiah et al. 2015):

Choose words whose profiles are far from each other

Choose words whose profiles obey some constraints

Compute the number of words with distinct profiles

(asymptotic up to a constant factor) and the number

of words whose profiles are at a fixed distance apart.

Enumeration: Ehrhart polynomials (Ehrhart, 1962),

Code construction: Varshamov codes for asymmetric

channels (Varshamov, 1973)

Open problem: Enumeration of profiles for a different

regime of parameters.

Page 16: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Design of Words for Sequence Assembly

Design criteria (Kiah et al. 2015):

Choose words whose profiles are far from each other

Choose words whose profiles obey some constraints

Compute bounds on the number of words with

distinct profiles (asymptotic up to a constant factor)

and the number of words whose profiles are at a

fixed distance apart.

NEXT: Enumeration of profiles for a different regime

of parameters.

Page 17: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Let’s get technical…

Page 18: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Notation

ACGCCGCCGACACGACACCA

CCG

𝑛 = length of word

ℓ = length of substring / gram

aka ℓ-gram

𝑞 = size of alphabet

Page 19: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Notation

ACGCCGCCGACACGACACCA

CCG

𝑛 = 20

ℓ = 3

𝑞 = 4

Page 20: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Profile Vector

ACG CCA CCG

CCGACA

CGA

CAC

CGC

CGCCGA

CAC GACACC GCC

GCCGACACG

ACA

ACGCCGCCGACACGACACCA

𝒑 𝒙, ℓ =multiplicity vector that represents this

unordered set of substrings

𝒙 =

𝑞 = 4, 𝑛 = 20, ℓ = 3

𝒑 𝒙, ℓ =(ACA: 2, ACC: 1, ACG: 2, CAC: 2, CCA: 1, CCG: 2, CGA: 2, CGA: 2, GAC: 2, GCC: 2)

Page 21: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Different Words may have same

Profile Vector

ACA CAC

ACAC𝒙 =

𝑞 = 4, 𝑛 = 4, ℓ = 3

𝒑 𝒙, ℓ = ACA: 1, CAC: 1𝒑 𝒚, ℓ = ACA: 1, CAC: 1

ACA CAC

CACA𝒚 =

Define equivalence 𝑞-ary words of length 𝑛 by

𝒙~𝒚 if 𝒑 𝒙, ℓ = 𝒑 𝒚, ℓ .

Page 22: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Equivalence Classes

𝑃𝑞 𝑛, ℓ = number of equivalence classes

𝑅𝑞 𝑛, ℓ =log𝑞 𝑃𝑞 𝑛,ℓ

𝑛= rate

Define equivalence 𝑞-ary words of length 𝑛 by

𝒙~𝒚 if 𝒑 𝒙, ℓ = 𝒑 𝒚, ℓ .

Previous Result (Jacquet et al. 2012, Kiah et al. 2015):

𝑃𝑞 𝑛, ℓ = Θ 𝑛𝑞ℓ−𝑞ℓ−1 ,

lim𝑛→∞𝑅𝑞 𝑛, ℓ = 0.

Page 23: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

Equivalence Classes

𝑃𝑞 𝑛, ℓ = number of equivalence classes

𝑅𝑞 𝑛, ℓ =log𝑞 𝑃𝑞 𝑛,ℓ

𝑛= rate

Define equivalence 𝑞-ary words of length 𝑛 by

𝒙~𝒚 if 𝒑 𝒙, ℓ = 𝒑 𝒚, ℓ .

Previous Result (Jacquet et al. 2012, Kiah et al. 2015):

𝑃𝑞 𝑛, ℓ = Θ 𝑛𝑞ℓ−𝑞ℓ−1 ,

lim𝑛→∞𝑅𝑞 𝑛, ℓ = 0.

Page 24: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

What is 𝑃𝑞 𝑛, 𝑛 ?

𝑃𝑞 𝑛, 𝑛 = 𝑞𝑛 or 𝑅𝑞 𝑛, 𝑛 = 1

ACA

ACA

𝑞 = 4, 𝑛 = ℓ = 3

Page 25: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Assumptions and Known Results

Fix q ≥ 2.

1. Let µ be the Mobius function.

2. Let b + 1 < a and let Rq(n, b) be the set of all q-ary stringsof length n that have no run of b consecutive zeroes.

3. Codes satisfying the constraint have been extensively studied.In particular, the size of Rq(n, b) and the recursive formulafor Fq(b, n) , |Rq(n, b)| is well known.

Fq(n, b) =

{qn, if n < b,

(q − 1)∑b

i=1 Fq(n − i , b), otherwise.

Page 26: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Main Contributions

Efficient encoding and decoding algorithms for all results below,except the one in Eq. (4) where the bound is established bycounting the number of partial de Bruijn sequences.

• If ` ≤ n < 2`, then

Pq(n, `) = qn −∑

r |n−`+1

∑t|r

(r − 1

r

)µ( rt

)qt > qn−1(q − 1).

(1)

• Choose a and m such that 2a ≤ ` and m ≤ qa−1. If n = m`,

Pq(n, `) ≥ (q − 1)m(`−a). (2)

Page 27: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Main Contributions Continued

• Choose b, a, and m such that 2a ≤ `, b + 1 < a andm ≤ Fq(a− b − 1, b). If n = m`, then

Pq(n, `) ≥ (Fq(n − a− 1, b)(q − 1))m . (3)

• If n ≥ `,

Pq(n, `) >1

n

∑t|n

µ(nt

)qt −

(n

2

)qn−`+1

. (4)

Suppose further that ` ≥ 2 logq n + 2 and n ≥ 8. ThenPq(n, `) > qn−1/n. Hence, for all 0 < κ < 1, we havePq(n, `) ≥ qκn for sufficiently large n.

Page 28: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Rate for fixed ` = 20The rates R4(n, 20) for 20 ≤ n ≤ 108

Page 29: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Rate for fixed ` = 100The rates R4(n, 100) for 100 ≤ n ≤ 108

Page 30: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Rate for fixed n = 1000The rates R4(103, `) for 1 ≤ ` ≤ 103

Page 31: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Rate for fixed n = 106

The rates R4(106, `) for 1 ≤ ` ≤ 106

Page 32: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Fountain Codes meet DNA

LT Codes (M. Luby, 2002), Raptor Codes (A. Shokrollahi, 2006).Erlich and Zielinski, Science vol. 355, 2 March 2017.

Page 33: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Portable and Error-Free

Yazdi, Gabrys, and Milenkovic, Scientific Reports, July 2017.

Page 34: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Encoding Movie into Living! Bacteria

Shipman, Nivala, Macklis, and Church, Nature vol. 547, 20 July2017.

Page 35: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Synthetic DNA as Attack Vectors in Crypto

P. Ney, K. Koscher, L. Organick, L. Ceze, T. Kohno, “ComputerSecurity, Privacy, and DNA Sequencing: Compromising Computerswith Synthesized DNA, Privacy Leaks, and More,” USENIXSecurity Symposium, 2017.

• Use synthetic DNA to load malicious codes into computersystems.

• The systems typically process DNA sequencing.

• The codes took over control of the host systems.

• Not so practical yet, but a proof-of-concept of attacks.

Page 36: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Some References

• O. Milenkovic, Lecture video athttps://www.youtube.com/watch?v=N7zJLSEZKYQ.

• Z. Chang, J. Chrisnata, M. F. Ezerman, and H. M. Kiah,“Rates of DNA Sequence Profiles for Practical Values of ReadLengths,” in IEEE Trans. Inform. Theory, vol. 63, no. 11, pp.3125–3146, Nov. 2017.

• G. M. Church, Y. Gao, and S. Kosuri, “Next-generationdigital information storage in DNA,” Science, vol. 337, no.6102, pp. 1628–1628, 2012.

• N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M.LeProust, B. Sipos, and E. Birney, “Towards practical,high-capacity, low-maintenance information storage insynthesized DNA,” Nature, vol. 494, pp. 77–80, 2013.

Page 37: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

More References

• R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. J.Stark, “Robust chemical preservation of digital information onDNA in silica with error-correcting codes,” AngewandteChemie In. Ed., vol. 54, no. 8, pp. 2552–2555, 2015.

• S. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, “Arewritable, random-access DNA-based storage system,”Scientific Reports, vol. 5, no. 14138, 2015.

• S. Yazdi, H. M. Kiah, E. R. Garcia, J. Ma, H. Zhao, andO. Milenkovic, “DNA-based storage: Trends and methods,”IEEE Trans. Molecular, Biological, Multi-Scale Commun.,vol. 1, no. 3, pp. 230–248, 2015.

Page 38: Notes on Recent Works on DNA Coding - NTU Singapore | NTUThe Good AspectsRates of Pro le VectorsRecent Papers Notes on Recent Works on DNA Coding Frederic Ezerman Senior Research Fellow

The Good Aspects Rates of Profile Vectors Recent Papers

Still More...

• J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, G. Seelig,and K. Strauss, “A DNA-based archival storage system”, inProc. 21st Int. Conf. Architectural Support for Prog.Languages and Operating Systems, 2016, pp. 637–649.

• Y. Erlich, and D. Zielinski, “DNA Fountain enables a robustand efficient storage architecture”, Science, vol. 355,no. 6328, pp. 950–954, 2017.

• S. Yazdi, R. Gabrys, and O. Milenkovic, “Portable anderror-free DNA-based data storage”, Scientific Reports,no. 5011, vol. 7, 2017.

• H. M. Kiah, G. J. Puleo, and O. Milenkovic, “Codes for DNAsequence profiles,” IEEE Trans. Inform. Theory, vol. 62,no. 6, pp. 3125–3146, 2016.