A DNA-Based Archival Storage Systembornholt/papers/dnastorage-asplos16.sli… · Chunking data...
Transcript of A DNA-Based Archival Storage Systembornholt/papers/dnastorage-asplos16.sli… · Chunking data...
A DNA-Based Archival Storage SystemJames Bornholt* Randolph Lopez* Douglas M. Carmean† Luis Ceze* Georg Seelig* Karin Strauss†
* University of Washington † Microsoft Research
Facebook cold storage facility 1 exabyte (109 GB) 66,000 square feet
DNA molecules as storage
DNA molecules as storage
Extremely dense Theory: 1 exabyte in 1 in3
DNA molecules as storage
Extremely dense Theory: 1 exabyte in 1 in3
Extremely durable Half life > 500 years
Store
A DNA-based archival storage system
Write Read
Store
A DNA-based archival storage system
Write Read
Redundancy and density
Store
A DNA-based archival storage system
Write Read
Redundancy and density
Efficient retrieval
Store
A DNA-based archival storage system
Write Read
Redundancy and density
Efficient retrieval
Wet lab experiments
DNA manipulation
DNA moleculesFour nucleotides:
A
C
G
T
Adenine
Cytosine
Guanine
Thymine
DNA moleculesFour nucleotides:
A
C
G
T
Adenine
Cytosine
Guanine
Thymine
DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C TG A C A C C T
DNA moleculesFour nucleotides:
A
C
G
T
Adenine
Cytosine
Guanine
Thymine
DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C T
Two strands can bind to each other if they are complementary:
C T G T G G A
G A C A C C T
DNA moleculesFour nucleotides:
A
C
G
T
Adenine
Cytosine
Guanine
Thymine
DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C T
Two strands can bind to each other if they are complementary:
C T G T G G A
G A C A C C T
C T G T G G A
G A C A C C T
DNA moleculesFour nucleotides:
A
C
G
T
Adenine
Cytosine
Guanine
Thymine
DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C T
Two strands can bind to each other if they are complementary:
C T G T G G A
G A C A C C T
C T G T G G A
G A C A C C T
C, G are complementary
A, T are complementary
DNA moleculesFour nucleotides:
A
C
G
T
Adenine
Cytosine
Guanine
Thymine
DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C T
Two strands can bind to each other if they are complementary:
C T G T G G A
G A C A C C T
C T G T G G A
G A C A C C T
C, G are complementary
A, T are complementary
T
Partial errors allowed
DNA manipulationSynthesis: manufacturing DNA strands
GACACCT G A C A C C T
• Chemical synthesis process appends one nucleotide at a time • Maximum practical length ~200 nts • Typically produces thousands of copies of the strand
DNA manipulationSynthesis: manufacturing DNA strands
GACACCT G A C A C C T
• Chemical synthesis process appends one nucleotide at a time • Maximum practical length ~200 nts • Typically produces thousands of copies of the strand
Sequencing: reading DNA strandsGACACCTG A C A C C T
• Produces many reads of a strand • Much higher throughput than synthesis
An archival storage system
System overviewArchival storage system structured as a key-value store
System overviewArchival storage system structured as a key-value store
put(key, value)
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
cat.jpg
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
cat.jpg
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
Synthesis
cat.jpg
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
Synthesis
cat.jpg
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
Synthesis
Pool
cat.jpg
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
Synthesis
Pool
cat.jpg
cat.jpg
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
Synthesis
Pool
cat.jpg
cat.jpg
cat.jpg
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
Synthesis
Pool
cat.jpg
cat.jpg
cat.jpg
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
Synthesis
Pool
cat.jpg
cat.jpg
cat.jpg
Sequencing
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
Synthesis
Pool
cat.jpg
cat.jpg
cat.jpg
Sequencing
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
Synthesis
Pool
cat.jpg
cat.jpg
cat.jpg
Sequencing
Organizing written data
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
Synthesis
Pool
cat.jpg
cat.jpg
cat.jpg
Sequencing
Organizing written data
Efficient retrieval
System overviewArchival storage system structured as a key-value store
put(key, value) get(key)
01001…
ACGAT…
Synthesis
Pool
cat.jpg
cat.jpg
cat.jpg
Sequencing
Organizing written data
Efficient retrieval
Coping with errors
Writing data to DNAThe easy way: convert base 2 to base 4
10100011 10010001 11100111 11000101 10010100 10111101
Writing data to DNAThe easy way: convert base 2 to base 4
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
Writing data to DNAThe easy way: convert base 2 to base 4
G C A C
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
Writing data to DNAThe easy way: convert base 2 to base 4
But this approach isn’t feasible for more than a few bytes
G C A C
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
Writing data to DNAThe easy way: convert base 2 to base 4
G
But this approach isn’t feasible for more than a few bytes
G C A C
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
Writing data to DNAThe easy way: convert base 2 to base 4
G G
But this approach isn’t feasible for more than a few bytes
G C A C
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
Writing data to DNAThe easy way: convert base 2 to base 4
G G
But this approach isn’t feasible for more than a few bytes
P[Attach] = 99%
G C A C
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
Writing data to DNAThe easy way: convert base 2 to base 4
G G
But this approach isn’t feasible for more than a few bytes
P[Attach] = 99%
99%
G C A C
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
Writing data to DNAThe easy way: convert base 2 to base 4
G G A
But this approach isn’t feasible for more than a few bytes
P[Attach] = 99%
99% 98%
G C A C
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
Writing data to DNAThe easy way: convert base 2 to base 4
G G A T G C A
But this approach isn’t feasible for more than a few bytes
P[Attach] = 99%
99% 98% 97% 96.1% 95.1% 94.2%
G C A C
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
Writing data to DNAThe easy way: convert base 2 to base 4
G G A T G C A
But this approach isn’t feasible for more than a few bytes
P[Attach] = 99%
99% 98% 97% 96.1% 95.1% 94.2%
100 nts
36.6%
…
G C A C
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
Writing data to DNAThe easy way: convert base 2 to base 4
G G A T G C A
But this approach isn’t feasible for more than a few bytes
P[Attach] = 99%
99% 98% 97% 96.1% 95.1% 94.2%
100 nts
36.6%
200 nts
13.4%
… …
G C A C
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
G C A C
Chunking dataBreak binary data into chunks stored in separate strands
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T T G C T T A C C G C C A G T T C
G C A C
Chunking dataBreak binary data into chunks stored in separate strands
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T
T G C T T A C C
G C C A G T T C
G C A C
Chunking dataBreak binary data into chunks stored in separate strands
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T
T G C T T A C C
G C C A G T T C
A A A A
A A A C
A A A G
Addresses within the value
C A T C C
G C A C
Chunking dataBreak binary data into chunks stored in separate strands
10100011 10010001 11100111 11000101 10010100 10111101
2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
G G A T
T G C T T A C C
G C C A G T T C
A A A A
A A A C
A A A G
C A T C C
C A T C C
A T G T T
A T G T T
A T G T T
Addresses within the value
Key identifiers (“primers”)
C A T C C G C C A G T T C A A A G A T G T T
Efficient reads
G C A C G G A T
T G C T T A C C
A A A A
A A A C
C A T C C
C A T C C
A T G T T
A T G T T
Addresses within the value
Key identifiers (“primers”)
A T T TG CCTACGAAACTTGACCG
Efficient reads
A T T TG CCTACAAAACACGTAGG
A T T TG CCTACCAAACCATTCGT
Addresses within the value
Key identifiers (“primers”)
A T T TG CCTACGAAACTTGACCG
Efficient reads
A T T TG CCTACAAAACACGTAGG
A T T TG CCTACCAAACCATTCGT
Pool containing stored strands for all keys & values!
Addresses within the value
Key identifiers (“primers”)
A T T TG CCTACGAAACTTGACCG
Efficient reads
A T T TG CCTACAAAACACGTAGG
A T T TG CCTACCAAACCATTCGT
Pool containing stored strands for all keys & values! get(key)
cat.jpg
Addresses within the value
Key identifiers (“primers”)
A T T TG CCTACGAAACTTGACCG
Random access
A T T TG CCTACAAAACACGTAGG
A T T TG CCTACCAAACCATTCGT
AddressPrimers
A T T TG CCTACGAAACTTGACCG
Random access
A T T TG CCTACAAAACACGTAGG
A T T TG CCTACCAAACCATTCGT
AddressPrimers
Strands with 3 different primers
A T T TG CCTACGAAACTTGACCG
Random access
A T T TG CCTACAAAACACGTAGG
A T T TG CCTACCAAACCATTCGT
AddressPrimers
PCR
Selectively amplify strands based on their primer
Strands with 3 different primers
A T T TG CCTACGAAACTTGACCG
Random access
A T T TG CCTACAAAACACGTAGG
A T T TG CCTACCAAACCATTCGT
AddressPrimers
PCR
Selectively amplify strands based on their primer
Strands with 3 different primers
A T T TG CCTACGAAACTTGACCG
Random access
A T T TG CCTACAAAACACGTAGG
A T T TG CCTACCAAACCATTCGT
AddressPrimers
PCR
Selectively amplify strands based on their primer
SampleStrands with 3 different primers
Almost all strands have desired primer
A T T TG CCTACGAAACTTGACCG
Random access
A T T TG CCTACAAAACACGTAGG
A T T TG CCTACCAAACCATTCGT
AddressPrimers
PCR
Selectively amplify strands based on their primer
SampleStrands with 3 different primers
Almost all strands have desired primer
Reads are destructive, so replenish when necessary
Error correctionBoth synthesis and sequencing are error prone:
G G A T A G C
G G A T G C A
G G A T G A
G G A T C C A
Insertions
Deletions
Substitutions
A
Error rates ~1% per nucleotide!
Logical redundancy
Logical redundancy
AddressPrimer Data
Logical redundancy
AddressPrimer Data
Logical redundancy
AddressPrimer Data
Logical redundancy
AddressPrimer Data
XOR redundancy provides simple error correction
Logical redundancy
AddressPrimer Data
XOR redundancy provides simple error correction
Reserved address space to indicate redundancy data
Wet lab results
The process
The process
The process
The process
catcatgg
The process
catcatgg
The process
catcatgg
The process
catcatgg
The process
catcatgg
The process
catcatgg
catcatgc
The process
catcatgg
catcatgc
The process
catcatgg
catcatgc
Throughput MBs/week
DecodingEncoded and synthesized 3 files (151 kB):
Photo: Tara Brown / UW
DecodingEncoded and synthesized 3 files (151 kB):
DecodingEncoded and synthesized 3 files (151 kB):
Selected and PCRed one file for random access (42 kB):
DecodingEncoded and synthesized 3 files (151 kB):
Selected and PCRed one file for random access (42 kB):
Sequenced and decoded the resulting amplified pool:
DecodingEncoded and synthesized 3 files (151 kB):
Selected and PCRed one file for random access (42 kB):
Sequenced and decoded the resulting amplified pool:
Recovered every bit despite errors in synthesis and sequencing
The importance of redundancy
AddressPrimer Data
The importance of redundancy
AddressPrimer Data
The importance of redundancy
AddressPrimer Data
0
25
50
75
0 2500 5000 7500Number of copies
Freq
uenc
y
If we ignore redundancy data, we cannot recover the file.
The importance of redundancy
AddressPrimer Data
0
25
50
75
0 2500 5000 7500Number of copies
Freq
uenc
y Some strands are missing entirely
If we ignore redundancy data, we cannot recover the file.
Store
A DNA-based archival storage system
Write Read
Redundancy and density
Efficient retrieval
Wet lab experiments
Store
A DNA-based archival storage system
Write Read
Redundancy and density
Efficient retrieval
Wet lab experiments
Also in the paper: • Reliability-density
trade-off • Simulation of decay
over time • Error analysis • Model of truncated
strands
MBs/week GBs/second
DNA productivity is growing
102
104
106
108
1010
1970 1980 1990 2000 2010Year
Prod
uctiv
ity
Transistors on ChipReading DNAWriting DNA
Source: Robert Carlson
DNA technology is miniaturizing
We’ve just barely scratched the surface
0%
25%
50%
75%
100%
0.01% 0.1% 1% 10%Reads used
Accu
racy
Our community has seen these challenges before
Simulation
Scheduling
Spatial addressing
Programming with errors
Error correction
Latency-hiding optimizations
Cache locality
Circuit design
Photo: Tara Brown / UW