Information Theory of DNA Sequencing
description
Transcript of Information Theory of DNA Sequencing
Information Theory of DNA Sequencing
David Tse Dept. of EECSU.C. Berkeley
ITA 2012Feb. 10
Research supported by NSF Center for Science of Information.
Guy BreslerAbolfazl Motahari
DNA sequencing
DNA: the blueprint of life
Problem: to obtain the sequence of nucleotides.
…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…
Impetus: Human Genome Project
1990: Start
2001: Draft
2003: Finished3 billion basepairs
$3 billion
Sequencing Gets Cheaper and Faster
Cost of one human genome• HGP: $ 3 billion• 2004: $30,000,000• 2008: $100,000• 2010: $10,000• 2011: $4,000 • 2012-13: $1,000• ???: $300
Time to sequence one genome: years/months hours/days
Massive parallelization.
But many genomes to sequence
100 million species(e.g. phylogeny)
7 billion individuals (SNP, personal genomics)
1013 cells in a human(e.g. somatic mutations
such as HIV, cancer)
Whole Genome Shotgun Sequencing
Reads are assembled to reconstruct the original DNA sequence.
A Gigantic Jigsaw Puzzle
Computation versus Information View
• Many proposed assembly algorithms for many sequencing technologies.
• But what is the minimum number of reads required for reliable reconstruction?
• How much intrinsic information does each read provide about the DNA sequence?
• This depends on the sequencing technology but not on the assembly algorithm.
Communication and Sequencing: An Analogy
Communication:
Sequencing:
Question: what is the max. sequencing rate such that reliable reconstruction is possible?
sourcesequence
S1;S2; : : : ;SG R 1;R 2; : : : ;R N
max. communication rate = CchannelHsource source sym / sec.
sequencing rate GN DNA sym / read
The read channel
• Capacity depends on
– read length: L
– DNA length: G
• Normalized read length:
• Eg. L = 100, G = 3 £ 109 :
read channelAGCTTATAGGTCCGCATTACC AGGTCC
¹L := LlogG
L ") C "
G ") C #
¹L = 4:6
Result: Sequencing Capacity
H2( p) is (Renyi) entropy rate of the DNA sequence:
The higher the entropy, the easier the problem!
C = 0
C = ¹L
no coverage(Lander-Waterman 88)
duplication(Arratia et al 96)
L L LLH2 = lim
`! 11` logP(X
` = Y `)
greedyalgorithm
Complexity is in the eyes of the beholder
Low entropy High entropy
harder to communicate
easier jigsaw puzzle harder jigsaw puzzle
easier to communicate
Conclusion
• DNA sequencing is an important problem.
• Many new technologies and new applications.
• An analogy between sequencing and communication is drawn.
• A notion of sequencing capacity is formulated.
• A principled design framework?