Information Theory of DNA Sequencing

13
Information Theory of DNA Sequencing David Tse Dept. of EECS U.C. Berkeley ITA 2012 Feb. 10 Research supported by NSF Center for Science of Information. Guy Bresler Abolfazl Motahari

description

Information Theory of DNA Sequencing. David Tse Dept. of EECS U.C. Berkeley ITA 2012 Feb. 10 Research supported by NSF Center for Science of Information. Abolfazl Motahari. Guy Bresler. TexPoint fonts used in EMF: A A A A A A A A A A A A A A A A. DNA sequencing. - PowerPoint PPT Presentation

Transcript of Information Theory of DNA Sequencing

Page 1: Information Theory of DNA Sequencing

Information Theory of DNA Sequencing

David Tse Dept. of EECSU.C. Berkeley

ITA 2012Feb. 10

Research supported by NSF Center for Science of Information.

Guy BreslerAbolfazl Motahari

Page 2: Information Theory of DNA Sequencing

DNA sequencing

DNA: the blueprint of life

Problem: to obtain the sequence of nucleotides.

…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…

Page 3: Information Theory of DNA Sequencing

Impetus: Human Genome Project

1990: Start

2001: Draft

2003: Finished3 billion basepairs

$3 billion

Page 4: Information Theory of DNA Sequencing

Sequencing Gets Cheaper and Faster

Cost of one human genome• HGP: $ 3 billion• 2004: $30,000,000• 2008: $100,000• 2010: $10,000• 2011: $4,000 • 2012-13: $1,000• ???: $300

Time to sequence one genome: years/months hours/days

Massive parallelization.

Page 5: Information Theory of DNA Sequencing

But many genomes to sequence

100 million species(e.g. phylogeny)

7 billion individuals (SNP, personal genomics)

1013 cells in a human(e.g. somatic mutations

such as HIV, cancer)

Page 6: Information Theory of DNA Sequencing

Whole Genome Shotgun Sequencing

Reads are assembled to reconstruct the original DNA sequence.

Page 7: Information Theory of DNA Sequencing

A Gigantic Jigsaw Puzzle

Page 8: Information Theory of DNA Sequencing

Computation versus Information View

• Many proposed assembly algorithms for many sequencing technologies.

• But what is the minimum number of reads required for reliable reconstruction?

• How much intrinsic information does each read provide about the DNA sequence?

• This depends on the sequencing technology but not on the assembly algorithm.

Page 9: Information Theory of DNA Sequencing

Communication and Sequencing: An Analogy

Communication:

Sequencing:

Question: what is the max. sequencing rate such that reliable reconstruction is possible?

sourcesequence

S1;S2; : : : ;SG R 1;R 2; : : : ;R N

max. communication rate = CchannelHsource source sym / sec.

sequencing rate GN DNA sym / read

Page 10: Information Theory of DNA Sequencing

The read channel

• Capacity depends on

– read length: L

– DNA length: G

• Normalized read length:

• Eg. L = 100, G = 3 £ 109 :

read channelAGCTTATAGGTCCGCATTACC AGGTCC

¹L := LlogG

L ") C "

G ") C #

¹L = 4:6

Page 11: Information Theory of DNA Sequencing

Result: Sequencing Capacity

H2( p) is (Renyi) entropy rate of the DNA sequence:

The higher the entropy, the easier the problem!

C = 0

C = ¹L

no coverage(Lander-Waterman 88)

duplication(Arratia et al 96)

L L LLH2 = lim

`! 11` logP(X

` = Y `)

greedyalgorithm

Page 12: Information Theory of DNA Sequencing

Complexity is in the eyes of the beholder

Low entropy High entropy

harder to communicate

easier jigsaw puzzle harder jigsaw puzzle

easier to communicate

Page 13: Information Theory of DNA Sequencing

Conclusion

• DNA sequencing is an important problem.

• Many new technologies and new applications.

• An analogy between sequencing and communication is drawn.

• A notion of sequencing capacity is formulated.

• A principled design framework?