DNA C ODES B ASED ON H AMMING S TEM S IMILARITIES

37
DNA CODES BASED ON HAMMING STEM SIMILARITIES A.G. Dyachkov 1 , A.N. Voronina 1 1 Dept. of Probability Theory, MechMath., Moscow State University, Russia

description

DNA C ODES B ASED ON H AMMING S TEM S IMILARITIES. A.G. Dyachkov 1 , A.N. Voronina 1. 1 Dept. of Probability Theory, MechMath., Moscow State University, Russia. OUTLINE. DNA background Modeling the hybridization energy DNA codes Example of code construction - PowerPoint PPT Presentation

Transcript of DNA C ODES B ASED ON H AMMING S TEM S IMILARITIES

Page 1: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

DNA CODES BASED ON HAMMING STEM SIMILARITIESA.G. Dyachkov1,A.N. Voronina1

1 Dept. of Probability Theory, MechMath., Moscow State University, Russia

Page 2: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

2

OUTLINE

1. DNA background

2. Modeling the hybridization energy

3. DNA codes

4. Example of code construction

5. Bounds on the rate on DNA codes

6. On sphere sizes

7. Further generalizations

8. Bibliography

Page 3: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

3

DNA STRANDS

■ DNA strands consist of nucleotides, composed of sugar and phosphate backbone and 1 base

■ There are 4 types of bases:

Single DNA strand5’ end

3’ endSugar phosphate backbone

Bases

Nucleotide

A

C

G

T

adenine

thymine

guanine

cytosine

■ Base A is said to be complement to T and C – to G

■ DNA strands are oriented. Thus, for example, strand AATG is different from strand GTAA

■ 2 oppositely directed strands containing complement bases at corresponding positions are called reverse-complement strands. For example, this 2 strands are reverse-complement:

A A C G

CGT T

The strands have different directions

Page 4: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

4

HYBRIDIZATION

Watson-Crick duplex■ 2 oppositely directed DNA strands are capable of coalescing into duplex, or double helix

■ The process of forming of duplex is referred to as hybridization

■ The basis of this process is forming of the hydrogen bonds between complement bases

■ Duplex, formed of reverse-complement strands is called a Watson-Crick duplex. Here is the example of it:

A A C G

CGT T A

T

Page 5: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

5

CROSS-HYBRIDIZATION AND ENERGY OF HYBRIDIZATION

■ Though, hybridization is not a perfect process and non-complementary strands can also hybridize

■ This is one example of cross-hybridization:

A A C

CGT T

T G

C

G

C A

A

T

C

G

C C

G

A

A

■ The indicator of “strength”, or stability of formed duplex is its energy of hybridization. Its value depends on the total number of bonds formed

■ Thus, the greatest hybridization energy is obtained when Watson-Crick duplex is formed rather than is case of cross-hybridization

This bases are not complement

This bases are not complement

Page 6: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

6

■ If a pair of bases is bonded but neither of its “neighbor” bases form a bond as well, then it is called a lone bond. Here it is:

A A

CT T

G

C A

A

T

C C C

G

■ The lone bond is too “weak” to form a strong connection, so it does not contribution much to the total energy of hybridization

■ Moreover, in fact, the energy of hybridization depends not on the number of bonds formed, but on the number of pairs of adjacent bonds

■ Thus, if we suppose, that hybridization energy is equal to the number of pairs, then in the example above it is equal to 3, not 5 or 6

LONE BONDS AND “PAIRWISE” METRIC

Lone bond does not contribute to hybr.

energy

A pair of bonds add 1 to total hybr. energy

A

T

A triplet is counted as 2 adjacent pairs

Hybr. Energy = 3

Page 7: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

7

OUTLINE

1. DNA background

2. Modeling the hybridization energy

3. DNA codes

4. Example of code construction

5. Bounds on the rate on DNA codes

6. On sphere sizes

7. Further generalizations

8. Bibliography

Page 8: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

8

NOTATIONS

General notations■ Let be an arbitrary even integer

■ Denote by the standard alphabet of size

■ Denote by the largest (smallest) integer

Reverse-complementation■ For any letter , define – the

complement of the letter

■ For any q-ary sequence , define its reverse complement

Note, that if , then for any .

Page 9: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

9

STEM HAMMING SIMILARITY

For 2 q-ary sequences of length n

and

stem Hamming similarity is equal to

where

■ is equal to the total number of common 2-blocks containing adjacent symbols in the longest common Hamming subsequence

Page 10: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

10

HAMMING VS. STEM HAMMING

■ Hamming similarity is element-wise while stem Hamming similarity is pair-wise (though still additive)

■ Re-ordering the elements in the sequence does not influence Hamming similarity, but may change stem Hamming similarity

Example

Page 11: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

11

STEM HAMMING DISTANCE

■ Note, that and if and only if

■ Stem Hamming distance between is

ExampleLet and

■ The longest common Hamming subsequence is

■ Stem Hamming similarity is equal to

■ Stem Hamming distance is equal to

Page 12: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

12

OUTLINE

1. DNA background

2. Modeling the hybridization energy

3. DNA codes

4. Example of code construction

5. Bounds on the rate on DNA codes

6. On sphere sizes

7. Further generalizations

8. Bibliography

Page 13: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

13

MOTIVATION

■ Study of DNA codes was motivated by the needs of DNA computing and biomolecular nanotechnology

■ In these applications, one must form a collection of DNA strands, which will serve as markers, while the collection of reverse-complement (to that first strands) DNA strands will be utilized for reading, or recognition

TACGCGACTTTCATCAAACGATGCTGTGTGCTCGTCATTTTTGCGTTACACTAAATACAAGAAAAAGAAGAA

Coding Strandsfor Ligation

Probing Complement Strands for Reading

GAAAGTCGCGTAGCATCGTTTGATGACGAGCACACATAACGCAAAAATTTGTATTTAGTGTTCTTCTTTTTC

1. Collection of mutually reverse-complement pairs

2. No self-reverse complement words

3. No cross-hybridization

Page 14: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

14

DNA CODE

■ is a code of length and size

■ , where are the codewords of code

is called a DNA -code based on stem Hamming similarity if the following 2 conditions are fulfilled:

1. For any , there exists , such that

2. For any

■ Let be the maximal size of DNA -codes.

Is called a rate of DNA codes

Page 15: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

15

OUTLINE

1. DNA background

2. Modeling the hybridization energy

3. DNA codes

4. Example of code construction

5. Bounds on the rate on DNA codes

6. On sphere sizes

7. Further generalizations

8. Bibliography

Page 16: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

16

Q-ARY REED-MULLER CODES

■ q-ary Reed-Muller code:Let

Define mapping , with

Reed-Muller code of order is the image

■ Reed-Muller code of order 1 satisfy the condition of reverse-complementarity

■ It may contain self-reverse complement words, that should be excluded from the final construction

Page 17: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

17

EXAMPLE OF CODE

Let q=4 and m=1

0

1

2

3

0 1 2 3

0 0 0 00 1 2 30 2 0 20 3 2 11 1 1 11 2 3 01 3 1 31 0 3 22 2 2 22 3 0 12 0 2 02 1 0 33 3 3 33 0 1 23 1 3 13 2 1 0

Self-reverse complement

Mutually-reverse complement

Page 18: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

18

OUTLINE

1. DNA background

2. Modeling the hybridization energy

3. DNA codes

4. Example of DNA codes

5. Bounds on the rate on DNA codes

a. Lower Gilbert-Varshamov bound

b. Upper bounds

c. Graphs

6. On sphere sizes

7. Possible generalizations

8. Bibliography

Page 19: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

19

RANDOM CODING

■ and are independent identically distributed random sequences with uniform distribution on

■ Define

■ Probability distribution of

■ Sum of

Page 20: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

20

GILBERT-VARSHAMOV BOUND

■ Let . Introduce

■ We construct random code as a collection of independent variables and their reverse-complements. This fact leads to necessity of special random coding technique for DNA codes

■ One can check, that

■ Random coding bound (Gilbert-Varshamov bound):

if then

Page 21: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

21

CALCULATION OF THE BOUND

■ are dependent variables: and both depend on and

■ do not constitute a Markov chain:

vs.

■ are deterministic functions of Markov chain :

and

■ We cannot apply standard technique as in case of Hamming similarity

■ We have to use Large Deviations Principle for Markov chains for

Page 22: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

22

GILBERT-VARSHAMOV BOUND

■ Introduce

■ Gilbert-Varshamov lower bound on the rate :

If then , where

and is a decreasing -convex function with

Page 23: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

23

OUTLINE

1. DNA background

2. Modeling the hybridization energy

3. DNA codes

4. Example of DNA codes

5. Bounds on the rate on DNA codes

a. Lower Gilbert-Varshamov bound

b. Upper bounds

c. Graphs

6. On sphere sizes

7. Possible generalizations

8. Bibliography

Page 24: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

24

UPPER BOUNDS

■ Plotkin upper bound:

If , then and

if

■ Elias upper bound:

If , then , where is presented by parametric equation

■ Elias bound improves Plotkin bound for small values of . We calculated and .

Page 25: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

25

OUTLINE

1. DNA background

2. Modeling the hybridization energy

3. DNA codes

4. Example of DNA codes

5. Bounds on the rate on DNA codes

a. Lower Gilbert-Varshamov bound

b. Upper bounds

c. Graphs

6. On sphere sizes

7. Possible generalizations

8. Bibliography

Page 26: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

26

BOUNDS ON THE RATE (Q=2)

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Bound on the rate of DNA code, q=2

Gilbert-Varshamov bound

Plotkin bound

Hamming bound

Elias bound

0.75

Page 27: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

27

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

BOUNDS ON THE RATE (Q=4)

Bound on the rate of DNA code, q=4

Gilbert-Varshamov bound

Plotkin bound

Hamming bound

Elias bound

0.9375

Page 28: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

28

OUTLINE

1. DNA background

2. Modeling the hybridization energy

3. DNA codes

4. Example of code construction

5. Bounds on the rate on DNA codes

6. On sphere sizes

7. Further generalizations

8. Bibliography

Page 29: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

29

FIBONACCI NUMBERS

■ q-ary Fibonacci numbers are defined by recurrent equation

with initial conditions

■ q-ary Fibonacci numbers may also be calculated as sum

■ q-ary Fibonacci number may be interpreted as the numberof q-ary sequences of length , which do not contain 2-stems of the form (0,0)

Page 30: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

30

COMBINATORIAL CALCULATION

■ Space with metric is homogeneous, i.e., the volume of a sphere does not depend on it’s center

■ Define

for any

■ Consider a sphere with center . Anysequence must have no common2-stems (pairs) with . In other words, is must have no 2-stems of type (0,0). Thus,

■ Sphere sizes for other may be obtained using the same technique with some corresponding modifications

Page 31: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

31

GRAPH OF PROBABILITIES

Probability distribution

0

0.2

0.4

0.6

0.8

1

0 3 6 9 12 15

n = 5

n = 10

n = 20

n = 30

n = 40

Page 32: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

32

OUTLINE

1. DNA background

2. Modeling the hybridization energy

3. DNA codes

4. Example of code construction

5. Bounds on the rate on DNA codes

6. On sphere sizes

7. Further generalizations

8. Bibliography

Page 33: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

33

B-STEM HAMMING SIMILARITY

■ -stem Hamming similarity: in spite of counting the number of 2-stems (pairs) – calculate the number of -stems

where

Page 34: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

34

WEIGTHED STEM HAMMING SIMILARITY

■ Weighted stem Hamming similarity: assign weight to each type of q-ary pairs and take it into account while calculating the sum

■ Let be a weight function such that

■ Similarity is defined as follows:

, where

Page 35: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

35

INSERTION-DELETION STEM SIMILARITY

■ Insertion-deletion stem similarity:allow loops and shifts at the DNA duplex

■ is a common block subsequence between and , if is an ordered collection of non-overlapping common ( , )-blocks of length

1. common ( , )-block of length , is a subsequence of and , consisting of consecutive elements of and

■ is the set of all common block subsequences between and

■ is the minimal number of blocks of consecutive elements of and in the given subsequence

■ Similarity is defined as follows:

Shift

Loop

Page 36: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

36

OUTLINE

1. DNA background

2. Modeling the hybridization energy

3. DNA codes

4. Example of code construction

5. Bounds on the rate on DNA codes

6. On sphere sizes

7. Further generalizations

8. Bibliography

Page 37: DNA C ODES  B ASED ON  H AMMING  S TEM  S IMILARITIES

37

BIBLIOGRAPHY

Probability theory and Large Deviation Principle■ V.N. Tutubalin, The Theory of Probability and Random Processes. Moscow:

Publishing House of Moscow State University, 1992 (in Russian).

■ A. Dembo, O. Zeitouni, Large Deviations Techniques and Applications. Boston, MA: Jones and Bartlett, 1993.

DNA codes■ D'yachkov A.G., Macula A.J., Torney D.C., Vilenkin P.A., White P.S.,

Ismagilov I.K., Sarbayev R.S., On DNA Codes. Problemy Peredachi Informatsii, 2005, V. 41, N. 4, P. 57-77, (in Russian). English translation: Problems of Information Transmission, V. 41, N. 4, 2005, P. 349-367.

■ Bishop M.A.,D'yachkov A.G., Macula A.J., Renz T.E., Rykov V.V., Free Energy Gap and Statistical Thermodynamic Fidelity of DNA Codes. Journal of Computational Biology, 2007, V. 14, N. 8, P. 1088-1104.

■ A. D’yachkov, A. Macula, T. Renz and V. Rykov, Random Coding Bounds for DNA Codes Based on Fibonacci Ensembles of DNA Sequences. Proc. of 2008 IEEE International Symposium on Information Theory, Toronto, Canada, 2008, in print.