DNA Compression (Encoded using Huffman Encoding Method)

85
DNA compression ( Encoded using Huffman Coding Method ) Marwa K. Al-Rikaby University of Babylon/ College of Information Technology

Transcript of DNA Compression (Encoded using Huffman Encoding Method)

Page 1: DNA Compression (Encoded using Huffman Encoding Method)

DNA compression(Encoded using Huffman Coding Method)

Marwa K. Al-RikabyUniversity of Babylon/ College of Information

Technology

Page 2: DNA Compression (Encoded using Huffman Encoding Method)

DNAOne of the building blocks in the organisms bodies.Consists of four chemical bases:

Adenine (A). Thymine (T). Cytosine (C). Guanine (G).

DNA bases pair up with each other, A with T and C with G, to form units called base pairs.

DNA in humans contains around 3 billion bases and these are similar in two persons for about 99% of the total bases.

Page 3: DNA Compression (Encoded using Huffman Encoding Method)

DNA Compression BasesGoal: analyzing, saving space and time.The DNA sequences constructed from the alphabet {A,

T, C, G}, and those sequences have various repeats usually approximate.

Only lossless algorithms are valid.DNA compression model is preferred to be:

Based on a biological knowledge.Give compression.Simple, few parameters.Can give per symbol information content.Efficient algorithm.

Page 4: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?Since DNA sequences only contain the four bases {a, c, g,

t} they can be stored using two bits per input symbol.The standard compression tools, such as gzip and bzip,

usually fail to achieve any compression since they use more than two bits per symbol.

When compressing 229354 bases (57338 bytes), we get:

HEHCMVCG: 57338 bytes (without compression). gzip: 66741 bytes (negative compression).

bzip2: 62169 bytes (negative compression).

Page 5: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?In the case of multiple genomes from the same species,

associated with ‘resequencing’ technologies, the flat text file approach is clearly wasteful since for the most part the sequences are identical.

A simple approach is to store a reference sequence, and then for each other sequence, encode only the differences (or ‘deltas’) with respect to the original sequence.

Consider the sequences AACGACTAGTAATTTG and CACGTCTAGTAATGTG which are identical, except for a substitution in position 1 (A→C), 5 (A→T) and 14 (T→G). Each SNP can be encoded by a pair (i, X), where i is an integer encoding the position and X represents the value of the substitution relative to the reference.

Page 6: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA? Although the basic idea is easy to understand, and not new, a

precise implementation requires addressing a number of important technical issues:

One can use local relative addresses, i.e. intervals, rather than absolute addresses. Using intervals, the above example ‘1C5T14G’becomes ‘0C4T9G’. With intervals the dynamic range of the integers to be encoded may be considerably smaller than with absolute addresses. The relatively modest price to pay is that intervals must be added to recover absolute coordinates.

If the positions at which variations occur in the population are fixed and form a relatively small subset of all possible positions, then additional savings may result by focusing only on those positions.

The choice of the reference sequence.

Page 7: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?All applications of the basic ideas hinge on a fundamental

technical problem: how to encode integers, representing for instance absolute or relative genomic addresses or read lengths, into binary strings?

we are interested in binary encoding schemes for sequences of integers that can be parsed automatically and that, consistently with information theory, are entropy efficient, in the sense that fewer bits are used to encode more frequent events.

Page 8: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA? Common components of most of DNA

compression algorithms:

Finding the candidate repeat segments. Considering approximate repeats. Selecting the best subset of compatible

repeats. Encoding of the repeat segments. Encoding of the non-repeat segments.

Page 9: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?Suppose we have the following DNA sequence: 

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 10: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 11: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 12: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 13: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 14: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 15: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 16: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 17: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 18: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 19: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 20: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA? The total number of “TGATAG” occurrences is

14.

All segments repetitions should be indicated in this way.

The counted numbers are kept for using in the encoding.

Page 21: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?2. Considering approximate repeats: Scanning the sequence to find out any similarity

between the segments, i.e. segments can be identical after applying any operation from the four basic operations:

Insertion: “AAATTCG”==“AAATTCTG” after Ins(T,6). Deletion: “AAATTCG”==“AAATTG” after Del(5,1). Replacement: “AAATTCG”==“AATTTCG” after

Rep(2,T). Reverse: “AAATTCG”==“GCTTAAA” after Rev().

Page 22: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?2. Considering approximate repeats:let “ATATGA” be a reference segment, then “ATATCA” is

identical to it if we replace “G” by “C”

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 23: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?2. Considering approximate repeats: “ATAGA” is identical to “ATATGA” when deleting “T” at

position 3.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 24: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?2. Considering approximate repeats: “ATATGA” is identical to “ATAGA” when deleting “T” at

position 3.“GGCGC” is identical to “GGCGG” when replacing “C” by “G”

at position 4. TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 25: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?2. Considering approximate repeats: “AATGG” is identical to “GGTAA” when reversing it.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 26: DNA Compression (Encoded using Huffman Encoding Method)

How to compress DNA?3. Selecting the best subset of compatible repeats: The choosing of the reference segment is a major and a very

sensitive process since the design of the reference sequence impacts not only the variants to be recorded, but also the intervals, and therefore it must also take into consideration any constraints a particular implementation may place on the intervals and their encodings.

In our example, The segments that we have detected should have integer numbers pointing to its indexes in the reference table.

Page 27: DNA Compression (Encoded using Huffman Encoding Method)

Segment IndexA 0T 1C 2

G 3

TGATAG 4

ATATGA 5

AAATTCG 6

GGTAA 7

GGCGC 8

RepC 9

Del 10

InsT 11

Rev 12

RepG 13

RepT 14

The reference table contains:

• The four basic symbols {A, T, G, C}.

• The candidates segments.

• The basic operations, each one with the available parameters applied on the sequence.

Page 28: DNA Compression (Encoded using Huffman Encoding Method)

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4ATATGA 5AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 29: DNA Compression (Encoded using Huffman Encoding Method)

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4ATATGA 5AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 30: DNA Compression (Encoded using Huffman Encoding Method)

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 31: DNA Compression (Encoded using Huffman Encoding Method)

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 32: DNA Compression (Encoded using Huffman Encoding Method)

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6 1

GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 33: DNA Compression (Encoded using Huffman Encoding Method)

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 34: DNA Compression (Encoded using Huffman Encoding Method)

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 35: DNA Compression (Encoded using Huffman Encoding Method)

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 36: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 37: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1 +1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10InsT 11Rev 12RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 38: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1 +1 +1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10 1

InsT 11Rev 12RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 39: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 1 +1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10 1 +1

InsT 11Rev 12RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 40: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 1 +1 +1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10 2

InsT 11 1

Rev 12RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 41: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 1 +1 +1 +1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10 2

InsT 11 1 +1

Rev 12RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 42: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 1 +1

GGCGC 8 1

RepC 9Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 43: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 1 +1 +1

GGCGC 8 1

RepC 9Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 44: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 3

GGCGC 8 1 +1

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 45: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 46: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3 25

TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 47: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1 19

C 2G 3 25

TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 48: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0 28

T 1 19

C 2G 3 25

TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 49: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0 28

T 1 19

C 2 14

G 3 25

TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 50: DNA Compression (Encoded using Huffman Encoding Method)

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0 28

T 1 19

C 2 14

G 3 25

TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Page 51: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method

First, find each segment probability:

Segment Index repetitions

probability

A 0 28 28/119

T 1 19 19/119

C 2 14 14/119

G 3 25 25/119

TGATAG 4 14 14/119

ATATGA 5 3 3/119

AAATTCG

6 4 4/119

GGTAA 7 2 2/119

GGCGC 8 2 2/119

RepC 9 1 1/119

Del 10 2 2/119

InsT 11 2 2/119

Rev 12 1 1/119

RepG 13 1 1/119

RepT 14 1 1/119

No. of segments = 119

Page 52: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method

Arrange the segments in

non-decreasing order according to its probability.

14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

Page 53: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

Build Huffman Coding Tree

Page 54: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

Page 55: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

Page 56: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

Page 57: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

Page 58: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

Page 59: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

Page 60: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

Page 61: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

Page 62: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

Page 63: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

Page 64: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

Page 65: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

Page 66: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

Page 67: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Page 68: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Page 69: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

Page 70: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

Page 71: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

Page 72: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

0

Page 73: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

01

0

Page 74: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

01

0

1

0

Page 75: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

01

0

1

01

0

Page 76: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

01

0

1

01

0

0

1

Page 77: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

01

0

1

01

0

0

1

10

Page 78: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

1

1

11

11 1

11

1

1

1

1

0

0

0

0

0

00

0

0

0

0

0

00

1

Page 79: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Finally, encode the segments via reading its code from the root to its leaf.

1

1

11

11 1

11

1

1

1

1

0

0

0

0

0

00

0

0

0

0

0

00

1

Page 80: DNA Compression (Encoded using Huffman Encoding Method)

0

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Code(0)=code (A)=10

1

1

11

11 1

11

1

1

1

1

0

0

0

0

00

0

0

0

0

0

00

1

Page 81: DNA Compression (Encoded using Huffman Encoding Method)

0

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Code(3)=code(G)=00

1

1

11

11 1

11

1

1

1

1

0

0

0

0

00

0

0

0

0

0

00

1

Code(0)=code (A)=10

Page 82: DNA Compression (Encoded using Huffman Encoding Method)

1

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Code(9)=code(RepC)=1100111

1

11

11 1

11

1

1

1

1

00

0

0

0

00

0

0

0

0

0

00

1Code(3)=code(G)=00

Code(0)=code (A)=10

Page 83: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman methodSegment Index Repetition

sProbability

Code

A 0 28 28/119 1 0

T 1 19 19/119 1 1 1

C 2 14 14/119 0 1 1

G 3 25 25/119 0 0

TGATAG 4 14 14/119 0 1 0

ATATGA 5 3 3/119 1 1 0 1 1 0

AAATTCG 6 4 4/119 1 1 0 1 1 1

GGTAA 7 2 2/119 1 1 0 1 0 1

GGCGC 8 2 2/119 1 1 0 1 0 0

RepC 9 1 1/119 1 1 0 0 1 1 1

Del 10 2 2/119 1 1 0 0 0 1

InsT 11 2 2/119 1 1 0 0 0 0

Rev 12 1 1/119 1 1 0 0 1 1 0

RepG 13 1 1/119 1 1 0 0 1 0 1

RepT 14 1 1/119 1 1 0 0 1 0 0

The final reference table is:

Keep in mind that only the segments and the codes are important for the decoder.

Page 84: DNA Compression (Encoded using Huffman Encoding Method)

Encoding by Huffman method

The previous coding satisfy both prefix property and the information theory in that :• There is no code given for a segment is a prefix in an other segment code.•The shortest codes given to segments that are more frequent while long ones assigned to those which are less frequent.

Page 85: DNA Compression (Encoded using Huffman Encoding Method)

Thank you