1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

19
1 FireμSat: An Algorithm to Detect Tandem Repeats in DNA C orne de R idder ,Derrick G.Kourie ,Bruce W .Watson [email protected],[email protected],[email protected] S chool ofC om puting, U niversity ofSouth Africa,South Africa,Pretoria 0003 FastarR esearch G roup, D epartm entof C om puterScience,U niversity of P retoria, South A frica P retoria 0002 a b b a b

Transcript of 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

Page 1: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

1

FireμSat:An Algorithm to Detect

Tandem Repeats in DNA

FireμSat:An Algorithm to Detect

Tandem Repeats in DNA

Corne de Ridder , Derrick G. Kourie , Bruce W. Watson [email protected],[email protected],[email protected]

School of Computing, University of South Africa, South Africa, Pretoria 0003 Fastar Research Group, Department of Computer Science, University of

Pretoria, South Africa Pretoria 0002

a b b

a

b

Page 2: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

2

IntroductionIntroduction• What are tandem repeats in DNA?

• How are we going to detect tandem repeats in DNA?

• Why would anybody want to detect tandem repeats in DNA?

Page 3: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

3

Genetic sequencesGenetic sequences

• DNA consists of four different nucleotides, namely:

Adenine (A) Guanine (G)

Cytosine (C) Thiamine (T)

• Genetic databanks e.g. Genbank, Emboss and Entrez stores DNA sequences as concatenated single letter codes in FASTA format.

Page 4: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

4

Tandem Repeats (TR’s) in genome sequences

Tandem Repeats (TR’s) in genome sequences

• DNA molecules are subject to numerous mutational events. One of the consequences of these events that can be detected by computationally analyzing genome sequences is tandem duplication.

• A TR or TR-zone is a string of DNA molecules that is characterized by a certain motif that introduces the string, contiguously followed by a number of ‘copies’ of the motif, e.g., ACGACGACGACGACG

Page 5: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

5

Tandem RepeatsTandem Repeats

• Perfect tandem repeat (PTR) if the copies are exact e.g. ACGACGACGACGACG, hence five copies of the motif ACG.

• Approximate tandem repeat (ATR) if the copies of the motif include non-exact copies, thus mutational events have, most likely occurred e.g. ACGACACGAGGACGAG.

• In the absence of further qualification, reference to a tandem repeat should be construed as a reference to either a PTR or an ATR.

Page 6: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

6

Tandem Repeat ElementsTandem Repeat Elements

• A PTR element (PTRE) is a TR element that matches the motif. If the motif is for example ACG then the PTRE will also be ACG.

• An ATR element (ATRE) is a TR element similar

to the motif but not an exact copy thereof. If the motif is ACG then an ATRE may for example be AC.

Page 7: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

7

MicrosatellitesMicrosatellites• The length of PTRE’s may vary:

satellites, minisatellites and microsatellites

• Microsatellites is a subset of TR’s

(conforming to Benson, Delgrange, Rivals & Abajian)

52 motif

Page 8: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

8

Formal problem statement

A PTR whose motif is ρ is repeated p times where p 1, is denoted by ρp. An ATR u that is derived from this PTR ρp must always have the motif (ρ) as its prefix. It therefore has the form ρu2…up where each ATRE, uk(k = 2…p), is the result of at most ε mutations on ρ. Here ε is the so called motif error.

Besides the restrictions applicable to the motif error threshold values are also introduced that manipulate the attributes of the detected TR.

Page 9: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

9

Tolerated errortypes

Tolerated errortypes

Errors regarding the motif or PTRE (motif errors):• deletions• mismatches • insertions

• Errors related to the detected TR (TR errors):• in terms of the ratio between PTRE’s and ATRE’s • the minimum number TRE’s to be reported• the maximum number of ATRE’s consecutively

Page 10: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

10

Motif errorsMotif errorsMaximum of 50% error toleration

• If |ρ| = 2 or |ρ| = 3 then є = 0 or є = 1 (default = 1)

• If |ρ| = 4 or |ρ| = 5 then є = 0; є = 1 or є = 2(default = 2)

Consider ACGTT then ACT will be an ATRE where two deletions have occurred.

Page 11: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

11

Motif errors: Types of Mutations

Motif errors: Types of Mutations

• Deletion Refers to the absence of a base pair in the

motif.

• Insertion An ATRE with up to ε base pairs inserted into

any position of the PTRE.

• Mismatch Refers to the replacement of a base pair in the

motif by another.

Page 12: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

12

Detected TR errors:the substring error

Detected TR errors:the substring error

• The substring error :

where is the maximum substring error allowed and = (n_d x p_d) + (n_i x p_i) + (n_m x p_m) –

n_ptrewhere

n_d: number of deletionsn_i: number of insertionsn_m: number of mismatchesp_d: penalty allocated to deletionsp_i: penalty allocated to insertionsp_m: penalty allocated to mismatches

Page 13: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

13

Detected TR errors:the minimum number of TRE’s

Detected TR errors:the minimum number of TRE’s

• tn_tre = tn_ptre + tn_atre

• tn_tre

• the default value for = 2

• to prevent the output of unwanted data

Page 14: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

14

Detected TR errors:the maximum number of

consecutive ATRE’s

Detected TR errors:the maximum number of

consecutive ATRE’s

tn_atreC

• tn_atreC is incremented for every ATRE read

• tn_atreC is set to zero whenever a PTRE is read

• the default of tn_atreC is 0

Page 15: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

15

DeletionRefers to the absence of a base pair in the motif

FAD(ACG,1)

Page 16: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

16

MismatchRefers to the replacement of a base pair in the motif by another.

FAm(ACG,1)

Page 17: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

17

• generateWords(ρ,ε) generates a set of all words

of length ρLength from the alphabet

Σ = {A,C,G,T}.

• createFATR(ρ,ε) returns FATR(ρ,ε) as discussed.

• findIndices(gSeq, FATR, τ, α, β, p_m, p_d, p_i) returns a set of index pairs in gSeq of an identified TR.

• the TR is such that it complies with the constraints specified by τ, α, β. Various counters have to be updated to ensure correct output.

High-level Descriptionof FireμSat

High-level Descriptionof FireμSat

Page 18: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

18

Why does anybody want to detect TR’s in DNA?

Why does anybody want to detect TR’s in DNA?

• The cause of several human diseases can be traced to having too many copies of a certain nucleotide triplet.

• TR’s play a role in the development of immune system cells.

• TR’s serves as genetic markers in plant and animal species.

• Tandem repeats play a role in gene regulation and contribute to the breeding of disease resistant cultivars.

Page 19: 1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

19

ConclusionConclusion

A new theoretical approach to detect TR’s in DNA has been introduced. The time complexity of FireµSat is linear in |gSeq|.

The practical implementation of FireµSat is in progress. The following matters constitute a future research agenda:

• the performance of FireµSat

• the possibility of reducing FATR

• and, if successful, the latter results could suggest ways of adapting FireµSat to detect minisatellites and satellites as well.