Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di...

Algorithms to Search Position Specific Scoring Matrices in Biosequences

Cinzia PizziDipartimento di Ingegneria dell’Informazione

Università degli Studi di Padova

C.Pizzi, DEI – Univ. Of Padova (Italy) 2

Outline Weighted patterns in Biology The problem of profile matching The look-ahead method

Suffix based Algorithms Aho-Corasick Extension (ACE) Look-ahead Filtration Algorithm (LFA) Superalphabet (NS)

Some experimental results

What are Motifs? Motifs are biologically significant

elements that are responsible for common structures or functions

Motifs are statistically significant substrings in bio-sequences

Assumption: if two entities share same function or structure, common over-represented elements might be responsible for observed similarity


Motif Discovery Take set of co-expressed genes Compare their promoter regions Common over-represented

substrings are good candidates for TFBS

Need counted/expected frequency


Promoters of co-expressedgenes


Motif Discovery TFBS, DNA motifs Motifs = binding sites = substrings

Intrinsic variability of biological sequences Mismatches, indels, wildcards,

superalphabets...

Promoters of co-expressedgenes

Motif Representation Binding sites of the same factor

are not exactly the same in all sequences

ACATACCCGAATATGCATGCCTACTCCAAATTCGAAACGGACTCCTATGCCCACTCGGAA

1 2 3 4 5 6A

GC

T

Profile -> matrix representation

C.Pizzi, DEI – Univ. Of Padova (Italy)

Motif Representation Protein classification: each family

is modeled by a matrix

ACDEHNPVACCCDEGAMMATATHCATVVST

1 2 3 4 5 6A

DC

...


1 2 3 4 5 6A

DC

... 1 2 3 4 5 6A

DC

...WVDEHNPVAC

Profile Weighted pattern p oflength m

defined over alphabet Σ |Σ| x m matrix defines scores

1 2 3 4 5 6A 0.

30.0

0.1

0.2

1.0

0.3

C 0.1

0.8

0.5

0.2

0.0

0.4

G 0.2

0.0

0.4

0.3

0.0

0.0

T 0.4

0.2

0.0

0.3

0.0

0.3


Segment Score

S = s1 s2 … sm

1 2 3 4 5 6A 0.

30.0

0.1

0.2

1.0

0.3

C 0.1

0.8

0.5

0.2

0.0

0.4

G 0.2

0.0

0.4

0.3

0.0

0.0

T 0.4

0.2

0.0

0.3

0.0

0.3

s1 s2 s3 s4 s5 s6

m

ii isMScore

1

],[


Meaning of the score


)|(

)|(lnlnln],[

1

1

1,

,

1

,

1 BSP

MSP

p

f

p

f

p

fisMScore

m

im

is

m

iis

s

ism

i s

ism

ii

i

i

i

i

i

i

Segment Score Example

Score = 2.1

1 2 3 4 5 6A 0.

30.0

0.1

0.2

1.0

0.3

C 0.1

0.8

0.5

0.2

0.0

0.4

G 0.2

0.0

0.4

0.3

0.0

0.0

T 0.4

0.2

0.0

0.3

0.0

0.3

G T A C A C


Profile Matching Problem Text T of length n defined over Σ Profile p (|Σ| x m) Score threshold th Score Si of the segment of length

m starting at position i Find all positions i in T where Si ≥

th


Example: th = 2CGTACACTCGGTA

Score = 0.6

Not a match!

1 2 3 4 5 6

A 0.3

0.0

0.1

0.2

1.0

0.3

C 0.1

0.8

0.5

0.2

0.0

0.4

G 0.2

0.0

0.4

0.3

0.0

0.0

T 0.4

0.2

0.0

0.3

0.0

0.3



Score = 2.1

Match at pos 2!

1 2 3 4 5 6

A 0.3

0.0

0.1

0.2

1.0

0.3

C 0.1

0.8

0.5

0.2

0.0

0.4

G 0.2

0.0

0.4

0.3

0.0

0.0

T 0.4

0.2

0.0

0.3

0.0

0.3



Score = 1.4

Not a match!

1 2 3 4 5 6

A 0.3

0.0

0.1

0.2

1.0

0.3

C 0.1

0.8

0.5

0.2

0.0

0.4

G 0.2

0.0

0.4

0.3

0.0

0.0

T 0.4

0.2

0.0

0.3

0.0

0.3



Score = 1.8

Not a match!

1 2 3 4 5 6

A 0.3

0.0

0.1

0.2

1.0

0.3

C 0.1

0.8

0.5

0.2

0.0

0.4

G 0.2

0.0

0.4

0.3

0.0

0.0

T 0.4

0.2

0.0

0.3

0.0

0.3



Score = 0.9

Not a match!

1 2 3 4 5 6

A 0.3

0.0

0.1

0.2

1.0

0.3

C 0.1

0.8

0.5

0.2

0.0

0.4

G 0.2

0.0

0.4

0.3

0.0

0.0

T 0.4

0.2

0.0

0.3

0.0

0.3



Score = 1.3

Not a match!

1 2 3 4 5 6

A 0.3

0.0

0.1

0.2

1.0

0.3

C 0.1

0.8

0.5

0.2

0.0

0.4

G 0.2

0.0

0.4

0.3

0.0

0.0

T 0.4

0.2

0.0

0.3

0.0

0.3



Score = 1.4

Not a match!

1 2 3 4 5 6

A 0.3

0.0

0.1

0.2

1.0

0.3

C 0.1

0.8

0.5

0.2

0.0

0.4

G 0.2

0.0

0.4

0.3

0.0

0.0

T 0.4

0.2

0.0

0.3

0.0

0.3



Score = 2.2

Match at pos 8!

1 2 3 4 5 6

A 0.3

0.0

0.1

0.2

1.0

0.3

C 0.1

0.8

0.5

0.2

0.0

0.4

G 0.2

0.0

0.4

0.3

0.0

0.0

T 0.4

0.2

0.0

0.3

0.0

0.3


Scenarios of applications Online Algorithms (no indexing)

Database of profile matrices (e.g. TRANSFAC, JASPAR for TFBS)

Input sequence to be searched Offline algorithms (indexing)

Sequence or set of sequences Input matrix to search for matches


Summary of current methods

Look-ahead method LA (Wu et al,00)

Offline methods based on LA: Suffix-tree (Dorohonceanu et al, 00) Suffix-array (Beckstette et al, 04,06) Truncated Suffix Tree (Pizzi and

Favaretto, 10) Online methods based on LA:

Aho-Corasick,Filtering(Pizzi et al. 07,09)


Summary of current methods Pattern Matching

Shift-Add (Salmela e Tarhio, 08) KMP (Liefoghee et al, 09)

Matrix partitioning (Liefhooghe et al.,06, Pizzi et al., 07, 09)

FFT based (Rajasekaran et al., 02) Compression based(Freschi et al., 05)


The look-ahead approach

]1max[][

],[]max[

ithiP

isMi

th

m

iki

1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

max 3.0 2.2 1.7 1.4 0.4

Pth -1.0 -0.2 0.3 0.6 1.6 2.0



1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

max 3.0 2.2 1.7 1.4 0.4

Pth -1.0 -0.2 0.3 0.6 1.6 2.0

C G T A C A 0.1



1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

max 3.0 2.2 1.7 1.4 0.4

Pth -1.0 -0.2 0.3 0.6 1.6 2.0

C G T A C A 0.1 0.1



1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

max 3.0 2.2 1.7 1.4 0.4

Pth -1.0 -0.2 0.3 0.6 1.6 2.0

C G T A C A 0.1 0.1 0.1 Don’t need to compare these ones!


The suffix tree of T data structure suffix tree, Tree(T),

is compacted trie that represents all the suffixes of string T

linear size: |Tree(T)| = O(|T|) can be constructed in linear time

O(|T|)


Suffix trie and suffix tree

ab

b

aaa

aa

b

b

b

a

baab

baabab

abaabbaabaababb

Trie(abaab) Tree(abaab)


Tree(T) is of linear size only the internal branching nodes

and the leaves represented explicitly

edges labeled by substrings of T v = node(α) if the path from root to

v spells α one-to-one correspondence of

leaves and suffixes |T| leaves, hence < |T| internal

nodesC.Pizzi, DEI – Univ. Of Padova (Italy) 30

Tree(hattivatti)hattivatt

i

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

hattivatti attivatt

i

ttivatti

tivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti


Tree(hattivatti)hattivatt

i

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

1 2 34

5

6

6,106,10

2,5 4,510

8

9

3,3

vatti

vatti

hattivatti

hattivatti

7


Tree(T) is full text indexTree(T)

P

31 8

P occurs in T at locations 8, 31, …

P occurs in T P is a prefix of some suffix of T Path for P exists in Tree(T)

All occurrences of P in time O(|P| + #occ)C.Pizzi, DEI – Univ. Of Padova (Italy)

LA over a Suffix Tree

CG

T

Score(CG)=0.2 > -0.2 = Th(2)Score(CGT)=0.2 < 0.3 = Th(3) : Skip the subtree


TCC

G

LA over a Suffix Tree

CG

T

Score(TCC)=1.9 > 0.3 = Th(3)Score(TCCG)=2.2 > 2 = Th(6) : Match, all the subtree


TCC

G

Suffix array: example

suffix array = lexicographic order of the suffixes

hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

ε

εatti

attivatti

hattivatti

i

ivatti

ti

tivatti

tti

ttivatti

vatti

11

7

2

1

10

5

9

4

8

3

6


37

Suffix array suffix array SA(T) = an array giving

the lexicographic order of the suffixes of T

practitioners like suffix arrays (simplicity, space efficiency)

theoreticians like suffix trees (explicit structure)


LA over a Suffix Array


In terms of suffix trees, skp[i] is the lexicographically next leaf that does notoccur in the subtree below the branching node corresponding to the longest common prefix of Ssuf[i-1] and Ssuf[i].skp[i] = min({n + 1} U [ j in [i + 1; n] | lcp[i] > lcp[j])

LA over Truncated ST Build TST with truncation factor h L = max length of a matrix in the

DB if h=L, simply work as ST if h<L, filtering

if a leaf is reached take corresponding positions (p1, p2, …, pt)

For each pi check positions pi+j, h<j<=m with lookahead


LA over Truncated ST


hL

p1 p3p2p1 + h

p1

p2 +h p3 +hL-h L-h L-h

p2p3

Space OccupationTruST


Running Time TruST


Aho-Corasick Expansion (ACE) Pattern matching + LA

Lookahead Filtration Algorithm(LFA) Score for fixed length prefix as a filter

+ LA Naive Superalphabet (NS)

Encode k-mers in superalphabet symbol

Online Profile Matching


The Aho-Corasick Algorithm

A trie for D = {he, she, his, hers}


The Aho-Corasick algorithm

Add failure links his -- she

Time O(n+m)Space depends on Dm = sum of word lengths


The Fast Aho-Corasick

s

0 1 2 8 9

6 7

3 4 5

h e r s

si

s

h ee,i,r

h

r s

h,sh e,i

s

Time O(n)Space depends on D and Σ


AC and profile matching Build AC automaton for all the words

that are a match for the matrix LA partial threshold limits the number of

words to those that actually match O(|D||Σ|m + m|Σ|) pre-processing |D|≤|Σ|m depends on matrix and threshold

Search the text with AC automaton O(n) search


AC-Extension by LA First position

1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

Pth -1.0

-0.2

0.3 0.6 1.6 2.0

[C,0.1]

[G,0.2] [A,0.3]

[T,0.4]


AC-Extension by LA Second position

1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

Pth -1.0

-0.2

0.3 0.6 1.6 2.0

[C,0.1]

[G,0.2] [A,0.3]

[T,0.4]

[A,0.1]

[G,0.1] [T,0.3][C,0.9]


AC-Extension by LA Third position

1 2 3 4 5 6

A 0.3 0.0 0.1 0.2 1.0 0.3

C 0.1 0.8 0.5 0.2 0.0 0.4

G 0.2 0.0 0.4 0.3 0.0 0.0

T 0.4 0.2 0.0 0.3 0.0 0.3

Pth -1.0

-0.2

0.3 0.6 1.6 2.0

[C,0.1]

[G,0.2] [A,0.3]

[T,0.4]

[A,0.1]

[G,0.1] [T,0.3][C,0.9]

[G,0.5] [C,0.6]


ACE Example

CGTACACTCGGTA

g t a c

cg g

t

t

a c

ac

1


ACE Example

CGTACACTCGGTA

g t a c

cg g

t

t

a c

ac

2


ACE Example

CGTACACTCGGTA

g t a c

cg g

t

t

a c

ac

3


ACE Example

CGTACACTCGGTA

g t a c

cg g

t

t

a c

ac

4


ACE Example

CGTACACTCGGTA

g t a c

cg g

t

t

a c

ac

5


ACE Example

CGTACACTCGGTA

g t a c

cg g

t

t

a c

ac

6


ACE Example

CGTACACTCGGTA

g t a c

cg g

t

t

a c

ac

7

Match at p-m+1 = 7-6+1=2


Minimum Gain for ACE Dual Concept of look-ahead Compute for every prefix the

minimum contribution of the remaining positions in the pattern

If current_score(i) + min_gain(i) > Th Report a match

Adv: in the automaton save a full subtree of height m-i


Example: M0003, MSS=0.85

[G,18500]



[G,18500]

[C,37000]



[G,18500]

[C,37000]

[C,55500]

• GCC is sufficient to detect a match

h=3



[G,18500]

[C,37000]

[C,55500]

hm

i

iA1

||• Save 5464 nodes out of 5468

h=3


Minimum Gain ACE


Look-ahead Filtration Compute the scores for all words of fixed

length k and store them O(|Σ|k) pre-processing

Sliding window of size k When score ≥ Pth[k], check remaining symbols

with LA (up to m-k)

O(n + (m -k)r) search; k is the prefix length, r is avg number of full scoring


Lookahaed Filtration ExampleK=3

SCORE

AAA

0.4

... ...

ATT 0.5

CAA 0.2

... ...

CGT 0.1

CTT 0.3

GAA

0.3

... ...

GTA 0.5

... ...

GTT 0.4

TAA 0.5

... ...

TTT 0.6

Pth[3]=0.3

CGTACACTCGGTA

Score(CGT) = 0.1 < Pth[3]

Shift and concatenate to obtain thenext 3-mer

|Σ|k

entries


Filtered Lookahaed Example

K=3

SCORE

AAA

0.4

... ...

ATT 0.5

CAA 0.2

... ...

CGT 0.1

CTT 0.3

GAA

0.3

... ...

GTA 0.5

... ...

GTT 0.4

TAA 0.5

... ...

TTT 0.6

Pth[3]=0.3

CGTACACTCGGTA

Score(GTA) = 0.5 > Pth[3]

Check at most m-k remaining symbols

Score(GTAC) = 0.7 > Pth[4]Score(GTACA) = 1.7 > Pth[5]Score(GTACAC) = 2.1 > th

Match!

|Σ|k

entries


More on ACE and LF It is possible to combine both

methods Automaton build on qualifying

prefixes only Multi-matrix version


Super-Alphabet Code words of length k to super-

alphabet symbols |Σ|k symbols are needed

Code the matrix M into matrix M’ (|Σ|k x m/k)

Run the naive algorithm on the sequence O(nm/k)


SuperAlphabet ExampleK=2 SCORE 1-2 SCORE 3-4 SCORE 5-6

AA 0.3 0.3 1.3

AC 1.1 0.3 1.4

AG 0.3 0.4 1.0

AT 0.3 0.4 1.3

CA 0.1 0.7 0.3

CC 0.9 0.7 0.4

CG 0.1 0.8 0.0

CT 0.3 0.8 0.3

GA 0.2 0.6 0.3

GC 1.0 0.6 0.4

GG 0.2 0.7 0.0

GT 0.4 0.7 0.3

TA 0.4 0.2 0.3

TC 1.2 0.2 0.4

TG 0.4 0.3 0.0

TT 0.6 0.3 0.3

CGTACACTCGGTA

Score = 0.6 < Th

|Σ|k

entries


SuperAlphabet ExampleK=2 SCORE 1-2 SCORE 3-4 SCORE 5-6

AA 0.3 0.3 1.3

AC 1.1 0.3 1.4

AG 0.3 0.4 1.0

AT 0.3 0.4 1.3

CA 0.1 0.7 0.3

CC 0.9 0.7 0.4

CG 0.1 0.8 0.0

CT 0.3 0.8 0.3

GA 0.2 0.6 0.3

GC 1.0 0.6 0.4

GG 0.2 0.7 0.0

GT 0.4 0.7 0.3

TA 0.4 0.2 0.3

TC 1.2 0.2 0.4

TG 0.4 0.3 0.0

TT 0.6 0.3 0.3

CGTACACTCGGTA

Score = 2.1 match!

|Σ|k

entries


Experiments Jaspar Database: 123 TFBS

matrices (DNA), PRINTS database (proteins)

Test sequence about 50M bases P-value defines threshold 3 GHz Intel Pentium IV processor

with 2 gigabytes of main memory, running under Linux.


DNA – avg running times per matrix


DNA- matrix length


DNA – window width


Proteins, avg time per matrix


Proteins - matrix length


MOODS – Motif Occurrence Detection Suite


Conclusions Searching matrix is a core step for

many bioinformatics applications (searching, discovery, classification…)

Several approaches have been developed in recent years

Online methods based on filtering are currently the most efficient


References C.Pizzi, P.Rastas, E.Ukkonen

Fast Search Algorithms for Position Specific Scoring Matrices In Proc. of the 1st Conference on Bioinformatics Research and Development (BIRD 07), Berlin, Germany, March 2007, LNCS/LCBI 4414 pp 239--250

C.Pizzi, E.UkkonenFast Profile Matching Algorithms - a survey Theoretical Computer Science, 395(2-3), 2008, pp 137--157, Special Issue SAIL: String Algorithms, Information and Learning

C.Pizzi, P.Rastas, E.UkkonenFinding significant matches of position weight matrices in linear time Accepted for publication by IEEE Transaction on Computational Biology and Bioinformatics, 2009

J.Korhonen, P.Martinmaki, C.Pizzi, P.Rastas, E.Ukkonen MOODS: fast search for position weight matrix matches in DNA sequences Bioinformatics 2009 25(23):3181-3182


Thanks


Acknowledgements Esko Ukkonen, Pasi Rastas, Janne

Korhonen, P.Martinmaki Academy of Finland grant “From Data

to knowledge” EU Project “Regulatory Networks”

Premio di Ricerca `Avere Trent’Anni’ Univ.Padova, Parco Scientifico Galileo,

Il Mattino, Giovani Confindustria, Scuola Galileiana di Studi SuperioriC.Pizzi, DEI – Univ. Of Padova (Italy)

Length 100

NA = Naïve AlgorithmLSA = Look-ahead Search AlgorithmLFA = Look-ahead Filter Algorithm (k=7)NS = Naïve Superalphabet (k=7)

• 13 patterns obtained by concateneting Jaspar matrices

• MSS: Matrix Similarity Score (% of maximal score)


Multiple Matrices Search


Running Time per matrix


Length 0 to 15 (108 matrices)

NA = Naïve AlgorithmLSA = Look-ahead Search AlgorithmACE = Aho-Corasick ExpansionLFA = Look-ahead Filtration Algorithm (k=7)NS = Naïve Super-alphabet (k=7)


Running Time per matrix


Length 16 to 30 (15 matrices)

NA = Naïve AlgorithmLSA = Look-ahead Search AlgorithmLFA = Look-ahead Filtration AlgorithmNS = Naïve Super-alphabet


Length 100

NA = Naïve AlgorithmLSA = Look-ahead Search AlgorithmLFA = Look-ahead Filter Algorithm (k=7)NS = Naïve Superalphabet (k=7)

• 13 patterns obtained by concateneting Jaspar matrices

P=10-5 P=10-4 P=10-3 P=10-2

NA 10.234 10.244 10.434 11.080

LSA 11.835 12.675 13.335 15.118

LFA 9.955 10.347 11.096 12.965

NS 3.576 3.677 4.593 9.918


Motif Representation Istances of a biological signal are

different

ACATACCCGAATATGCATGCCTACTCCAAATTCGAAACGGACTCCTATGCCCACTCGGAA

TCC(G|T)AC

1 2 3 4 5 6A

GC

T

Consensus -> pattern representation

Profile -> matrix representation


Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di...

Documents

Transcript of Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di...