Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data...

18
Wavelet analysis of DNA and of proteins Georgios Choudalakis * (Dated: May 12, 2005) Wavelet Transform (WT) is a powerful tool for spectral analysis, applicable to bioinformatics. Its main advantage is that it returns a spectrum containing period and location information. This feature may be useful in the search for features, such as histones, along the DNA, or secondary structure in proteins. In this work WT is presented and applied, along with analytic methods aiming to evaluate the significance of the features of the spectra found. The conclusion is that we see features resembling what we would expect from histones and secondary structure. The validity of this interpretation remains to be checked when experimental data become available. I. MOTIVATION After the completion of the encoding of human genome [1, 2], it is still needed to better understand the dynamics of DNA as a macro-molecule, the function of its parts and the factors which have affected its formation. Treating it as a sequence of bases, we can search for patterns along DNA, that would potentially be suggestive of its nature. Similar regularities can be sought in the structure of proteins, seen as sequences of amino acids. One approach to analyze a data sequence is to search for periodicities in it. It would be interesting, for ex- ample, to know if there are pieces of DNA sharing some common property which reappears at regular intervals in some piece of the DNA. There are two main spectral analysis methods; the Fourier Transform (FT) and the Wavelet Transform (WT). I am going to use WT, mainly because: Unlike FT, WT is local in nature, thus WT re- veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or a protein, hence to understand structural differences between different locations, which may be related to different func- tions. The full amount of information in DNA is huge, no matter how one quantifies it and transforms it into a numerical sequence. Applying either FT or WT would be feasible only on segments of DNA. The use of WT is then more sensible, since WT is local by construction, so its results for one location are not affected by the fact that the data before and after that location are not taken into account. II. WAVELETS Consider all ‘nice’ functions as vectors |f i in a Hilbert space H. H has some complete orthonormal basis, call it * Physics Dpt. of the M.I.T.; Electronic address: [email protected]; URL: http://www.mit.edu/ gchouda {|ti}. Any |f i can be analyzed as: |f i = X t ht|f i|ti (1) where f (t) ≡ht|f i is the function |f i in its usual and most intuitive representation as a time-series. In general, t does not have to be time. It is the generalized position along some data sequence f (t). In this work, t is the position in the numerical sequence being analyzed, which can be directly related to some physical position along DNA or some protein. The fundamental idea in FT and in WT is that one can define another complete and orthonormal basis {|ki}, so |f i = X k hk|f i|ki (2) ht|f i = X k hk|f iht|ki (3) hk|f i = X t ht|f ihk|ti (4) In the usual FT we transform f (t) as f (t)= 1 2π Z ˜ f (k)e ikt dk (5) ˜ f (k)= 1 2π Z f (t)e -ikt dt (6) Comparing eq. (3) and eq. (4) to eq. (5) and eq. (6), one sees that FT is nothing but the choice of {|ki} to be ht|ki = e ikt / 2π. This is a function independent of t, which explains why FT is insensitive to location. FT only provides the overall contribution of each frequency in the data. In an analogous way, wavelets define a complete ba- sis {|ψ s,τ i}. They may be any family of functions that combine the properties of: 1. orthonormality: R ψ * s 0 0 (t)ψ s,τ (t)dt = δ ss 0 δ ττ 0 . Normality is important so as all the coefficients hψ s,τ |f i to have the same weight. However, I will relax the demand of orthogonality, because I am go- ing to apply Continuous WT (CWT), where there is overlap of neighboring wavelets, so there is some redundancy that is not to worry about [4].

Transcript of Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data...

Page 1: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

Wavelet analysis of DNA and of proteins

Georgios Choudalakis∗

(Dated: May 12, 2005)

Wavelet Transform (WT) is a powerful tool for spectral analysis, applicable to bioinformatics.Its main advantage is that it returns a spectrum containing period and location information. Thisfeature may be useful in the search for features, such as histones, along the DNA, or secondarystructure in proteins. In this work WT is presented and applied, along with analytic methodsaiming to evaluate the significance of the features of the spectra found. The conclusion is that wesee features resembling what we would expect from histones and secondary structure. The validityof this interpretation remains to be checked when experimental data become available.

I. MOTIVATION

After the completion of the encoding of human genome[1, 2], it is still needed to better understand the dynamicsof DNA as a macro-molecule, the function of its parts andthe factors which have affected its formation. Treating itas a sequence of bases, we can search for patterns alongDNA, that would potentially be suggestive of its nature.

Similar regularities can be sought in the structure ofproteins, seen as sequences of amino acids.

One approach to analyze a data sequence is to searchfor periodicities in it. It would be interesting, for ex-ample, to know if there are pieces of DNA sharing somecommon property which reappears at regular intervals insome piece of the DNA.

There are two main spectral analysis methods; theFourier Transform (FT) and the Wavelet Transform(WT). I am going to use WT, mainly because:

• Unlike FT, WT is local in nature, thus WT re-veals periodicities around each location in the dataseries. This allows one to see how the spectrumvaries along a piece of DNA or a protein, hence tounderstand structural differences between differentlocations, which may be related to different func-tions.

• The full amount of information in DNA is huge, nomatter how one quantifies it and transforms it intoa numerical sequence. Applying either FT or WTwould be feasible only on segments of DNA. Theuse of WT is then more sensible, since WT is localby construction, so its results for one location arenot affected by the fact that the data before andafter that location are not taken into account.

II. WAVELETS

Consider all ‘nice’ functions as vectors |f〉 in a Hilbertspace H. H has some complete orthonormal basis, call it

∗Physics Dpt. of the M.I.T.; Electronic address: [email protected];URL: http://www.mit.edu/∼gchouda

{|t〉}. Any |f〉 can be analyzed as:

|f〉 =∑

t

〈t|f〉 |t〉 (1)

where f(t) ≡ 〈t|f〉 is the function |f〉 in its usual andmost intuitive representation as a time-series. In general,t does not have to be time. It is the generalized positionalong some data sequence f(t). In this work, t is theposition in the numerical sequence being analyzed, whichcan be directly related to some physical position alongDNA or some protein.

The fundamental idea in FT and in WT is that one candefine another complete and orthonormal basis {|k〉}, so

|f〉 =∑

k

〈k|f〉 |k〉 (2)

〈t|f〉 =∑

k

〈k|f〉 〈t|k〉 (3)

〈k|f〉 =∑

t

〈t|f〉 〈k|t〉 (4)

In the usual FT we transform f(t) as

f(t) =1√2π

f̃(k)eiktdk (5)

f̃(k) =1√2π

f(t)e−iktdt (6)

Comparing eq. (3) and eq. (4) to eq. (5) and eq. (6),one sees that FT is nothing but the choice of {|k〉} to

be 〈t|k〉 = eikt/√

2π. This is a function independent oft, which explains why FT is insensitive to location. FTonly provides the overall contribution of each frequencyin the data.

In an analogous way, wavelets define a complete ba-sis {|ψs,τ 〉}. They may be any family of functions thatcombine the properties of:

1. orthonormality:∫

ψ∗s′,τ ′(t)ψs,τ (t)dt = δss′δττ ′ .

Normality is important so as all the coefficients〈ψs,τ |f〉 to have the same weight. However, I willrelax the demand of orthogonality, because I am go-ing to apply Continuous WT (CWT), where thereis overlap of neighboring wavelets, so there is someredundancy that is not to worry about [4].

Page 2: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

2

2.∫

ψs,τ (t)dt = 0

3. compact support: ψs,τ (t) decays exponentially (orfaster) outside some bounded region.

There are infinite wavelet families. One that is verycommonly used is the Morlet wavelet:

ψs,τ (t) = Cω0

1√se−

((t−τ)/s)2

2 eiω0t−τ

s (7)

Factor s is called ‘scale’ and τ translation, because theyrespectively stretch and translate ψs,τ(t). The term1√sCω0 guarantees normality ∀s, the Gaussian envelope

provides t−locality and the complex phase provides fre-quency locality. In order to make easier the interpreta-tion of the width:

W (s, τ) =∑

t

f(t)ψs,τ (t), (8)

I will use ω0 ≡ 2π, so Cω0 = π−1/4. By substitutingthis ω0 in eq. (7) we see that W (s, τ) is the weight of thewavelet which is localized around location τ and period

s.

A. Algorithm of CWT

The method I use is called Continuous WT, because itlets s and τ vary continuously [5]. For each pair of thosevaluesW (s, τ) is found using eq. (8). Exploiting the com-pact support of ψ, I gain speed by not summing over allt’s, but only: t ∈ [max{1, τ − 5s},min{tmax, τ + 5s}],where tmax is the length of the sequence being analyzed.Finally, the contour plot of |W (s, τ)|/√s is given and thisis the CWT spectrum.

The reason that I decided to plot |W (s, τ)|/√s insteadof |W (s, τ)|2, which is usually met in the literature [5], isthat the relative values of |W (s, τ)|/√s reflect the rela-tive amplitudes found in the time series. This will becomeclear in the following example.

B. A toy example of CWT

Consider, for example, the analysis of the number se-ries:

f(t) =

2 cos(

2πt10

)

, 1 ≤ t < 200cos

(

2πt20

)

, 201 ≤ t < 5003 cos

(

2πt30

)

, 501 ≤ t < 700cos

(

2πt40

)

, 701 ≤ t < 1000cos

(

2πt10

)

+ 0.5 cos(

2πt30

)

, 1001 ≤ t < 1500(9)

The output of CWT is shown in Fig. (1) and Fig. (2).Remarkably, the correct periodicities (s) are detected

at the correct domains (τ) of the sequence. Naturally,where the period changes we have ‘edge’ effects, i.e. many

200 400 600 800 1000 1200 1400Τ

0.5

1

1.5

2

2.5

ÈWHs,ΤLÈ��!!!s

s=10

s=20

s=30

s=40

FIG. 1: |W (s, τ )|/√s for s = (10, 20, 30, 40), for the toy se-quence of eq. (9)

0 200 400 600 800 1000 1200 1400Τ

10

20

30

40

50

s< 0.09

< 0.21

< 0.34

< 0.46

< 0.59

< 0.72

< 0.84

< 0.97

< 1.09

< 1.22

> 1.22

z

FIG. 2: |W (s, τ )|/√s for the toy sequence of eq. (9)

periodicities seem to contribute there. It’s also clear thatwe can not avoid the limitation of the uncertainty prin-ciple; greater s’es are worse resolved in τ , which is whyin Fig. (1) the curve for s = 10 is much sharper than thecurve for s = 40. Finally, thanks to the choice to plot|W |/√s, the height of the curves in Fig. (1) are propor-tional to the amplitudes in eq. (9).

III. DATA

First, the human DNA, as found in [6], is used. Thenumerical sequence generated and analyzed is:

1. Consensus Bendability [7, 8] along the first 217,988bases, calculated with window of size 31 bp.

The reason I focus on bendability is because I wish totest the hypothesis that this quantity is related to the po-sition of histones. DNA is winded around histones witha characteristic loop-size of about 146 bp. Between suc-cessive loops, a rod-like segment of ∼ 50 bp is interposed.If bendability is strongly related to the formation of nu-cleosomes, then we would expect a periodicity of ∼ 200db to be significant where histones are.

Page 3: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

3

TABLE I: Table of properties of amino acids, borrowed from[11]

AA M pIa Hydropathyb

g 75 5.97 -0.4a 89 6.01 1.8v 117 5.97 4.2l 131 5.98 3.8i 131 6.02 4.5m 149 5.74 1.9f 165 5.48 2.8y 181 5.66 -1.3w 204 5.89 -0.9s 105 5.68 -0.8p 115 6.48 1.6t 119 5.87 -0.7c 121 5.07 2.5n 132 5.41 -3.5q 146 5.65 -3.5k 146 9.74 -3.9h 155 7.59 -3.2r 174 10.76 -4.5d 133 2.77 -3.5e 147 3.22 -3.5

aThe characteristic pH at which the net electric charge is zero.bA scale combining hydrophobicity and hydrophilicity of R

groups; it can be used to measure the tendency of an amino acidto seek an aqueous environment (− values) or a hydrophobic envi-ronment (+ values) [12].

Secondly, I analyze five proteins [9], to which I willrefer using the following numbering:

1. gi 56118546 ref NP 001007904.1 hd-prov protein, Xeno-pus tropicalis

2. gi 55742658 ref NP 999129.1 huntingtin Sus scrofa

3. gi 55622140 ref XP 517079.1 PREDICTED: similar toHuntingtin (Huntingtons disease protein) (HD protein)Pan troglodytes

4. gi 50747290 ref XP 420822.1 PREDICTED: similar toHuntingtin (Huntingtons disease protein) (HD protein)Gallus gallus

5. gi 48096509 ref XP 392476.1 similar to Huntingtin(Huntingtons disease protein) (HD protein) Apis mel-lifera

Each of those 5 proteins was translated into 3 numer-ical sequences, according to Tab. (I):

1. Hydropathy

2. Isoelectric point (pI)

3. Molecular mass

IV. RESULTS

A. CWT on DNA Bendability

A general characteristic of the contour plots of|W (s, τ)|/√s is that they mostly consist of thin stripes,

localized in τ and extending over long ranges of s(Fig. (3)). This is the result of a spike in the data se-quence at τ . As in FT a spike in the data sequencetransforms into many contributing frequencies, the samehappens here, with the only difference that in WT weknow where the spike is, which is the τ where the stripeappears. The bigger the spike is in the data, the morefrequencies (s) are excited, therefore the longer the cor-responding stripe along the s-axis.

The fact that the area s < 30 is darker is due to thesize of the window used to compute bendability, as testwith different window sizes has shown.

In Fig. (3) we notice the existence of many bright is-lands, especially in the range 50 < s < 250. Those couldbe locations where histones are formed. In Fig. (4) wezoom closer to some of those islands. Interestingly, theyappear in couples; at the same τ there is usually an is-land around s ∼ 90 and one around s ∼ 200. The reasonfor that will be discussed in Sec. (IV A 1).

Since stripes extend over long ranges in s, it could bethat there is nothing special about periods of 50 < s <250. However, those islands are on average significantlybrighter than others (Fig. (6)). One way to quantify thisis to plot 〈|W (s)|〉 (Fig. (7)), defined as:

〈|W (s)|〉 ≡ 1

tmax − 10s

∫ tmax−5s

5s

|W (s, τ)|√s

dτ (10)

By this definition, I only integrate over values of τ farenough from the ends of the sequence, to avoid includingunreliable values in which significant part of the waveletextends beyond the ends of the data sequence.

Generally, if the series analyzed is characterized byintense variation from one element to the next, then〈|W (s)|〉 increases for smaller s. This is understandable,as it reflects the fact that data points that are at neigh-boring t’s (small s) vary by great amplitude. For ex-ample, see Fig. (8). In Sec. (II B) we didn’t have sucha violently varying sequence, but the same is not truefor DNA bendability. Taking this effect into account,Fig. (7) is not sufficiently convincing by itself. We needa way to evaluate the significance of the islands arounds ∼ 80 and s ∼ 190, taking into account that 〈|W (s)|〉would increase at smaller s’s even for a random sequence,like in Fig. (8).

A useful test of the significance of islands around s ∼80 and s ∼ 190 is to apply CWT on the bendabilityseries computed for the same piece of DNA, after uniformrandom shuffling of the bases. The result is shown inFig. (9), which is directly comparable to Fig. (4). Islandsseem to be less regular and bright after the shuffling,which is an indication that in the real DNA those islandsprobably have some physical significance.

Another way to check the effect of the shuffling ofthe DNA is to compare 〈|W (s)|〉 before and after it(Fig. (10)). 〈|W (s < 30)|〉 is significantly smaller in bothcases and this is one additional verification that this isdue to the size of the window used to calculate bend-ability. 〈|W shuff(s)|〉 < 〈|W real(s)|〉, which means that the

Page 4: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

4

islands after shuffling are on average less wide and/orbright than before.

1. Interpretation discussion

Let’s assume that histones are correlated to bendabil-ity in a specific way. Then I will apply CWT on somesequences which will be constructed based on this as-sumption. By testing the response of CWT to designedsequences we will gain better understanding of the actualresults.

Let’s assume that a nucleosome corresponds to a bumpin bendability that spans x = 146 bp, followed by astraight piece of length y = y0±σy where the bendabilityis zero. The artificial bendability series is constructed bypulling separations y from a Gaussian distribution withmean y0 and σ = σy, while the bump is assumed to be

of the shape sin2(2π t−t02·x ), where t0 is the position in the

sequence where the bump starts. To make the modelmore realistic, a layer of noise can be superimposed, bypulling random numbers uniformly from zero to a maxi-mum noise level N .

First test is to try different numbers of pseudo-histonesto see how this affects the spectrum, removing completelythe noise (Fig. (11)). The main periodicity is alwaysaround s1 = x+y0 = 146+70 = 216 bp, varying slightlybecause of the finite σy which causes some bumps to ag-gregate in some positions and to be sparser in others.The first harmonic is also significant, appearing as an is-land around s2 = s1/2. The third row of Fig. (11) showsthat there is even a very small contribution from higherorder harmonics, like the hardly visible s3 = s1/3 = 72.

The more pseudo-histones, the weaker the contribu-tion from s > s1. The reason that for very few pseudo-histones there is some contribution from s > s1 is thatthe sequence analyzed tends to be like a spike; no peri-odic pattern can be defined, but there tends to be onlya single feature which is synthesized by a wide spectrumof contributing frequencies.

Next, it’s useful to check what the performance of theCWT is in a noisy sequence. First noise of amplitudeequal to the signal is added, namely N = 1 (Fig. (12)).Now, small periods contribute, which are clearly due tothe rapidly fluctuating noisy background. However, thesignal is still very distinguishable in the spectrum.

In the tests with noise it is useful to calculate also the〈|W (s)|〉. When N increases to 5 and 10, the recogni-tion of the signal is still possible, but with more diffi-culty (Fig. (13) and Fig. (14)). In the case of 30 pseudo-histones (third column) the island at s1 is fragmented buton average (fourth row) s1 is still visible. Of course, if thenoisy sequence continued for long without any pseudo-histones, then the average (〈|W (s)|〉) would tend to bemostly determined by the noise.

The more intense the noise is, and the longer a portionof the sequqnce it exclusively occupies, the more it domi-nates the 〈|W (s)|〉, which tends to assume the shape seen

in Fig. (8). The existence of features appears as bumpsalong the curve that the 〈|W (s)|〉 would have for a com-pletely random sequence.

In the light of those observations, let’s try to interpretthe results obtained from the DNA.

The appearance of s1 and s2 is a strong indication thatthe pairs of islands seen in Fig. (4) are probably not twodifferent features, as in the last part of the sequence ofeq. (9). The fact that the lower island is at about halfthe period of the upper implies that at the location of thepair some feature exists, with period equal to the s of theupper island. Of course, though it seems unlikely, it isnot impossible that there is also a second feature there,which is linearly superimposed with the first and has halfthe period of the first. By directly inspecting for exampleFig. (5) it is not evident that such a superposition isthere. Instead, we see there a few sharp spikes, about 2and 5 for those specific islands.

Also, in Fig. (10), 〈|W shuff(s)|〉 drops smoothly for s >90, while 〈|W real(s)|〉 clearly doesn’t. It increases againand exhibits a second local maximum at s ∼ 160, equallyhigh with the one at s ∼ 80. However, the difference〈|W real(s)|〉 − 〈|W shuff(s)|〉 is maximized at s ∼ 200 bpand that peak is clearly higher than the one at s ∼ 80.This I regard the strongest indication that those islandsare physically significant and where they lie the periods ∼ 200 is mostly contributing.

Fig. (10) contains no information about location. Theplotted averages include all the features of the DNA, notonly histones. It is possible that the comparatively big-ger average contribution of s ∼ 200 is not because of hi-stones, but because of some other feature with the sameperiodicity. There are many reasons for which histoneswould not appear in Fig. (10):

• If histones are not related to bendability in a pre-scribed way, similar for all histones. Then, spectralike the one in Fig. (3) contain no information abouthistones’ locations.

• If histones are not grouped in groups of at least3. Then, 1 or 2 histones would not be enoughto clearly see a periodic pattern at the locationswhere they would be. Then, spectra like the one inFig. (3) would not clearly have any islands wherehistones would be.

• If the bendability patterns corresponding to his-tones are of much smaller amplitude than the am-plitude caused by other features of the DNA. Inthat case the histones would not be visible in spec-tra as the one in Fig. (3) either.

• If none of the above is true, but histones are veryfew in total. Then, their patterns would not beable to noticeably influence 〈|W |〉, which would bedominated by other features. However, histone is-lands would then be visible in spectra like the onein Fig. (3).

Page 5: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

5

It remains to be checked whether the locations indi-cated by CWT correspond to histones indeed. This willbe possible as soon as information of the actual locationsof histones is available. If histones are not the reason forislands such as those in Fig. (4), then it would be inter-esting to investigate what actually causes those featuresin the bendability spectrum.

B. CWT on Proteins

Application of CWT on proteins yields contour plotswith many bright islands, especially in small periods (s).Like in the analysis in Sec. (IV A), it’s necessary to eval-uate the significance of those islands.

Reproducing Fig. (10) for proteins does not help(Fig. (15)). We get 〈|W real(s)|〉 ≈ 〈|W shuff(s)|〉 in mostcases, unlike what we had in Fig. (10). The reason〈|W (s)|〉 does not reveal the significance of different s’esis that proteins are much shorter than DNA, so the av-

erage |W (s)| /√s is computed over only a few islands.A random rearrangement of the data results into similar〈|W (s)|〉 even if the islands are of different positions andintensities.

Since 〈|W (s)|〉 doesn’t help, it’s necessary to focus ofeach island separately and test its significance. The al-gorithm used is the following:

1. ProduceN random rearrangements of the protein’samino acids. I used N = 100. Then, obtain thehydropathy, mass and pI.

2. Calculate the average |W (s)| /√s of all N shuffledproteins:

〈|W (s, τ)|〉 ≡ 1

N

N

|WN (s, τ)| /√s (11)

3. Define Significance as the complement of the Pois-son probability to observe the actual |W (s, τ)| /√s,given the average 〈|W (s, τ)|〉:

S(s, τ) = 1 − 〈|W (s, τ)|〉|W (s,τ)|/√s

|W (s,τ)|√s

!e−〈|W (s,τ)|〉 (12)

If |W (s, τ)| /√s followed the Poisson distributionfor randomly shuffled proteins, then 1 − S(s, τ)would be the probability that a random sequencewould give the actually observed |W (s, τ)| /√s.However, since there is no reason to assume that|W (s, τ)| /√s follows Poisson statistics, S and 1−Sshould not be strictly considered as probabilities.Significance (S) is just a measure of the deviationof the actual |W (s, τ)| /√s from its average, as cal-culated for N shuffles of the protein.

4. Superimpose the contour plot of the actual|W (s, τ)| /√s and contours of certain S(s, τ), forexample S = 0.8 and S = 0.9.

The result can be seen in Fig. (16) and Fig. (17). Masssequences yield contour plots that are full of significant

areas. Not only islands are characterized by high signifi-cance, but so do certain dark spots; lack of contributionof certain periods at certain positions is far from average.

On the other hand, hydropathy and pI contour plotshave significant domains mostly in low s < 6, with a fewexceptions, like in hydropathy of proteins 1 and 3, wherewe observe a significant (0.8 < S < 0.9) extended islandof s ∼ 21 in both proteins. Low periods in hydropathyare probably related to the locations of α-helices (s=3)and β-sheets (s=2).

To distinguish islands at small s’es we need to zoom into shorter ranges of τ , to avoid merging of islands due tolimited contour plot resolution (Fig. (18)).

As mentioned, significance is not a probability. How-ever, we would like to know the probability that someshuffling of the protein would yield a |W (s, τ)| /√s of thesame or higher significance than the actual |W (s, τ)| /√s.We can find this probability experimentally:

1. ProduceM random rearrangements of the protein’samino acids. I used M = 200 [13].

2. Using the same 〈|W (s, τ)|〉 found earlier, calculateS(s, τ) for each one of the M sequences.

3. Calculate P(s, τ) ≡ fraction of the M sequencesthat had S(s, τ) < Sactual(s, τ). P(s, τ) isapproximately the probability that the actual|W (s, τ)| /√s is more significant than what a ran-dom rearrangement would give.

4. Superimpose the contour plot of the actual|W (s, τ)| /√s and contours where P = 0.8 andP = 0.9.

One such result can be seen in Fig. (19). It seemsthat P(s, τ) tends to be greater than S(s, τ), but similarislands are highlighted by both S and P .

V. SUMMARY AND CONCLUSION

A powerful tool for application of CWT was built. Itsanalytic power was demonstrated in a toy example, whichalso made clear the meaning of the output of the method.

Bright islands are observed in the spectrum of thebendability of the human DNA. The on average mostphysically interesting periodicity was found to be abouts ∼ 200 bp. This could be attributed to histones, ascomparative analysis has shown, based on a simple as-sumption for the way histones may relate to bendabilityand applying CWT on clusters of pseudo-histones. It re-mains to be checked whether the features seen are indeeddue to histones or something else.

CWT was also applied on proteins, where another sta-tistical method was developed to evaluate the point-by-point significance of the spectra. Very small s islandswere found significant in hydropathy, which could infersecondary structure. This remains to be checked.

Page 6: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

6

[1] the International Human Genome Sequencing Consor-tium (IHGSC) in the 15 February 2001 issue of Nature.

[2] Celera Genomics, in the 16 February 2001 issue of Sci-ence.

[3] Ingrid Daubeshies. Orthonormal bases of compactly sup-ported wavelets. Comm. Pure Appl. Math., 41:906-966,1988.

[4] http://paos.colorado.edu/research/wavelets/faq.

html#orthogonal

[5] http://users.rowan.edu/∼polikar/WAVELETS/

WTpart3.html

[6] ftp://ftp.ncbi.nih.gov/genbank/genomes/

H sapiens/hs phase3.fna.gz

[7] Gabrielian, A., Simoncsits, A. and Pongor, S. (1996):”Distribution of Bending Propensity in DNA Sequences”FEBS Letters, 393, 124-130

[8] Munteanu, M. G., Vlahovicek, K., Parthasaraty, S., Si-mon, I. and Pongor, S. (1998): ”Rod models of DNA:sequence-dependent anisotropic elastic modelling of lo-cal bending phenomena” Trends Biochem. Sci. 23 (9),341-346

[9] http://www.pubmed.com

[10] Sequence dependence of DNA conformational flexibility.A.Sarai, J.Mazur, R.Nussinov, R.Jernigan. Biochemistry,V.28, 7842-7848, 1989

[11] www.sghms.ac.uk/depts/ndu/Teaching/Downloads/

AAandPeptides-lect03.pdf

[12] Chapter 12, Kyte J. & Doolittle, R.F. (1982) J. Mol. Biol.157, 105-132.

Page 7: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

7

[13] For M = 200, and by calculating 50 positions in τ and10 in s, Fig. (19) took 7 hours of computation. That’swhy I didn’t select a bigger M , which would have given

P more accurately.

VI. FIGURES

0 50 100 150 200 250 300s

0

10000

20000

30000

40000

50000Τ

FIG. 3: |W (s, τ )|/√s for bendability of DNA. Notice that s is on the horizontal axis.

Page 8: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

8

4000 6000 8000 10000t

2

3

4

5

6

Bendability

2000 4000 6000 8000 10000Τ

0

100

200

300

400

500

s

FIG. 4: |W (s, τ )|/√s for bendability of DNA. Brings closer a piece of Fig. (3), making the islands clearer, confirming that theyare not ‘fake’ islands resulting from limited plotting resolution. On top is the piece of the bendability sequence correspondingto the same interval.

5000 5500 6000 6500 7000 7500t

2

3

4

5

6

Bendability

FIG. 5: The bendability sequence at the bright islands shown in Fig. (4). The red lines indicate the locations of the brightestareas of the islands.

Page 9: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

9

0 50000 100000 150000 200000Τ

1000

2000

3000

4000

5000

s

FIG. 6: |W (s, τ )|/√s for bendability of DNA, showing that islands of greater period s tend to be darker, i.e. less significant.

1000 2000 3000 4000 5000s

0.05

0.15

0.2

0.25

<ÈWÈ>

FIG. 7: 〈|W (s)|〉 for bendability of DNA. The points from left to right refer to values of s={2, 52, 102, 152, 202,. . . , 452, 502,752, 1002, 1252,. . . , 5002}. Periods 100 < s < 200 contribute significantly more.

50 100 150 200 250s

0.025

0.05

0.075

0.125

0.15

<ÈWÈ>

FIG. 8: 〈|W (s)|〉 for 10, 000 uniformly distributed random numbers in the range [0, 1].

Page 10: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

10

2000 4000 6000 8000 10000Τ

0

50

100

150

200

250

300

s

FIG. 9: |W (s, τ )| /√s for bendability of shuffled DNA.

100 200 300 400 500s

0.05

0.15

0.2

0.25

<ÈWÈ>

FIG. 10: Left: 〈|W (s)|〉 for bendability of DNA. The upper curve is for the real sequence, just a magnification of the ranges < 500 of Fig. (7). The lower curve is the same quantity resulting from the bendability series of the same piece of DNA afteruniform, random shuffling of the bases. Right: The difference

⟨∣

∣W real∣

−⟨∣

∣W shuff∣

.

Page 11: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

11

500 1000 1500 2000 2500 3000 3500 4000t

0.2

0.4

0.6

0.8

1

fHtL

500 1000 1500 2000 2500 3000 3500 4000t

0.2

0.4

0.6

0.8

1

fHtL

2000 4000 6000 8000 10000t

0.2

0.4

0.6

0.8

1

fHtL

0 1000 2000 3000 4000Τ

0

100

200

300

400

500

s

0 1000 2000 3000 4000Τ

0

100

200

300

400

500

s

0 2000 4000 6000 8000 10000Τ

0

100

200

300

400

500

s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,1500LÈ��!!!s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,1900LÈ��!!!s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,4000LÈ��!!!s

FIG. 11: Spectra of 3, 5 and 30 pseudo-histones, in the first, second and third column respectively. Used N = 0 and(y0, σy) = (70, 10). In the top row are the three artificial sequences and under each sequence is the corresponding CWTspectrum. In the third row there are sections of the above contour plots, which showing |W (s, τ )| /√s as a function of s andfor fixed τ . The last row helps to compare the intensities of the contributing harmonics.

Page 12: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

12

500 1000 1500 2000 2500 3000 3500 4000t

0.5

1

1.5

2

fHtL

500 1000 1500 2000 2500 3000 3500 4000t

0.5

1

1.5

2

fHtL

2000 4000 6000 8000 10000t

0.5

1

1.5

2

fHtL

0 1000 2000 3000 4000Τ

0

100

200

300

400

500

s

0 1000 2000 3000 4000Τ

0

100

200

300

400

500

s

0 2000 4000 6000 8000 10000Τ

0

100

200

300

400

500

s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,1500LÈ��!!!s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,1700LÈ��!!!s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,4000LÈ��!!!s

50 100 150 200 250 300s

0.04

0.06

0.08

0.12

0.14

0.16

<ÈWÈ>

50 100 150 200 250 300s

0.05

0.15

0.2

0.25<ÈWÈ>

50 100 150 200 250 300s

0.05

0.15

0.2

0.25

0.3

0.35

<ÈWÈ>

FIG. 12: The same with Fig. (11), but with N = 1. The fourth row shows the 〈|W (s)|〉.

Page 13: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

13

500 1000 1500 2000 2500 3000 3500 4000t

1

2

3

4

5

fHtL

500 1000 1500 2000 2500 3000 3500 4000t

1

2

3

4

5

fHtL

2000 4000 6000 8000 10000t

1

2

3

4

5

6

fHtL

0 1000 2000 3000 4000Τ

0

100

200

300

400

500

s

0 1000 2000 3000 4000Τ

0

100

200

300

400

500

s

0 2000 4000 6000 8000 10000Τ

0

100

200

300

400

500

s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,1500LÈ��!!!s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,1700LÈ��!!!s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,4000LÈ��!!!s

50 100 150 200 250 300s

0.2

0.3

0.4

0.5

0.6

0.7

0.8

<ÈWÈ>

50 100 150 200 250 300s

0.2

0.3

0.4

0.5

0.6

0.7

0.8

<ÈWÈ>

50 100 150 200 250 300s

0.2

0.3

0.4

0.5

0.6

0.7

0.8

<ÈWÈ>

FIG. 13: The same with Fig. (12), but with N = 5.

Page 14: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

14

500 1000 1500 2000 2500 3000 3500 4000t

2

4

6

8

10

fHtL

500 1000 1500 2000 2500 3000 3500 4000t

2

4

6

8

10

fHtL

2000 4000 6000 8000 10000t

2

4

6

8

10

fHtL

0 1000 2000 3000 4000Τ

0

100

200

300

400

500

s

0 1000 2000 3000 4000Τ

0

100

200

300

400

500

s

0 2000 4000 6000 8000 10000Τ

0

100

200

300

400

500

s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,1500LÈ��!!!s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,1700LÈ��!!!s

100 200 300 400s

0.1

0.2

0.3

0.4

0.5

ÈWHs,4000LÈ��!!!s

50 100 150 200 250 300s

0.25

0.5

0.75

1.25

1.5

<ÈWÈ>

50 100 150 200 250 300s

0.25

0.5

0.75

1.25

1.5

1.75

<ÈWÈ>

50 100 150 200 250 300s

0.2

0.4

0.6

0.8

1.2

1.4

1.6

<ÈWÈ>

FIG. 14: The same with Fig. (12), but with N = 10.

Page 15: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

15

20 40 60 80 100s

0.25

0.5

0.75

1.25

1.5

1.75

<ÈWÈ>

20 40 60 80 100s

4

6

8

10

12

14

<ÈWÈ>

20 40 60 80 100s

0.2

0.4

0.6

0.8

<ÈWÈ>

50 100 150 200 250 300s

0.25

0.5

0.75

1

1.25

1.5

1.75

<ÈWÈ>

50 100 150 200 250 300s

2

4

6

8

10

12

14

<ÈWÈ>

50 100 150 200 250 300s

0.4

0.6

0.8

<ÈWÈ>

10 20 30 40 50 60s

0.4

0.6

0.8

1.2

1.4

1.6

<ÈWÈ>

10 20 30 40 50 60s

6

8

10

12

14

<ÈWÈ>

10 20 30 40 50 60s

0.4

0.6

0.8

<ÈWÈ>

50 100 150 200 250 300s

0.25

0.5

0.75

1.25

1.5

<ÈWÈ>

50 100 150 200 250 300s

4

6

8

10

12

<ÈWÈ>

50 100 150 200 250 300s

0.2

0.4

0.6

0.8

<ÈWÈ>

50 100 150 200 250s

0.25

0.5

0.75

1.25

1.5

<ÈWÈ>

50 100 150 200 250s

4

6

8

10

12

14

<ÈWÈ>

50 100 150 200 250s

0.4

0.6

0.8

<ÈWÈ>

FIG. 15: Each row corresponds to one protein, starting from protein 1 at the top row and going down to protein 5. From leftto right, the plots correspond to the numerical sequences of hydropathy, mass and pI respectively. The red marked curve is⟨∣

∣W real(s)∣

and the blue marked is⟨∣

∣W shuff(s)∣

.

Page 16: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

16

0 200 400 600 800 1000Τ

0

10

20

30

40

50

s

0 200 400 600 800 1000Τ

0

10

20

30

40

50

s

0 200 400 600 800 1000Τ

0

10

20

30

40

50

s

0 500 1000 1500 2000 2500 3000Τ

0

25

50

75

100

125

150

s

0 500 1000 1500 2000 2500 3000Τ

0

25

50

75

100

125

150

s

0 500 1000 1500 2000 2500 3000Τ

0

25

50

75

100

125

150

s

0 100 200 300 400 500 600Τ

0

5

10

15

20

25

30

s

0 100 200 300 400 500 600Τ

0

5

10

15

20

25

30

s

0 100 200 300 400 500 600Τ

0

5

10

15

20

25

30

s

0 500 1000 1500 2000 2500 3000Τ

0

25

50

75

100

125

150

s

0 500 1000 1500 2000 2500 3000Τ

0

25

50

75

100

125

150

s

0 500 1000 1500 2000 2500 3000Τ

0

25

50

75

100

125

150

s

0 500 1000 1500 2000 2500Τ

0

20

40

60

80

100

120

140

s

0 500 1000 1500 2000 2500Τ

0

20

40

60

80

100

120

140

s

0 500 1000 1500 2000 2500Τ

0

20

40

60

80

100

120

140

s

FIG. 16: Alignment is the same as in Fig. (15). The shaded contour plots are of |W (s, τ )| /√s of the actual proteins. The bluemarked contour is where S = 0.8 and the red marked is where S = 0.9.

Page 17: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

17

0 200 400 600 800 1000Τ

0

2

4

6

8

10

s

0 200 400 600 800 1000Τ

0

2

4

6

8

10

s

0 200 400 600 800 1000Τ

0

2

4

6

8

10

s

0 500 1000 1500 2000 2500 3000Τ

0

2

4

6

8

10

s

0 500 1000 1500 2000 2500 3000Τ

0

2

4

6

8

10

s

0 500 1000 1500 2000 2500 3000Τ

0

2

4

6

8

10

s

0 100 200 300 400 500 600Τ

0

2

4

6

8

10

s

0 100 200 300 400 500 600Τ

0

2

4

6

8

10

s

0 100 200 300 400 500 600Τ

0

2

4

6

8

10

s

0 500 1000 1500 2000 2500 3000Τ

0

2

4

6

8

10

s

0 500 1000 1500 2000 2500 3000Τ

0

2

4

6

8

10

s

0 500 1000 1500 2000 2500 3000Τ

0

2

4

6

8

10

s

0 500 1000 1500 2000 2500Τ

0

2

4

6

8

10

s

0 500 1000 1500 2000 2500Τ

0

2

4

6

8

10

s

0 500 1000 1500 2000 2500Τ

0

2

4

6

8

10

s

FIG. 17: Same as Fig. (16), zooming in to 1 < s < 10.

Page 18: Wavelet analysis of DNA and of proteins - MIT...veals periodicities around each location in the data series. This allows one to see how the spectrum varies along a piece of DNA or

18

100 110 120 130 140 150Τ

0

2

4

6

8

10

s

800 820 840 860 880 900Τ

1

1.5

2

2.5

3

3.5

4

s

FIG. 18: Two close-ups in the spectrum of hydropathy of protein 1. The blue marked contour is where S = 0.8 and the redmarked is where S = 0.9.

100 110 120 130 140 150Τ

1

1.5

2

2.5

3

3.5

4

s

FIG. 19: Close-up in the spectrum of hydropathy of protein 1, with contours P = 0.8 (blue) and P = 0.9 (red) superimposed.Comparable to left Fig. (18).