A u to re gr essiv e Modeling a nd Fe atur e A naly sis of ... · A u to re gr essiv e Modeling a...

22
EURASIP Journal on Applied Signal Processing 2004:1, 13–28 c 2004 Hindawi Publishing Corporation Autoregressive Modeling and Feature Analysis of DNA Sequences Niranjan Chakravarthy Department of Electrical Engineering, Arizona State University, Tempe, AZ 85287-5706, USA Email: [email protected] A. Spanias Department of Electrical Engineering, Arizona State University, Tempe, AZ 85287-5706, USA Email: [email protected] L. D. Iasemidis Harrington Department of Bioengineering, Arizona State University, Tempe, AZ 85287-9709, USA Email: [email protected] K. Tsakalis Department of Electrical Engineering, Arizona State University, Tempe, AZ 85287-5706, USA Email: [email protected] Received 28 February 2003; Revised 15 September 2003 A parametric signal processing approach for DNA sequence analysis based on autoregressive (AR) modeling is presented. AR model residual errors and AR model parameters are used as features. The AR residual error analysis indicates a high specificity of coding DNA sequences, while AR feature-based analysis helps distinguish between coding and noncoding DNA sequences. An AR model-based string searching algorithm is also proposed. The eect of several types of numerical mapping rules in the proposed method is demonstrated. Keywords and phrases: DNA, autoregressive modeling, feature analysis. 1. INTRODUCTION The complete understanding of cell functionalities depends primarily on the various cell activities carried out by pro- teins. Information for the formation and activity of these proteins is coded in the deoxyribonucleic acid (DNA) se- quences. For detection purposes, the vast amount of genomic data makes it necessary to define models for DNA segments such as the protein coding regions. Such models can also facilitate our understanding of the stored information and could provide a basis for the functional analysis of the DNA. Since the DNA is a discrete sequence, it can be interpreted as a discrete categorical or symbolic sequence and hence, digital signal processing (DSP) techniques could be used for DNA sequence analysis. The DNA sequence analysis problem can be considered as analogous to some forms of speech recog- nition problems. That is, coding and noncoding regions in DNA need to be identified from long nucleotide sequences, a process that bears some similarities to the problem of iden- tifying phonemes from long sequences of speech signal sam- ples. Currently proposed DSP techniques include the study of the spectral characteristics [1, 2, 3, 4] and the correlation structure [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] of DNA sequences. The measurement of spectra in most cases has been characterized by nonparametric Fourier transform techniques [1]. In some of the most common cases, the pres- ence of a spectral peak [1] was used to characterize protein- coding regions in the DNA. On the other hand, correlations have been often characterized on the basis of the extent of power-law (long-range) behavior and the persistence of the power-law correlation sequence [6, 8]. Attempts have been also made to parameterize these correlations in terms of the scale of the power law [6]. In this paper, we propose the use of parametric spectral methods for the analysis of DNA sequences. Parametric spec- tral analysis techniques have been widely used to study time series of speech, seismic, and other types of signals. Specif- ically, we investigate the use of autoregressive (AR) spectral

Transcript of A u to re gr essiv e Modeling a nd Fe atur e A naly sis of ... · A u to re gr essiv e Modeling a...

  • EURASIP Journal on Applied Signal Processing 2004:1, 13–28c© 2004 Hindawi Publishing Corporation

    Autoregressive Modeling and Feature Analysisof DNA Sequences

    Niranjan ChakravarthyDepartment of Electrical Engineering, Arizona State University, Tempe, AZ 85287-5706, USAEmail: [email protected]

    A. SpaniasDepartment of Electrical Engineering, Arizona State University, Tempe, AZ 85287-5706, USAEmail: [email protected]

    L. D. IasemidisHarrington Department of Bioengineering, Arizona State University, Tempe, AZ 85287-9709, USAEmail: [email protected]

    K. TsakalisDepartment of Electrical Engineering, Arizona State University, Tempe, AZ 85287-5706, USAEmail: [email protected]

    Received 28 February 2003; Revised 15 September 2003

    A parametric signal processing approach for DNA sequence analysis based on autoregressive (AR) modeling is presented. ARmodel residual errors and AR model parameters are used as features. The AR residual error analysis indicates a high specificity ofcoding DNA sequences, while AR feature-based analysis helps distinguish between coding and noncoding DNA sequences. An ARmodel-based string searching algorithm is also proposed. The effect of several types of numerical mapping rules in the proposedmethod is demonstrated.

    Keywords and phrases: DNA, autoregressive modeling, feature analysis.

    1. INTRODUCTION

    The complete understanding of cell functionalities dependsprimarily on the various cell activities carried out by pro-teins. Information for the formation and activity of theseproteins is coded in the deoxyribonucleic acid (DNA) se-quences. For detection purposes, the vast amount of genomicdata makes it necessary to define models for DNA segmentssuch as the protein coding regions. Such models can alsofacilitate our understanding of the stored information andcould provide a basis for the functional analysis of the DNA.Since the DNA is a discrete sequence, it can be interpreted asa discrete categorical or symbolic sequence and hence, digitalsignal processing (DSP) techniques could be used for DNAsequence analysis. The DNA sequence analysis problem canbe considered as analogous to some forms of speech recog-nition problems. That is, coding and noncoding regions inDNA need to be identified from long nucleotide sequences, aprocess that bears some similarities to the problem of iden-

    tifying phonemes from long sequences of speech signal sam-ples. Currently proposed DSP techniques include the studyof the spectral characteristics [1, 2, 3, 4] and the correlationstructure [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] ofDNA sequences. The measurement of spectra in most caseshas been characterized by nonparametric Fourier transformtechniques [1]. In some of the most common cases, the pres-ence of a spectral peak [1] was used to characterize protein-coding regions in the DNA. On the other hand, correlationshave been often characterized on the basis of the extent ofpower-law (long-range) behavior and the persistence of thepower-law correlation sequence [6, 8]. Attempts have beenalso made to parameterize these correlations in terms of thescale of the power law [6].

    In this paper, we propose the use of parametric spectralmethods for the analysis of DNA sequences. Parametric spec-tral analysis techniques have been widely used to study timeseries of speech, seismic, and other types of signals. Specif-ically, we investigate the use of autoregressive (AR) spectral

    mailto:[email protected]:[email protected]%20mailto:[email protected]:[email protected]

  • 14 EURASIP Journal on Applied Signal Processing

    estimation tools for DNA sequence analysis. AR models ef-fectively capture spectral peaks and model the correlation insequences [19]. After the model fit, the AR model parame-ters, and AR related signals such as the prediction residual,can be used as features of the DNA sequences. The studiesthat we carried on AR models include the following. First,we explored the use of linear prediction residuals to com-pare coding and noncoding regions as well as distinguish be-tween different genes. Different numerical mapping rules forthe representation of nucleotides were considered. Second,we used the AR parameters as DNA sequence features.

    The paper is organized as follows. A few basic biolog-ical properties of the DNA are described in Section 2. Anoverview of DNA sequence analysis techniques based on cor-relation functions and DSP-based methods is presented inSection 3. The motivation for the use of parametric spectralanalysis methods for DNA analysis and its various imple-mentation aspects are presented in Section 4. Results fromthe application of AR model-based analysis to DNA se-quences are presented in Section 5. A discussion of the re-sults and possible extensions to these techniques are given inSection 6.

    2. DNA STRUCTURE AND FUNCTION

    DNA is the basic information storehouse in living cells. Var-ious cell activities are carried out by proteins which are pro-duced based on information stored in genes. DNA is a poly-mer formed from 4 basic subunits or nucleotides, namely,adenine (A), cytosine (C), thymine (T), and guanine (G).A single DNA strand is formed by the covalent bonds be-tween the sugar phosphate groups of the nucleotides. TwoDNA strands are then weakly bonded by hydrogen bonds be-tween the nucleotides. Since the nucleotide A forms such abond only with T, and G only with C, the two DNA strandsare complementary to each other and each of them is used asa template during cell division to transfer information. Usu-ally, two complementary DNA strands form a double helix.The synthesis of proteins is governed by certain regions in theDNA called protein coding regions or genes. The 64 possiblenucleotide triplets ((nucleotide alphabet size)word length = 43),called codons, are mapped into 20 amino acids that bond to-gether to form proteins. Certain codons known as start andstop codons indicate the beginning and end of a gene. TheDNA also consists of regions that store information for reg-ulatory functions. In advanced organisms, the protein cod-ing regions are not generally continuous and are separatedinto several smaller subregions called exons. The regions be-tween the exons are known as introns. During the proteincoding process, these introns are eliminated and the exonsare spliced together. The splicing can be carried out in a num-ber of different ways depending on the cell function. Splic-ing thus also determines the type of protein synthesis andhence genes can be used for the production of a variety ofproteins. The central dogma (Figure 1) in cellular biologydescribes the information transfer from the DNA to the ri-bonucleic acid (RNA) and the production of proteins. Theformation of proteins takes place in two stages, namely, tran-

    Protein Arg-Gly-Tyr-Thr-Phe

    Translation

    mRNA CGU-GGA-UCA-ACU-UUU

    Transcription

    DNA CGT-GGA-TCA-ACT-TTT

    GCA-CCT-AGT-TGA-AAA

    Figure 1: Central dogma; the information transfer from DNA toproteins.

    scription and translation. During transcription, the genes inthe DNA sequence are used as templates to form the pre-messenger RNA (pre-mRNA). The pre-mRNA is a polymerformed from 4 basic subunits, namely, A, C, G, and uracil(U). Next, the exons in the pre-mRNA are spliced together toform a polymer of only coding regions known as the mRNA.The mRNA along with the transfer RNA (tRNA) controlsprotein formation. The complete process is controlled andcatalyzed by a number of enzymes. Almost all cells in a livingsystem have the same DNA structure and information con-tent. The gene expression depends on the cell requirements.Microarray technology basically captures the amount of ex-pression of various genes. The structure and organization ofthe DNA and various cell functions are explained in [20].

    One of the relevant problems in bioinformatics is to ac-curately identify the protein coding regions and thus predictthe protein that will be generated using the information inthese segments. In addition, some effort is expended in un-derstanding the role of noncoding regions. It is therefore ofcentral interest to analyze and characterize various DNA re-gions such as coding and noncoding sequences.

    3. REVIEW OF METHODS FOR DNASEQUENCE ANALYSIS

    A primary objective of DNA sequence analysis is to automat-ically interpret DNA sequences and provide the location andfunction of protein coding regions. Methods to locate genes,and various coding measures are described in [21]. The geneidentification problem is challenging especially in eukary-otic DNA sequences in which the coding regions are sepa-rated into several exons. An overview of standard techniquesfor gene identification is provided in [22]. Computationaltechniques for gene identification are classified into templatemethods and lookup methods. Template methods attemptto model prototype objects or sequences and identify genesbased on these models. On the other hand, lookup methodsuse exactly known gene sequences and search for similar seg-ments in a database. Computational techniques, to accom-plish the above, include identification measures like Fourierspectra and sequence similarity measures. An overview of the

  • Autoregressive Modeling and Feature Analysis of DNA Sequences 15

    standard coding measures and their accuracy in identifyinggenes is also given in [22]. A discussion on the regulation ofgene expression, techniques to integrate various gene models,for example, hidden Markov models (HMM), and methodsfor efficient computation are presented in [22] as well.

    3.1. Correlations in DNA sequences

    Correlation functions have been widely used to study the sta-tistical properties of DNA sequences. The autocorrelation ofa stationary and ergodic numerical sequence x at lag m is de-fined as

    rxx(m) = E[x(n + m)x(n)

    ]

    = limN→∞

    12N + 1

    N∑

    n=−Nx(n + m)x(n),

    (1)

    where E[·] is the statistical expectation operator and N is thelength of the window over which the averaging is performed.A typical statistically well-behaved estimator for the autocor-relation is

    r̂b(m) =1N

    N−|m|−1∑

    n=0x(n + |m|

    )x(n). (2)

    The power spectrum of a signal is the Fourier transform ofits correlation [19]. To use (2) in DNA analysis, one has toassign numerical values to the nucleotides A, T, C, and G.One of the early analyses of the correlation structure in theDNA was done in [6]. Binary indicator sequences are usedtherein to calculate correlations in the DNA sequence. Thepower spectra of the sequences are shown to have a power-law behavior. The spectra are reported to change according tothe evolutionary categories of the DNA sequences analyzed.Similar analysis is also presented in [11], wherein a simplemodel, called expansion-modification model, is consideredto exhibit correlations similar to those present in the DNA.Results are therein presented based on three correlation mea-sures, that is, the mutual information function, the powerspectrum to calculate the correlations, and a cumulative ap-proach (similar to a DNA walk). Various issues of the DNAcorrelation structure and its interpretation are also discussed.

    The calculation and relation between correlation func-tions and mutual information of symbol sequences areexplained in [5]. Correlation functions and mutual infor-mation function differ in quantifying statistical dependen-cies. While correlations measure only the linear dependen-cies in sequences, the mutual information function detectsother statistical dependencies (e.g., nonlinear) in the signalas well. The correlation measurements depend on the assign-ment of numbers to the symbols in the sequence, whereasthe mutual information is independent of such coordinatetransformations. The binary mapping rules used in [7] carrycertain biological interpretations and are used in the calcu-lation of the autocorrelation and the other related statisti-cal dependencies. A study on the statistical correlations inthe DNA sequence is presented in [8], in which possible er-rors in estimating correlations from short DNA sequences

    is also described. The direct measure of correlations fromlong sequences is advocated to be better than measures ob-tained through detrended fluctuation analysis (DFA) [10],indirect autocorrelation computation from the power spec-tra, and correlation estimates from the mutual informationfunction [11]. The DFA technique removes heterogeneitiesin the DNA sequence, but since it has been reported that im-portant details of the correlation structure in the DNA maybe due to these heterogeneities [23], the use of the DFA tech-nique is questioned. The autocorrelation function is consid-ered to be useful in measuring the compositional heterogene-ity. A series of studies on the use of correlation in DNA anal-ysis is also given in [9, 14, 15, 16, 17, 18]. Other methods forDNA analysis include DNA walk [24] and Markov chains ofvarious orders.

    Observed correlation properties have also been inter-preted in terms of the underlying biology [11, 12, 13, 18].One of the important characteristics of protein coding seg-ments in DNA sequences is the presence of persistent cor-relations with a pronounced period of three. It is shown in[12] that these correlations arise due to the nonuniform us-age of codons in the coding regions. This nonuniformity isconsidered to exist due to a number of factors including themany-to-one mapping of codons to amino acids, the use ofcertain amino acids for protein formation, the preferentialcoding of codons into amino acids, and the correlations be-tween the G + C contents in the third codon positions withG + C contents in the surrounding DNA. These factors maycause the concentrations of nucleotides in the three codonpositions to be different. Such a positional asymmetry is be-lieved to be the cause of the pronounced period-three patternin the coding segment correlations and mutual information.The pronounced periodicity mentioned in [12] has also beenused to differentiate coding and noncoding DNA segments[25]. Covariance matrix decay is used for analysis of correla-tion functions in [13]. The observations of long-range corre-lations and the various periodicities in the observed correla-tions are related to biological facts in genomes.

    The characterization of coding and noncoding regionsbased on the mutual information function is describedin [25]. That paper basically explores the existence ofphylogenetic origin-free statistical features in coding andnoncoding regions. The mutual information function decaysto zero for noncoding DNA, whereas it oscillates for cod-ing DNA with a period of three. Gene identification basedon the mutual information function is reported to performbetter than traditional techniques which require training ondatasets [26]. A number of other information theory mea-sures have also been used for coding segment characteriza-tion [5, 18, 23, 27, 28, 29, 30, 31]. A measure for sequencecomplexity is presented in [23]. The sequence compositionalcomplexity is based on an entropic segmentation methodto divide a sequence into homogenous segments. The com-plexity measure is compared for coding and noncoding seg-ments and is related to the correlation structure. An entropicsegmentation method is also used in finding borders be-tween coding and noncoding regions [27]. A 12-letter alpha-bet or mapping rule is used, which takes into account the

  • 16 EURASIP Journal on Applied Signal Processing

    differential base composition at each codon position. This isused to find different compositional domains for coding andnoncoding regions. General statistical properties of codingregions are used in the segmentation, and this method is re-ported to be highly accurate in identifying borders. Anotherinformation theory tool which has been reported to be use-ful in the analysis of DNA sequences is given in [28]. Thisis the Jensen-Shannon divergence which quantifies the dif-ference between different statistical distributions. A descrip-tion of statistical properties of the divergence measure is fol-lowed by the application to the analysis of DNA sequences.The segmentation method based on the divergence measureis reported to segment a nonstationary sequence into station-ary subsequences, and is also applied to DNA. Finally, a goodoverview on information theory and applications to molec-ular biology can be found in [32].

    3.2. DSP techniques for DNA sequence analysisThe string of nucleotides in the DNA sequence is a categori-cal or symbolic sequence. Each of the nucleotides is assigneda numerical value, in order to apply DSP methods. Examplesof such numerical assignment techniques are the binary in-dicator sequences [6] or the assignment of the integers 1, 2,3, and 4 to A, C, G, and T, respectively [33]. The numericalsequences thus obtained are analyzed using DSP methods.Tiwari et al. [1] identify coding regions in DNA sequences bycomputing the Fourier spectra of a moving window acrossthe sequence. The value of the spectrum at f = 1/3, is usedto clarify the DNA regions as either coding or noncoding.The relative strength of the periodicity is used as the codingmeasure (ratio of the spectral value at f = 1/3 to the av-erage spectrum). The effectiveness of the GeneScan methodin identifying coding regions is also discussed. The methodis robust to sequencing errors resulting from frameshift er-rors; the computations are simple and training is not re-quired, which is an additional advantage. Anastassiou [2] ex-tends on the ideas from [1, 3] and provides a method to dif-ferentiate coding and noncoding regions based on weightedspectra. Two numerical assignment schemes, namely, binaryand complex number assignments are used for analysis in[2]. A procedure to compute the protein sequence from thecoding regions, based on the principles of finite impulse re-sponse filters and quantization, is also described. Methodsto calculate DNA spectrograms, and the use of power spec-tra to identify coding regions, are given. The paper also de-scribes the method for the identification of reading framesand summarizes the uses of DSP-based techniques in DNAsequence analysis. Analysis of chromosome genomic signalshas also been carried out using a complex numerical repre-sentation of nucleotides [34]. Therein, a model of the struc-ture of the chromosome has been presented through tech-niques such as phase analysis, two- and three-dimensionalsequence path analysis, and statistical analysis. The signalprocessing of symbolic sequences has also been addressedin [35, 36]. In [35], binary indicator sequences are used forDNA sequence analysis. For any mapping rule, a symbolicsequence is mapped to a numerical sequence by assigning aweight to each symbol. This mapping can be represented as

    a matrix multiplication. The subsequent linear transforma-tion of the numerical sequence can also be represented bya matrix multiplication operation. Since linear transforma-tions are performed, the weights can be optimized to obtaina required property in the transformed signal. These opera-tions are explained in the case of discrete Fourier transforms(DFTs). The computation of linear transforms for symbolicsignals is also explained in [36]. Spectral and wavelet analy-ses of symbolic sequences are explained and applied to DNAsequences, and results are presented for “pseudo DNA” se-quences and E. Coli DNA.

    Concepts from digital IIR filtering were used in [4] todetect coding regions. This paper uses antinotch IIR filtersto identify these regions. This is achieved by designing a fil-ter which has a sharp frequency response peak at 2π/3. Onpassing the nucleotide sequence through this filter, if the se-quence is from a coding region, the output will have a pro-nounced frequency peak at 2π/3. The authors explain vari-ous tradeoffs in the design of the IIR filter and efficient designprocedures. They conclude with examples where the outputof the antinotch filter has a more discernible spectral peak at2π/3 when coding sequences are analyzed.

    Two DSP-based approaches to genome sequences anal-ysis are explained in [24]. The methods are the three-dimensional DNA walks and Gauss wavelet-based analy-sis, and Huffman-based encoding technique. The three-dimensional DNA walk is used as a tool to visualize changesin nucleotide composition, base pair patterns, and evolutionalong the DNA sequence. The proposed DNA walk modelis reported to provide similar results as those obtained froma purine-pyrimidine walk, in terms of long-range correla-tions. Gauss wavelet analysis is then used to analyze the frac-tal structure of the three-dimensional DNA walk. With theuse of Huffman coding, the transformation of the DNA se-quence into an encoded domain can help visualize the se-quences from a new perspective.

    The spectral analysis of a categorical time series is ex-plained in [37, 38]. In [37], the statistical theory for ana-lyzing a categorical time series in the frequency domain isdiscussed, and the methodology that is developed is appliedto DNA sequences. A discussion on the application of thespectral envelope methodology to a number of sequences, in-cluding the DNA, is given in [38]. Various spectral peaks inthe sequence can be observed in the spectral envelope that isobtained through this technique. Techniques based on time-frequency and wavelet analysis have also been used to analyzeDNA and protein sequences [18, 39, 40, 41].

    3.3. Numerical mapping of nucleotides

    Numerical mapping can be broadly classified into two types,namely, fixed mapping as in [1, 2, 4, 5, 6, 7, 8, 13, 16, 17,24, 33] and a mapping based on some optimality criterionas in [36, 37]. Fixed mappings include binary [8], integer[33], and complex representations [2]. In this work, we use areal-number mapping rule based on the complement prop-erty of the complex mapping in [2]. The real-number rep-resentation is A = −1.5; T = 1.5; C = 0.5; and G = −0.5.

  • Autoregressive Modeling and Feature Analysis of DNA Sequences 17

    G = −1 + j

    C = −1− j

    A = 1 + j

    T = 1− j

    (a)

    A = −1.5

    G = −0.5

    T = 1.5

    C = 0.5

    (b)

    Figure 2: A constellation diagram for (a) complex-number representation and (b) real-number representations.

    The complement of a sequence of nucleotides can be ob-tained by changing the sign of the equivalent number se-quence and reversing the sequence. For example, CTGAA:0.5; 1.5;−0.5;−1.5;−1.5 → Change Sign and Reverse Se-quence → 1.5; 1.5; 0.5;−1.5;−0.5: TTCAG. In the computa-tion of correlations, real representations are preferred overcomplex representations. Furthermore, it is interesting tonote that the complex, real, and integer representations canalso be viewed as constellation diagrams, which are widelyused in digital communications. Figure 2 shows the constel-lation diagram for the complex and real representations. Thecomplex constellation is similar to that of the quadraturephase shift keying (QPSK) scheme, and the real represen-tation is similar to the pulse amplitude modulation (PAM)scheme. The constellation diagram helps visualize the DNAsequence in the context of digital communications, wherea symbol mapping is followed by transmission of informa-tion. Analysis of DNA sequences using digital communica-tions techniques could reveal certain aspects of the DNA likeerror-correcting capability. An information theory perspec-tive of information transmission in the DNA, namely, thecentral dogma, is explained in [32].

    4. AR MODEL-BASED DNA SEQUENCE ANALYSIS

    The aforementioned DNA sequence analysis techniques canbe divided into two main categories. In the first category, cor-relations within coding and noncoding sequences are char-acterized and used thereafter. In the second category, theFourier transform of sequences is used to observe spec-tral characteristics that could distinguish between codingand noncoding DNA regions. The typical spectral signaturefound in a coding region is a spectral peak [1], and AR spec-tral estimators are effective in modeling spectral peaks ofshort sequences [19]. AR spectral parameters can also re-flect the underlying difference in the correlation structure be-tween coding and noncoding regions. Since correlations havebeen related to biological properties of the DNA, AR modelscould also be used as models of biological functions. Hence,it is a logical extension to use AR spectral estimators to ana-lyze DNA sequences.

    4.1. AR modeling

    The AR modeling of DNA sequences can be performed usinglinear prediction techniques. In the linear prediction anal-

    Nucleotidesequence

    x(n) A(z)(Linear combiner)

    Residualsignal

    Figure 3: AR process and linear prediction; A(z) is the filter poly-nomial.

    ysis, a sample in a numerical sequence is approximated bya linear combination of either preceding or future sequencevalues [42]. The forward linear prediction operation is givenby

    e(n) = x(n)− a1x(n− 1)− a2x(n− 2)− · · ·− apx(n− p),(3)

    where x is the numerical sequence, n is the current sam-ple index, a1, a2, . . . , ap are the linear prediction parameters,and e(n) is the linear prediction error. Equation (3) repre-sents forward linear prediction since the current sample ispredicted by a linear combination of previous samples. Simi-larly, in backward linear prediction, a sample is predicted as alinear combination of future samples. The linear predictioncoefficients are calculated by minimizing the mean squarederror. The linear prediction polynomial is given by

    A(z) = 1−p∑

    i=1aiz−i. (4)

    Figure 3 depicts the DNA linear prediction in the context ofAR processes.

    The output of the linear combiner is known as the resid-ual signal. In speech processing, linear prediction has beenused for efficient modeling with a considerable level of suc-cess [43]. The AR Yule-Walker and Burg algorithms arewidely used to compute the AR model parameters. The in-volved autocorrelation matrix values are typically calculatedusing the biased estimate in (2). Issues related to the ARmodeling of DNA sequences are discussed in Section 4.2.

    4.2. Proposed AR model-based DNA sequence analysis

    The AR modeling of a DNA sequence is done by first map-ping the sequence into the numerical domain and then cal-culating the AR parameters of the resulting numerical se-quence. Since the numerical mapping of the DNA affects

  • 18 EURASIP Journal on Applied Signal Processing

    DNAsequence 1

    Numericalmapping Equivalent

    numericalsequence

    Modelestimation AR model

    parameters

    DNAsequence 2

    Numericalmapping Equivalent

    numericalsequence

    Linearprediction filter

    Residualerror

    Figure 4: Block diagram of AR model-based residual signal analysis of DNA segments.

    the correlation function [5], the AR parameters, which arederived from the correlation values, also depend on thenumerical assignment. In this paper, the real, integer, and bi-nary mapping rules [8] have been used for analysis. Anotherimportant issue pertains to the application of AR modelingto DNA sequences. As mentioned in Section 4.1, the calcula-tion of AR parameters from the linear prediction model in-volves minimizing the error between the current signal sam-ple and a linear combination of past samples. This defini-tion pertains to causal AR modeling. In the case of DNA se-quences, there appears to be no constraint to consider only acausal AR model, since the nucleotides in a spatial series neednot be constrained to depend on the ones positioned beforethem only. However, the protein coding information is storedin nucleotide triplets and certain codons signal the start andstop of these gene regions. The start/stop codons and thetranscription of the nucleotide triplets implicitly confer di-rectionality to the nucleotide sequences in the genes. Hence,a causal AR model appears to be more appropriate for mod-eling gene sequences. The fact that the polymerase enzymewhich is responsible for reading the information from thegenes physically reads this DNA information from the startto the stop codons augurs our assumption. However, it needsto be noted that no such directionality apparently exists innoncoding regions and it would thus be of considerable in-terest to analyze both coding and noncoding DNA regionswith causal versus noncausal models, respectively.

    AR models of DNA sequences were used to perform twobasic kinds of analyses. In the first analysis, the residual errorvariance of DNA sequences was used as a measure to indi-cate the “goodness” of the AR fit. In other words, AR modelsof various DNA segments were compared based on their ARresidual signal. That is, suppose that signals s1(n) and s2(n)are modeled using respective AR models. When s1(n) is in-put to the linear predictor defined by the parameters of theAR model of s2(n), the residual signal error would be lowerif s1(n) and s2(n) are described by similar AR models thanif described by different AR models. The residual signal canthus be used as a measure of similarity between two signals(e.g., two DNA regions). Furthermore, it is evident that theresidual error (a one-dimensional measure) alone is not suf-ficient to parameterize multidimensional signals, that is, dif-ferent signals may yield similar residual error values. Thus,the inadequacy of the residual error was one of the moti-vations to use AR model parameters as sequence features.

    For example, if the parameters a1, a2, . . . ,ap are obtained byAR analysis of a gene segment, the vector [1,a1,a2, . . . ,ap]T

    is used as the segment feature. This is similar to the analysisof speech signals, where the AR model parameters or theirderivatives, such as cepstral parameters, are used as featurevectors. Furthermore, by representing DNA sequences of dif-ferent lengths with AR models of equal order, their compar-ison becomes possible by many simple measures such as Eu-clidean distance and vector correlations. Subsequently, ARfeatures of coding and noncoding DNA sequences were an-alyzed using techniques such as feature space distributionanalysis. Finally, we did not use the AR spectrum to distin-guish between coding and noncoding features. This is due tothe fact that working with high-order AR models, spuriousspectral peaks were observed.

    4.3. Analyzed DNA sequencesThe analyses presented herein were performed on the Saccha-romyces cerevisiae, Caenorhabditis elegans, and Streptococcusagalactiae genomes. The S. cerevisiae genome has 16 chro-mosomes and its complete length is approximately 12 mil-lion bp. C. elegans and C. cerevisiae are eukaryotes, while S.agalactiae is a prokaryotic organism.

    Prokaryotes are single-celled organisms while eukary-otes can be single- or multicelled. Major differences betweenprokaryotic and eukaryotic genomes are that the genome sizeof prokaryotes is typically less than that of eukaryotes, andthat prokaryotic DNA has a higher percentage of genetic in-formation content in contiguous gene segments than eukary-otic DNA. Furthermore, the number of repetitive sequencesin eukaryote DNA sequences is larger than the number ofrepeats in prokaryote DNA. The above-mentioned genomescan be obtained from the National Center for BiotechnologyInformation (NCBI) public database.

    5. RESULTS

    5.1. Residual error analysis

    We will first discuss the AR residual error-based DNA anal-ysis. Results only from the analysis of S. cerevisiae chromo-some 4 DNA sequence are presented herein. The binary SWmapping rule [8] and the real-number mapping rule wereused. The analysis’ block diagram is shown in Figure 4. ARmodels of coding and noncoding DNA regions were com-pared based on their AR residual errors as follows.

  • Autoregressive Modeling and Feature Analysis of DNA Sequences 19

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.18

    0.2

    0.22

    0.24

    0.26

    0.28

    0.3

    (a)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.18

    0.2

    0.22

    0.24

    0.26

    0.28

    0.3

    (b)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.15

    0.2

    0.25

    0.3

    0.35

    (c)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.15

    0.2

    0.25

    0.3

    0.35

    (d)

    Figure 5: AR model of gene 1 of S. cerevisiae is used to perform residual signal analysis on its other genes using binary mapping. Residualsignal variance versus AR model for gene 1 ( ◦—) and other genes ( •—) from chromosome 4, (a) error in gene 1 and genes 3–9; (b) error ingene 1 and genes 11–18; (c) error in gene 1 and genes 20–35; and (d) error in gene 1 and genes 36–50. Genes of length less than 150 bp werenot considered since they cannot be modeled using high-order AR models.

    First, the AR models were computed for each gene. Then,these AR model parameters were used to perform linear pre-diction and obtain the residual signal variances when appliedto other genes. Genes of shorter length for which higher-order AR models could not be computed were not consid-ered. The residual signal variances from 47 genes obtainedwith the AR model of gene 1 are shown in Figure 5. It canbe noted that with increasing AR model order, the residualsignal variance in gene 1 decreases. This is in conformancewith the well-known fact from statistical signal processingthat when a signal is modeled using AR models of increas-ing order, the residual signal error for that signal decreasesmonotonically [19]. On the other hand, it is interesting tonote that for the other gene sequences, the residual error vari-

    ance increases with increasing AR model order (see Figure 5).A similar result was observed when the real mapping rule wasused (see Figure 6). This observation implies that with in-creasing model order, the similarity between the AR modelsof different genes decreases due to the increased specificity ofthe AR models to genes. The specificity could be due to theabsence of redundancy between the analyzed genes and em-phasizes the idea that, since different genes typically code fordifferent amino acid sequences, they may not contain a lot ofsimilar or redundant information.

    Next, noncoding segments were compared with codingsegments. Gene 1 in chromosome 4 of S. cerevisiae was mod-eled using an AR model, and the model parameters wereused to compute the residual error variances of 50 noncoding

  • 20 EURASIP Journal on Applied Signal Processing

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    1.2

    1.4

    1.6

    1.8

    2

    (a)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    1.2

    1.4

    1.6

    1.8

    2

    (b)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    1.2

    1.4

    1.6

    1.8

    2

    (c)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    1.2

    1.4

    1.6

    1.8

    2

    (d)

    Figure 6: AR model of gene 1 of of S. cerevisiae is used to perform residual signal analysis on its other genes using real-number mapping.Residual signal variance versus AR model for gene 1 ( ◦—) and other genes ( •—) from chromosome 4, (a) error in gene 1 and genes 3–9;(b) error in gene 1 and genes 11–18; (c) error in gene 1 and genes 20–35; and (d) error in gene 1 and genes 36–50.

    segments. Similarly, gene 17 was modeled using an AR modeland the model parameters were used to compute the residualerror variances of 50 noncoding segments. The residual er-ror variances of 50 noncoding segments when the AR modelfrom gene 1 and gene 17 was applied are depicted in Fig-ures 7 and 8, respectively. It can be observed that the resid-ual signal variance values for a few noncoding sequences aresmaller than the ones for gene 1, for the full range of modelorders. This implies the existence of similarities between cod-ing and noncoding segments. Similar observations were alsoobtained when real mapping was applied.

    It is evident from the above observations that the classi-fication of an analyzed sequence to either a coding or non-coding region based on the residual signal alone is difficult asdifferent regions may have similar residual errors for a range

    of AR model orders. The above results also show that whenAR models are used to parameterize DNA segments basedon the residual error, higher-order models may be requiredto model the characteristics and capture their differences.

    5.2. AR feature-based analysis

    One of the important problems in DNA sequence analysisis identifying regions with similar nucleotide compositions.This is then typically applied in studies such as identifyingconserved regions across different organisms. A number ofalgorithms, such as BLAST, have been developed to performstring searches and template matching. These string search-ing tools are typically based on dynamic programming con-cepts, wherein the actual template or query string is com-pared with segments of a long DNA sequence. In this paper,

  • Autoregressive Modeling and Feature Analysis of DNA Sequences 21

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.2

    0.25

    0.3

    (a)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.2

    0.25

    0.3

    (b)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.2

    0.25

    0.3

    (c)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.2

    0.25

    0.3

    (d)

    Figure 7: AR model of gene 1 is used for linear prediction on 50 noncoding segments using binary mapping. (a) Error in noncoding segments1–12; (b) error in noncoding segments 13–25; (c) error in noncoding segments 26–38; and (d) error in noncoding segments 39–50.

    the AR model parameters of the template nucleotide se-quence are used as features to identify similar segments ina long DNA sequence. AR models capture the global spectralcharacteristics of the modeled sequences. Thus, the identifi-cation is based on similar spectral characteristics (AR) ratherthan one-to-one nucleotide matching (dynamic program-ming techniques).

    The analysis was performed on a segment of the S. cere-visiae genome using binary, real-number, and integer map-ping. The template matching procedure was performed asfollows. First, a segment of nucleotides of length L was cho-sen as the template. The AR model of this template was es-timated for various orders, and the model parameters wereused as template features. Second, the AR features were cal-culated over the whole DNA sequence from overlappingmoving windows of the same length L as the template. Third,

    the feature vectors obtained from each moving window werecompared with the template feature vector by computing theEuclidean distance between them.

    It was observed that using the real mapping, similarsegments to either the template, its reversed sequence, itscomplementary sequence, or its reversed complementarysequence are detected. One such example is presented inTable 1, wherein the template and its complement were iden-tified. Using integer mapping, the DNA locations where sim-ilar features were found are cited in Table 2. In this case, thefeatures of the template sequence alone was detected. Usingbinary SW mapping, although the actual template occurredonly once in the complete sequence, other segments alsoyielded the same features (see Table 3). Here the template andthe matched sequences differ in the actual nucleotide but ona closer look, they have a similar sequence of strong and weak

  • 22 EURASIP Journal on Applied Signal Processing

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.18

    0.2

    0.22

    0.24

    0.26

    (a)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.18

    0.2

    0.22

    0.24

    0.26

    (b)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.18

    0.2

    0.22

    0.24

    0.26

    (c)

    Order0 50 100 150 200

    Res

    idua

    lerr

    or

    0.18

    0.2

    0.22

    0.24

    0.26

    (d)

    Figure 8: AR model of gene 17 is used for linear prediction of 50 noncoding segments using binary mapping. (a) Error in noncodingsequences 1–12; (b) error in noncoding sequences 13–25; (c) error in noncoding sequences 26–38; and (d) error in noncoding sequences39–50.

    hydrogen bonds. Analysis with the binary RY mapping rule[8] yielded similar results, that is, segments with a similarsequence of purines and pyrimidines as the one in the tem-plate.

    In the aforementioned analysis, the mapping rule usedplayed an important role in identifying matches. The real-and integer-number mapping rules yielded different stringmatches. This is due to the inherent complementary prop-erty of the real mapping rule and the noncomplementaryproperty of the integer mapping rule. The difference is fur-ther elucidated through the following exercise. Say, for ex-ample, the occurrences of the template 5′-TACGTGC-3′

    need to be found in a long DNA string. The correspondingnumerical sequence obtained through real mapping wouldbe 5′-1.5,−1.5, 0.5,−0.5, 1.5,−0.5, 0.5-3′. The following nu-merical sequences will have the same AR parameters as the

    above template:

    (i) 5′- −1.5, 1.5,−0.5, 0.5,−1.5, 0.5,−0.5-3′ =5′-ATGCACG-3′: (reversed complement of the template);

    (ii) 5′-0.5,−0.5, 1.5,−0.5, 0.5,−1.5, 1.5-3′ =5′-CGTGCAT-3′: (reversed template);

    (iii) 5′- −0.5, 0.5,−1.5, 0.5,−0.5, 1.5,−1.5-3′ =5′-GCACGTA-3′: (complement of the template).

    This is due to the fact that (a) the sign-reversed numericalsequence and the actual numerical sequence have the samelinear dependence and hence the same AR parameters, and(b) minimizing the forward or the backward linear predic-tion error would theoretically yield the same AR model. Thisis observed with the Burg algorithm AR estimation, wherein

  • Autoregressive Modeling and Feature Analysis of DNA Sequences 23

    Table 1: Detection of repeats of DNA segments via AR modeling.Real mapping rule and second-order AR model features are used;the template is 8 bp long. There are 5 repeats in the whole sequence.Identification of complementary and reversed sequences is obtainedas well.

    Position with the same features DNA segment

    210–217 (template) CTCACATT

    5174–5181 CTCACATT

    12572–12579 CTCACATT

    19278–19285 AATGTGAG

    29624–29631 CTCACATT

    36387–36394 AATGTGAG

    55805–55812 AATGTGAG

    63106–63113 CTCACATT

    Table 2: Detection of repeats of DNA segments via AR modeling.Integer mapping rule and second-order AR model features are used;the template is 8 bp long. There are 5 repeats in the whole sequence.The template is exactly identified.

    Position with the same features DNA segment

    210–217 (template) CTCACATT

    5174–5181 CTCACATT

    12572–12579 CTCACATT

    29624–29631 CTCACATT

    63106–63113 CTCACATT

    Table 3: Detection of repeats of DNA segments via AR modeling.Binary SW mapping rule and fourth-order AR model features areused; the template is 14 bp long and it has one occurrence in thewhole sequence. Identification of DNA with similar sequences ofstrong and weak hydrogen bonds is obtained. Nucleotides C andG (mapped to one), A and T (mapped to zero) are highlighted

    differently.

    Position with the same features DNA segment

    210–221 (template) C T C A C ATTA CCC TA7424–7435 C T C T G AAAT GCC AT9283–9294 G A C T G ATAA GGG TT

    80726–80737 C A G T G ATAT CGG TA

    both the forward and backward linear prediction errors areminimized together. In the case of the integer mapping rule(A = 1, C = 2, G = 3, T = 4), the corresponding numeri-cal sequence of the template is 5′-4, 1, 2, 3, 4, 3, 2-3′. The re-versed sequence, namely, 2, 3, 4, 3, 2, 1, 4, has the same ARmodel parameters as the template (by minimizing the for-ward and reverse prediction errors). On the other hand, thesequence corresponding to the complement of the templatemay not have the same AR model. Hence, using the integer

    mapping rule, the exact template and its reversed sequenceare matched.

    The features of the nucleotide segments are also af-fected by the use of the binary mapping rule. This isexplained through the following example. The sequence5′-TGACAAGC-3′ is mapped to 5′-0, 1, 0, 1, 0, 0, 1, 1-3′ usingthe binary SW mapping rule. The above numerical sequencealso corresponds to 5′-ACACATGG-3′, and a number ofother nucleotide combinations. The AR model parameters ofall these combinations are the same, and hence, it is possibleto identify sequences with certain similar chemical propertieslike similar sequences of strong and weak hydrogen bonds.

    The above observations are of great interest becausethey show that identification of regions with similar biolog-ical/chemical properties may be possible using AR feature-based template matching under different mapping rules. Forexample, the ability to identify a template and its comple-ment can help in identifying genes in complementary strandsas well, which may not be possible in a single “run” using tra-ditional string searching tools. The AR model string searchmethod can be used as an analytical tool to reveal additionalinformation about the interrelations between different DNAsequences. The knowledge acquired by this analysis could beused in knowledge or rule-based methods. Two DNA signalswith similar AR spectra are more related in a global mannerthan in a one-to-one nucleotide basis. In this sense, the abovemethod can provide clues about similarities between appar-ently nonidentical DNA sequences that could then be used inthe identification of the underlying biochemical mechanismsof such similarities. The results of AR model-based analy-sis are related to fast Fourier transform (FFT)-based meth-ods. The pros and cons have to do with the well-known ad-vantages and disadvantages of using parametric versus non-parametric signal processing methods (e.g., ability to analyzeshort versus long segments, computational speed, etc).

    The above algorithm was also applied to gene searches ina long string of DNA. It was observed that the distance be-tween the feature vectors is zero at the exact location of thegene even with an AR model of an order as low as 2. The dis-tance between the gene sequence AR feature vector and themoving window AR feature vector is plotted for various fea-ture dimensions (AR model orders) in Figure 9. It was alsoobserved that the average distance between the gene featurevector and features of the moving windows increased withAR model order. It can be typically expected that the averagedistance between vectors tends to increase with increasing di-mension. Nevertheless, in conjunction with our previous ob-servations from the residual signal-based analysis, it appearsthat the increasing average distance of the gene features withthe AR model orders may mainly be due to the greater speci-ficity of the AR modeling to the presence of genes. To furtherinvestigate the above observations, a study of the distribu-tions of coding and noncoding AR features was undertaken.

    The complete S. cerevisiae genome with all coding andnoncoding sequences was considered. We mapped the DNAsegments into the numerical domain using the binary SWmapping rule. Then, the AR model parameters of all seg-ments were calculated and used as the DNA segment features.

  • 24 EURASIP Journal on Applied Signal Processing

    Nucleotide position

    0 0.5 1 1.5 2∗ 2.5 3 3.5 4×104

    Dis

    tanc

    e

    00.10.20.30.4

    (a)

    Nucleotide position

    0 0.5 1 1.5 2∗ 2.5 3 3.5 4×104

    Dis

    tanc

    e

    00.20.40.60.8

    (b)

    Nucleotide position

    0 0.5 1 1.5 2∗ 2.5 3 3.5 4×104

    Dis

    tanc

    e

    00.20.40.60.8

    (c)

    Figure 9: The distance between the feature vector of a gene sequence (position denoted by ∗) and the corresponding features within amoving window segments over the analyzed DNA sequence from S. cerevisiae for AR model orders (a) 10, (b) 25, and (c) 50 (real mappingused). It can be noticed that the average distance between the gene feature and the features of the moving windows increases with AR modelorder, and it is minimal (zero) at the position of the gene.

    The analysis was also performed using the real mapping rule.For a particular AR model of order p, the centroid of all cod-ing region feature vectors was calculated, and the Euclideandistance of the feature vectors from the centroid was com-puted. The distances were similarly computed for noncod-ing region features from their centroid as well. The distri-bution density of these distance measures was obtained. Theprocess was repeated for increasing model orders. The dis-tributions from the coding region and noncoding regionswere then compared using the Kolmogorov-Smirnov test[44]. Figure 10 shows the distribution densities for S. cere-visiae coding and noncoding regions for AR model orders15 and 35, using binary SW mapping. The distribution den-sities obtained by using real-number mapping are depictedin Figure 11. Both coding and noncoding features are con-centrated near their respective centroids. The noncoding fea-tures appear to be more concentrated around their centroidthan the coding features.

    The p values from the Kolmogorov-Smirnov test of thedistributions of the coding and noncoding features using bi-nary SW and real-number mapping, are shown in Figure 12.It is observed that the threshold p = 0.05 used in the hy-pothesis testing is achieved with an AR model order of 21 forthe binary SW mapping and only 16 for the real mapping.Thus, it appears that such distance distributions can be used

    to further classify a DNA segment as coding or noncoding. Italso appears that the real mapping is more effective than thebinary SW mapping in this analysis.

    6. CONCLUSION

    A brief survey of the research on the analysis of DNA se-quences from a signal processing perspective was presented.The use of nonparametric classical DSP tools like Fouriertransforms and time-frequency analysis have been effectivein studying DNA sequences of coding and noncoding re-gions. The use of parametric spectral analysis to capture cer-tain spectral characteristics of such DNA regions was hereinintroduced. We applied the AR spectral analysis tools to ana-lyze DNA sequences.

    The analyses were of two basic types. First, the AR modelparameters of the analyzed DNA segments were used to per-form linear prediction analysis. The residual error was sub-sequently used to compare the analyzed segments. An ob-servation of particular interest was that the AR model wasvery specific to the coding DNA sequences. This specificityincreased with increasing model orders. Though the resid-ual error analysis methodology could be used to compare ARmodels of different DNA segments, it was found not to beadequate for the characterization of these sequences. The AR

  • Autoregressive Modeling and Feature Analysis of DNA Sequences 25

    Distance0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

    Den

    sity

    0

    10

    20

    30

    40

    50

    60

    70

    CDS featuresNCDS features

    (a)

    Distance0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

    Den

    sity

    0

    5

    10

    15

    20

    25

    30

    CDS featuresNCDS features

    (b)

    Figure 10: Distribution density of distances of coding segment (CDS) AR feature vectors and noncoding segment (NCDS) AR featurevectors from their respective centroids for AR model orders (a) 15 and (b) 35 (binary SW mapping used).

    Distance0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

    Den

    sity

    0

    10

    20

    30

    40

    50

    60

    CDS featuresNCDS features

    (a)

    Distance0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

    Den

    sity

    0

    5

    10

    15

    20

    25

    30

    CDS featuresNCDS features

    (b)

    Figure 11: Distribution density of distances of coding segment (CDS) AR feature vectors and noncoding segment (NCDS) AR featurevectors from their respective centroids for AR model orders (a) 15 and (b) 35 (real mapping used).

    model parameters themselves were then used as features forDNA string searches.

    Depending on the type of the numerical mapping ruleused, the AR feature-based string searching technique washighly effective in identifying all repeats of the query string,along with the locations of its complementary sequence.It was also possible to locate regions with similar chemi-cal structures, for example, sequences of similar strong and

    weak hydrogen bonds. Thus different mapping rules can beused depending on the objective of the analysis. For example,the use of SW or RY mapping rules was necessary to locateregions of similar strong-weak hydrogen bonds or purine-pyrimidine structure. It was observed that modeling witha low-order AR model and working in the generated fea-ture space was sufficient to locate the occurrence of com-plete genes in a long DNA sequence. Further analysis of the

  • 26 EURASIP Journal on Applied Signal Processing

    AR order10 20 30 40 50 60

    pva

    lue

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    (a)

    AR order10 20 30 40 50 60

    pva

    lue

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    (b)

    Figure 12: p values obtained from the Kolmogorov-Smirnov test, comparing the distribution of coding and noncoding AR features for (a)binary SW mapping and (b) real mapping. The 5% threshold used in the hypothesis testing is also plotted as a dotted horizontal line.

    distribution of the coding and noncoding AR features re-vealed that these distributions differed significantly for high-dimension AR features. It would be of great interest to fur-ther investigate the biological implications of differences inthe distributions of coding and noncoding region AR fea-tures.

    The proposed analytical scheme can also be used for theanalysis of other biochemical molecules, in addition to DNA,such as amino acid sequences. Further, like in speech recog-nition, AR features and their derivatives, such as cepstral fea-tures, could also be incorporated in an HMM-based gene-finding tool. Analysis of more genomic sequences along thelines proposed herein is underway.

    ACKNOWLEDGMENTS

    This work is partially supported by the National Institutesof Health through a Bioengineering Research PartnershipGrant NS39687 to Dr. L. D. Iasemidis. Portions of the ed-ucational components of this work have been supported bythe National Science Foundation Grant NSF0089075 to Dr.A. Spanias.

    REFERENCES

    [1] S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhat-tacharya, and R. Ramaswamy, “Prediction of probable genesby Fourier analysis of genomic sequences,” Computer Appli-cations in the Biosciences, vol. 13, no. 3, pp. 263–270, 1997.

    [2] D. Anastassiou, “Genomic signal processing,” IEEE SignalProcessing Magazine, vol. 18, no. 4, pp. 8–10, 2001.

    [3] B. D. Silverman and R. Linsker, “A measure of DNA period-icity,” Journal of Theoretical Biology, vol. 118, pp. 295–300,1986.

    [4] P. P. Vaidyanathan and B.-J. Yoon, “Gene and exon predictionusing allpass-based filters,” in Proc. Workshop on Genomic Sig-nal Processing and Statistics (GENSIPS ’02), Raleigh, NC, USA,October 2002.

    [5] H. Herzel and I. Grosse, “Measuring correlations in symbolsequences,” Physica A, vol. 216, no. 4, pp. 518–542, 1995.

    [6] R. F. Voss, “Evolution of long-range fractal correlations and1/f noise in DNA base sequences,” Phys. Rev. Lett., vol. 68, no.25, pp. 3805–3808, 1992.

    [7] S. V. Buldyrev, A. L. Goldberger, S. Havlin, et al., “Long-range correlation properties of coding and noncoding DNAsequences: GenBank analysis,” Phys. Rev. E, vol. 51, no. 5, pp.5084–5091, 1995.

    [8] P. Bernaola-Galván, P. Carpena, R. Román-Roldán, and J. L.Oliver, “Study of statistical correlations in DNA sequences,”Gene, vol. 300, no. 1-2, pp. 105–115, 2002.

    [9] O. Weiss and H. Herzel, “Correlations in protein sequencesand property codes,” Journal of Theoretical Biology, vol. 190,no. 4, pp. 341–353, 1998.

    [10] C. K. Peng, S. V. Buldyrev, S. Havlin, M. Simons, H. E. Stan-ley, and A. L. Goldberger, “Mosaic organization of DNA nu-cleotides,” Phys. Rev. E, vol. 49, no. 2, pp. 1685–1689, 1994.

    [11] W. Li, T. Marr, and K. Kaneko, “Understanding long-rangecorrelations in DNA sequences,” Physica D, vol. 75, no. 1–3,pp. 392–416, 1994.

    [12] H. Herzel and I. Grosse, “Correlations in DNA sequences: Therole of protein coding segments,” Phys. Rev. E, vol. 55, no. 1,pp. 800–810, 1997.

    [13] H. Herzel, E. N. Trifonov, O. Weiss, and I. Grosse, “Interpret-ing correlations in biosequences,” Physica A, vol. 249, no. 1–4,pp. 449–459, 1998.

    [14] W. Li, “The study of correlation structures of DNA sequences:a critical review,” Computers & Chemistry, vol. 21, no. 4, pp.257–272, 1997.

    [15] L. Luo, W. Lee, L. Jia, F. Ji, and L. Tsai, “Statistical correlationof nucleotides in a DNA sequence,” Phys. Rev. E, vol. 58, no.1, pp. 861–871, 1998.

    [16] D. Holste, I. Grosse, and H. Herzel, “Statistical analysis of theDNA sequence of human chromosome 22,” Phys. Rev. E, vol.64, no. 4, pp. 1–9, 2001.

    [17] A. K. Mohanty and A. V. S. S. Narayana Rao, “Long range cor-relations in DNA sequences,” preprint, 2002, http://arXiv.org/abs/physics/0202075.

    [18] B. Audit, C. Thermes, C. Vaillant, Y. d’Aubenton-Carafa, J. F.Muzy, and A. Arneodo, “Long-range correlations in genomic

    http://arXiv.org/abs/physics/0202075http://arXiv.org/abs/physics/0202075

  • Autoregressive Modeling and Feature Analysis of DNA Sequences 27

    DNA: a signature of the nucleosomal structure,” Phys. Rev.Lett., vol. 86, no. 11, pp. 2471–2474, 2001.

    [19] L. S. Marple, Digital Spectral Analysis with Applications,Prentice-Hall, Englewood Cliffs, NJ, USA, 1987.

    [20] B. Alberts, D. Bray, A. Johnson, et al., Essential Cell Biology,Garland Publishing, NY, USA, 1998.

    [21] J. W. Fickett, “Recognition of protein coding regions in DNAsequences,” Nucleic Acids Research, vol. 10, no. 17, pp. 5303–5318, 1982.

    [22] J. W. Fickett, “The gene identification problem: an overviewfor developers,” Computers & Chemistry, vol. 20, no. 1, pp.103–118, 1996.

    [23] R. Román-Roldán, P. Bernaola-Galván, and J. L. Oliver, “Se-quence compositional complexity of DNA through an en-tropic segmentation method,” Phys. Rev. Lett., vol. 80, no. 6,pp. 1344–1347, 1998.

    [24] J. A. Berger, S. K. Mitra, M. Carli, and A. Neri, “New ap-proaches to genome sequence analysis based on digital sig-nal processing,” in Proc. Workshop on Genomic Signal Process-ing and Statistics (GENSIPS ’02), Raleigh, NC, USA, October2002.

    [25] I. Grosse, H. Herzel, S. V. Buldyrev, and H. E. Stanley, “Speciesindependence of mutual information in coding and noncod-ing DNA,” Phys. Rev. E, vol. 61, no. 5, pp. 5624–5629, 2000.

    [26] J. W. Fickett and C. S. Tung, “Assessment of protein codingmeasures,” Nucleic Acids Research, vol. 20, no. 24, pp. 6441–6450, 1992.

    [27] P. Bernaola-Galván, I. Grosse, P. Carpena, J. L. Oliver,R. Román-Roldán, and H. E. Stanley, “Finding borders be-tween coding and noncoding DNA regions by an entropic seg-mentation method,” Phys. Rev. Lett., vol. 85, no. 6, pp. 1342–1345, 2000.

    [28] I. Grosse, P. Bernaola-Galván, P. Carpena, R. Román-Roldán,J. L. Oliver, and H. E. Stanley, “Analysis of symbolic sequencesusing the Jensen-Shannon divergence,” Phys. Rev. E, vol. 65,pp. 041905-1–041905-16, 2002.

    [29] M. Crochemore and R. Vérin, “Zones of low entropy in ge-nomic sequences,” Computers & Chemistry, vol. 23, no. 3-4,pp. 275–282, 1999.

    [30] E. E. May, M. A. Vouk, D. L. Bitzer, and D. I. Rosnick, “A cod-ing theory framework for genetic sequence analysis,” in Proc.Workshop on Genomic Signal Processing and Statistics (GEN-SIPS ’02), Raleigh, NC, USA, October 2002.

    [31] H. P. Yockey, “An application of information theory to thecentral dogma and the sequence hypothesis,” Journal of The-oretical Biology, vol. 46, pp. 369–406, 1974.

    [32] H. P. Yockey, Information Theory and Molecular Biology, Cam-bridge University Press, Cambridge, UK, 1992.

    [33] A. A. Tsonis, J. B. Elsner, and P. A. Tsonis, “Periodicity in DNAcoding sequences: implications in gene evolution,” Journal ofTheoretical Biology, vol. 151, pp. 323–331, 1991.

    [34] P. D. Cristea, “Analysis of chromosome genomic signals,” inProc. 7th International Symposium on Signal Processing and ItsApplications (ISSPA ’03), vol. 2, pp. 49–52, Paris, France, July2003.

    [35] D. H. Johnson and W. Wang, “Symbolic signal processing,”in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing(ICASSP ’99), pp. 1361–1364, Phoenix, Ariz, USA, March1999.

    [36] W. Wang and D. H. Johnson, “Computing linear transformsof symbolic signals,” IEEE Trans. Acoustics, Speech, and SignalProcessing, vol. 50, no. 3, pp. 628–634, 2002.

    [37] D. S. Stoffer, D. E. Tyler, and A. J. McDougall, “Spectral anal-ysis for categorical time series: Scaling and the spectral enve-lope,” Biometrika, vol. 80, no. 3, pp. 611–622, 1993.

    [38] D. S. Stoffer, D. E. Tyler, and D. A. Wendt, “The spectral en-velope and its applications,” Statistical Science, vol. 15, no. 3,pp. 224–253, 2000.

    [39] A. Arneodo, E. Bacry, P. V. Graves, and J. F. Muzy, “Character-izing long-range correlations in DNA sequences from waveletanalysis,” Phys. Rev. Lett., vol. 74, no. 16, pp. 3293–3296, 1995.

    [40] K. Bloch and G. R. Arce, “Time-frequency analysis of proteinsequence data,” in Proc. IEEE-EURASIP Workshop on Non-linear Signal and Image Processing (NSIP ’01), Baltimore, Md,USA, June 2001.

    [41] J. Song, T. Ware, and S.-L. Liu, “Test of origin site (oriC) andterminus (terC) of replication by wavelet analysis in bacteria,”in Proc. Workshop on Genomic Signal Processing and Statistics(GENSIPS ’02), Raleigh, NC, USA, October 2002.

    [42] J. Makhoul, “Linear prediction: A tutorial review,” Proceedingsof the IEEE, vol. 63, no. 4, pp. 561–580, 1975.

    [43] J. D. Markel and A. H. Gray Jr., Linear Prediction of Speech,Springer-Verlag, NY, USA, 1976.

    [44] D. J. Sheskin, Handbook of Parametric and Nonparametric Sta-tistical Procedures, Chapman & Hall/CRC Press, Boca Raton,Fla, USA, 2nd edition, 2000.

    Niranjan Chakravarthy received the B.E.degree in electronics and communicationengineering from the Government Collegeof Engineering, Tamil Nadu, India, in 2001and the M.S. degree in electrical engineer-ing from the Arizona State University in2003. He is currently working towards thePh.D. degree with the Department of Elec-trical Engineering at Arizona State Univer-sity. He is a Research Assistant at the DigitalSignal Processing and Brain Dynamics Laboratories and currentlypursues research on the prediction and control of epileptic seizuresand genomic signal processing. His research interests include digi-tal signal processing, time series modeling, and systems theory andapplications to physical and biological systems.

    A. Spanias is a Professor of electrical en-gineering at Fulton School of Engineering,Arizona State University. His research inter-ests are in adaptive signal processing andspeech processing. He received the 2003Teaching Award from the IEEE PhoenixSection for the development of J-DSP. Heis a member of the IEEE-CAS Society DSPTechnical Committee and has served as aMember in the Technical Committee onStatistical Signal and Array Processing of the IEEE Signal Process-ing Society (SPS). He has served as an Associate Editor of the IEEETransactions on Signal Processing, General Cochair of the 1999 In-ternational Conference on Acoustics Speech and Signal Processing(Phoenix), IEEE Signal Processing Vice President for Conferences,and Chair of the Conference Board. He served as a Member in theIEEE Signal Processing Executive Committee and as an AssociateEditor of IEEE Signal Processing Letters. He is currently servingas a Member in the IEEE SPS Publications Board, and Member-at-Large of the IEEE SPS Conference Board. He has been Chair ofthe Phoenix IEEE Communications and Signal Processing Chapter,and is a Member in Eta Kappa Nu and Sigma Xi. Andreas Spanias iscorecipient of the 2002 IEEE Donald G. FinkPaper Award, and wasrecently elected as a Fellow of the IEEE. He is appointed as 2004Distinguished Lecturer of the IEEE SPS.

  • 28 EURASIP Journal on Applied Signal Processing

    L. D. Iasemidis received the Diploma inelectrical and electronics engineering fromthe National Technical University of Athensin 1982, M.S. in Physics, M.S. and Ph.D.in biomedical engineering from the Univer-sity of Michigan, Ann Arbor, Mich in 1985,1986, and 1991, respectively. Dr. Iasemidisis currently an Associate Professor of Bio-engineering at the Arizona State University,Tempe, Ariz, and Director and Founder ofthe ASU Brain Dynamics Laboratory. Dr. Iasemidis is recognizedas an expert in dynamics of epileptic seizures, and his researchand publications have stimulated an international interest in theprediction and control of epileptic seizures, and understanding ofthe mechanisms of epileptogenesis. He is currently on the EditorialBoard of Epilepsia and IEEE Transactions on Biomedical Engineer-ing, and is a Reviewer of NIH. He has reviewed articles for morethan 10 scientific journals. His research interests are in the areas ofbiomedical and genomic signal processing, complex systems theoryand nonlinear dynamics, neurophysiology, monitoring and analy-sis of the electrical and magnetic activity of the brain in epilepsyand other brain dynamical disorders, intervention and control ofthe CNS, neuroplasticity, rehabilitation, and neuroprosthesis. Dr.Iasemidis’ research has been funded by NIH, VA, DARPA and theWhitaker Foundation.

    K. Tsakalis received his Ph.D. degree inelectrical engineering from the University ofSouthern California. He is currently a Pro-fessor of electrical engineering at ArizonaState University. His interests are in robustadaptive control, time varying systems, ap-plications of control, identification, and op-timization in semiconductor manufactur-ing problems, and, more recently, the ap-plication of adaptive systems theory on theprediction and control of epileptic seizures.

  • EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING

    Special Issue on

    Advanced Signal Processing and ComputationalIntelligence Techniques for Power Line Communications

    Call for PapersIn recent years, increased demand for fast Internet access andnew multimedia services, the development of new and fea-sible signal processing techniques associated with faster andlow-cost digital signal processors, as well as the deregulationof the telecommunications market have placed major em-phasis on the value of investigating hostile media, such aspowerline (PL) channels for high-rate data transmissions.

    Nowadays, some companies are offering powerline com-munications (PLC) modems with mean and peak bit-ratesaround 100 Mbps and 200 Mbps, respectively. However,advanced broadband powerline communications (BPLC)modems will surpass this performance. For accomplishing it,some special schemes or solutions for coping with the follow-ing issues should be addressed: (i) considerable differencesbetween powerline network topologies; (ii) hostile propertiesof PL channels, such as attenuation proportional to high fre-quencies and long distances, high-power impulse noise oc-currences, time-varying behavior, and strong inter-symbolinterference (ISI) effects; (iv) electromagnetic compatibilitywith other well-established communication systems work-ing in the same spectrum, (v) climatic conditions in differ-ent parts of the world; (vii) reliability and QoS guarantee forvideo and voice transmissions; and (vi) different demandsand needs from developed, developing, and poor countries.

    These issues can lead to exciting research frontiers withvery promising results if signal processing, digital commu-nication, and computational intelligence techniques are ef-fectively and efficiently combined.

    The goal of this special issue is to introduce signal process-ing, digital communication, and computational intelligencetools either individually or in combined form for advancingreliable and powerful future generations of powerline com-munication solutions that can be suited with for applicationsin developed, developing, and poor countries.

    Topics of interest include (but are not limited to)

    • Multicarrier, spread spectrum, and single carrier tech-niques

    • Channel modeling

    • Channel coding and equalization techniques• Multiuser detection and multiple access techniques• Synchronization techniques• Impulse noise cancellation techniques• FPGA, ASIC, and DSP implementation issues of PLC

    modems• Error resilience, error concealment, and Joint source-

    channel design methods for video transmissionthrough PL channels

    Authors should follow the EURASIP JASP manuscript for-mat described at the journal site http://asp.hindawi.com/.Prospective authors should submit an electronic copy of theircomplete manuscripts through the EURASIP JASP man-uscript tracking system at http://www.hindawi.com/mts/, ac-cording to the following timetable:

    Manuscript Due October 1, 2006

    Acceptance Notification January 1, 2007

    Final Manuscript Due April 1, 2007

    Publication Date 3rd Quarter, 2007

    GUEST EDITORS:

    Moisés Vidal Ribeiro, Federal University of Juiz de Fora,Brazil; [email protected]

    Lutz Lampe, University of British Columbia, Canada;[email protected]

    Sanjit K. Mitra, University of California, Santa Barbara,USA; [email protected]

    Klaus Dostert, University of Karlsruhe, Germany;[email protected]

    Halid Hrasnica, Dresden University of Technology, Ger-many [email protected]

    Hindawi Publishing Corporationhttp://asp.hindawi.com

  • EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING

    Special Issue on

    Numerical Linear Algebra in Signal ProcessingApplications

    Call for PapersThe cross-fertilization between numerical linear algebra anddigital signal processing has been very fruitful in the lastdecades. The interaction between them has been growing,leading to many new algorithms.

    Numerical linear algebra tools, such as eigenvalue and sin-gular value decomposition and their higher-extension, leastsquares, total least squares, recursive least squares, regulariza-tion, orthogonality, and projections, are the kernels of pow-erful and numerically robust algorithms.

    The goal of this special issue is to present new efficient andreliable numerical linear algebra tools for signal processingapplications. Areas and topics of interest for this special issueinclude (but are not limited to):

    • Singular value and eigenvalue decompositions, in-cluding applications.

    • Fourier, Toeplitz, Cauchy, Vandermonde and semi-separable matrices, including special algorithms andarchitectures.

    • Recursive least squares in digital signal processing.• Updating and downdating techniques in linear alge-

    bra and signal processing.• Stability and sensitivity analysis of special recursive

    least-squares problems.• Numerical linear algebra in:

    • Biomedical signal processing applications.• Adaptive filters.• Remote sensing.• Acoustic echo cancellation.• Blind signal separation and multiuser detection.• Multidimensional harmonic retrieval and direc-

    tion-of-arrival estimation.• Applications in wireless communications.• Applications in pattern analysis and statistical

    modeling.• Sensor array processing.

    Authors should follow the EURASIP JASP manuscriptformat described at http://www.hindawi.com/journals/asp/.Prospective authors should submit an electronic copy oftheir complete manuscript through the EURASIP JASP man-uscript tracking system at http://www.hindawi.com/mts/, ac-cording to the following timetable:

    Manuscript Due October 1, 2006

    Acceptance Notification February 1, 2007

    Final Manuscript Due May 1, 2007

    Publication Date 3rd Quarter, 2007

    GUEST EDITORS:

    Shivkumar Chandrasekaran, Department of Electricaland Computer Engineering, University of California, SantaBarbara, USA; [email protected]

    Gene H. Golub, Department of Computer Science, StanfordUniversity, USA; [email protected]

    Nicola Mastronardi, Istituto per le Applicazioni del Calcolo“Mauro Picone,” Consiglio Nazionale delle Ricerche, Bari,Italy; [email protected]

    Marc Moonen, Department of Electrical Engineering,Katholieke Universiteit Leuven, Belgium;[email protected]

    Paul Van Dooren, Department of Mathematical Engineer-ing, Catholic University of Louvain, Belgium;[email protected]

    Sabine Van Huffel, Department of Electrical Engineering,Katholieke Universiteit Leuven, Belgium;[email protected]

    Hindawi Publishing Corporationhttp://www.hindawi.com

  • EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING

    Special Issue on

    Human-Activity Analysis in Multimedia Data

    Call for PapersMany important applications of multimedia revolve aroundthe detection of humans and the interpretation of human be-havior, for example, surveillance and intrusion detection, au-tomatic analysis of sports videos, broadcasts, movies, ambi-ent assisted living applications, video conferencing applica-tions, and so forth. Success in this task requires the integra-tion of various data modalities including video, audio, andassociated text, and a host of methods from the field of ma-chine learning. Additionally, the computational efficiency ofthe resulting algorithms is critical since the amount of data tobe processed in videos is typically large and real-time systemsare required for practical implementations.

    Recently, there have been several special issues on the hu-man detection and human-activity analysis in video. Theemphasis has been on the use of video data only. This specialissue is concerned with contributions that rely on the use ofmultimedia information, that is, audio, video, and, if avail-able, the associated text information.

    Papers on the following and related topics are solicited:

    • Video characterization, classification, and semanticannotation using both audio and video, and text (ifavailable).

    • Video indexing and retrieval using multimedia infor-mation.

    • Segmentation of broadcast and sport videos based onaudio and video.

    • Detection of speaker turns and speaker clustering inbroadcast video.

    • Separation of speech and music/jingles in broadcastvideos by taking advantage of multimedia informa-tion.

    • Video conferencing applications taking advantage ofboth audio and video.

    • Human mood detection, and classification of interac-tivity in duplexed multimedia signals as in conversa-tions.

    • Human computer interaction, ubiquitous computingusing multimedia.

    • Intelligent audio-video surveillance and other securi-ty-related applications.

    Authors should follow the EURASIP JASPmanuscript format described at the journal site bellowhttp://www.hindawi.com/GetJournal.aspx?journal=ASP.Prospective authors should submit an electronic copyof their complete manuscript through the EURASIPJASP manuscript tracking system at the following sitehttp://www.hindawi.com/mts/, according to the followingtimetable:

    Manuscript Due February 1, 2007

    Acceptance Notification June 1, 2007

    Final Manuscript Due October 1, 2007

    Publication Date 1st Quarter, 2008

    GUEST EDITORS:

    A. Enis Cetin, Department of Electrical and Electron-ics Engineering, Bilkent University, Ankara 06800, Turkey;[email protected]

    Eric Pauwels, Signals and Images Research Group, Centrefor Mathematics and Computer Science (CWI), 1098 SJ Am-sterdam, The Netherlands; [email protected]

    Ovidio Salvetti, Institute of Information Science and Tech-nologies (ISTI), Italian National Research Council (CNR),56124 Pisa, Italy; [email protected]

    Hindawi Publishing Corporationhttp://www.hindawi.com

  • EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING

    Special Issue on

    Advanced Signal Processing and Pattern RecognitionMethods for Biometrics

    Call for PapersBiometric identification has established itself as a very im-portant research area primarily due to the pronounced needfor more reliable and secure authentication architectures inseveral civilian and commercial applications. The recent in-tegration of biometrics in large-scale authentication systemssuch as border control operations has further underscoredthe importance of conducting systematic research in biomet-rics. Despite the tremendous progress made over the past fewyears, biometric systems still have to reckon with a numberof problems, which illustrate the importance of developingnew biometric processing algorithms as well as the consid-eration of novel data acquisition techniques. Undoubtedly,the simultaneous use of several biometrics would improvethe accuracy of an identification system. For example the useof palmprints can boost the performance of hand geome-try systems. Therefore, the development of biometric fusionschemes is an important area of study. Topics related to thecorrelation between biometric traits, diversity measures forcomparing multiple algorithms, incorporation of multiplequality measures, and so forth need to be studied in more de-tail in the context of multibiometrics systems. Issues relatedto the individuality of traits and the scalability of biometricsystems also require further research. The possibility of us-ing biometric information to generate cryptographic keys isalso an emerging area of study. Thus, there is a definite needfor advanced signal processing, computer vision, and patternrecognition techniques to bring the current biometric sys-tems to maturity and allow for their large-scale deployment.

    This special issue aims to focus on emerging biometrictechnologies and comprehensively cover their system, pro-cessing, and application aspects. Submitted articles must nothave been previously published and must not be currentlysubmitted for publication elsewhere. Topics of interest in-clude, but are not limited to, the following:

    • Fusion of biometrics• Analysis of facial/iris/palm/fingerprint/hand images• Unobtrusive capturing and extraction of biometric

    information from images/video• Biometric identification systems based on

    face/iris/palm/fingerprint/voice/gait/signature

    • Emerging biometrics: ear, teeth, ground reactionforce, ECG, retina, skin, DNA

    • Biometric systems based on 3D information• User-specific parameterization• Biometric individuality• Biometric cryptosystems• Quality measure of biometrics data• Sensor interoperability• Performance evaluation and statistical analysis

    Authors should follow the EURASIP JASP manuscriptformat described at http://www.hindawi.com/journals/asp/.Prospective authors should submit an electronic copy oftheir complete manuscript through the EURASIP JASP man-uscript tracking system at http://www.hindawi.com/mts/, ac-cording to the following timetable:

    Manuscript Due May 1, 2007

    Acceptance Notification September 1, 2007

    Final Manuscript Due December 1, 2007

    Publication Date 1st Quarter, 2008

    GUEST EDITORS:

    Nikolaos V. Boulgouris, Department of ElectronicEngineering, Division of Engineering, King’s CollegeLondon, London WC2R 2LS, UK;[email protected]

    Juwei Lu, EPSON Edge, EPSON Canada Ltd., Toronto,Ontario M1W 3Z5, Canada; [email protected]

    Konstantinos N. Plataniotis, The Edward S. Rogers Sr.Department of Electrical and Computer Engineering,University of Toronto, Toronto, Ontario, Canada, M5S 3G4;[email protected]

    Arun Ross, Lane Department of Computer Science &Electrical Engineering, West Virginia University,Morgantown WV, 26506, USA; [email protected]

    Hindawi Publishing Corporationhttp://www.hindawi.com

  • EURASIP JOURNAL ON BIOINFORMATICS AND SYSTEMS BIOLOGY

    Special Issue on

    Information Theoretic Methods for Bioinformatics

    Call for Papers

    Information theoretic methods for modeling are at the cen-ter of the current efforts to interpret bioinformatics data.The high pace at which new technologies are developed forcollecting genomic and proteomic data requires a sustainedeffort to provide powerful methods for modeling the dataacquired. Recent advances in universal modeling and mini-mum description length techniques have been shown to bewell suited for modeling and analyzing such data. This spe-cial issue calls for contributions to modeling of data arisingin bioinformatics and systems biology by information theo-retic means. Submissions should address theoretical develop-ments, computational aspects, or specific applications. Suit-able topics for this special issue include but are not limitedto:

    • Normalized maximum-likelihood (NML) universalmodels

    • Minimum description length (MDL) techniques• Microarray data modeling• Denoising of genomic data• Pattern recognition• Data compression-based modeling

    Authors should follow the EURASIP JBSB manuscriptformat described at http://www.hindawi.com/journals/bsb/.Prospective authors should submit an electronic copy of theircomplete manuscript through the EURASIP JBSB’s manu-script tracking system at http://www.hindawi.com/mts/, ac-cording to the following timetable.

    Manuscript Due February 1, 2007

    Acceptance Notification May 1, 2007

    Final Manuscript Due July 1, 2007

    Publication Date 3rd Quarter, 2007

    GUEST EDITORS:

    Jorma Rissanen, Computer Learning Research Center,University of London, Royal Holloway, TW20 0EX, UK;[email protected]

    Peter Grünwald, Centrum voor Wiskunde en Informatica(CWI), National Research Institute for Mathematics andComputer Science, P.O. Box 94079, 1090 GB Amsterdam,The Netherlands; [email protected]

    Jukka Heikkonen, Laboratory of ComputationalEngineering, Helsinki University of Technology, P.O. Box9203, 02015 HUT, Finland; [email protected]

    Petri Myllymäki, Department of Computer Science,University of Helsinki, P.O. Box 68 (Gustaf Hällströmin katu2b), 00014, Finland; [email protected]

    Teemu Roos, Complex Systems Computation Group,Helsinki Institute for Information Technology, University ofHelsinki, P.O.Box 68, 00014, Finland; [email protected]

    Juho Rousu, Department of Computer Science, Universityof Helsinki, P.O. Box 68 (Gustaf Hällströmin katu 2b),00014, Finland; [email protected]

    Hindawi Publishing Corporationhttp://www.hindawi.com

    file:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]

  • IEEE ICME 2007 Call for Papers 2007 International Conference on Multimedia

    & Expo (ICME) July 2-5, 2007

    Beijing International Convention Center, Beijing, China

    Sponsored by: Circuits and Systems Society, Communications Society, Computer Society, and Signal Processing Society.

    General Co-Chairs

    Xinhua Zhuang, U. Missouri-Columbia, USA

    Wen Gao, Peking University, China

    Technical Program Chair

    Yun Q. Shi, NJIT, USA

    Technical Program Vice-Chairs

    Mark Liao (Circ. & Sys. Society) Acad. Sinica

    Yu-Hen Hu (Comm. Society) U. Wisconsin, USA

    Philip Sheu (Comp. Society) UC Irvine, USA Joern Ostermann (Sig. Pr. Soc.) LUH, Germany

    Conference Manager

    Hong Man, Stevens Inst. Tech., USA

    Special Session Chairs

    John Aa. Sorenson, ECC, Denmark

    Shipeng Li, Microsoft Research Asia, China

    Tutorial Chairs

    Ming-Ting Sun, University of Washington, USA

    Oscar Au, HKUST, China

    Finance Chairs

    Belle Tseng, NEC Lab America, USA Shiqiang Yang, Tsinghua University, China

    Publicity Chairs Bo Li, Beihang University, China

    Ed Wong, Brooklyn Poly. Univ., USA

    Registration Chair

    Yun He, Tsinghua University, China

    Hong Man, Stevens Inst. Tech., USA

    Electronic Media Chairs

    Zicheng Liu, Microsoft Research, USA

    Chiren Shyu, U. Missouri-Columbia, USA

    Publications Chairs

    Wenjun Zeng, U. Missouri-Columbia, USA

    Yuanchun Shi, Tsinghua University, China

    Demo-Exhibit Chairs

    Jian Lu, Vobile Inc., USA

    Feng Wu, Microsoft Research Asia, China

    Local Arrangement Chairs

    Hanqing Lu, IA of CAS, China

    Xilin Chen, ICT of CAS, China

    North America Liaison

    Heather Yu, Panasonic, USA

    Yong Rui, Microsoft, China

    Europe Liaison

    Murat Kunt, EPFL, Switzerland

    Jean-Luc Dugelay, EUROCOM, France

    IEEE International Conference on Multimedia & Expo is a major annual international conference with the objective of bringing together researchers, developers, and practitioners from academia and industry working in all areas of multimedia. ICME serves as a forum for the dissemination of state-of-

    the-art research, development, and implementations of multimedia systems, technologies and applications.

    ICME is co-sponsored by four IEEE societies including the Circuits and Systems Society, the

    Communications Society, the Computer Society, and the Signal Processing Society. The conference will

    feature world-class plenary speakers, exhibits, special sessions, tutorials, and paper presentations.

    Prospective authors are invited to submit a four-page paper in double-column format including authors'

    names, affiliations, and a short abstract. Only electronic submissions will be accepted. Topics include but

    are not limited to:

    • Audio, image, video processing

    • Virtual reality and 3-D imaging

    • Signal processing for media integration

    • Multimedia communications and networking

    • Multimedia security and content protection

    • Multimedia human-machine interface and interaction

    • Multimedia databases

    • Multimedia computing systems and appliances

    • Hardware and software for multimedia systems

    • Multimedia standards and related issues

    • Multimedia applications

    • Multimedia and social media on the Internet

    A number of awards will be presented to the Best Papers and Best Student Papers at the conference.

    Participation for special sessions and tutorial proposals are encouraged.

    SCHEDULE ! Special Session Proposals Due: December 1, 2006

    ! Tutorial Proposals Due: December 1, 2006

    ! Regular Paper Submissions Due: January 5, 2007

    ! Notification of Acceptance: March 19, 2007

    ! Camera-Ready Papers Due: April 16, 2007

    Check the conference website http://www.icme2007.org for updates.

    International Advisory Board

    Sadaoki Furu i, Tokyo Inst. Tech., Japan (Chair)

    Ming Liou, HKUST, China (Co-Chair)

    Peter Pirsch, LUH, Germany (Co-Chair)

    Jan Biemond, Delft Univ. Tech., Netherlands

    Shih-Fu Chang, Columbia Univ., USA

    Rama Chellappa, University of Maryland, USA

    Chang-Wen Chen, Florida Inst. Tech., USA

    Liang-Gee Chen, National Taiwan University

    Robert M. Haralick, City Univ. of New York, USA

    T. S. Huang, UIUC, USA

    Anil Jain, Michigan State University, USA

    Ramesh Jain, UC Irvine, USA

    Chung-Sheng Li, IBM Watson Research, USA

    Xing-Gang Lin, Tsinghua Univ., China

    K. J. Ray Liu, University of Maryland, USA

    Songde Ma, Ministry of Science and Technology, China

    Timothy K. Shih, Tamkang University

    T. Sikora, Technical Univ. Berlin, Germany

    Ming-Ting Sun, Univ. Washington, USA

    Qi Tian, Institute for Inforcomm Research, Singapore

    B. W. Wah, UIUC, USA

    Hong-Jiang Zhang, Microsoft, China

    Ya-Qin Zhang, Microsoft, China