In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes

8
In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes Chaudhary Mashhood Alam a , Avadhesh Kumar Singh b , Choudhary Sharfuddin a , Safdar Ali b, a Department of Botany, Patna University, Bihar 800005, India b Department of Biomedical Sciences, SRCASW, University of Delhi, Vasundhara Enclave, New Delhi 110096, India abstract article info Article history: Accepted 13 August 2013 Available online 25 August 2013 Keywords: Tobamovirus Simple sequence repeats IMEx dMAX Correlation studies An in-silico analysis of simple sequence repeats (SSRs) in 30 species of tobamoviruses was done. SSRs (mono to hexa) were present with variant frequency across species. Compound microsatellites, primarily of variant motifs accounted for up to 11.43% of the SSRs. Motif duplications were observed for A, T, AT, and ACA repeats. (AG)(TC) was the most prevalent SSR-couple. SSRs were differentially localized in the coding region with ~54% on the 128 kDa protein while 20.37% was exclusive to 186 kDa protein. Characterization of such variations is important for elucidating the origin, sequence variations, and structure of these widely used, but incompletely understood sequences. © 2013 Elsevier B.V. All rights reserved. 1. Introduction Tobamovirus with 33 species accounts for 64% species in the family Virgoviridae (King et al., 2012). These are crop damaging pathogens infecting cucumber, tobacco, tomato as well as wild plant species and are easily spread by contact. Each tobamovirus genome contains a single-stranded positive sense RNA of ~7000 nucleotides, terminal un-translated regions and, between them, a single open reading frame (ORF) that is translated into a poly-protein which is cleaved post-translation into at least four proteins by virus-encoded protein- ases. Though diverse at nucleotide sequence level, the tobamoviruses possess similar genome organization. Also, there are four proteins in tobamovirus namely, 128 kDa, 186 kDa, movement and coat proteins. The size of these proteins shows slight variations in size across species. Moreover, the encoding sequence for the 128 kDa protein has a read- through stop codon for expression of the 186 kDa protein. Simple sequence repeats (SSRs), also called as micro- or mini- satellites, are tandem repetitions of relatively short DNA motifs. Their presence in animal viruses like Hepatitis C virus (HCV) (Chen et al., 2009) and Human cytomegalovirus (HCMV) (Picone et al., 2005) conrms their existence beyond prokaryotes and eukaryotes (Mrazek et al., 2007; Tóth et al., 2000), as believed otherwise. The repeat number, length, and motif size inuence microsatellite mutability. For instance, increased number of repeats leads to higher mutability (Pearson et al., 2005). Microsatellite instability due to strand slippage and unequal re- combination leads to insertions/deletions of one/several repeat units. Owing to high instability, microsatellites are vital in genome evolution (Deback et al., 2009) and the predominant source of genetic diversity (Kashi and King, 2006). Variable length of microsatellites affects local DNA structure or the encoded proteins (Mrazek et al., 2007). Their role in gene regulation, transcription and protein function has been elucidat- ed in few cases (Kashi and King, 2006; Usdin, 2008). Genome features such as size and GC content inuence the occurrence of microsatellites (Coenye and Vandamme, 2005; Dieringer and Schlotterer, 2003) and the polymorphism therein (Kelkar et al., 2008). However, this correlation is not universal and therefore, a single priority rule cannot be forged for predicting their occurrence and density. Based on the presence of interruptions, microsatellites may be interrupted, pure, compound, interrupted compound, complex and interrupted complex (Chambers and MacAvoy, 2000). Present study pri- marily focuses on pure and compound microsatellites (two or more microsatellites adjacent to each other). Their presence has been report- ed in diverse taxa across viruses, prokaryotes and eukaryotes (Alam et al., 2013; Chen et al., 2012; Gur-Arie et al., 2000; Koer et al., 2008). Interestingly, microsatellites are more abundant in coding regions than in non-coding regions in eukaryotes (Metzgar et al., 2000; Tóth et al., 2000) and in some prokaryotes (Gur-Arie et al., 2000; Li et al., 2004), possibly due to increased selection in coding regions (Ellegren, 2004; Karaoglu et al., 2005). Microsatellite accumulation in the coding regions of viral genomes is due to high coding density (Chen et al., 2009; George et al., 2012). Furthermore, the compound microsatellites constitute ~10% of SSRs in human genome (Weber, 1990), including highly polymorphic Gene 530 (2013) 193200 Abbreviations: SSR, Simple sequence repeat; cSSR, Compound simple sequence repeat; IMEx, Imperfect Microsatellite Extraction; RD, Relative density; RA, Relative abundance; CP, Coat protein; MP, Movement protein. Corresponding author at: Department of Biomedical Sciences, SRCASW, University of Delhi, New Delhi, 110096, India. Tel.: +91 11 22623503; fax: +91 11 22623504. E-mail addresses: [email protected], [email protected] (S. Ali). 0378-1119/$ see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.gene.2013.08.046 Contents lists available at ScienceDirect Gene journal homepage: www.elsevier.com/locate/gene

Transcript of In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes

Page 1: In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes

Gene 530 (2013) 193–200

Contents lists available at ScienceDirect

Gene

j ourna l homepage: www.e lsev ie r .com/ locate /gene

In-silico analysis of simple and imperfect microsatellites in diversetobamovirus genomes

Chaudhary Mashhood Alam a, Avadhesh Kumar Singh b, Choudhary Sharfuddin a, Safdar Ali b,⁎a Department of Botany, Patna University, Bihar 800005, Indiab Department of Biomedical Sciences, SRCASW, University of Delhi, Vasundhara Enclave, New Delhi 110096, India

Abbreviations: SSR, Simple sequence repeat; cSSR, ComIMEx, Imperfect Microsatellite Extraction; RD, Relative deCP, Coat protein; MP, Movement protein.⁎ Corresponding author at: Department of Biomedical S

Delhi, New Delhi, 110096, India. Tel.: +91 11 22623503;E-mail addresses: [email protected], [email protected]

0378-1119/$ – see front matter © 2013 Elsevier B.V. All rhttp://dx.doi.org/10.1016/j.gene.2013.08.046

a b s t r a c t

a r t i c l e i n f o

Article history:Accepted 13 August 2013Available online 25 August 2013

Keywords:TobamovirusSimple sequence repeatsIMExdMAXCorrelation studies

An in-silico analysis of simple sequence repeats (SSRs) in 30 species of tobamoviruses was done. SSRs (mono tohexa) were present with variant frequency across species. Compoundmicrosatellites, primarily of variant motifsaccounted for up to 11.43% of the SSRs.Motif duplicationswere observed for A, T, AT, and ACA repeats. (AG)–(TC)was the most prevalent SSR-couple. SSRs were differentially localized in the coding region with ~54% on the128 kDa protein while 20.37% was exclusive to 186 kDa protein. Characterization of such variations is importantfor elucidating the origin, sequence variations, and structure of these widely used, but incompletely understoodsequences.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

Tobamovirus with 33 species accounts for 64% species in the familyVirgoviridae (King et al., 2012). These are crop damaging pathogensinfecting cucumber, tobacco, tomato as well as wild plant species andare easily spread by contact. Each tobamovirus genome contains asingle-stranded positive sense RNA of ~7000 nucleotides, terminalun-translated regions and, between them, a single open readingframe (ORF) that is translated into a poly-protein which is cleavedpost-translation into at least four proteins by virus-encoded protein-ases. Though diverse at nucleotide sequence level, the tobamovirusespossess similar genome organization. Also, there are four proteins intobamovirus namely, 128 kDa, 186 kDa, movement and coat proteins.The size of these proteins shows slight variations in size across species.Moreover, the encoding sequence for the 128 kDa protein has a read-through stop codon for expression of the 186 kDa protein.

Simple sequence repeats (SSRs), also called as micro- or mini-satellites, are tandem repetitions of relatively short DNA motifs.Their presence in animal viruses like Hepatitis C virus (HCV) (Chenet al., 2009) and Human cytomegalovirus (HCMV) (Picone et al., 2005)confirms their existence beyond prokaryotes and eukaryotes (Mrazeket al., 2007; Tóth et al., 2000), as believed otherwise. The repeat number,length, and motif size influence microsatellite mutability. For instance,

pound simple sequence repeat;nsity; RA, Relative abundance;

ciences, SRCASW, University offax: +91 11 22623504.m (S. Ali).

ights reserved.

increased number of repeats leads to higher mutability (Pearson et al.,2005). Microsatellite instability due to strand slippage and unequal re-combination leads to insertions/deletions of one/several repeat units.Owing to high instability, microsatellites are vital in genome evolution(Deback et al., 2009) and the predominant source of genetic diversity(Kashi and King, 2006). Variable length of microsatellites affects localDNA structure or the encoded proteins (Mrazek et al., 2007). Their rolein gene regulation, transcription and protein function has been elucidat-ed in few cases (Kashi and King, 2006; Usdin, 2008). Genome featuressuch as size and GC content influence the occurrence of microsatellites(Coenye and Vandamme, 2005; Dieringer and Schlotterer, 2003) andthe polymorphism therein (Kelkar et al., 2008). However, this correlationis not universal and therefore, a single priority rule cannot be forged forpredicting their occurrence and density.

Based on the presence of interruptions, microsatellites may beinterrupted, pure, compound, interrupted compound, complex andinterrupted complex (Chambers andMacAvoy, 2000). Present study pri-marily focuses on pure and compound microsatellites (two or moremicrosatellites adjacent to each other). Their presence has been report-ed in diverse taxa across viruses, prokaryotes and eukaryotes (Alamet al., 2013; Chen et al., 2012; Gur-Arie et al., 2000; Kofler et al., 2008).Interestingly, microsatellites are more abundant in coding regions thanin non-coding regions in eukaryotes (Metzgar et al., 2000; Tóth et al.,2000) and in some prokaryotes (Gur-Arie et al., 2000; Li et al., 2004),possibly due to increased selection in coding regions (Ellegren, 2004;Karaoglu et al., 2005). Microsatellite accumulation in the coding regionsof viral genomes is due to high coding density (Chen et al., 2009; Georgeet al., 2012).

Furthermore, the compound microsatellites constitute ~10% ofSSRs in human genome (Weber, 1990), including highly polymorphic

Page 2: In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes

194 C.M. Alam et al. / Gene 530 (2013) 193–200

compound repeats such as (dC-dA)n(dG-dT)n (Bull et al., 1999). Othereukaryotic genomes like Homo sapiens, Macaca mulatta, Mus musculusand Rattus norvegicus have 4–25% of compound microsatellites (Kofleret al., 2008). Further, 22 complete Escherichia coli genomes had afrequency of 1.75–2.85% while in 81 HIV type-1 genomes there were24.24% (Chen et al., 2012) and 45 potyvirus genomes had a frequencyof 15.15% compound microsatellites (Alam et al., 2013) indicatingsignificant variations across genomes. An exhaustive study of the diver-sifications in satellite sequences would provide insight into the imper-fections and evolution of microsatellites.

Viral microsatellites have the potential for generating genomic diver-sity and phenotypic changes (Li et al., 2004). Although microsatelliteshave been studied with respect to origin, distribution and evolution(Archak et al., 2007), yet their presence and possible functional signifi-cance in plant viruses have been recognized only recently (George et al.,2012; Xiangyan et al., 2011). Here, we systematically analyzed theoccurrence, size, and density of differentmicrosatellites in the highly di-vergent tobamoviruses, a possible model for understanding functionalaspects, evolutionary relationships, and adaptation to divergent hosts.

2. Materials and methods

2.1. Genome sequences

Complete genome sequence of 30 tobamoviruses species wereassessed from NCBI (http://www.ncbi.nlm.nih.-gov/) and analyzedfor simple and compound microsatellites. Studied genomes rangedfrom 4683nt (Acc No- U47034) to 6794nt (Acc No- DQ356949) whichhas been summarized in Table 1.

2.2. Microsatellite identification and investigation

The microsatellite search using IMEx software (Mudunuri andNagarajaram, 2007) with parameters for eukaryotes and E. coli genomesdid not yield any results owing to relatively smaller size of tobamovirusgenomes. Subsequently, the search was performed employing the

Table 1Overview of simple microsatellites in complete tobamovirus genome sequences.

S. no Name Accession number

T1 Bell pepper mottle virus DQ355023T2 Brugmansia mild mottle virus AM398436T3 Cactus mild mottle virus EU043335T4 Clitoria yellow mottle virus JN566124T5 Cucumber fruit mottle mosaic virus AF321057T6 Cucumber green mottle mosaic virus DQ767631T7 Cucumber mottle virus AB261167T8 Frangipani mosaic virus HM026454T9 Hibiscus latent Fort Pierce virus FJ196834T10 Hibiscus latent Singapore virus AF395898T11 Kyuri green mottle mosaic virus AB015145T12 Maracuja mosaic virus DQ356949T13 Obuda pepper virus L11665T14 Odontoglossum ringspot virus X82130T15 Paprika mild mottle virus AB089381T16 Passion fruit mosaic virus HQ389540T17 Pepper mild mottle virus M81413T18 Rattail cactus necrosis-associated virus JF729471T19 Rehmannia mosaic virus JX575184T20 Ribgrass mosaic virus JQ319720T21 Sammons's Opuntia virus AY366209T22 Streptocarpus flower break virus AM040955T23 Sunn-hemp mosaic virus U47034T24 Tobacco mild green mosaic virus JX534224T25 Tobacco mosaic virus JX993906T26 Tomato mosaic virus AJ243571T27 Turnip vein-clearing virus JN205074T28 Wasabi mottle virus AB017504T29 Youcai mosaic virus AB261175T30 Zucchini green mottle mosaic virus AJ252189

‘Advance-Mode’ of IMEx with parameters as reported for HIV (Chenet al., 2012); as in Type of Repeat: perfect; Repeat Size: all; MinimumRepeat Number: 6, 3, 3, 3, 3, 3; Maximum distance allowed betweenany two SSRs (dMAX) is 10. Other parameters were set as default.Compoundmicrosatellites were not standardized in order to determinereal composition.

2.3. Statistical analysis

We used Microsoft Office Excel 2007 for the statistical analysis.Linear regression was used for correlation studies.

3. Results

3.1. Number, relative abundance and relative density of SSRs and cSSRs

Genome-wide scan of tobamoviruses revealed 765 SSRs and 39cSSRs distributed unevenly (Tables 1–2, Supplementary Tables 1–2).The incidence of SSRs had a least frequency of 11 (T15) and amaximumof 36 (T12) (Table 1, Supplementary Table S1, Fig. 1A). Their relativedensity varied from 10.96 bp/kb (T15) to 37.28 bp/kb (T18) whereasrelative abundance ranged between 1.65 (T15) and 5.51 bp/kb (T18).(Table 1, Figs. 1B–C).

The percentage of individual microsatellites being part of a cSSR(cSSRs-%) varied from 0 (T19) to 11.43% (T18) with each having 35SSRs (Fig. 2A). Interestingly, 7 tobamoviruses lacked cSSRs (relativeabundance and relative density of zero) whereas a maximum of 4cSSRs was present in T18 (Table 2 and Supplementary Table S2). More-over, highest relative density and relative abundance were 10.65 bp/kbin T26 and 0.63 bp/kb in T18 respectively (Table 2, Figs. 2B–C).

3.2. dMAX and cSSRs

dMAX is themaximumdistance between twoadjacentmicrosatellitesand if the distance separating two microsatellites is ≤dMAX, thesemicrosatellites are classified as cSSR (Kofler et al., 2008). For IMEx,

Genome size GC content SSR RA RD

6375 42.93 32 5.02 32.476381 43.33 25 3.92 25.546449 42.89 24 3.57 22.486514 43.17 23 3.53 23.036562 43.61 22 3.35 23.936424 43.14 20 3.11 22.106485 43.76 27 4.16 28.376643 39.32 26 3.91 25.894941 40.42 23 4.65 31.576474 41.80 25 3.86 36.616515 45.07 18 2.76 17.346794 44.95 36 5.30 33.716506 41.39 28 4.30 28.136618 39.51 18 2.72 17.986524 49.87 35 3.75 25.216791 41.13 27 1.69 11.196357 45.06 35 5.15 34.906506 42.30 26 4.25 27.696395 44.68 25 5.38 35.976311 43.39 25 4.07 26.906663 43.88 11 3.96 26.146271 41.65 27 4.31 29.344683 44.69 13 2.78 20.296356 41.11 21 3.30 22.506394 43.37 33 5.16 33.946383 41.75 32 5.01 34.316311 44.00 28 4.44 31.376297 43.48 26 4.13 28.596304 43.40 24 3.81 26.496513 44.11 22 3.38 22.26

Page 3: In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes

Table 2Overview of compound microsatellites in tobamovirus whole genome sequences.

S. no SSR cSSRa RDb RAc cSSR-%d die trie

T1 32 3 0.47 8.16 9.38 1 0T2 25 2 0.31 6.11 8.00 1 0T3 24 1 0.16 2.48 4.35 1 0T4 23 1 0.15 2.00 4.35 1 0T5 22 0 0.00 0.00 0.00 0 0T6 20 0 0.00 0.00 0.00 0 0T7 27 0 0.00 0.00 0.00 0 0T8 26 1 0.15 1.96 3.85 1 0T9 23 2 0.40 5.87 8.70 1 0T10 25 0 0.00 0.00 0.00 0 0T11 18 1 0.15 2.00 5.56 1 0T12 36 2 0.29 5.15 5.56 1 0T13 28 2 0.31 6.76 7.14 1 0T14 18 1 0.15 3.78 5.56 1 0T15 11 0 0.00 0.00 0.00 0 0T16 35 1 0.15 4.29 2.86 1 0T17 27 1 0.15 2.95 3.70 1 0T18 35 4 0.63 10.54 11.43 1 0T19 26 0 0.00 0.00 0.00 0 0T20 25 2 0.31 4.07 8.00 1 0T21 25 0 0.00 0.00 0.00 0 0T22 27 2 0.32 6.05 7.41 1 0T23 13 1 0.21 4.06 7.69 1 0T24 21 1 0.16 2.83 4.76 1 0T25 33 2 0.31 3.60 6.06 1 0T26 32 3 0.47 10.65 9.38 0 1T27 28 2 0.32 7.45 7.14 0 1T28 26 1 0.16 2.70 3.85 1 0T29 24 1 0.16 2.70 4.17 1 0T30 22 1 0.15 2.30 4.55 1 0

a Number of compound microsatellites.b Relative density is defined as the total length (bp) contributed by each compound

microsatellite per kb of sequence analyzed.c Relative abundance: number of compound microsatellites present per kb of the/

genome (kb).d cSSRs-% is the percentage of individual microsatellites being part of a compound

microsatellite.e Compound microsatellite complexity (number of individual microsatellites in a

compound microsatellite).

195C.M. Alam et al. / Gene 530 (2013) 193–200

dMAX value can be between 0 and 50 (Mudunuri and Nagarajaram,2007). T1, T6, T12, T18, T24 and T30 (randomly selected) revealed an in-crease in cSSRs-% with higher dMAX in the genomes except T1 (Fig. 3).

3.3. Correlation studies

We tested for correlation between genome size/GC content andnumber/relative abundance/relative density of SSRs and cSSRs. Inci-dence of SSRs is non-significantly correlated (R2 = 0.089, P N 0.01)with genome size but significantly correlated with GC content(R2 = 0.07, P b 0.05). However, relative density (R2 = 0.07, P N 0.01)and relative abundance (R2 = 0.002, P N 0.01) were non-significantlycorrelated with genome size but significantly correlated with GCcontent R2 = 0.13, P b 0.05; and R2 = 0.10, P b 0.05 respectively. Sim-ilarly the regression analysis of cSSR for cSSR-% (R2 = 0.12, P N 0.01),relative density (R2 = 0.04, P N 0.05) and relative abundance(R2 = 0.06, P N 0.05) shows non-significant correlation with genomesize. GC content is significantly correlated for cSSR-% (R2 = 0.01,P b 0.05), and relative density (R2 = 0.01, P b 0.05) but for relativeabundance (R2 = 0.09, P b 0.01) they show weak but significantcorrelation.

3.4. Motifs types in analyzed genomes

Mononucleotide repeats incidence exhibited an overall prevalenceof poly (A/T) repeats (87.7%) and also a higher frequency for poly(A/T) repeats in individual genomes (Supplementary Table S1, Fig. 4),possibly due to (A/T) rich genomes (Chambers and MacAvoy, 2000).However, (A/T) being only slightly higher than G/C content in the

analyzed sequences suggests a weak influence on the occurrence ofmononucleotide repeats. A maximum of 79 mononucleotide A repeatswas observed in Hibiscus latent Singapore virus (T10).

Present study showed variant incidence of di-nucleotide repeats:AG/GA, GT/TG, AC/CA, CT/TC, AT/TA, and CG/GC (SupplementaryTable S1). The most prevalent GT/TG repeats were ~9 times moreabundant than the least represented CG/GC. Amongst the third mostabundant SSRs comprised of tri-nucleotide repeats, ACA (Threonine)had the highest frequency followed by CAA/AAC coding for Asparagine/Glutamine respectively (Fig. 4, Supplementary Table S1). Tetra- andhexa-nucleotide repeat motifs AATT (T20), AGTA (T27) and ACATTT(T23) were also present whereas there were no penta-nucleotiderepeats.

3.5. SSRs/cSSRs in coding regions

The SSRs and cSSRs incidence in non-coding region is very low.About 54.5% of SSRs was localized on 128 kDa protein while 20.37%was exclusive to 186 kDa protein (Fig. 5). Frequency of mono-, di- andtri-nucleotide repeats were highest in 128 kDa proteins (Fig. 6). Thevariation and possible functionality of SSRs on 128 kDa and 186 kDaproteins are currently unknown. Moreover, 59% cSSRs were localizedon 128 kDa protein while 18% was exclusive to 186 kDa protein (Fig. 7).

3.6. Motif complexity in analysed genomes

Motifs in form [m1]n-xn-[m2]n are termed ‘SSR-couples’ of motifm1–m2. The extracted microsatellites when analyzed for SSR couplesrevealed (A)-x-(A) and (AG)-x-(TC) to be most prevalent (presentthrice). Interestingly, 86% of SSR-couple motifs are specific for a particu-lar species (Table 3). Self complementarymotifs were also found such as(A)-x-(T), (AC)-x-(TG), (AG)-x-(TC) and (CA)-x-(TG)-x-(CTG). Motifduplication; similar motif on both ends of the spacer sequence, forexample (CA)n-(X)y-(CA)z was observed in cSSRs of 6 tobamoviruses.The most common duplication motifs were mononucleotide A and T,AT followed by ACA (Table 3).

4. Discussion

With thousands of prokaryotic genomes sequenced so far and thefact that a typical bacterial genome can be sequenced in less than aday (Flicek and Birney, 2009; Lagesen et al., 2010; Reeves et al., 2009)the onus is now on extracting the maximum possible informationfrom these sequences. One of the primary tasks therein is to be able todiscern the distribution of various types of sequences present in thegenome, coding and non-coding, simple and compound microsatelliteand other satellite sequences, to name a few. Amongst these the simplesequence repeats or microsatellites have evoked great interest owing tothe establishment of its increasing presence across different species inboth coding and non-coding regions. The repeat sequences may eitherbe present as part of the coding regions or be in the vicinity of genes(Jasinska and Krzyzosiak, 2004). These repeats are known to be respon-sible for regulation of gene expression at transcriptional and transla-tional level or for gene silencing (Li et al., 2004; Rocha et al., 2002;Vergnaud and Denoeud, 2000). The presence of SSRs in genomesassumes clinical and evolutionary significance as well with the vari-ation in copies due to expansion and contraction of these repeats areestablished risk factors for Huntington's disease, myotonic dystrophyand several other genetic diseases (Borstnik and Pumpernik, 2002; DiProspero and Fischbeck, 2005; Dushlaine et al., 2005; Richards, 2001;Sutherland and Richards, 1995). Moreover, their presence and variationhave been studied in reference to genome evolution in 257 viruses ac-counting for a strong correlation between type of the repeat, genomesize and host adaptability (Zhao et al., 2012). Thus, owing to its ubiqui-tous presence and instability, the SSRs play a positive role in adaptiveevolution both within and across the species.

Page 4: In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes

Fig. 1. Analysis of SSRs (A) Distribution of SSRs; (B) Relative density: Total length covered by SSR per kb of genome; (C) Relative abundance: SSRs present per kb of genome.

196 C.M. Alam et al. / Gene 530 (2013) 193–200

Herewe look into the presence, abundance, and composition of SSRsin 30 tobamovirus genomes. The SSRs incidence is proportional togenome size with the tobamovirus having 11–36 SSRs, lower thanpotyviruses (23–45 SSRs) (Xiangyan et al., 2011) or human immuno-deficiency virus isolates (22–48 SSRs) (Chen et al., 2009) but higher

than geminivirus (4–19 SSRs) (George et al., 2012). The sequencecomposition of repeats determines the abundance of microsatellites.sIn tobamovirus, GT/TG repeats predominate whereas GC/CG was rare,compared to geminivirus, human immunodeficiency virus (Karaogluet al., 2005; Kim et al., 2008;Morgante et al., 2002) and some eukaryotes

Page 5: In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes

Fig. 2. Analysis of cSSRs (A) Analysis of cSSR-%— Percentage of individual microsatellites being part of a compound microsatellite; (B) Relative density of cSSR— Total length covered bycSSR per kb of genome; (C) Relative abundance of cSSR — Number of cSSR present per kb of genome.

197C.M. Alam et al. / Gene 530 (2013) 193–200

(Hong, 2007) in which AT/TA repeat was most prevalent. Notably, nosignificant correlation was observed between genome size and relativedensity/relative abundance of microsatellites, concurrent with that inE. Coli/HIV-1.

Though the significance of cSSRs in tobamovirus is unclear, they arereportedly involved in regulation of gene expression and at functionallevel of proteins (Chen et al., 2011). The cSSR-% in 6 analyzed speciesincreased with higher dMAX, though not in a linear way, except one.

Page 6: In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes

Fig. 3. Frequency of cSSR-% (percentage of individual microsatellites being part of a com-poundmicrosatellite) in relation to varying dMAX (10 to 50) across six randomly selectedtobamovirus species.

Fig. 5. Differential distribution of SSRs (%) in coding/non-coding regions of tobamovirus.

198 C.M. Alam et al. / Gene 530 (2013) 193–200

Approximately 95% cSSRs comprised of two motifs only. The largestcSSR was composed of 3 motifs compared to prokaryotes (4 motifs)and eukaryotes (N8 motifs). The taxonomy of tobamovirus shows nocomparable congruence with host taxonomy, and species from thesame lineage may have quite unrelated hosts (Gibbs and Ohshima,2010) suggesting viral microsatellites to be organism specific ratherthan host specific. Accordingly, cSSR from viruses infecting commonhost does not possess similar number/motifs. Surprisingly, 7 tobamovirusspecies did not possess any cSSR. However, the species of tobamoviruswherein cSSR is present large numbers of strains and isolates have beenidentified the possible significance of which is yet to be elucidated.

Majority of the SSRs and compound microsatellites were foundto exist in coding regions suggesting that compound microsatellitelimits the expansion of perfect microsatellite in the coding regionwhich would otherwise have deleterious consequence. SSRs presentin protein coding regions are known to be associated with functionssuch as social behavior in voles, sporulation efficiency and cell adhe-sion in yeast, skeletal morphology in domestic dogs, adaptive diver-gence in barley and wheat populations (Kashi and King, 2006).

Genetic RNA recombination is vital for emergence of new viralstrains/species (Tan et al., 2005). In RNA viruses, homologous recom-bination based on copy choice and template switching strongly depends

Fig. 4. Average distribution of (A) Mono- or di-nucleotid

on composition and RNA secondary structures (Nagy and Burjarski,1996). Studies in potyvirus indicate recombination occurred in particularhot-spots (Xiaojun et al., 2009). Though functional role of tandem re-peats in viruses remains to be elucidated yet they are allegedly recom-bination hot-spots (Jeffreys et al., 1998). We thereby hypothesize thatviral SSRs are involved in recombination, replication, and repairmechanisms, leading to sequence diversity that drives adaptation.The diversity of microsatellites in tobamovirus genomes may be amodel for understanding genetic diversity, evolutionary biology, andstrain/species demarcation.

5. Conclusion

The tobamovirus species exhibited a highly divergent incidence ofpolymorphic microsatellites. All the studied tobamovirus genomes hadSSRs/cSSRs mostly associated with 128 kDa/186 kDa proteins. Thoughrecombination signal in tobamovirus genome is yet to be deciphered,the extracted repeats may form the basis for the search.

Supplementary data to this article can be found online at http://dx.doi.org/10.1016/j.gene.2013.08.046.

Conflict of interests

The authors declare that they have no conflicts of personal, commu-nication or financial interests.

e repeat motifs and (B) Tri-nucleotide repeat motifs.

Page 7: In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes

Fig. 6. Distribution of mono-, di- and tri-nucleotide SSR motifs (%) across coding/non-coding regions of tobamovirus.

Fig. 7. Presence of cSSR(%) in coding/non-coding regions of tobamovirus.

Table 3Characterization of SSR-couples and distribution of complementary and duplication motifs in t

Motif Number Genome serialnumber

Duplication (D)/complementary (D)motif

(AG)-x-(TC) 3 T20, T25 and T26 C(A)-x-(A) 3 T8, T14 and T18 D(A)-x-(T) 2 T9 and T9 C(ACA)-x-(ACA) 2 T13 and T16 D(ACA)-x-(CA) 2 T21 and T24 –

(CAA)-x-(ACA) 2 T1 and T26 –

(GT)-x-(T) 2 T21 and T29 –

(T)-x-(T) 2 T2 and T13 D(TG)-x-(T) 2 T27 and T28 –

(ACA)-x-(AG) 1 T17 –

(AC)-x-(GA) 1 T2 –

(AC)-x-(TG) 1 T20 C(AG)-x-(G) 1 T25 –

(AG)-x-(GA) 1 T12 –

199C.M. Alam et al. / Gene 530 (2013) 193–200

Acknowledgments

We thank the Department of Botany, Patna University, andDepartment of Biomedical Sciences, Shaheed Rajguru College ofApplied Sciences for Women (SRCASW), University of Delhi for thefinancial and infrastructural supports provided.

References

Alam, C.M., George, B., Sharfuddin, C., Jain, S.K., Chakraborty, S., 2013. Occurrence andanalysis of imperfect microsatellites in diverse potyvirus genomes. Gene 521 (2),238–244 (1).

Archak, S., Meduri, E., Kumar, P.S., Nagaraju, J., 2007. InSatDb: a microsatellite database offully sequenced insect genomes. Nucleic Acids Res. 35, D36–D39.

Borstnik, B., Pumpernik, D., 2002. Tandem repeats in protein coding regions of primategenes. Genome Res. 12, 909–915.

Bull, L.N., Pabon-Pena, C.R., Freimer, N.B., 1999. Compound microsatellite repeats: practicaland theoretical features. Genome Res. 9, 830–838.

Chambers, G.K., MacAvoy, E.S., 2000. Microsatellites: consensus and controversy. Comp.Biochem. Physiol. Biochem. Mol. Biol. 126, 455–476.

obamovirus genomes.

Motif Number Genomeserialnumber

Duplication (D)/complementary (C)motif

(CA)-x-(TG)-x-(CTG) 1 T27 C(CG)-x-(TC) 1 T11 –

(CT)-x-(AT) 1 T1 –

(CT)-x-(CA) 1 T3 –

(GA)-x-(GT) 1 T18 –

(GC)-x-(AT)-x-(AT) 1 T26 D(GT)-x-(AGA) 1 T12 –

(GT)-x-(GA) 1 T4 –

(GT)-x-(TA) 1 T22 –

(GT)-x-(TG) 1 T30 –

(T)-x-(AC) 1 T1 –

(TAG)-x-(T) 1 T18 –

(TG)-x-(GA) 1 T18 –

(AT)-x-(TC) 1 T23 –

Page 8: In-silico analysis of simple and imperfect microsatellites in diverse tobamovirus genomes

200 C.M. Alam et al. / Gene 530 (2013) 193–200

Chen,M., et al., 2009. Similar distribution of simple sequence repeats in diverse completedhuman immunodeficiency virus type 1 genomes. FEBS Lett. 583, 2959–2963.

Chen, M., et al., 2011. Compound microsatellites in complete Escherichia coli genomes.FEBS Lett. 585, 1072–1076.

Chen, M., Tan, Z., Zeng, G., Zhuotong, Z., 2012. Differential distribution of compoundmicrosatellites in various human immunodeficiency virus type 1 complete genomes.Infect. Genet. Evol. 12, 1452–1457.

Coenye, T., Vandamme, P., 2005. Characterization ofmononucleotide repeats in sequencedprokaryotic genomes. DNA Res. 12, 221–233.

Deback, C., et al., 2009. Utilization of microsatellite polymorphism for differentiatingherpes simplex virus type 1 strains. J. Clin. Microbiol. 47, 533–540.

Di Prospero, N.A., Fischbeck, K.A., 2005. Therapeutic development for triplet repeatexpansion diseases. Nat. Rev. Genet. 6, 756–765.

Dieringer, D., Schlotterer, C., 2003. Two distinctmodes ofmicrosatellitemutation process-es: evidence from the complete genomic sequences of nine species. Genome Res. 13,2242–2251.

Dushlaine, C.T.O., Edwards, R.J., Park, S.D., Shields, D.C., 2005. Tandem repeat copy numbervariation in protein-coding regions of the human genes. Genome Biol. 6 (8), R69.

Ellegren, H., 2004. Microsatellites: simple sequences with complex evolution. Nat. Rev.Genet. 5, 435–445.

Flicek, P., Birney, E., 2009. Sense from sequence reads: methods for alignment and assem-bly. Nat. Methods 6 (Suppl. 11), S6–S12.

George, B., Mashhood, A.C., Jain, S.K., Sharfuddin, C., Chakraborty, S., 2012. Differential dis-tribution and occurrence of simple sequence repeats in diverse geminivirus genomes.Virus Genes 45, 556–566.

Gibbs, A.J., Ohshima, K., 2010. Potyviruses and the digital revolution. Annu. Rev. Phytopathol.48, 205–223.

Gur-Arie, R., Cohen, C.J., Eitan, Y., 2000. Simple sequence repeats in Escherichia coli:abundance, distribution, composition, and polymorphism. Genome Res. 10, 62–71.

Hong, C., 2007. Genomic distribution of simple sequence repeats in Brassica rapa. Mol.Cells 3, 349–356.

Jasinska, A., Krzyzosiak, W.J., 2004. Repetitive sequences that shape the humantranscriptome. FEBS Lett. 567, 136–141.

Jeffreys, J., Murray, J., Neumann, R., 1998. High-resolution mapping of crossovers inhuman sperm defines a minisatellite-associated recombination hotspot. Mol. Cells2, 267–273.

Karaoglu, H., Lee, C.M., Meyer, W., 2005. Survey of simple sequence repeats in completedfungal genomes. Mol. Biol. Evol. 22, 639–649.

Kashi, Y., King, D.G., 2006. Simple sequence repeats as advantageous mutators in evolu-tion. Trends Genet. 22, 253–259.

Kelkar, Y.D., Tyekucheva, S., Chiaromonte, F., Makova, K.D., 2008. The genome-wide deter-minants of human and chimpanzee microsatellite evolution. Genome Res. 18, 30–38.

Kim, T.S., et al., 2008. Simple sequence repeats in Neurospora crassa: distribution, poly-morphism and evolutionary inference. BMC Genomics 9, 31.

King, A.M.Q., Adams, M.J., Carstens, E.B., Lefkowitz, E.J., 2012. Virus taxonomy: classifica-tion and nomenclature of viruses. Ninth Report of the International Committee onTaxonomy of Viruses. Elsevier, San Diego.

Kofler, R., Schlotterer, C., Luschutzky, E., Lelley, T., 2008. Survey ofmicrosatellite clusteringin eight fully sequenced species sheds light on the origin of compoundmicrosatellites.BMC Genomics 9, 612.

Lagesen, K., Ussery, D.W., Wassenaar, T.M., 2010. Genome update: the 1000th genome —a cautionary tale. Microbiology 156, 603–608.

Li, Y.C., Korol, A.B., Fahima, T., Nevo, E., 2004.Microsatellites within genes: structure, func-tion, and evolution. Mol. Biol. Evol. 21, 991–1007.

Metzgar, D., Bytof, J., Wills, C., 2000. Selection against frameshift mutations limits micro-satellite expansion in coding DNA. Genome Res. 10, 72–80.

Morgante, M., Hanatey, M., Powell, W., 2002. Microsatellites are preferentially associatedwith nonrepetitive DNA in plant genomes. Nat. Genet. 30, 194–200.

Mrazek, J., Guo, X., Shah, A., 2007. Simple sequence repeats in prokaryotic genomes. Proc.Natl. Acad. Sci. U. S. A. 104, 8472–8477.

Mudunuri, S.B., Nagarajaram, H.A., 2007. IMEx: imperfect microsatellite extractor. Bioin-formatics 23, 1181–1187.

Nagy, P.D., Burjarski, J.J., 1996. Homologous RNA recombination in brome mosaic virus:AU-rich sequences decrease the accuracy of crossovers. J. Virol. 70, 415–426.

Pearson, C.E., Nichol Edamura, K., Cleary, J.D., 2005. Repeat instability: mechanisms ofdynamic mutations. Nat. Rev. Genet. 6, 729–742.

Picone, O., Ville, Y., Costa, J.M., Rouzioux, C., Leruez-Ville, M., 2005. Human cytomegalovirus(HCMV) short tandem repeats analysis in congenital infection. J. Clin. Virol. 32, 254–256.

Reeves, G.A., Talavera, D., Thornton, J.M., 2009. Genome and proteome annotation:organization, interpretation and integration. J. R. Soc. Interface 6, 129–147.

Richards, R.I., 2001. Dynamicmutations: a decade of unstable expanded repeats in humangenetic disease. Hum. Mol. Genet. 10, 2187–2194.

Rocha, E.P., Matic, I., Taddei, F., 2002. Over-representation of repeats in stress responsegenes: a strategy to increase versatility under stressful conditions? Nucleic AcidsRes. 30, 1886–1894.

Sutherland, G.R., Richards, R.I., 1995. Simple tandem repeats and human genetic disease.Proc. Natl. Acad. Sci. U. S. A. 92, 3636–3641.

Tan, Z., Gibbs, A.J., Tomitaka, Y., Sanchez, F., Ponz, F., Ohshima, K., 2005. Mutations inturnip mosaic virus genomes that have adapted to Raphanus sativus. J. Gen. Virol.86, 501–510.

Tóth, G., Gáspári, Z., Jurka, J., 2000. Microsatellites in different eukaryotic genomes: surveyand analysis. Genome Res. 10, 967–981.

Usdin, K., 2008. The biological effects of simple tandem repeats: lessons from the repeatexpansion diseases. Genome Res. 18, 1011–1019.

Vergnaud, G., Denoeud, F., 2000. Minisatellites: mutability and genome architecture.Genome Res. 10, 899–907.

Weber, J.L., 1990. Informativeness of human (dC-dA)n. (dG-dT)n polymorphisms. Geno-mics 7, 524–530.

Xiangyan, Z., et al., 2011. Microsatellites in different potyvirus genomes: survey andanalysis. Gene 488, 52–56.

Xiaojun, Hu., Alexander, V.K., Celeste, J.B., Jim, H.L., 2009. Sequence characteristics ofpotato virus Y recombinants. J. Gen. Virol. 90, 3033–3041.

Zhao, X., et al., 2012. Coevolution between simple sequence repeats (SSRs) and virusgenome size. BMC Genomics 13, 435.