Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode...

58
ACTA UNIVERSITATIS UPSALIENSIS UPPSALA 2009 Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 589 Computational Approaches to the Identification and Characterization of Non-Coding RNA Genes PONTUS LARSSON ISSN 1651-6214 ISBN 978-91-554-7386-0 urn:nbn:se:uu:diva-9518

Transcript of Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode...

Page 1: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

ACTAUNIVERSITATIS

UPSALIENSISUPPSALA

2009

Digital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology 589

Computational Approaches to theIdentification and Characterizationof Non-Coding RNA Genes

PONTUS LARSSON

ISSN 1651-6214ISBN 978-91-554-7386-0urn:nbn:se:uu:diva-9518

Page 2: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

���������� �������� �� ������ �������� � �� �������� ������� � ���� ������������������ ��������� �� ������� ������� ����� �!� �!!" �� �!#!! $� �%� ������ $ ����$ &%����%�' (%� �������� )��� �� ������� � *����%'

��������

+����� &' �!!"' ���������� ,�����%�� � �%� -����$����� �� �%��������.��� $/0���� 1/, 2���' ,��� ����������� ���������' ������� ��� � ���� ����� �� ������� ���� ������� �� �� ������� � ��� �� ��� � ������ 34"' 35 ��' ������'-6�/ "540"�033705�480!'

/0���� 1/,� 9�1/,�: %��� ������� �� %��%�� ������� �� �)��$�� ��� ������� � �%������ �%� ���� $ ������������ ����� $�� ������.�� �������� �������� � ��� ������������� ��� ����� ���%����� � ���� %��%�� �����$�� ��������� $ ��� ��������' ($���� �������� �%� $������ ����$����� $ �1/,�� �� �� $ �������� �������� � �����$��� �%��������.� �%� ��������� $ �1/,� � �%� ����' &���������� ����� ����0)��� ������ �����$� �1/,� %�� �������� ����� ������ $ ��������� �1/,� �� $�� �����$����������0�����$�� �1/, $������� $ ��) $����' 1���� �����; ��������� �%��%0�%���%��� ��<����� ���%�<��� ����������� �$$����� �� �������� ���%�� $����������� �����$����� �� ����� $ ����' , ��=� ��� � �%� )�� �������� �%���%���� %�� ��� � ������ �� ��� ���������� ��� $� �%� �����$����� ���%��������.��� $ �1/, ����'

>� ���� ���������� ������%�� � ������� )��% ����������� ���%�� � ����� �%��1/, ��������� $ �%� ���� ������ ������� ��� ������ �' >� ����� �1/, ���������� � )���0�%��������.�� ��� $������� �� )��� �� ��������� ��) �� ����������������0�����$�� �1/, $�������' (%� ���������� ���� $ � �� �1/, ��� ��������)�� �������$���� ��������� �� �������� � ���%� $� �������� �������0����� ����������� ���� �������0����� ������� ���� �� �������� ��������� �����������'

>� ��� ����� � ���������� %���������� ��� %��� ���������� �1/,�' /��%����� ������� �� ��/, ����� �� )��� �� ���$�������� ������� $ �������� ������������������ ����� �������� � ����� ����� $ ��������� �1/,�' - ����������� � �1/,������� )��% ������� �������� ����������� �%�� ���� ��������� %��� �������� �$$���� ������ ���� ������� )��� �����$���'

- ������� )� %��� �� ���� ���������� ������%�� ������ )��% ������������������ �����$��� � ���% �� ������� �1/, ��������� � �%� ��������� �� ������ � ��� ���� ��' (%� ��������� ��������� ��� �%� �1/,� � �� ���� �� �������� �$������ �������� � ������� $ 0������ ���� �� �������� $ ��������1/, �������'

� ������ �1/,� �1/,� �� ������ ����� ���������� �������� ������������� ���������������� ������� ����

���� !�����" � ���� �� � � �� ��� #� ����� $����" $% &'(" ������� ���� �����"�)*+&,-. �������" �� � �

? &��� +���� �!!"

-66/ �83�08��7-6�/ "540"�033705�480!��#�#��#��#����0"3�4 9%���#@@��'��'��@������A��B��#�#��#��#����0"3�4:

Page 3: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Till mina föräldrar

Page 4: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through
Page 5: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

List of Publications

This thesis is based on the following papers which will be referred to in the

text by their Roman numerals:

I. †Aspegren, A., †Hinas, A., Larsson, P., Larsson, A. and Söderbom F.,

2004. Novel non-coding RNAs in Dictyostelium discoideum and their

expression during development. Nucleic Acids Research, 32(15),

4646-56.

II. †Hinas, A., †Larsson P., Avesson, L., Kirsebom, L.A., Virtanen, A. and

Söderbom, F., 2006. Identification of the major spliceosomal RNAs in

Dictyostelium discoideum reveals developmentally regulated U2 vari-

ants and polyadenylated snRNAs. Eukaryotic Cell, 5(6), 924-34.

III. Larsson, P., Hinas, A., Ardell, D.H., Kirsebom, L.A., Virtanen, A. and

Söderbom, F., 2008. De novo search for non-coding RNA genes in the

AT-rich genome of Dictyostelium discoideum: performance of

Markov-dependent genome feature scoring. Genome Research, 18(6),

888-99.

IV. †Kyriakopoulou, C., †Larsson, P., †Liu, L., †Schuster, J., Söderbom, F.,

Kirsebom, L.A. and Virtanen A., 2006. U1-like snRNAs lacking com-

plementarity to canonical 5' splice sites. RNA, 12(9), 1603-11.

V. Larsson, P., Kirsebom, L.A. and Virtanen, A., 2008. Computational

identification and analysis of non-canonical snRNAs.

† These authors contributed equally

Reprints were made with permission from the publishers.

Page 6: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through
Page 7: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Contents

Introduction ....................................................................................................9

Chemical properties of RNA .......................................................................9

Non-coding RNA genes .............................................................................11

History .....................................................................................................11

The ncRNA families .................................................................................12

Biogenesis ................................................................................................15

Structure ..................................................................................................15

Function ...................................................................................................16

Protein-coding genes .................................................................................16

Splicing ......................................................................................................17

Alternative splicing ...................................................................................19

Regulation of alternative splicing ..............................................................19

Alternative splicing and the proteome ........................................................20

Splicing and disease ..................................................................................21

Post-genomic era .......................................................................................21

Genome evolution ....................................................................................21

Comparative genomics .............................................................................22

Identification of functional units ...............................................................23

Experimentally based methods ..................................................................23

Computationally based methods ................................................................25

Present Investigations ...................................................................................29

Non-coding RNAs in Dictyostelium discoideum ......................................29

Experimentally based methods ..................................................................30

Computationally based methods ................................................................30

Analysis of D. discoideum ncRNAs ..........................................................34

Characterization of human spliceosomal snRNA variants ........................37

Biochemical identification and characterization of U1 snRNAs ..................37

Identification of expressed snRNAs using exon microarrays .......................38

Characterization of expressed snRNA variants ...........................................39

Conclusions and future perspectives ............................................................43

Acknowledgements ......................................................................................45

Swedish summary .........................................................................................47

References ....................................................................................................50

Page 8: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Abbreviations

A adenine

AS alternative splicing

bp base pair

BP branch point

C cytosine

cDNA complementary DNA

CM covariance model

DNA deoxyribonucleic acid

DUSE Dictyostelium upstream sequence element

EST expressed sequence tag

FDR false discovery rate

G guanine

HMM hidden Markov model

miRNA micro RNA

mRNA messenger RNA

ncRNA non-coding RNA

nt nucleotide

ORF open reading frame

piRNA piwi-interacting RNA

PSE proximal sequence element

RISC RNA-induced silencing complex

RNA ribonucleic acid

RNase P ribonuclease P

RNP ribonucleoprotein

rRNA ribosomal RNA

scaRNA small cajal body-specific RNA

siRNA small interfering RNA

snoRNA small nucleolar RNA

snRNA small nuclear RNA

SRE splicing regulatory element

SRP signal recognition particle

SS splice site

T thymine

TMG tri-methylated guanosine

tRNA transfer RNA

U uracil

USE upstream sequence element

Page 9: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Introduction

Non-protein coding RNA molecules (ncRNA), as the name implies, do not

encode proteins and are never translated. Instead, they are functional at the

level of RNA as e.g. regulators, structural scaffolds or catalytic factors. Al-

though a number of abundant ncRNA families have been well studied and

found to be crucial to core functions in cells, research advancements over the

last ten years have made it clear that the number and diversity of ncRNAs

have been massively underestimated. The structure, transcription and pro-

cessing of ncRNA genes are fundamentally different from protein-coding

genes which in part explain why ncRNA genes have been largely over-

looked. Other methods must often be employed to identify ncRNA genes

than can be used to identify protein-coding genes, in particular when it

comes to computational prediction of genes in sequence data. Computational

methods that are able to accurately and reliably identify ncRNA and protein-

coding genes in sequence data are becoming increasingly important as se-

quencing techniques are becoming cheaper, faster and capable of producing

enormous amounts of high-throughput sequence data.

In this thesis I will discuss mainly computational strategies to identify

ncRNA genes. I will present our results on ncRNA genes identified in the

model organism Dictyostelium discoideum using a combination of experi-

mental and computational strategies as well as an in-depth study of the het-

erogeneity among human spliceosomal RNA genes.

Chemical properties of RNA

Deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) are highly similar

molecules. They are linear polymers built up by nucleotides (nt) which are

linked together. A nucleotide is a five-carbon sugar with an organic base and

a phosphate group attached at the 1' and 5' carbons, respectively (Figure 1A).

In the polymer, the 3' carbon of one nucleotide is linked to the 5' carbon of

the next nucleotide via a phosphodiester bond, giving the polymer a direc-

tionality. When writing out the sequence of a nucleic acid, a 5' to 3' direc-

tionality is implied unless specified otherwise. In DNA nucleotides, the sug-

ar is deoxyribose while in RNA it is ribose. The difference is that in ribose, a

hydrogen on the 2' carbon is replaced by a hydroxyl group. This subtle

9

Page 10: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

change has important consequences as it provides RNA with a chemically re-

active hydroxyl group that can take part in RNA-mediated catalysis. The hy-

droxyl group can also attack and break the phosphodiester bond between nu-

cleotides. This makes the RNA molecule unstable compared to DNA.

The bases attached to the sugar are adenine (A), guanine (G), cytosine (C),

thymine (T) or uracil (U). While A, C and G are common to DNA and RNA,

T is exclusive to DNA and is replaced by U in RNA. Through hydrogen

bond formation between the organic bases, inverted nucleotides are able to

pair with each other, forming so-called base pairs (bp). In the classical Wat-

son-Crick bp's, G pairs with C and A pairs with T or U. The Watson-Crick

pairs are said to be complementary. In RNA, G can also form a bp with U,

the so-called G-U wobble. The base pairing capabilities means that comple-

mentary nucleic acid strands can pair with each other with high specificity to

form a double-stranded structure (Figure 1B). Because of the steric proper-

ties of the linked nucleotides, the double-stranded nucleic acids wind around

each other to form a double helix with the paired bases meeting in the center

and the sugars forming a backbone on the outside (Figure 1C). It is worth-

while to mention that besides the classical two Watson-Crick bp's there are at

least 26 additional possible bp's that involve at least two hydrogen bonds.

An unpaired (single-stranded) nucleic acid can also fold back onto itself

and form base pairs between distant complementary nucleotides within the

same molecule. This gives the molecule a distinct structure which in the case

of RNA is often important for its function. It is often the case that a nu-

cleotide substitution that destroys a base pair is compensated by a second nu-

cleotide substitution that restores the base pairing between nucleotides again,

a so-called compensatory base pair change (Figure 1D). While the primary

structure of a nucleic acid describes the internal order of the nucleotides, the

secondary structure describes the internal order but also which nucleotides

are single-stranded and which nucleotides base-pair with each other.

10

Page 11: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Non-coding RNA genes

History

A few discoveries of high significance mark the history of ncRNA research.

The ribosome was known to mainly consist of RNA already in the 40's

(Claude, 1943) and transfer RNA (tRNA) molecules functioning as the link

between the genetic nucleic acids and proteins were discovered in the late

50's (Hoagland et al., 1958). In the late seventies the concept of split genes

and RNA splicing was independently discovered by Phillip A. Sharp and

Richard Roberts (Berget et al., 1977; Chow et al., 1977). Subsequently, in

the early eighties, ribonucleoprotein (RNP) particles containing a set of

small nuclear RNAs (snRNA) were linked to splicing of messenger RNA

(mRNA) (Lerner et al., 1980; Yang et al., 1981; Padgett et al., 1983). At

about the same time, RNA was shown to have the ability to catalyze chemi-

11

Page 12: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

cal reactions, a property previously only credited to proteins. Thomas Cech

and co-workers found that an intron in the endogenous ribosomal RNA

(rRNA) from Tetrahymena thermophila was able to catalyze splicing of itself

in cis, in absence of proteins (Kruger et al., 1982; Cech, 1990). Shortly

thereafter, Sidney Altman and colleagues were able to show that the cleavage

of precursor tRNA molecules into mature tRNA is catalyzed by the RNA

component of ribonuclease P (RNase P) in trans (Guerrier-Takada et al.,

1983; Altman, 1990). Cech and Altman shared the Nobel prize in chemistry

in 1989 for the discovery of catalytic RNA and Sharp and Roberts were

awarded the Nobel prize in medicine 1993 for their discovery of split genes

in eukaryotes. The discovery that exogenous double-stranded RNA injected

into the nematode Caernohabditis elegans could specifically target and si-

lence gene expression through highly specific anti-sense interactions with

the target mRNA (Fire et al., 1998) together with the discovery that an abun-

dant family of endogenous small RNAs, depicted micro RNAs (miRNA),

could silence gene expression in a similar manner (Lee et al., 1993; Wight-

man et al., 1993; Reinhart et al., 2000; Lagos-Quintana et al., 2001; Lee and

Ambros, 2001; Lau et al., 2001), clearly demonstrated the potential and in-

volvement of ncRNAs in gene regulation. Consequently, the interest in ncR-

NAs exploded. For the discovery of gene silencing through small interfering

RNAs (siRNA), Fire and Mello received the 2006 Nobel prize in medicine.

In 2000, the large subunit of the ribosome from the bacterium Haloarculamarismortui was published at such high resolution that the long suspected

fact that the peptidyl transfer reaction is catalyzed by the rRNAs could be

concluded (Ban et al., 2000; Cech et al., 2006). In addition to these ground

breaking discoveries, many other functions have been linked to RNA over

the years.

The ncRNA families

The wealth and diversity of ncRNA families presently known makes a com-

plete listing practically impossible. Here I will just focus on a few of the

most abundant and widely occurring ncRNA families.

Ribosomal RNA (rRNA)The rRNAs are present in all kingdoms of life. Together with ~50-80 pro-

teins, they make up the ribosome where protein synthesis in the cell takes

place. It had long been suspected that the ribosome might actually be a ri-

bozyme, i.e. the peptidyltransferase reaction that forms the peptide bonds on

the growing peptide would be catalyzed by the rRNAs (Crick, 1968; Moore

and Steitz, 2006), but it wasn't until the solution of a high-resolution crystal

structure of the large subunit of the ribosome from the bacterium H. maris-mortui in 2000 that this suspicion could be confirmed (Ban, 2000).

12

Page 13: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Transfer RNA (tRNA)tRNAs are present in all kingdoms of life and provide the link between the

nucleic acids of the genetic material and the amino acids of proteins. Each

tRNA folds into a highly conserved secondary structure which exposes a

three-nucleotide anticodon sequence that base pair with a complementary se-

quence on the mRNA known as a codon during protein synthesis (Figure

2A). An amino acid corresponding to the anticodon sequence, dictated by the

genetic code, is attached to the tRNA. Upon base pairing between the anti-

codon and the correct codon on the mRNA at the A site of the ribosome, the

peptidyltransferase reaction links the amino acid to the growing peptide

(Moore and Steitz 2006).

Spliceosomal Small Nuclear RNA (snRNA)The spliceosomal snRNAs are only present in eukaryotes and are involved in

the process of removing introns from pre-mRNA (splicing). Together with

protein factors they assemble into RNPs (snRNPs) and constitutes the core

of the spliceosome. During the splicing process, multiple interactions involv-

ing base pairing between the snRNAs and sequence motifs on the pre-mRNA

confer specificity to the reaction (Figure 2B) (Staley and Guthrie, 1998;

Burge et al., 1999; Tycowski et al., 2006). It is frequently speculated that the

catalytic steps involved in splicing are catalyzed by the snRNAs (Cech,

1986; Valadkhan, 2007).

Small nucleolar RNA (snoRNA)SnoRNAs associate with proteins to form RNP complexes (snoRNPs).

SnoRNPs are mainly involved in chemical modifications of a target RNA.

13

Page 14: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

The RNA component interacts with the target RNA through base pairing and

thereby brings the target nucleotide into the active site of the associated pro-

teins which catalyze the modification (Figure 2C). Based on sequence motifs

within the RNAs, the snoRNPs are divided into two classes. C/D-box

snoRNPs perform 2'-O-methylation of nucleotides while H/ACA-box

snoRNPs perform pseudouridylation of uridine nucleotides (Bachellerie et

al., 1995; Kiss-La�zló et al., 1996; Smith and Steitz, 1997; Tollervey and

Kiss, 1997; Decatur and Fournier, 2003). The exact function of the modifica-

tions is unclear but they are likely to increase stability within the molecule.

The main targets for snoRNPs are rRNA molecules. An snRNA molecule

closely related to snoRNA is the small cajal body-specific RNA (scaRNA).

ScaRNAs are structurally and functionally related to snoRNAs and assemble

into scaRNPs. Many scaRNAs are a hybrid between the two classes and con-

tain both C/D-box and H/ACA-box domains. ScaRNPs reside within cajal

bodies in the nucleus where they modify primarily snRNAs and snoRNAs

(Tycowski et al., 2006).

Micro RNA (miRNA)MiRNAs were first observed in the nematode Caernohabditis elegans where

they were found to regulate switching between developmental stages by

specifically down-regulating expression of key proteins by a mechanism de-

pendent on base pairing between miRNA and mRNA (Lee et al., 1993;

Wightman et al., 1993; Reinhart et al., 2000; Lagos-Quintana et al., 2001;

Lee and Ambros, 2001; Lau et al., 2001). MiRNAs are processed from

longer hairpin precursors into ~21-24 nt single stranded RNAs that are incor-

porated into a larger protein complex known as RNA-induced silencing

complex (RISC). The miRNAs are thought to have appeared somewhere

close to the base of the metazoan lineage (Grimson et al., 2008). Interesting-

ly, miRNAs are also present in plants although the structure, biogenesis and

target interaction are substantially different from the animal counterparts.

These differences together with the fact that no miRNAs have been detected

in fungi or other intervening lineages have led to the hypothesis that the

miRNA machineries in animals and plants have appeared independently

(Jones-Rhoades et al., 2006).

Other RNAsBeside the few ncRNA families mentioned above, numerous other examples,

many with regulatory functions, have been characterized (for review, see e.g.Costa, 2005; Hüttenhofer et al., 2005; Mattick and Makunin, 2006). These

include the already mentioned RNase P RNA, involved in tRNA processing

(Altman, 1990). The signal recognition particle (SRP) is a ribonucleoprotein

which include the SRP RNA. It is involved in directing secreted or mem-

brane bound proteins to the cellular membranes (Walter and Blobel, 1982;

Egea et al., 2005). Xist RNA is a large mRNA-like transcript with no pro-

14

Page 15: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

tein-coding potential that has been shown to initiate silencing of gene ex-

pression from the extra X chromosome in females (dosage compensation) by

associating with the chromosome (Clemson et al., 1996). The piwi-interact-

ing RNAs (piRNAs) are germ-line specific ~25-30 nucleotide RNAs that are

involved in defense against transposons (Aravin et al., 2007). In addition, a

large number of apparently species-specific (or limited to closely related

species) ncRNA families have been identified (see e.g. Ruby et al., 2006;

Pollard et al., 2006; Missal et al., 2006; Parrott and Mathews, 2007; Mourier

et al., 2008 for a few examples).

Biogenesis

The ncRNA genes do not have a general biogenesis pathway. Which RNA

polymerase that transcribes a ncRNA depends on the family, e.g. most

rRNAs are transcribed by RNA polymerase I, most snRNAs are transcribed

by RNA polymerase II and tRNAs are transcribed by RNA polymerase III.

While e.g. spliceosomal snRNAs are transcribed as independent units, this is

not true for all families of ncRNAs. Most rRNAs are transcribed as a contin-

uous precursor transcript that is cleaved and processed post-transcriptionally

(Fatica and Tollervey, 2002). Many snoRNAs and some miRNAs are encod-

ed within introns of protein-coding genes and are co-transcribed together

with the host gene and processed out of the spliced introns by exonucleases

(Kiss and Filipowicz, 1995; Kim, 2005). The Xist RNA is transcribed,

spliced and polyadenylated just like a normal protein-coding mRNA, yet it

lacks an open reading frame (ORF) and is never translated (Clemson et al.,

1996).

Structure

Many ncRNAs adopt an evolutionarily conserved secondary structure, indi-

cating that the structure is of critical importance for the function of the ncR-

NA. Most ncRNAs associate with RNA-binding proteins to form active RNP

particles . The recognition of the RNA by proteins is often dependent on the

structure or a combination of structure and short sequence motifs (Tycowski

et al. 2006, Cech et al., 2006). For this reason, unlike protein-coding genes

which are often under negative selection to preserve the reading frame of

ORFs, amino-acid sequence and three-dimensional fold of the protein prod-

uct, ncRNAs are free to diverge as long as relevant sequence motifs and sec-

ondary structures are preserved, possibly through compensatory base pair

changes (Figure 1D). Thus, the nucleotide sequence of ncRNA genes can

differ substantially even between relatively closely related organisms (Eddy,

2002a).

15

Page 16: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Function

The high specificity with which an RNA molecule can interact with other

RNA molecules via base-pairing make them ideally suited for functions re-

quiring specific recognition of a particular target sequence. This opens up the

possibility of having a generic protein machinery with some function that as-

sociates with an interchangeable RNA molecule which guide the machinery

to a target RNA with high specificity. Most ncRNA genes, e.g. snoRNAs,

miRNAs and piRNAs incorporate into RNPs and function exactly like this

(Tycowski et al., 2006; Petersen et al., 2006; Aravin et al., 2007). This brings

an enormous flexibility to an organism as new specificities can rapidly

evolve without having to alter the protein machinery. A genomic duplication

of the ncRNA gene followed by divergence of the sequence bringing speci-

ficity would be sufficient (Eddy, 2001).

Protein-coding genes

The DNA in the nucleus contains the information needed for making all the

proteins an organism requires. When a specific protein product is required by

the cell, the piece of DNA encoding the instructions for making it is tran-

scribed into an RNA copy, known as messenger RNA (mRNA). In eukary-

otes, DNA encoding protein-coding genes are transcribed by RNA poly-

merase II (reviewed in e.g. Roeder, 2005; Kornberg, 2007). Beside the splic-

ing process, reviewed below, the mRNA is modified at the 5' and 3' ends be-

fore export into the cytoplasm and translation. The unprocessed mRNA

molecule is commonly referred to as pre-mRNA. The mRNA is “capped” at

the 5' end by the addition of a reversed guanosine nucleotide by a special

capping enzyme. The cap protects the mRNA from degradation from the 5'

end but it is also important for splicing, transport and translation. At the 3'

end, the pre-mRNA is cleaved at the poly(A)-site by a protein complex that

recognizes a number of cis-encoded elements within the mRNA. Subse-

quently, another protein complex, the poly(A)-polymerase, adds a number of

adenosine residues, typically 50-250, to the end of the pre-mRNA. This is

known as polyadenylation and the added poly(A)-tail is important for the

transport, translation and stability of the mRNA.

The amino-acid sequence of the protein is encoded in the open reading

frame of the mRNA in nucleotide triplets known as codons. The genetic code

dictates which codon corresponds to which amino acid. Since there are 43 =

64 possible codons but only 20 amino acids, several different codons can en-

code the same amino acid. Different codons that encode the same amino acid

are said to be synonymous and conversely, codons that encode different

amino acids are said to be non-synonymous. With very few exceptions, the

genetic code is universal and shared between all living organisms. A number

16

Page 17: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

of codons have special meanings: AUG encodes the amino acid methionine

as well as being the start codon that indicates where translation should start.

Less frequently, CUG, GUG or UUG are used as start codons. UAA, UAG

and UGA do not encode amino acids but are stop codons that will cause

translation to terminate when encountered. Thus, these special codons im-

pose restrictions on the ORF, i.e. the ORF begins with a start codon and no

stop codons are allowed within the ORF since that would cause a premature

termination of translation to occur. In addition, because the codons are read

in triplets in a fixed so-called reading frame, the length of the ORF must be a

multiple of three and the reading frame must not be shifted. If e.g. a muta-

tion in the DNA or a mistake in the transcription into RNA that inserts or

deletes a single nucleotide would occur, a frameshift could result which

would cause the ORF to be misread and the wrong amino acids to be incor-

porated into the polypeptide.

Unlike prokaryotes where the ORF is encoded as a continuous nucleotide

sequence in the DNA, eukaryotic ORFs are often interrupted by stretches of

nucleotides that should not be translated. These non-protein coding se-

quences are known as introns. The introns have to be excised from the tran-

script and the coding parts, the exons, have to be put together with high pre-

cision as the ORF must not be altered or frameshifted. This process is known

as splicing.

Splicing

In the splicing process, the introns are removed from the pre-mRNA and the

coding pieces, the exons, are joined. This occurs through a two-step transes-

terification reaction (Figure 3). In the first step, the 2'-hydroxyl of the adeno-

sine residue at the branch point (BP), located close to the 3' end of the intron,

attacks and breaks the phosphodiester bond at the 5' splice site (5' SS), be-

tween the last nucleotide of the exon and the first nucleotide of the intron.

This results in the transesterification of the 5' end of the intron to the BP

adenosine, creating a lariat intermediate and a free 3'-hydroxyl at the 5' exon.

In the second step, this 3'-hydroxyl attacks and breaks the phosphodiester

bond at the 3' splice site (3' SS), between the last nucleotide of the intron and

the first nucleotide of the downstream exon, followed by ligation of the 5'

exon to the 3' exon through a transesterification reaction and release of the

lariat intron (Burge et al., 1999). The nucleotide sequences around the 5' SS,

3' SS usually matches a well-defined consensus sequence motif (Mount,

1982). Although the exact sequence is somewhat variable, the two first nu-

cleotides of the intron are always GU, with very few exceptions. Similarly,

the last two nucleotides of the intron are almost always AG.

17

Page 18: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

The splicing reaction is catalyzed and coordinated by the spliceosome, a

large RNP complex composed of the five snRNAs U1, U2, U4, U5 and U6

together with more than 200 protein factors (Jurica and Moore, 2003; Will

and Lührmann, 2006). The snRNAs are assembled into ribonucleoprotein

particles (snRNPs) that are essential for the splicing reaction (Yang et al.,

1981; Padgett et al., 1983). The formation of the spliceosome on the pre-

mRNA is directed by the interaction of the U1 snRNP with the 5' SS through

base pairing to the U1 snRNA and U2 snRNP binding to the branch point

through interaction with protein factors and base pairing between the U2

snRNA and the BP sequence (Burge et al., 1999). The two-step transesterifi-

cation reaction of the splicing process is identical to the reaction by which

the group II self-splicing introns splice themselves. In addition, RNA-RNA

interactions within the spliceosome closely mimics base pairing interactions

within group II introns (Tycowski et al., 2006) and it has been speculated

that the spliceosome is in fact a ribozyme (Cech, 1986; Valadkhan et al.,

2007; Valadkhan, 2007).

In the shadow of this dominant spliceosome, known as the major spliceo-

some or U2-dependent spliceosome, there also exists a second spliceosome,

known as the minor spliceosome or U12-dependent spliceosome, which cat-

alyzes the removal of a distinct set of introns (Jackson, 1991; Hall and Pad-

gett, 1996; Tarn and Steitz, 1996). The minor spliceosome is present in most

18

Page 19: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

but not all eukaryotic organisms. Although the mechanisms of the splicing

reaction and a great number of protein factors are similar between the two

spliceosomes, the minor spliceosome contain a different set of snRNAs

known as U11, U12, U4atac and U6atac. The U5 snRNA is shared between

the spliceosomes (Patel and Steitz, 2003; Will and Lührmann, 2005). The in-

trons spliced by the major and minor spliceosomes are referred to as U2-type

and U12-type introns, respectively. It is estimated that approximately one in

300 introns is spliced by the minor spliceosome (Levine and Durbin, 2001).

Alternative splicing

Through alternative splicing (AS), the same pre-mRNA can be spliced dif-

ferently in a modular fashion, e.g. introns can be left in the pre-mRNA (in-

tron retention), exons can be spliced out (exon skipping) or alternative 5' and

3' splice sites can be utilized (Figure 4). This way, one transcription unit in

the genome can give rise to a large number of different protein products. The

most extreme example known is the Dscam gene in Drosophilamelanogaster (fruit fly) which is involved in neuronal wiring in the nervous

system. AS of four of the exons in the gene potentially gives rise to more

than 36,000 different protein products (Schmucker et al., 2000, Schmucker,

2007).

Regulation of alternative splicing

Observations of tissue-specific and developmentally regulated alternative

splicing events indicated that trans-acting factors were involved in regulat-

ing AS. It was hypothesized that the snRNPs could be such trans-acting fac-

tors and that variant snRNPs could produce different splicing products (Las-

ki et al., 1986; Padgett, 1986). This hypothesis was supported by reports of

developmentally regulated and heterogeneous snRNPs in Xenopus oocytes,

mouse, sea urchin and drosophila (Forbes et al., 1984; Lund et al., 1985;

Santiago and Marzluff, 1989; Lo and Mount, 1990). Later discoveries of cis-

acting elements, so-called splicing regulatory elements (SRE), which have

19

Page 20: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

been shown to be major regulators of splicing by recruiting various members

of the SR protein family, have provided a robust and comprehensive mecha-

nism for AS regulation (Graveley, 2000; Black, 2003; Matlin et al., 2005;

Wang and Burge, 2008). However, it is also known that the interaction be-

tween the U1 snRNA and the 5' SS influence splicing efficiency and that

changes in the strength of the interaction can shift splicing of exons towards

constitutive or alternative (Zhuang and Weiner, 1986; Zhuang et al., 1987;

Stamm et al., 1994; Sorek et al., 2004). A study of introns with non-canoni-

cal 5' SS having a GC-dinucleotide instead of a GU-dinucleotide at the first

positions of the intron, found that a large fraction of introns with a non-

canonical 5' SS are alternatively spliced and that non-canonical 5'SS are

overrepresented among alternatively spliced introns (Thanaraj and Clark,

2001). Thus, regulation of AS seems to depend on a delicate interplay be-

tween cis-acting elements in the pre-mRNA and their associated protein fac-

tors and snRNP complexes interacting with sequence motifs on the pre-

mRNA.

Alternative splicing and the proteome

It has been clear for a long time that the morphological complexity of an or-

ganism, measured e.g. as the number of distinct cell types in the body, is not

correlated with the size of its genome. For example, the unicellular protist

Amoeba dubia has a genome of approximately 670 Gbp which is more than

200 times larger than the human genome (Gregory and Hebert, 1999). Nei-

ther does the number of protein-coding genes seem to correlate with the or-

ganismal complexity. For example, the nematode Caernohabditis elegansconsists of roughly 1000 cells distributed over 24 distinct cell types (Altun,

Z. F. and Hall, D. H. 2005. Handbook of C. elegans Anatomy. In WormAtlas.

http://www.wormatlas.org/handbook/contents.htm). In contrast, the number

of cells in the human body is in the order of 1014

and the number of distinct

cell types in mammals are around 100 (Bell and Mooers, 1997). In the July

2008 release of Ensembl (Release 50), the number of known and novel hu-

man protein-coding genes annotated is 21,528 (Flicek et al., 2008;

http://www.ensembl.org). In the same release the number of protein coding

genes in the nematode is 20,176. Thus, despite the vast differences in num-

ber of cells and cell types between nematodes and humans, we have about

the same number of protein-coding genes.

Alternative splicing has been suggested as an explanation for this paradox

since AS is a powerful mechanism by which the protein repertoire of an or-

ganism can be expanded beyond the number of transcription units encoded

within the genome (Black, 2003; Blencowe, 2006). A study using exon junc-

tion microarrays estimate that some 74% of human multi-exon genes are al-

ternatively spliced (Johnson et al., 2003). Even more astonishing, a recent

study using massive parallel sequencing conclude that 98-100% of human

20

Page 21: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

multi-exon genes undergo AS events, most of which seem to be regulated

between tissues, lending support to the idea that AS is an important contribu-

tor to phenotypic complexity (Wang et al., 2008a).

Splicing and disease

Among known disease-causing mutations, it is estimated that 20-30% affect

pre-mRNA splicing, causing either aberrant splicing or shifting ratios be-

tween transcript isoforms. The severity of the disease phenotype depends on

the level of correctly spliced transcripts or the ratio between isoforms (see

Nissim-Rafinia and Kerem, 2002; Faustino and Cooper, 2003; Wang and

Cooper, 2007 for review). Approximately 10-15% of these mutations disrupt

intronic splicing motifs e.g. splice sites, SREs or creating decoy splice sites

(Krawczak et al., 1992; Faustino and Cooper, 2003).

Post-genomic era

In 1995 the first complete genome sequence of a free-living organism,

Haemophilus influenzae, was published (Fleischmann et al., 1995). Since

then, the number of completely sequenced genomes has increased exponen-

tially while the cost and time required to sequence have decreased steadily.

In October 2008, the number of genome sequences in NCBI's GenBank that

are completed or with an available draft assembly was close to 1500 and

roughly another 1000 genome sequencing projects are in progress. Approxi-

mately 85% represent prokaryotic organisms (http://www.ncbi.nlm.nih.-

gov/). This calls for large-scale methods capable of more or less automatical-

ly analyzing and handling the exponentially growing amount of genome

data.

Genome evolution

A genome is not static but is constantly changing due to mutations. Muta-

tions provide the raw material for the evolutionary forces of natural selection

and genetic drift to work on. Mutations can arise through e.g. mistakes in the

cellular processes of DNA replication or exposure to mutagenic substances

or radiation. Mutations can be e.g. nucleotide substitutions where one nu-

cleotide is changed into another or nucleotides can be inserted or deleted.

Other types of mutations include larger structural rearrangement events that

can be due to e.g. the action of transposons, non-homologous recombination

during crossing-over or chromosomal breakage during replication. While the

vast majority of mutations are neutral, i.e. they do not have any significant

effect on the phenotype of an individual, a small fraction is likely to be

21

Page 22: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

harmful and a small fraction can be beneficial. For example, insertions or

deletions in an ORF most likely cause frameshifts and lead to an erroneous

protein being synthesized, which is probably harmful to the organism. Con-

versely, a nucleotide substitution that changes a codon into another synony-

mous codon would most likely have no effect on the organism and therefore

be completely neutral.

If a genomic mutation occurs in the germ-line, the mutation can be inherit-

ed by the offspring and passed on to subsequent generations. While severely

harmful mutations are unlikely to spread through the population since off-

spring carrying them are likely to die or otherwise be prevented from breed-

ing, neutral mutations can spread and give rise to genetic variation within a

population. Recent genome sequencing projects have shown that the

genomes of different human individuals differ from each other by roughly

0.5% (Levy et al., 2007; Wheeler et al., 2008; Wang et al., 2008b).

Comparative genomics

Since all living organisms are related to each other through common descent

and since their genomes are shaped by evolutionary forces; comparing the

footprints these forces leave in the genomes reveal crucial information. If the

corresponding regions in two genomes are significantly well conserved over

large evolutionary distances, that suggests selective constraints to keep the

regions unchanged which would indicate functionality. Such comparative ap-

proaches can also identify elements potentially involved in making a species

unique. As an example, comparing the human and mouse genomes might

help in identifying genes and regulatory motifs that are of such importance

to the organism that they haven't changed substantially in the approximately

80 million years since humans and mice last shared a common ancestor. On

the other hand, human and chimpanzee are so closely related that we expect

very little to have changed between the genomes (roughly 98.8% of aligned

nucleotides are identical) (Mikkelsen et al., 2005). Therefore, it is in the

parts of the genomes that are different we would expect to find the genetic

variations that set these species apart.

On an even finer scale, having the complete genomes from several indi-

viduals of the same species allows for even more detailed analyses such as

mapping the genetic cause for various heritable diseases. Recently, a large-

scale sequencing initiative, The 1000 genomes project, has been undertaken

and the goal is to compile a catalog of human genetic variation that is

present at a frequency of one percent or more. The complete genome se-

quences from at least 1000 individuals will be determined and made publicly

available (http://www.1000genomes.org).

22

Page 23: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Identification of functional units

Having knowledge of the complete genome sequence is extremely valuable

when trying to understand how an organism functions since the genome es-

sentially contains all information that defines the organism physically. How-

ever, just knowing the genome sequence, i.e. the internal order of DNA nu-

cleotides, is not very useful by itself. The next great challenge is to interpret

and decode the genome to create a map of all functional units, i.e. the genes

and the regulatory elements that control their expression. This can be

achieved through a wide array of complementary methods which can be sep-

arated into experimentally based and computationally based methods. I will

mainly focus on identification of RNA genes and to some extent on protein-

coding genes.

In general, experimental methods have the advantage of being capable of

de novo-identification while not suffering so much from false positives (with

the exception of microarray technologies). On the other hand, experimental

approaches are often expensive, work intensive and may suffer from large

amounts of false negatives depending on growth conditions, expression lev-

els etc. Bioinformatic approaches generally have the advantage of being

cheap, often scales well with increasing data amounts and are independent of

expression levels. However, bioinformatic methods always have to struggle

with a trade-off between false positives and false negatives and they are

mostly limited to discovering already known ncRNA families. Typically,

genes that have been identified computationally have to be verified experi-

mentally in order to be fully trusted.

Experimentally based methods

Complementary DNA library preparationPreparation of complementary DNA (cDNA) libraries by reverse transcrip-

tion of mature mRNAs extracted from cells have been extensively used to

map protein-coding genes. In this process, an oligo-dT primer complemen-

tary to the poly(A)-tail of the mature mRNAs, or alternatively a random hex-

anucleotide primer, is used as a primer for reverse transcriptase enzyme

which transcribes the mRNA into cDNA. The cDNA can then be sequenced

from the 5' or 3' end, generating transcripts, so-called expressed sequence

tags (ESTs), 100-800 nucleotides in length. Since the cDNA is constructed

from mature mRNAs, introns will be absent from the resulting sequence and

a subsequent mapping back to the genome (if available) will reveal valuable

information on the exon-intron structure of the gene. Since this method dis-

criminates against less abundant and non-polyadenylated RNAs, ncRNAs

will only very rarely be incorporated into these cDNA libraries.

23

Page 24: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

In 2001, Hüttenhofer and colleagues focused on large scale identification

and mapping of ncRNAs by constructing cDNA libraries specifically en-

riched for ncRNAs (Hüttenhofer et al., 2001). In their approach, total RNA

was first size fractionated by electrophoresis and only RNA in the size range

50-500 nucleotides, where the fraction of mRNAs are expected to be very

low, was extracted. Subsequently, a poly(C)-tail was added and a cDNA li-

brary could then be constructed by reverse transcription using an oligo-dG

primer. Subsequent refinements of the technique entails ligating known

adapter sequences to the 5' and 3' ends of the RNAs in order to generate full-

length cDNAs. RNA can also be isolated by performing immunoprecipita-

tion of RNP complexes followed by RNA extraction and adapter ligation.

A drawback with this method is that absence or underrepresentation of

certain ncRNA families in the constructed cDNA library may result from low

abundance or interference with adapter ligation and reverse transcription into

cDNA due to secondary structures or chemical modifications of nucleotides

at the ends of RNAs. In addition, unless specifically eliminated, the majority

of cDNAs in the library are expected to represent highly abundant RNAs

(Hüttenhofer and Vogel, 2006).

Microarray analysisMicroarray techniques usually involve spotting 25-70 nucleotide long DNA

probes on the silicon or glass surface of microarrays. RNA is then extracted

from the cell and either hybridized to the microarray probes directly or

chemically labeled and possibly reverse transcribed into cDNA before hy-

bridization. The hybridization patterns can then be visualized through the la-

beling or if unlabeled RNA was used, through antibodies recognizing the

DNA-RNA interaction. A microarray can contain a large number of probes

and the design varies depending on the application. Arrays with probes inter-

rogating known or predicted exon junctions have been used to asses alterna-

tive splicing events (Johnson et al., 2003). Arrays with probes interrogating

known or predicted exons also allow AS events or variations in expression

levels to be assessed (Kapur et al., 2007; Clark et al., 2007). Yet another type

of array is known as tiling array and consists of overlapping nucleotide

probes that span entire chromosomes or even genomes. Here, no a prioriknowledge of exons or transcripts is required. Tiling microarrays have re-

vealed a surprising amount of transcription from regions of the human

genome assumed to be transcriptionally silent (Bertone et al., 2004; Kampa

et al., 2004; Birney et al., 2007).

In the study of ncRNAs, microarrays are usually targeted at verifying or

comparing expression levels of transcripts from known or computationally

predicted ncRNAs (see e.g. Babak et al., 2004; Hu et al., 2006). However,

there are many factors influencing and complicating analysis of the microar-

ray data such as probe specific and array specific hybridization efficiency

and varying levels of background noise. Subsequent Northern blot verifica-

24

Page 25: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

tion of transcripts detected on microarray have also failed in many instances

so ncRNAs identified on microarray must in many cases be further verified

by downstream analyses (Hüttenhofer and Vogel, 2006).

Direct sequencingRecently, powerful high-throughput sequencing platforms such as Roche

454, Illumina Genome Analyzer and Applied Biosystems SOLiD sequencer,

capable of producing millions of sequence reads in a short amount of time

and at a low cost have emerged (for a recent review, see Mardis, 2008).

These techniques are extremely promising and offers the possibility to create

a snapshot view of the transcriptome content of a sample. For example,

Wang and colleagues performed deep sequencing of the transcriptome of a

wide array of human tissues and cell lines and reported that approximately

92-94% of all human mRNAs are alternatively spliced (Wang et al., 2008a).

In another study, Lu et al. performed deep sequencing of small RNAs ex-

tracted from the plant Arabidopsis thaliana and were able to identify a vast

repertoire of small RNAs including previously known and unknown miR-

NAs and siRNAs (Lu et al., 2005). Numerous studies have followed their ap-

proach to identify small RNAs (see e.g. Ruby et al., 2006; Landgraf et al.,

2007; Fahlgren et al., 2007; Grimson et al., 2008). Similar to the construc-

tion of full-length cDNA libraries, these methods rely on an initial prepara-

tion step where adapter sequences are ligated to the ends of the RNA

molecules. Hence the same bias in representation may arise.

Computationally based methods

When describing computationally based methods for gene discovery and an-

notation, it is convenient to distinguish between situations where we are

searching for a particular family of genes where known examples exist and

situations where we are searching for genes de novo, i.e. where we are trying

to discover new gene families.

Searching for known gene familiesThe most commonly used method for annotation of functional units in a se-

quenced genome is probably using an already known sequence from another

species to do a sequence similarity search. This is primarily useful for pro-

tein-coding genes which are generally under negative selection to preserve

amino acid sequence and hence change relatively slowly during evolution. In

particular, the amino acid sequence of a known protein can be used to search

a genome by translating the genome sequence into amino acids in all six po-

tential ORFs, thereby increasing the sensitivity compared to just doing nu-

cleotide comparisons. The Basic Local Alignment Search Tool (BLAST)

program suite is widely used for this purpose (Altschul et al., 1990; Altschul

25

Page 26: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

et al., 1997). Other useful tools include the BLAST derivative Washington

University BLAST (WU-BLAST) (Gish, W. (1996-2004)

http://blast.wustl.edu), The BLAST-like Alignment Tool (BLAT) (Kent,

2002) and FASTA (Pearson and Lipman, 1988; Pearson et al., 1997). For

ncRNA genes, searching using nucleotide sequence similarity is not as reli-

able since many ncRNAs are very much dependent on their structure rather

than sequence. Therefore, unless the species with which the comparison is

made is closely related, there is a great risk that the nucleotide sequences of

homologous genes have diverged substantially (Eddy, 2002a).

A more promising approach to search for homologous ncRNA genes

would be to search a database for a similar secondary structure. RScan isone program that implements such an approach using only structural infor-

mation to search a database (Xue and Liu, 2007). Klein and Eddy have de-

veloped the RSEARCH software that is capable of searching a database us-

ing both sequence and structure information (Klein and Eddy, 2003). Anoth-

er option is to manually supply a sequence and structure pattern which can

be searched for using e.g. PatSearch (Grillo et al., 2003), HyPa (Gräf et al.,

2001; http://www.biophys.uni-duesseldorf.de/hypa) or RNAMotif (Macke et

al., 2001).

In cases where homologous or paralogous sequences are known and a reli-

able multiple alignment is available or can be constructed, the Infernal(Eddy, 2002b) software package can be used to create a so-called covariance

model (CM) from the multiple alignment. A CM is a probabilistic model that

represents both structural information, inferred from correlation between

columns in the multiple alignment, and sequence information (Eddy and

Durbin, 1994). The constructed CM can then be used by Infernal to search

an input sequence for occurrences of subsequences that receives significant

scores according to the CM. The database Rfam is a large-scale effort to col-

lect all available knowledge on a large number of ncRNA families in a com-

munity-driven approach. This includes manually curated multiple alignments

and pre-computed CMs of ncRNA families (Griffiths-Jones et al., 2005;

http://rfam.sanger.ac.uk/). Both Infernal and RSEARCH have been evaluated

and shown to perform very well on well-defined ncRNA families but suffer

from being computationally expensive (Freyhult et al., 2007).

For many known families of ncRNAs, specially tailored software capable

of predicting genes belonging to the family exist. A small selection of exam-

ples include tRNAscan-SE (Lowe and Eddy, 1997) and ARAGORN(Laslett and Canbäck, 2004) for tRNAs, snoscan (Lowe and Eddy, 1999),

snoGPS (Schattner et al., 2004) and Fisher (Edvardsson et al., 2003) for

snoRNAs, MiRscan (Lim et al., 2003), miR-abela (Sewer et al., 2005) and

RNAmicro (Hertel and Stadler, 2006) for miRNAs and SRPscan (Regalia et

al., 2002) for SRP RNA. Each of these programs is dedicated to the identifi-

cation of a single family of ncRNAs, often using a combination of sequence

and structure motifs known to be crucial for the function of the RNA.

26

Page 27: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

De novo gene predictionIn contrast to the situation above where a well-known family of genes or

where homologs to a specific gene are searched for, de novo gene prediction

means that we want to identify genes without any specific prior knowledge

on sequence or structure features. For protein-coding genes, a number of fea-

tures, such as start codon, open reading frame, splice signals etc., must be

present in order for the gene to function properly in a cellular context. Taken

together, these features can be combined to yield de novo protein-coding

gene predictions of high accuracy, even though things like e.g. alternative

splicing and short ORFs complicate the exact prediction of protein products.

A corresponding set of general features for ncRNA genes is as far as we

know not available. The approach will also depend on whether the genome

sequences of any closely related species are available. In that case it might

be possible to do a comparative study.

Single genome search

Without genome sequences from closely related species, the problem of denovo ncRNA gene prediction becomes even harder. We have already seen

that secondary structure is important for the function of many ncRNAs so

one approach could be to search for locally stable secondary structures with-

in a genomic sequence. RNALfold (Hofacker et al., 2004), RNAplfold(Bernhart et al., 2006) and Rfold (Kiryu et al., 2008) are examples of pro-

grams with this capability. However, a number of studies have demonstrated

that although secondary structure stability of ncRNAs is statistically signifi-

cant compared to random nucleotide sequences for certain highly structured

ncRNA families, it is not generally true for ncRNA genes (Workman and

Krogh, 1999; Rivas and Eddy, 2000; Clote et al., 2005). Hence, this ap-

proach is biased towards ncRNAs with stable secondary structures and

would probably generate many false positives so further analyses of the pre-

dictions would be necessary.

Another possible approach would be to focus on the genomic signals im-

portant for transcription and processing of a ncRNA gene. Such approaches

have had success in prokaryotes where the promoter and transcription termi-

nator signals are well characterized (Argaman et al., 2001; Chen et al., 2002;

Yachie et al., 2006). However, in eukaryotes this approach is complicated by

the fact that transcription of ncRNAs can be carried out by any of the poly-

merases and their promoter structures are poorly understood. In addition,

ncRNA genes come in a variety of genomic structures, they are not necessar-

ily independently transcribed. An attempt to correlate human structured ncR-

NA gene predictions made with the RNAz software (see below) with tran-

scription factor binding sites could not detect any difference between struc-

tured ncRNA gene predictions and shuffled background (Backofen et al.,

2007).

27

Page 28: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Yet another option suggested is if there exists a bias in nucleotide compo-

sition between the genes encoding ncRNAs and the rest of the genome, re-

gions with a nucleotide composition similar to typical ncRNA genes can be

identified (Rivas and Eddy, 2000). This approach is obviously extremely

limited to the special cases where such a bias exists. Klein and co-workers as

well as Schattner both successfully conducted studies identifying ncRNA

genes in hyperthermophiles with extremely AT-rich genomes using this ap-

proach independently (Schattner, 2002; Klein et al., 2002).

Comparative study

In case the genome sequence from one or more closely related species is

available, the conditions for de novo ncRNA gene prediction are somewhat

improved. The basic idea is then that orthologous regions between the

genomes are aligned and the substitution pattern can be evaluated to identify

regions where nucleotides are substituted in a way that is expected for a ncR-

NA gene. Since secondary structure is often important to the function of

ncRNAs whereas nucleotide sequence is not, it would be expected that evo-

lution preserves the secondary structure through compensatory nucleotide

substitutions in base paired regions (Figure 1D). In contrast, for proteins

which are very much dependent on their amino acid sequence, the protein-

coding genes are expected to have a substitution pattern with mostly synony-

mous nucleotide changes and no insertions or deletions that would cause

frameshift mutations. In order for such an analysis to be successful, the

aligned genomes can not be too closely related since there might be too few

nucleotide substitutions. On the other hand, they can not be too distantly re-

lated, they must be sufficiently close so that a reliable alignment where each

column represents an orthologous position can be constructed. Rivas and

colleagues pioneered this approach in the software QRNA and demonstrated

that they could detect bacterial ncRNAs with high sensitivity and specificity

(Rivas and Eddy, 2001; Rivas et al., 2001). Evofold is a similar software ca-

pable of considering multiple sequence alignments and taking the underlying

phylogenetic relationships into account (Pedersen et al., 2006). Like QRNA,

it is based on secondary structure predictions using a probabilistic frame-

work. RNAz also considers multiple sequence alignments and uses thermo-

dynamic parameters for secondary structure prediction (Washietl et al.,

2005a). Both Evofold and RNAz have been used to generate a large amount

of predicted elements with evolutionarily conserved stable secondary struc-

tures, many of which correspond to ncRNA genes (Pedersen et al., 2006;

Washietl et al., 2005b).

28

Page 29: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Present Investigations

To understand the processes and mechanisms governing biological systems,

it is crucial to have a comprehensive knowledge about the key elements in-

volved. NcRNAs have emerged as highly abundant and diverse factors in

e.g. cellular processes and regulation of gene expression. Thus, to determine

the ncRNA repertoire is of critical importance to our understanding of bio-

logical systems. The aim of my thesis project has been to develop and use

computational tools for the purpose of identifying and characterizing ncRNA

genes.

Non-coding RNAs in Dictyostelium discoideum (PapersI-III)

Dictyostelium discoideum is a slime mold belonging to the amoebozoa lin-

eage that branched out early in eukaryotic evolution, after the plant-animal

split but before the fungi-animal split (Baldauf and Doolittle, 1997; Bapteste

et al., 2002). D. discoideum lives as single cells feeding on bacteria. Upon

starvation, up to 105 cells come together and differentiate into a multicellu-

lar-like organism. D. discoideum has for over 50 years been a model organ-

ism to study different cellular processes, e.g. cell signaling, multicellular de-

velopment and host-pathogen interactions. The genome sequence has been

determined and was published in 2005 (Eichinger et al., 2005). The genome

which is extremely rich in A and T nucleotides, 78% on average, is approxi-

mately 34 Mbp in size, organized into six chromosomes and contains rough-

ly 12,500 protein-coding genes. Approximately 10% of the genome consists

of interspersed complex repeats (Glöckner et al., 2001). At the start of this

study, despite being a well-studied model organism, very little was known

about the ncRNA repertoire of D. discoideum beside the rRNAs, tRNAs and

a few more RNAs (for a review, see Hinas and Söderbom, 2007). We have

used a number of experimentally and computationally based techniques to

identify and characterize many new ncRNAs in D. discoideum.

29

Page 30: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Experimentally based methods (Paper I)

In order to get a picture of the ncRNA repertoire of D. discoideum, we con-

structed full-length cDNA libraries of small RNAs (50-500 nt) from D. dis-coideum. Total RNA from cells that had been developing for 16h (slug stage)

was size fractionated followed by C-tail addition. To ensure that the cDNAs

would represent full-length RNAs of both processed and primary transcripts,

an RNA oligo was ligated to the 5' end of the RNAs after treatment with

TAP. The RNA was subsequently amplified using RT-PCR and cloned. Be-

fore sequencing, colonies were screened to minimize the number of clones

corresponding to rRNA sequences. Of the remaining clones, 36 unique se-

quences were found to represent previously unknown D. discoideum ncR-

NAs. Among the sequenced small RNAs, 18 snoRNAs, 1 U2 snRNA and 1

SRP RNA could be identified. In addition and most interestingly, two highly

abundant classes (~14% of the cloned RNAs) of a previously unknown small

ncRNA family were isolated and named Class I and Class II RNAs. We note

the absence of some of the abundant ncRNA families, e.g. tRNAs and the

snRNAs except for U2. This can most likely be attributed to the ligation of

the RNA oligo to the 5' end of RNAs as similar studies that perform reverse

transcription without the ligation step frequently detect these ncRNAs (Hüt-

tenhofer et al., 2001; Yuan et al., 2003). The ligation step is known to be in-

efficient and higher-order structures in the RNA as well as chemical modifi-

cations near the ends can interfere with ligation and reverse transcription

(Hüttenhofer and Vogel, 2006). The cDNA library approach enabled a first

glimpse of the ncRNA repertoire of D. discoideum and allowed us to identify

a highly abundant and, to our knowledge, species specific family of ncR-

NAs. One should bear in mind however, that the method failed to pick up

certain known families of ncRNAs where ligation of oligos or reverse tran-

scription into cDNA are problematic due to extensive structure.

Computationally based methods (Papers II-III)

Known ncRNA families (Paper II)We were confident that D. discoideum contained the snRNA components of

the splicing machinery since early work had demonstrated the presence of

spliceosomal snRNAs on polyacrylamide gels (Wise and Weiner, 1981;

Takeishi and Kaneda, 1981). In Paper I, we managed to identify the U2 snR-

NA. However, we failed to detect any of the other snRNAs associated with

the spliceosome. We therefore set out to identify the spliceosomal snRNAs

in D. discoideum by computational means.

The U5 and U6 snRNAs could be readily identified through a sensitive se-

quence similarity search with the BLAST software using known U5 and U6

snRNAs as query. However, this approach failed to detect the U1 and U4

snRNAs. We therefore refined the search based on existing knowledge con-

30

Page 31: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

cerning the interactions between the snRNA and pre-mRNA as well as be-

tween the snRNAs in the spliceosome. This was combined with knowledge

about the evolutionarily conserved secondary structures known to be re-

quired for the spliceosome function (Madhani and Guthrie, 1994; Tycowski

et al., 2006). This led to the identification of five genomic loci corresponding

to U1 snRNA and three loci corresponding to U4 snRNA. In addition, we lo-

cated a total of seven loci encoding U2 snRNA-like genes by sequence simi-

larity searches using the U2 snRNA identified in Paper I. We did not manage

to identify any sequences corresponding to snRNAs belonging to the minor

spliceosome, indicating that the minor spliceosome machinery may have

been lost in D. discoideum. This observation is also supported in a recent

study (Dávila López et al., 2008).

The computationally predicted snRNA genes (18 in total) were extensive-

ly verified experimentally and expression from 17 loci was detected. This

approach illustrates a very cost-efficient and powerful method for locating

ncRNAs belonging to known and well-defined families where extensive

knowledge of sequence and structure features can be utilized.

De novo ncRNA gene prediction (Paper III)The identification of previously unknown ncRNA families potentially specif-

ic to D. discoideum in Paper I prompted us to attempt a computational effort

to identify ncRNAs de novo. Presently, successful methods for de novo ncR-

NA gene prediction have been limited to studies based on comparative ge-

nomics or transcriptional signals (e.g. Rivas et al., 2001; Pedersen et al.,

2006; Washietl et al., 2005b; Argaman et al., 2001; Chen et al., 2002; Yachie

et al., 2006). However, at the time of the study, no genome sequence was

available for any sufficiently close D. discoideum relative. In combination

with the variability in ncRNA gene structure and eukaryotic transcription

signals being poorly characterized, these approaches seemed inapplicable.

Our characterization of the isolated ncRNAs in D. discoideum showed that

most genes are situated in between protein-coding genes where the A and T

content is very high. However, the ncRNA genes were not particularly

skewed towards A and T nucleotides. This fact opened up the possibility for

de novo ncRNA gene prediction using base composition as a criterion. This

approach has previously been successfully used to identify ncRNA genes in

hyperthermophiles with extremely AT-rich genomes (Klein et al., 2002;

Schattner, 2002).

While the studies by Klein and colleagues and Schattner used a hidden

Markov model (HMM) framework and a sliding window approach, respec-

tively, we chose to implement a method based on high scoring segments

among partial sums. With this approach, each nucleotide in a sequence is

given a score value. Typically, nucleotides that frequently occur in the type

of sequences we are looking for, in this case ncRNAs, will receive a positive

31

Page 32: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

score while nucleotides that are not frequent are given a negative score. In

this particular case, G and C nucleotides that are unusual in the background

genome but relatively frequent in ncRNAs will receive a positive score while

A and T nucleotides, which are common in the background genome, will re-

ceive a negative score. By summing up scores from consecutive nucleotides

we can define a high scoring segment so that the cumulative sum between

the start and end point of the segment is maximized. Such high scoring seg-

ments would then represent a sequence with a base composition more similar

to the known ncRNAs than to the background genome. This method has sev-

eral advantages. Firstly, the only parameters needed are the scoring scheme

used to assign a score value to each nucleotide. These scores can be designed

arbitrarily to detect specific features of interest or be derived from e.g. a

training set of positive examples and a background set representing negative

data. Secondly, there is a solid body of theory developed regarding the statis-

tical distribution of scores of maximal scoring segments which enables the

statistical significance of maximal scoring segments to be evaluated analyti-

cally (Karlin and Altschul, 1990). In addition, when considering secondary

structures of ncRNAs, it is well known that interactions between neighboring

nucleotides are important, e.g. stacking interactions between base pairs stabi-

lize helices (Mathews et al., 1999; Workman and Krogh, 1999; Clote et al.,

2005). For that reason we implemented a scoring scheme assuming a first or-

der Markov dependence between nucleotides which we refer to as the M1-

model with the intention to evaluate any added benefit compared to a scoring

scheme considering nucleotides to be independent, the M0-model (Figure 5).

The theoretical results regarding the statistical distribution of maximal scor-

ing segments have been extended to sequences with a Markov dependence

(Karlin and Dembo, 1992).

As practically none of the known ncRNAs overlap annotated exons or in-

terspersed repeats, we masked those sequences from the D. discoideumgenome. We refer to the resulting data set as the non-repetitive, intergenic

genome. We used a scaled log ratio between the nucleotide frequency from a

“target” set and a “background” set as score for each corresponding nu-

cleotide. The target set consisted of all known ncRNAs filtered for redundan-

cy and the background set consisted of the non-repetitive, intergenic genome

with known ncRNAs removed. For the M0-model we scored each nucleotide

independently using scores derived from the respective nucleotide frequency.

For the M1-model, we scored each overlapping dinucleotide using scores de-

rived from the conditional frequency of the second nucleotide given the first

nucleotide. Evaluating the benefit of using the M1-model over the M0-mod-

el, we found a slight increase in sensitivity and specificity when using the

M1-model. However, the largest part of this increase is most likely due to a

characteristic dinucleotide content in the background genome of D. dis-coideum rather than in the ncRNAs.

32

Page 33: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

We searched the non-repetitive, intergenic genome of D. discoideum using

the M1-model and identified 94% of previously known ncRNAs as well as a

large number of high-scoring segments possibly corresponding to previously

unknown ncRNA genes. In order to further refine the predictions we used a

few filters which we believed were motivated. The first filter is based on the

observation we had made during our earlier work (Papers I-II) that genes

from the same ncRNA family are frequently present in multiple copies

throughout the genome. Therefore, we clustered the predictions into families

according to sequence similarity and experimentally verified expression

from sequences belonging to four out of eight families tested. Another obser-

vation we had made in previous studies (Papers I-II) was the presence of a

conserved sequence element at a fixed distance upstream of many of the

ncRNAs identified. We had named this element Dictyostelium Upstream Se-quence Element (DUSE) because of its similarity to the upstream sequence

element (USE) commonly found upstream of snRNA genes in other organ-

isms (Hernandez, 2001). The USE acts as a promoter element in other or-

ganisms and we assume that it plays a similar role in D. discoideum. Thus,

we implemented a transcription signal based filter that searches for the pres-

ence of a DUSE element upstream of the predictions. Two predictions were

experimentally assayed and expression was detected from both of them. One

of these was found to correspond to a previously reported selenocysteine

tRNA (Shrimali et. al, 2005). In addition, predictions belonging to two of the

experimentally verified clusters from the first filter were also found to be

preceded by DUSE elements. We conclude that in the special case of D. dis-coideum, we can successfully exploit the bias in nucleotide composition be-

tween ncRNAs and the rest of the genome for de novo ncRNA gene predic-

tion. Further refinements of the predictions can be made by employing filters

based on existing knowledge of the organism. We note however, just like in

previous base composition studies done in hyperthermophiles, that the

method seems biased towards detecting ncRNAs with extensive secondary

structure (Klein et al., 2002; Schattner, 2002). This manifests itself most

clearly through the failure to detect most of the less structured C/D-box

snoRNAs and Class I RNAs.

33

Page 34: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Analysis of D. discoideum ncRNAs

An upstream sequence elementWhen analyzing genomic sequences upstream of the isolated ncRNAs (Paper

I), we identified an eight nucleotide long sequence element present in front

of most Class I RNAs, the U2 snRNA, the SRP RNA and some of the identi-

fied snoRNAs. The sequence element is similar to the upstream sequence el-

ement (USE, or sometimes known as proximal sequence element (PSE))

found in association with spliceosomal snRNA genes in other organisms

(Hernandez 2001). We named this sequence motif Dictyostelium upstreamsequence element (DUSE). The element is also present at a fixed distance of

approximately 63 nt upstream of all snRNAs (Paper II) with no distinction

between the U6 snRNA, which is transcribed by RNA polymerase III, and

the other snRNAs which are transcribed by RNA polymerase II. In addition,

a closer inspection of the genomic sequences preceding RNaseP RNA,

RNase MRP RNA and D2 RNA (Marquez et al., 2005; Piccinelli et al., 2005;

34

Page 35: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Wise and Weiner, 1981; Takeishi and Kaneda, 1981), identified the DUSE

~63 nt from the predicted start of transcription. The importance of this ele-

ment was corroborated in Paper III where we used the presence of a DUSE

upstream of a high-scoring segment as a filter to refine predictions. Two of

the predictions were experimentally tested as well as sequences from two

clusters preceded by a DUSE and they were all found to be expressed. Typi-

cally, we find the DUSE in association with ncRNA genes but not in associa-

tion with protein-coding genes. Hence, the DUSE seems to be a promoter el-

ement specific for a wide range of ncRNA gene families in D. discoideum.

Small nucleolar RNAsIn Paper I, 18 of the sequenced ncRNAs were found to contain sequence mo-

tifs characteristic of snoRNAs. 17 of the D. discoideum snoRNAs could be

classified as the C/D-box type. We subsequently used a bioinformatic screen

to search for target sequences, i.e. D. discoideum rRNA sequences comple-

mentary to the C/D-box snoRNA guide sequences (Cavaillé and Bachellerie,

1998). Candidate targets could be assigned for 12 of the snoRNAs and one

of these was experimentally tested and verified to be methylated.

Spliceosomal small nuclear RNAs (snRNAs)We identified and characterized 18 loci encoding snRNA like genes (Paper I-

II) and we detected expression from 17 loci. All snRNA classes except U6

are present in two or more copies throughout the genome. These copies are

frequently arranged in a pairwise manner where two genes coding for the

same snRNA class are typically located close to each other and in all possi-

ble relative orientations. The same organization has been reported for e.g. D.discoideum tRNA genes and is believed to reflect a so far poorly understood

mechanism of gene duplication (Eichinger et al., 2005).

The U2 gene family contains seven copies which can be further divided

into two subfamilies based on sequence similarity. The first subfamily con-

tains three copies which are highly similar to each other and most similar to

the conserved U2 snRNA sequence of other species. The second subfamily

consists of four copies that are similar to each other but surprisingly diver-

gent from the first subfamily except for the highly conserved sequence mo-

tifs known to be involved in intermolecular interactions within the spliceo-

some. We also observe compensatory base pair substitutions between the

subfamilies, maintaining the predicted secondary structure. Upon closer in-

spection, the U2 snRNAs in the second subfamily also contain a ~40 nu-

cleotide 5' extension that is predicted to fold into a stem loop structure. Ex-

perimental data show that the snRNAs of this second subfamily are down-

regulated during development and are predominately located in the cyto-

plasm. This is unexpected since splicing is known to take place in the nucle-

us and the snRNAs normally only go through a brief cytoplasmic phase dur-

35

Page 36: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

ing the maturation process before being re-imported into the nucleus (Will

and Lührmann, 2001).

Most snRNAs could be detected as polyadenylated transcripts in the cyto-

plasm. This is also surprising as polyadenylation in eukaryotes is associated

with mRNAs where it is involved in transcript turnover and translation effi-

ciency. A quality control mechanism in yeast involving polyadenylation of

ncRNAs followed by degradation by exonucleolytic activity has recently

been reported (LaCava et al., 2005; Vanácová et al., 2005). Whether a simi-

lar mechanism exists in D. discoideum is so far not known.

In accordance with what is known for other eukaryotes, all snRNAs ex-

cept for U6 can be immunoprecipitated with an antibody specific for tri-

methylated cap, indicative of transcription by RNA polymerase II followed

by further processing (Will and Lührmann, 2001). In agreement with tran-

scription by RNA polymerase III, the U6 snRNA gene ends in a run of T nu-

cleotides which is typically a signal for termination of transcription.

A new family of ncRNAs Class I and Class II

Two highly abundant classes of a previously unknown small ncRNA family

were isolated and named Class I and Class II RNAs (Paper I). Transcripts

from both classes were found to be mainly cytoplasmic and Class I was

found to be developmentally regulated. Class I is a large family with 14 iso-

lated unique sequences mapping to 17 loci in the genome. Sequence similari-

ty searches in the genome revealed 24 additional loci. The distinguishing

features for the 55-65 nt Class I RNAs are highly conserved 5' and 3' ends,

16 and 8 nucleotides, respectively and a variable sequence in between. The

5' and 3' ends are predicted to form a helix. Class II sequences are present in

two highly similar genomic copies and are predicted to form a similar sec-

ondary structure as Class I. The Class I and Class II RNAs share an eleven

nucleotide sequence motif. The function of these unknown RNA classes is so

far not known. However, one of the Class I RNA genes has been knocked-

out, generating a strain with a mild phenotype where the developing multi-

cellular aggregates are reduced in size (Avesson and Söderbom, unpub-

lished).

Computationally identified ncRNAs

In a computational screen for ncRNA genes we clustered candidates into

families according to sequence similarity (Paper III). Out of eight experi-

mentally tested families, expression could be detected from four. One family

in particular is interesting. There are 12 candidates in the family, nine of

which are preceded by a DUSE. Through, 5'- and 3'-RACE experiments, we

could verify expression from at least three and possibly five individual se-

quences from the family and the mature 5' end is located at the expected 63

36

Page 37: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

nt downstream of the DUSE motifs. Secondary structure prediction using

RNAalifold (Hofacker et al., 2002) suggests a conserved, stable secondary

structure supported by several compensatory base pair substitutions. The

function of this group of ncRNAs is so far unknown and we have not been

able to detect similar sequences in other species.

Characterization of human spliceosomal snRNAvariants (Papers IV-V)

Heterogeneity among the expressed snRNA repertoire, often in combination

with tissue specificity and/or developmental regulation, has been reported in

many species, e.g. Xenopus, mouse, sea urchin, D. Melanogaster, T. ther-mophila, silk worm and D. discoideum (Forbes et al., 1984; Lund et al.,

1985; Santiago and Marzluff, 1989; Alonso et al., 1984; Lo and Mount,

1990; Ørum et al., 1992; Gao and Herrera, 1995; Paper II). Variant snRNAs

have also been previously reported in human (Sontheimer and Steitz, 1992).

The function of this heterogeneity has not been determined but an involve-

ment in tissue-specific or developmentally regulated alternative splicing is

often suggested (Mattaj and Hamm, 1989). Most organisms are known to

contain a large amount of snRNA-like sequences in their genomes (see e.g.Paper II; Adams et al., 2000; Lander et al., 2001; Waterston et al., 2002;

Manser and Gesteland, 1982). The publication of the human genome (Lan-

der et al., 2001) opened up the possibility to get a complete picture of the

snRNA heterogeneity on the genomic level. In order to characterize the het-

erogeneity among human snRNAs, we have in Papers IV and V used both

biochemical procedures as well as computational analyses for this purpose,

initially focusing mainly on the U1 snRNA.

Biochemical identification and characterization of U1 snRNAs

(Paper IV)

We first compiled a list of human genomic locations encoding U1 snRNA-

like sequences through sequence similarity searches with BLAT using the

well-characterized, dominant U1 snRNA sequence (U1A) as query (Branlant

et al., 1980). For completeness, we also extracted positions of interspersed

repeats classified as U1 snRNA from RepeatMasker annotations of the hu-

man genome (Smit AFA., Hubley R. and Green P. RepeatMasker Open-3.0.

1996-2004; http://www.repeatmasker.org). After manual inspection of up-

stream and downstream sequences as well as the RNA gene sequence, a sub-

set of 27 genes were selected as promising candidates for further analysis.

Northern blot analysis indicated that eight of these were expressed in HeLa

cells and 5' and 3' RACE and molecular cloning confirmed the expression of

37

Page 38: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

three snRNAs. We labeled these variants U1A5, U1A6 and U1A7. Further

biochemical characterization of these variants revealed that they are ubiqui-

tously expressed among a wide range of tissues and estimation of the relative

abundance levels are at 2-7% of the level of U1A which is in the same range

as the U11 snRNA of the minor spliceosome (Montzka and Steitz, 1988). We

could show that the variants all posses a tri-methylated guanosine (TMG)

cap at the 5' end and that they were associated with Sm proteins. This is in-

dicative of transcription by RNA polymerase II followed by export of the

transcript into the cytoplasm where Sm proteins assemble on the RNA fol-

lowed by hypermethylation of the cap (Kiss, 2004). We could also show that

the variants are present in nuclear RNP complexes, consistent with a re-im-

port of the processed snRNA variants into the nucleus. Taken together, these

observations indicate that the expressed variant snRNAs undergo the same

biogenesis pathway as functional snRNAs (Kiss, 2004).

Identification of expressed snRNAs using exon microarrays

(Paper V)

In order to take a large-scale approach to identify expressed snRNA genes in

human tissues, we analyze publicly available microarray data. The

Affymetrix Human Exon 1.0 ST microarray platform

(http://www.affymetrix.com) contains probesets interrogating over one mil-

lion exon clusters of known and predicted exons and transcribed regions in

the human genome. Each probeset consists of up to four individual 25 nt

probes that are combined to give a single intensity value for the probeset. Of

particular interest to us is the fact that the array also includes probesets inter-

rogating known and predicted RNA genes. Hybridization data for RNA ex-

tracted from eleven different human tissues has been made publicly available

and can be downloaded from the Affymetrix website. From this dataset we

extracted probesets corresponding to snRNA gene predictions. After filtering

away cross-hybridizing probesets and probesets with low individual probe

coverage we extracted Detection Above Background p-values. Briefly, this

p-value has been calculated by comparing each probe intensity against the

distribution of intensities from roughly 1000 background probes of the same

GC-content. Based on these p-values and setting a false discovery rate

(FDR) threshold at 0.05 we assigned predicted snRNA genes as expressed or

not expressed.

We find expressed variants for all spliceosomal snRNAs except U12 and

U6atac. U6 and U1 show the largest number of expressed variants, 28 and

17, respectively. When comparing how the expressed variants are distributed

over the eleven tissue types, we find that cerebellum is significantly overrep-

resented (p << 0.0001) while heart, muscle and lung are slightly underrepre-

sented (0.05 < p < 0.0025; Figure 6). The same general pattern of tissue ex-

38

Page 39: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

pression is observed for U1 and U6 snRNAs. It is particularly interesting

that we find by far the highest number of expressed snRNA variants in brain

tissue which has been fingered in several studies as being one of the tissues

containing the most alternative splicing events (Xu et al., 2002; Yeo et al.,

2004). In addition, many families of ncRNAs have been reported to be over-

represented in the nervous system and suggested to be involved in the regu-

lation of nervous system development (see Cao et al., 2006; Mehler and

Mattick, 2006 and references therein).

Characterization of expressed snRNA variants

Complementarity to pre-mRNA 5' splice sitesA striking feature of the U1 snRNA variants we identified in Paper IV is that

they all have at least one nucleotide substitution in the 5' end located motif

that interacts with the 5' SS of the pre-mRNA (Figure 7). In particular, they

all lack perfect complementarity to the highly conserved GU dinucleotide at

the first positions of the intron. In vivo data suggests that the U1 snRNA in-

teraction with the 5' SS can tolerate mismatches to the GU dinucleotide pro-

vided that the loss of base pairing is compensated for by increasing base

pairing throughout the rest of the interaction motif (Aebi et al., 1986; Jack-

son, 1991). As the identified variants U1A5 and U1A6 deviate from the

canonical U1A 5' SS interaction motif sequence at several positions, a puta-

tive interaction with the 5' SS consensus would be suboptimal in this respect.

We therefore hypothesized that these variants could function by recognizing

non-consensus 5' splice sites and be involved in alternative splicing events.

Similarly, of the 17 U1 snRNA variants we found to be expressed in Paper

V, all but three have nucleotide substitutions in the 5' SS recognition motif

compared to U1A and seven variants (one of which correspond to the U1A6

39

Page 40: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

variant in Paper IV) do not perfectly match the consensus GU dinucleotide

interaction motif. Thus, a fairly large number of variant U1 snRNAs are ex-

pressed in human tissues, most of which have a different 5' SS recognition

motif than U1A. Provided that the variant U1 snRNAs are assembled into

snRNPs and take part in the splicing reaction, it seems plausible that they

could give specificity for a subset of splice sites which are not optimally rec-

ognized by the U1A snRNA. Similar situations where compensatory muta-

tions in the 5' SS recognition motif of U1 snRNA can suppress 5' splice site

mutations and shift pre-mRNA splicing patterns in vivo have been reported.

(Zhuang and Weiner, 1986; Zhuang et al., 1987; Siliciano and Guthrie, 1988;

Cohen et al., 1993; Lo et al., 1994; Hitomi et al., 1998; Zahler et al., 2004).

Evolutionary conservationIn Paper IV, we looked at orthologous positions to the human U1 snRNA

variants in other primate species (chimpanzee, orangutan, rhesus macaque,

common marmoset) as well as in dog and cow. The variants are well con-

served between human and chimpanzee. U1A5 is also present in rhesus and

marmoset, although in marmoset there is a large deletion in the middle that

would probably inactivate the gene. U1A6 is present in orangutan but has a

large deletion within the gene. U1A7 is present in orangutan and rhesus but

the rhesus sequence has a deletion. Surprisingly, we find that the dog and

cow orthologs to U1A6 and U1A7 correspond to perfect U1A sequences.

This is interesting since it suggests that the variants have evolved from U1A

while still being expressed and subjected to the snRNA biogenesis pathway

in the cell. Since the emergence of snRNA variants with a preference for a

different set of splice sites could have dramatic impacts on the global splic-

ing pattern of pre-mRNAs, it is tempting to speculate that the large internal

deletions observed in some of the primate species variants represent a posi-

tively selected inactivation of the genes.

40

Page 41: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

To investigate whether more similar situations exist, i.e. where some

species have snRNA genes that have retained the original sequence while

other species have an snRNA gene variant that has substantially diverged in

the same locus, we use the 12-way whole-genome multiple alignment avail-

able through the Ensembl databases (Flicek et al., 2008; Paten et al., 2008) to

locate orthologous snRNA gene loci between human and non-human species

(Paper V). Beyond the primate lineage, we are only able to locate ortholo-

gous loci for a few genes. However, we identify 2-3 orthologs to human U1

snRNA genes in rodents, horse, dog, cow and platypus, all of which are lo-

41

Page 42: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

cated within the first intron of an evolutionarily conserved testis expressed

gene, TEX14, in human located on chromosome 17. Two of these corre-

spond to the U1A6 and U1A7 variants we identified in Paper IV. With the

exception of one cow sequence, all correspond to the original U1A sequence

in non-primate species while the primate orthologs have diverged sequences

in the corresponding positions. A homolog of TEX14 is present in the chick-

en and Xenopus genomes but no U1-like sequences can be located within its

introns in those species (Figure 8). Taken together, these results indicate that

the first intron of TEX14 harbor a cluster of U1 snRNA genes that were ac-

tive in an ancestral species close to the root of the mammalian lineage.

Somewhere along the primate lineage it seems like there has been a relax-

ation of negative selection, allowing the primate genes to diverge while the

cluster is still under negative selection in non-primate mammals.

42

Page 43: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Conclusions and future perspectives

My Ph. D. studies have been aimed at the identification and characterization

of ncRNAs using computational methods.

Through a combination of computational and experimental approaches we

have identified a large number of ncRNAs belonging to many known as well

as several previously unknown gene families in the model organism Dic-tyostelium discoideum. In addition, we developed a new approach to use nu-

cleotide composition as a means to identify ncRNA genes de novo and we

demonstrated the successful use of this approach. We have also identified a

large heterogeneity among human spliceosomal snRNAs and we speculate

about a possible link to regulation of alternative splicing. But more knowl-

edge spawns more questions. A few lines of inquiry that might shed light on

issues of abundance, function and relevance of the ncRNAs we've encoun-

tered and which we should be able to follow up in the not too distant future

are sketched below.

Building CM-models of the ncRNA families that seem to be specific

to D. discoideum would enable more distantly related genomes to be

searched for homologs that might have been missed by sequence ho-

mology searches.

Genome sequencing of D. purpureum, a close relative to D. dis-coideum is underway. This opens up the possibility for doing denovo ncRNA gene discovery using comparative approaches such as

RNAz as well as study homologs to already identified ncRNA fami-

lies.

Introns that could potentially be spliced by variant U1 snRNAs re-

main to be found. Available databases that collect and extract infor-

mation on splice sites could be searched for this purpose. In particu-

lar introns with poor complementarity to U1A snRNA could consti-

tute promising candidates.

High-throughput sequencing of primate RNA enriched for snRNAs

are underway and is likely to yield large amounts of valuable infor-

mation on the primate snRNA transcriptome.

Another level of heterogeneity is offered by allelic variations (poly-

morphisms). Large amounts of data will soon become available from

the 1000 genomes project which will shed light on this issue.

43

Page 44: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

In order to test different models of evolution, reliable orthology data

is needed. This is notoriously hard to obtain between non-protein

coding genomic regions of distantly related organisms because this

genomic sequence is mainly evolving neutrally. One approach po-

tentially worth exploring could be to align “genome structure” rather

than sequence. For example, annotations of ancestral repeats could

be used to find orthologous regions sharing a repeat pattern that is

consistent with divergence times and the age of repeats.

As a final point, this work highlights how experimental and computational

methods may efficiently complement each other. Experimental data generate

knowledge which can guide the design of computational approaches that in

turn can be used to search vast amounts of sequence data and generate new

predictions. In contrast to experimental methods, computational approaches

are generally cheap and fast. In addition, they are always reproducible and

are not sensitive to expression levels or growth conditions.

44

Page 45: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Acknowledgements

I would like to deeply thank my supervisor Anders Virtanen for accepting

me as a Ph. D. student, for your support and for everything you have taught

me. I have really appreciated how you are always full of ideas and how you

can always find the time to discuss projects. Your optimism and your passion

for science have been true sources of inspiration!

I would also like to thank my second supervisor Leif Kirsebom for all your

valuable help, advice and encouragement throughout my Ph. D. studies and

for giving me many opportunities to exciting collaborations.

Many special thanks to Fredrik Söderbom for exciting and fruitful collabo-

rations in the slime mold business! Thanks for enduring those insane work-

ing hours prior to submission, looking out for me and being a great friend.

My deepest gratitude also to David Ardell for all the support. You have

taught me a lot about science and bioinformatics and your encouragement

has meant a lot to me. Also, thanks for great collaborations, your ideas and

guidance on Paper III were invaluable!

My fellow members of the Virtanen group: Lei, the snRNA expert. It's been

really nice working together with you. Thanks for all the discussions about

snRNAs and for teaching me about experimental procedures. Good luck with

your projects! Per, for always taking the time to answer my questions, the

countless parties in your garden and your endless generosity. Good luck in

Japan! Niklas, your turn next! Thanks for all the good times, on conferences,

in the (well ventilated) office, in Uppsala and on the west coast. See you in

Gravarne!

Thanks to collaborators in the Dicty group at SLU: Andrea, for rewarding

collaborations and for making sense out of my predictions. Welcome back to

Sweden! Lotta, thanks for collaborations, nice company in Chicago and for

making the coffee breaks with me and Johan slightly less geeky (but only

slightly).

45

Page 46: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Collaborators at ICM: Santanu, Jaydip, Nurul M. Islam, Ulrika, Bhupi,Hava and Fredrik P.

Thanks to Gerhart Wagner for many interesting discussions concerning

manuscripts and RNA in general.

Thanks also to collaborators and co-authors outside ICM: Christina Kyri-akopoulou, Jens Schuster, Anders Aspegren and Vincent Moulton.

I gratefully acknowledge the Lennanders Foundation at Uppsala Universi-

ty for financial support during the conclusion of my Ph. D. studies.

Thanks to all present and former colleagues at ICM! The administrators:Sigrid, Sofia, Anders, and Åsa for all their help and assistance over the

years. Akiko for computer support. Christer for technical support. Micro:

Johan, I've really enjoyed all our discussions, coffee breaks, going to con-

ferences, playing games, working out, etc. etc. (the list reaches from here to

Fyrisfjädern). Thanks man! Helene for tough badminton games. I'm dread-

ing a squash revenge. Karin T and Emma R for helpful discussions on mi-

croarrays (and the great parties of course!). Hava for sharing nuggets of pop

culture, teaching me english language and showing what a pool party is sup-

posed to be like. Cia – thanks for great travel company in Chicago. Maaike,

Erik H, Aurelie, Ulrika, Fredrik P, Shiying, Henrik T, Sonchita, Bhupi,Pernilla, Jon, Marie W, Disa, Linda M (and Mattias). Molbio: Magnus L,

Harri, Doreen, David F, Ikue, Magnus J, Chandu, Prune, Brian, Paul,Petter. The Kilimann group: Greta, Nicole and Christoph. Immunology:

Kajsa and Sofia.

Tack till oslagbara vänner, som håller det mikrokulära på avstånd: Anders L, Pontus H, Erik S, Lisa KW, Janne, Johan L, Daniel L, Per N,

Daniel V och Micaela. Fredrik, Patrik och Mats med familjer. The Stray-

dogs: Pontus G, Jonas, Ola och Erik.

Min familj:Ert stöd har varit ovärderligt!

Pia, Daniel, Moa och Ebba. Pernilla, Henrik, Olivia, Agnes och Alice.

Och naturligtvis, allra störst tack till Mamma och Pappa som alltid ställer

upp och som alltid har sett till att jag haft möjlighet att göra det jag velat.

Sist men också minst: Last.FM som stått för soundtracket till det här

kalaset!

46

Page 47: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Swedish summary

Under 1990-talet gjordes en revolutionerande upptäckt när man fann att

dubbelsträngat RNA, som injicerats i rundmasken Caernohabditis elegansmed mycket stor precision kunde stänga av uttrycket (tillverkningen) av

specifika proteiner. Kort därefter upptäckte man att en stor mängd små icke

protein-kodande RNA-molekyler (ncRNA), med liknande funktion förekom-

mer naturligt, hos i stort sett alla djur och även i växter. Denna upptäckt, som

benämns RNA-interferens (RNAi), ledde till att man fick upp ögonen för hur

vanligt förekommande och avgörande reglering via ncRNA tycktes vara. För

att fullt ut kunna undersöka och kartlägga omfattningen och funktionerna

hos ncRNA är det mycket viktigt att man har så omfattande information som

möjligt om den uppsättning ncRNA-gener som uttrycks i en organism.

En viktig klass av ncRNA utgörs av små nukleära RNA (snRNA), vilka är

involverade i den s.k. splitsningen av mRNA. Ett eukaryot mRNA innehåller

ofta sekvenser, introner, som inte kodar för det färdiga proteinet. Intronerna

är insprängda mellan sekvenserna som kodar för proteinet, exonerna. Innan

ett mRNA kan translateras måste således intronerna klippas ut och exonerna

klistras ihop igen. Denna process kallas splitsning. Splitsningen utförs av ett

maskineri bestående av ett litet antal snRNA molekyler och ett stort antal as-

socierade proteiner. Cellen kan använda sig av splitsningsmaskineriet för att

producera olika, men ändå närbesläktade proteiner. I dessa fall klipper och

klistrar maskineriet ihop RNA-delar på olika sätt. Det kallas alternativ split-

sning och är mycket vanligt förekommande.

Eftersom människan är en högt utvecklad organism med bl.a. en stor och

komplex hjärna, har man trott att det skulle behövas en stor mängd proteiner

för att bygga upp och hålla igång en individ. När den mänskliga arvsmassan

var färdigsekvenserad i början av 2000-talet, blev man därför väldigt förvå-

nad över att upptäcka, att människan har ungefär lika många proteinkodande

gener som den betydligt mindre utvecklade rundmasken. Man tror nu, att det

snarare är små skillnader i hur de olika proteinerna uttrycks, som bidrar till

den ökade komplexiteten hos människan, jämfört med t.ex. rundmasken.

Eftersom olika ncRNA är mycket effektiva regulatorer av genuttryck, tror

man att dessa spelar en stor roll i detta sammanhang. Ytterligare en

mekanism som kan öka komplexiteten hos uppsättningen proteiner en organ-

47

Page 48: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

ism har tillgång till, är den ovan nämnda alternativa splitsningen. Också den

processen är i hög grad reglerad av ncRNA.

Att sekvensera arvsmassan (eller genomet), dvs att bestämma ordningsfölj-

den hos DNA-nukleotiderna, hos en organism är av stor betydelse när man

vill förstå hur organismen är uppbyggd och fungerar. Att ta reda på nuk-

leotidsekvensen i arvsmassan är dock bara ett första steg. För att ha någon

nytta av den kunskapen måste man veta vad sekvensen betyder, bl.a. måste

man kartlägga bitarna som kodar för gener: både gener som kodar för pro-

teiner och gener som kodar för ncRNA. Att göra detta för ett helt genom är

ett stort projekt och till stor del måste man därför förlita sig på datorbaserade

metoder, som kan analysera hela genomet och märka ut generna.

I mitt doktorandarbete har jag med hjälp av datorbaserade metoder identi-

fierat och karaktäriserat ncRNA utifrån sekvensdata. Vi har också studerat

den mångfald som finns bland mänskliga splitsnings-RNA och hur detta kan

påverka uppsättningen proteiner hos människan.

I min avhandling presenterar jag strategier vi använt oss av för att identifiera

och karaktärisera ncRNA gener i amöban Dictyostelium discoideum, på

svenska kallad slemsvamp. Slemsvamp är en viktig modellorganism som har

studerats under de senaste 50 åren. Trots att den studerats under så lång tid,

har man känt till mycket lite om dess ncRNA repertoar. Vi använde oss av

flera metoder, både experimentellt baserade och datorbaserade, för att identi-

fiera ncRNAs i slemsvamp.

Vi kunde identifiera många ncRNA tillhörande dels väldokumenterade

ncRNA-klasser, som dock inte varit kartlagda i slemsvamp tidigare, men

också klasser av tidigare okända ncRNA, som eventuellt är specifika för den

här organismen. Funktionen hos dessa okända ncRNA är inte kartlagd.

Genom att vi hittade många ncRNA kunde vi upptäcka speciella egenskaper

som var ganska utmärkande för ncRNA-gener i slemsvamp. Vi kunde därför,

med hjälp av dessa egenskaper, skapa en helt datorbaserad sökmetod som vi

använde för att hitta ännu fler okända ncRNA.

Vi har också studerat den naturligt förekommande variationen i splitsnings-

RNA (snRNA) hos människan. Eftersom en del av dessa snRNAs är in-

volverade i lokaliseringen av splitsningsställen i mRNA har det förekommit

spekulationer om att varianter av snRNA skulle kunna påverka t.ex. alterna-

tiv splitsning. Vi har visat, både genom datorbaserade och experimentella

metoder, att ett flertal varianter av snRNA, som skulle kunna påverka

lokaliseringen av splitsningsställen, förekommer I mänskliga celler. Vi tror

därför att snRNA-varianter skulle kunna påverka alternativ splitsning.

48

Page 49: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Sammantaget har vi genom att kombinera datorbaserade och experimentella

metoder identifierat en rik och varierad ncRNA-repertoar i slemsvamp och i

människa. Denna kartläggning kan tjäna som ett första steg för att mer in-

gående studera funktionen hos de olika ncRNA-generna. Variationen vi ser i

mänskliga snRNA-gener skulle kunna ha betydelse för splitsningen av

mRNA och därigenom bidraga till att öka antalet olika proteinprodukter, som

cellen är kapabel att tillverka.

49

Page 50: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

References

Adams, M.D. et al., 2000. The genome sequence of Drosophila melanogaster. Science,

287(5461), 2185-95.

Aebi, M. et al., 1986. Sequence requirements for splicing of higher eukaryotic nuclear pre-

mRNA. Cell, 47(4), 555-65.

Alonso, A. et al., 1984. Divergence of U2 snRNA sequences in the genome of D.

melanogaster. Nucleic Acids Research, 12(24), 9543-50.

Altman, S., 1990. Ribonuclease P. Postscript. The Journal of Biological Chemistry, 265(33),

20053-6.

Altschul, S.F. et al., 1990. Basic local alignment search tool. Journal of Molecular Biology,

215(3), 403-10.

Altschul, S.F. et al., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein

database search programs. Nucleic Acids Research, 25(17), 3389-402.

Aravin, A.A., Hannon, G.J. & Brennecke, J., 2007. The Piwi-piRNA pathway provides an

adaptive defense in the transposon arms race. Science, 318(5851), 761-4.

Argaman, L. et al., 2001. Novel small RNA-encoding genes in the intergenic regions of Es-

cherichia coli. Current Biology: CB, 11(12), 941-50.

Babak, T. et al., 2004. Probing microRNAs with microarrays: tissue specificity and functional

inference. RNA, 10(11), 1813-9.

Bachellerie, J.P. et al., 1995. Antisense snoRNAs: a family of nucleolar RNAs with long com-

plementarities to rRNA. Trends in Biochemical Sciences, 20(7), 261-4.

Backofen, R. et al., 2007. RNAs everywhere: genome-wide annotation of structured RNAs.

Journal of Experimental Zoology. Part B. Molecular and Developmental Evolution,

308(1), 1-25.

Baldauf, S.L. & Doolittle, W.F., 1997. Origin and evolution of the slime molds (Mycetozoa).

Proceedings of the National Academy of Sciences of the United States of America,

94(22), 12007-12.

Ban, N. et al., 2000. The complete atomic structure of the large ribosomal subunit at 2.4 A res-

olution. Science, 289(5481), 905-20.

Bapteste, E. et al., 2002. The analysis of 100 genes supports the grouping of three highly di-

vergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proceedings ofthe National Academy of Sciences of the United States of America, 99(3), 1414-9.

Bell, G. & Mooers, A.O., 1997. Size and complexity among multicellular organisms. Biologi-cal Journal of the Linnean Society, 60(3), 345-363.

Berget, S.M., Moore, C. & Sharp, P.A., 1977. Spliced segments at the 5' terminus of aden-

ovirus 2 late mRNA. Proceedings of the National Academy of Sciences of the Unit-ed States of America, 74(8), 3171-5.

Bernhart, S.H., Hofacker, I.L. & Stadler, P.F., 2006. Local RNA base pairing probabilities in

large sequences. Bioinformatics, 22(5), 614-5.

Bertone, P. et al., 2004. Global identification of human transcribed sequences with genome

tiling arrays. Science, 306(5705), 2242-6.

Birney, E. et al., 2007. Identification and analysis of functional elements in 1% of the human

genome by the ENCODE pilot project. Nature, 447(7146), 799-816.

Black, D.L., 2003. Mechanisms of alternative pre-messenger RNA splicing. Annual Review ofBiochemistry, 72, 291-336.

50

Page 51: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Blencowe, B.J., 2006. Alternative splicing: new insights from global analyses. Cell, 126(1),

37-47.

Branlant, C. et al., 1980. Nucleotide sequences of nuclear U1A RNAs from chicken, rat and

man. Nucleic Acids Research, 8(18), 4143-54.

Burge, C.B., Tuschl, T. & Sharp, P.A., 1999. Splicing of Precursors to mRNAs by the Spliceo-

somes. In The RNA World. Second Edition. New York: Cold Spring Harbor Labora-

tory Press, pp. 525-560.

Burset, M., Seledtsov, I.A. & Solovyev, V.V., 2001. SpliceDB: database of canonical and non-

canonical mammalian splice sites. Nucleic Acids Research, 29(1), 255-9.

Cao, X. et al., 2006. Noncoding RNAs in the Mammalian Central Nervous System. AnnualReview of Neuroscience. Available at:

http://www.ncbi.nlm.nih.gov/pubmed/16602910 [Accessed December 4, 2008].

Cavaillé, J. & Bachellerie, J.P., 1998. SnoRNA-guided ribose methylation of rRNA: structural

features of the guide RNA duplex influencing the extent of the reaction. NucleicAcids Research, 26(7), 1576-87.

Cech, T.R., 1990. Self-splicing of group I introns. Annual Review of Biochemistry, 59, 543-68.

Cech, T.R., 1986. The generality of self-splicing RNA: relationship to nuclear mRNA splic-

ing. Cell, 44(2), 207-10.

Cech, T.R. et al., 2006. The RNP World. In The RNA World. Third Edition. New York: Cold

Spring Harbor Laboratory Press, p. 287.

Chen, S. et al., 2002. A bioinformatics based approach to discover small RNA genes in the Es-

cherichia coli genome. Bio Systems, 65(2-3), 157-77.

Chow, L.T. et al., 1977. An amazing sequence arrangement at the 5' ends of adenovirus 2 mes-

senger RNA. Cell, 12(1), 1-8.

Clark, T. et al., 2007. Discovery of tissue-specific exons using comprehensive human exon

microarrays. Genome Biology, 8(4), R64.

Claude, A., 1943. The constitution of protoplasm. Science, 97(2525), 451-456.

Clemson, C.M. et al., 1996. XIST RNA paints the inactive X chromosome at interphase: evi-

dence for a novel RNA involved in nuclear/chromosome structure. The Journal ofCell Biology, 132(3), 259-75.

Clote, P. et al., 2005. Structural RNA has lower folding energy than random RNA of the same

dinucleotide frequency. RNA, 11(5), 578-91.

Cohen, J.B., Broz, S.D. & Levinson, A.D., 1993. U1 small nuclear RNAs with altered speci-

ficity can be stably expressed in mammalian cells and promote permanent changes

in pre-mRNA splicing. Molecular and Cellular Biology, 13(5), 2666-76.

Costa, F.F., 2005. Non-coding RNAs: new players in eukaryotic biology. Gene, 357(2), 83-94.

Crick, F.H., 1968. The origin of the genetic code. Journal of Molecular Biology, 38(3),

367-79.

Crooks, G.E. et al., 2004. WebLogo: a sequence logo generator. Genome Research, 14(6),

1188-90.

Dávila López, M., Rosenblad, M.A. & Samuelsson, T., 2008. Computational screen for

spliceosomal RNA genes aids in defining the phylogenetic distribution of major and

minor spliceosomal components. Nucleic Acids Research, 36(9), 3001-10.

Decatur, W.A. & Fournier, M.J., 2003. RNA-guided nucleotide modification of ribosomal and

other RNAs. The Journal of Biological Chemistry, 278(2), 695-8.

Eddy, S.R., 2002b. A memory-efficient dynamic programming algorithm for optimal align-

ment of a sequence to an RNA secondary structure. BMC Bioinformatics, 3, 18.

Eddy, S.R., 2002a. Computational genomics of noncoding RNA genes. Cell, 109(2), 137-40.

Eddy, S.R., 2001. Non-coding RNA genes and the modern RNA world. Nature Reviews. Ge-netics, 2(12), 919-29.

Eddy, S.R. & Durbin, R., 1994. RNA sequence analysis using covariance models. NucleicAcids Research, 22(11), 2079-88.

Edvardsson, S. et al., 2003. A search for H/ACA snoRNAs in yeast using MFE secondary

structure prediction. Bioinformatics, 19(7), 865-73.

51

Page 52: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Egea, P.F., Stroud, R.M. & Walter, P., 2005. Targeting proteins to membranes: structure of the

signal recognition particle. Current Opinion in Structural Biology, 15(2), 213-20.

Eichinger, L. et al., 2005. The genome of the social amoeba Dictyostelium discoideum. Na-ture, 435(7038), 43-57.

Fahlgren, N. et al., 2007. High-Throughput Sequencing of Arabidopsis microRNAs: Evidence

for Frequent Birth and Death of MIRNA Genes. PLoS ONE, 2(2), e219.

Fatica, A. & Tollervey, D., 2002. Making ribosomes. Current Opinion in Cell Biology, 14(3),

313-318.

Faustino, N.A. & Cooper, T.A., 2003. Pre-mRNA splicing and human disease. Genes & De-velopment, 17(4), 419-37.

Fire, A. et al., 1998. Potent and specific genetic interference by double-stranded RNA in

Caenorhabditis elegans. Nature, 391(6669), 806-11.

Fleischmann, R.D. et al., 1995. Whole-genome random sequencing and assembly of

Haemophilus influenzae Rd. Science, 269(5223), 496-512.

Flicek, P. et al., 2008. Ensembl 2008. Nucleic Acids Research, 36(Database issue), D707-14.

Forbes, D.J. et al., 1984. Differential expression of multiple U1 small nuclear RNAs in

oocytes and embryos of Xenopus laevis. Cell, 38(3), 681-9.

Freyhult, E.K., Bollback, J.P. & Gardner, P.P., 2007. Exploring genomic dark matter: a critical

assessment of the performance of homology search methods on noncoding RNA.

Genome Research, 17(1), 117-25.

Gao, J.P. & Herrera, R.J., 1995. U1 snRNA variants coexist in Bombyx mori cells. InsectMolecular Biology, 4(3), 193-202.

Glöckner, G. et al., 2001. The complex repeats of Dictyostelium discoideum. Genome Re-search, 11(4), 585-94.

Gräf, S. et al., 2001. HyPaLib: a database of RNAs and RNA structural elements defined by

hybrid patterns. Nucleic Acids Research, 29(1), 196-8.

Graveley, B.R., 2000. Sorting out the complexity of SR protein functions. RNA, 6(9),

1197-211.

Gregory, T.R. & Hebert, P.D., 1999. The modulation of DNA content: proximate causes and

ultimate consequences. Genome Research, 9(4), 317-24.

Griffiths-Jones, S. et al., 2005. Rfam: annotating non-coding RNAs in complete genomes.

Nucleic Acids Research, 33(Database issue), D121-4.

Grillo, G. et al., 2003. PatSearch: A program for the detection of patterns and structural motifs

in nucleotide sequences. Nucleic Acids Research, 31(13), 3608-12.

Grimson, A. et al., 2008. Early origins and evolution of microRNAs and Piwi-interacting

RNAs in animals. Nature, 455(7217), 1193-7.

Guerrier-Takada, C. et al., 1983. The RNA moiety of ribonuclease P is the catalytic subunit of

the enzyme. Cell, 35(3 Pt 2), 849-57.

Hall, S.L. & Padgett, R.A., 1996. Requirement of U12 snRNA for in vivo splicing of a minor

class of eukaryotic nuclear pre-mRNA introns. Science, 271(5256), 1716-8.

Hernandez, N., 2001. Small nuclear RNA genes: a model system to study fundamental mech-

anisms of transcription. The Journal of Biological Chemistry, 276(29), 26733-6.

Hertel, J. & Stadler, P.F., 2006. Hairpins in a Haystack: recognizing microRNA precursors in

comparative genomics data. Bioinformatics, 22(14), e197-202.

Hinas, A. & Söderbom, F., 2007. Treasure hunt in an amoeba: non-coding RNAs in Dic-

tyostelium discoideum. Current Genetics, 51(3), 141-59.

Hitomi, Y., Sugiyama, K. & Esumi, H., 1998. Suppression of the 5' splice site mutation in the

Nagase analbuminemic rat with mutated U1snRNA. Biochemical and BiophysicalResearch Communications, 251(1), 11-6.

Hoagland, M.B. et al., 1958. A soluble ribonucleic acid intermediate in protein synthesis. TheJournal of Biological Chemistry, 231(1), 241-57.

Hofacker, I.L., Priwitzer, B. & Stadler, P.F., 2004. Prediction of locally stable RNA secondary

structures for genome-wide surveys. Bioinformatics, 20(2), 186-90.

Hofacker, I.L., Fekete, M. & Stadler, P.F., 2002. Secondary structure prediction for aligned

RNA sequences. Journal of Molecular Biology, 319(5), 1059-66.

52

Page 53: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Hüttenhofer, A. et al., 2001. RNomics: an experimental approach that identifies 201 candi-

dates for novel, small, non-messenger RNAs in mouse. The EMBO Journal, 20(11),

2943-53.

Hüttenhofer, A., Schattner, P. & Polacek, N., 2005. Non-coding RNAs: hope or hype? Trendsin Genetics, 21(5), 289-97.

Hüttenhofer, A. & Vogel, J., 2006. Experimental approaches to identify non-coding RNAs.

Nucleic Acids Research, 34(2), 635–646.

Hu, Z. et al., 2006. An antibody-based microarray assay for small RNA detection. NucleicAcids Research, 34(7), e52.

Jackson, I.J., 1991. A reappraisal of non-consensus mRNA splice sites. Nucleic Acids Re-search, 19(14), 3795-8.

Johnson, J.M. et al., 2003. Genome-wide survey of human alternative pre-mRNA splicing

with exon junction microarrays. Science, 302(5653), 2141-4.

Jones-Rhoades, M.W., Bartel, D.P. & Bartel, B., 2006. MicroRNAS and their regulatory roles

in plants. Annual Review of Plant Biology, 57, 19-53.

Jurica, M.S. & Moore, M.J., 2003. Pre-mRNA splicing: awash in a sea of proteins. MolecularCell, 12(1), 5-14.

Kampa, D. et al., 2004. Novel RNAs Identified From an In-Depth Analysis of the Transcrip-

tome of Human Chromosomes 21 and 22. Genome Research, 14(3), 331–342.

Kapur, K. et al., 2007. Exon arrays provide accurate assessments of gene expression. GenomeBiology, 8(5), R82.

Karlin, S. & Altschul, S.F., 1990. Methods for assessing the statistical significance of molecu-

lar sequence features by using general scoring schemes. Proceedings of the Nation-al Academy of Sciences of the United States of America, 87(6), 2264-8.

Karlin, S. & Dembo, A., 1992. Limit distributions of maximal segmental score among

Markov-dependent partial sums. Advances in applied probability, 24(1), 113-140.

Kent, W.J., 2002. BLAT--the BLAST-like alignment tool. Genome Research, 12(4), 656-64.

Kim, V.N., 2005. MicroRNA biogenesis: coordinated cropping and dicing. Nature Reviews.Molecular Cell Biology, 6(5), 376-85.

Kiryu, H., Kin, T. & Asai, K., 2008. Rfold: an exact algorithm for computing local base pair-

ing probabilities. Bioinformatics, 24(3), 367-73.

Kiss-László, Z. et al., 1996. Site-specific ribose methylation of preribosomal RNA: a novel

function for small nucleolar RNAs. Cell, 85(7), 1077-88.

Kiss, T. & Filipowicz, W., 1995. Exonucleolytic processing of small nucleolar RNAs from

pre-mRNA introns. Genes & Development, 9(11), 1411-24.

Kiss, T., 2004. Biogenesis of small nuclear RNPs. Journal of Cell Science, 117(Pt 25),

5949-51.

Klein, R.J. & Eddy, S.R., 2003. RSEARCH: finding homologs of single structured RNA se-

quences. BMC Bioinformatics, 4, 44.

Klein, R.J., Misulovin, Z. & Eddy, S.R., 2002. Noncoding RNA genes identified in AT-rich

hyperthermophiles. Proceedings of the National Academy of Sciences of the UnitedStates of America, 99(11), 7542-7.

Kornberg, R.D., 2007. The molecular basis of eukaryotic transcription. Proceedings of theNational Academy of Sciences of the United States of America, 104(32), 12955-61.

Krawczak, M., Reiss, J. & Cooper, D.N., 1992. The mutational spectrum of single base-pair

substitutions in mRNA splice junctions of human genes: causes and consequences.

Human Genetics, 90(1-2), 41-54.

Kruger, K. et al., 1982. Self-splicing RNA: autoexcision and autocyclization of the ribosomal

RNA intervening sequence of Tetrahymena. Cell, 31(1), 147-57.

LaCava, J. et al., 2005. RNA degradation by the exosome is promoted by a nuclear

polyadenylation complex. Cell, 121(5), 713-24.

Lagos-Quintana, M. et al., 2001. Identification of novel genes coding for small expressed

RNAs. Science, 294(5543), 853-8.

Lander, E.S. et al., 2001. Initial sequencing and analysis of the human genome. Nature,

409(6822), 860-921.

53

Page 54: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Landgraf, P. et al., 2007. A mammalian microRNA expression atlas based on small RNA li-

brary sequencing. Cell, 129(7), 1401-14.

Laski, F.A., Rio, D.C. & Rubin, G.M., 1986. Tissue specificity of Drosophila P element trans-

position is regulated at the level of mRNA splicing. Cell, 44(1), 7-19.

Laslett, D. & Canback, B., 2004. ARAGORN, a program to detect tRNA genes and tmRNA

genes in nucleotide sequences. Nucleic Acids Research, 32(1), 11-6.

Lau, N.C. et al., 2001. An abundant class of tiny RNAs with probable regulatory roles in

Caenorhabditis elegans. Science, 294(5543), 858-62.

Lee, R.C. & Ambros, V., 2001. An extensive class of small RNAs in Caenorhabditis elegans.

Science, 294(5543), 862-4.

Lee, R.C., Feinbaum, R.L. & Ambros, V., 1993. The C. elegans heterochronic gene lin-4 en-

codes small RNAs with antisense complementarity to lin-14. Cell, 75(5), 843-54.

Lerner, M.R. et al., 1980. Are snRNPs involved in splicing? Nature, 283(5743), 220-4.

Levine, A. & Durbin, R., 2001. A computational scan for U12-dependent introns in the human

genome sequence. Nucleic Acids Research, 29(19), 4006-13.

Levy, S. et al., 2007. The diploid genome sequence of an individual human. PLoS Biology,

5(10), e254.

Lim, L.P. et al., 2003. The microRNAs of Caenorhabditis elegans. Genes & Development,17(8), 991-1008.

Lo, P.C. & Mount, S.M., 1990. Drosophila melanogaster genes for U1 snRNA variants and

their expression during development. Nucleic Acids Research, 18(23), 6971-9.

Lo, P.C., Roy, D. & Mount, S.M., 1994. Suppressor U1 snRNAs in Drosophila. Genetics,

138(2), 365-78.

Lowe, T.M. & Eddy, S.R., 1999. A computational screen for methylation guide snoRNAs in

yeast. Science, 283(5405), 1168-71.

Lowe, T.M. & Eddy, S.R., 1997. tRNAscan-SE: a program for improved detection of transfer

RNA genes in genomic sequence. Nucleic Acids Research, 25(5), 955-64.

Lu, C. et al., 2005. Elucidation of the Small RNA Component of the Transcriptome. Science,

309(5740), 1567-1569.

Lund, E., Kahan, B. & Dahlberg, J.E., 1985. Differential control of U1 small nuclear RNA ex-

pression during mouse development. Science, 229(4719), 1271-4.

Macke, T.J. et al., 2001. RNAMotif, an RNA secondary structure definition and search algo-

rithm. Nucleic Acids Research, 29(22), 4724-35.

Madhani, H.D. & Guthrie, C., 1994. Dynamic RNA-RNA interactions in the spliceosome. An-nual Review of Genetics, 28, 1-26.

Manser, T. & Gesteland, R.F., 1982. Human U1 loci: genes for human U1 RNA have dramati-

cally similar genomic environments. Cell, 29(1), 257-64.

Mardis, E.R., 2008. The impact of next-generation sequencing technology on genetics. Trendsin Genetics, 24(3), 133-41.

Marquez, S.M. et al., 2005. Structural implications of novel diversity in eucaryal RNase P

RNA. RNA, 11(5), 739-51.

Mathews, D.H. et al., 1999. Expanded sequence dependence of thermodynamic parameters

improves prediction of RNA secondary structure. Journal of Molecular Biology,

288(5), 911-40.

Matlin, A.J., Clark, F. & Smith, C.W.J., 2005. Understanding alternative splicing: towards a

cellular code. Nature Reviews. Molecular Cell Biology, 6(5), 386-98.

Mattaj, I.W. & Hamm, J., 1989. Regulated splicing in early development and stage-specific U

snRNPs. Development, 105(2), 183-9.

Mattick, J.S. & Makunin, I.V., 2006. Non-coding RNA. Human Molecular Genetics, 15 Spec

No 1, R17-29.

Mehler, M.F. & Mattick, J.S., 2006. Non-coding RNAs in the nervous system. The Journal ofPhysiology, 575(Pt 2), 333-41.

Mikkelsen, T.S., 2005. Initial sequence of the chimpanzee genome and comparison with the

human genome. Nature, 437(7055), 69-87.

54

Page 55: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Missal, K. et al., 2006. Prediction of structured non-coding RNAs in the genomes of the ne-

matodes Caenorhabditis elegans and Caenorhabditis briggsae. Journal of Experi-mental Zoology. Part B. Molecular and Developmental Evolution, 306(4), 379-92.

Montzka, K.A. & Steitz, J.A., 1988. Additional low-abundance human small nuclear ribonu-

cleoproteins: U11, U12, etc. Proceedings of the National Academy of Sciences ofthe United States of America, 85(23), 8885-9.

Moore, P.B. & Steitz, T.A., 2006. The Roles of RNA in the Synthesis of Protein. In RNAWorld. Third Edition. New York: Cold Spring Harbor Laboratory Press, p. 287.

Mount, S.M., 1982. A catalogue of splice junction sequences. Nucleic Acids Research, 10(2),

459-72.

Mourier, T. et al., 2008. Genome-wide discovery and verification of novel structured RNAs in

Plasmodium falciparum. Genome Research, 18(2), 281-92.

Nissim-Rafinia, M. & Kerem, B., 2002. Splicing regulation as a potential genetic modifier.

Trends in Genetics: TIG, 18(3), 123-7.

Orum, H., Nielsen, H. & Engberg, J., 1992. Structural organization of the genes encoding the

small nuclear RNAs U1 to U6 of Tetrahymena thermophila is very similar to that of

plant small nuclear RNA genes. Journal of Molecular Biology, 227(1), 114-21.

Padgett, R.A. et al., 1986. Splicing of messenger RNA precursors. Annual Review of Bio-chemistry, 55, 1119-50.

Padgett, R.A. et al., 1983. Splicing of messenger RNA precursors is inhibited by antisera to

small nuclear ribonucleoprotein. Cell, 35(1), 101-7.

Parrott, A.M. & Mathews, M.B., 2007. Novel rapidly evolving hominid RNAs bind nuclear

factor 90 and display tissue-restricted distribution. Nucleic Acids Research, 35(18),

6249-58.

Patel, A.A. & Steitz, J.A., 2003. Splicing double: insights from the second spliceosome. Na-ture Reviews. Molecular Cell Biology, 4(12), 960-70.

Paten, B. et al., 2008. Enredo and Pecan: genome-wide mammalian consistency-based multi-

ple alignment with paralogs. Genome Research, 18(11), 1814-28.

Pearson, W.R. & Lipman, D.J., 1988. Improved tools for biological sequence comparison.

Proceedings of the National Academy of Sciences of the United States of America,

85(8), 2444-8.

Pearson, W.R. et al., 1997. Comparison of DNA sequences with protein sequences. Genomics,

46(1), 24-36.

Pedersen, J.S. et al., 2006. Identification and classification of conserved RNA secondary

structures in the human genome. PLoS Computational Biology, 2(4), e33.

Petersen, C.P. et al., 2006. The Biology of Short RNAs. In RNA World. Third Edition. New

York: Cold Spring Harbor Laboratory Press, p. 287.

Piccinelli, P., Rosenblad, M.A. & Samuelsson, T., 2005. Identification and analysis of ribonu-

clease P and MRP RNA in a broad range of eukaryotes. Nucleic Acids Research,

33(14), 4485-95.

Pollard, K.S. et al., 2006. An RNA gene expressed during cortical development evolved rapid-

ly in humans. Nature, 443(7108), 167-72.

Regalia, M., Rosenblad, M.A. & Samuelsson, T., 2002. Prediction of signal recognition parti-

cle RNA genes. Nucleic Acids Research, 30(15), 3368-77.

Reinhart, B.J. et al., 2000. The 21-nucleotide let-7 RNA regulates developmental timing in

Caenorhabditis elegans. Nature, 403(6772), 901-6.

Rivas, E. & Eddy, S.R., 2001. Noncoding RNA gene detection using comparative sequence

analysis. BMC Bioinformatics, 2, 8.

Rivas, E. & Eddy, S.R., 2000. Secondary structure alone is generally not statistically signifi-

cant for the detection of noncoding RNAs. Bioinformatics, 16(7), 583-605.

Rivas, E. et al., 2001. Computational identification of noncoding RNAs in E. coli by compar-

ative genomics. Current Biology: CB, 11(17), 1369-73.

Roeder, R.G., 2005. Transcriptional regulation and the role of diverse coactivators in animal

cells. FEBS Letters, 579(4), 909-15.

55

Page 56: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Ruby, J.G. et al., 2006. Large-scale sequencing reveals 21U-RNAs and additional microRNAs

and endogenous siRNAs in C. elegans. Cell, 127(6), 1193-207.

Santiago, C. & Marzluff, W.F., 1989. Expression of the U1 RNA gene repeat during early sea

urchin development: evidence for a switch in U1 RNA genes during development.

Proceedings of the National Academy of Sciences of the United States of America,

86(8), 2572-6.

Schattner, P., 2002. Searching for RNA genes using base-composition statistics. Nucleic AcidsResearch, 30(9), 2076-82.

Schattner, P. et al., 2004. Genome-wide searching for pseudouridylation guide snoRNAs:

analysis of the Saccharomyces cerevisiae genome. Nucleic Acids Research, 32(14),

4281-96.

Schmucker, D. et al., 2000. Drosophila Dscam is an axon guidance receptor exhibiting ex-

traordinary molecular diversity. Cell, 101(6), 671-84.

Schmucker, D., 2007. Molecular diversity of Dscam: recognition of molecular identity in neu-

ronal wiring. Nature Reviews. Neuroscience, 8(12), 915-20.

Sewer, A. et al., 2005. Identification of clustered microRNAs using an ab initio prediction

method. BMC Bioinformatics, 6, 267.

Shrimali, R.K. et al., 2005. Selenocysteine tRNA identification in the model organisms Dic-

tyostelium discoideum and Tetrahymena thermophila. Biochemical and BiophysicalResearch Communications, 329(1), 147-51.

Siliciano, P.G. & Guthrie, C., 1988. 5' splice site selection in yeast: genetic alterations in base-

pairing with U1 reveal additional requirements. Genes & Development, 2(10),

1258-67.

Smith, C.M. & Steitz, J.A., 1997. Sno storm in the nucleolus: new roles for myriad small

RNPs. Cell, 89(5), 669-72.

Sontheimer, E.J. & Steitz, J.A., 1992. Three novel functional variants of human U5 small nu-

clear RNA. Molecular and Cellular Biology, 12(2), 734-46.

Sorek, R. et al., 2004. Minimal conditions for exonization of intronic sequences: 5' splice site

formation in alu exons. Molecular Cell, 14(2), 221-31.

Staley, J.P. & Guthrie, C., 1998. Mechanical devices of the spliceosome: motors, clocks,

springs, and things. Cell, 92(3), 315-26.

Stamm, S. et al., 1994. A sequence compilation and comparison of exons that are alternatively

spliced in neurons. Nucleic Acids Research, 22(9), 1515-26.

Takeishi, K. & Kaneda, S., 1981. Isolation and characterization of small nuclear RNAs from

Dictyostelium discoideum. Journal of Biochemistry, 90(2), 299-308.

Tarn, W.Y. & Steitz, J.A., 1996. A novel spliceosome containing U11, U12, and U5 snRNPs

excises a minor class (AT-AC) intron in vitro. Cell, 84(5), 801-11.

Thanaraj, T.A. & Clark, F., 2001. Human GC-AG alternative intron isoforms with weak donor

sites show enhanced consensus at acceptor exon positions. Nucleic Acids Research,

29(12), 2581-93.

Tollervey, D. & Kiss, T., 1997. Function and synthesis of small nucleolar RNAs. CurrentOpinion in Cell Biology, 9(3), 337-42.

Tycowski, K.T. et al., 2006. The Ever-Growing World of Small Nuclear Ribonucleoproteins.

In RNA World. Third Edition. New York: Cold Spring Harbor Laboratory Press, p.

327.

Valadkhan, S., 2007. The spliceosome: a ribozyme at heart? Biological Chemistry, 388(7),

693-7.

Valadkhan, S. et al., 2007. Protein-free spliceosomal snRNAs catalyze a reaction that resem-

bles the first step of splicing. RNA, 13(12), 2300-11.

Vanácová, S. et al., 2005. A new yeast poly(A) polymerase complex involved in RNA quality

control. PLoS Biology, 3(6), e189.

Walter, P. & Blobel, G., 1982. Signal recognition particle contains a 7S RNA essential for pro-

tein translocation across the endoplasmic reticulum. Nature, 299(5885), 691-8.

56

Page 57: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Wang, E.T. et al., 2008a. Alternative isoform regulation in human tissue transcriptomes. Na-ture. Available at: http://dx.doi.org/10.1038/nature07509 [Accessed November 21,

2008].

Wang, G. & Cooper, T.A., 2007. Splicing in disease: disruption of the splicing code and the

decoding machinery. Nature Reviews. Genetics, 8(10), 749-61.

Wang, J. et al., 2008b. The diploid genome sequence of an Asian individual. Nature,

456(7218), 60-5.

Wang, Z. & Burge, C.B., 2008. Splicing regulation: from a parts list of regulatory elements to

an integrated splicing code. RNA, 14(5), 802-13.

Washietl, S. et al., 2005b. Mapping of conserved RNA secondary structures predicts thou-

sands of functional noncoding RNAs in the human genome. Nature Biotechnology,

23(11), 1383-90.

Washietl, S., Hofacker, I.L. & Stadler, P.F., 2005a. Fast and reliable prediction of noncoding

RNAs. Proceedings of the National Academy of Sciences of the United States ofAmerica, 102(7), 2454-9.

Waterston, R.H. et al., 2002. Initial sequencing and comparative analysis of the mouse

genome. Nature, 420(6915), 520-62.

Wheeler, D.A. et al., 2008. The complete genome of an individual by massively parallel DNA

sequencing. Nature, 452(7189), 872-6.

Wightman, B., Ha, I. & Ruvkun, G., 1993. Posttranscriptional regulation of the heterochronic

gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell, 75(5),

855-62.

Will, C.L. & Lührmann, R., 2005. Splicing of a rare class of introns by the U12-dependent

spliceosome. Biological Chemistry, 386(8), 713-24.

Will, C.L. & Lührmann, R., 2001. Spliceosomal UsnRNP biogenesis, structure and function.

Current Opinion in Cell Biology, 13(3), 290-301.

Will, C.L. & Luhrmann, R., 2006. Spliceosome Structure and Function. In RNA World. ThirdEdition. New York: Cold Spring Harbor Laboratory Press, p. 369.

Wise, J.A. & Weiner, A.M., 1981. The small nuclear RNAs of the cellular slime mold Dic-

tyostelium discoideum. Isolation and characterization. The Journal of BiologicalChemistry, 256(2), 956-63.

Workman, C. & Krogh, A., 1999. No evidence that mRNAs have lower folding free energies

than random sequences with the same dinucleotide distribution. Nucleic Acids Re-search, 27(24), 4816-22.

Xue, C. & Liu, G., 2007. RScan: fast searching structural similarities for structured RNAs in

large databases. BMC Genomics, 8, 257.

Xu, Q., Modrek, B. & Lee, C., 2002. Genome-wide detection of tissue-specific alternative

splicing in the human transcriptome. Nucleic Acids Research, 30(17), 3754-66.

Yachie, N. et al., 2006. Prediction of non-coding and antisense RNA genes in Escherichia coli

with Gapped Markov Model. Gene, 372, 171-81.

Yang, V.W. et al., 1981. A small nuclear ribonucleoprotein is required for splicing of adenovi-

ral early RNA sequences. Proceedings of the National Academy of Sciences of theUnited States of America, 78(3), 1371-5.

Yeo, G. et al., 2004. Variation in alternative splicing across human tissues. Genome Biology,

5(10), R74.

Yuan, G. et al., 2003. RNomics in Drosophila melanogaster: identification of 66 candidates

for novel non-messenger RNAs. Nucleic Acids Research, 31(10), 2495-507.

Zahler, A.M., Tuttle, J.D. & Chisholm, A.D., 2004. Genetic suppression of intronic +1G muta-

tions by compensatory U1 snRNA changes in Caenorhabditis elegans. Genetics,

167(4), 1689-96.

Zhuang, Y. & Weiner, A.M., 1986. A compensatory base change in U1 snRNA suppresses a 5'

splice site mutation. Cell, 46(6), 827-35.

Zhuang, Y., Leung, H. & Weiner, A.M., 1987. The natural 5'splice site of simian virus 40 large

T antigen can be improved by increasing the base complementarity to U1 RNA.

Molecular and Cellular Biology, 7(8), 3018-3020.

57

Page 58: Computational Approaches to the Identification and ...173095/FULLTEXT01.pdf · into the nematode Caernohabditis elegans could specifically target and si-lence gene expression through

Acta Universitatis UpsaliensisDigital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology 589

Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science andTechnology, Uppsala University, is usually a summary of anumber of papers. A few copies of the complete dissertationare kept at major Swedish research libraries, while thesummary alone is distributed internationally through theseries Digital Comprehensive Summaries of UppsalaDissertations from the Faculty of Science and Technology.(Prior to January, 2005, the series was published under thetitle “Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology”.)

Distribution: publications.uu.seurn:nbn:se:uu:diva-9518

ACTAUNIVERSITATIS

UPSALIENSISUPPSALA

2009