Bioinformatic jc 08_14_2013_formal

33
Genome-wide variation of alternative polyadenylation in sense and antisense transcription in Arabidopsis accessions Li Lei Plant Pathology, KSU [email protected] August 14, 2013

description

Li Lei (KSU) Poly-A read mapping in Arabidopsis. https://sites.google.com/site/toomajianlab/Home/people

Transcript of Bioinformatic jc 08_14_2013_formal

Page 1: Bioinformatic jc 08_14_2013_formal

Genome-wide variation of alternative polyadenylation in sense and antisense transcription in Arabidopsis accessions  

Li Lei Plant Pathology, KSU

[email protected] August 14, 2013

Page 2: Bioinformatic jc 08_14_2013_formal

Outline

•  Background Ø  Pre-mRNA processing & polyadenylation

Ø  Alternative polyadenylation (APA)

Ø  APA in plants & unknown questions

•  Objective •  Method

Ø  Approach

Ø  PALMapper: map RNA-seq reads to reference

Ø  How I retrieved the poly(A) reads

•  Result Ø  Evidence for APA

Ø  Poly(A) site location & related gene annotation

•  Conclusion •  Outlook •  Acknowledgements

Page 3: Bioinformatic jc 08_14_2013_formal

Background

Eukaryotic pre-mRNA processing & polyadenylation

poly(A)  site  (PAS)  •  poly(A) site = PAS •  Some genes, PASs of their mRNAs only in one place •  Other, PASs of their mRNAs in different places

Freitag, et al. 2012

(Fig. 2a and Supplementary Figs 4 and 5). For some of these genes,ESTs corresponding to predicted peroxisomal isoforms were found inpublic databases (Supplementary Fig. 6). In addition, a recent study ofthe Neurospora crassa peroxisomal proteome identified GAPDH as a‘PTS-less’ peroxisomal protein19. However, our analysis suggests thatN. crassa expresses a PTS1-containing isoform of GAPDH by trans-lational read-through (Fig. 2a and Supplementary Fig. 4). We detectedhints for peroxisomal targeting of GAPDH and PGK even in the early-diverged zygomycetous fungus Phycomyces blakesleeanus (Fig. 2a).The P. blakesleeanus pgk gene shows hallmarks of translational read-through (Supplementary Fig. 5), whereas three paralogous genes(gapdh1–3) encode GAPDH isoforms ending with the tripeptidesGAL, GNL and GNA, respectively. Of these isoforms, the PTS1 motifof Gapdh2 was functional in U. maydis (Supplementary Fig. 7).

The mosaic distribution of different mechanisms used for dualtargeting of GAPDH and PGK in the fungal kingdom suggests several

GAPDH PGK

Schizosaccharomyces pombe

Phycomyces blakesleeanus

Sporisorium reilianum

Ustilago maydis

Cryptococcus curvatus

Cryptococcus neoformans

Laccaria bicolor

Saccharomyces cerevisiae

Candida albicans

Yarrowia lipolytica

Aspergillus flavus

Aspergillus nidulans

Botrytis cinerea

Verticillium dahliae

Neurospora crassa

Alternative splicing

Translational read-through

a

CAA TAG GAA ACA GGT CGG AAG CCA ATG GCC AGG AGC TCC TTG TAA..Q * E T G R K P M A R S S L *

TCT GAG AAA AGT AAG TAA ... ... CAG GC CCT AGA CTG TAG..S E K S K * S P R L *

Aspergillus nidulans: gpdA (GAPDH)b

C terminus PTS1 (score )

RFP–PTS1GpdAGFP–Sps19DIC Merge

Aspergillus nidulans: pgkA (PGK)cpA1 pA2

RFP–PTS1PgkAGFP–Sps19DIC Merge

pA1 -LPGVAALSEKSK* –53.5pA2 -LPGVAALSEKSPRL* +3.1

ESTs C terminus PTS1 (score )

(1) (2)

Gene duplication

Penicillium chrysogenum

No evidence

(1) -SHPAYISKVDAQ* –58.9 (2) -SHPAYISKVDAQ*ETGRKPMARSSSL* +12.5

NS

NS

––

NS No sequenceavailable

Figure 2 | Dual targeting of GAPDH and PGK occurs in many fungi.a, Occurrence of dual targeting of glycolytic enzymes in different fungi.Peroxisomal localization of predicted PTS1 motifs was validated by expressionof GFP fusion proteins in S. cerevisiae. b, Protein sequences resulting fromnormal translational termination and translational read-through of A. nidulansgpdA (GAPDH) are indicated. The C-terminal tripeptide is highlighted in blue.The PTS1 scores of C-terminal dodecamers (underlined) were obtained usingthe PTS1-predictor tool14. The C-terminal dodecamer of sequence (2) wasexpressed as an RFP fusion in a yeast strain carrying a GFP–Sps19 fusionprotein. Scale bar, 5mm. c, Schematic drawing of the A. nidulans pgkA gene.Protein sequences resulting from polyadenylation within the intron (pA1) orfrom splicing (pA2) are indicated. The C-terminal tripeptide is highlighted inblue. ESTs of spliced pgkA transcripts were generated by RT–PCR andconfirmed by sequencing. Peroxisomal-targeting efficiency of the C-terminaldodecamer was determined as described in b. Scale bar, 5mm.

TCG GAG AAG TGA CTC GAT GCT GCC GTC CCC AAA CTG TGA..S E K * L D A A V P K L *

PTS1

a

gapdh

pA1 20 80 -DLLVFMAQKDSA* –74.0 pA2 (I) 4 10 -DLLVFMAQKDSAGASRL* +10.1 pA2 (II) 12 10 -DLLVFMAQKDSA* –74.0

ESTs rtPCR (%) C terminus PTS1 (score)

DIC GFP–PTS1Pgk1 mCherry–SKL Merge

DIC GFP–PTS1Gapdh mCherry–SKL Merge

AAAA (pA2)

AAAA (pA2)

AAAA (pA1)

pA1 pA2

(I)

(II)

(I)

(II)

b

pgk1 PTS1 AAAA

(1) (2)

(1) (2)

(1): 80 -KTLPGVKELSEK* –43.0(2): 20 -KTLPGVKELSEK*LDAAVPKL* +9.4

Protein (%) C terminus PTS1 (score)

GCG TAA GTA AGT .. .. CAG T GGC GCT TCA CGA CTT TAA..A * G A S R L *

Figure 1 | Cryptic peroxisomal targeting signals in Gapdh and Pgk1 inU. maydis. a, Schematic drawing of the gapdh gene of U. maydis. Differenttranscripts resulting from alternative polyadenylation and splicing are shown.pA1 and pA2 indicate alternative poly(A) sites. (I) and (II) denote alternative 59splice sites. The C-terminal tripeptide of the peroxisomal isoform is highlightedin blue. Different transcripts were quantified by EST and rtPCR analysis. ThePTS1 scores of C-terminal dodecamers (underlined) were obtained using thePTS1-predictor program14. Positive values indicate high probability ofperoxisomal targeting. The C-terminal dodecamer of the splice variant (I) wasfused to GFP and co-expressed with mCherry–SKL in U. maydis. DIC,differential interference contrast. Scale bar, 10mm. b, Schematic drawing of thepgk1 gene of U. maydis. Protein sequences resulting from normal termination (1)and translational read-through (2) are indicated. The C-terminal tripeptide ishighlighted in blue. The relative fraction of protein was determined by westernblot analysis. Peroxisomal-targeting efficiency of the C-terminal dodecamer wasdetermined as described in a. Scale bar, 10mm.

LETTER RESEARCH

2 4 M A Y 2 0 1 2 | V O L 4 8 5 | N A T U R E | 5 2 3

Macmillan Publishers Limited. All rights reserved©2012

Alternative polyadenylation (APA): Different mRNAs transcribed from the same gene have different PASs

Page 4: Bioinformatic jc 08_14_2013_formal

Alternative polyadenylation (APA)

Background

expressed than their longer counterparts [22]. Severalmechanisms can explain why changes in 30 UTR lengthmay affect protein abundance. One of the best-charac-terized processes is that of microRNA (miR)-mediateddegradation. In studies of myogenic [43,44!!], hemato-poietic [28], and cancer [45] cells, transcripts bearingshorter 30 UTRs contained fewer miRNA-binding sites,thus allowing these transcripts to evade miRNA-mediated degradation. Transcripts are also subject to

length-dependent degradation by the nonsense-mediated decay (NMD) pathway [46,47]. In NMD,Upf1 binds to the 30 UTR in a length-dependent manner,thus eliciting degradation of longer transcripts morerapidly [48!].

The 30 UTR contains elements that affect not onlytranscript degradation but also stability. In a genome-wide computational analysis of sequence and stability

All’s well that ends well Mueller, Cheung and Rando 225

Table 1 (Continued )

Method Cell or tissue type Major findings Keyreferences

EST Analysis Mouse and human samples 54% of human genes and 37% of mousegenes have multiple PASs. Orthologsbetween the two species display similarpolyadenylation patterns

[8]

Global Study of Poly(A)Site Usage byGene-based ESTVote (GAUGE)

42 human tissues from polyA_DB Systemic differences in PAS usage amongtissues and identification of potential cis-regulatory elements associated with PASsin the brain. Development of polyA_DBdatabase of mammalian mRNApolyadenylation

[34,35]

OtherDigital Gene Expression (DGE)

on the basis of MassivelyParallel Signature Sequencing(MPSS) and IlluminaSequencing by Synthesis(SBS) and analysis similarto GAUGE

Arabidopsis and rice of variousdevelopmental stages andenvironmental exposures

Approximately 60% of Arabidopsis genesand 47–82% of rice genes contain multiplePASs with 49–66% mapping within thecoding region. Genes that showdifferential PAS usage in differentdevelopmental stages make up 10% of thetranscriptome

[36]

Figure 1

(a)

Ex1 Ex3

PASPAS

Ex2

Ex1 Ex3Ex2Ex1 Ex3Ex2

(b)

Ex1 Ex3Ex2Ex1 Ex2

Ex1 Ex3

PASPAS

Ex2

5!

5! 3!

5!

5!

5! 5!3! 3!

3!

3!

3!

Current Opinion in Cell Biology

Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, thenidentical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity ofprotein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced whenthe proximal PAS is chosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions.

www.sciencedirect.com Current Opinion in Cell Biology 2013, 25:222–232

Mueller, et al. 2012

Tian, et al. 2013

differentiated cells are reprogrammed to ES cell-like in-duced pluripotent stem (iPS) cells [41]. A notable excep-tion, however, has been observed with spermatogonialgerm cells, whose reprogramming to ES cells involves 30

UTR lengthening [41]. Notably, this is in line with the factthat germ cells are more proliferative than ES cells. Simi-lar trends of 30 UTR length regulation have been reportedfor comparisons of ES cells versus neural stem/progenitor(NSP) cells or neurons [42]. Although these studies have allpointed to a connection between 30 UTR length and cellproliferation, cardiac hypertrophy, in which myocytes growin size rather than in number, has also been found toinvolve 30 UTR shortening [43]. Thus, a general rulemay be that APA regulation is correlated with cell growth.

CancerCancer cells are of course highly proliferative. In keepingwith this, and consistent with the above, cancer cells havebeen found to express, in general, mRNAs with shortened 30

UTRs, as first shown in transformed cell lines [44] and inmouse B-cell leukemia/lymphoma models [45], and morerecently in human colorectal carcinomas [46] and breast andlung cancers [47]. In the study by Singh et al. [45], the APAprofile was found to be informative in separating tumorsubtypes with different survival consequences, indicatingits relevance to cancer development and utility as a diag-nostic marker. One key question concerning APA regulationin cancer is whether proliferation or transformation is themajor driver of APA. Meta-analysis of microarray data fromtransformed and nontransformed cells with similar pre-dicted proliferation rates has led to the conclusion that celltransformation has a significant role in 30 UTR regulation[44]. However, a recent study has shown that, by comparingthe same cells (BJ primary fibroblast and mammary epithe-lial cell line MCF10A) in proliferating, arrested, and trans-formed states, proliferation is a more importantdeterminant of 30UTR length [48]. Adding to the complexityof 30 UTR regulation in cancer, Fu et al. [49] have reportedthat, compared to MCF10A, breast cancer cell lines MCF7and MB231 show shortened and lengthened 30 UTRs, re-spectively. Notably, it has also been reported that, contraryto the general trend, some gene groups, such as cell–celladhesion genes, tend to express mRNAs with lengthened 30

UTRs in cancer cells [45,46]. Therefore, it remains to be fullydelineated how APA of different transcripts is regulated indifferent cancer types and at different stages.

APA is modulated by multiple mechanismsRegulation of core C/P factor expressionThe core components of the mammalian C/P machineryinclude !15 polypeptides, most of which exist in multi-subunit subcomplexes (Box 3 and Figure 3). Regulation of

TiBS

pA pA

AAAnAAAn

pA pA

Composite Skipped

UTR

CDS

pA

AAAnAAAn

AAAnExonic

(A)

(B)

Figure 1. Alternative cleavage and polyadenylation sites (pAs) in a gene. (A) Alternative cleavage and polyadenylation (APA) in 30-most exon. A hypothetical gene is shown,with two pAs located in the 30-most exon. The top gray line is genomic DNA with exons boxed, and bottom lines are mRNAs. Coding sequence (CDS) and 30 untranslatedregion (UTR) are shown as thick and thin blue lines, respectively, as indicated in the graph, splicing as bent line, and pAs as arrowheads. AAAn indicates the poly(A) tail. (B)APA in upstream regions. The type of terminal exon is indicated. The top mRNA shows only splicing. Skipped, skipped terminal exon; composite, composite (internal/terminal) exon; exonic, upstream exon.

miRNA RBP

Transla!on Degrada!on Localiza!on

AAAnCDS

CDS

cUTR aUTR

!!

AAA

AAA

n

TiBS

Figure 2. Regulation of cis elements in 30 untranslated regions (UTRs) byalternative cleavage and polyadenylation (APA). Two mRNA isoforms areshown. The 30 UTR region upstream of the proximal cleavage andpolyadenylation site (pA) is called the constitutive UTR (cUTR), and thedownstream region is called the alternative UTR (aUTR). RNA-binding protein(RBP) and miRNA targeting to the aUTR are shown. Impacts on mRNA localization,translation, and degradation are indicated. CDS, coding sequence.

Review Trends in Biochemical Sciences June 2013, Vol. 38, No. 6

315

expressed than their longer counterparts [22]. Severalmechanisms can explain why changes in 30 UTR lengthmay affect protein abundance. One of the best-charac-terized processes is that of microRNA (miR)-mediateddegradation. In studies of myogenic [43,44!!], hemato-poietic [28], and cancer [45] cells, transcripts bearingshorter 30 UTRs contained fewer miRNA-binding sites,thus allowing these transcripts to evade miRNA-mediated degradation. Transcripts are also subject to

length-dependent degradation by the nonsense-mediated decay (NMD) pathway [46,47]. In NMD,Upf1 binds to the 30 UTR in a length-dependent manner,thus eliciting degradation of longer transcripts morerapidly [48!].

The 30 UTR contains elements that affect not onlytranscript degradation but also stability. In a genome-wide computational analysis of sequence and stability

All’s well that ends well Mueller, Cheung and Rando 225

Table 1 (Continued )

Method Cell or tissue type Major findings Keyreferences

EST Analysis Mouse and human samples 54% of human genes and 37% of mousegenes have multiple PASs. Orthologsbetween the two species display similarpolyadenylation patterns

[8]

Global Study of Poly(A)Site Usage byGene-based ESTVote (GAUGE)

42 human tissues from polyA_DB Systemic differences in PAS usage amongtissues and identification of potential cis-regulatory elements associated with PASsin the brain. Development of polyA_DBdatabase of mammalian mRNApolyadenylation

[34,35]

OtherDigital Gene Expression (DGE)

on the basis of MassivelyParallel Signature Sequencing(MPSS) and IlluminaSequencing by Synthesis(SBS) and analysis similarto GAUGE

Arabidopsis and rice of variousdevelopmental stages andenvironmental exposures

Approximately 60% of Arabidopsis genesand 47–82% of rice genes contain multiplePASs with 49–66% mapping within thecoding region. Genes that showdifferential PAS usage in differentdevelopmental stages make up 10% of thetranscriptome

[36]

Figure 1

(a)

Ex1 Ex3

PASPAS

Ex2

Ex1 Ex3Ex2Ex1 Ex3Ex2

(b)

Ex1 Ex3Ex2Ex1 Ex2

Ex1 Ex3

PASPAS

Ex2

5!

5! 3!

5!

5!

5! 5!3! 3!

3!

3!

3!

Current Opinion in Cell Biology

Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, thenidentical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity ofprotein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced whenthe proximal PAS is chosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions.

www.sciencedirect.com Current Opinion in Cell Biology 2013, 25:222–232

expressed than their longer counterparts [22]. Severalmechanisms can explain why changes in 30 UTR lengthmay affect protein abundance. One of the best-charac-terized processes is that of microRNA (miR)-mediateddegradation. In studies of myogenic [43,44!!], hemato-poietic [28], and cancer [45] cells, transcripts bearingshorter 30 UTRs contained fewer miRNA-binding sites,thus allowing these transcripts to evade miRNA-mediated degradation. Transcripts are also subject to

length-dependent degradation by the nonsense-mediated decay (NMD) pathway [46,47]. In NMD,Upf1 binds to the 30 UTR in a length-dependent manner,thus eliciting degradation of longer transcripts morerapidly [48!].

The 30 UTR contains elements that affect not onlytranscript degradation but also stability. In a genome-wide computational analysis of sequence and stability

All’s well that ends well Mueller, Cheung and Rando 225

Table 1 (Continued )

Method Cell or tissue type Major findings Keyreferences

EST Analysis Mouse and human samples 54% of human genes and 37% of mousegenes have multiple PASs. Orthologsbetween the two species display similarpolyadenylation patterns

[8]

Global Study of Poly(A)Site Usage byGene-based ESTVote (GAUGE)

42 human tissues from polyA_DB Systemic differences in PAS usage amongtissues and identification of potential cis-regulatory elements associated with PASsin the brain. Development of polyA_DBdatabase of mammalian mRNApolyadenylation

[34,35]

OtherDigital Gene Expression (DGE)

on the basis of MassivelyParallel Signature Sequencing(MPSS) and IlluminaSequencing by Synthesis(SBS) and analysis similarto GAUGE

Arabidopsis and rice of variousdevelopmental stages andenvironmental exposures

Approximately 60% of Arabidopsis genesand 47–82% of rice genes contain multiplePASs with 49–66% mapping within thecoding region. Genes that showdifferential PAS usage in differentdevelopmental stages make up 10% of thetranscriptome

[36]

Figure 1

(a)

Ex1 Ex3

PASPAS

Ex2

Ex1 Ex3Ex2Ex1 Ex3Ex2

(b)

Ex1 Ex3Ex2Ex1 Ex2

Ex1 Ex3

PASPAS

Ex2

5!

5! 3!

5!

5!

5! 5!3! 3!

3!

3!

3!

Current Opinion in Cell Biology

Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, thenidentical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity ofprotein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced whenthe proximal PAS is chosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions.

www.sciencedirect.com Current Opinion in Cell Biology 2013, 25:222–232

Adapted from Tress et al. 2007

Protein  isoforms  

targets, indicating reduced cleavage at the proximal poly(A) site42. Similarly, ELL2 — another transcription elongation factor — was suggested to promote loading of the polyadenylation factor CSTF on the transcription machinery, thereby enhancing usage of the proximal poly(A) site of IgM43, providing an additional mecha-nism for the switch of membrane-bound IgM to the secreted IgM form. Furthermore, a global positive corre-lation between gene expression level and relative usage of proximal poly(A) sites was observed in the human and mouse transcriptomes, and it was demonstrated using reporter assays that enhancement of transcriptional activity results in increased cleavage at proximal sites44.

The second emerging principle for the interplay between transcription and APA involves kinetic coupling (FIG. 4b). As the proximal poly(A) sites are transcribed first and are therefore encountered first by the 3′-end-processing machinery, they have an advantage for being used over distal poly(A) sites17. That is, use of proximal poly(A) sites should positively correlate with the distance between consecutive poly(A) sites and should negatively correlate with transcription elongation rate. In accord-ance with this expectation, using a D. melanogaster strain with a lower transcriptional elongation rate, it was shown that reduced RNA Pol II elongation kinetics results in increased usage of proximal poly(A) sites in a number of transcripts45. Of note, this kinetic coupling resembles a mechanism for alternative splicing regulation, in which slow kinetics of RNA Pol II leads to preferential inclusion of otherwise skipped alternative exons46. Little is known about mechanisms that regulate transcription elongation rates, and it remains to be seen whether this kinetic cou-pling is used to regulate APA in physiological conditions.

APA and chromatin. Recent results have suggested that chromatin and epigenetic modifications affect APA. It was observed that poly(A) sites are strongly depleted of nucleosomes, whereas regions downstream of these sites are enriched for nucleosomes47. To some extent, nucleo-some depletion at poly(A) sites is explained by base composition of sequences in these regions, which are A- and T-rich, as poly(dA:dT) DNA stretches have a low nucleosome affinity. Interestingly, examination of genes with multiple poly(A) sites showed that stronger poly(A) sites are associated with more pronounced nucleosome

depletion at the site and more pronounced enrichment downstream from it, suggesting that nucleosome posi-tioning might influence PAS use by, for example, affect-ing the rate of polymerase elongation. Yet at this stage, these observations are only correlative, and experi-mental studies are required in order to test this model and to establish a cause–effect relationship between nucleosome occupancy and poly(A) site selection.

Another way in which chromatin has been proposed to affect APA is through DNA methylation. This epi-genetic effect on APA was first suggested using mouse tissues, in two cases of retrogenes (namely, Mcts2 and Napl15), which are located within the introns of host genes (namely, H13 and Herc3, respectively)48,49. In both cases, the promoters of the retrogenes are imprinted and are therefore silenced on the maternal allele, whereas they are unmethylated and active on the paternal allele. It was shown that when the retrogene is transcribed, an upstream intronic poly(A) is used by the host gene, whereas a downstream distal poly(A) site is used by the host gene in the allele on which the retrogene is silenced. These observations support a model in which transcriptional interference affected poly(A) site choice by the host gene but cannot exclude the involvement of a polyadenylation factor (or factors) that is sensitive to the methylation status of the DNA in the vicinity of the poly(A) site.

Interplay between splicing and APA. Numerous studies have reported multiple links between the splicing and 3′-end-processing machineries and have demon-strated that physical interactions between splicing and polyadenylation factors occurring at terminal introns of precursor mRNAs (pre-mRNAs) enhance cleav-age efficiency at 3′UTR poly(A) sites50,51. Two types of APA events are affected by the interplay between splicing and 3′ end processing — alternative terminal exons and intronic APA (FIG. 2) — and recent studies are shedding light on underlying regulatory mecha-nisms. A first global analysis on this interplay used EST databases to identify events of intronic polyadenyla-tion in hundreds of human genes52. Importantly, these events were associated with weak 5′ splicing sites (5′ss) and long introns, suggesting a dynamic competition between splicing and polyadenylation. In agreement with this model, increased cleavage at intronic poly(A) sites was observed in conditions that were associated with increased usage of 3′UTR proximal poly(A) sites, including proliferation26, whereas decreased intronic cleavage was observed during development and differ-entiation18. Another indication for interplay between splicing and APA regulation was provided by an RNA-seq study that examined the transcriptomes of a diverse panel of human tissues and cell lines14. A strong correla-tion between patterns of alternative splicing and APA across the probed samples was observed, suggesting coordinated regulation of these processes. Furthermore, strong enrichment of well-known splicing-related regu-latory motifs was also detected in 3′UTRs, suggesting that the factors binding these motifs function in the regulation of both splicing and APA.

Nature Reviews | Genetics

&GXGNQROGPV�CPF�EGNNWNCT�FKȭGTGPVKCVKQPNeuron activityProliferation

Cancer

Oculopharyngeal muscular dystrophy

Global APA

Biological processes

Connections to disease

Favour distal poly(A) site usage Favour proximal poly(A) site usage

Figure 3 | Biological processes that have been linked with broad APA modulation. A schematic showing the biological processes and diseases that alternative polyadenylation (APA) has been linked with. In addition, the tendency towards distal or proximal poly(A) site usage is shown.

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 14 | JULY 2013 | 501

© 2013 Macmillan Publishers Limited. All rights reserved

Elkon, et al. 2013

Page 5: Bioinformatic jc 08_14_2013_formal

APA in plants and unknown questions?

Background

Although genome-wide investigation of polyadenylation in single Arabidopsis accession, we still do not know: 1.  How much variation in the polyadenylation usage across Arabidopsis

accessions? What is the genetic basis for such variation? Cis regulation? Trans?

2.  Is Arabidopsis an outlier for any of the trends of polyadenylation site usage compared with related species? How has APA evolved across related species?

       

Genome-wide landscape of polyadenylation inArabidopsis provides evidence for extensivealternative polyadenylationXiaohui Wua,b, Man Liua, Bruce Downiec, Chun Lianga, Guoli Jib, Qingshun Q. Lia,b,1, and Arthur G. Huntd,1

aDepartment of Botany, Miami University, Oxford, OH 45056; bDepartment of Automation, Xiamen University, Xiamen, Fujian 361005, People’s Republic ofChina; and cDepartment of Horticulture and Seed Biology Group, and dDepartment of Plant and Soil Sciences, University of Kentucky, Lexington,KY 40546-0312.

Edited by David C. Baulcombe, University of Cambridge, Cambridge, United Kingdom, and approved June 8, 2011 (received for review January 14, 2011)

Alternative polyadenylation (APA) has been shown to play animportant role in gene expression regulation in animals andplants. However, the extent of sense and antisense APA at thegenome level is not known. We developed a deep-sequencingprotocol that queries the junctions of 3!UTR and poly(A) tails andconfidently maps the poly(A) tags to the annotated genome. Theresults of this mapping show that 70% of Arabidopsis genes usemore than one poly(A) site, excluding microheterogeneity. Analy-sis of the poly(A) tags reveal extensive APA in introns and codingsequences, results of which can significantly alter transcript se-quences and their encoding proteins. Although the interplay ofintron splicing and polyadenylation potentially defines poly(A)site uses in introns, the polyadenylation signals leading to theuse of CDS protein-coding region poly(A) sites are distinct fromthe rest of the genome. Interestingly, a large number of poly(A)sites correspond to putative antisense transcripts that overlapwith the promoter of the associated sense transcript, a mode pre-viously demonstrated to regulate sense gene expression. Ourresults suggest that APA plays a far greater role in gene expres-sion in plants than previously expected.

alternative processing | antisense transcription | nonstop mRNAs

The polyadenylation of mRNA in eukaryotes is an importantstep in gene expression in eukaryotes. With few exceptions,

mature eukaryotic mRNAs possess a poly(A) tract, that in turnfunctions to facilitate transport of the mRNA to the cytoplasmand its subsequent stabilization and translation. The poly(A) tailcontributes regulatory information to each of these processesthrough interactions with RNA processing factors and poly(A)-binding proteins. The process of polyadenylation also contributesto regulation by “determining” the composition of the mRNAapart from the poly(A) tail. Thus, the position along the genewhere the pre-mRNA is processed and polyadenylated deter-mines the sequence content in terms of exons and regulatorymotifs. If a gene possesses more than one polyadenylation site,then the nature of the expressed mRNA can be altered via dif-ferential choice of these sites, a process that is called alternativepolyadenylation, or APA. That APA may be important is sug-gested by the observations that more than 50% of human andplant genes have multiple poly(A) sites (1–5). APA may be animportant factor in the regulation of genes associated with can-cer and with early embryo development in animals (6–8). APAhas also been implicated in global control of gene expression inneuronal cells in humans (9), and in the responses of genes tostress and developmental cues in Caenorhabditis elegans (10).In plants, there are many documented cases of APA (11–13).

Perhaps the best-studied example of APA in plants involves thenetwork of genes that control flowering time in Arabidopsis. Oneregulatory factor, FY, is a core polyadenylation complex subunit;this protein acts in concert with an RNA-binding protein, FCA,to promote polyadenylation within an intron in transcriptsencoded by the FCA gene (14). Two other core polyadenylationfactor subunits, CstF77 and CstF64, and a novel RNA-bindingprotein, FPA, control APA of antisense transcripts encoded by

the FLC gene (15, 16); these antisense transcripts are involved intranscriptional regulation of sense FLC mRNAs through chro-matin modifications in the vicinity of the sense FLC promoter.The regulation of these two genes thus provides examples of twomodes of APA, involving intronic polyadenylation and 3! endprocessing of antisense transcripts.Plant poly(A) site datasets (3, 17) have been assembled from

the analysis and curation of the results of EST and full-lengthcDNA sequencing projects. Unfortunately, these projects are notspecially targeted to the identification of poly(A) sites, nor arethey high-throughput. With this consideration in mind, a strategydesigned to specifically query the mRNA-poly(A) junction ona transcriptome-wide basis was developed and used to studypoly(A) site choice in Arabidopsis leaves and seeds. The resultsobtained using this strategy reveal an extensive network of po-tential APA in Arabidopsis, including unanticipated and novelmodes of APA. In addition, the results corroborate other reportssuggestive of wide-spread antisense transcription in Arabidopsis,and provide a dataset of poly(A) sites associated with antisensetranscripts. Finally, they provide evidence for tissue-specificpoly(A) site choice.

ResultsPreparation and Characterization of cDNA Tags That Query Poly-adenylation Sites. To study Arabidopsis poly(A) sites on a genome-wide basis, short DNA tags that include the mRNA-poly(A) sitejunction [called poly(A) tags, or PATs hereafter] were preparedand sequenced; the starting materials for these samples wereRNA isolated from dry seeds and the leaves of young seed-lings. The initial sequences were processed and mapped to theArabidopsis reference genome. After removing potential internalpriming candidates and eliminating tags that mapped to chlo-roplast and mitochondria genomes and to miscellaneous RNAs(primarily rRNAs), a collection of tags that defined more than280,000 individual poly(A) sites were obtained (Table S1). Be-cause poly(A) site microheterogeneity is ubiquitous in plants (3,4), poly(A) sites in the same gene that are located within 24 nt ofeach other were clustered so as to define a poly(A) site cluster(PAC). The results of this process were more than 71,000 PACswith an average of 54 PATs per PAC (Table S1). Of these PACs,57,473 were in the “sense” orientation with respect to an anno-

Author contributions: X.W., M.L., G.J., Q.Q.L., and A.G.H. designed research; X.W., M.L.,and A.G.H. performed research; X.W., B.D., C.L., and A.G.H. contributed new reagents/analytic tools; X.W., M.L., C.L., Q.Q.L., and A.G.H. analyzed data; and X.W., B.D., C.L., Q.Q.L.,and A.G.H. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Freely available online through the PNAS open access option.

Data deposition: The sequence reported in this paper has been deposited in the NationalCenter for Biotechnology Information Short Reads Archive (accession no. SRA028410).1To whom correspondence may be addressed. E-mail: [email protected] or [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1019732108/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1019732108 PNAS | July 26, 2011 | vol. 108 | no. 30 | 12533–12538

PLANTBIOLO

GY

NATURE STRUCTURAL & MOLECULAR BIOLOGY VOLUME 19 NUMBER 8 AUGUST 2012 845

R E S O U R C E

Arabidopsis thaliana is an important model system that has had a critical role in discoveries essential to our understanding of plant biology and of generically important processes such as RNA interfer-ence (RNAi). Although the A. thaliana genome was sequenced more than a decade ago, challenges remain in resolving the RNAs that it encodes and determining their functional significance. Establishing where transcripts end is essential in genome annotation and for understanding gene function. Alternative cleavage and polyadenyla-tion (APA) defines different 3 ends within pre-mRNA transcribed from the same gene, and this can affect function by determining coding potential or the inclusion of regulatory sequence elements1,2. This regulation of RNA 3 -end formation is considerably more wide-spread than previously thought1,2, and RNA-binding proteins that enable A. thaliana flowering provide important examples of the biological impact of this control3. Defective 3 -end formation and transcription termination at tandem or convergent gene pairs can result in transcription interference or RNAi4,5, revealing that these processes normally partition the genome and maintain expression of neighboring genes6. Accordingly, such consequences of uncontrolled 3 -end formation also emphasize the critical nature of gene arrange-ment along a eukaryotic chromosome.

As a prelude to the analysis of regulators of 3 -end formation, we set out to map A. thaliana RNA 3 ends genome-wide. Previous high-throughput A. thaliana transcriptome studies have depended on the copying of RNA into complementary DNA (cDNA) with reverse transcriptase7–10. However, the intrinsic template switch-ing11 and DNA-dependent DNA-polymerase12 activities of reverse transcriptases, together with oligo(dT)-dependent internal priming13, cause well-established artifacts that can affect the identification of authentic antisense RNAs14,15, splicing events14 and RNA 3 ends13,16.

Different strategies have been developed to address these problems, making strand-specific RNA sequencing an increasingly powerful tool for the analysis of transcriptomes. However, a recent comparison of several such methods showed marked differences not only in strand specificity but also in a range of criteria that influence transcriptome interpretation17. Therefore, as an alternative, we used direct RNA sequencing (DRS) to identify polyadenylated A. thaliana RNAs18. This approach is direct in the sense that native RNA is used as the sequencing template, but the sequence is read by imaging comple-mentary fluorescent nucleotides incorporated by a polymerase. In this true single-molecule sequencing (tSMS) procedure, the site of RNA cleavage and polyadenylation is defined with an accuracy of 2 nucleotides (nt) in the absence of errors induced by reverse transcriptase, ligation or amplification18.

RESULTSMapping A. thaliana RNA 3 endsTotal RNA purified from A. thaliana seedlings was subjected to DRS, and a computational procedure to align reads uniquely to the most recent A. thaliana genome release (currently TAIR10) was developed. The initial mapping analysis revealed that the vast majority of reads (89.60%) aligned to protein-coding genes, which is consistent with the idea that this approach can identify authentic sites of mRNA cleavage and polyadenylation (Fig. 1a). These data define extremely heterogeneous patterns of RNA 3 -end formation (Fig. 1b) that differ markedly from those of human mRNAs analyzed in the same way (Supplementary Fig. 1a)18.

Although nontemplated base addition between cleavage sites and the poly(A) tail has been reported from analysis of A. thaliana expressed-sequence-tag (EST) data19, we found no evidence for this phenomenon

1College of Life Sciences, University of Dundee, Dundee, UK. 2Department of Cell and Molecular Sciences, James Hutton Institute, Invergowrie, Dundee, UK. 3Helicos BioSciences Corporation, Cambridge, Massachusetts, USA. Correspondence should be addressed to G.G.S. ([email protected]) or G.J.B. ([email protected]).

Received 16 February; accepted 19 June; published online 22 July 2012; doi:10.1038/nsmb.2345

Direct sequencing of Arabidopsis thaliana RNA reveals patterns of cleavage and polyadenylationAlexander Sherstnev1, Céline Duc1, Christian Cole1, Vasiliki Zacharaki1, Csaba Hornyik2, Fatih Ozsolak3, Patrice M Milos3, Geoffrey J Barton1 & Gordon G Simpson1,2

It has recently been shown that RNA 3 -end formation plays a more widespread role in controlling gene expression than previously thought. To examine the impact of regulated 3 -end formation genome-wide, we applied direct RNA sequencing to A. thaliana. Here we show the authentic transcriptome in unprecedented detail and describe the effects of 3 -end formation on genome organization. We reveal extreme heterogeneity in RNA 3 ends, discover previously unrecognized noncoding RNAs and propose widespread reannotation of the genome. We explain the origin of most poly(A)+ antisense RNAs and identify cis elements that control 3 -end formation in different registers. These findings are essential to understanding what the genome actually encodes, how it is organized and how regulated 3 -end formation affects these processes.

npg

© 2

012

Nat

ure

Am

eric

a, In

c. A

ll rig

hts

rese

rved

.

WIREs RNA Polyadenylation and gene expression regulation

initial efforts to express Bt toxin in plants were notsuccessful as a result of truncated mRNA generatedby cleavage and polyadenylation on the bacterial genetranscripts.47 To predict unwanted polyadenylationsites, computer algorithms were designed based onthe characteristics of plant poly(A) signals.48,49 A gen-eralized hidden Markov model was used to build apoly(A) site sleuth (PASS) program.35,49 Its predic-tion results could be correlated with the data fromthe mutation analysis performed in tobacco. Recently,the same group designed a new and more versatileclassifier-based model in which polyadenylation signalparameters from different species may be more easilyincorporated.50 The computational implementationsof these algorithms were also made available.51

These algorithms may also be applied to predictalternative poly(A) sites of known genes or poly(A)sites of genes of newly sequenced genomes. Indeed,large-scale prediction of the Arabidopsis chromosomesegment matched most known poly(A) sites, whereasother high-score sites may predict realistic, butunconfirmed, poly(A) sites.49

APA IN PLANTSBecause the site of poly(A) addition marks the end ofa transcript, selection of the poly(A) site may deter-mine the gene coding information. The use of analternative poly(A) site could alter the coding capacityand/or change the inclusion of key sequence elementsfor mRNA stability, location, or suppression of trans-lation. Thus, it is conceivable that APA could beused for gene expression regulation. Indeed, increas-ing evidence supports this notion, as mentioned inIntroduction section.

APA has long been documented in a varietyof plant species. A few examples include multiplepolyadenylation sites that were identified from genesencoding phosphoenolpyruvate carboxylase and !1-tublin in maize52,53; "-glucanase isozyme GV inbarley54; Agamous and rbohA in Arabidopsis55,56;chloroplast ascorbate peroxidase in spinach andtobacco57; and a set of soybean genes.46 Moreimportantly, the functionality of APA was alsodemonstrated in a range of biological processes. InBrassica, the S locus genes are essential for self-incompatibility. The APA of the pre-mRNA of the Slocus receptor kinase gene within intron-3 produced a1.6-kb transcript encoding a putative secreted protein,whereas the removal of the intron resulted in a 1.8-kbtranscript encoding a predicted membrane-anchoredprotein.58,59 In peach, two ethylene receptor-relatedpolypeptides, with and without the C-terminalreceiver domain, were postulated to be derived from

FCA

(FCA-!)

(FCA-")

(AtCPSF30)

(AtCPSF30*-YT521B)

FPA

FLC

OXT6

a

P

P D

D

D P

P D

bc

a

a

ab

c

b

b

c

c

FIGURE 2 | Schematic representation of alternative polyadenylationof sense transcripts of FCA, FPA, CPSF30, and antisense transcripts ofFLC. (a) Gene structure; (b) The transcript derived from proximal poly(A)site (P); (c) The transcript derived from the distal poly(A) site (D). Theopen boxes denote exons, the solid lines denote introns, and the dottedlines indicate joining exons. The filled black boxes represent the genesegment being exon (and/or part of UTR) when the proximal poly(A) siteis used, but being intron when the distal poly(A) site is used. The doublebackward slash lines denote exons that are not shown.

the APA of a single gene.60 In cotton, the APAof lysine-ketoglutarate reductase (LKR) pre-mRNAgenerated two transcripts encoding monofunctionaland bifunctional LKR polypeptides, respectively,which was believed to be important for enablingefficient flux of lysine catabolism under specificconditions.61 The APA of LKR pre-mRNA was alsofound to be conserved in Arabidopsis.62 Analogousto the processing of LKR pre-mRNA, the pre-mRNA of BIO3-BIO1 composite locus, which wasrequired for biotin biosynthesis in Arabidopsis, couldalso be alternatively polyadenylated to produce bothBIO3 and BIO3-BIO1 isoforms encoding mono-and bifunctional polypeptides.63 Notwithstandingsuch documentation supporting the function andbiological relevance of these APAs, the molecularmechanisms underlying their regulation and, hence,their biological effects remain to be elucidated.Perhaps the best studied APAs in plants involveflowering time control and oxidative stress responseof Arabidopsis (Figure 2). The following sections willfocus on these two aspects.

APA and Flowering Time RegulationIn Arabidopsis flowering time control, there are threegenes, FCA, FPA, and FLC, whose transcripts aresubjected to APA. FCA and FPA are autonomous

Volume 2, May/June 2011 # 2010 John Wiley & Sons, L td. 449

Xing, et al. 2012 PAS2 PAS1

Gene

Transcript1

Transcript2

Page 6: Bioinformatic jc 08_14_2013_formal

Investigate genome-wide variation of alternative polyadenylation in sense and antisense transcription across a set of Arabidopsis thaliana accessions  

Objective

Objective

•  Is variation in APA as prevalent across genotypes as across tissue types?

•  Is there genetic basis for variation related to the trans regulation as well as cis of APA?

•  Does a gene’s proximity to neighboring genes constrain polyadenylation site choice and limit variation?

Page 7: Bioinformatic jc 08_14_2013_formal

Approach

Method

82 bp Strand-specific RNA-seq

Map reads to each corresponding genome--PALMapper

Transform read positions from each transcriptome into a common coordinate system based on a

multiple-genome alignment

Retrieve polyA-containing reads, cluster across all accessions and identify poly(A) site (PAS)

Generate read counts for each PAS for each accession

Compare PASs genome-wide across accessions

19 accessions (genome sequenced)

Seedling Root Floral bud

RNA extraction & library construction with barcode

Page 8: Bioinformatic jc 08_14_2013_formal

PALMapper: map RNA-seq reads to reference

•  PALMapper (Jean, et al. 2010)

•  A combination of:

the spliced alignment method QPALMA (De Bona, et al. 2008)

the short read alignment tool GenomeMapper (Schneeberger, et al. 2009)

http://ftp.raetschlab.org/software/palmapper/palmapper-0.5.tar.gz

Version  0.5  released:  

Method

Adapted from Kahles, et al. 2013 talk

Why Another Mapper?

Andre Kahles (SKI, New York) PALMapper HiTSeq, July 20, 2013 1

Memorial Sloan-Kettering Cancer Center

Advantages: •  Alignments with variants, e.g. mismatches, indels

•  Accurate spliced alignments using computational splice site predictions

•  More accurate than TopHat (e.g. C. elegance 47% & 81%, respectively)

•  Fast alignments (about 10 million reads/hour)

•  Softtrimming for polyA tail of each read

Page 9: Bioinformatic jc 08_14_2013_formal

Softtrimming

•   The sequence remain in bam file •  Annotated with cigar “S” annotation •  Ignored by many tools such as the IGV

Page 10: Bioinformatic jc 08_14_2013_formal

How did I retrieve the poly(A) reads?

Method

The mapped sam file with softtrimmed poly(A)

Softtrimming

+  5’   3’  

RNAseq_reads  5’  

3’  

Genome   5’   3’  AAAAAAAA

5’   +  Splicing  length  >=1500bp    

Perl programming to pick up Poly(A) reads

Consecutive As in 3’ end of reads >=8bp Quality score of each A >=40

Huge splicing

Page 11: Bioinformatic jc 08_14_2013_formal

Defining poly(A) clusters (PAS)

Result

Identify poly(A) reads across accessions 2,203,313  

Cluster poly(A) reads: 75,532 PASs •  In the same orientation •  Within 10bp of each other across all accessions •  Total cluster interval spanning <= 24bp

Map PASs to genic regions (±120bp to the annotated range): •  93.4% PASs map to genic regions •  6.6% PASs further away from genic regions

Consider the sense & antisense PASs: •  Poly(A) reads orientation relative to the gene

orientation •  6581 genes with >= 20 sense poly(A) reads

across accessions •  1473 genes with >= 10 antisense poly(A) reads

across accessions

Page 12: Bioinformatic jc 08_14_2013_formal

Reads mapping to the major and non-major poly(A) cluster within gene

Result

•  Major PAS: the PAS with the most reads across all accessions for each gene

•  p = proportion of total reads in gene mapping to major PAS

•  q = 1-p = proportion of total reads in gene mapping to non-major PASs

Page 13: Bioinformatic jc 08_14_2013_formal

The distribution of the proportion of reads mapping to non-major sense & antisense poly(A) clusters per gene

Genes with the proportion of non-major cluster reads equal to or greater than 0.4 ( indicated with gray dashed lines) were considered as containing alternative poly(A) sites and chosen for further polymorphic analysis

Result

6581 gene with sense PASs 1471 gene with antisense PASs

Page 14: Bioinformatic jc 08_14_2013_formal

Pairwise difference in the proportion of reads mapping to non-major poly(A) clusters across accessions

Result

D =1n

n�1X

i=1

nX

j=i+1

Dij

•  For the ith and jth accessions Ai, and Aj, we can calculate their absolute

difference of the proportion of reads mapping to non-major poly(A) cluster,

here called Dij, Dij = |qAi – qAj|

•  Average pairwise difference:

Where n=19

•  Maximum pairwise difference:

Dmax = max{Dij}

Page 15: Bioinformatic jc 08_14_2013_formal

Pairwise difference in the proportion of reads mapping to non-major poly(A) clusters across accessions

3074 genes with sense PAS

Result

Average pairwise difference Maximum pairwise difference Dmax

Page 16: Bioinformatic jc 08_14_2013_formal

Pairwise difference in the proportion of reads mapping to non-major poly(A) clusters across accessions

544 genes with antisense PAS

Result

Maximum pairwise difference Dmax Average pairwise difference

Page 17: Bioinformatic jc 08_14_2013_formal

Gene position and antisense PAS

Result

Nearby gene: the distance apart from its adjacent gene <=2kb

Groups Fraction of genes in

each group

Fraction of genes with sense

poly(A) reads >=20

Fraction of genes with proportion of non-major sense

PASs>0.4

Fraction of genes with antisense poly(A) reads

>=10

Fraction of genes with proportion of non-major antisense

PASs>0.4

A 57.87% 62.92% 62.94% 96.91% 97.79%

B 20.48% 21.30% 20.59% 1.65% 0.74%

C 21.64% 15.77% 16.46% 1.43% 1.47%

Page 18: Bioinformatic jc 08_14_2013_formal

Conclusion

•  For genes with more sense & antisense poly(A) reads, half use

non-major PAS at least 40% of the time

•  Pairwise comparison across all accessions helped to identify the

best candidate genes for polymorphism in the usage or position of

major PASs

Conclusion

Page 19: Bioinformatic jc 08_14_2013_formal

Outlook

•  Combine all tissues & all accessions, calculate & its variance

•  Associate with gene categories, poly(A) site location of genes, etc.

•  Examine the trans/cis poly(A) QTL with the MAGIC lines’ data

•  Check the relationship between the antisense poly(A) site & the

orientation of nearby genes, and the relationship this may have with

expression level

•  Check the data from related species, Capsella rubella & A. lyrata to look at

APA usage & its evolution between species

•  Ask if A. thaliana an outlier for any of the trends observed? if APA is

derived in A. thaliana?

Outlook

Page 20: Bioinformatic jc 08_14_2013_formal

Acknowledgements

Kansas State University Dr. Chris Toomajian University of Utah Dr. Richard Clark Dr. Joshua Steffen

Edward J. Osborne Robert Greenhalgh

Wellcome Trust Centre for Human Genetics, University of Oxford Dr. Richard Mott Memorial Sloan-Kettering Cancer Center Dr. Gunnar Raetsch Philipp Drewe Andre Kahles

Page 21: Bioinformatic jc 08_14_2013_formal
Page 22: Bioinformatic jc 08_14_2013_formal

Alternative polyadenylation (APA)

Background

expressed than their longer counterparts [22]. Severalmechanisms can explain why changes in 30 UTR lengthmay affect protein abundance. One of the best-charac-terized processes is that of microRNA (miR)-mediateddegradation. In studies of myogenic [43,44!!], hemato-poietic [28], and cancer [45] cells, transcripts bearingshorter 30 UTRs contained fewer miRNA-binding sites,thus allowing these transcripts to evade miRNA-mediated degradation. Transcripts are also subject to

length-dependent degradation by the nonsense-mediated decay (NMD) pathway [46,47]. In NMD,Upf1 binds to the 30 UTR in a length-dependent manner,thus eliciting degradation of longer transcripts morerapidly [48!].

The 30 UTR contains elements that affect not onlytranscript degradation but also stability. In a genome-wide computational analysis of sequence and stability

All’s well that ends well Mueller, Cheung and Rando 225

Table 1 (Continued )

Method Cell or tissue type Major findings Keyreferences

EST Analysis Mouse and human samples 54% of human genes and 37% of mousegenes have multiple PASs. Orthologsbetween the two species display similarpolyadenylation patterns

[8]

Global Study of Poly(A)Site Usage byGene-based ESTVote (GAUGE)

42 human tissues from polyA_DB Systemic differences in PAS usage amongtissues and identification of potential cis-regulatory elements associated with PASsin the brain. Development of polyA_DBdatabase of mammalian mRNApolyadenylation

[34,35]

OtherDigital Gene Expression (DGE)

on the basis of MassivelyParallel Signature Sequencing(MPSS) and IlluminaSequencing by Synthesis(SBS) and analysis similarto GAUGE

Arabidopsis and rice of variousdevelopmental stages andenvironmental exposures

Approximately 60% of Arabidopsis genesand 47–82% of rice genes contain multiplePASs with 49–66% mapping within thecoding region. Genes that showdifferential PAS usage in differentdevelopmental stages make up 10% of thetranscriptome

[36]

Figure 1

(a)

Ex1 Ex3

PASPAS

Ex2

Ex1 Ex3Ex2Ex1 Ex3Ex2

(b)

Ex1 Ex3Ex2Ex1 Ex2

Ex1 Ex3

PASPAS

Ex2

5!

5! 3!

5!

5!

5! 5!3! 3!

3!

3!

3!

Current Opinion in Cell Biology

Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, thenidentical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity ofprotein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced whenthe proximal PAS is chosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions.

www.sciencedirect.com Current Opinion in Cell Biology 2013, 25:222–232

Adapted from Tress et al. 2007

Protein  isoforms  

Page 23: Bioinformatic jc 08_14_2013_formal

Outlook

•  Combine all tissues and all accessions, take each tissue as subset, calculate and its variance

•  For each tissue, associate with gene categories according to GO analysis & gene families

•  Compare the distribution of from different tissues, and PAS usage patterns among tissues or accessions

•  Check Ka/Ks for genes with high/low in all tissues

•  Check the poly(A) site location for genes with high , e.g. 3'UTR, CDS, 5'UTR or intron

•  Compare the location across accessions

•  Look at the relationship of location with gene expression level

•  Examine the cis poly(A) QTL with the MAGIC lines’ RNA-seq data

•  Check the relationship between the antisense poly(A) site and the orientation of nearby genes for each tissue subset, and the relationship this may have with expression level

•  Check the data from Capsella rubella and A. lyrata to look at APA usage and its evolution between species

•  Ask if A. thaliana an outlier for any of the trends observed? if APA is derived in A. thaliana?

Outlook

Page 24: Bioinformatic jc 08_14_2013_formal

Tian, et al. 2013

differentiated cells are reprogrammed to ES cell-like in-duced pluripotent stem (iPS) cells [41]. A notable excep-tion, however, has been observed with spermatogonialgerm cells, whose reprogramming to ES cells involves 30

UTR lengthening [41]. Notably, this is in line with the factthat germ cells are more proliferative than ES cells. Simi-lar trends of 30 UTR length regulation have been reportedfor comparisons of ES cells versus neural stem/progenitor(NSP) cells or neurons [42]. Although these studies have allpointed to a connection between 30 UTR length and cellproliferation, cardiac hypertrophy, in which myocytes growin size rather than in number, has also been found toinvolve 30 UTR shortening [43]. Thus, a general rulemay be that APA regulation is correlated with cell growth.

CancerCancer cells are of course highly proliferative. In keepingwith this, and consistent with the above, cancer cells havebeen found to express, in general, mRNAs with shortened 30

UTRs, as first shown in transformed cell lines [44] and inmouse B-cell leukemia/lymphoma models [45], and morerecently in human colorectal carcinomas [46] and breast andlung cancers [47]. In the study by Singh et al. [45], the APAprofile was found to be informative in separating tumorsubtypes with different survival consequences, indicatingits relevance to cancer development and utility as a diag-nostic marker. One key question concerning APA regulationin cancer is whether proliferation or transformation is themajor driver of APA. Meta-analysis of microarray data fromtransformed and nontransformed cells with similar pre-dicted proliferation rates has led to the conclusion that celltransformation has a significant role in 30 UTR regulation[44]. However, a recent study has shown that, by comparingthe same cells (BJ primary fibroblast and mammary epithe-lial cell line MCF10A) in proliferating, arrested, and trans-formed states, proliferation is a more importantdeterminant of 30UTR length [48]. Adding to the complexityof 30 UTR regulation in cancer, Fu et al. [49] have reportedthat, compared to MCF10A, breast cancer cell lines MCF7and MB231 show shortened and lengthened 30 UTRs, re-spectively. Notably, it has also been reported that, contraryto the general trend, some gene groups, such as cell–celladhesion genes, tend to express mRNAs with lengthened 30

UTRs in cancer cells [45,46]. Therefore, it remains to be fullydelineated how APA of different transcripts is regulated indifferent cancer types and at different stages.

APA is modulated by multiple mechanismsRegulation of core C/P factor expressionThe core components of the mammalian C/P machineryinclude !15 polypeptides, most of which exist in multi-subunit subcomplexes (Box 3 and Figure 3). Regulation of

TiBS

pA pA

AAAnAAAn

pA pA

Composite Skipped

UTR

CDS

pA

AAAnAAAn

AAAnExonic

(A)

(B)

Figure 1. Alternative cleavage and polyadenylation sites (pAs) in a gene. (A) Alternative cleavage and polyadenylation (APA) in 30-most exon. A hypothetical gene is shown,with two pAs located in the 30-most exon. The top gray line is genomic DNA with exons boxed, and bottom lines are mRNAs. Coding sequence (CDS) and 30 untranslatedregion (UTR) are shown as thick and thin blue lines, respectively, as indicated in the graph, splicing as bent line, and pAs as arrowheads. AAAn indicates the poly(A) tail. (B)APA in upstream regions. The type of terminal exon is indicated. The top mRNA shows only splicing. Skipped, skipped terminal exon; composite, composite (internal/terminal) exon; exonic, upstream exon.

miRNA RBP

Transla!on Degrada!on Localiza!on

AAAnCDS

CDS

cUTR aUTR

!!

AAA

AAA

n

TiBS

Figure 2. Regulation of cis elements in 30 untranslated regions (UTRs) byalternative cleavage and polyadenylation (APA). Two mRNA isoforms areshown. The 30 UTR region upstream of the proximal cleavage andpolyadenylation site (pA) is called the constitutive UTR (cUTR), and thedownstream region is called the alternative UTR (aUTR). RNA-binding protein(RBP) and miRNA targeting to the aUTR are shown. Impacts on mRNA localization,translation, and degradation are indicated. CDS, coding sequence.

Review Trends in Biochemical Sciences June 2013, Vol. 38, No. 6

315

expressed than their longer counterparts [22]. Severalmechanisms can explain why changes in 30 UTR lengthmay affect protein abundance. One of the best-charac-terized processes is that of microRNA (miR)-mediateddegradation. In studies of myogenic [43,44!!], hemato-poietic [28], and cancer [45] cells, transcripts bearingshorter 30 UTRs contained fewer miRNA-binding sites,thus allowing these transcripts to evade miRNA-mediated degradation. Transcripts are also subject to

length-dependent degradation by the nonsense-mediated decay (NMD) pathway [46,47]. In NMD,Upf1 binds to the 30 UTR in a length-dependent manner,thus eliciting degradation of longer transcripts morerapidly [48!].

The 30 UTR contains elements that affect not onlytranscript degradation but also stability. In a genome-wide computational analysis of sequence and stability

All’s well that ends well Mueller, Cheung and Rando 225

Table 1 (Continued )

Method Cell or tissue type Major findings Keyreferences

EST Analysis Mouse and human samples 54% of human genes and 37% of mousegenes have multiple PASs. Orthologsbetween the two species display similarpolyadenylation patterns

[8]

Global Study of Poly(A)Site Usage byGene-based ESTVote (GAUGE)

42 human tissues from polyA_DB Systemic differences in PAS usage amongtissues and identification of potential cis-regulatory elements associated with PASsin the brain. Development of polyA_DBdatabase of mammalian mRNApolyadenylation

[34,35]

OtherDigital Gene Expression (DGE)

on the basis of MassivelyParallel Signature Sequencing(MPSS) and IlluminaSequencing by Synthesis(SBS) and analysis similarto GAUGE

Arabidopsis and rice of variousdevelopmental stages andenvironmental exposures

Approximately 60% of Arabidopsis genesand 47–82% of rice genes contain multiplePASs with 49–66% mapping within thecoding region. Genes that showdifferential PAS usage in differentdevelopmental stages make up 10% of thetranscriptome

[36]

Figure 1

(a)

Ex1 Ex3

PASPAS

Ex2

Ex1 Ex3Ex2Ex1 Ex3Ex2

(b)

Ex1 Ex3Ex2Ex1 Ex2

Ex1 Ex3

PASPAS

Ex2

5!

5! 3!

5!

5!

5! 5!3! 3!

3!

3!

3!

Current Opinion in Cell Biology

Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, thenidentical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity ofprotein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced whenthe proximal PAS is chosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions.

www.sciencedirect.com Current Opinion in Cell Biology 2013, 25:222–232

Page 25: Bioinformatic jc 08_14_2013_formal

Alternative polyadenylation (APA)

Background

expressed than their longer counterparts [22]. Severalmechanisms can explain why changes in 30 UTR lengthmay affect protein abundance. One of the best-charac-terized processes is that of microRNA (miR)-mediateddegradation. In studies of myogenic [43,44!!], hemato-poietic [28], and cancer [45] cells, transcripts bearingshorter 30 UTRs contained fewer miRNA-binding sites,thus allowing these transcripts to evade miRNA-mediated degradation. Transcripts are also subject to

length-dependent degradation by the nonsense-mediated decay (NMD) pathway [46,47]. In NMD,Upf1 binds to the 30 UTR in a length-dependent manner,thus eliciting degradation of longer transcripts morerapidly [48!].

The 30 UTR contains elements that affect not onlytranscript degradation but also stability. In a genome-wide computational analysis of sequence and stability

All’s well that ends well Mueller, Cheung and Rando 225

Table 1 (Continued )

Method Cell or tissue type Major findings Keyreferences

EST Analysis Mouse and human samples 54% of human genes and 37% of mousegenes have multiple PASs. Orthologsbetween the two species display similarpolyadenylation patterns

[8]

Global Study of Poly(A)Site Usage byGene-based ESTVote (GAUGE)

42 human tissues from polyA_DB Systemic differences in PAS usage amongtissues and identification of potential cis-regulatory elements associated with PASsin the brain. Development of polyA_DBdatabase of mammalian mRNApolyadenylation

[34,35]

OtherDigital Gene Expression (DGE)

on the basis of MassivelyParallel Signature Sequencing(MPSS) and IlluminaSequencing by Synthesis(SBS) and analysis similarto GAUGE

Arabidopsis and rice of variousdevelopmental stages andenvironmental exposures

Approximately 60% of Arabidopsis genesand 47–82% of rice genes contain multiplePASs with 49–66% mapping within thecoding region. Genes that showdifferential PAS usage in differentdevelopmental stages make up 10% of thetranscriptome

[36]

Figure 1

(a)

Ex1 Ex3

PASPAS

Ex2

Ex1 Ex3Ex2Ex1 Ex3Ex2

(b)

Ex1 Ex3Ex2Ex1 Ex2

Ex1 Ex3

PASPAS

Ex2

5!

5! 3!

5!

5!

5! 5!3! 3!

3!

3!

3!

Current Opinion in Cell Biology

Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, thenidentical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity ofprotein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced whenthe proximal PAS is chosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions.

www.sciencedirect.com Current Opinion in Cell Biology 2013, 25:222–232

Mueller, et al. 2012

Tian, et al. 2013

differentiated cells are reprogrammed to ES cell-like in-duced pluripotent stem (iPS) cells [41]. A notable excep-tion, however, has been observed with spermatogonialgerm cells, whose reprogramming to ES cells involves 30

UTR lengthening [41]. Notably, this is in line with the factthat germ cells are more proliferative than ES cells. Simi-lar trends of 30 UTR length regulation have been reportedfor comparisons of ES cells versus neural stem/progenitor(NSP) cells or neurons [42]. Although these studies have allpointed to a connection between 30 UTR length and cellproliferation, cardiac hypertrophy, in which myocytes growin size rather than in number, has also been found toinvolve 30 UTR shortening [43]. Thus, a general rulemay be that APA regulation is correlated with cell growth.

CancerCancer cells are of course highly proliferative. In keepingwith this, and consistent with the above, cancer cells havebeen found to express, in general, mRNAs with shortened 30

UTRs, as first shown in transformed cell lines [44] and inmouse B-cell leukemia/lymphoma models [45], and morerecently in human colorectal carcinomas [46] and breast andlung cancers [47]. In the study by Singh et al. [45], the APAprofile was found to be informative in separating tumorsubtypes with different survival consequences, indicatingits relevance to cancer development and utility as a diag-nostic marker. One key question concerning APA regulationin cancer is whether proliferation or transformation is themajor driver of APA. Meta-analysis of microarray data fromtransformed and nontransformed cells with similar pre-dicted proliferation rates has led to the conclusion that celltransformation has a significant role in 30 UTR regulation[44]. However, a recent study has shown that, by comparingthe same cells (BJ primary fibroblast and mammary epithe-lial cell line MCF10A) in proliferating, arrested, and trans-formed states, proliferation is a more importantdeterminant of 30UTR length [48]. Adding to the complexityof 30 UTR regulation in cancer, Fu et al. [49] have reportedthat, compared to MCF10A, breast cancer cell lines MCF7and MB231 show shortened and lengthened 30 UTRs, re-spectively. Notably, it has also been reported that, contraryto the general trend, some gene groups, such as cell–celladhesion genes, tend to express mRNAs with lengthened 30

UTRs in cancer cells [45,46]. Therefore, it remains to be fullydelineated how APA of different transcripts is regulated indifferent cancer types and at different stages.

APA is modulated by multiple mechanismsRegulation of core C/P factor expressionThe core components of the mammalian C/P machineryinclude !15 polypeptides, most of which exist in multi-subunit subcomplexes (Box 3 and Figure 3). Regulation of

TiBS

pA pA

AAAnAAAn

pA pA

Composite Skipped

UTR

CDS

pA

AAAnAAAn

AAAnExonic

(A)

(B)

Figure 1. Alternative cleavage and polyadenylation sites (pAs) in a gene. (A) Alternative cleavage and polyadenylation (APA) in 30-most exon. A hypothetical gene is shown,with two pAs located in the 30-most exon. The top gray line is genomic DNA with exons boxed, and bottom lines are mRNAs. Coding sequence (CDS) and 30 untranslatedregion (UTR) are shown as thick and thin blue lines, respectively, as indicated in the graph, splicing as bent line, and pAs as arrowheads. AAAn indicates the poly(A) tail. (B)APA in upstream regions. The type of terminal exon is indicated. The top mRNA shows only splicing. Skipped, skipped terminal exon; composite, composite (internal/terminal) exon; exonic, upstream exon.

miRNA RBP

Transla!on Degrada!on Localiza!on

AAAnCDS

CDS

cUTR aUTR

!!

AAA

AAA

n

TiBS

Figure 2. Regulation of cis elements in 30 untranslated regions (UTRs) byalternative cleavage and polyadenylation (APA). Two mRNA isoforms areshown. The 30 UTR region upstream of the proximal cleavage andpolyadenylation site (pA) is called the constitutive UTR (cUTR), and thedownstream region is called the alternative UTR (aUTR). RNA-binding protein(RBP) and miRNA targeting to the aUTR are shown. Impacts on mRNA localization,translation, and degradation are indicated. CDS, coding sequence.

Review Trends in Biochemical Sciences June 2013, Vol. 38, No. 6

315

Adapted from Tress et al. 2007

Protein  isoforms  

targets, indicating reduced cleavage at the proximal poly(A) site42. Similarly, ELL2 — another transcription elongation factor — was suggested to promote loading of the polyadenylation factor CSTF on the transcription machinery, thereby enhancing usage of the proximal poly(A) site of IgM43, providing an additional mecha-nism for the switch of membrane-bound IgM to the secreted IgM form. Furthermore, a global positive corre-lation between gene expression level and relative usage of proximal poly(A) sites was observed in the human and mouse transcriptomes, and it was demonstrated using reporter assays that enhancement of transcriptional activity results in increased cleavage at proximal sites44.

The second emerging principle for the interplay between transcription and APA involves kinetic coupling (FIG. 4b). As the proximal poly(A) sites are transcribed first and are therefore encountered first by the 3′-end-processing machinery, they have an advantage for being used over distal poly(A) sites17. That is, use of proximal poly(A) sites should positively correlate with the distance between consecutive poly(A) sites and should negatively correlate with transcription elongation rate. In accord-ance with this expectation, using a D. melanogaster strain with a lower transcriptional elongation rate, it was shown that reduced RNA Pol II elongation kinetics results in increased usage of proximal poly(A) sites in a number of transcripts45. Of note, this kinetic coupling resembles a mechanism for alternative splicing regulation, in which slow kinetics of RNA Pol II leads to preferential inclusion of otherwise skipped alternative exons46. Little is known about mechanisms that regulate transcription elongation rates, and it remains to be seen whether this kinetic cou-pling is used to regulate APA in physiological conditions.

APA and chromatin. Recent results have suggested that chromatin and epigenetic modifications affect APA. It was observed that poly(A) sites are strongly depleted of nucleosomes, whereas regions downstream of these sites are enriched for nucleosomes47. To some extent, nucleo-some depletion at poly(A) sites is explained by base composition of sequences in these regions, which are A- and T-rich, as poly(dA:dT) DNA stretches have a low nucleosome affinity. Interestingly, examination of genes with multiple poly(A) sites showed that stronger poly(A) sites are associated with more pronounced nucleosome

depletion at the site and more pronounced enrichment downstream from it, suggesting that nucleosome posi-tioning might influence PAS use by, for example, affect-ing the rate of polymerase elongation. Yet at this stage, these observations are only correlative, and experi-mental studies are required in order to test this model and to establish a cause–effect relationship between nucleosome occupancy and poly(A) site selection.

Another way in which chromatin has been proposed to affect APA is through DNA methylation. This epi-genetic effect on APA was first suggested using mouse tissues, in two cases of retrogenes (namely, Mcts2 and Napl15), which are located within the introns of host genes (namely, H13 and Herc3, respectively)48,49. In both cases, the promoters of the retrogenes are imprinted and are therefore silenced on the maternal allele, whereas they are unmethylated and active on the paternal allele. It was shown that when the retrogene is transcribed, an upstream intronic poly(A) is used by the host gene, whereas a downstream distal poly(A) site is used by the host gene in the allele on which the retrogene is silenced. These observations support a model in which transcriptional interference affected poly(A) site choice by the host gene but cannot exclude the involvement of a polyadenylation factor (or factors) that is sensitive to the methylation status of the DNA in the vicinity of the poly(A) site.

Interplay between splicing and APA. Numerous studies have reported multiple links between the splicing and 3′-end-processing machineries and have demon-strated that physical interactions between splicing and polyadenylation factors occurring at terminal introns of precursor mRNAs (pre-mRNAs) enhance cleav-age efficiency at 3′UTR poly(A) sites50,51. Two types of APA events are affected by the interplay between splicing and 3′ end processing — alternative terminal exons and intronic APA (FIG. 2) — and recent studies are shedding light on underlying regulatory mecha-nisms. A first global analysis on this interplay used EST databases to identify events of intronic polyadenyla-tion in hundreds of human genes52. Importantly, these events were associated with weak 5′ splicing sites (5′ss) and long introns, suggesting a dynamic competition between splicing and polyadenylation. In agreement with this model, increased cleavage at intronic poly(A) sites was observed in conditions that were associated with increased usage of 3′UTR proximal poly(A) sites, including proliferation26, whereas decreased intronic cleavage was observed during development and differ-entiation18. Another indication for interplay between splicing and APA regulation was provided by an RNA-seq study that examined the transcriptomes of a diverse panel of human tissues and cell lines14. A strong correla-tion between patterns of alternative splicing and APA across the probed samples was observed, suggesting coordinated regulation of these processes. Furthermore, strong enrichment of well-known splicing-related regu-latory motifs was also detected in 3′UTRs, suggesting that the factors binding these motifs function in the regulation of both splicing and APA.

Nature Reviews | Genetics

&GXGNQROGPV�CPF�EGNNWNCT�FKȭGTGPVKCVKQPNeuron activityProliferation

Cancer

Oculopharyngeal muscular dystrophy

Global APA

Biological processes

Connections to disease

Favour distal poly(A) site usage Favour proximal poly(A) site usage

Figure 3 | Biological processes that have been linked with broad APA modulation. A schematic showing the biological processes and diseases that alternative polyadenylation (APA) has been linked with. In addition, the tendency towards distal or proximal poly(A) site usage is shown.

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 14 | JULY 2013 | 501

© 2013 Macmillan Publishers Limited. All rights reserved

Elkon, et al. 2013

Page 26: Bioinformatic jc 08_14_2013_formal

Alternative polyadenylation (APA)

Background

expressed than their longer counterparts [22]. Severalmechanisms can explain why changes in 30 UTR lengthmay affect protein abundance. One of the best-charac-terized processes is that of microRNA (miR)-mediateddegradation. In studies of myogenic [43,44!!], hemato-poietic [28], and cancer [45] cells, transcripts bearingshorter 30 UTRs contained fewer miRNA-binding sites,thus allowing these transcripts to evade miRNA-mediated degradation. Transcripts are also subject to

length-dependent degradation by the nonsense-mediated decay (NMD) pathway [46,47]. In NMD,Upf1 binds to the 30 UTR in a length-dependent manner,thus eliciting degradation of longer transcripts morerapidly [48!].

The 30 UTR contains elements that affect not onlytranscript degradation but also stability. In a genome-wide computational analysis of sequence and stability

All’s well that ends well Mueller, Cheung and Rando 225

Table 1 (Continued )

Method Cell or tissue type Major findings Keyreferences

EST Analysis Mouse and human samples 54% of human genes and 37% of mousegenes have multiple PASs. Orthologsbetween the two species display similarpolyadenylation patterns

[8]

Global Study of Poly(A)Site Usage byGene-based ESTVote (GAUGE)

42 human tissues from polyA_DB Systemic differences in PAS usage amongtissues and identification of potential cis-regulatory elements associated with PASsin the brain. Development of polyA_DBdatabase of mammalian mRNApolyadenylation

[34,35]

OtherDigital Gene Expression (DGE)

on the basis of MassivelyParallel Signature Sequencing(MPSS) and IlluminaSequencing by Synthesis(SBS) and analysis similarto GAUGE

Arabidopsis and rice of variousdevelopmental stages andenvironmental exposures

Approximately 60% of Arabidopsis genesand 47–82% of rice genes contain multiplePASs with 49–66% mapping within thecoding region. Genes that showdifferential PAS usage in differentdevelopmental stages make up 10% of thetranscriptome

[36]

Figure 1

(a)

Ex1 Ex3

PASPAS

Ex2

Ex1 Ex3Ex2Ex1 Ex3Ex2

(b)

Ex1 Ex3Ex2Ex1 Ex2

Ex1 Ex3

PASPAS

Ex2

5!

5! 3!

5!

5!

5! 5!3! 3!

3!

3!

3!

Current Opinion in Cell Biology

Major categories of APA. This model refers to a hypothetical gene with three exons and two PASs. (a) When both PASs are located in the 30 UTR, thenidentical proteins are produced. Because the 30 UTR often contains elements regulating transcript stability, degradation, or localization, the quantity ofprotein produced may be altered depending upon PAS choice. (b) When one PAS is located in the coding region, a truncated protein is produced whenthe proximal PAS is chosen. Ex = exon, PAS = polyadenylation site; thick lines = UTR regions, thin lines = intronic regions.

www.sciencedirect.com Current Opinion in Cell Biology 2013, 25:222–232

Mueller, et al. 2012

Tian, et al. 2013

differentiated cells are reprogrammed to ES cell-like in-duced pluripotent stem (iPS) cells [41]. A notable excep-tion, however, has been observed with spermatogonialgerm cells, whose reprogramming to ES cells involves 30

UTR lengthening [41]. Notably, this is in line with the factthat germ cells are more proliferative than ES cells. Simi-lar trends of 30 UTR length regulation have been reportedfor comparisons of ES cells versus neural stem/progenitor(NSP) cells or neurons [42]. Although these studies have allpointed to a connection between 30 UTR length and cellproliferation, cardiac hypertrophy, in which myocytes growin size rather than in number, has also been found toinvolve 30 UTR shortening [43]. Thus, a general rulemay be that APA regulation is correlated with cell growth.

CancerCancer cells are of course highly proliferative. In keepingwith this, and consistent with the above, cancer cells havebeen found to express, in general, mRNAs with shortened 30

UTRs, as first shown in transformed cell lines [44] and inmouse B-cell leukemia/lymphoma models [45], and morerecently in human colorectal carcinomas [46] and breast andlung cancers [47]. In the study by Singh et al. [45], the APAprofile was found to be informative in separating tumorsubtypes with different survival consequences, indicatingits relevance to cancer development and utility as a diag-nostic marker. One key question concerning APA regulationin cancer is whether proliferation or transformation is themajor driver of APA. Meta-analysis of microarray data fromtransformed and nontransformed cells with similar pre-dicted proliferation rates has led to the conclusion that celltransformation has a significant role in 30 UTR regulation[44]. However, a recent study has shown that, by comparingthe same cells (BJ primary fibroblast and mammary epithe-lial cell line MCF10A) in proliferating, arrested, and trans-formed states, proliferation is a more importantdeterminant of 30UTR length [48]. Adding to the complexityof 30 UTR regulation in cancer, Fu et al. [49] have reportedthat, compared to MCF10A, breast cancer cell lines MCF7and MB231 show shortened and lengthened 30 UTRs, re-spectively. Notably, it has also been reported that, contraryto the general trend, some gene groups, such as cell–celladhesion genes, tend to express mRNAs with lengthened 30

UTRs in cancer cells [45,46]. Therefore, it remains to be fullydelineated how APA of different transcripts is regulated indifferent cancer types and at different stages.

APA is modulated by multiple mechanismsRegulation of core C/P factor expressionThe core components of the mammalian C/P machineryinclude !15 polypeptides, most of which exist in multi-subunit subcomplexes (Box 3 and Figure 3). Regulation of

TiBS

pA pA

AAAnAAAn

pA pA

Composite Skipped

UTR

CDS

pA

AAAnAAAn

AAAnExonic

(A)

(B)

Figure 1. Alternative cleavage and polyadenylation sites (pAs) in a gene. (A) Alternative cleavage and polyadenylation (APA) in 30-most exon. A hypothetical gene is shown,with two pAs located in the 30-most exon. The top gray line is genomic DNA with exons boxed, and bottom lines are mRNAs. Coding sequence (CDS) and 30 untranslatedregion (UTR) are shown as thick and thin blue lines, respectively, as indicated in the graph, splicing as bent line, and pAs as arrowheads. AAAn indicates the poly(A) tail. (B)APA in upstream regions. The type of terminal exon is indicated. The top mRNA shows only splicing. Skipped, skipped terminal exon; composite, composite (internal/terminal) exon; exonic, upstream exon.

miRNA RBP

Transla!on Degrada!on Localiza!on

AAAnCDS

CDS

cUTR aUTR

!!

AAA

AAA

n

TiBS

Figure 2. Regulation of cis elements in 30 untranslated regions (UTRs) byalternative cleavage and polyadenylation (APA). Two mRNA isoforms areshown. The 30 UTR region upstream of the proximal cleavage andpolyadenylation site (pA) is called the constitutive UTR (cUTR), and thedownstream region is called the alternative UTR (aUTR). RNA-binding protein(RBP) and miRNA targeting to the aUTR are shown. Impacts on mRNA localization,translation, and degradation are indicated. CDS, coding sequence.

Review Trends in Biochemical Sciences June 2013, Vol. 38, No. 6

315

Adapted from Tress et al. 2007

Protein  isoforms  

targets, indicating reduced cleavage at the proximal poly(A) site42. Similarly, ELL2 — another transcription elongation factor — was suggested to promote loading of the polyadenylation factor CSTF on the transcription machinery, thereby enhancing usage of the proximal poly(A) site of IgM43, providing an additional mecha-nism for the switch of membrane-bound IgM to the secreted IgM form. Furthermore, a global positive corre-lation between gene expression level and relative usage of proximal poly(A) sites was observed in the human and mouse transcriptomes, and it was demonstrated using reporter assays that enhancement of transcriptional activity results in increased cleavage at proximal sites44.

The second emerging principle for the interplay between transcription and APA involves kinetic coupling (FIG. 4b). As the proximal poly(A) sites are transcribed first and are therefore encountered first by the 3′-end-processing machinery, they have an advantage for being used over distal poly(A) sites17. That is, use of proximal poly(A) sites should positively correlate with the distance between consecutive poly(A) sites and should negatively correlate with transcription elongation rate. In accord-ance with this expectation, using a D. melanogaster strain with a lower transcriptional elongation rate, it was shown that reduced RNA Pol II elongation kinetics results in increased usage of proximal poly(A) sites in a number of transcripts45. Of note, this kinetic coupling resembles a mechanism for alternative splicing regulation, in which slow kinetics of RNA Pol II leads to preferential inclusion of otherwise skipped alternative exons46. Little is known about mechanisms that regulate transcription elongation rates, and it remains to be seen whether this kinetic cou-pling is used to regulate APA in physiological conditions.

APA and chromatin. Recent results have suggested that chromatin and epigenetic modifications affect APA. It was observed that poly(A) sites are strongly depleted of nucleosomes, whereas regions downstream of these sites are enriched for nucleosomes47. To some extent, nucleo-some depletion at poly(A) sites is explained by base composition of sequences in these regions, which are A- and T-rich, as poly(dA:dT) DNA stretches have a low nucleosome affinity. Interestingly, examination of genes with multiple poly(A) sites showed that stronger poly(A) sites are associated with more pronounced nucleosome

depletion at the site and more pronounced enrichment downstream from it, suggesting that nucleosome posi-tioning might influence PAS use by, for example, affect-ing the rate of polymerase elongation. Yet at this stage, these observations are only correlative, and experi-mental studies are required in order to test this model and to establish a cause–effect relationship between nucleosome occupancy and poly(A) site selection.

Another way in which chromatin has been proposed to affect APA is through DNA methylation. This epi-genetic effect on APA was first suggested using mouse tissues, in two cases of retrogenes (namely, Mcts2 and Napl15), which are located within the introns of host genes (namely, H13 and Herc3, respectively)48,49. In both cases, the promoters of the retrogenes are imprinted and are therefore silenced on the maternal allele, whereas they are unmethylated and active on the paternal allele. It was shown that when the retrogene is transcribed, an upstream intronic poly(A) is used by the host gene, whereas a downstream distal poly(A) site is used by the host gene in the allele on which the retrogene is silenced. These observations support a model in which transcriptional interference affected poly(A) site choice by the host gene but cannot exclude the involvement of a polyadenylation factor (or factors) that is sensitive to the methylation status of the DNA in the vicinity of the poly(A) site.

Interplay between splicing and APA. Numerous studies have reported multiple links between the splicing and 3′-end-processing machineries and have demon-strated that physical interactions between splicing and polyadenylation factors occurring at terminal introns of precursor mRNAs (pre-mRNAs) enhance cleav-age efficiency at 3′UTR poly(A) sites50,51. Two types of APA events are affected by the interplay between splicing and 3′ end processing — alternative terminal exons and intronic APA (FIG. 2) — and recent studies are shedding light on underlying regulatory mecha-nisms. A first global analysis on this interplay used EST databases to identify events of intronic polyadenyla-tion in hundreds of human genes52. Importantly, these events were associated with weak 5′ splicing sites (5′ss) and long introns, suggesting a dynamic competition between splicing and polyadenylation. In agreement with this model, increased cleavage at intronic poly(A) sites was observed in conditions that were associated with increased usage of 3′UTR proximal poly(A) sites, including proliferation26, whereas decreased intronic cleavage was observed during development and differ-entiation18. Another indication for interplay between splicing and APA regulation was provided by an RNA-seq study that examined the transcriptomes of a diverse panel of human tissues and cell lines14. A strong correla-tion between patterns of alternative splicing and APA across the probed samples was observed, suggesting coordinated regulation of these processes. Furthermore, strong enrichment of well-known splicing-related regu-latory motifs was also detected in 3′UTRs, suggesting that the factors binding these motifs function in the regulation of both splicing and APA.

Nature Reviews | Genetics

&GXGNQROGPV�CPF�EGNNWNCT�FKȭGTGPVKCVKQPNeuron activityProliferation

Cancer

Oculopharyngeal muscular dystrophy

Global APA

Biological processes

Connections to disease

Favour distal poly(A) site usage Favour proximal poly(A) site usage

Figure 3 | Biological processes that have been linked with broad APA modulation. A schematic showing the biological processes and diseases that alternative polyadenylation (APA) has been linked with. In addition, the tendency towards distal or proximal poly(A) site usage is shown.

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 14 | JULY 2013 | 501

© 2013 Macmillan Publishers Limited. All rights reserved

Elkon, et al. 2013

Page 27: Bioinformatic jc 08_14_2013_formal

targets, indicating reduced cleavage at the proximal poly(A) site42. Similarly, ELL2 — another transcription elongation factor — was suggested to promote loading of the polyadenylation factor CSTF on the transcription machinery, thereby enhancing usage of the proximal poly(A) site of IgM43, providing an additional mecha-nism for the switch of membrane-bound IgM to the secreted IgM form. Furthermore, a global positive corre-lation between gene expression level and relative usage of proximal poly(A) sites was observed in the human and mouse transcriptomes, and it was demonstrated using reporter assays that enhancement of transcriptional activity results in increased cleavage at proximal sites44.

The second emerging principle for the interplay between transcription and APA involves kinetic coupling (FIG. 4b). As the proximal poly(A) sites are transcribed first and are therefore encountered first by the 3′-end-processing machinery, they have an advantage for being used over distal poly(A) sites17. That is, use of proximal poly(A) sites should positively correlate with the distance between consecutive poly(A) sites and should negatively correlate with transcription elongation rate. In accord-ance with this expectation, using a D. melanogaster strain with a lower transcriptional elongation rate, it was shown that reduced RNA Pol II elongation kinetics results in increased usage of proximal poly(A) sites in a number of transcripts45. Of note, this kinetic coupling resembles a mechanism for alternative splicing regulation, in which slow kinetics of RNA Pol II leads to preferential inclusion of otherwise skipped alternative exons46. Little is known about mechanisms that regulate transcription elongation rates, and it remains to be seen whether this kinetic cou-pling is used to regulate APA in physiological conditions.

APA and chromatin. Recent results have suggested that chromatin and epigenetic modifications affect APA. It was observed that poly(A) sites are strongly depleted of nucleosomes, whereas regions downstream of these sites are enriched for nucleosomes47. To some extent, nucleo-some depletion at poly(A) sites is explained by base composition of sequences in these regions, which are A- and T-rich, as poly(dA:dT) DNA stretches have a low nucleosome affinity. Interestingly, examination of genes with multiple poly(A) sites showed that stronger poly(A) sites are associated with more pronounced nucleosome

depletion at the site and more pronounced enrichment downstream from it, suggesting that nucleosome posi-tioning might influence PAS use by, for example, affect-ing the rate of polymerase elongation. Yet at this stage, these observations are only correlative, and experi-mental studies are required in order to test this model and to establish a cause–effect relationship between nucleosome occupancy and poly(A) site selection.

Another way in which chromatin has been proposed to affect APA is through DNA methylation. This epi-genetic effect on APA was first suggested using mouse tissues, in two cases of retrogenes (namely, Mcts2 and Napl15), which are located within the introns of host genes (namely, H13 and Herc3, respectively)48,49. In both cases, the promoters of the retrogenes are imprinted and are therefore silenced on the maternal allele, whereas they are unmethylated and active on the paternal allele. It was shown that when the retrogene is transcribed, an upstream intronic poly(A) is used by the host gene, whereas a downstream distal poly(A) site is used by the host gene in the allele on which the retrogene is silenced. These observations support a model in which transcriptional interference affected poly(A) site choice by the host gene but cannot exclude the involvement of a polyadenylation factor (or factors) that is sensitive to the methylation status of the DNA in the vicinity of the poly(A) site.

Interplay between splicing and APA. Numerous studies have reported multiple links between the splicing and 3′-end-processing machineries and have demon-strated that physical interactions between splicing and polyadenylation factors occurring at terminal introns of precursor mRNAs (pre-mRNAs) enhance cleav-age efficiency at 3′UTR poly(A) sites50,51. Two types of APA events are affected by the interplay between splicing and 3′ end processing — alternative terminal exons and intronic APA (FIG. 2) — and recent studies are shedding light on underlying regulatory mecha-nisms. A first global analysis on this interplay used EST databases to identify events of intronic polyadenyla-tion in hundreds of human genes52. Importantly, these events were associated with weak 5′ splicing sites (5′ss) and long introns, suggesting a dynamic competition between splicing and polyadenylation. In agreement with this model, increased cleavage at intronic poly(A) sites was observed in conditions that were associated with increased usage of 3′UTR proximal poly(A) sites, including proliferation26, whereas decreased intronic cleavage was observed during development and differ-entiation18. Another indication for interplay between splicing and APA regulation was provided by an RNA-seq study that examined the transcriptomes of a diverse panel of human tissues and cell lines14. A strong correla-tion between patterns of alternative splicing and APA across the probed samples was observed, suggesting coordinated regulation of these processes. Furthermore, strong enrichment of well-known splicing-related regu-latory motifs was also detected in 3′UTRs, suggesting that the factors binding these motifs function in the regulation of both splicing and APA.

Nature Reviews | Genetics

&GXGNQROGPV�CPF�EGNNWNCT�FKȭGTGPVKCVKQPNeuron activityProliferation

Cancer

Oculopharyngeal muscular dystrophy

Global APA

Biological processes

Connections to disease

Favour distal poly(A) site usage Favour proximal poly(A) site usage

Figure 3 | Biological processes that have been linked with broad APA modulation. A schematic showing the biological processes and diseases that alternative polyadenylation (APA) has been linked with. In addition, the tendency towards distal or proximal poly(A) site usage is shown.

REVIEWS

NATURE REVIEWS | GENETICS VOLUME 14 | JULY 2013 | 501

© 2013 Macmillan Publishers Limited. All rights reserved

Elkon, et al. 2013

Page 28: Bioinformatic jc 08_14_2013_formal

Why Another Mapper?

Andre Kahles (SKI, New York) PALMapper HiTSeq, July 20, 2013 1

Memorial Sloan-Kettering Cancer Center

Advantages: •  Alignments with variants, e.g. mismatches, indels

•  Accurate spliced alignments using computational splice site predictions

•  More accurate than TopHat (e.g. C. elegance 47% & 81%, respectively)

•  Fast alignments (about 10 million reads/hour)

•  Softtrimming for polyA tail of each read

Page 29: Bioinformatic jc 08_14_2013_formal

How did I retrieve the poly(A) reads?

The mapped sam file with softtrimmed poly(A)

Reads with Softtrimmed end & consecutive As in the end

Reads with long splicing length & consecutive As in the end

SoItrimming    

+  5’   3’  

RNAseq_reads   consecuKve  As>=8  &  quality  score  of  each  soItrimmed  bp  >=40  

5’  

3’  

Genome  

5’   3’  AAAAAAAA

5’  

+  Splicing  length  >=1500bp    SoItrimming  

consecuKve  As>=8  &  quality  score  of  each  soItrimmed  bp  >=40  

Genome  

RNAseq_reads  

5’  3’  

AAAAAAAA 5’   +  

Splicing2  length  >=1500bp    Splicing1  

Splicing1  <Splicing  2  

consecuKve  As>=8  &  quality  score  of  each  soItrimmed  bp  >=40  

5’  3’  

AAAAAAAA 5’   +  

Splicing  length  >=1500bp    

consecuKve  As>=8  &  quality  score  of  each  soItrimmed  bp  >=40  

Perl programming to make the criteria true Method

Page 30: Bioinformatic jc 08_14_2013_formal

Defining poly(A) clusters (PAS)

•  2,203,313 poly(A) reads across accessions are identified

•  Calculate the poly(A) site for each poly(A) read with Perl script

•  75,532 PAS defined by clustering poly(A) reads in the same orientation and within 10bp of each other across all accessions with total cluster interval spanning no more than 24bp

•  93.4% of clusters map to genic regions, and the 6.6% of clusters that are further away from genic regions

•  6581 genes have at least 20 sense poly(A) reads across accessions

•  1473 genes have at least 10 antisense poly(A) reads across accessions

•  Major sense PAS defined across all accessions for each gene as the sense PAS with the most reads

•  p = proportion of total reads in gene mapping to major PAS

•  q = 1-p = proportion of total reads in gene mapping to non-major PASs Result

Page 31: Bioinformatic jc 08_14_2013_formal

The distribution of the proportion of reads mapping to non-major sense and antisense poly(A) clusters per gene

Genes with the proportion of non-major cluster reads equal to or greater than 0.4 ( indicated with gray dashed lines) were considered as containing alternative poly(A) sites and chosen for further polymorphic analysis

Result

Page 32: Bioinformatic jc 08_14_2013_formal

Pairwise difference in the proportion of reads mapping to non-major poly(A) clusters across accessions

3074 genes with sense PAS

Result

Page 33: Bioinformatic jc 08_14_2013_formal

Gene position and antisense PAS

Result

10

Nearby gene: the distance apart from its adjacent gene <=2kb