Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna...
-
Upload
victoria-leonard -
Category
Documents
-
view
215 -
download
0
Transcript of Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna...
Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis
(SSA)
Giovanna Ambrosini Christoph Schmid
Workshop on Regulatory Sequence Motif Discovery,November 10th 2006.The Linnaeus Centre for Bioinformatics, SLU-UU, Sweden.
Wasserman 5, 276-287 (2004)
Distal transcription-factor binding sites (enhancer)
cis-regulatory modules
Components of transcriptional regulation
EPDThe Eukaryotic Promoter Database
Current Release 88 (SEPT-2006)
• founded in 1986 (Bucher and Trifonov; Nucleic Acids Res, 14, 10009-10026)
• originally exclusively based on literature, carefully maintained and regularly updated
• in recent years started with consideration of mass sequencing data
• aim at high precision of mapping of transcription start site (+/- 5bp)
• promoter sequences of 139 different species, still relatively low coverage (i.e. 1871
human entries)
• format of annotation of TSS:
DR EMBL; ZZ999999.1; HS28BP; [-19, 9]. -15 -10 -5 0 5 ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ‘ a c c c g c c t g c a c c c g a t t c A T G T G A G A A
• one or several alternative transcription start sites per gene
EPD formatID HS_RPS3 standard; multiple; VRT.XXAC EP74176;XXDT 10-JAN-2003 (Rel. 73, created)DT 13-SEP-2004 (Rel. 80, Last annotation update).XXDE Ribosomal protein S3.OS Homo sapiens (human).XXHG none.AP none.NP none.XXDR GENOME; NT_033927.7; NT_033927; [-5333322, 12577805]. [ ENSEMBL; UCSC HapMap ]DR CLEANEX; HS_RPS3.DR EMBL; AP000744.4; [-90138, 35862]. [ EMBL; GenBank; DDBJ ]DR SWISS-PROT; P23396; RS3_HUMAN.DR RefSeq; NM_001005 [ DBTSS ].DR MIM; 600454.
10 bp
Fre
quen
cy o
f fu
ll-l
engt
h tr
ansc
ript
s
Genomic positionR84046905-84046987
R84047148-84047231
45 bp
TSS determined by modelling Gaussian distributions (MADAP)
The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Schmid, C.D., Praz, V., Delorenzi, M., Perier, R. and Bucher, P. (2004) Nucleic Acids Res, 32, D82-85.
[-10;10] [-400;400]EPD 70 0.83 136
RefSeq mRNA 0.32 0.95933
Genome annot. 0.31 0.95890
DBTSSv1 (human) 0.13 0.68933
Eponine 0.12 0.46494
Superior precision of in silico primer extension (ISPE)
New data sources for EPDChIP-chipKim et al. (2005) Nature, 436, 876-880 GEO: GSE2672 (remapped!)
ENSEMBL chro12:6.8 – 6.94 Mb
virt
ual c
ount
s(2
** lo
g ra
tio)-
1
Distribution of TSS
Genomic position
Fre
qu
en
cy
6831200 6831400 6831600 6831800 6832000
0.0
0.5
1.0
1.5
2.0
2.5
3.0
FP Hs USP5 :+R EU:NC_000012.10 1+ 6831557; 74339.
ChIP-chip data with insufficient resolution
EPD webserver: http://www.epd.isb-sib.ch/
• find EPD entry(-ies) using gene symbols,...– extraction of promoter sequences in user-defined
ranges – direct transfer to Signal Sequence Analysis (SSA)
• download of complete (reference!) promoter sets http://www.epd.isb-sib.ch/seq_download.html
SSA
Signal Search AnalysisGiovanna Ambrosini
ISREC Swiss Institute for Experimental Cancer Research
History: Signal Search Analysis is a method developed by P Bucher in the early eighties (Bucher, P. and Bryan B., E.N.; Nucleic Acids Res, v.12(1 Pt 1): 287–305)
Purpose: to discover and characterize sequence motifs that occur at constrained distances from physiologically defined sites in nucleic acid sequences.
Signal search analysis programs:
1. CPR: generates a “constraint profile” for the neighborhood of a functional site
2. SList: generates lists of over and under-represented motifs in particular regions relative to a
functional site
3. OProf: generates a “signal occurrence profile” for a particular motif
4. PatOP: optimizes a weight matrix description of a locally over-represented sequence motif
Recent events: Adaptation of software to new environment, SSA web server, application to promoters and translational start sites
Locally Over-represented Sequence Motifs
Definition of a Locally Over-represented Sequence Motif
Components of the formal motif description
1. A weight matrix or consensus sequence defining the motif
2. A cut-off value determining which subsequence constitutes a motif match
3. A preferred region of occurrence defined by 5’ and 3’ borders relative to a
functional site, e.g. a transcription initiation site
Concept
A motif which preferentially occurs at a characteristic distance (range) from a certain type of functional position
Example: the TATA-box is a locally over-represented sequence motif of the -30 region of eukaryotic POL II transcription initiation sites
Locally Over-represented Sequence Motifs
Input Data Structure
Work data
Primary experimental data
(Functional Position Set)
annotated functional positions in DNA
sequences stored in a database
A DNA sequence matrix
a set of fixed-length sequence segments with an experimentally defined site at a fixed internal position
The Motif Search Problem
Statement
For a given DNA sequence matrix
find locally optimal combination of
using a given quality criterion
Quantitative motif description Cut-off value Region of preferential occurrence
TATA-box Signal Occurrence Profile for EPD and ENSEMBL Drosophila Promoters
CCAAT-box Signal Occurrence Profile for Vertebrate and ENSEMBL Drosophila Promoters
SSA webserver: http://www.isrec.isb-sib.ch/ssa
Provides access to precompiled functional position sets
Collections of transcription initiation sites (promoters) from
eukaryotic species
Collections of translation initiation sites from large variety of
prokaryotic genomes
Provides access to the four signal search analysis programs
Application to a bacterial translational control signal: the Shine-Dalgarno ribosome binding-site motif
Compare the strength and location of the Shine-Dalgarno mRNA-rRNA interaction motif in E. coli and B. subtilis in a qualitative manner.
Result: the Shine-Dalgarno interaction motif is stronger in B. subtilis than in E .coli and centered about two bases further upstream in the former species. More than hundred bacterial genomes are now available to perform this type of analysis.
Studying transcription regulatory processes with specialized bioinformatics resources – and example
Biological question:
Do genes that are generally up-regulated in cancer cells have different types of promoters?
Procedure:
Define cancer up- and down-regulated gene sets using CleanExExtract corresponding promoter regions from EPDAnalyse the signal content of the two promoter sequence sets using SSA
Comparative analysis of cancer up- and down-regulated promoters
Signals considered:
Initiator preferred position approx. frequency
Initiator 0 25% - 50%
TATA-box -30 to -25 ~30%
GC-box -200 to 0 ~50%
CCAAT-box -200 to -50 ~20%
Positional distribution of Initiator motif in cancer up- and down-regulated promoters
Positional distribution of TATA-boxes in cancer up- and down-regulated promoters
Positional distribution of GC-boxes in cancer up- and down-regulated promoters
Positional distribution of CCAAT-boxes in cancer up- and down-regulated promoters
Comparative analysis of cancer up- and down-regulated promoters: Summary of results
Signal content
Initiator Frequency in Frequency incancer-up genes cancer-down genes
Initiator no change no changeTATA-box up downGC-box no change no changeCCAAT-box up down
Next questions:
Are TATA-box and CCAAT-box binding factors disregulated in cancer cells ?Or do cancer-specific transcription factors (binding to adjacent sites) preferentially interact withTATA-box and CCAAT-box binding factors?
Concluding remarks
Signal search analysis has played an instrumental role in the characterization of eukaryotic promoter elements
The method has originally been developed for the analysis of eukaryotic promoters but has a much broader application potential (e.g. Shine-Dalgarno signal analysis)
Rapidly growing collection of complete genomes and high-throughput methods for genomic analysis increase the statistical power to discover new motifs, or better characterize already known control signals
Aligning sequence sets with respect to a well characterized motif might allow the detection of binding sites of cooperating transcription factors positionally correlated with the known motif
Confirm or challenge commonly accepted hypotheses originally derived from small sets