Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

47
Comparative analysis of eukaryotic genes Mar Albà http://genomics.imim.es/evolgenome Barcelona Biomedical Research Park

description

Comparative analysis of eukaryotic genes Mar Albà http://genomics.imim.es/evolgenome Barcelona Biomedical Research Park. Genome Projects. GOLD: Genomes Online Database (www.genomesonline.org). Genome Projects. GOLD: Genomes Online Database (www.genomesonline.org). Genome Projects. - PowerPoint PPT Presentation

Transcript of Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Page 1: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Comparative analysis of eukaryotic genes

Mar Albà

http://genomics.imim.es/evolgenomeBarcelona Biomedical Research Park

Page 2: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Genome Projects

GOLD: Genomes Online Database (www.genomesonline.org)

Page 3: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Genome Projects

GOLD: Genomes Online Database (www.genomesonline.org)

Page 4: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Genome Projects

GOLD: Genomes Online Database (www.genomesonline.org)

Page 5: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Genome Browsers

-NCBI Map Viewerhttp://www.ncbi.nlm.nih.gov/mapview/

-Ensemblhttp://www.ensembl.org

-UCSC Genome Browserhttp://genome.cse.ucsc.edu

The three databases use the same genome assembly, which is generated by NCBI.

Page 6: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Ensembl

Page 7: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Ensembl

-genomic regions-alignments with synthenic sequences

-genes- Homologs, SNPs

- transcripts- EMBL mRNAS, ESTs, Expression

-proteins-Gene Ontology (function), protein domains, diseaseassociations

Page 8: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Ensembl - Biomart- retrieval of information on gene datasets

Page 9: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Gene comparative sequence analysis

Genome and transcriptome projects have generated a vast amount of information on protein-coding and non-coding gene sequences.

Identification of conserved sequences in different genes can help us understand gene evolution and identify functional regions.

species 1

species 2x N genes(orthologs)

...

promotercoding

species m

Page 10: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Non-coding sequences in vertebrate genomes

-only 1.2% of the human genome codes for proteins

-but 5% exhibits high sequence conservation levels, compatible with negative selection (MGSC, 2002)

-non-coding- Transcription regulatory regions- Introns- Non-protein coding exons/genes (miRNAs, etc.)- Repetitive elements (Alus, etc.)- Ultra-conserved elements

Page 11: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Gene transcription regulatory sequences

Maston et al., 2006Annu. Rev. Genomics Hum. Genet. 7: 29-59

Page 12: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Frequently-found metazoan motifs in the core promoter

Maston et al., 2006

Page 13: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Wray et al. (2003), Mol. Biol. Evol. 20(9):1377-1419.

Eukaryotic promoter diversity

Page 14: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

High evolvability of regulatory sequences

-most of the changes in regulatory networks are likely to occur in cis; changes in trans (transcription factors) may often have too strong effects.

-one single mutation may lead to the acquisition of a newDNA-factor interaction (rapid turnover)

-the expression in one tissue may evolve independently of expression in another tissue (promoter modular organization)

Wray et al. (2003) The Evolution of Transcriptional Regulation in Eukaryotes. Mol. Biol. Evol. 20(9):1377-1419.

Page 15: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Transcription factor binding sites (TFBS) are short and imprecise

-short sequence motifs (6-12 bp)

- some positions of the motif are variable

- sometimes different transcription factors can recognize the same sequence motif

TATAAA TATAGA TATAAATATAAA GATAAATATAAATATAAATATAAT ***

TATA box

Page 16: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Transcription factor binding sites (TFBS)

Weight matrices

TATAAA TATAGA TATAAATATAAA GATAAATATAAATATAAATATAAT ***

1 2 3 4 5 6 - - - - - - - - - - - - A 0 8 0 8 7 7 C 0 0 0 0 0 0 G 1 0 0 0 1 0 T 7 0 8 0 0 1

-> can be used to search for putative motifs in sequences

Page 17: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

TRANSFAChttp://transfac.gbf.de/TRANSFAC/

http://www.biobase.de

TRRDhttp://www.bionet.nsc.ru/trrd/

Placehttp://www.dna.affrc.go.jp/htdocs/PLACE/

ooTFD / rTFDhttp://www.ifti.org/cgi-bin/ifti/ootfd.pl

SCPDhttp://cgsigma.cshl.org/jian/

RegulonDBhttp://regulondb.ccg.unam.mx/

Transcription factor binding site databases

Page 18: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

TFBS prediction using weight matrices

PROMO

Farré, D., et al. (2003). Nucleic Acids Research 31: 1739-1748.

http://promo.lsi.upc.edu

Page 19: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

High false positive rate in TFBS prediction

Test Sequences: 200 vertebrate promoter sequences 607 experimentally-verified sites

Blanco, E., et al.. (2006). Nucleic Acids Research 34: D63-D67.

Predictions: Transfac v.6.4

SENSITIVITY: 46%

SPECIFICITY: 2% Very low!

Page 20: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Comparative approaches are necessary

- orthologous sequences : phylogenetic footprinting

- co-expressed genes : shared regulatory motifs

Select those motifs or regions that are shared by:

Page 21: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Boffelli D, Nobrega MA, Rubin EM. (2004) Nat Rev Genet. 5:456-65

Phylogenetic footprinting

Page 22: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Highly conserved enhancer in gene DACH1

Phylogenetic footprinting

Page 23: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Proximal promoter

pre-initiationcomplex

Page 24: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Motif positional bias

Signal Search Analysis Server (SIB)

Page 25: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Why some motifs should show positional bias?

- promoter structure

- protein-protein interaction positional constraints

Predicted elementReference element (known)

TFBS 1

proximal promoter

TSSPICACTTFBS 1 TFB 2

regulatory module

TF1 TF2

Page 26: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

PEAKS: identification of motif positional bias

functionally-related sequences(ex. co-expressed)

random

Predicted elementReference element (known)

TSSTFBS

over-representation

Page 27: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

1

1

23

seq1

seq2

seq3

seq4

PEAKS

112

Step 1. Construct motif frequency profile

profile

sliding window

Predicted elementReference element (known)

Page 28: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

PEAKSStep 1. Construct motif frequency profile

308 housekeeping genesTransfac v.6.4 matrix library

TSS

Page 29: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

PEAKSStep 2. Measure significance of peaks

Score (max peak) = Sa x Sb x Sc

Sa = max peak / num motifSb = max peak / num seqSc = max peak / average num motifs

maximum peakFor each matrix:

CAAT-box+675 -325

average signal

difference

Page 30: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

PEAKSStep 2. Measure significance of peaks

- determine random expectation score cut-off for different levels of significance using 1000 random datasets

- define significant signal range:

cut-off 0.005max peak

CAAT-box

aver signal

Page 31: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

PEAKSStep 3. Build “promoter type”

52 genes regulated by NFkB, p < 0.5%

TATASp1

NFkB

BACH1

Page 32: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

PEAKS serverhttp://genomics.imim.es/peaks/

Bellora, Farré and Albà (2007). Bioinformatics 23, 243-4.

Page 33: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

308 housekeeping genes52 NFkB regulated genes

TATACAAT

GC-box YYTATANFkB

GC-box

BACH1

PEAKS resultshuman promoter sequences

TRANSFAC vertebrate matrices

Page 34: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

PEAKS resultspromoters from yeast genes, amino acid metabolism (86 genes)

- 54 yeast weight matrices tested

- significant regions detected by the method show significant enrichment in experimentally-validated sites

Page 35: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Measuring promoter sequence divergence

promoter

species 1

species 2

promoter

species 1

species 2

Divergence (Non-aligned promoter fraction or dSM)

0.8

0.4

Castillo-Davis et al., 2004

1. highly divergent -> less constraints

2. highly conserved -> more constraints

Page 36: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

0-0.1 0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1 1

Variability in promoter sequence divergence

8385 human-mouse orthologues2 Kb from transcription start site

Average divergence = 70%

Page 37: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Regulatory genes contain more conserved promoters than structural/metabolic genes

Functional classes enriched in high score promoter alignments

Lee et al. (2006). BMC Genomics 6: 188

- consistent with results by Iwama and Gojobori (2004)

Page 38: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Structural/metabolic genes contain less highly conservedpromoters than regulatory genes

Functional classes enriched in low score promoter alignments

Lee et al. (2006). BMC Genomics 6: 188

Page 39: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Comparison neurogenesis versus ribosomal

neurogenesisribosomal

Lee et al. (2006). BMC Genomics 6: 188

Page 40: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Is expression breadth related to promoter sequence divergence?

0

200

400

600

800

1000

1200

01-05 06-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50 51-55

Expression breadth (number of tissues)

Number of genes

Expression data from Zhang et al. (2004)

tissue-specific

intermediate

housekeeping

orthologues human-mouse

Page 41: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

promoter

species 1

species 2

Measure sequence divergence

-tissue-specific-intermediate-housekeeping

Divergence = non-aligned promoter fraction

2 Kb

Page 42: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

l=16 N dSM Ka Ks Ka/KsZhangdataset

3983

01-10 1006 0.688 0.108 0.764 0.14811-50 1931 0.702 0.080 0.746 0.11451-55 1046 0.734 0.050 0.678 0.078

Relationship between promoter divergence and expression breadth

number of tissues

Coding sequence evolutionary rate

Promoter divergence

but..

housekeeping

tissue-specificintermediate

promoter divergence

coding sequence divergence

Page 43: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

0

10

20

30

40

50

60

70

-2000-1900-1800-1700-1600-1500-1400-1300-1200-1100-1000-900-800-700-600-500-400-300-200-100

Relationship between promoter divergence and expression breadth

- divergence measured in 100 nt bins

housekeeping

non-housekeeping

TSS

% conservation

Page 44: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Promoter divergence and gene function

highly divergent promoter

RNA bindingligase activityhydrolase activitycatalytic activity

highly conserved promoter

receptor bindingsignal transducer activityreceptor activitystructural molecule activitytranscription regulator activitytranscription factor activityDNA binding

GO class > 50 genes, p-value < 0.01

Page 45: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

0,5

0,6

0,7

0,8

organdevelopment

transcriptionfactor

receptor proteinmetabolism

housekeepingnon-housekeeping

Promoter divergence and gene functiondivergence

Page 46: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

Summary

- the prediction of transcription factor binding sites is very noisy, we need to use comparative genomics

- some motifs show positional bias, this property canhelp us understand the structure of promoters and improve motif predictions

-promoter sequence conservation is related to gene function and to gene expression breadth. the fact that housekeeping genes contain less conserved promoters may obey to a more simple gene expression regulation

Page 47: Comparative analysis of eukaryotic genes Mar Albà genomics.imim.es/evolgenome

NicolasBellora

DomènecFarré

LorisMularoni

MacarenaToll

The teamEvolutionary Genomics GroupUniversitat Pompeu Fabra, Barcelonahttp://genomics.imim.es/evolgenome

MedyaShikhagaie