Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence...

42
Comparative genomics of 24 mammals Manolis Kellis MIT Computer Science & Artificial Intelligence Laboratory road Institute of MIT and Harvard

Transcript of Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence...

Page 1: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Comparative genomics of 24 mammals

Manolis Kellis

MIT

MIT Computer Science & Artificial Intelligence Laboratory

Broad Institute of MIT and Harvard

Page 2: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Sequencing the mammalian phylogeny # Species Center CovgH1 Human Done FullH2 Chimp Done FullH3 Rhesus Done FullH4 Mouse Done FullH5 Rat Done FullH6 Dog Done FullH7 Cow Done Full1 Elephant Broad 1.94x2 Armadillo Broad 1.98x3 Tenrec Broad 1.90x4 Rabbit Broad 1.95x5 Guinea Pig Broad 1.92x6 Hedgehog Broad 1.86x7 Shrew Broad 1.92x8 Microbat Broad 1.84x9 Tree Shrew Broad 1.89x10 Squirrel Broad 1.90x11 Bushbaby Broad 1.87x12 Pika Broad 1.92x13 Mouse Lemur Broad 1.93x14 Horse Broad 5.36x15 Cat Agencourt 1.87x16 Dolphin Baylor 2.59x17 Hyrax Baylor 2.19x18 Kangaroo Rat Baylor 1.85x19 Megabat Baylor ~2x20 Alpaca WashU 2.34x21 Tarsier WashU 1.88x22 Sloth WashU 2.10x23 Pangolin x x24 Flying lemur x x

Kerstin Lindblad-Toh, Sante Gnerre, Federica DiPalmaBroad, Baylor, WashU, Arachne, UCSC

Page 3: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Comparative genomics of mammalian species

• Goal 1: Discover regions of increased selection– Detect functional elements by their increased conservation– More genomes: detect smaller elements, subtle selection

• Goal 2: Discover different classes of functional elements– Patterns of change distinguish different types of functional elements– Specific function Selective pressures Patterns of mutation/inse/del

• Develop evolutionary signatures characteristic of each function

Page 4: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Protein-coding genes

Mike Lin

Page 5: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Evolutionary signatures for protein-coding genes

• Same conservation levels, distinct patterns of divergence– Gaps are multiples of three (preserve amino acid translation)– Mutations are largely 3-periodic (silent codon substitutions)– Specific triplets exchanged more frequently (conservative substs.)– Conservation boundaries are sharp (pinpoint individual splicing signals)

Non-synonymous substitutions

Synonymous codon substitutions Frame-shifting gapsGaps are multiples of 3

Page 6: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Protein-coding evolution vs nucleotide conservation

• Evolutionary signatures specific to each function– Distinguish protein-coding from non-coding conservation– Genome-wide run (CSF only): 81% sens., 91% precision– Incorporating additional signatures: RFC, single-species…

Protein-coding exonsHighly conserved non-coding elements

Page 7: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Many new genes confirmed by chromatin domains

• Several hundred new exons, many in clustersExample: MM14qC3

• Supported by chromatin signatures (Guttman et al)

Mikkelsen et al

Missedexon

Alt.splicedexon

Page 8: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Genome-wide curation / experimental follow-up

• Novel candidate genes and exons– Experimental cDNA sequencing and validation– Curation of gene structures integrating evidence

• Revising existing annotations– Identify dubious genes with non-protein-like evolution– Refine boundaries and exon sets of existing genes– Curation: evaluate evidence supporting that annotation

• Unusual gene structures– Evolutionary evidence in absence of primary signals– Reveal new and unusual biological mechanisms

G PI: Tim Hubbard, Sanger Center.

HAVANA curators, experimental validation.

Page 9: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Unusual protein-coding events

Mike Lin

Page 10: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

When primary sequence signals are ignored

• Unusual gene (GPX2). Protein-coding signal continues past the stop.• GPX2 is a known selenoprotein! Additional candidates found.

• Typical gene (MEF2A). Evolutionary signal stops at the stop codon.

Page 11: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Translational read-through in neuronal proteins

• New mechanism of post-transcriptional control.– Conserved in both mammals (~5 candidates) and flies (~150 candidates)– Strongly enriched for neurotransmitters and brain-expressed proteins– Read-through stop codon (&surrounding) shows increased conservation

• Many questions remain– Role of editing? Cryptic splice sites? RNA secondary structure?

Protein-coding

conservation

Continued protein-coding

conservationNo more

conservationStop codon

read through2nd stop

codon

Lin et al, Genome Research 2007

Novel candidate: OPRL1 neurotransmitter

Page 12: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Measuring excess constraint within protein-coding exons

Typical protein-coding exon (Numerous mutations, at each column)

Excess-conservation exon: conserved above and beyond the call of duty

Likely to have additional functions, overlapping selective pressures

Page 13: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Searching for excess-constraint coding sequence

(1) Build a model for expected substitution counts

Syn.subs. correlate w/ degeneracy & CpG Distribution for each ancestral codon

(3) Top candidate exons with excess constraint• PCPB2: derived from ancestral transposon• Hox B5 gene start: 52 AA before 1 syn.subst• C6orf111: predicted ORF on chr. 6• EIF4G2: overlaps spliced EvoFold prediction

(2) Score windows for depletion in syn. subst.• Z-score: P(obs. subst | expected for each codon)

Page 14: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Examples: Top candidate exons showing increased selection

• HoxB5: 52 amino acids before the first synonymous substitution• Overlaps highly conserved RNA secondary structure

• C6orf11: Predicted ORF, protein-coding, extremely conserved

• EIF4G2: Several consecutive exons, conserved RNA struct.

Page 15: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

microRNA genes

Alex Stark

Pouya Kheradpour

Page 16: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Evolutionary signatures for microRNA genes

Combine with 10 other features 4,500-fold enrichment

(1) Conservation profile

Page 17: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Novel miRNAs validated by sequencing reads

Ruby, Bartel, Lai

348 reads16 reads

• In fly genome: 101 hairpins above 0.95 cutoff60 of 74 (81%) known Rfam miRNAs rediscovered+ 24 novel expression-validated by 454&Solexa (Bartel/Hannon)+ 17 additional candidates show diverse evidence of function

• In mammals: combine experimental & evolutionary infoRely on reads for discovery, use evolutionary signal to study function

Sta

rk e

t al

, Gen

om

e R

esea

rch

(G

R)

2007

. Ru

by

et a

l GR

200

7

Page 18: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Surprise 1: microRNA & microRNA* function

• Both hairpin arms of a microRNA can be functional– High scores, abundant processing, conserved targets– Hox miRNAs miR-10 and miR-iab-4 as master Hox regulators

Stark et al, Genome Research 2007

Drosophila Hox

Page 19: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Surprise 2: microRNA-anti-sense function

• A single miRNA locus transcribed from both strands• The two transcripts show distinct expression domains (mutually exclusive)• Both processed to mature miRNAs: mir-iab-4, miR-iab-4AS (anti-sense)

senseanti-

sense

Sta

rk e

t al

, Gen

es&

Dev

elo

pm

ent

2007

Highly conserved Hox targets

Page 20: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

miR-iab-4AS leads to homeotic transformations

• Mis-expression of mir-iab-4S & AS: altereswings homeotic transform.

• Stronger phenotype for AS miRNA• Sense/anti-sense pairs as general

building blocks for miRNA regulation• 10 sense/anti-sense miRNAs in mouse

halterewing

wing

haltereSensory bristles

wing

w/bristles

sense Antisense

WT

No

te:

C,D

,E s

ame

mag

nif

icat

ion

Stark et al, Genes&Development 2007

Page 21: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Function of miRNA* arms and anti-sense miRNAs

• Denser Hox miRNA targeting network

Page 22: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Measuring selection

Michele Clamp

Manuel Garber

Xiaohui Xie

Page 23: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Detecting Purifying Selection (ω)

Neutral sequence Constrained sequence

Estimating intensity of constraint ():• Probabilistic evolutionary model• Maximum Likelihood (ML) estimation of

- sitewise (evaluate every k-long window)- windows-based (increased power)

• Reports ω, and its log odds score (LODS).• Theoretical p-value (LODS distributes 2 with df = 1)

Manuel Garber, Michele Clamp, Xiaohui Xie

Page 24: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Detecting other constraint signatures (π)

0 0 0.8 0.5 0.6 3.2 0 0

• Repeated CG transversion

• Has happened at least 4 times.

• Very unlikely given neutral model.

• Goal: Identify sites with unlikely substitution pattern.

• Approach: Probabilistic method to detect a stationary distribution that is different from background.

• Solution: Implement ML estimator () of this vector:• Provides a Position Weight Matrix for any given k-mer in the genome.• Scores every base in the genome (LODS).

Manuel Garber, Michele Clamp, Xiaohui Xie

Page 25: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Estimation of genome-wide constraint

10.5% conserved6% above FDR cutoff

Across entire genome: 5% under selection.Same as for Human-Mouse. What’s different?

Pilot Encode Regions (1%):

9.4% conserved5.7% above FDR cutoff

Genome-wide:

Manuel Garber, Michele Clamp, Xiaohui Xie

Page 26: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

More mammals: We can actually tell which 5% it is!

4 mammals 21 mammals

Constraint calculated over a 12mer

5% FDR

4 mammals 21 mammals

Constraint calculated over a 50mer

5% FDR

Michele Clamp

>40% FDR

>40% FDR

Page 27: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Individual conserved elements match known TF sites

Binding site resolution, even without known motif model

Promoter alignment

5’

Constraint score

Known TF binding sites

5’

Michele Clamp

TATA SP-1 CEF-2 CEF1

Example: TNNC1 (Troponin C)

Page 28: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Binding sites for known regulators

Pouya Kheradpour

Alex Stark

Page 29: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Computing Branch Length Score (BLS)

CTCF

BLS = 2.23sps (78%) Allows for:

1. Mutations permitted by motif degeneracy

2. Misalignment/movement of motifs within window (up to hundreds of nucleotides)

3. Missing motif in dense species tree

mutations

missing short branches

movement

Page 30: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Branch Length Score Confidence

1. Use motif-specific shuffled control motifs determine the expected number of instances at each BLS by chance alone (or due to non-motif conservation)

2. Compute Confidence Score as fraction of instances over noise at a given BLS(=1 – false discovery rate)

3. Many species are needed to confidently predict instances

Page 31: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Performance on vertebrate Transfac motifs

1. Most motifs have confident instances into 90% confidence with 18 mammals

2. Substantial increase in the number of instances compared to only human, mouse rat and dog.

2.5x increase

3.5x

6.5x

Med

ian

nu

mb

er

of

inst

ance

s (a

t fi

xed

co

nfi

den

ce)

Page 32: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Intersection with CTCF ChIP-Seq regions

ChIP-Seq and ChIP-Chip technologies allow for identifying binding sites of a motif experimentally

1. Conserved CTCF motif instances highly enriched in ChIP-Seq sites

2. High enrichment does not require low sensitivity

3. Many motif instances are verified

ChIP data from Barski, et al., Cell (2007)

≥ 50% of regions with a motif

50% motifs verified

50

% c

on

fide

nc

e

Page 33: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Enrichment also found for other factors B

ars

ki,

et

al.

, C

ell

(2

00

7)

We can accurately identify targets for many factors

Od

om

, e

t a

l.,

Na

ture

Ge

ne

tic

s (

20

07

)

Lim

, e

t a

l.,

Mo

lec

ula

r C

ell

(2

00

7)

Ro

be

rts

on

, e

t a

l.,

Na

ture

Me

tho

ds

(2

00

6)

We

i, e

t a

l.,

Ce

ll (

20

06

)

Ze

lle

r, e

t a

l.,

PN

AS

(2

00

6)

Lin

, e

t a

l.,

PL

oS

Ge

ne

tic

s (

20

07

)

Page 34: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Enrichment increases in conserved bound regions

Human: Barski, et al., Cell (2007)Mouse: Bernstein, unpublished

1. ChIP bound regions may not be conserved

2. For CTCF we also have binding data in mouse

3. Enrichment in intersection is dramatically higher

Page 35: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Enrichment increases in conserved bound regionsH

um

an

: B

ars

ki,

et

al.

, C

ell

(2

00

7)

Mo

us

e:

Be

rns

tein

, u

np

ub

lis

he

d

Od

om

, e

t a

l.,

Na

ture

Ge

ne

tic

s (

20

07

)

1. ChIP bound regions may not be conserved

2. For CTCF we also have binding data in mouse

3. Enrichment in intersection is dramatically higher

4. Trend persists for other factors where we have multi-species ChIP data

Page 36: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Motif discovery

Pouya Kheradpour

Alex Stark

Page 37: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Using confidence for motif discovery

1. Use motif-specific shuffled control motifs determine the expected number of instances at each BLS by chance alone (or due to non-motif conservation)

2. Compute Confidence Score as fraction of instances over noise at a given BLS(=1 – false discovery rate)

Page 38: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Motif discovery pipeline

1. Enumerate motif seeds

• Six non-degenerate characters with variable size gap in the middle

2. Score seed motifs• Use a conservation ratio corrected for composition

and small counts to rank seed motifs3. Expand seed motifs

• Use expanded nucleotide IUPAC alphabet to fill unspecified bases around seed using hill climbing

4. Cluster to remove redundancy• Using sequence similarity

GT C A GTgap

GT C A GTR RY gapS W

Page 39: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Motif discovery in enhancer regions

• Collaboration with Ren, White, Posakony labs– Predict novel enhancer / promoter / insulator elements– Identify motifs associated with these regions– Validate predicted regions for in vivo function

• Initial results in human genome– Motif combinations predictive of enhancer regions (5X)

Hei

nzm

an e

t al

, Bin

g R

en’s

lab

Page 40: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Motif discovery in 3’UTRs

1. Perform motif discovery by ranking 7-mers in 3’UTRs by the highest confidence they reach with 100 instances.

Page 41: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Summary

• Measuring increased selection– Scaling of branch lengths: ω– Non-random stationary distribution: π– Increased resolution: individual binding sites

• Protein-coding genes– Distinct evolutionary signatures– Novel genes, revised genes– Unusual structures: read-through, increased selection

• microRNAs– Function of miRNA/miRNA* and sense/anti-sense pairs– Dense miRNA targeting network for Hox cluster

• Regulatory motifs– Measure increased selection, derive confidence score– High sensitivity / high specificity for known motifs– Use enumeration/confidence metric for motif discovery

Page 42: Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Acknowledgements

AlexStark

Sequencing Baylor, WashU, Agencourt. Funding: NHGRImiRNAs Julius Brennecke, Graham Ruby, Greg Hannon, David Barteliab-4AS Natascha Bushati, Steve Cohen, Julius, Greg Hannon

PouyaKheradpour

MikeLin

MattRasmussen

MicheleClamp

XiaohuiXie

KerstinLindblad-Toh

ManuelGarber

MIT Computer Science and AI Lab Broad Institute of MIT and Harvard

Sante Gnerre, David JaffeIssao FujiwaraFederica Di PalmaArachne Assembly TeamBroad Sequencing PlatformEric Lander