Comparative genomics of 24 mammals
Manolis Kellis
MIT
MIT Computer Science & Artificial Intelligence Laboratory
Broad Institute of MIT and Harvard
Sequencing the mammalian phylogeny # Species Center CovgH1 Human Done FullH2 Chimp Done FullH3 Rhesus Done FullH4 Mouse Done FullH5 Rat Done FullH6 Dog Done FullH7 Cow Done Full1 Elephant Broad 1.94x2 Armadillo Broad 1.98x3 Tenrec Broad 1.90x4 Rabbit Broad 1.95x5 Guinea Pig Broad 1.92x6 Hedgehog Broad 1.86x7 Shrew Broad 1.92x8 Microbat Broad 1.84x9 Tree Shrew Broad 1.89x10 Squirrel Broad 1.90x11 Bushbaby Broad 1.87x12 Pika Broad 1.92x13 Mouse Lemur Broad 1.93x14 Horse Broad 5.36x15 Cat Agencourt 1.87x16 Dolphin Baylor 2.59x17 Hyrax Baylor 2.19x18 Kangaroo Rat Baylor 1.85x19 Megabat Baylor ~2x20 Alpaca WashU 2.34x21 Tarsier WashU 1.88x22 Sloth WashU 2.10x23 Pangolin x x24 Flying lemur x x
Kerstin Lindblad-Toh, Sante Gnerre, Federica DiPalmaBroad, Baylor, WashU, Arachne, UCSC
Comparative genomics of mammalian species
• Goal 1: Discover regions of increased selection– Detect functional elements by their increased conservation– More genomes: detect smaller elements, subtle selection
• Goal 2: Discover different classes of functional elements– Patterns of change distinguish different types of functional elements– Specific function Selective pressures Patterns of mutation/inse/del
• Develop evolutionary signatures characteristic of each function
Protein-coding genes
Mike Lin
Evolutionary signatures for protein-coding genes
• Same conservation levels, distinct patterns of divergence– Gaps are multiples of three (preserve amino acid translation)– Mutations are largely 3-periodic (silent codon substitutions)– Specific triplets exchanged more frequently (conservative substs.)– Conservation boundaries are sharp (pinpoint individual splicing signals)
Non-synonymous substitutions
Synonymous codon substitutions Frame-shifting gapsGaps are multiples of 3
Protein-coding evolution vs nucleotide conservation
• Evolutionary signatures specific to each function– Distinguish protein-coding from non-coding conservation– Genome-wide run (CSF only): 81% sens., 91% precision– Incorporating additional signatures: RFC, single-species…
Protein-coding exonsHighly conserved non-coding elements
Many new genes confirmed by chromatin domains
• Several hundred new exons, many in clustersExample: MM14qC3
• Supported by chromatin signatures (Guttman et al)
Mikkelsen et al
Missedexon
Alt.splicedexon
Genome-wide curation / experimental follow-up
• Novel candidate genes and exons– Experimental cDNA sequencing and validation– Curation of gene structures integrating evidence
• Revising existing annotations– Identify dubious genes with non-protein-like evolution– Refine boundaries and exon sets of existing genes– Curation: evaluate evidence supporting that annotation
• Unusual gene structures– Evolutionary evidence in absence of primary signals– Reveal new and unusual biological mechanisms
G PI: Tim Hubbard, Sanger Center.
HAVANA curators, experimental validation.
Unusual protein-coding events
Mike Lin
When primary sequence signals are ignored
• Unusual gene (GPX2). Protein-coding signal continues past the stop.• GPX2 is a known selenoprotein! Additional candidates found.
• Typical gene (MEF2A). Evolutionary signal stops at the stop codon.
Translational read-through in neuronal proteins
• New mechanism of post-transcriptional control.– Conserved in both mammals (~5 candidates) and flies (~150 candidates)– Strongly enriched for neurotransmitters and brain-expressed proteins– Read-through stop codon (&surrounding) shows increased conservation
• Many questions remain– Role of editing? Cryptic splice sites? RNA secondary structure?
Protein-coding
conservation
Continued protein-coding
conservationNo more
conservationStop codon
read through2nd stop
codon
Lin et al, Genome Research 2007
Novel candidate: OPRL1 neurotransmitter
Measuring excess constraint within protein-coding exons
Typical protein-coding exon (Numerous mutations, at each column)
Excess-conservation exon: conserved above and beyond the call of duty
Likely to have additional functions, overlapping selective pressures
Searching for excess-constraint coding sequence
(1) Build a model for expected substitution counts
Syn.subs. correlate w/ degeneracy & CpG Distribution for each ancestral codon
(3) Top candidate exons with excess constraint• PCPB2: derived from ancestral transposon• Hox B5 gene start: 52 AA before 1 syn.subst• C6orf111: predicted ORF on chr. 6• EIF4G2: overlaps spliced EvoFold prediction
(2) Score windows for depletion in syn. subst.• Z-score: P(obs. subst | expected for each codon)
Examples: Top candidate exons showing increased selection
• HoxB5: 52 amino acids before the first synonymous substitution• Overlaps highly conserved RNA secondary structure
• C6orf11: Predicted ORF, protein-coding, extremely conserved
• EIF4G2: Several consecutive exons, conserved RNA struct.
microRNA genes
Alex Stark
Pouya Kheradpour
Evolutionary signatures for microRNA genes
Combine with 10 other features 4,500-fold enrichment
(1) Conservation profile
Novel miRNAs validated by sequencing reads
Ruby, Bartel, Lai
348 reads16 reads
• In fly genome: 101 hairpins above 0.95 cutoff60 of 74 (81%) known Rfam miRNAs rediscovered+ 24 novel expression-validated by 454&Solexa (Bartel/Hannon)+ 17 additional candidates show diverse evidence of function
• In mammals: combine experimental & evolutionary infoRely on reads for discovery, use evolutionary signal to study function
Sta
rk e
t al
, Gen
om
e R
esea
rch
(G
R)
2007
. Ru
by
et a
l GR
200
7
Surprise 1: microRNA & microRNA* function
• Both hairpin arms of a microRNA can be functional– High scores, abundant processing, conserved targets– Hox miRNAs miR-10 and miR-iab-4 as master Hox regulators
Stark et al, Genome Research 2007
Drosophila Hox
Surprise 2: microRNA-anti-sense function
• A single miRNA locus transcribed from both strands• The two transcripts show distinct expression domains (mutually exclusive)• Both processed to mature miRNAs: mir-iab-4, miR-iab-4AS (anti-sense)
senseanti-
sense
Sta
rk e
t al
, Gen
es&
Dev
elo
pm
ent
2007
Highly conserved Hox targets
miR-iab-4AS leads to homeotic transformations
• Mis-expression of mir-iab-4S & AS: altereswings homeotic transform.
• Stronger phenotype for AS miRNA• Sense/anti-sense pairs as general
building blocks for miRNA regulation• 10 sense/anti-sense miRNAs in mouse
halterewing
wing
haltereSensory bristles
wing
w/bristles
sense Antisense
WT
No
te:
C,D
,E s
ame
mag
nif
icat
ion
Stark et al, Genes&Development 2007
Function of miRNA* arms and anti-sense miRNAs
• Denser Hox miRNA targeting network
Measuring selection
Michele Clamp
Manuel Garber
Xiaohui Xie
Detecting Purifying Selection (ω)
Neutral sequence Constrained sequence
Estimating intensity of constraint ():• Probabilistic evolutionary model• Maximum Likelihood (ML) estimation of
- sitewise (evaluate every k-long window)- windows-based (increased power)
• Reports ω, and its log odds score (LODS).• Theoretical p-value (LODS distributes 2 with df = 1)
Manuel Garber, Michele Clamp, Xiaohui Xie
Detecting other constraint signatures (π)
0 0 0.8 0.5 0.6 3.2 0 0
• Repeated CG transversion
• Has happened at least 4 times.
• Very unlikely given neutral model.
• Goal: Identify sites with unlikely substitution pattern.
• Approach: Probabilistic method to detect a stationary distribution that is different from background.
• Solution: Implement ML estimator () of this vector:• Provides a Position Weight Matrix for any given k-mer in the genome.• Scores every base in the genome (LODS).
Manuel Garber, Michele Clamp, Xiaohui Xie
Estimation of genome-wide constraint
10.5% conserved6% above FDR cutoff
Across entire genome: 5% under selection.Same as for Human-Mouse. What’s different?
Pilot Encode Regions (1%):
9.4% conserved5.7% above FDR cutoff
Genome-wide:
Manuel Garber, Michele Clamp, Xiaohui Xie
More mammals: We can actually tell which 5% it is!
4 mammals 21 mammals
Constraint calculated over a 12mer
5% FDR
4 mammals 21 mammals
Constraint calculated over a 50mer
5% FDR
Michele Clamp
>40% FDR
>40% FDR
Individual conserved elements match known TF sites
Binding site resolution, even without known motif model
Promoter alignment
5’
Constraint score
Known TF binding sites
5’
Michele Clamp
TATA SP-1 CEF-2 CEF1
Example: TNNC1 (Troponin C)
Binding sites for known regulators
Pouya Kheradpour
Alex Stark
Computing Branch Length Score (BLS)
CTCF
BLS = 2.23sps (78%) Allows for:
1. Mutations permitted by motif degeneracy
2. Misalignment/movement of motifs within window (up to hundreds of nucleotides)
3. Missing motif in dense species tree
mutations
missing short branches
movement
Branch Length Score Confidence
1. Use motif-specific shuffled control motifs determine the expected number of instances at each BLS by chance alone (or due to non-motif conservation)
2. Compute Confidence Score as fraction of instances over noise at a given BLS(=1 – false discovery rate)
3. Many species are needed to confidently predict instances
Performance on vertebrate Transfac motifs
1. Most motifs have confident instances into 90% confidence with 18 mammals
2. Substantial increase in the number of instances compared to only human, mouse rat and dog.
2.5x increase
3.5x
6.5x
Med
ian
nu
mb
er
of
inst
ance
s (a
t fi
xed
co
nfi
den
ce)
Intersection with CTCF ChIP-Seq regions
ChIP-Seq and ChIP-Chip technologies allow for identifying binding sites of a motif experimentally
1. Conserved CTCF motif instances highly enriched in ChIP-Seq sites
2. High enrichment does not require low sensitivity
3. Many motif instances are verified
ChIP data from Barski, et al., Cell (2007)
≥ 50% of regions with a motif
50% motifs verified
50
% c
on
fide
nc
e
Enrichment also found for other factors B
ars
ki,
et
al.
, C
ell
(2
00
7)
We can accurately identify targets for many factors
Od
om
, e
t a
l.,
Na
ture
Ge
ne
tic
s (
20
07
)
Lim
, e
t a
l.,
Mo
lec
ula
r C
ell
(2
00
7)
Ro
be
rts
on
, e
t a
l.,
Na
ture
Me
tho
ds
(2
00
6)
We
i, e
t a
l.,
Ce
ll (
20
06
)
Ze
lle
r, e
t a
l.,
PN
AS
(2
00
6)
Lin
, e
t a
l.,
PL
oS
Ge
ne
tic
s (
20
07
)
Enrichment increases in conserved bound regions
Human: Barski, et al., Cell (2007)Mouse: Bernstein, unpublished
1. ChIP bound regions may not be conserved
2. For CTCF we also have binding data in mouse
3. Enrichment in intersection is dramatically higher
Enrichment increases in conserved bound regionsH
um
an
: B
ars
ki,
et
al.
, C
ell
(2
00
7)
Mo
us
e:
Be
rns
tein
, u
np
ub
lis
he
d
Od
om
, e
t a
l.,
Na
ture
Ge
ne
tic
s (
20
07
)
1. ChIP bound regions may not be conserved
2. For CTCF we also have binding data in mouse
3. Enrichment in intersection is dramatically higher
4. Trend persists for other factors where we have multi-species ChIP data
Motif discovery
Pouya Kheradpour
Alex Stark
Using confidence for motif discovery
1. Use motif-specific shuffled control motifs determine the expected number of instances at each BLS by chance alone (or due to non-motif conservation)
2. Compute Confidence Score as fraction of instances over noise at a given BLS(=1 – false discovery rate)
Motif discovery pipeline
1. Enumerate motif seeds
• Six non-degenerate characters with variable size gap in the middle
2. Score seed motifs• Use a conservation ratio corrected for composition
and small counts to rank seed motifs3. Expand seed motifs
• Use expanded nucleotide IUPAC alphabet to fill unspecified bases around seed using hill climbing
4. Cluster to remove redundancy• Using sequence similarity
GT C A GTgap
GT C A GTR RY gapS W
Motif discovery in enhancer regions
• Collaboration with Ren, White, Posakony labs– Predict novel enhancer / promoter / insulator elements– Identify motifs associated with these regions– Validate predicted regions for in vivo function
• Initial results in human genome– Motif combinations predictive of enhancer regions (5X)
Hei
nzm
an e
t al
, Bin
g R
en’s
lab
Motif discovery in 3’UTRs
1. Perform motif discovery by ranking 7-mers in 3’UTRs by the highest confidence they reach with 100 instances.
Summary
• Measuring increased selection– Scaling of branch lengths: ω– Non-random stationary distribution: π– Increased resolution: individual binding sites
• Protein-coding genes– Distinct evolutionary signatures– Novel genes, revised genes– Unusual structures: read-through, increased selection
• microRNAs– Function of miRNA/miRNA* and sense/anti-sense pairs– Dense miRNA targeting network for Hox cluster
• Regulatory motifs– Measure increased selection, derive confidence score– High sensitivity / high specificity for known motifs– Use enumeration/confidence metric for motif discovery
Acknowledgements
AlexStark
Sequencing Baylor, WashU, Agencourt. Funding: NHGRImiRNAs Julius Brennecke, Graham Ruby, Greg Hannon, David Barteliab-4AS Natascha Bushati, Steve Cohen, Julius, Greg Hannon
PouyaKheradpour
MikeLin
MattRasmussen
MicheleClamp
XiaohuiXie
KerstinLindblad-Toh
ManuelGarber
MIT Computer Science and AI Lab Broad Institute of MIT and Harvard
Sante Gnerre, David JaffeIssao FujiwaraFederica Di PalmaArachne Assembly TeamBroad Sequencing PlatformEric Lander
Top Related