TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name.
-
Upload
constance-lewis -
Category
Documents
-
view
212 -
download
0
Transcript of TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name.
TITLE OFPRESENTATION
Board of Scientific Counselors January 2007
Your Name
Title - 32 pt Arial
Title - 32 pt Arial
Title - 32 pt Arial
COMPARATIVE GENOMICSManolis Kellis
Board of Scientific Counselors
January 2007
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAA
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAA
Genes
Encodeproteins
Regulatory motifs
Controlgene expression
32 mammals
9 yeasts
12 flies
The power of comparative genomics
• Comparative genomics reveals selection– Functional elements mostly conserved– Non-functional regions mostly diverged
Functional regions stand out
• Comparative genomics reveals function– Each type of function under unique constraints
(Proteins, RNA, motifs, each evolve differently)– Discover them by their distinct evolutionary patterns Evolutionary signatures for each type of element
human mouse ratchimp dog
8 Candida
Comparative genomics leads to…
1. Genome interpretation– Decode the human genome– Discover all functional elements
The building blocks
2. Cell circuitry– Discover all control constructs– Regulatory network properties The interconnections
3. Evolutionary innovation– Emergence of new functions– Genome and network duplication
The dynamics
Distinguishing genes from non-coding regions
Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT
Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA
Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC
Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC
Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC
Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT
Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC
Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT
***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * **
• Protein-coding genes have specific evolutionary constraints– Gaps are multiples of three (preserve amino acid translation)– Mutations are largely 3-periodic (silent codon substitutions)– Specific triplets exchanged more frequently (conservative substs.)– Conservation boundaries are sharp (pinpoint individual splicing signals)
• Encode as ‘evolutionary signatures’– Computational test for each of them– Combine and score systematically
Splice
Frame-shifting indels Periodic mutations Synonymous substs.
Power of evolutionary signatures
Signatures much more precise than level of conservationBefore: Parsing a genome into high-conservation / low-conservationNow: Parse into protein-coding conservation / RNA-like / motif-like, etc.
Probabilistic frameworkHidden Markov Models (HMMs)
Generative model, learn emission, transition probabilitiesEasy to train, hard to integrate long-range signals
Conditional Random Fields (CRFs)Discriminative dual of HMMs, learn weights on featuresEasy to integrate diverse signals, gradient ascent for training
Known genes stand out Substitution typical of protein-coding regionsSubstitution typical of intergenic regions
Previously-annotated start codon Newly-identified start codon
Ability to identify subtle events
ATG ATG
• Translation start corrected for 200 genes
Protein-coding
conservation
Continued protein-coding
conservationNo more
conservation
• Hundreds of read-through regions identified
• New mechanism of post-transcriptional control. Many questions remain. • Enriched in brain proteins, ion channels. Under ADAR control.
Stop codon
read through2nd stop
codon
• Towards a revised genome annotation– Curation: FlyBase integrates prediction with cDNA, protein, literature– Experimentation: BDGP large-scale functional validation novel exons
• High-accuracy reannotation– Ability to detect small genes & exons (40AA: 95|99|99%, 20AA: 87|96|99%)– Detect subtle events: sequencing errors, start/stop and splice site changes– Recognize unusual gene structures read-through, uORFs, RNA editing
D. simulans
D. erecta
D. persimilis
D. melanog.
Summary: Revisiting fly genome annotation
(…)
454 genes 800 genes 668 genes12,000 genes
Confirmed Dubious Novel Refined
Powerful approach for comprehensive genome annotation
sen | pre | spe sen | pre | spe
Comparative genomics
1. Genome interpretation– Decode the human genome– Discover all functional elements
The building blocks
2. Cell circuitry– Discover all control constructs– Regulatory network properties The interconnections
3. Evolutionary innovation– Emergence of new functions– Genome and network duplication
The dynamics
The regulatory code
• Multiple levels of regulation– Temporal and spatial regulation, disease, development– Chromatin, pre- / post-transcriptional, splicing, translational
• Combinatorial coding of individual motifs– The core: a relatively small number of regulatory motifs– Regions: diverse motif combinations specify diverse functions
• Regulatory motifs– Summarize information across thousands of sites
• Distinguish: regulatory motifs vs. motif instances
– Challenging to discover• Small (6-8 nucleotides), subtle (frequent degenerate positions),
dispersed (act at a distance), diverse (sequence composition)
Enhancer regions
5’-UTR
Promoter motifs
3’-UTR
Splicing signals Motifs at RNA level
Regulatory motif discovery
Study known motifs
Derive conservation rules
Discover novel motifs
Known motifs are preferentially conserved
• In multi-species alignments: known motifs conservation islands– Conserved biology: Conserved regulatory code, same words are functional– Preferential conservation: Stand out from surrounding nucleotides– Good signal for identifying individual instances of known motifs
• Need additional power for motif discovery: – Conservation not limited to exact binding site additional bases would be found– Weakly constrained positions can diverge Real motifs will be missed– How do we discover motifs de novo? Use basic property of regulatory motifs
Evaluate genome-wide conservation over thousands of instances
Errhuman CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *
Gabpa
Human
Dog
Mouse
Rat
Errα Errα Errα
Consensus MCS Matches to known Expression enrichment Promoters Enhancers
1 CTAATTAAA 65.6 engrailed (en) 25.4 2
2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.2
3 WATTRATTK 54.9 araucan (ara) 11.7 2.6
4 AAATTTATGCK 54.4 paired (prd) 4.5 16.5
5 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.3
6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.3
7 TGATTAAT 45.7 apterous (ap) 7.1 1.7
8 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.2
9 AAACNNGTT 41.2 20.1 4.3
10 RATTKAATT 40 3.9 0.7
11 GCACGTGT 39.5 fushi tarazu (ftz) 17.9
12 AACASCTG 38.8 broad-Z3 (br-Z3) 10.7
13 AATTRMATTA 38.2 19.5 1.2
14 TATGCWAAT 37.8 5.8 2
15 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.4
16 CATNAATCA 36.9 1.8 1.7
17 TTACATAA 36.9 5.4
18 RTAAATCAA 36.3 3.2 2.8
19 AATKNMATTT 36 3.6 0
20 ATGTCAAHT 35.6 2.4 4.6
21 ATAAAYAAA 35.5 57.2 -0.5
22 YYAATCAAA 33.9 5.3 0.6
23 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 6
24 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.7
25 TGTMAATA 33.2 8.9 1.6
26 TAAYGAG 33.1 4.7 2.7
27 AAAKTGA 32.9 7.6 0.3
28 AAANNAAA 32.9 449.7 0.8
29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.8
30 TTATTTAYR 32.9 Deformed (Dfd) 30.7
Systematically discover regulatory motifs
Functional clustering of motifs and tissues
Motif discovery in human enhancer regions
• Can identify 40% of enhancers with 50 motifs– 3X enrichment (vs. 15% of intergenic regions)
• Motif combinations further improve performance– 5X enrichment for top 30 motif combinations
Chromatin signatures of enhancer regions Motif signatures of enhancer regions
74 Enhancers
208 Promoters
H3K4me3 RNAPII
p300H3K4me1
Evolutionary signatures for microRNA genes
• Genome-wide discovery of miRNAs– 41 novel miRNA genes. Rediscover 81% of known (61 of 74). Reject 4 dubious.– 454 sequencing of small RNAs confirms 27 of 41 novel miRNAs (66%).
• Genomic properties: – Introns of known genes, including several transcription factors– Genomic clustering of known and novel miRNAs: poly-cistronic precursors– Two ‘dubious’ protein-coding genes are in fact miRNAs
Improved annotation of miRNA genes
Functional properties of microRNA targets
• Refine annotation of known miRNA genes– Start adjustments suggested by the evolutionary signatures, confirmed by sequencing– Small change in start (+2 nucleotides) implies great change in target spectrum (>95%)
• miRNA targets– Novel miRNAs include many novel families distinct groupings of genes. – Targets of novel show large overlap with targets of known denser miRNA network
• miR10* as a master Hox regulator– For three genes, both miRNA+ and miRNA* seem functional by evolution and sequencing. – For miR-10, the star shows stronger signal, more sequencing reads, more predicted targets.– Both miR-10+ and miR-10* targets several Hox genes, more than any other miRNA.
Comparative genomics
1. Genome interpretation– Decode the human genome– Discover all functional elements
The building blocks
2. Cell circuitry– Discover all control constructs– Regulatory network properties The interconnections
3. Evolutionary innovation– Emergence of new functions– Genome and network duplication
The dynamics
Resolving power in mammals, flies, fungi
• Neutral: 2.57 subs/site
(opp: 0.62 32sps: 4.87)
• Coding: 1.16 subs/site• Detect: 6-mer at FP 10-6
10 mammals 17 yeasts12 flies
8 Candida
9 Yeasts
Po
st-
du
pli
ca
tio
nD
iplo
idH
ap
loid
Pre
-du
p
P
P
P
PP
P
• Neutral: 4.13 subs/site
• Coding: 1.65 subs/site
• Detect: 6-mer at 10-11
• Neutral: 15.5 subs/site
(Yeast: 6.5 Candida: 6.5)
• Coding: 7.91 subs/site• Detect: 3-mer at 10-21
0.3 sub/site0.1 sub/site 0.8 sub/site