Post on 28-Dec-2015
AACC
GGTT
The The MedicagoMedicago truncatulatruncatula genome: genome:a progress reporta progress report
Dr. Bruce A. RoeDr. Bruce A. RoeAdvanced Center for Genome TechnologyAdvanced Center for Genome TechnologyDepartment of Chemistry and BiochemistryDepartment of Chemistry and BiochemistryUniversity of OklahomaUniversity of Oklahomabroe@ou.edu www.genome.ou.edubroe@ou.edu www.genome.ou.edu
Plant and Animal GenomePlant and Animal GenomeSan Deigo January 11San Deigo January 11, 2004, 2004
Photos by Steve Hughes, Genetic Resource Centre (PIRSA-SARDI), Adelaide, Australia.http://www.fao.org/ag/AGP/AGPC/doc/gallery/pictures/meditrunc/meditrunc.htm
AACC
GGTT
• An important forage crop • A genetically tractable model legume • A relatively small (~500 Mbp) diploid genome
• Active legume research community
• Medicago Research Consortium• Large collection of ESTs• Excellent BAC library• Integrated physical and genetic map• Large number of BAC-end sequences
Why sequence the Medicago genome?
AACC
GGTT
DNA GenBank
Sequence Pipeline at the University of Oklahoma Genome Center, OU-ACGT
DNA shearing(HydroshearTM)
Colony Piking(QPixIITM)
Growing subclones(HiGroTM)
Subclone Isolation I(Mini-StaccatoTM)
Subclone isolation II(VPrepTM)
Thermocycling(ABI 9700)
Sequencing(ABI 3700)
Data assembly and Analysis
Primer Synthesis
Miscelaneous liquid handling
Closure
AACC
GGTT
• This Zymark robot has 384 cannula array, four built in shakers, three attached storage racks, built-in barcoding and a Twister II robotic arm.
• This automation has allow us to perform the DNA isolation completely unattended from as many as eighty 384 well plates of bacterial cells per day.
Subclone Isolation (Mini-StaccatoTM)
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
AACC
GGTT
• Once all three solutions have been added, the plates are transferred from the SciClone workspace deck to a storage rack by the Twister II robotic arm.
Subclone Isolation (Mini-StaccatoTM)
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
AACC
GGTT
• Liquid handling station with 384-channel pipettor head • Four movable shelves on either side of the pipettor head • Used for subclone isolation, sequencing reaction set-up and clean-up.
Subclone Isolation and Sequencing Reaction Pipetting (Velocity 11 VPrep)
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
AACC
GGTT
Data assembly and Analysis
32 GB RAM running Solaris 8 OS and 3 TB of data stored on RAID-5 arrays with autoloader tape backup
Also:• 12 workstations each with 1 GB RAM
Sun V880 server Phred/Phrap/Consed
Exgap
AACC
GGTT
Initial WGS Skimming for ~500 MbMedicago truncatula genome
• Collected ~25,000 end-sequences from ~12,500 plasmid-based WGS clones.
• Of these ~25,000 sequences, ~1,000 have homology with Medicago truncatula ESTs.
• URL: http://www.genome.ou.edu/medicago.html
AACC
GGTT
Phrap assembly of our Medicago truncatula whole genome shotgun survey sequencing dataat 0.005-fold genomic sequence coverage
AACC
GGTT
DotPlot of a Phrap assembled whole genome shotgun contig showing multiple repeated regions
0 100 200 300 400 500 600 700
700
60
0
500
4
00
300
200
1
00
0
Bas
es
Bases
AACC
GGTT
DotPlot of a Phrap assembled whole genome shotgun contig showing 4 repeated blocks of ~600 bases
0 500 1000
1
00
0
5
00
0
Ba
ses
Bases
AACC
GGTT
Yet another genomic contig showing extensive repeated regions
Contig 1931
0 200 400 600
600
40
0
200
0
Bas
es
Bases
AACC
GGTT
>Contig1931 TTTACGTCCCCGTAGTGAACTATTTCCTAAGTTGACTAGTCAATTAGGTGATAGTTCGTCCGGATGACGTACCGCCGTGAACCCGATATGAGAATTTCATGTGGTGCATCCTTCTATGTTTGATAAGGTCATTTTGAACGGTCGGATTGAACGTGGCTGGTGTCGTTCACGATAGAGGCACGTTTAGGTCCCTACGGTGAACTAGTTCCTAAGTTGACTAGTCAATTAGGTGATAGTTTGTCCGGATGACGTACCTCCGTGAACCCGATCTGAGAAATTCAAGTTTCTGCATCCTTCTATGTTTGATAAGGTCATTTTGAACGGTCGGATTGAAGGTGGCTGGTGTTCTTCACATTCTAGGCACGTTTAGGTTCCCGCGGTGAACTAGTTCCTAAGTTGACTAGTCAATTAGGTGATAGTTCGTCCGGATGACCTACCTCCGTGAACCCGATATTAGAAATTCAAGTTTCTGCATCCTTCTATGTTTGATAAGGTCATTTTGAACGGTCAGATTGAACGTGGCTGGTGTCGTTCACGATCTAGGCACGTTTAGGTCCCCGCAGTGAACTAGTTCCTAAGTTGACTAGTCAATTAGGTGATAGTTTGTCCGGATGACGTGACTCCGTAAAGCCAGTATGAGAACTTCTAGTTTCTGCATCCTTTTATGTTTGATAAGGTCATTTTGAACGGTGGGATTGAACGTTGTTGGTGTCGTTCACGATCTAGGCACGTTTAGGTCCCCGCAGTGAACTAGTTCCTTAGTTGACTAGTCAATTAGGTGATAGTTCGTCCGGATGACGTATCTCCGTCAGCCCGATCTGAGAAATTCAAATTTCTGCATCCTTCTATGTTTGATAAGGTCATTTTGAACGGTCGGATTGAACGTGGCTGGTGTCGTGCACGATCAAGGCACGTTTAGGTCCCCGCAGCGAACTAGTTCCTAAGTTGACTAGTCAATTAGGTGATACCTTGTCCGGATGACGTACCTCCGTGAACCCGATCTGAGAAATTCAAGTTTCTGCATCCTTCTATGTTTGATAAGGTCATTTTGAACGGTTGGATTGAACATGGCTGGTGTCGTTCACGATCTAGGCACGTTTAGGTCCCCGCAGTGAACTAGTTCCTAAGTTGACTAGTCAATTAGGTGATAGTTCGTCTGGATGACGTACCTCCTTGAACCCAATATGAGAAATTCAATTTTCTTCATCCTTCTATGTTTGATAAGGTCATTTTGAACGGTCGGATTGAACGTGCCTGGTGTCGTTCACGATCGAGGCACGTTTAGGTCCCCGCAGTGAAC. . .
AACC
GGTT
Summary of our Medicago truncatula WGS Sequencing Assembly with only 0.005-fold
Genomic Sequence Coverage
• The largest contig (21,157 bp) contained the 26S rRNA genes
• 19 smaller contigs (105,455 bp total) were from the chloroplast genome
• The remaining ~500 contigs, ranging in size from 2,000 to 12,000 bp contain highly repetitive DNA, which were unique to Medicago, as they had no significant homology in the GenBank database
• We concluded that a more directed strategy was needed
AACC
GGTT
Mapped BAC approach in collaboration with Doug Cook and DJ Kim at U.C. Davis with
funding from the Noble Foundation, Ardmore, OK
AACC
GGTT
The first ~1000The first ~1000 Medicago truncatula Medicago truncatula BACsBACs• Initially concentrated on BACs with known biological Initially concentrated on BACs with known biological
markers and in regions of biological interest that were markers and in regions of biological interest that were supplied to us by the UC Davis group.supplied to us by the UC Davis group.
• Requests for sequencing specific BACs were directed Requests for sequencing specific BACs were directed to Doug Cook and DJ Kim at UC Davis and they to Doug Cook and DJ Kim at UC Davis and they supplied us with the BACs once these BACs have supplied us with the BACs once these BACs have been characterized.been characterized.
• Once the BACs were received, we created the shotgun Once the BACs were received, we created the shotgun libraries, isolated the sequencing templates and libraries, isolated the sequencing templates and obtained the working draft sequence followed by obtained the working draft sequence followed by closure and finishing.closure and finishing.
• All data was made publically available in GenBank All data was made publically available in GenBank within 24 hours of sequence assembly.within 24 hours of sequence assembly.
AACC
GGTT
UC Davis
--------
OklahomaUniversity
AACC
GGTT
Medicago BAC Sequencing
0
10000000
20000000
30000000
40000000
50000000
60000000
70000000
80000000
90000000
100000000
4/15/026/15/028/15/0210/15/0212/15/02
2/15/034/15/036/15/038/15/0310/15/0312/15/03
Date
Number of Bases
Phase 1
Phase 2
Phase 3
Total
AACC
GGTT
The next ~750The next ~750 Medicago truncatula Medicago truncatula BACsBACs
• With recent NSF funding, we will be With recent NSF funding, we will be sequencing BACs from chromosomes sequencing BACs from chromosomes 1,4, 6, and 8 with the goal of completing 1,4, 6, and 8 with the goal of completing the sequence of the euchromatic regions the sequence of the euchromatic regions of these chromosomes over the next 3 of these chromosomes over the next 3 years.years.
• Chromosomes 2 and 7 will be sequenced Chromosomes 2 and 7 will be sequenced at TIGR, chromosome 3 at The Sanger at TIGR, chromosome 3 at The Sanger Institute and and chromosome 5 at Institute and and chromosome 5 at Genoscope.Genoscope.
• All data will be released immediately as All data will be released immediately as before.before.
AACC
GGTT
www.genome.ou.edu/medicago.html
AACC
GGTT
www.genome.ou.edu/medicago_totals.html
AACC
GGTT
Medicago-specific gene with ESTs but no known homology
Gene density of this BAC is ~1 gene per 10 kb
AACC
GGTT
Medicago-specific gene with ESTs but no known homology
AACC
GGTT
myosin-like protein
Gene density ~1 gene per 10 kb
AACC
GGTT
myosin-like protein
AACC
GGTT
AACC
GGTT
Gene Size Distribution (All Sequence Data) (FgenesH vs. Genscan)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1-10
00
100
1-20
00
200
1-30
00
300
1-40
00
400
1-50
00
500
1-60
00
600
1-70
00
700
1-80
00
800
1-90
00
900
1-10
000
100
01-1
100
0
110
01-1
200
0
120
01-1
300
0
130
01-1
400
0
140
01-1
500
0
150
01-1
600
0
160
01-1
700
0
170
01-1
800
0
180
01-1
900
0
190
01-2
000
0
200
01-a
bo
ve
FgeneSH
Genscan
Number of
Genes
Gene Size Range
13,396 FgeneSH predicted genes11,488 Genscan predicted genes
AACC
GGTT
Exon Size Distribution (All Sequence Data) (FgenesH vs. Genscan)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
1-50
51-1
00
101
-200
201
-300
301
-400
401
-500
501
-600
601
-700
701
-800
801
-900
901
-100
0
100
1-15
00
150
1-20
00
200
1-25
00
250
1-30
00
300
1-35
00
350
1-40
00
Number of
Exons
Exon Size Range
FgeneSH
Genscan
59,808 FgeneSH predicted exons55,792 Genscan predicted exons
AACC
GGTT
Intron Size Distribution (All Sequence Data) (FgenesH vs. Genscan)
0
2000
4000
6000
8000
10000
12000
1-50
51-1
00
101
-200
201
-300
301
-400
401
-500
501
-600
601
-700
701
-800
801
-900
901
-100
0
100
1-15
00
150
1-20
00
200
1-25
00
250
1-30
00
300
1-35
00
350
1-40
00
Number of
Introns
Intron Size Range
FgeneSH
Genscan
46,412 FgeneSH predicted introns44,305 Genscan predicted introns
AACC
GGTT
FgeneSH GenscanTotal number of genes 13,397 11,488Total length of genes 30,793,326 51,687,528Total exon length 15,794,243 14,400,445Total number of exons 59,808 55,792Total intron length 14,999,083 37,287,083Total number of introns 46,412 44,305_______________________________________________________
Base Pairs Sequenced 87,423,457 87,423,457_______________________________________________________
Gene Space(Gene Length/BP Sequenced) 35% 59%_______________________________________________________
Gene Density (Genes/200Mb) 30,649 26,281
1 gene/6.5 kb 1 gene/7.6 kb_______________________________________________________
Arabidopsis 25,498 protein coding genes
Gene Density of the ~450 Mb Medicago truncatula genome
AACC
GGTT
Medicago GC Content for ~90 Mb of Genomic BAC Clones Sequenced (mainly from gene rich regions)
AACC
GGTT
Metabolic Overview of Medicago13,396 FgeneSH predicted genes using the COG Database
DNA Metabolism23%
Cellular Processes23%Metabolism
24%
Poorly Characterized
17%
No Hits5%
Multiple COG Hits8%
AACC
GGTT
Metabolic Overview (detailed view) of Medicago13,396 FgeneSH predicted genes using the COG Database
No Hits5%
Translation, ribosomal structure & biogenesis
7% Transcription5%
DNA replication, recombination & repair
11%
Multiple COG Hits8%
Poorly Characterized17%
Cell division & chromosome
partitioning 2%
Posttranslational modification, protein
turnover, chaperones 5%
Cell envelope biogenesis, outer
membrane 4%
Cell motility & secretion 3%
Inorganic ion transport & metabolism 3%
Signal transduction
mechanisms 5%Energy production & conversion 5%
Carbohydrate transport & metabolism 4%
Amino acid transport & metabolism 5%
Nucleotide transport & metabolism 2%
Coenzyme metabolism 2%
Lipid metabolism 2%
Secondary metabolites biosynthesis, transport &
catabolism 3%
AACC
GGTT
Gene Duplication: Three copies of the phosphoglycerate kinase gene in one BAC
AACC
GGTT
AC138448.fg.10 MATKRSVGTLKEAELKGKRVFVRVDLNVPLDDNLNITDDTRIRAAVPTIKYLTGYGAKVILSSHL-----AC138448.fg.11 MA-KKSVGDLSGAELKGKKVFVRADLNVPLDDNQNITDDTRIRAAIPTIKYLIQNGAKVILSSHL-----AC138448.fg.8 MATKRSVGTLKEGELKGKRVFVRVDLNVPLDDNLNITDDTRIRAAVPTIKYLTGYGAKVILSSHLEIYKT
AC138448.fg.10 ------------------------------------------GRPKGVTPKYSLKPLVPRLSELLGTQVKAC138448.fg.11 ------------------------------------------GRPKGVTPKYSLAPLVPRLSELIGIEVIAC138448.fg.8 EVSVSEYNLAVSEYKLAISDTYRYRIRVRHDSSPFLEYRGSQGRPKGVTPKYSLKPLVPRLSELLETQVK
AC138448.fg.10 IADDSIGEEVEKLVAQIPEGGVLLLENVRFHKEEEKNDPEFAKKLASLADLYVNDAFGTAHRAHASTEGVAC138448.fg.11 KAEDSIGPEVEKLVASLPDGGVLLLENVRFYKEEEKNDPEHAKKLAALADLYVNDAFGTAHRAHASTEGVAC138448.fg.8 ISDDCIGEEVEKLVAQIPEGGVLLLENVRFHKEEEKNEPEFAKKLASLADLYVNDAFGTAHRAHASTEGV
AC138448.fg.10 AKYLKPSVAGFLMQKELDYLVGAVSNPKKPFAAIVGGSKVSSKIGVIESLLEKVDILLLGGGMIFTFYKAAC138448.fg.11 TKYLKPSVAGFLLQKELDYLVGAVSSPKRPFAAIVGGSKVSSKIGVIESLLEKVDILLLGGGMIFTFYKAAC138448.fg.8 AKYLKPSVAGFLMQKELDYLVGAVSNPKKPFAAIVGGSKVSSKIGVIESLLEKVDILLLGGGMIYTFYKA
AC138448.fg.10 QGYAVGSSLVEEDKLDLATTLIEKAKAKGVSLLLPTDVVIADKFAADANDKIVPASSIPDGWMGLDIGPDAC138448.fg.11 QGLAVGSSLVEEDKLELATTLIAKAKAKGVSLLLPSDVVIADKFAPDANSQIVPASAIPDGWMGLDIGPDAC138448.fg.8 QGYSIGSSLVEEDKLDLATSLMEKAKAKGVSLLLPTDVVIADKFSADANDKIVPASSIPDGWMGLDIGPD
AC138448.fg.10 SIKTFNEALDKSQTIIWNGPMGVFEFDKFAAGTEAIAKKLAEVSGKGVTTIIGGGDSVAAVEKVGLADKMAC138448.fg.11 SIKTFNEALDTTQTIIWNGPMGVFEFDKFAVGTESIAKKLADLSGKGVTTIIGGGDSVAAVEKVGVADVMAC138448.fg.8 SIKTFNEALDKSQTIIWNGPMGVFEFDKFAAGTEAIAKKLAEVSGKGVTTIIGGGDSVAAVEKVGLADKM
AC138448.fg.10 SHISTGGGASLELLEGKPLPGVLALDDA* 401 amino acidsAC138448.fg.11 SHISTGGGASLELLEGKELPGVLALDEATPVAV* 405 amino acids, differs at 42 positionsAC138448.fg.8 SHISTGGGASLELLEGKPLPGVLALDDA* 448 amino acids, differs at 6 positions
Gene Duplication: Three copies of phosphoglycerate kinase in one BAC
AACC
GGTT
Printrepeat Analysis of M. truncatula BAC AC121240 vs. A. thaliana Chr.2
Expansion, Duplication, Repeat Elements
~5 kb region
~25 kb region
AACC
GGTT
PIP of M. truncatula BAC AC121240 vs. A. thaliana Chr.2
AACC
GGTT
Medicago truncatulaMedicago truncatulaSummary and ConclusionsSummary and Conclusions
• Average Predicted Gene Density of 1 gene per 6.5 Average Predicted Gene Density of 1 gene per 6.5 to 7.6 Kb by FgeneSH and Genscan, respectively.to 7.6 Kb by FgeneSH and Genscan, respectively.
• Genome characteristics such as %GC, intron/exon Genome characteristics such as %GC, intron/exon size and conserved unique 5’ splice sites reveal size and conserved unique 5’ splice sites reveal Medicago characteristicsMedicago characteristics
• The sequence of the The sequence of the Medicago truncatulaMedicago truncatula genome genome shows homology to the sequenced shows homology to the sequenced Arabidopsis Arabidopsis thalianathaliana genome but expansion, rearrangements genome but expansion, rearrangements and duplications are evident.and duplications are evident.
AACC
GGTT
Data Release and Preliminary AnnotationData Release and Preliminary Annotation
• All our sequence data is available through links on our All our sequence data is available through links on our web site to GenBank and on our ftp site at URL: web site to GenBank and on our ftp site at URL: ftp.genome.ou.edu/medicagoftp.genome.ou.edu/medicago
• keyword and blast searches can be done on our web site keyword and blast searches can be done on our web site at URL: at URL: http://www.genome.ou.edu/medicago.htmlhttp://www.genome.ou.edu/medicago.html
• Additional annotation via Genome Browser database Additional annotation via Genome Browser database are available on our web site at URL: are available on our web site at URL: http://www.genome.ou.edu/medicago_table.htmlhttp://www.genome.ou.edu/medicago_table.html
• E-mail suggestions for additional annotation to Bruce E-mail suggestions for additional annotation to Bruce Roe at: Roe at: broe@ou.edu
AACC
GGTT
Three Year PlanThree Year Plan
• Obtain the contiguous sequence of the Gene Obtain the contiguous sequence of the Gene Rich regions of four of the 8 Rich regions of four of the 8 Medicago truncatulaMedicago truncatula genome at OU, with the remaining four being genome at OU, with the remaining four being completed by our international partners at TIGR, completed by our international partners at TIGR, Sanger, and Genoscope.Sanger, and Genoscope.
• This information will serve as a solid foundation This information will serve as a solid foundation for anticipated comparative and functional for anticipated comparative and functional legume genomics.legume genomics.
AACC
GGTT
Laboratory OrganizationLaboratory OrganizationBruce Roe, PIBruce Roe, PI
InformaticsInformatics
Support TeamsSupport Teams
ProductionProduction AdministrationAdministration
Jim WhiteJim WhiteSteve KentonSteve KentonHongshing LaiHongshing LaiSean QianSean Qian
Rose Morales-Diaz*Rose Morales-Diaz*Mounir Elharam*Mounir Elharam*Yonas TesfaiYonas TesfaiSteve Shaull**Steve Shaull**Doug WhiteDoug WhiteWork-study Undergraduates**Work-study Undergraduates**
Kay Lynn HaleKay Lynn HaleDixie WishnuckDixie WishnuckTami WomackTami WomackMary Catherine WilliamsMary Catherine Williams
DNA SynthesisDNA Synthesis
Phoebe Loh*Phoebe Loh*Sulan QiSulan QiBart Ford*Bart Ford*
Reagents &Reagents &Equip. Maint.Equip. Maint.
Mounir Elharam*Mounir Elharam*Doug WhiteDoug White
Axin HuaAxin HuaWeihong XuWeihong Xu
Jami MilamJami Milam Sara Downard**Sara Downard**
Limei YangLimei YangAngie Prescott*Angie Prescott*Audra Wendt**Audra Wendt**Mandi Aycock**Mandi Aycock**
Ziyun YaoZiyun YaoSteve Shaull*Steve Shaull*Youngju YoonYoungju Yoon
Trang DoTrang Do Anh DoAnh DoLily FuLily FuYang YeYang YeJames Yu James Yu Tessa Manning**Tessa Manning**
Fu Ying Fu Ying Liping ZhouLiping ZhouRuihua ShiRuihua ShiJunjie WuJunjie Wu
Stephan DeschampsStephan DeschampsShelly OommenShelly OommenChristopher LauChristopher LauYanhong LiYanhong Li
Research TeamsResearch TeamsDoris KupferDoris KupferJulia Kim*Julia Kim*Sun SoSun SoGraham Wiley**Graham Wiley**Lauren Ritterhouse**Lauren Ritterhouse**
Lin SongLin Song Ying NiYing NiHuarong JiangHuarong Jiang
ShaoPing LinShaoPing LinHonggui JiaHonggui JiaHongming WuHongming WuBaifang QinBaifang QinPeng Zhang Peng Zhang
Fares NajarFares Najar Chunmei QuChunmei QuKeqin WangKeqin WangCarson QuCarson QuShuling LiShuling Li
Funding from the Noble Foundation, DOE, and NSFCollaborators at Univ. Minnesota, UC Davis, TIGR, Sanger, Genoscope, and the Noble Foundation
Pheobe LohPheobe Loh * *Sulan QiSulan QiBart Ford*Bart Ford*
* Previous undergraduate * Previous undergraduate research studentresearch student
** Present undergraduate ** Present undergraduate research studentresearch student
AACC
GGTT
The AACCGGTT Team
AACC
GGTT
AACC
GGTT
Conserved Intron/Exon Boundry Features by a FELINEs** Analysis of 181,444 Medicago truncatula ESTs in GenBank
vs Genomic Sequence
Size Range Mean LengthExons 6 - 5,789 nt 268 ntIntrons 20 - 3,921 nt 429 nt
Intron Conserved Splice Site Sequence Elements PercentIntrons w/ 5’ GU 99.21%Introns w/ 5’ GC 0.36%*Introns w/ 5’ AU 0.31%Introns w/ U12 branch sites instead of A12 0.13%
*Compared to 0.5 - 2.5% in fungi, and 0.5% in mammals with an EST minimum identity of 90%
** S. Drabensctot, D. Kupfer, J. White, D. Dyer, B. Roe, K. Buchanan and J. Murphy. FELINES: A Utility for Extracting and Examining EST-Defined Introns and Exons. Nucleic Acid Research 31(22), E141 (2003).
AACC
GGTT
Consensus Logogram of the 5’GU vs the 5’AU Class of Introns in Medicago truncatula determined by FELINES
AU intron consensus
GU intron consensus