REVIEW-Crop Genome Sequencing2

12
REPORT| CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 1 INTRODUCTION Plants have existed long before humans and animals have even appeared. With the instinctive gift of intelligence, and learning through trial-and-error process, humans have gradually developed tools for survival with the crudest technology from the early human existence to the ever evolving level of sophistication in our post-modern era. With this technological evolution came along the agricultural evolution, which could be accounted with greater significance, for in order to maintain life (i.e. society), food is essential, which mainly comes from plants [1]. Humans learned the value of plants and through experience selected those that are beneficial in a process called domestication [2] Plants with particular economic importance that lead to their cultivation are generally called crops [3]. These crops are generally valued for their relevance for food, medicine, materials, industry, landscape, etc. For enhancing the quality and yield of crops produced, breeders and scientists have worked hard to produce methods that address these objectives. These are logically achieved by properly understanding the anatomy, physiology, genetics, and ultimately all aspects of plant mechanisms responsible for growth and development, as well as environmental factors that affect this growth and development. It basically means, the more knowledge we have of a plant and the intricate interplay of all biological factors that limit or promote its optimal performance, the easier it is to manipulate particular parameters to obtain the desired phenotype. We have come a long way in understanding the biological factors and mechanisms that contribute in plant growth and development, from knowing the hereditary molecules to the isolation of the first gene [4], to the recent studies of the genome, the transcriptome, the proteome, and the metabolome. Despite these great astonishing 1 Laboratory of Functional Crop Genomics and Biotechnology, Department of Plant Science, College of Agriculture and Life Sciences, Seoul National University, 151- 742 Seoul, Korea 2 Plant Biotechnology Institute, Department of Life Science, Sahmyook University, 139-742 Seoul, Korea Email: [email protected] 노이완 REPORT The plant genome: a socio-economic implication Nomar Espinosa Waminal 1,2 Abstract | Crop genomics has gotten much attention recently, especially after the completion of the genome sequencing project of Arabidopsis thaliana. Coupled with the ever advancing DNA sequencing technologies, genomics has unlimited resources to offer to the scientific community. With several plant genomes having been sequenced, it is apparent that novel knowledge have been unraveled in understanding plant mechanisms for disease resistance and biosynthesis of desired traits, areas that immediately affect economic aspect of agriculture. Here, detailed review on the genome sequencing and assembly of three socio-economically important plant species are presented.
  • date post

    26-Oct-2014
  • Category

    Documents

  • view

    109
  • download

    3

Transcript of REVIEW-Crop Genome Sequencing2

Page 1: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 1

INTRODUCTION

Plants have existed long before humans and animals have even appeared. With the

instinctive gift of intelligence, and learning through trial-and-error process, humans have

gradually developed tools for survival with the crudest technology from the early human

existence to the ever evolving level of sophistication in our post-modern era. With this

technological evolution came along the agricultural evolution, which could be accounted

with greater significance, for in order to maintain life (i.e. society), food is essential, which

mainly comes from plants [1]. Humans learned the value of plants and through experience

selected those that are beneficial in a process called domestication [2]

Plants with particular economic importance that lead to their cultivation are generally

called crops [3]. These crops are generally valued for their relevance for food, medicine,

materials, industry, landscape, etc. For enhancing the quality and yield of crops produced,

breeders and scientists have worked hard to produce methods that address these

objectives. These are logically achieved by properly understanding the anatomy, physiology,

genetics, and ultimately all aspects of plant mechanisms responsible for growth and

development, as well as environmental factors that affect this growth and development. It

basically means, the more knowledge we have of a plant and the intricate interplay of all

biological factors that limit or promote its optimal performance, the easier it is to

manipulate particular parameters to obtain the desired phenotype.

We have come a long way in understanding the biological factors and mechanisms

that contribute in plant growth and development, from knowing the hereditary molecules

to the isolation of the first gene [4], to the recent studies of the genome, the

transcriptome, the proteome, and the metabolome. Despite these great astonishing

1 Laboratory of Functional Crop Genomics and Biotechnology, Department of Plant Science, College of Agriculture and Life Sciences, Seoul National University, 151-742 Seoul, Korea

2Plant Biotechnology Institute, Department of Life Science, Sahmyook University, 139-742 Seoul, Korea

Email: [email protected]

노이완

R E P O R T

The plant genome: a socio-economic implication

Nomar Espinosa Waminal 1,2

Abstract | Crop genomics has gotten much attention recently, especially after the completion of

the genome sequencing project of Arabidopsis thaliana. Coupled with the ever advancing DNA

sequencing technologies, genomics has unlimited resources to offer to the scientific community.

With several plant genomes having been sequenced, it is apparent that novel knowledge have

been unraveled in understanding plant mechanisms for disease resistance and biosynthesis of

desired traits, areas that immediately affect economic aspect of agriculture. Here, detailed

review on the genome sequencing and assembly of three socio-economically important plant

species are presented.

Page 2: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 2

advances that encompass the study of plants, it is still apparent that we have yet another

long way to go to make use of this vast amount of information for the improvement of

human health and lifestyle, and to address the recent international concerns of global

warming.

The concerted efforts of scientists from various fields have contributed enormously in

the understanding of the plant genome, its structure, and its function. Unlocking the

genomic DNA sequence and understanding the interplay of the DNA and other

biomolecules have a profound impact in downstream researches and applications of this

information. There is such a wide horizon of downstream applications of the genome

sequence that attempting to enumerate them is like trying to limit its possibilities.

Nevertheless, some of the apparent direct and indirect contributions of the genome

sequence to the scientific community include (i) access to: the relatively complete gene

catalogue of a species, the regulatory elements that control the gene functions, and the

foundation in understanding variation of genomes; (ii) understanding the structure,

function, and evolution of organisms; (iii) understanding biochemical pathways; (iv)

development of molecular markers to speed up genetic analysis, discovery of genes, and

breeding programs for crop improvement; (v) and providing framework for further

structural and functional genomics studies of model plants, essential food crops, animal

feed, and energy crops. To date, there are about 26 plant genomes that have been

sequenced [5].

Of which plant genome to sequence first is influenced by several factors like the

sequencing cost, genome size, and genome complexity. These factors have more influence

in the decision of sequencing than the direct economic significance of the species being

sequenced, as exemplified by the sequencing of Zea mays L. which could have been

sequenced after Arabidopsis and rice [6] but was consequently sequenced later, after Vitis

vinefera [7] and Populus trichocarpa [8], due to the huge amount of repetitive elements in

its genome [9]. These highly repetitive elements make genome assembly difficult by

challenging computational accuracy [9], especially with the use of the next generation

sequencing (NGS) technologies that produce short reads [10]. However, scientists have

developed approaches that utilize long reads (Sanger sequencing) in combination with the

NGS reads to produce more reliable results [11]. To date, several important crop species

have already been sequenced using whole genome shotgun (WGS) sequencing or BAC-by-

BAC sequencing approaches (Table 1), and this number is dramatically increasing as more

sophisticated sequencing technologies and bioinformatics tools are being refined.

With the increasing knowledge of the plant genome, which was greatly spurred after

the completion of the genome sequence of Arabidopsis thaliana in 2000 [5], comes along

the incremental evolution of DNA sequencing technologies. The Sanger method has

dominated the DNA sequencing industry for nearly two decades and has contributed so

much in sequencing many genomes, including the monumental completion of the human

Page 3: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 3

Table 1. Overview of plant genomes that have been sequenced (Adapted from Trends in Plant Science February 2011, Vol. 16,

No. 2 and List of sequenced eukaryotic genomes. (2012, April 20). In Wikipedia, April 29, 2012)

Organism* Relevance Genome

(Mb)

Chrom.

no. (n)

Predicted

Genes

Sequencing

strategy Organization

Year of

completion

Dicots

Arabidopsis lyrata Model plant 207 8 32,670 WGS DOE-JGI and Max Planck

Institute for Developmental

Biology

2011

Arabidopsis thaliana Model plant 119 5 25,498,

27,400,

31,670

BAC-by-BAC Arabidopsis Genome Initiative 2000

Brassica rapa Crop and model organism 284 10 41,174 WGS multicenter collaboration 2011

Cannabis sativa Hemp and marijuana

production

534 10 30,074 WGS multicenter collaboration 2011

Cucumis sativus Vegetable crop 367 7 26,682 WGS Chinese Academy of

Agricultural Sciences, Beijing

2009

Fragaria vesca Fruit crop 280 7 25,050 WGS multicenter collaboration 2011

Glycine max Protein and oil crop 1,100 20 46,430 WGS Purdue University 2010

Jatropha curcas Biodiesel crop 410 11 40,929 BAC-by-BAC

WGS

Kazusa DNA Research Institute 2010

Lotus japonicus Model legume 417 30,799 BAC-by-BAC multicenter collaboration 2008

Malus domestica Fruit tree 927 57,000 BAC-by-BAC International consortium 2010

Medicago truncatula Model organism for

legume biology

375 62,388 BAC-by-BAC multicenter collaboration 2011

Populus trichocarpa Carbon sequestration,

model tree, timber

550 19 45,555 WGS The International Poplar

Genome Consortium

2006

Ricinus communis Oilseed crop 320 31,237 WGS multicenter collaboration 2010

Solanum tuberosum Crop plant 844 12 39,031 WGS multicenter collaboration 2011

Thellungiella parvula Arabidopsis relative with

high salt tolerance

140 28,901 WGS multicenter collaboration 2011

Theobroma cacao Flavoring crop 430 10 28,798 WGS CIRAD, multiple institutions

(separate project, Mars Inc.,

USDA)

2010

Vitis vinifera Fruit crop 490 30,434 WGS The French-Italian Public

Consortium for Grapevine

Genome Characterization

2007

Monocots

Brachypodium

distachyon

Model monocot (grass) 272 5 26,500 WGS The International Brachypodium

Initiative

2010

Oryza sativa ssp indica Crop and model organism 420 12 32-50,000 WGS Beijing Genomics Institute,

Zhejiang University and the

Chinese Academy of Sciences

2002

Oryza sativa ssp

japonica

Crop and model organism 466 12 58,000 BAC-by-BAC Syngenta and Myriad Genetics 2002

Oryza glaberrima West African species of

cultivated rice that was

domesticated

independently of Asian

rice.

316 12 ND† BAC pooling

and WGS

Arizona Genomics Institute 2010

Phoenix dactylifera Fruit tree (palm) 658 36 >25,000 WGS Genomics Core, Qatar 2011

Sorghum bicolor Crop plant 730 10 27,640 WGS Multiple institutions 2009

Zea mays Cereal crop 2,800 10 63,300 BAC-by-BAC NSF 2009

†ND: no data

*species in bold font are discussed in detail in this review

genome [12]. However, limitations of this technology (low throughput and high cost are

main concerns) have fueled the need for more advanced sequencing technologies that

could produce enormous amount of sequence data in shorter time and cheaper cost. The

result is the shift of sequencing approach from the traditional ‘first-generation’ technology

of automated Sanger sequencing to the more advanced ‘next-generation’ sequencing [12].

Recent development and refinement of these technologies have initiated the ‘third-

generation’ sequencing with high accuracy, longer read lengths, super high coverage and

fast data acquisition [13].

Page 4: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 4

Here, I review the genomic sequencing results of three recently sequenced crops:

Cucumis sativus, Jatropha curcas, and Theobroma cacao, covering the sequencing

strategies used, genomic structure and arrangement, novel biosynthetic pathways, species-

specific genes, implication in species evolution, and other areas of functional genomics.

REVIEWED GENOMES

Cucumis sativus L. (cucumber) belongs to the family Cucurbitaceae which includes

many economically significant species. It has served as a model system for sex

determination studies [14]. The cucurbits are also plant models for vascular biology studies

because xylem and phloem sap are easily collected for long-distance signaling studies.

Jatropha curcas L. ( jatropha) belongs to the Euphorbiaceae family. It has much

potential for various uses including biofuels due to its high yield of oil per unit area which

is second only to oil palm. This presents a great promise for reducing the problems caused

by the continued consumption of fossil fuels, mainly the global warming concerns.

Theobroma cacao L. (cocoa tree), the Criollo variety, is an important crop in producing

chocolate products. However, fine-cocoa production is about less than 5% globally. This is

mainly caused by fungal, oomycete and viral diseases, and insect pest susceptibility to

fine-flavor cocoa varieties. Breeding of improved Criollo varieties is needed for sustained

production of fine-flavor cocoa.

Despite the great economic significance of these crops as partially mentioned above,

there are limited or very limited genomic resources that hinder speedy researches that

address their respective objectives. Unlocking their genomic sequence will undoubtedly

uncover new frontiers that would help understand their structure, function, and control of

desired traits. Consequently, independent organizations have initiated and completed the

sequencing of their genomes (Table 1).

STRUCTURAL GENOMICS

Sequencing and assembly

Whole genome shotgun (WGS) sequencing approach was used to sequence the three

genomes, with different platforms used and methods to achieve a better quality of

assembled sequences. Whole genome shotgun with a combination of the Sanger and NGS

(GA by Illumina) sequencing was used to sequence the cucumber genome. This method

produced longer N50 of both contigs and scaffold than when using separately assembled

reads from each sequencing strategy (Table 2). For J. curcas genome sequencing, a

combination of BAC end sequencing and shotgun sequencing was employed using the

Table 2. Genome assembly statistics of C. sativus

Assembly Contig N50 (kb) Contig total (Mb) Scaffol N50

(kb)

Scaffold total

(Mb)

Sanger 2.6 204 19 238

Illumina GA 12.5 190 172 200

Sanger + Illumina GA 19.8 226.5 1,140 243.5

conventional Sanger method and the NGS (GS-FLX by Roche/454 and GA by Illumina). For

N50 is the sequence size

above which half of the

total length of the

sequence set can be

found.

Page 5: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 5

T. cacao, the sequencing strategy used was WGS incorporating Sanger and NGS platforms

(-FLX by Roche/454 and GA by Illumina). Different software was used to assemble the

genomic sequences of the three species Table 3. The combination of the conventional

Table 3. Summary of information about the genome, sequencing strategy and genome assembly statistics of the three reviewed species

Jatropha curcas Theobroma cacao Cucumis sativus

Genome size (Mb) ~410 [15] ~430 [18] ~367 [11]

Ploidy, chrom. no. 2n=2x=22 2n=2x=20 2n=2x=14

Date of completion

Date of publication, Journal

2010

2011, DNA research

2010

2011, Nature Genetics

2009

2009, Nature Genetics

Sequencing/Funding

Institution

Kazusa DNA Research Institute

Foundation, Japan

The International Cocoa Genome

Sequencing Consortium-ICGS,

coordinated by CIRAD

Chinese Academy of Agricultural

Sciences, Beijing, China

Sequencing strategy BAC-by-BAC and WGS WGS WGS

Sequencing method

• Sanger –for shotgun libraries and

BAC ends

• NGS –GS-FLX (Roche, USA)

–GA II (Illumina, USA)

• Sanger –for BAC ends only

• NGS –GS-FLX

–GA II

• Sanger –for BAC, plasmid, and

fosmid sequencing

• NGS –GA II

Assembly program PCAP.rep and MIRA Newbler version 2.3, SOAP RePS2

Total length of assembled

genome 285.9 Mb 326.9 Mb 243.5 Mb

Percent of genome covered

by the assembly

~70% (if based on ~410 Mb genome)

~75% (if based on ~380 Mb genome)

~76% (based on genotype B97-61/B2,

430 Mb)

~66% (based on ‘Chinese long’ inbred

line 9930)

Coverage depth of raw data ND 61.1x Total: 72.2x

Gene space covered 95% 97.8% 96.8%

Sequence anchored to

chromosomes ND 67% 72.8%

Contig:

Total number

Total length (Mb)

Average length (Kb)

Longest (Kb)

N50 (Kb)

120, 586

276.7

2.3

29.7

3.8

25, 912

291.4

11.2

190

19.8

62, 410

226.4

ND

ND

19.8

Scaffold:

Total number

Total length (Mb)

Average length (Kb)

Longest (Kb)

N50 (Kb)

15, 300

129.3

8.4

56

ND

4, 792

326.9

68.2

3, 145

473.8

47, 837

243.5

ND

ND

1, 144

ND: no data

Sanger and NGS strategy proved to be superior than just using either technology

independently by compensating the shortcomings of each respective method, allowing the

acquisition of high quality sequences with lower cost in a short period of time; thus,

making it popular in the sequencing of eukaryotes [15].

Linkage analysis

The consensus genetic map of T. cacao was created using two mapping populations,

while 77 recombinant inbred lines from inter-subspecific cross between Gy14 and

PI183967 were used for cucumber. No data was provided about the linkage analysis of J.

curcas. About the same percentage of the molecular markers were aligned into the newly

assembled genomic sequence of C. sativus and T. cacao (Table 4). It was interesting to

observe that in cucumber, recombination suppression regions were found after comparing

Gy14 is a North American

processing market-type

cucumber cultivar.

PI183967 is an accession

of C. sativus var.

hardwickii originating

from India.

Page 6: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 6

the genetic and physical maps. This covers two 10-Mb regions at either ends of

chromosome 4, a 20-Mb region on chromosome 5, and a 8-Mb region on chromosome 7

(Fig. 1a). Further FISH analysis revealed segmental inversion on chromosome 5 between

Gy14 and PI183967 (Fig. 1b). This chromosomal inversion helps explain the recombination

suppression in these regions and added insight to the study of cucumber evolution during

domestication.

Table 4. Summary of the linkage analysis

Mapping

population

Total length

(cM)

No. mol.

markers

Aligned

markers

Anchored sequences

to chromosomes (%)

T. cacao 2 750.6 1,259 1,192 (94%) 67

C. sativus 77 581 1,885 1,763 (93.5) 72.8

Figure 1. The integrated genetic and physical maps of cucumber. (a) Genetic distance vs. physical distance of the seven

cucumber chromosomes. The brackets denote the regions of recombination suppression. (b) Detection of segmental

inversion on chromosome 5 between Gy14 and PI183967 through FISH (12-7 and 12-2 are fosmid clones used as probes).

Bar = 5μm.

FISH: Fluorescence in situ

hybridization is

a cytogenetic technique

that is used to detect and

localize the presence or

absence of

specific DNA sequences

on chromosomes.

Page 7: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 7

Repetitive sequences

In proportion to the genome size, J. curcas has the most number of transposons (36.6%

of 410 Mb genome) followed by C. sativus (24% of 367 Mb genome) and T. cacao (24% of

430 Mb genome) (Table 5). In all three species, the Class I transposable elements

(retrotransposons) represent majority of the repeat sequences in the genome. In J. curcas,

Table 5. Summary of the transposable elements identified in the three species

C. sativus J. curcas T. cacao

Number of

elements

Fraction of the

genome (%)

Number of

elements

Fraction of the

genome (%)

Number of

elements

Fraction of the

genome (%)

Class I

LTR: copia

LTR: gypsy

LTR: other

Others

119,339

(91,109)*

20,119

12.16

(10.43)

1.75

113,047

31,740

67,658

13,454

195

29.91

8.03

19.6

2.23

0.05

49,942

18,060

12,622

19260

ND

Class II 16,972 1.24 25,977 2.04 21,882 ND

Others 135,464 11.64 28,069 5.22 ND ND

TOTAL 266,232 24.01 152,805 36.6 67,575 ~24

*Values in parentheses denote total value of all LTR families.

there are more gypsy-type retrotransposons than copia-type, an opposite pattern with that

of T. cacao. In fact, in T. cacao, a copia-like LTR name Gaucho, 11,297 bp long and

repeated approximately 1,100 times, was identified and hybridized through FISH, and was

found to occupy most of the interstitial regions (regions between centromeres and

telomeres) of chromosome arms (Fig. 2b). Additionally, a 212 bp long repeat named ThCen

was confirmed to be centromere-specific repeats after FISH analysis (Fig 2a), and that it

may have contributed to the genome size variation of T. cacao.

Figure 2. FISH analysis of T. cacao repetitive sequences. (a) T. cacao chromosomes counterstained

with DAPI (blue) with ThCen (red) used as probe. (b) hybridization of ThCen (red) and Gaucho LTR

retrotransposon probes (green).

Gene content

Not all RNA-encoding genes were reported for the three species (Table 6). Due to the

limitations of sequencing method, the number of genes, especially for the ribosomal RNA-

encoding genes may be largely underestimated. In T. cacao, only six fragments of rRNA

genes were recovered, a huge reduction of the average number of repeats found in most

Page 8: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 8

Table 6. Summary of RNA-coding genes in the three species

eukaryotes, as can be observed through FISH analysis using rDNA as probes [16].

MicroRNAs (miRNAs) are short non-coding RNAs that transcriptionally or post-

transcriptionally regulate gene-expression. Many miRNAs have roles in plant development

and stress response. In T. cacao, most of the miRNAs predicted have homologous

transcription factor sequences, suggesting that miRNAs are major gene expression

regulators in T. cacao.

Three gene-prediction methods

were used to identify protein-

coding genes for the three species

namely ab initio, cDNA-EST, and

homology searches using gene

finder software in public databases (Table 7). Comparison of the gene families with other

sequenced genomes resulted to 682 T. cacao-specific and 4,362 C. sativus-specific gene

families, while 1,529 genes were found to be specific to the family Euphorbiaceae where J.

curcas belongs.

Table 7. Summary of the gene-prediction analysis of the three sequenced genomes

FUNCTIONAL GENOMICS

Disease resistance-related genes

Resistance genes (R genes) are subdivided into two classes: the nucleotide-binding

site leucin-rich repeat (NBS-LRR) class of genes and the receptor protein kinase (RPK) class

of genes [17]. A total of 297, 92, and 61 NBS genes were identified in T. cacao, J. curcas,

and C. sativus, respectively. Three major resistance gene families were identified in T. cacao,

the NBS, LRR-RLK and NPR1, all three have been mapped onto the ten chromosomes.

Alternative mechanisms may have been utilized by C. sativus to confer resistance to

pathogens. The relatively few amount of NBS genes found in its genome compared to the

two other species and to Arabidopsis (200), poplar (398), and rice (600) [8] is compensated

C. sativus J. curcas T. cacao

rRNA 292 ND 6

tRNA 699 597 473

miRNA 171 ND 83

snoRNA 238 ND ND

snRNA 192 65 ND

Gene-

prediction

methods

Protein-coding

region search

programs

Similarity

searches

database

No. of

predicted

genes

Mean

coding

sequence

size (bp)

Mean

exon

size

(bp)

Mean

intron

size

(bp)

Mean

exons

per

gene

C. sativus ab initio

homology

search

cDNA-EST

GlimmerHMM

Genscan

Agustus

BGF

SNAP

Arabidopsis

Papaya

Poplar

Grapevine

Rice

26,682 1,046 238 483 4.39

J. curcas ab initio

homology

search

cDNA-EST

GeneMark.hmm

Genescan

Uniref

TrEMBL

40,929 3,064 227 356 ND

T. cacao ab initio

homology

search

cDNA-EST

EUGene

SpliceMachine

Swiss-Prot

TAIR

Malvaceae

GenBank

Glycine max

T. cacao EST

28,798 3,346 231 6,319 5.03

Page 9: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 9

by the expansion of its lipoxygenase (LOX) pathways that produce short chain aldehyde

and alcohols that are involved in plant defense mechanism. The eukaryotic translation

initiation factors confer recessive resistance to plant viral infections. Three EIF4E and EIF4G

genes that encode the eIF4E and eIF4G proteins, respectively, have been identified in C.

sativus genome, another mechanism of compensating its less NBS-R-mediated pathogen

resistance.

Genes for desired traits in respective species

One of the utmost goals of genome sequencing is the elucidation of putative gene

families that are responsible for the desired quality traits that contribute to the species’

economic significance. In the case of T. cacao, gene families that are directly responsible

for the fine quality and high yield of cocoa are of great interest. For J. curcas, gene families

that are responsible for oil production are greatly valued. And for C. sativus, since it is a

model for plant vascular biology and the study of sex expression among others, gene

families responsible for these traits are of great interest.

Triacylglycerol (TAG) genes contribute to the biodiesel production, and J. curcas has

the ability to biosynthesize and accumulate considerable amount of TAGs in its seeds. To

improve Jatropha oil quality for biodiesel, modification of the fatty acid synthesis can be

obtained by altering the genes involved in its synthesis as predicted in the newly

sequenced genome. Moreover, J. curcas is known to produce tumor-promoting phorbol

esters. Lowering the expression of the genes responsible for the production of phorbol

esters in high oil-yielding lines will promote the safe use of J. curcas for biodiesel

production.

Oils, proteins, starch, Flavonoids, alkaloids, and terpenoids are principla components

affecting flavor and quality of cocoa. The unique fatty acid profile of cocoa butter

enhances the quality of smell to chocolates and confectioneries. A total of 84 orthologous

genes were discovered that are potentially involved in the lipid biosynthesis; 96 genes

involved in flavonoid biosynthesis; and 57 genes that encode terpene synthase.

Cucurbits are known for the production of cucurbitacin—an insect-repellent secondary

metabolite but also attracts specific insects for pollination. Four genes for oxidosqualene

cylase (OSC genes) that are responsible for the cucurbitacin production were identified in

C. sativus. Moreover, 137 cucumber genes related to the biosynthesis of ethylene, a

compound that stimulates femaleness in cucumber, have been identified. Additionally,

auxin regulates sex expression, and six auxin-related genes were identified in C. sativus

genome. Additionally, three short-chain dehydrogenase/reductase genes homologous to

the ts2 sex-determination gene in maize (Zea mays) have been identified.

The discovery of these genes will surely spur downstream studies directed to the

improvement of varieties that would eventually address social and economic issues. A

relevant example would be the issue of global warming.

Page 10: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 10

Genome evolution and comparative genomics

Eudicots are known to have undergone paleo-hexaploidization events followed by

lineage-specific whole genome duplications (WGD) events. It was suggested that the T.

cacao genome underwent 11 major chromosome fusions from the 21 chromosomes of the

paleo-hexaploid ancestor to produce the present 10 chromosomes (Fig. 3). On the other

hand, the collinear gene-order analysis of C. sativus revealed no recent WGD, but some

segmental duplication events. Additionally, the comparative genomics between C. sativus

and its immediate relative, C. melo (melon) suggests a possible chromosomal fusion

between two chromosomes among ten ancestral chromosomes to form the five (chrom.

no 1, 2, 3, 5, and 6) of the seven present chromosomes of C. sativus (Fig. 4). In T. cacao,

seven blocks of duplicated genes were characterized after alignment of its gene models

onto its genome (Fig. 5).

Figure 3. Evolutionary model of T. cacao. The eudicot ancestor chromosomes are presented in

seven colors. The several lineage-specific shuffling events have shaped the present eudicot

genomes. R: rounds of WGD, F: chromosomal fusions.

Although the idea of accounting the genomic evolutionary history to common ancestry

may be incredibly enticing, perhaps due to the fact that chromosomal segments are

rearranged after breeding like in the case of C. sativus, I suggest that alternative approach

be considered. The fact that nobody has lived a thousand years (how much more for a

million years) and that homologous segments doesn’t always mean common ancestry but

alternatively mean common design and function, I recommend further unbiased researches

as far as genomic history is concerned, to unlock further mechanisms that underlie the

control of the genomic fusion, inversion, translocation, etc. and to test how much of these

Page 11: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 11

events limits the sustenance of life. Do these similarities really mean common ancestry, or

common functions and/or regulatory mechanisms?

Figure 4. Comparative genomics between melon and cucumber, showing chromosomes 1, 2, 3, 5, and 6 of cucumber largely syntenic to

two chromosomes of melon.

Microsynteny with other genomes

As expected, greater degree of syntenic relationship (53% of the assembled scaffolds)

was observed between J. curcas and Ricinus communis, both in the Euphorbiaceae family,

but less synteny was observed between distantly-related (or

functionally less related) species like Glycine max (11%) and

Arabidopsis thaliana (16%). Meanwhile, 54% of the BAC

sequences of melon were aligned to C. sativus. Moreover,

628, 540, 1,106, 772, and 795 syntenic blocks were identified

between C. sativus and A. thaliana, Carica papaya, Populus

trichocarpa, Vitis vinefera, and Oryza sativa, respectively.

The highly syntenic genomes of C. sativus and C. melo

will help the genetic analysis of C. melo, now that the

genomce of C. sativus have already been sequenced. It will

also help in the advancement of phylogenetic relationship studies. Collectively, syntenic

relationship among dicots, and among plants in a broader sense, will help in gene

prediction, and eventually aid in understanding the relationship between sequence

similarity and function, and the limitations to the theory of ancestry as the sole

explanation to sequence similarity.

CONCLUSION

Recent advancement in sequencing technologies has revolutionized our experimental

approaches in the study of plants. It has also shifted major scientific questions like ‘How to

sequence a genome’ to ‘What platform should be used best to sequence a particular

genome of interest.’ It has allowed scientists to study crops holistically in the genomics,

transcriptomics, proteomics, and metabolomics level. It will definitely aid in crop

improvement, understanding phylogenies and metabolic pathways among others. The

direct or indirect exciting consequences of genome sequencing apparently boils down to

the economic and lifestyle improvement of people that would hopefully be well-

distributed globally. The promising open doors to science brought by these advance

technologies are limitless. The use of these technologies for human and Mother Earth’s

benefit is to be the main goal, and not just solely for humans.

Figure 5. Duplicated gene segments of

T. cacao. The seven colors represent the

seven ancestral eudicot linkage groups.

Page 12: REVIEW-Crop Genome Sequencing2

R E P O R T | CROP GENOME ANALYSIS VOLUME 01 | APRIL 2012 | 12

REFERENCES (Those in bold font refers to the main articles for the three species reviewed here.)

1. Yang TS. 2012. Plant and Culture:Another Interpretation of Human History. Journal of Jishou

University(Social Sciences)33(1): 1-7.

2. Hirst KK. Plant Domestication: Table of Dates and Places.

http://archaeology.about.com/od/domestications/a/plant_domestic.htm

3. Crop. 2012. March 30. In Wikipedia, The Free Encyclopedia. Retrieved 04:36, April 27, 2012, from

http://en.wikipedia.org/w/index.php?title=Crop&oldid=484675363

4. Shapiro J, Machattie L, Eron L, Ihler G, Ippen K and Beckwith J. 1969. Isolation of Pure lac Operon

DNA. Nature 224, 768 – 774.

5. Feuillet C, Leach JE, Rogers J, Schnable PS and Eversole K. 2011. Crop genome sequencing:

lessons and rationales. Trends in Plant Science 16:77-88.

6. Messing J, Bharti AK, Karlowski WM, Gundlach H, Kim HR, Yu Y, Wei F, Fuks G, Soderlund CA,

Mayer KFX, and Wing RA. 2004. Sequence composition and genome organization of maize.

PNAS 101: 14349–14354.

7. Jaillon O, Aury JM, Noel B, et al. 2007. The grapevine genome sequence suggests ancestral

hexaploidization in major angiosperm phyla. Nature 449 (7161): 463–467.

8. Tuskan GA, Difazio S, Jansson S, et al. 2006. The genome of black cottonwood, Populus

trichocarpa (Torr. & Gray). Science 313 (5793): 1596–604.

9. Haberer G, Young S, Bharti AK, Gundlach H, Raymond C, Fuks G, Butler E, Wing RA, Rounsley S,

Birren B, Nusbaum C, Mayer KF, and Messing J. 2005. Structure and architecture of the maize

genome. Plant Physiol. 139:1612-1624.

10. Schatz MC et al. 2010. Assembly of large genomes using secondgeneration sequencing. Genome

Res. 20, 1165–1173.

11. Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, et al., 2009. The genome of the cucumber,

Cucumis sativus L. Nature Genetics 41:1275–1281.

12. Metzker ML. 2010. Sequencing technologies—the next generation. Nature Reviews: Genetics

11:31-46.

13. Hayden EC. 2009. Genome sequencing: the third generation. Nature 457:768-769.

14. Tanurdzic M and Banks JA. 2004. Sex-determining mechanisms in land plants. Plant Cell 16, S61–

S71.

15. Sato S et al., 2011. Sequence analysis of the genome of an oil-bearing tree, Jatropha curcas

L. DNA Research 18:65-76.

16. Waminal NE, Kim NS and Kim HH. 2011. Dual‐color FISH karyotype analyses using rDNAs in

three Cucurbitaceae species. Genes and Genomics. 33: 517-524.

17. Afzal AJ, Wood AJ and Lightfoot DA. 2008. Plant receptor-like serine threonine kinases: roles in

signaling and plant defense. Mol. Plant Microbe Interact. 21, 507–517.

18. Argout X, et al., 2011. The genome of Theobroma cacao. Nature Genetics 43:101-108.