Introduction to Genomics and the Tree of Life Chapter 13.
-
Upload
bonnie-fletcher -
Category
Documents
-
view
228 -
download
6
Transcript of Introduction to Genomics and the Tree of Life Chapter 13.
Introduction to Genomics and the Tree of Life
Chapter 13
Extra-Reading
• Next generation sequencer– What next generation sequencer can do for
genetics/genomics research?
• Compar_genomics– What can we learn from comparative
genomics?
Outline of today’s lecture
Introduction: 5 perspectives, history of life
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
Five approaches to genomics
As we survey the tree of life, consider these perspectives:
Approach I: cataloguing genomic informationGenome size; number of chromosomes; GC
content; isochores; number of genes; repetitive DNA; unique features of each genome
Approach V: Bioinformatics aspectsAlgorithms, databases, websites
Approach IV: Human disease relevance
Approach III: function; biological principles; evolutionHow genome size is regulated; polyploidization; birth and death of genes; neutral theory of
evolution; positive and negative selection; speciation
Approach II: cataloguing comparative genomic informationOrthologs and paralogs; COGs; lateral gene transfer
Page 519
IntroductionLessons learned form comparative genomics What have we learned about genes by comparing genomic
sequences? What have we learned about regulation? About 5% of the human genome is under purifying selection Positively regulated regions Mechanisms and history of mammalian evolution Nonuniformity of neutral evolutionary rates within species Nonuniformity of evolution along the branches of phylogenyLearning more form existing data Choice of species Choice of toolsFuture of comparative genomics
Levels of analysis in genomics
level topics databasesDNA genes, chromosomes GenBankRNA ESTs, ncRNA UniGene, GEOprotein ORFs, composition UniProtcomplexes binary, multimeric BINDpathways COGs, KEGGorganellesorgansindividuals variation and disease HapMapspecies speciation TaxBrowser; SGDgenus JAX mouse phylum FishBasekingdom TOL
Definitions of terms
Genomics is the study of genomes (the DNA comprising an organism) using the tools of bioinformatics.
Bioinformatics is the study protein, genes, and genomes using computer algorithms and databases.
Systematics is the scientific study of the kinds and diversity of organisms and of any and all relationships among them.
Classification is the ordering of organisms into groups on the basis of their relationships. The relationships may be evolutionary (phylogenetic) or may refer to similarities of phenotype (phenetic).
Taxonomy is the theory and practice of classifying organisms.
Fig. 13.1Page 521
Pace (2001) described a tree of life based on small subunit rRNA sequences.
This tree shows the mainthree branches describedby Woese and colleagues.
Historically, trees were generated primarily usingcharacters provided by morphological data. Molecularsequence data are now commonly used, includingsequences (such as small-subunit RNAs) that arehighly conserved.
Visit the European Small Subunit Ribosomal RNAdatabase for 20,000 SSU rRNA sequences.
Molecular sequences as basis of trees
Page 523
http://www.zo.utexas.edu/faculty/antisense/Download.html
Tree of life from David Hillis’ lab (based on ~3000 rRNAs)
animalsplants
fungi
protists
bacteriaarchaea
you are here
http://www.zo.utexas.edu/faculty/antisense/Download.html
you are here
Tree of life from David Hillis’ lab (based on ~3000 rRNAs)
Ribosomal RNA Database
Ribosomal Database Projecthttp://rdp.cme.msu.edu/index.jsp
Santos, S. R. and Ochman H. Identification and phylogenetic sorting of bacterial lineages with universally conserved genes and proteins. Environmental Microbiology. 2004. Jul(6)7:754-9.
►Download fusA (translation elongation factor 2 [EF-2])►Obtain DNA in the fasta format►Align by ClustalW in MEGA►Create a neighbor-joining tree
Page 524
European Small Subunit Ribosomal RNA database(http://www.psb.ugent.be/rRNA/ssu/)
Bac
ant
hrac
is S
tern
e fu
sA
Bac
thur
ing
9727
fusA
Bac
ant
hrac
is A
mes
fusA
Bac
ant
hrac
is 0
581
fusA
Bac
cer
eus
1098
7 fu
sA
Bac
cer
eus
1457
9 fu
sA
Bac
sub
tilis
fusA
Bac
hal
odur
ans
fusA
List
inno
cua
Clip
1126
2 fu
sA
List
mon
ocyt
o 4b
F23
65 fu
sA
List
mon
ocyt
o EG
De
fusA
Oce
anob
ac ih
eyen
sis H
TE83
1 fu
sA
Staph
yl ep
ider
mi 1
2228
fusA
Staph
y aur
eus M
W2
fusA
Staphy
aure
us M
u50 f
usA
Staphy aureus N
315 fusA
Lactobac j
ohnsonii N
CC533 fusA
Lactobac p
lantarum WCFS1 fu
sA
Entero faeca
lis V583 fu
sA
Strep m
utans UA159 fusA
Lactococ lactis Il1403 fusA
Strep agalactiae NEM316 fusA
Strep agalactiae 2603VR fusA
Strep pneumoniae R6 fusA
Strep pneumoniae TIGR4 fusA
Strep pyogenes M1 GAS fusA
Strep pyogenes MGAS8232 fusA
Strep pyogenes MGAS315 fusAStrep pyogenes SSI1 fusAOnion yel phytoplasm OYM fusAMycoplas mobile 163K fusAMycoplas pulmonis UAB CTIP fusAMycoplas mycoides PG1 fusA
Mycoplas penetrans HF2 fusA
Ureaplasma parvum 700970 fusA
Mycoplas galli R fusA
Mycoplas genita G37 fusA
Mycoplas pneumon M129 fusA
Thermoanaero tengcongensis fusA
Fuso nucleatum ATCC25586 fusA
Clost perfringens 13 fusA
Clost acetobutylicum 824 fusA
Clost tetani E88 fusA
Parachlamydia UWE25 fusA
Chlamy muridarum fusA
Chlamy tracho DUW3CX fusA
Chlamydo caviae GPIC fusA
Chlamydo pneumon J138 fusA
Chlamydo pneumon CWL029 fusA
Chlamydo pneumon AR39 fusA
Chlamydo pneum
on TW183 fusA
Prochloro marinus CCM
P1375 fusA
Prochloro marinus CCM
P1986 fusA
Nostoc PCC7120 fusA
Synechocystis PCC6803 fusA
Gloeo violaceus PC
C7421 fusA
Thermosynecho elongatus BP1 fusA
Prochloro m
arinus MIT 9313 fusA
Synechococcus sp W
H8102 fusA
Hel
ico
pylo
ri 26
695
fusA
Hel
ico
pylo
ri J9
9 fu
sA
Hel
ico
hepa
ticus
514
49 fu
sA
Wol
inel
la s
ucci
noge
n D
SM
1740
fusA
Cam
pylo
jeju
ni N
CT
C11
168
fusA
Buc
h ap
hidi
AP
S fu
sA
Buc
h ap
hidi
Sg
fusA
Buc
h ap
hidi
Bp
fusA
Can
di B
loch
man
flor
i fus
A
Wig
gles
wor
thia
fusA
Nitr
o eu
ropa
ea 1
9718
fusA
Cox
iella
bur
netii
RS
A49
3 fu
sAX
ylel
la fa
stid
iosa
9a5
c fu
sAX
ylel
la fa
stid
iosa
Tem
ecu1
fusA
Vib
rio v
ulni
ficus
CM
CP
6 fu
sA
Vib
rio v
ulni
ficus
YJ0
16 fu
sA
Vib
rio p
arah
aem
olyt
RIM
D22
1063
3 fu
sA
Vib
rio c
hole
rae
N16
961
fusA
She
wan
ella
one
iden
sis
MR
1 fu
sA
Aci
neto
bact
er A
DP
1 fu
sA
Nei
s m
enin
git M
C58
fusA
Nei
s m
enin
git Z
2491
fusA
Hae
mo
ducr
eyi 3
5000
HP
fusA
Pas
teu
mul
toci
da P
m70
fusA
Hae
mo
influ
RdK
W20
fusA
Phot
o lu
min
es T
TO1
fusA
Yers
inia
pes
tis C
O92
fusA
Yersin
ia p
estis
KIM
fusA
Yersin
ia pe
stis 9
1001
fusA
Erwini
a ca
roto
vora
SCRI1
043
fusA
Salmon
enter
Typ
hi CT18
fusA
Salmon enter T
yphi T
y2 fu
sA
Salmon ty
phimuriu
m LT2 fusA
E coli O
157 H7 fusA
E coli O157 H7 EDL933 fusA
E coli CFT073 fusA
E coli K12 fusA
Shigella flexneri 2457T fusA
Shigella flexneri 301 fusA
Lepto inter lai 56601 fusA
Lepto inter Copen Fio L1130 fusA
Pirellula 1 fusA
Aquifex aeolicus fusA
Thermotoga maritima MSB8 fusA
Bacteroides thetaio VPI5482 fusA
Porphyro gingiv W83 fusA
Geo sulfur PCA fusAChloro tepidum TLS fusA
Bordet bronchi RB50 fusABordet pertussis TohamaI fusABordet parapert 12822 fusARalstonia solan GMI1000 fusA
Chromo violaceum 12472 fusA
Xanthomonas axonopodis 306 fusA
Xanthomonas campestris 33913 fusA
Pseudo aeruginosa PA01 fusA
Pseudo putida KT2440 fusA
Pseudo syringae DC3000 fusA
Desulfo vulgaris Hilden fusA
Agro tumefaciens C58 fusA
Sinorhiz meliloti 1021 fusA
Mesorhiz loti MAFF303099 fusA
Bruc suis 1330 fusA
Caulo crescentus CB15 fusA
Bradyrhiz japonicum USDA110 fusA
Rhodopseudo palustris CGA009 fusA
Deino radiodurans R1 fusA
Thermus therm
ophilus HB27 fusA
Coryne efficiens YS314 fusA
Coryne gluta 13032 fusA
Coryne diphtheriae N
CTC
13129 fusA
Bifido longum
fusA
Streptom
y avermitilis M
A4680 fusA
Streptom
y coelicol A3 2 fusA
Mycobac leprae T
N fusA
Mycobac avium
k10 fusA
Mycobac bovis A
F212297 fusA
Mycobac tubercu C
DC
1551 fusA
Mycobac tubercu H
37Rv fusA
Treponem
a denticola 35405 fusA
Treponem
a pallidum N
ichols fusA
Borrelia burgdorferi B
31 fusA
Bdello bacter H
D100 fusA
Tropherym
a whipplei T
W08 27 fusA
Tropherym
a whipplei T
wist fusA
Bart henselae H
oust1 fusAB
art quintana fusAW
olbachia fusAR
icket conorii Malish 7 fusA
Ricket prow
azekii MadridE
fusA
0.05Rickettsia Treponema
Mycobacterium
Aquifex aeolicus
Yersinia pestis
Clostridium
Mycoplasma
Bac. antracis
Neighbor-joining tree of ~150 fusA (GTPase) DNA sequences
History of life on earth
4.55 BYA formation of earth (violent 100 MY period)4.4-3.8 BYA last ocean-evaporating impacts3.9 BYA oldest dated rocks3.8 BYA sun brightened to 70% of today’s luminosity
Ammonia, methane, or carbon dioxide atmosphere.Earliest life: RNA, protein
Source: Schopf J.W. (ed.), Life’s Origins (U. Calif. Press, 2002)
Page 521
1000 100 0500
InsectsCambrianexplosion
Age of Reptiles ends
Land plants
Proterozoic eon Phanerozoic eon
deuterostome/protostome
echinoderm/chordate
Millions of years ago (MYA)
Page 522
Millions of years ago (MYA)
Dinosaurs extinct;Mammalian radiation
Human/chimpdivergence
100 10 050
Mass extinction
Page 522
Millions of years ago (MYA)
Homo sapiens/Chimp divergence
Emergence ofHomo erectus
Earlieststone tools
10 1 05
AustralepithecusLucy
Page 522
Homo erectusemerges in Africa
MitochondrialEve
1,000,000 100,000 0500,000
Years ago
Page 523
Years ago
Neanderthal and Homo erectus disappear
Emergence ofanatomically
modern H. sapiens
100,000 10,000 050,000
Page 523
Years ago
“Ice Man”from Alps Aristotle
10,000 1,000 05,000
Earliestpyramids
Page 523
Years ago
algebra calculusDarwin,MendelGutenberg
1,000 100 0500
Page 523
We will next summarize the major achievements ingenome sequencing projects from a chronologicalperspective.
Chronology of genome sequencing projects
Page 525
1976: first viral genomeFiers et al. sequence bacteriophage MS2 (3,569 base pairs,Accession NC_001417).
1977:Sanger et al. sequence bacteriophage X174.This virus is 5,386 base pairs (encoding 11 genes).See accession J02482; NC_001422.
Chronology of genome sequencing projects
Page 527
1981Human mitochondrial genome16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)Today (10/09), over 1800 mitochondrial genomes sequenced
1986Chloroplast genome 156,000 base pairs (most are 120 kb to 200 kb)
Chronology of genome sequencing projects
Page 527
mitochondrion
chloroplast
Lackmitochondria (?)
http://www.ncbi.nlm.nih.gov/genomes/ORGANELLES/organelles.html
Entrez Genomes organelle resource at NCBI
There are >2100 eukaryotic organelles (10/09)
http://megasun.bch.umontreal.ca/gobase/
GOBASE: resource for organelle genomes
http://www-lecb.ncifcrf.gov/mitoDat/
MitoDat: resource for organelle genomes
“This database is dedicated to the nuclear genes specifying the enzymes, structural proteins, and other proteins, many still not identified, involved in mitochondrial biogenesis and function. MitoDat highlights predominantly human nuclear-encoded mitochondrial proteins.”
Not updated recently.
http://www.mitomap.org/
MitoMap: resource for organelle genomes
It is possible to map mutations in human mitochondrial DNA that are responsible for disease
1995: first genome of a free-living organism, the bacterium Haemophilus influenzae
Chronology of genome sequencing projects
Page 530
1996: first eukaryotic genome
The complete genome sequence of the budding yeastSaccharomyces cerevisiae was reported. We willdescribe this genome soon.
Also in 1996, TIGR reported the sequence of the firstarchaeal genome, Methanococcus jannaschii.
Chronology of genome sequencing projects
Page 532
1997:More bacteria and archaeaEscherichia coli4.6 megabases, 4200 proteins (38% of unknown function)
1998: first multicellular organismNematode Caenorhabditis elegans 97 Mb; 19,000 genes.
1999: first human chromosomeChromosome 22 (49 Mb, 673 genes)
Chronology of genome sequencing projects
Page 532
1999: Human chromosome 22 sequenced
2000:Fruitfly Drosophila melanogaster (13,000 genes)
Plant Arabidopsis thaliana
Human chromosome 21
2001: draft sequence of the human genome(public consortium and Celera Genomics)
Chronology of genome sequencing projects
Page 534
2000
• Selection of genomes for sequencing
• Sequence one individual genome, or several?
• How big are genomes?
• Genome sequencing centers
• Sequencing genomes: strategies
• When has a genome been fully sequenced?
• Repository for genome sequence data
• Genome annotation
Overview of genome analysis
Page 537
Applications of Genome Sequencing
Purpose Template Example
De novo sequencing
Genome sequencing Sequencing >1000 influenza genomes
Ancient DNA Extinct Neanderthal genome
Metagenomics Human gut
Resequencing Whole genomes Individual humans
Genomic regions Assessment of genomic rearrangements or disease-associated regions
Somatic mutations Sequencing mutations in cancer
Transcriptome Full-length transcripts Defining regulated messenger RNA transcriptsSerial Analysis of
Gene Expression (SAGE)
Noncoding RNAs Identifying and quantifying microRNAs in samples
Epigenetics Methylation changes Measuring methylation changes in cancer
Table 13.15 p.538
Fig. 13.8p.539
Overview of genome analysis
Criteria include:
• genome size (some plants are >>>human genome)• cost• relevance to human disease (or other disease)• relevance to basic biological questions• relevance to agriculture
Criteria for selecting genomes for sequencing
Page 538
Criteria include:
• genome size (some plants are >>>human genome)• cost• relevance to human disease (or other disease)• relevance to basic biological questions• relevance to agriculture
Recent projects:Chicken Fungi (many)Chimpanzee Honey beeCow Sea urchinDog Rhesus macaque
Page 540
Criteria for selecting genomes for sequencing
Selection of genomes for sequencing is basedon specific criteria.
For an overview, see a series of white papers posted on the National Human Genome Research Institute (NHGRI) website: http://www.genome.gov/10002154
For a description of NHGRI selection criteria, visit:http://www.genome.gov/10001495
Selection criteria
Page 540
Sequence one individual genome, or several?
Try one…
--Each genome center may study one
chromosome from an organism
--It is necessary to measure polymorphisms
(e.g. SNPs) in large populations
For viruses, thousands of isolates may be sequenced.
For the human genome, cost is the impediment.
Page 540
Criteria for selecting genomes for sequencing
How big are genomes?
Viral genomes: 1 kb to 350 kb (Mimivirus: 1181 kb)
Bacterial genomes: 0.5 Mb to 13 Mb
Eukaryotic genomes: 8 Mb to 686 Gb (human: ~3 Gb)
Diversity of genome sizes
Page 540
viruses
plasmids
bacteria
fungi
plants
algae
insects
mollusks
reptiles
birds
mammals
Genome sizes in nucleotide base pairs
104 108105 106 107 10111010109
The size of the humangenome is ~ 3 X 109 bp;almost all of its complexityis in single-copy DNA.
The human genome is thoughtto contain ~30,000-40,000 genes.
bony fish
amphibians
http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt
Genus, species Subgroup Size (Mb) #chr common name
Macropus eugenii Mammals 3800 8 tammar wallaby
Oryctolagus cuniculus Mammals 3500 22 rabbit
Cavia porcellus Mammals 3400 31 guinea pig
Pan troglodytes Mammals 3100 24 chimpanzee
Homo sapiens Mammals 3038 23 human
Bos taurus Mammals 3000 30 cow
Dasypus novemcinctus Mammals 3000 32 nine-banded armadillo
Loxodonta africana Mammals 3000 28 African savanna elephant
Sorex araneus Mammals 3000 European shrew
Rattus norvegicus Mammals 2750 21 rat
Canis familiaris Mammals 2400 39 dog
Zea mays Land Plants 2365 10 corn
Aplysia californicaOther Animals 1800 17 California sea hare
Danio rerio Fishes 1700 25 zebrafish
Gallus gallus Birds 1200 40 chicken
Triphysaria versicolor Land Plants 1200 plant parasite
16 eukaryotic genome projects > 1000 megabases
Ancient DNA projects
Special challenges:
• Ancient DNA is degraded by nucleases• The majority of DNA in samples derives from unrelated organisms such as bacteria that invaded after death• The majority of DNA in samples is contaminated by human DNA• Determination of authenticity requires special controls, and analysis of multiple independent extracts
Page 542
Metagenomics projects
Two broad areas:
• Environmental (ecological) e.g. hot spring, ocean, sludge, soil
• Organismal e.g. human gut, feces, lung
Page 543
Outline of today’s lecture
Introduction: 5 perspectives, history of life: time lines
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
Outline of today’s lecture
Introduction: 5 perspectives, history of life: time lines
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
20 Genome sequencing centers contributedto the public sequencing of the human genome.
Many of these are listed at the Entrez genomes site.(Or see Table 19.3, page 803.)
Overview of genome analysis
Page 548
Whole genome shotgun sequencing (Celera)
Hierarchical shotgun sequencing (public consortium)
Two approaches to genome sequencing
Whole Genome Shotgun (from the NCBI website)
An approach used to decode an organism's genome by shredding it into smaller fragments of DNA which can be sequenced individually. The sequences of thesefragments are then ordered, based on overlaps in the genetic code, and finally reassembled into the complete sequence. The 'whole genome shotgun' (WGS) method isapplied to the entire genome all at once, while the 'hierarchical shotgun' method is applied to large, overlapping DNA fragments of known location in the genome.
Page 548
Two approaches to genome sequencing
Human genome project: strategies
Whole genome shotgun sequencing (Celera)
-- given the computational capacity, this approach is far faster than hierarchical shotgun sequencing-- the approach was validated using Drosophila
Hierarchical shotgun methodAssemble contigs from various chromosomes, then sequence and assemble them. A contig is a set of overlapping clones or sequences from which a sequence can be obtained. The sequence may be draft or finished.
A contig is thus a chromosome map showing the locations of those regions of a chromosome where contiguous DNA segments overlap. Contig maps are important because they provide the ability to study a complete, and often large segment of the genome by examining a series of overlapping clones which then provide an unbroken succession of information about that region.
Two approaches to genome sequencing
Page 548
Hierarchical shotgun sequencing (public consortium)
-- 29,000 BAC clones-- 4.3 billion base pairs-- it is helpful to assign chromosomal loci to sequenced fragments, especially in light of the large amount of repetitive DNA in the genome-- individual chromosomes assigned to centers
Two approaches to genome sequencing
Source: IHGSC (2001)
Fig. 19.8Page 804Source: IHGSC (2001)
Sequenced-clone contigs are merged to form scaffolds of known order and orientation
A typical goal is to obtain five to ten-fold coverage.
Finished sequence: a clone insert is contiguouslysequenced with high quality standard of error rate0.01%. There are usually no gaps in the sequence.
Draft sequence: clone sequences may contain severalregions separated by gaps. The true order andorientation of the pieces may not be known.
When has a genome been fully sequenced?
Page 549
When has a genome been fully sequenced?
Fold coverage % sequenced0.25 220.5 390.75 531 632 87.53 954 98.25 99.46 99.757 99.918 99.979 99.9910 99.995
When has a genome been fully sequenced?
Page 551
Raw data from many genome sequencing projectsare stored at the trace archive at NCBI or EBI
(main NCBI page, bottom right).
Also visit: http://trace.ensembl.org/
As of October 2008, the Trace Archive had ~2b traces.
As of October 2009 it has ~2,108,000,000 traces.
Trace repository for genome sequence data
Page 552
Fig. 13.12Page 553
http://www.jgi.doe.gov/education/
http://www.youtube.com/watch?v=RLsb0pMx_oU&feature=channel_page
A Howard Hughes Medical Institute (HHMI) video production describing the Whole Genome Shotgun Sequencing process at the JGI. This video is viewable on YouTube in three parts: Part1(chapters 1-5), Part 2 (chapters 6-8), Part 3 (chapters 9-14).
Role of comparative genomics
Phylogenetic footprinting
Phylogenetic shadowing
Population shadowing
Page 552
Fig. 13.13Page 554
Outline of today’s lecture
Introduction: 5 perspectives, history of life: time lines
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
Fig. 13.14Page 555
Information content in genomic DNA includes:
-- nucleotide composition (GC content)
-- repetitive DNA elements
-- protein-coding genes, other genes
Genome annotation
Page 555
20 30 40 50 60 70 80
GC content (%)
Vertebrates
Invertebrates
Plants
Bacteria
3
5
10
Nu
mb
er o
f sp
ecie
sin
eac
h G
C c
lass
5
10
5
GC content varies across genomes
Fig. 13.15Page 556
Gene prediction tools• http://bioinformatics.ca/links_directory/?subcategory_i
d=39• http://www.geneprediction.org/
Common tools
GenScan: http://genes.mit.edu/GENSCAN.html
HMMgene: http://www.cbs.dtu.dk/services/HMMgene/
Microbial: http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi
Fungal:
http://www.cbcb.umd.edu/software/GlimmerHMM/