Eukaryotic Genomes: From Parasites to Primates (Part 1 of 2) Monday, November 3, 2003 Introduction...

59
Eukaryotic Genomes: From Parasites to Primates (Part 1 of 2) Monday, November 3, 2003 Introduction to Bioinformatics ME:440.714 J. Pevsner [email protected]

Transcript of Eukaryotic Genomes: From Parasites to Primates (Part 1 of 2) Monday, November 3, 2003 Introduction...

Eukaryotic Genomes:From Parasites to Primates

(Part 1 of 2)

Monday, November 3, 2003

Introduction to BioinformaticsME:440.714J. Pevsner

[email protected]

Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomicsby J Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by Wiley.

These images and materials may not be usedwithout permission from the publisher.

Visit http://www.bioinfbook.org

Copyright notice

Today: Eukaryotic genomesWednesday Nov. 5: Human genomeFriday Nov. 7: computer lab

Monday Nov. 10: Human diseaseWednesday Nov. 12: Final exam (in class)

Announcements

Outline of today’s lecture

1. General features of eukaryotic genomes C value paradox and genome sizes noncoding DNA; repetitive DNA genes organization of DNA in chromosomes

2. Individual eukaryotic genomes protozoans (e.g. trypanosomes, Plasmodium) plants (Arabidopsis, rice) metazoans (nematodes, insects) vertebrates (fish, mouse, primates)

Page 539

Introduction to the eukaryotes

Eukaryotes are single-celled or multicellular organismsthat are distinguished from prokaryotes by the presenceof a membrane-bound nucleus, an extensive system ofintracellular organelles, and a cytoskeleton.

We will explore the eukaryotes using a phylogenetic treeby Baldauf et al. (Science, 2000). This tree was made byconcatenating four protein sequences: elongation factor 1a, actin, -tubulin, and -tubulin.

Page 541

Eukaryotes(after Baldauf et al., 2000)

General features of the eukaryotes

Some of the general features of eukaryotes that distinguishthem from prokaryotes are:

• eukaryotes include many multicellular organisms, in addition to unicellular organisms.• eukaryotes have [1] a membrane-bound nucleus, [2] intracellular organelles, and [3] a cytoskeleton• Most eukaryotes undergo sexual reproduction• The genome size of eukaryotes spans a wider range than that of most prokaryotes• Eukaryotic genomes have a lower density of genes• Prokaryotes are haploid; eukaryotes have varying ploidy• Eukaryotic genomes tend to be organized into linear chromosomes with a centromere and telomeres.

Page 541

C value paradox: why eukaryotic genome sizes vary

The haploid genome size of eukaryotes, called the C value,varies enormously.

Small genomes include:Encephalotiozoon cuniculi (2.9 Mb)A variety of fungi (10-40 Mb)Takifugu rubripes (pufferfish)(365 Mb)(same number of genes as other fish or as the human genome, but 1/10th the size)

Large genomes include:Pinus resinosa (Canadian red pine)(68 Gb)Protopterus aethiopicus (Marbled lungfish)(140 Gb)Amoeba dubia (amoeba)(690 Gb)

Page 543

C value paradox: why eukaryotic genome sizes vary

The range in C values does not correlate well with thecomplexity of the organism. This phenomenon is calledthe C value paradox.

The solution to this “paradox” is that genomes are filledwith large tracts of noncoding, often repetitive DNA sequences.

Page 543

Britten & Kohne’s analysis of repetitive DNA

In the 1960s, Britten and Kohne defined the repetitivenature of genomic DNA in a variety of organisms.They isolated genomic DNA, sheared it, dissociatedthe DNA strands, and measured the rates of DNAreassociation.

For dozens of eukaryotes—but not bacteria or viruses—large amount of DNA reassociates extremely rapidly.This represents repetitive DNA.

Page 544-545

Fig. 16.2Page 545

Britten and Kohne (1968) identified repetitive DNA classes

Five main classes of repetitive DNA

1. Interspersed repeats 2. Processed pseudogenes3. Simple sequence repeats4. Segmental duplications5. Blocks of tandem repeats

Page 546-550

Five main classes of repetitive DNA

Page 546

1. Interspersed repeats (transposon-derived repeats)constitute ~45% of the human genome. They involveRNA intermediates (retroelements) or DNA intermediates(DNA transposons).

• Long-terminal repeat transposons (RNA-mediated)• Long interspersed elements (LINEs);

these encode a reverse transcriptase• Short interspersed elements (SINEs)(RNA-mediated);

these include Alu repeats• DNA transposons (3% of human genome)

Five main classes of repetitive DNA

Table 16.5Page 546

1. Interspersed repeats (transposon-derived repeats)

Examples include retrotransposed genes that lack introns,such as:

ADAM20 NM_003814 14q (original gene on 8p)Cetn1 NM_004066 18p (original gene on Xq)Glud2 NM_012084 Xq (original gene on 10q)Pdha2 NM_005390 4q (original gene on Xp)

Five main classes of repetitive DNA

Page 547

2. Processed pseudogenes

These genes have a stop codon or frameshift mutationand do not encode a functional protein. They commonly arise from retrotransposition, or following gene duplicationand subsequent gene loss.

For a superb on-line resource,visit Mark Gerstein’s website, http://www.pseudogene.org

Five main classes of repetitive DNA

Page 546

3. Simple sequence repeats

Microsatellites: from one to a dozen base pairsExamples: (A)n, (CA)n, (CGG)n

These may be formed by replication slippage.Minisatellites: a dozen to 500 base pairs

Simple sequence repeats of a particular length andcomposition occur preferentially in different species.In humans, an expansion of triplet repeats such as CAGis associated with at least 14 disorders (includingHuntington’s disease).

Example of a simple sequence repeat (CCCA or GGGT) in human genomic DNA

Five main classes of repetitive DNA

Page 547

4. Segmental duplications

These are blocks of about 1 kilobase to 300 kb that arecopied intra- or interchromosomally. Evan Eichler andcolleagues estimate that about 5% of the human genomeconsists of segmental duplications. Duplicated regionsoften share very high (99%) sequence identity.

As an example, consider a group of lipocalin geneson human chromosome 9.

Fig. 16.3Page 548

Successive tandem gene duplications(after Lacazette et al., 2000)

observed today

Fig. 16.3Page 548

Successive tandem gene duplications(after Lacazette et al., 2000)

Fig. 16.3Page 548

Successive tandem gene duplications(after Lacazette et al., 2000)

Fig. 16.3Page 548

Successive tandem gene duplications(after Lacazette et al., 2000)

Five main classes of repetitive DNA

Page 549

5. Blocks of tandem repeats

These include telomeric repeats (e.g. TTAGGG inhumans) and centromeric repeats (e.g. a 171 base pairrepeat of satellite DNA in humans).

Such repetitive DNA can span millions of base pairs, and it is often species-specific.

Fig. 16.4Page 549

Example of telomeric repeats (obtained by tblastn searching TTAGGG4)

Five main classes of repetitive DNA

Page 550

5. Blocks of tandem repeats

In two exceptional cases, chromosomes lack satellite DNA:

• Saccharomyces cerevisiae (very small centromeres)• Neocentromeres (an ectopic centromere; 60 have been described in human, often associated with disease)

Software to detect repetitive DNA

It is essential to identify repetitive DNA in eukaryoticgenomes. RepBase Update is a database of knownrepeats and low-complexity regions.

RepeatMasker is a program that searches DNA queriesagainst RepBase. There are many RepeatMasker sitesavailable on-line.

We will use 100,000 base pairs from human chromosome10 as an example. This region (from NT_008769)includes the retinol-binding protein 4 gene.

Page 550

Fig. 16.5Page 551

Fig. 16.7Page 553

Fig. 16.8Page 554

RepeatMasker identifies simple sequence repeats

RepeatMasker identifies simple sequence repeats

Fig. 16.8Page 554

Fig. 16.8Page 554

RepeatMasker identifies Alu repeats

RepeatMaskermasksrepetitive DNA(FASTA format)

Fig. 16.9Page 555

Finding genes in eukaryotic DNA

Two of the biggest challenges in understanding anyeukaryotic genome are

• defining what a gene is, and • identifying genes within genomic DNA

Page 551

Finding genes in eukaryotic DNA

Types of genes include

• protein-coding genes• pseudogenes• functional RNA genes

--tRNA transfer RNA--rRNA ribosomal RNA--snoRNA small nucleolar RNA--snRNA small nuclear RNA--miRNA microRNA

Page 552

Finding genes in eukaryotic DNA

RNA genes have diverse and important functions.However, they can be difficult to identify in genomic DNA,because they can be very small, and lack open reading frames that are characteristic of protein-coding genes.

tRNAscan-SE identifies 99 to 100% of tRNA molecules,with a rate of 1 false positive per 15 gigabases.Visit http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

Page 553

Finding genes in eukaryotic DNA

Protein-coding genes are relatively easy to find in prokaryotes, because the gene density is high (aboutone gene per kilobase). In eukaryotes, gene density is lower, and exons are interrupted by introns.

There are several kinds of exons:-- noncoding-- initial coding exons-- internal exons-- terminal exons-- some single-exon genes are intronless

Page 553

Fig. 16.12Page 558

Eukaryotic gene prediction algorithms distinguish several kinds of exons

Finding genes in eukaryotic DNA

Algorithms that find protein-coding genes areextrinsic or intrinsic (refer to Chapter 12, Completedgenomes, Figure 12.17).

Page 555

Gene-finding algorithms

Homology-based searches (“extrinsic”) Rely on previously identified genes

Algorithm-based searches (“intrinsic”)Investigate nucleotide composition, open-reading frames, and other intrinsic properties of genomic DNA

Page 556

DNA

RNA

Mature RNA

protein

intron

Page 556

DNA

RNA

RNA

protein

Extrinsic, homology-based searching: compare genomic DNA to expressed genes (ESTs)

intron

Page 556

DNA

RNA

Intrinsic, algorithm-based searching: Identify open reading frames (ORFs). Compare DNA in exons (unique codon usage) to DNA in introns (unique splices sites)and to noncoding DNA.

Page 556

Finding genes in eukaryotic DNA

While ESTs are very helpful in finding genes, beware ofseveral caveats.

-- The quality of EST sequence is sometimes low-- Highly expressed genes are disproportionately represented in many cDNA libraries-- ESTs provide no information on genomic location

Page 557

Finding genes in eukaryotic DNA

Both intrinsic and extrinsic algorithms vary in their ratesof false-positive and false-negative gene identification. Programs such as GENSCAN and Grail account for features such as the nucleotide composition of codingregions, and the presence of signals such as promoter elements. Try using the on-line genome annotation pipelineoffered by Oak Ridge National Laboratory. Google ORNL pipeline, or visithttp://compbio.ornl.gov/tools/pipeline/

Page 557

Fig. 16.13Page 560

Oak Ridge National Laboratory (ORNL)offers an on-line annotation pipeline

Fig. 16.14Page 561

Fig. 16.15Page 561

Fig. 16.16Page 562

Finding genes in eukaryotic DNA

We used 100,000 base pairs of human DNA. The pipelinecorrectly identified several exons of RBP4, but failed togenerate a complete gene model.

As another example, initial annotation of the rice genomeyielded over 75,000 gene predictions, only 53,000 of whichwere complete (having initial and terminal exons). Also,it is very difficult to accurately identify exon-intron boundaries.

Estimates of gene content improve dramatically whenfinished (rather than draft) sequence is analyzed.

Page 561

Protein-coding genes in eukaryotic DNA:a new paradox

The C value paradox is answered by the presence of noncoding DNA.

Why are the number of protein-coding genes about the samefor worms, flies, plants, and humans?

This has been called the N-value paradox (number of genes)or the G value paradox (number of genes).

Page 562

Transcription factor databases

In addition to identifying repetitive elements and genes,it is also of interest to predict the presence of genomicDNA features such as promoter elements and GC content.

See Table 16.10 (p. 564) for a list of websites that predicttranscription factor binding sites and related sequences.

Page 563

Eukaryotic genomes are organized into chromosomes

Genomic DNA is organized in chromosomes. The diploid number of chromosomes is constant in each species(e.g. 16 in S. cerevisiae, 46 in human). Chromosomesare distinguished by a centromere and telomeres.

The chromosomes are routinely visualized by karyotyping(imaging the chromosomes during metaphase, when each chromosome is a pair of sister chromatids).

Page 564

Fig. 16.19Page 565

Eukaryotic chromosomes can be dynamic

Chromosomes can be highly dynamic, in several ways.

• Whole genome duplication (autopolyploidy) can occur, as in yeast (Chapter 15) and some plants.• The genomes of two distinct species can merge, as in the mule (male donkey, 2n = 62 and female horse, 2n = 64)• An individual can acquire an extra copy of a chromosome (e.g. Down syndrome, TS13, TS18)• Chromosomes can fuse; e.g. human chromosome 2 derives from a fusion of two ancestral primate chromosomes• Chromosomal regions can be inverted (hemophilia A)• Portions of chromosomes can be deleted (e.g. 11q syndrome)• Segmental and other duplications occur• Chromatin diminution can occur (Ascaris)

Page 565

Comparison of eukaryotic DNA: PipMaker and VISTA

We studied pairwise sequence alignment at the beginningof the course. In studying genomes, it is important to alignlarge segments of DNA.

PipMaker and VISTA are two tools for sequence alignmentand visualization. They show conserved segments, includingthe order and orientation of conserved elements. They also display large-scale genomic changes (inversions, rearrangements, duplications).

Try VISTA (http://www-gsd.lbl.gov/vista) or PipMaker(http://bio.cse.psu.edu/pipmaker) with genomic DNAfrom Hs10 and Mm19 (containing RBP4).

Page 566

Fig. 16.20Page 568

VISTA output for an alignment of human and mousegenomic DNA (including RBP4)

Fig. 16.20Page 568

VISTA output for an alignment of human and mousegenomic DNA (including RBP4)

This lecture continues with a discussion of individual eukaryotic genomes (part 2)