Unilag workshop complex genome analysis
-
Upload
soji-adewumi -
Category
Documents
-
view
1.460 -
download
0
Transcript of Unilag workshop complex genome analysis
COMPLEX GENOME ANALYSISIkhide Imumorin, PhD
Assistant Professor of Animal Molecular & Quantitative Genetics
Cornell UniversityIthaca, NY 14853
USA
GENOMICS ANALYSIS
The genomes of living organisms vary enormously in size.
Genomicists look at two basic features of genomes: sequence and polymorphism.
• Major challenges to determine sequence of each chromosome in genome and identify many polymorphisms:– How does one sequence a 500 Mb chromosome 600 bp at a
time?– How accurate should a genome sequence be?
• DNA sequencing error rate is about 1% per 600 bp.
– How does one distinguish sequence errors from polymorphisms?
• Rate of polymorphism in diploid human genome is about 1 in 500 bp.
– Repeat sequences may be hard to place.– Unclonable DNA cannot be sequenced.
• Up to 30% of genome is heterochromatic DNA that can not be cloned
Divide and conquer strategy meets most challenges.
• Chromosomes are broken into small overlapping pieces and cloned.
• Ends of clones sequenced and reassembled into original chromosome strings
• Each piece is sequenced multiple times to reduce error rate.– 10-fold sequence coverage achieves a rate of
error less than 1/10,000.
Fig. 10.2
Techniques for mapping and cloning
• Cloning– Library of DNA fragments 500 – 1,000,000 bp– Insert into one of a variety of vectors
• Hybridization– Location of a particular DNA sequence within the library of fragments
• PCR amplification– Direct amplification of a particular region of DNA ranging from 1 bp to > 20kb
• DNA sequencing– Automated DNA sequencer using Sanger method determines sequences 600 bp
at a time.• Computational tools
– Programs for identifying matches between a particular sequence and a large population of previously sequenced fragments
– Programs for identifying overlaps of DNA fragments– Programs for estimating error rates– Programs for identifying genes in chromosomal sequences
Making a large scale linkage map
• Types of DNA polymorphisms used for large-scale mapping:– Single nucleotide polymorphisms (SNPs) – 1/500 – 1/1000 bp across genome– Simple sequence repeats (SSRs) – 1/20-1/40 kb across genome
• 2-5 nucleotides is repeated 4-50 or more times.
• Most SNPs and SSRs have little or no effect on the organism.• Serve as DNA markers across the chromosomes• Must be able to rapidly identify and assay in populations from 100s
to 1000s of individuals
Fig. 10.3
Genome wide identification of genetic markers
• Initial genetic maps used SSRs which are highly polymorphic.
• Identified by screening DNA libraries with SSR probes
• Amplified by PCR and length differences assayed
• SNPs – millions more recently identified by comparison of orthologous regions of cDNA clones from different individuals
• Homologous – genes with enough sequence similarity to be related somewhere in evolutionary history
• Orthologous – genes in two different species that arose from the same gene in the two species’ common ancestor
• Paralogous – arise by duplication within same species
• Orthologous genes are always homologous, but homologous genes are not always orthologous.
SNPs and SSRs for genome coverage
• Until recently, maps were constructed from about 500 SSRs evenly spaced across genome (1 SSR every 6 Mb).
• SNPs provide more than 500,000 DNA markers across the genome.
Genome wide typing of genetic markers
• Two-stage assay for simple sequence repeats– PCR
amplification– Size separation
Fig. 10.4
Long range physical maps: karyotypes and genomic libraries position markers on chromosomes.
• Physical map– Overlapping DNA fragments ordered and oriented
that span each of the chromosomes– Based on direct analysis of DNA rather than
recombination on which linkage maps are based– Chart actual number of bp, kb, or Mb that separate a
locus from its neighbors– Linkage vs. physical maps
• 1 cM = 1 Mb in humans• 1 cM = 2 Mb in mice
Vectors used for clone large inserts for physical mapping
• YACs (yeast artificial chromosomes) – Insert size 100-1,000,000 Mb
• BACs (bacterial artificial chromosomes)– Insert size 50 – 300 kb– More stable and easier to purify from host
DNA than YACs
How to determine order of clones across genome
• Overlapping inserts help align cloned fragments.– Bottom-up approach – overlapping sequences
of tens of thousands of clones determined by restriction site analysis or sequence tag sites (STSs)
– Top-down approach – insert is hybridized against karyotype of entire genome.
Identifying and isolating a set of overlapping fragments from a library
• Two approaches:– Linkage maps used to derive a physical map
• Set of markers less than 1 cM apart• Use markers to retrieve fragments from library by
hybridization.• Construct contigs – two or more partially overlapping cloned
fragments.• Chromosome walk by using ends of unconnected contigs to
probe library for fragments in unmapped regions
– Physical mapping techniques:• Direct analysis of DNA• Overlapping clones aligned by restriction mapping• Sequence tag segments (STSs)
Physical mapping by analysis of STSsBottom-up approach
Each STS represents a unique segment of the genome amplified by
PCR.
Fig. 10.5
Human Karyotype
• (a) Complete set of human chromosomes stained with Giemsa dye shows bands.
• (b) Ideograms show idealized banding pattern.
Fig. 10.6 a, b
Chromosome 7 at three levels of resolution
Fig. 10. 6 c
FISH protocol for top-down approach
Fig. 10.8
Sequence maps show order of nucleotides in cloned piece of DNA.• Two strategies for sequence human
genome:– Hierarchical shotgun approach– Whole-genome shotgun approach
• Shotgun – randomly generated overlapping insert fragments:– Fragments from BACs– Fragments from shearing whole genome
• Shearing DNA with sonication• Partial digestion with restriction enzymes
Hierarchical shotgun strategyUsed in publicly funded effort to sequence human genome
• Shear 200 kb BAC clone into ~2 kb fragments
• Sequence ends 10 times• Need about 1700 plasmid
inserts per BAC and about 20,000 BACs to cover genome
• Data form linkage and physical maps used to assemble sequence maps of chromosomes
• Significant work to create libraries of each BAC and physically map BAC clones
Fig. 10.9
Whole-genome shotgun sequencing
Used by Celera Genomics to sequence whole human genome. • Whole genome randomly
sheared three times– Plasmid library constructed
with ~ 2kb inserts– Plasmid library with ~10 kb
inserts– BAC library with ~ 200 kb
inserts• Computer program assembles
sequences into chromosomes.• No physical map construction• Only one BAC library• Overcomes problems of repeat
sequences
Fig. 10.10
Limitations of whole genome sequencing
• Some DNA can not be cloned.– e.g., heterochromatin
• Some sequences rearrange or sustain deletions when cloned.
• Future large genome sequencing will use both shotgun approaches.
Sequencing of the human genome
• Most of draft took place during last year of project.– Instrument improvements – 500,000,000
bp/day– Automated factory-like production line
generated sufficient DNA to supply sequencers on a daily basis.
– Large sequencing centers with 100-300 instruments – 150,000,000 bp/day
Integration of linkage, physical, and sequence maps
• Provides check on the correct order of each map against other two
• SSR and SNP DNA linkage markers readily integrated into physical map by PCR analysis across insert clones in physical map
• SSR, SNP (linkage maps), and STS markers (physical maps) have unique sequences 20 bp or more, allowing placement on sequence map.
Changes in biology, genetics and genomics from human genome sequence
• Genetics parts list• Speeds gene-finding and gene-function analysis
– Sequence identification in second organism through homology
– Gene function in one organism helps understand function in another for orthologous and paralogous genes
– Genes often encode one or more protein domains• Allows guess at function of new protein by comparison of
protein sequence in databases of all known domains– Ready access to identification of known human
polymorphism– Speeds mapping of new organisms by comparison
• e.g., mouse and human have high similarity in gene content and order
Major insights from human and model organism sequences
• Approximately 25,000 human genes• Genes encode noncoding RNA or proteins.• Repeat sequences are > 50% of genome.• Distinct types of gene organization:
– Gene families– Gene rich regions– Gene desert
• Combinatorial strategies amplify genetic information and increase diversity.
• Evolution by lateral transfer of genes from one organism to another
• Males have twofold higher mutation rate than females.• Human races have very few unique distinguishing genes.• All living organisms evolve from a common ancestor.
Conserved segments of
syntenic blocks in human and
mouse genomes
Fig. 10.12
Noncoding RNA genes
• Transfer RNAs (tRNAs) – adaptors that translate triplet code of RNA into amino acid sequence of proteins
• Ribosomal RNAs (rRNAs) – components of ribosome
• Small nucleolar RNAs (snoRNAs) – RNA processing and base modification in nucleolus
• Small nuclear RNAs (sncRNAs) - spliceosomes
Protein coding genes generate the proteome.
• Proteome – collective translation of 30,000 protein coding genes into proteins
• Complexity of proteome increase from yeast to humans.– More genes– Shuffling, increase, or decrease of functional modules– More paralogs– Alternative RNA splicing – humans exhibit
significantly more– Chemical modification of proteins is higher in humans.
Protein coding genes generate the proteomeHow transcription factor protein domains have expanded
in specific lineages
Fig. 10.11
Examples of domain accretions in Examples of domain accretions in chromatin proteinschromatin proteins
Fig. 10.13
Number of distinct domain architectures in four eukaryotic genomes
Fig. 10.14
Repeat sequences fall into five classes.
• Transposon-derived repeats
• Processed pseudogenes
• SSRs
• Segmental duplications of 10-300 kb
• Blocks of repeated sequences at centromere, telomeres and other chromosomal features
Repeat sequences constitute more than 50% of the genome.
Fig. 10.15
Gene Organization of Genome• Gene families
– Closely related genes clustered or dispersed
• Gene-rich regions– Functional or chance events?
• Gene deserts– Span 144 Mb or 3% of genome– Contain regions difficult to identify?
• e.g., big genes – nuclear transcript spans 500 kb or more with very large introns (exons < 1% of DNA)
Genome has a distinct organization. Gene family – olfactory receptor gene family
Class II region of human major histocompatibility complex contains
60 genes in 700 kb
Fig. 10.17
Combinatorial strategies
• At DNA level – T-cell receptor genes are encoded by a multiplicity of gene segments.
• At RNA level – splicing of exons in different orders
Fig. 10.19a
Fig. 10.18
Lateral transfer of genes
• > 200 human genes may arise by transfer from organisms such as bacteria.
• Lateral transfer is direct transfer of genes from one species into the germ line of another.
Twofold higher mutation rate in males
• Comparison of X and Y chromosomes
• Same may be true for autosomes, but difficult to measure.
• Majority of human mutations arise in males.
• Males give rise to more defects, but also more diversity.
Human races have similar genes.
• Genome sequence centers have sequenced significant portions of at least three races.
• Range of polymorphisms within a race can be much greater than the range of differences between any two individuals of different races.
• Very few genes are race specific.• Genetically, humans are a single race.
All living organisms are a single race.
• All living organisms have remarkably similar genetic components.
• Life evolved once and we are descendents of that event.
• Analysis of appropriate biological systems in model organisms provides fundamental insight into corresponding human systems.
In the future, other features of chromosomes will become increasingly important.
• Chemical modification of bases– Understand DNA methylation now– Others may be discovered
• Interaction of various proteins with chromosome• Three dimensional structure of proteins in
nucleus– May determine interactions of chromosomal regions
with regions of nuclear envelope
• More effective tools need to be developed to examine chromosome features.
High-throughput instrumentsDNA sequencer
Fig. 10.20
High-throughput instruments
e.g, microarrays
Fig. 10.21
Two color - DNA microarray
Fig. 10.22
Analysis of genomic and RNA sequences
• Quantitative analysis of mRNA levels– Serial analysis of gene expression (SAGE)
• Small cDNA tags of 15 bp from 3’ ends of mRNA are linked and sequenced.
– Massively parallel signature sequence (MPSS)
• Transcriptome – population of mRNAs expressed in a single cell or cell type
• MPSS allows identification of most of cell’s rarely expressed mRNAs
Lynx therapeutics sequencing strategy of MPSS
Fig. 10.24
Systems Biology – the global study of multiple components of biological systems and their interactions
• New approach to studying biological systems has made possible:– Sequencing genomes– High-throughput platform development– Development of powerful computational tools– The use of model organisms– Comparative genomics
Human Genome Project has changed the potential for predictive/preventive medicine.
• Provided access to DNA polymorphisms underlying human variability– Makes possible identification of genes predisposing to
disease– Understanding of defective genes in context of
biological systems– Circumvent limitations of defective genes
• Novel drugs• Environmental controls• Approaches such as stem-cell transplants or gene therapy
Social, ethical, and legal issues• Privacy of genetic information• Limitations on genetic testing• Patenting of DNA sequences• Society’s view of older people• Training of physicians• Human genetic engineering
– Somatic gene therapy – inserting replacement genes– Germ-line therapy – modifications of human germ line