Genomic and comparative genomic analysis
description
Transcript of Genomic and comparative genomic analysis
![Page 1: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/1.jpg)
Genomic and comparative genomic analysis
BIO520 Bioinformatics Jim Lund
![Page 2: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/2.jpg)
Comparative genomics delivers
• Clues as to human disease genes and evolutionary history
• Evidence of general trends in genome evolution
• Previously unknown regulatory strategies
• “Natural history”of species as apparent in genome records
• Surprises
![Page 3: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/3.jpg)
Difference is in Scale and Direction
One or several genes compared against all other known genes.
Use genome to inform us about the
entire organism.
Use information from many
genomes to learn more about the
individual genes.
Entire Genome compared to other entire genomes.
Other “omics” Comparative
![Page 4: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/4.jpg)
What are some questions that comparative genomics can address?
How has the organism evolved?
What differentiates species?
Which non-coding regions are important?
Which genes are required for organisms to survive in a certain environment? (prokaryotes)
![Page 5: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/5.jpg)
Genomic characteristics observed in recently diverged species
Time (My)
FA B D EC0-10
-80
-150
-200
•Organism-specific differences in gene regulation more apparent than difference in genome sequence or structure•Relatively small amount of neutral drift•Apparent positive selection•Some chromosomal rearrangement•Minimal species-specific gene innovation
![Page 6: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/6.jpg)
Genomic characteristics observed in species that have diverged ~80MYA
Time (My)
FA B D EC0-10-80-150-200
•Chromosomal re-arrangements dominate organizational change.•Changes in chromosome number likely.•Conservation of synteny regions within rearrangements.•High conservation features indicate purifying selection against drift background, therefore important genomic features in common.•Protein domain arrangements largely conserved among orthologs.•Species-specific gene duplication, divergence, and/or loss.
![Page 7: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/7.jpg)
•Genome structure has no resolvable large or small-scale homology.•Cis-regulatory regions do not correspond.•Greatest conservation at the functional level in some protein domains and functional RNA. •Different strategies in gene organization and regulation.•Apparent homology in shared-ancestral systems, such as energy processing and storage.
Time (My)
A E F G0
-500
-1000
Genomic characteristics observed between species that have diverged ~1BYA
![Page 8: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/8.jpg)
Different Questions Require Different Comparisons
From: Hardison. Plos Biology. Vol 1 (2): 156-160
![Page 9: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/9.jpg)
What is compared?
• Gene location• Gene structure
– Exon number– Exon lengths– Intron lengths– Sequence similarity
• Gene characteristics– Splice sites– Codon usage– Conserved synteny
![Page 10: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/10.jpg)
From: Miller et al. Annu. Rev. Genom. Human. Genet. 2004.5:15-56.
Millions of years
![Page 11: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/11.jpg)
t
Early globin gene
Alpha chain
Frog alpha
Human alpha
Human Beta
Frog beta
Beta chain
First duplication event
Second duplication event (speciation)
0 1 2 3
Orthologues
Paralogues
Reminder: Orthologues & Paralogues
![Page 12: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/12.jpg)
Figure 1 Regions of the human and mouse homologous genes: Coding exons (white), noncoding exons (gray}, introns (dark gray), and intergenic regions (black). Corresponding strong (white) and weak (gray) alignment regions of GLASS are shown connected with arrows. Dark lines connecting the alignment regions denote very weak or no alignment. The predicted coding regions of ROSETTA in human, and the corresponding regions in mouse, are shown (white) between the genes and the alignment regions.
![Page 13: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/13.jpg)
ExampleFunctional elements:Gene regulation?Chromatin structure?
![Page 14: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/14.jpg)
Terminologies (Cont’d)– Synteny
• Two or more genes that are located in the same chromosome.
• Relevant within a species.
– Conserved synteny• Orthologs of genes that are syntenic in one
species are also located on a single chromosome in a second species.
• Gene order is irrelevant.
– Conserved segments/linkages• In a segment of DNA, the order of multiple
orthologous genes is the same in two species.
![Page 15: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/15.jpg)
Image credit: U.S. Department of Energy Human Genome Program
From: http://www.macdevcenter.com/pub/a/mac/2004/06/29/bioinformatics.html
![Page 16: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/16.jpg)
Q: Why do gene pairs in syntenic regions have more significant E scores?
![Page 17: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/17.jpg)
VISTAA genomic alignment and visualization program
http://genome.lbl.gov/vista/index.shtml
• VISTA automatically finds an orthologue for your input sequence and performs a VISTA similarity plot
• Example: Rat BAC: gj (AC097115)• For alignment, uses the AVID or LAGAN programs
• Quickly aligns 100’s of kb• Can handle sequence in draft format• Uses HMM-like algorithm to find strong anchors from a
collection of maximal matches• Uses VISTA browser – sequence alignment visualization tool
• Allows easy visualization of areas with high similarit.y• Visualization is scalable – allows you to zoom in/out.
![Page 18: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/18.jpg)
![Page 19: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/19.jpg)
![Page 20: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/20.jpg)
Gene: CARP – cardiac ankyrin repeat protein
![Page 21: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/21.jpg)
There are many genomic alignment and visualization tools:
• BLASTZ/PipMaker : http://bio.cse.psu.edu/• AVID/VISTA: http://www-gsd.lbl.gov/vista/• LAGAN/Multi-LAGAN: http://lagan.stanford.edu• AVID: http://baboon.math.berkeley.edu/mAVID• BLAT: http://www.genome.ucsc.edu/• SSAHA: http://bioinfo.sarang.net/wiki/SSAHA• CONREAL:http://conreal.niob.knaw.nl/• MUMmer: http://www.tigr.org/software/mummer.
![Page 22: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/22.jpg)
Example output from PipMaker
![Page 23: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/23.jpg)
Q: What general patterns can be seen?Q: Why do some of the factors correlate w/ gene density?
Genomic view of simple sequence categories
![Page 24: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/24.jpg)
Multi-species conservation
![Page 25: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/25.jpg)
Conserved Non-Coding Sequences
![Page 26: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/26.jpg)
What are those MCS?
• Regulatory– Transcription factor binding sites– miRNAs or miRNA target sites– Chromosome structure– Insulator sequences
• Structural– Replication– Recombination– Chromosome structure
![Page 27: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/27.jpg)
Between-proteome comparisons
Used to identify orthologs.
Protein alignments involving a search of one protein from species A against the proteome of a species B
Several different bioinformatic approaches have been used to make the comparison.
• High scoring reciprocal best hits.• COGs (and KOGs)• Genome-wide phylogenetic analysis
![Page 28: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/28.jpg)
Using High scoring reciprocal best hits
• High scoring reciprocal best hits with the same domain structure are most likely orthologs– share common ancestry– likely to have the same function– Function likely to be more essential (replication, etc)– Genes are not unique to either organism.– E-value should be >0.01 and alignment should stretch over >60% of each
protein
• High scoring hits with slightly different domain structures may be orthologous, but it difficult to tell due to common, conserved domains that have complicated histories
• Cluster analysis can help sort this out
![Page 29: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/29.jpg)
Cut-off p-value: <e-10 <e-20 <e-50 <e-100
Total num seq groups
1171 984 552 236
Num groups w/ > 2 members
560 442 230 79
Num (%) of all (6217) yeast proteins in groups
2697 (40) 1848 (30) 888 (14) 330 (5)
Num (%) of all worm proteins in groups
3653 (19) 2497 (13) 1094 (6) 370 (2)
Worm v. yeast sequences
![Page 30: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/30.jpg)
What is COG?
• The database of Clusters of Orthologous Groups of proteins (COGs) represents an attempt on a phylogenetic classification of the proteins encoded in complete genomes.
• Each COG group consists of individual orthologous proteins or orthologous sets of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.
• http://www.ncbi.nlm.nih.gov/COG
![Page 31: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/31.jpg)
A shortcut for identifying orthologs---the genomic-specific best hit (BeT)
• Given a gene from one genome, the gene from another genome with the highest sequence similarity (the BeT) is the ortholog.
![Page 32: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/32.jpg)
Algorithm of clustering orthologous groups (overview)
All-against-all sequence comparison
(gapped-BLAST)
Merge triangles
Input protein sequences
paralogs
Ortholog triangle COG database
Quality control
Graph of BeTs
![Page 33: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/33.jpg)
The ortholog triangle
• Multiple alignment
A(a)
C(c) B(b)
•Comparing pairwise alignments of AC and AB, we deduce the alignment of BC.
•Comparing the calculated and deduced alignment of BC; if the two alignments are consistent, the BeTs triangle is a triangle of orthologs and can initiate a new COG group.
![Page 34: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/34.jpg)
Algorithm – merging triangles
• Merging triangles that had a common side until no new ones can be joined.
A simple COG with two yeast paralogsisoleucyl-tRNA synthetase
The candidates of orthologous sets were detected.
![Page 35: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/35.jpg)
Functional and
phylogenetic patterns
E, E. coli; H, H. influenzae; G, M. genitalium; P, M. pneumoniae; C, Synechocystis sp.;M, M. jannaschii; Y, S. cerevisiae.
![Page 36: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/36.jpg)
Phyletic patterns of COGs (2003)
• 74% of COGs show scattered distribution, which reflect frequent lineage-specific gene loss and horizontal gene transfer in prokarytic evolution.
~500 COGs
![Page 37: Genomic and comparative genomic analysis](https://reader033.fdocuments.in/reader033/viewer/2022061616/56814914550346895db64bef/html5/thumbnails/37.jpg)
Representation of the 7 analyzed eukaryotic species in
KOGs
• KOG: eukaryotic orthologous groups