TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT
description
Transcript of TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATATTCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCAGAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTTCCCTGTTTCCAGGTTTGTTGTCCCAAAATAGTGACCATTTCATATGTATA
Comparative Genomics
Overview
I. Comparing genome sequences• Concepts and terminology• Methods
- Whole-genome alignments- Quantifying evolutionary conservation (PhastCons, PhyloP)- Identifying conserved elements
• Available datasets at UCSC
II. Comparative analyses of function• Evolutionary dynamics of gene regulation• Case studies• Insights into regulatory variation within and across species
Distribution of evolutionary constraint in the human genome
Lindblad-Toh et al. Nature 478:476 (2011)
4.2% of genome is putatively constrained~1 million putative regulatory elements
•Infer the course of past evolution using statistical models of sequence evolution
•Identify sequence elements evolving more slowly or more rapidly than neutral
•Evaluate the precise degree of constraint on specific positions
•Predict the functional effects of nucleotide or amino acid mutations in constrained sequences
Goals of comparative genomics
Vertebrate genomes available for comparative studies
Prim
ates
Mam
mal
s
Tetra
pods
Verte
brat
es
Commonly used (and misused) terms
Mutation vs. Substitution• Mutations occur in individuals, segregate in populations• Substitutions are mutations that have become fixed• Mutations = within species; substitutions = between species
Conservation vs. Constraint• Conservation = an observation of sequence similarity• Constraint = a hypothesis about the effect of purifying selection
Homology, Orthology and Paralogy• Homologous sequences = derived from a common ancestor• Orthologous sequences = homologous sequences separated by a speciation event
(e.g., human HOXA and mouse Hoxa)• Paralogous sequences = homologous sequences separated by gene duplication
(e.g., human HOXA and human HOXB)
Basic premises in comparative sequence analysis
Most mutations that affect function are eliminated by purifying selection• Constrained elements have lower substitution rates than expected from the neutral rate• Contingent on the effect of the mutation and degree of constraint on the function• Manifests as sequence conservation, even among distant species
Beneficial mutations may be driven to fixation by positive selection• May be detected as “faster-than-neutral” substitution rate• Expected to be rare
Most sequence differences among genomes are neutral• Involve substitutions with minimal or no functional impact• Fixed by random genetic drift• Fixation rate is equal to mutation rate• Genomes become more dissimilar with greater phylogenetic distance
Phylogenies
Phylogenetic trees show two things:• Evolutionary relationships among species or sequences: branching order• Evolutionary distance (e.g., degree of similarity or divergence): branch length
Internalnode
Terminalnode
Branch
Phylogenies
Phylogenetic trees show two things:• Evolutionary relationships among species or sequences: branching order• Evolutionary distance (e.g., degree of similarity or divergence): branch length
Species tree Gene tree
Orthologs and paralogs in gene trees
Capra et al. 2013
HMGCS1
HMGCS2
Orthologs and paralogs in gene trees
Capra et al. 2013
Orth
olog
sOr
thol
ogs
Para
logsDuplication
Orthologs and paralogs in gene trees
Capra et al. 2013
1:1 Orthologs
1:1 Orthologs
Human HMGCS1Human HMGCS21:2
Ortholog assignments at Ensembl
Ortholog assignments at Ensembl
Ortholog assignments at Ensembl
Steps in sequence comparisons
Sequence alignment• Global vs. local• Whole-genome vs. genome segments (e.g., genes)• Identify sites that are homologous (not necessarily identical)
Measure similarity and divergence of sequences• Sequence similarity – level of conservation• Rates of change among sequences - divergence
Infer degree of evolutionary constraint• Are the sequences more conserved than expected from neutral evolution?
Rates of sequence change are estimated using models of the substitution process
Transition probabilities:
Phylogeny
Substitution rates are calculated for each lineage in a sequence phylogeny
Conserved sequences identified by local reductionsin substitution rate
aligned position
aligned position
localneut
Tools for quantifying evolutionary conservation acrossgenomes
Alignment: Multiz• Generates multiple species alignment relative to a base genome• Constructed from pairwise alignment of individual genomes to reference• 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10
100-way Multiz alignment in hg19
Green = level of sequence similarity at each site
Conservation of synteny: “net” alignments
• Conservation of genome segments• Order and orientation of genes and regulatory sequences
Conservation of synteny: “net” alignments
• Synteny is frequently conserved on megabase scales
Tools for quantifying evolutionary conservation acrossgenomes
PhastCons• Estimates the probability that a nucleotide belongs to a conserved element• Sensitive to ‘runs’ of conserved sites – effective for identifying conserved blocks• For hg19, elements are calculated at three phylogenetic scopes
(Vertebrate, Placental Mammal, Primate)
PhyloP• Measures conservation independently at individual positions• Provides per-base conservation scores: (-log p value under hypothesis of neutrality)• Positive scores suggest constraint; negative scores suggest accelerated evolution
Alignment: Multiz• Generates multiple species alignment relative to a base genome• Constructed from pairwise alignment of individual genomes to reference• 46-way and 100-way alignment to hg19, 30-way to mm9; 60-way to mm10
Identifying conserved elements: PhastCons
PhastCons scores
PhastCons elements
lod score: log probability under conserved model – log probability under neutral modelScore: normalized lod score on 0-1000 scale
Use scores to rank elements by estimated constraint
lod: 882Score: 694
PhastCons elements estimated at 3 phylogenetic scopes
PrimatePlacentalVertebrate
Level of conservation decays with increasing evolutionary distance
PhyloP: measuring basewise conservation
PhyloPscores
• Scores are calculated independently for each base• Scores are –log P values under hypothesis of neutral evolution• Positive scores = constraint• Negative scores = acceleration
Per-site phyloP conservation scores
4.49 1.77 -0.96
Use PhastCons to identify conserved elementsUse phyloP to evaluate individual sites within elements
Accessing conservation data
Multiple genome alignments and conservation metrics are calculated independently for each reference genome
Orthologous region in mouse:
30-way multiz alignment
Conservation identifies critical binding sites in regulatory elementsRe
gula
tory
info
(ENC
ODE)
Cons
erva
tion
Important binding sites and variants that affect function will be here