May 3rd, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB
NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator:...
-
Upload
daphne-beesley -
Category
Documents
-
view
212 -
download
0
Transcript of NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator:...
NGS Bioinformatics Workshop1.5 Genome Annotation
April 4th, 2012IRMACS 10900, SFU
Facilitator: Richard BruskiewichAdjunct Professor, MBB
Acknowledgment:This week’s slides partly courtesy of Professor Fiona Brinkman, MBB,
with material from Wyeth W. Wasserman and Shannan Ho Sui
OverviewGeneral observations about genomesRepeat maskingGene findingGene regulation and promoter analysis
SOME GENERAL OBSERVATIONS ABOUT GENOMES
NGS Bioinformatics Workshop - 1.5 Genome Annotation
3
General Variables of GenomesProkaryote versus Eukaryote versus OrganelleGenome size:
Number of chromosomesNumber of base pairsNumber of genes
GC/AT relative contentRepeat contentGenome duplications and polyploidyGene content
See: Genomes, 2nd edition Terence A Brown. ISBN-10: 0-471-25046-5See NCBI Bookshelve: http://www.ncbi.nlm.nih.gov/books/NBK21128/
Eukaryote versus Prokaryote Genomes
Eukaryote Prokaryote
Size
Large (10 Mb – 100,000 Mb)
There is not generally a relationship between organism complexity and its genome size (many plants have larger genomes than human!)
Generally small (<10 Mb; most < 5Mb)
Complexity (as measured by # of genes and metabolism) generally proportional to genome size
Content Most DNA is non-coding DNA is “coding gene dense”
Telomeres/ Centromeres
Present (Linear DNA)
Circular DNA, doesn't need telomeres
Don’t have mitosis, hence, no centromeres.
Number of chromosomes
More than one, (often) including those discriminating sexual identity
Often one, sometimes more, -but plasmids, not true chromosome.
Chromatin Histone bound (which serves as a genome regulation point)
No histones
Uses supercoiling to pack genome
Eukaryote versus Prokaryote Genomes
Eukaryote Prokaryote
Genes
Often have introns
Intraspecific gene order and number generally relatively stable
many non-coding (RNA) genes
There is NOT generally a relationship between organism complexity and gene number
No introns
Gene order and number may vary between strains of a species
Gene regulation
Promoters, often with distal long range enhancers/silencers, MARS, transcriptional domains
Generally mono-cistronic
Promoters
Enhancers/silencers rare
Genes often regulated as polycistronic operons
Repetitive sequences Generally highly repetitive with genome wide
families from transposable element propagation
Generally few repeated sequences
Relatively few transposons
Organelle (subgenomes)
Mitochondrial (all)
chloroplasts (in plants) Absent
Genome SizePhysical:
Amount of DNA / number of base pairsNumber of chromosomes/linkage groupsInformation resources:
NCBI: http://www.ncbi.nlm.nih.gov/genomeAnimals: http://www.genomesize.com/Plants: http://data.kew.org/cvalues/Fungi: http://www.zbi.ee/fungal-genomesize/
Genetic: Number of genes in the genome
Gregory TR. 2002. Genome size and developmental complexity. Genetica. May;115(1):131-46.
Size of Organelle GenomesSpecies Type of organism Genome size (kb)
Mitochondrial genomes
Plasmodium falciparum Protozoan (malaria parasite) 6
Chlamydomonas reinhardtii Green alga 16
Mus musculus Vertebrate (mouse) 16
Homo sapiens Vertebrate (human) 17
Metridium senile Invertebrate (sea anemone) 17
Drosophila melanogaster Invertebrate (fruit fly) 19
Chondrus crispus Red alga 26
Aspergillus nidulans Ascomycete fungus 33
Reclinomonas americana Protozoa 69
Saccharomyces cerevisiae Yeast 75
Suillus grisellus Basidiomycete fungus 121
Brassica oleracea Flowering plant (cabbage) 160
Arabidopsis thaliana Flowering plant (vetch) 367
Zea mays Flowering plant (maize) 570
Cucumis melo Flowering plant (melon) 2500
Chloroplast genomes
Pisum sativum Flowering plant (pea) 120
Marchantia polymorpha Liverwort 121
Oryza sativa Flowering plant (rice) 136
Nicotiana tabacum Flowering plant (tobacco) 156
Chlamydomonas reinhardtii Green alga 195
http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511
Size of Prokaryote GenomesSpecies DNA molecules Size (Mb) Number of genes
Escherichia coli K-12 One circular molecule 4.639 4397
Vibrio cholerae El Tor N16961 Two circular molecules
Main chromosome 2.961 2770
Megaplasmid 1.073 1115Deinococcus radiodurans R1 Four circular molecules
Chromosome 1 2.649 2633 Chromosome 2 0.412 369 Megaplasmid 0.177 145 Plasmid 0.046 40
Borrelia burgdorferi B31 seven or eight circular molecules, 11 linear molecules
Linear chromosome 0.911 853
Circular plasmid cp9 0.009 12
Circular plasmid cp26 0.026 29
Circular plasmid cp32* 0.032 Not known
Linear plasmid lp17 0.017 25
Linear plasmid lp25 0.024 32
Linear plasmid lp28-1 0.027 32
Linear plasmid lp28-2 0.030 34
Linear plasmid lp28-3 0.029 41
Linear plasmid lp28-4 0.027 43
Linear plasmid lp36 0.037 54
Linear plasmid lp38 0.039 52
Linear plasmid lp54 0.054 76
Linear plasmid lp56 0.056 Not known
http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5524
Size of Eukaryote GenomesSpecies Genome size (Mb)
Fungi
Saccharomyces cerevisiae 12.1
Aspergillus nidulans 25.4
Protozoa
Tetrahymena pyriformis 190
Invertebrates
Caenorhabditis elegans 97
Drosophila melanogaster 180
Bombyx mori (silkworm) 490
Strongylocentrotus purpuratus (sea urchin) 845
Locusta migratoria (locust) 5000
Vertebrates
Takifugu rubripes (pufferfish) 400
Homo sapiens 3200
Mus musculus (mouse) 3300
Plants
Arabidopsis thaliana (vetch) 125
Oryza sativa (rice) 430
Zea mays (maize) 2500
Pisum sativum (pea) 4800
Triticum aestivum (wheat) 16 000
Fritillaria assyriaca (fritillary) 120 000
http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5471
http://en.wikipedia.org/wiki/Genome_sizehttp://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes
Number of GenesSpecies Ploidy Cs Size (Mb) No. Genes
Saccharomyces cerevisiae 2 16 12 6,281
Plasmodium falciparum 2 14 23 5,509
Caenorhabditis elegans 2 6 100 21,175
Drosophila melanogaster 2 6 139 15,016
Oryza sativa 2 12 410 30,294
Canis lupus familaris 2 39 2,445 24,044
Homo sapiens 2 24 3,100 36,036
Zea mays 2 10 2,046 42,000-56,000 (*)
Protopterus aethiopicus ? ? 130,000 ?
Paris japonica 8 40 150,000 ?
Polychaos dubium ? ? 670,000 ?
http://www.ncbi.nlm.nih.gov/genome
(*) Haberer et al., Structure and architecture of the maize genome. Plant Physiol. 2005 Dec139(4):1612-24
AT/GC content Regional variations correlates with genomic content
and function like transposable element distribution, gene density, gene regulation, methylation, etc.
Often introduces bias in sequencing processes (e.g. library yields, PCR amplification, NGS sequencing)
Species GC%Streptomyces coelicolor A3(2) 72Plasmodium falciparum 20Arabidopsis thaliana 36Saccharomyces cerevisiae 38Arabidopsis thaliana 36Homo sapiens 41 (35 – 60)
Romiguier et al. 2010. Contrasting GC-content dynamics across 33 mammalian genomes: Relationship with life-history traits and chromosome sizes. Genome Res. 20: 1001-1009
Repeat Content Large genomes generally reflect evolutionary expansion of
large families of repetitive DNA (by RNA/DNA transposon amplification/insertion, genetic recombination)
Repeats drive genome mutational processes: Recombination resulting in insertion, deletion, translocation,
segmental duplication of DNA Insertional mutagenesis, possibly including de novo creation of genes Insert novel regulatory signals
Repeats generally confound genome sequence assembly (especially for NGS, due to short reads). Gene annotation can also be problematic as transposons mimic gene structures.
Jurka et al. 2007. Repetitive sequences in complex genomes: structure and evolution. Annu Rev Genomics Hum Genet. 2007;8:241-59.
Genome Duplications/PolyploidySegmental duplications (i.e. by recombination)
Tandem: direct and invertedWhole genome duplication & loss, e.g.
Ancestral vertebrate: 2 roundso HOX gene clusters…
Polyploidy - ~70% of all angiospermsGenomic hybridization (allopolyploids)Can lead to immediate and extensive changes in
gene expressionMapping of homeologous gene loci can be tricky
Dehal P and Boore JL.2005. Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate. PLoS Biol 3(10) : e314. doi:10.1371
Adams and Wendel. 2005. Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol. 8(2):135–141
Case Study: Rice 10 duplicated blocks identified on all 12 chromosomes
of Oryza sativa subspecies indica, that contained 47% of the total predicted genes.
Possible genome duplication occurred ~70 million years ago, supporting the polyploidy origin of rice.
Additional segmental duplication identified involving chromosomes 11 and 12, ~5 million years ago.
Following the duplications, there have been large-scale chromosomal rearrangements and deletions. About 30–65% of duplicated genes were lost shortly after the duplications, leading to a rapid diploidization.
Wang et al. 2005. Duplication and DNA segmental loss in the rice genome:implications for diploidization New Phytologist 165: 937–946
Gene Content
http://www.ncbi.nlm.nih.gov/books/NBK21120/figure/A5501
The Bottom Line
All of these genomic variables:Type of organism: i.e. prokaryote versus eukaryoteGenome sizeGC/AT relative contentRepeat contentGenome duplications and polyploidyGene content
are important factors that can drive the strategy, expected outcome and efficacy of genome sequence assembly and annotation.
GENOME REPEAT MASKING
NGS Bioinformatics Workshop - 1.5 Genome Annotation
19
Genomic (DNA) Sequence Repeat Masking
Classic approach: search against repeat libraries RepeatMasker
http://www.repeatmasker.org/Uses a previously compiled library of repeat familiesUses (user configured) external sequence search programComputationally intensive but……the project web site also provides “pre-masked” genomic
data for many completed genomes, complete with some statistical characterization.
More Repeat Masking … de novo identification and classification:
RECON: http://www.genetics.wustl.edu/eddy/reconRepeatGluer: http://nbcr.sdsc.edu/euler/PILER: http://www.drive5.com/piler
Repeat databases:Repbase: http://www.girinst.org/repbase/index.htmlplants: http://plantrepeats.plantbiology.msu.edu/
Related algorithms:“probability clouds”
Gu et al. 2008. Identification of repeat structure in large genomes using repeat probability clouds. Anal Biochem. 380(1): 77–83
GENE FINDING…OR “WHAT IS A GENE?”
NGS Bioinformatics Workshop - 1.5 Genome Annotation
23
24
Objectives
Review of differences in prokaryotic and eukaryotic gene organization.
Understand consequences and challenges for gene finding algorithms for Prokaryotes and Eukaryotes.
Appreciate HMM as powerful tool (in many areas of computational biology!)
Be reminded that not all genes encode proteins and predictions of such genes have their own computational challenges.
Gene Annotation QuestionsWhich genes are present?How did they get there (evolution)?Are the genes present in more than one copy?Which genes are not there that we would
expect to be present?What order are the genes in, and does this
have any significance?How similar is the genome of one organism to
that of another?
26
Why Gene-finding?
• Whole-genome annotation– Genome sequence does not give you list of all
genes• Fully characterizing Yfg (“your favourite gene”)
– example: A disease is associated with a SNP in a location in the human genome. BLAST finds similarity to a protein coding gene in the area, but its only similar to part of the whole protein. What’s the whole gene?
After completing the human genome we faced 3 Gigabytes of this
Not immediately apparent where the genes are…
What is a gene?
29
A distinct functional unit encoded (at minimum transcribed) in the genome
Eukaryotes: The region that is transcribed – the mRNA defines it
Prokaryotes: The coding region – because multiple genes may be on one transcript as an operon
Both eukaryotes and prokaryotes: non-coding RNAs are treated relatively the same
30
Prokaryotes• High gene density• mRNA transcription-
translation is coupled
• Genes are usually contiguous stretches of coding DNA
• mRNAs often polycistronic gene
____________________
• Low gene density• mRNA transcribed then
transported to cytoplasm for translation
• Genes’ coding DNA often split by non-coding introns
• mRNAs are generally monocistronic gene
___________
Eukaryotes
Raw Biological Materials
Great real-time Transcription-Translation video: http://www.youtube.com/watch?v=41_Ne5mS2ls
transcript
How many genes in human genome?• 2000: must be at least 100,000 (Rice has ~40,000,
C. elegans has ~19,000)
• 2001: only 35,000?
• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)
• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)
• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternative splicing these produce likely many more…
How many genes in human genome?• 2000: must be at least 100,000 (Rice has ~40,000,
C. elegans has ~19,000)
• 2001: only 35,000?
• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)
• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)
• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternative splicing these produce likely >100,000 proteins (178,191 currently annotated in Ensembl)
1 gene in how many basepairs?...a. 1:10,000,000b. 1:1,000,000c. 1:100,000 roughly for humand. 1:10,000 (1:5000 for C. elegans)e. 1:1000 roughly for most bacteriaf. 1:100g. 1:10
33
Gene Density
34
Approaches to Finding Genes
• AB INITIO (‘from the beginning’):
– search by content: find genes by statistical properties that distinguish protein-coding DNA from non-coding DNA
– Search by signals/sites: splice donor/acceptor sites, promoters, etc.
• HOMOLOGY:– search by sequence similarity to homologous sequences
state-of-the-art gene finding combines these strategies and also uses other data like EST/RNA-seq/microarray data
35
Gene-finding in Prokaryotes:Easy? ….or not?
• ORF Finder– Open reading frame (ORF) from methionine
codon to first Stop codon– ORFs linked to BLAST– http://www.ncbi.nlm.nih.gov/gorf/gorf.html
Problem: All ORFs are not genes. Why? How can this be improved?
36
Gene Finding: Ab Initio Search by Content
• Protein code affects the statistical properties of a DNA sequence – some amino acids are used more frequently than others
(Leu more popular than Trp)– different numbers of codons for different amino acids
(Leu has 6, Trp has 1)– for a given amino acid, usually one codon is used more frequently than others
• this is termed codon preference • these preferences vary by species
37
Gene-finding in Prokaryotes:Improving predictions…
Common way to search by content• build Markov models of coding & noncoding regions
apply to ORFs or fixed-sized sequence windowsMarkov Model approaches: prokaryotic gene prediction• Glimmer
– http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi– http://cbcb.umd.edu/software/glimmer/– open source
• GeneMark– http://opal.biology.gatech.edu/GeneMark/– not open source but slightly more accurate
38Reproduced with permission,Chris Burge
i.e. If its sunny then the next weather state is not likely to become suddenly rainy.
If its cloudy then it is more likely the next weather state will be rainy.
39Reproduced with permission,Chris Burge
For gene prediction:
Certain bases occur after others in coding sequence versus non-coding sequence
What’s Hidden in a Hidden Markov Model?
• The HMM is a model of what actually happened (the true gene structure) but all you can see is the sequence
• Each nucleotide is not labeled with its state; it’s hidden
• Based on training data – which is critical!
Glimmer version 3
• annotates microbial genomes with a sensitivity of 99% and accuracy of >98%
• resolves overlapping genes• use appropriate gene dataset for training program• no annotation of rRNA and tRNA genes• employs interpolated Markov model
42NAR Vol.26 p1107-1115
43
GeneMark.hmm
• HMM plus ribosome binding site signals in primary nucleotide sequence
• training also required• Most accurate prokaryotic gene predictor (by a bit over
Glimmer)• GeneMarkS version - translation start prediction: 83.2% of
Bacillus subtilis genes 94.4% of Escherichia coli genes
44
Main issues with prokaryotic gene prediction
• Which ATG or GTG start to choose? Not that accurate• Errors in sequence frame shifts that lead to errors in gene
prediction• Very small genes either not predicted at all (most prokaryotic
genome projects don’t annotate any genes < 90 bp) or poorly predicted
• Non-coding RNA genes tend to be ignored!
• RNA-seq data being incorporated now…
45
RNA-seq
• Sequence cDNA from RNA using “next generation” sequencing
• Map sequence reads to ref genome
• Count number of sequence reads (depth of reads) for a given window of ref sequence
Rne (Rnase E) Upstream Region in P. aeruginosa
Rne (Rnase E) Upstream Region
• Upstream region of Rne in E. coli forms stem loop to regulate mRNA synthesis
• Region aligns reasonably well between spp.• Also predicted to encode a small non-coding RNA using QRNA
and RNAz
E. coli PAO1
Gene prediction in eukaryotes
• Overview of eukaryotic gene structure will illustrate the gene-finding problem
• Components described are some of those that gene-finding tools must model
• These signals are never 100% reliable because there are (almost) as many exceptions as rules
• Signals (and noise) vary with organism– 3% of human genome is “coding” vs. 25% in fugu
49
The problem with eukaryotes:the easy stuff?
The problem with eukaryotes:the hard stuff
• ~50% of human genes undergo alternative splicing
• Genes can be nested: overlapping on same or opposite strand or inside an intron
• Pseudogenes: “dead” genes
Fig 10.1 Baxevanis and Ouellette, 2000
Raw Biological Materials: Eukaryotes
Transcription Start
core promoter
TSS – transcription start site
. . .
Fig 11.25 from Introduction to Genetic Analysis, 7th ed. Griffiths et al. New York: W H Freeman & Co; c1999.http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=iga.figgrp.d1e46911
• ~70% of promoter regions have TATA box
End Modification
• 5’cap added at 5’ end of transcript• polyA-tail added at AATAAA consensus site
Splice site recognition
Logos from:http://genio.informatik.uni-stuttgart.de/GENIO/splice/
exon intron5’ (a.k.a. donor) 3’ (a.k.a. acceptor)
intron exon
The problem with splice signals
• ~0.5% of splice sites are non-canonical, i.e. not GT-AG
• ~5% of human genes may have a non-canonical splice site
• ~50% of human genes are alternatively spliced
Intron Phase
CCA
CAGAGG
exon intron exon
Phase 0:Phase 1:Phase 2:
A codon can be interrupted by an intron in one of three placesand still code for the same amino acid
Example CAG -> glutamine
Translation signals
Start AUG
Note: Bacteria also use GUG (~10% of starts) and UUG (rarely)
StopUAA, UAG, or UGA
Translation signals
• Translation Start: “usually” first AUG after 5’ cap
• But Kozak consensus sequence aids the selection so the 2nd or 3rd AUG might be used
• Translation Stop: first in-frame stop codon (simple!)
Gene-finding Strategies
• Similarity-based a.k.a. lookup method– genome:protein– genome:genome
• ab initio a.k.a. template method• Experimental
– microarrays– RNA Sequencing
Similarity – genome:protein
• blastx finds similarity between translated genomic region and known proteins
• Problem: on human chr22 only 50% of genes had similarity to known proteins when first annotated
Similarity – genome:genome
• Functional sequences will be conserved• Learn different things from different evolutionary
distances (signal:noise ratio)– coding sequences– regulatory sequences
• Compare mouse:human, pufferfish:human (Exofish)• Problems:
– you need something “similar”– won’t find precise boundaries
ab initio Gene-finding
• Based on intrinsic information• Coding statistics plus signal sensors• No prior knowledge of similar genes needed• Problem: gene models can only approach the
biological truth (recall eukaryotic gene structure variability)
Coding Statistics
• Exon/intron base composition differences indicate protein-coding regions
• Use frequency of occurrence of hexamers
Signal Sensors
• Model features recognized by transcriptional, splicing and translational machinery– Promoter elements– Splice sites– PolyA consensus sites– Start and stop codons
• Use pattern recognition methods as signal detectors– Weight matrices– Neural networks– Decision trees
What don’t (coding) gene finders do?
• Explicit alternative splicing models• Overlapping or nested genes• Non-protein-coding genes
Tools for finding genes…
Check out the Bioinformatics Links Directory at Bioinformatics.ca
http://bioinformatics.ca/links_directory/
Some tools for finding genes…
http://bioinformatics.ca/links_directory/category/dna/gene-prediction
Tools for eukaryotic gene prediction
• Ab Initio– genscan, fgenesh
• Homology-based prediction– Genewise (uses predicted proteins from one
genome/seq to drive exon finding in a second genome/seq, combined with splice-site model)
– Twinscan (builds two genome predictions together, using conserved DNA signal as one criterion)
Gene Finding in General: Training Sets
• ab initio gene-finders must be trained on real sequences
• Mostly single genes from specific groups of organisms (or the organism being studied)
• Bias of training set will affect a tool’s prediction accuracy
• Example: HMR195 has 195 single- or multi-exon murine genes with typical splice sites and confirmed by cDNA (Rogic et al, 2001)
GENSCAN Dissection
Reproduced with permission,Chris Burge
Reproduced with permission,Chris Burge
72
Genscan HSMM
• Circles and diamonds are states e.g. exon, phase 0 on plus strand
• Arrows are probabilities of transition between states
• Sequence you observe is generated from a series of states
Reproduced with permission, Chris Burge
Reproduced with permission,Chris Burge
74
Reproduced with permission,Chris Burge
75
Reproduced with permission,Chris Burge
76
Genscan HSMM
• Circles and diamonds are states e.g. exon, phase 0 on plus strand
• Arrows are probabilities of transition between states
• Sequence you observe is generated from a series of states
Reproduced with permission, Chris Burge
Signal Sensors
• Model features recognized by transcriptional, splicing and translational machinery– Start codon
• 12bp WMM (Weight Matrix Model); probabilities estimated from GenBank CDS feature annotations
– 5’ splice site• MDD (Maximal Depedence Decomposition) developed
by Burge and Karlin (1997) to allow for dependencies between adjacent and non-adjacent nucleotides
Coding Statistics
• Exon/intron base composition differences indicate protein-coding regions
• Use frequency of occurrence of hexamers• Model with 5th order HMMs
– Non-coding: homogeneous – Coding: heterogeneous
• different model, depending on whether it’s 1st, 2nd, 3rd codon position
The predicted gene structure is
the optimal combination of the aboveReproduced with permission,Chris Burge
GENSCAN
• Strengths– Works well from fly to human (and plants with
some tweaks)• Weaknesses
– Practical• Initial exons; HMMgene does (a bit) better• Gene boundaries; tends to join or split genes
– Models• Promoter and polyA site models• Transcription start sites (TSS)
Homology Based Approach
• Use evolutionary divergence as a tool• Based on the principal that introns and other
non-coding regions usually diverge much faster over time
• Helped by the fact that many splice patterns have been determined experimentally (cDNA)
• Fails when the divergence of distant sequences is too extreme (<50%; due to rapidly evolving proteins or large evolutionary distances)
GeneWise (Ewan Birney)• GeneWise aligns a protein sequence (or HMM) to
genomic DNA taking into account splicing information
See also “Exonerate”
February 2001 December 2002
The birth of comparative genomics…
Comparative Genefinders• Twinscan -based on Genscan (Korf, 2001)• ROSETTA – global alignment of orthologous
regions (Batzoglou et al, 2000)• SLAM (Alexandersson) and Doublescan
(Meyer)– jointly align while finding predicting genes– computationally expensive (extra search
dimension) and hard to parameterise– still effectively requires a full alignment
• SGP2 (Parra, 2003)– Combines ab initio gene prediction with TBLASTX
Paul Flicek et al. Genome Res. 2003: TWINSCAN pred (red), GENSCAN (green), and an aligned RefSeq transcript (blue). Yellow: low-complexity regions, black: mouse
alignments
Evaluation of Twinscan
Paul Flicek et al. (Genome Res. 2003): Accuracy of GENSCAN and TWINSCAN by the exact gene, exact exon, and coding nucleotide measures
Evaluation of Twinscan
Twinscan (N-SCAN) in UCSC genome browser
Region of seq1 (from HMR195 dataset) presentedTwinscan URL: http://ardor.wustl.edu
What about non-coding RNAs (ncRNAs)?
88
What are non-coding RNAs (ncRNAs)?
89
ncRNA genes make transcripts that function directly as RNA, rather than encoding proteins.– ribosomal RNA, transfer RNA, RNase P, (tmRNA)– E.coli translational regulatory RNAs, e.g. OxyS– Eukaryotes: snRNA, snoRNA (& Archaea)
Example predictor: QRNA - probabilistic model III
• RNA: pair stochastic context-free grammar – base stacking, loop length (MFOLD)
1. Identification of ncRNAs with known RNA familiesRFAM: a database of RNA families (Griffiths-Jones et al.,
2005)
2. Identification of novel ncRNAsObtain orthologous sequences using BLASTNMultiple sequence alignment using MAFFTncRNA prediction using RNAz and QRNAAnalyze overlapping predictions and filter false positivesNeed lab validation (RT-PCR, RNA-seq…)
Prediction of non-coding RNAs
1. Sean Eddy (HHMI) RFAM, QRNA http://selab.janelia.org/
2. Cenk Sahinalp (SFU) TaveRNA suite for structural analysis and interaction predictionhttp://compbio.cs.sfu.ca/
Analysis of non-coding RNAs: Examples of Research Leaders
Rne (Rnase E) Upstream Region in P. aeruginosa
Rne (Rnase E) Upstream Region
• Upstream region of Rne in E. coli forms stem loop to regulate mRNA synthesis
• Region aligns reasonably well between spp.• Also predicted to encode a small non-coding RNA using QRNA
and RNAz
E. coli PAO1
Summary
• Genes are complex structures, which are difficult to predict
• Different approaches to gene finding:– Ab Initio : i.e. GenScan (use appropriate training
dataset!)– Homology guided: i.e. GeneWise
Outstanding Issues
Most Gene finders cannot handle untranslated regions (UTRs)
~40% of human genes have non-coding 1st exons (UTRs)
Most gene finders cannot handle alternative splicing
Most gene finders cannot handle overlapping or nested genes
Bottom Line
Gene finding in eukaryotes is not solvedAccuracy of the best methods approaches 80%
at the exon level (90% at the nucleotide level) in coding-rich regions (much lower for whole genomes)
Gene predictions should always be verified by other means (cDNA sequencing RNA-seq/EST, BLAST search)
98
Whole Genome Annotation
• A lot of work has been done for you!
• NCBI Map Viewers• Ensembl• UCSC Genome Browser
Top 10 Future ChallengesEwan Birney, Chris Burge, Jim Fickett in Genome Technology
1. Precise, predictive model of transcription initiation and termination: ability to predict where and when transcription will occur in a genome
2. Precise, predictive model of RNA splicing/ alternative splicing: ability to predict the splicing pattern of any primary transcript in any tissue
3. Precise, quantitative models of signal transduction pathways: ability to predict cellular responses to external stimuli
4. Determining effective protein:DNA, protein:RNA and protein:protein recognition codes
5. Accurate ab initio protein structure prediction
5 minute break?
GENE REGULATION AND PROMOTER ANALYSIS
NGS Bioinformatics Workshop1.5 Genome Annotation
With material from Wyeth W. Wasserman and Shannan Ho Sui
ObjectivesNote: Focus on Eukaryotic cells and Polymerase II Promoters
• Overview of transcription– Structure of eukaryotic promoter– Laboratory methods for identifying binding motifs
• Prediction of transcription factor binding sites using binding profiles (“Discrimination”)– TFBS scanning – a great example of how bioinformatics
can be tweaked when we don’t fully understand the biology!
• Interrogation of sets of co-expressed genes to identify mediating transcription factors– TFBS Over-representation
• Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)– Motif Discovery!
Transcription Simplified
TATATFBS
TF Pol-II
Three-step Process:1. Transcription factor (TF; a
protein) binds to transcription factor binding site (TFBS; in DNA)
2. TF catalyzes recruitment of polII complex
3. Production of RNA from transcription start site (TSS)
TSS
Anatomy of Transcriptional RegulationWARNING: Terms vary widely in meaning between scientists
• Core Promoter – Sufficient for initiation of transcription; orientation dependent (i.e. immediately upstream, functions in only one orientation)
• TSR – transcription start region– Refers to a region rather than specific start site (TSS)
• TFBS – single transcription factor binding site• Regulatory Regions
• Proximal/Distal – vague reference to distance from TSR• May be positive (enhancing) or negative (repressing)• Orientation independent and location independent (generally)• Modules – Sets of TFBS within a region that function together
EXONTFBS TATA
TSR
TFBSTFBS
Core Promoter/Initiation Region (Inr)
TFBSTFBS
Distal Regulatory Region Proximal Regulatory Region
EXONTFBS TFBS
Distal Regulatory region
105
Complexity in Transcription
Distal enhancer
Distal enhancerProximal enhancer Core Promoter
Chromatin
Refer again to Transcription in video: http://www.youtube.com/watch?v=41_Ne5mS2ls
106
Transcription Factors
• Bind to enhancers found in vicinity of gene
• These proteins promote transcription by (i) binding DNA and (ii) activating transcription– These two functions usually reside on separate structural domains of
the transcription factor protein– Activation occurs via interactions with other transcription factors
and /or RNA polymerase
Structural motifs within different types of transcription factors.
See also http://www.youtube.com/watch?v=7EkSBBDQmpE
DNA Binding Domains – Conserved Motifs
Zinc Finger Motif
Helix-turn-Helix MotifHelix-loop-Helix
Motif
Leucine Zipper Motif
Lab Discovery of TF Binding Sites
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
Reporter Gene Activity
mutation
0% 100%
Identify functional regulatory region within a sequence and delineate specific TFBS through mutagenesis (and in vitro binding studies)
EMSA/Gel Shift Assays to Identify Binding Proteins
http://www.biomedcentral.com/content/figures/1741-7015-4-28-8.jpg
TF + DNA
DNA
High-throughput Methods
• SELEX– mix random ds DNA oligonucleotides with TF protein,
recover TF-DNA complexes and sequence DNA
• Protein Binding Arrays (UniProbe Database)– prepare arrays with ds DNA attached, label protein with a
fluorescent mark and observe DNA bound by protein
• ChIP (Chromatin Immunoprecipitation)– covalently link proteins to DNA in cell, shear DNA, recover
protein-DNA complexes and identify DNA (PCR, array or sequencing)
Representing Binding Sites for a TF
• A set of sites represented as a consensus• VDRTWRWWSHD (IUPAC degenerate DNA)
• A single site• AAGTTAATGA
Set of binding
sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA
Set of binding
sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA
IUPAC (International Union of Pure and Applied Chemistry) nomenclature
• A way to represent ambiguity when more than one value is possible at a particular position
A = adenineC = cytosineG = guanineT = thymineR = G or A (purine)Y = T or C (pyrimidine)K = G or T (keto)M = A or C (amino)S = G or C (strong bonds)
W = A or T (weak bonds)B = G, T or C (all but A)D = G, A or T (all but C)H = A, C or T (all but G)V = G, C or A (all but T)N = A, G, C or T (any)
For DNA
Representing Binding Sites for a TF
A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4
• A matrix describing a set of sites:
• A set of sites represented as a consensus• VDRTWRWWSHD (IUPAC degenerate DNA)
• A single site• AAGTTAATGA
Set of binding
sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA
Set of binding
sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA
Logo – A graphical representation of frequency matrix. Y-axis is information content , which reflects the strength of the pattern in each column of the matrix
113
TGCTG = 0.9
Conversion of PFMs to Position Specific Scoring Matrices (PSSM)
PSSMs also known as Position Weight Matrices(PWMs)
Add the following features to the matrix profile:1. Correct for nucleotide frequencies in genome2. Weight for the confidence (depth) in the pattern3. Convert to log-scale probability for easy arithmetic
A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2
pfm pssm
Log2 ( )f(b,i) + s(n)p(b)
f(b,i) = frequency of base b at position is(n) = pseudocount correction (optionally applied to small samples of binding sites)p(b) = background probability of base b
PSSM Scoring Scales
• Raw scores• Sum of values from indicated cells of the matrix
• Relative Scores (most common)• Normalize the scores to range of 0-1 or 0%-100%
• Empirical p-values• Based on distribution of scores for some DNA
sequence, determine a p-value (see next slide)
Detecting binding sites in a single sequenceRaw Scores
A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]
ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC Sp1
Relative ScoresA [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]
A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]
Max_score = 15.2 (sum of highest column scores)
Min_score = -10.3 (sum of lowest column scores)
93%
100%10.3)(15.2
(-10.3)-13.4
% 100Min_score - Max_scoreMin_score - Abs_score
Rel_score
Empirical p-value Scores
Abs_score = 13.4 (sum of column scores)
0.3
0.2
0.1
0.0
Freq
uenc
y0.0 0.2 0.4 0.6 0.8 1.0
Relative Score
Area to right of value
Area under entire curve
116
JASPAR:
AN OPEN-ACCESS DATABASE
OF TF BINDING PROFILES( jaspar.genereg.net )
(Some other databases with TF profiles include Transfac, TRRD, mPromDB, SCPD (yeast),
dbTBS and EcoTFS (bacteria))
The Good…
• Tronche (1997) tested 50 predicted HNF1 TFBS using an in vitro binding test and found that 96% of the predicted sites were bound!
• Stormo and Fields (1998) found in detailed biochemical studies that the best weight matrices produce scores highly correlated with in vitro binding energy
PSSM SCORE
BIN
DIN
GE
NE
RG
Y
…the Bad…
• Fickett (1995) found that a profile for the myoD TF made predictions at a rate of 1 per ~500bp of human DNA sequence
– This corresponds to an average of 20 sites / gene (assuming 10,000 bp as average human gene size)
…and the Ugly!Human Cardiac a-Actin gene analyzed
with a set of profiles(each line represents a TFBS prediction)
Futility Conjecture: TFBS predictions, as a collective group,
are almost always wrong!
True binding sites are defined by properties not incorporated into
the PSSM profile scores
Red boxes are protein coding exons - TFBS predictions excluded in this analysis
What have we learned?• PSSMs accurately reflect in vitro binding properties of
DNA binding proteins
• Suitable binding sites occur at a rate far too frequent to reflect in vivo function
• Bioinformatics methods that use PSSMs for binding site studies must incorporate additional information to enhance specificity
• Unfiltered predictions are too noisy for most applications• Note: Organisms with short regulatory sequences are less
problematic (e.g. yeast and bacteria)
Example of a bioinformatics challenge that needs more biological information to make predictions
USING PHYLOGENETIC FOOTPRINTING TO IMPROVE TFBS DISCRIMINATION
70,000,000 years of evolution can reveal regulatory regions
% Id
en
tity
Actin gene compared between human and mouse200 bp Window Start Position (human sequence)
Phylogenetic Footprinting
• Align orthologous gene sequences (e.g. use LAGAN). For 1st window of x bp (i.e. 200 bp), of sequence#1, determine % identity with sequence#2. Step across (slide window across) the first sequence, recording % identity in each window with the second sequence.
• Observe high identity with exons, lower identity in 5’ and 3’ UTRs• Additional conserved region could be regulatory region
Multi-species Phylogenetic Footprinting
124
Phylogenetic Footprinting Dramatically Reduces Spurious Hits
Human
Mouse
Actin, alpha cardiac
TFBS Prediction with Human & Mouse Pairwise Phylogenetic Footprinting
• Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set)
• 75-80% of defined sites detected with “sequence conservation filter” (incorporating orthologous sequence data), while only 11-16% of total predictions retained
SELECTIVITY SENSITIVITY
126
1kbp insulin receptor promoter screened with footprinting
Choosing the ”right” species for pairwise comparison...
COW
MOUSE
CHICKEN
HUMAN
HUMAN
HUMAN
ConSite
http://consite.genereg.net/
TFBS Discrimination Tools I
• Phylogenetic Footprinting Servers– FOOTER http://biodev.hgen.pitt.edu/footer_php/Footerv2_0.php
– CONSITE http://asp.ii.uib.no:8090/cgi-bin/CONSITE/consite/ or
http://consite.genereg.net/
– rVISTA http://rvista.dcode.org/
– ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk
• SNPs in TFBS Analysis– RAVEN http://burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?rm=home
TFBS Discrimination Tools II• Prokaryotes or Yeast
– PRODORIC http://prodoric.tu-bs.de/– YEASTRACT http://www.yeastract.com/index.php
• Software Packages– TOUCAN http://homes.esat.kuleuven.be/~saerts/software/toucan.php
• Programming Tools– TFBS http://tfbs.genereg.net/– ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk
• TFBS discrimination coupled with phylogenetic footprinting has greater specificity with tolerable loss of sensitivity
• As with any purification process, some true binding sites will be lost
• There are several online resources that provide precomputed analyses using phylogenetic footprinting
What have we learned?
TFBS Over-representation
• We seek to determine if a set of co-expressed genes contains an over-abundance of predicted binding sites for a known TF
• Phylogenetic footprinting to reduce false prediction rate
• Similar to looking for functional over-representation (i.e. identify annotations of gene function in sets of co-expressed genes that are over-represented)
Two Examples of TFBS Over-Representation
Foreground
Background
More Genes with TFBS
Foreground
Background
More Total TFBS
Statistical Methods for Identifying Over-represented TFBS
• Binomial test (Z scores)– Based on the number of occurrences of the TFBS relative
to background– Normalized for sequence length– Simple binomial distribution model
• Fisher exact probability scores– Based on the number of genes containing the TFBS relative
to background– Hypergeometric probability distribution
oPOSSUM Server
Validation using Reference Gene Sets
TFs with experimentally-verified sites in the reference sets.
A. Muscle-specific (23 input; 16 analyzed)
B. Liver-specific (20 input; 12 analyzed)
Rank Z-score Fisher Rank Z-score Fisher
SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08
MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03
c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01
Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01
TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02
deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01
S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01
Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02
Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01
HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01
Structurally-related TFs with Indistinguishable TFBS
• Most structurally related TFs bind to highly similar patterns– Zn-finger is a
big exception
• TFBS over-representation analysis can identify increased number of sites in a set of co-expressed genes, or increased number of genes in a co-expressed gene set that have a given site.
• Generally best performance has been with data directly linked to a transcription factor
• Highly dependent on the experimental design – cannot overcome noisy data from poor design
Note that the identity of a mediating TF may not be apparent when many proteins can bind to the same motif
Notes
de novo Pattern Discovery
• String-based – e.g. YMF (Sinha & Tompa)– Generalization: Identify over-represented oligomers in
comparison of “+” and “-” (or complete) promoter collections
– Used often for yeast promoter analysis
• Profile-based – e.g. AnnSpec (Workman & Stormo) or MEME (Bailey &
Elkin)– Generalization: Identify strong patterns (position weight
matrices) in “+” promoter collection vs. background model of expected sequence characteristics
String-based methods(1)
How likely are X words in a set of sequences, given background
sequence characteristics?CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range
String-based methods(2)
Find all words of length n in the yeast promoters (e.g. n=7)
Make a lookup table:
AAACCTTT 456TTTTTTTT 57788GATAGGCA 589
Etc...
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAAGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAGACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA
Xw: Instances of a word w within our set of X genes
E[Xw]: Average number of instances of w based on number of genes in our set
Var[Xw]: Variance – how much deviation from the average is expected for w
w
www XVar
XEXZ
String-based methods(3)
Limitations of String-based Methods• Longer word lengths not possible
– Number of possible strings of length n is 4n without any ambiguities and more if we allow ambiguous bases. Thus, as n increases, the computational time notably increases.
• While degeneracy codes can be used, TFBS are not words – we lose quantitation for variable positions with consensus sequences– Imagine column in PFM with 7 A’s and 1 T --- in a consensus
sequence we would represent as W or throw out the instance with T
• Recently the string-based method has found renewed utility in the analysis of 3’UTRs for the presence of microRNA target sequences...
Aside: microRNA Target Sequences
• Lim et al expressed miRNAs in cells and observed that the overall pattern of gene expression shifted toward the pattern of expression observed in cells which naturally express the miRNA
• The genes with reduced expression in response to miRNA exposure shared 7nt motifs the 3’UTR of their transcripts
• Nice website tutorial:– http://www.ambion.com/main/explorations/mirna.html
Motivation:
TFBS are not words, but rather sequence patterns
Efficiency – can handle longer patterns than string-based methods
Can be intentionally influenced to reflect prior knowledge
Overview: Find a local alignment of width x of sites that maximizes information content (or related measure) in reasonable time
Usually by Gibbs sampling (i.e. Gibbs Motif Sampler) or Expectation Maximization methods (i.e. MEME)
Probabilistic Methods
Ab initio motif finding: Gibbs sampling
• Popular “Markov chain Monte Carlo” algorithm for motif discovery
• Motif model: Position Weight Matrix• Local search algorithm
Gibbs sampling: basic idea
Current motif = PWM formedby circled substrings
Gibbs sampling: basic idea
Delete one substring
Gibbs sampling: basic idea
Try a replacement:Compute its score, Accept the replacementdepending on the score.
Gibbs sampling: basic idea
New motif
Gibbs Sampling(grossly over-simplified)
tgacttcctgctacct
agacctcactgtagtgacgcatct
cgatacgcttcgctcc
1 2 3 4 5 6 7 8A 2 0 2 2 2 1 0 1C 0 2 3 3 2 1 6 2G 0 4 1 0 1 0 1 1T 4 1 1 2 2 5 0 2
i.e. Gibbs Motif Sampler: http://bayesweb.wadsworth.org/gibbs/gibbs.html
http://www.youtube.com/watch?v=_4yrlzVTr_k
Pattern Discovery
• Gibbs sampling is guaranteed to return an optimal pattern if repeated sufficiently often– Procedure is fast, so running many 1000s of times
is feasible• Problems if the TFBS’s are not strongly over-
represented relative to other patterns…
Applied Pattern Discovery is Acutely Sensitive to Noise
True Mef2 Binding Sites
10
12
14
16
18
0 100 200 300 400 500 600
SEQUENCE LENGTH
PATTE
RN
SIM
ILA
RIT
Yvs.
TR
UE
ME
F2 P
RO
FILE
Pink line is negative control with no Mef2 sites included
Ab initio motif finding: Expectation Maximization
• Popular algorithm for motif discovery• Motif model: Position Weight Matrix• Local search algorithm
– Move from current choice of motif to a new similar motif, so as to improve the score
– Keep doing this until no more improvement is obtained: converge to local optima
Basic idea of iteration
PWMCurrent motif1.
Scan sequence for good matches to the current motif.
2.
3. Build a new PWM out of these matches, and make it the new motif
The basic idea can be formalized in the language of probability
MEME
• Popular motif finding program that uses Expectation-Maximization
• Web sitehttp://meme.sdsc.edu/meme/website/meme.html
Four Approaches to Improve Sensitivity• Better background models
-Higher-order properties of DNA
• Phylogenetic Footprinting– Human:Mouse comparison eliminates ~75% of sequence
• Regulatory Modules – Architectural rules
• Limit the types of binding profiles allowed– TFBS patterns are NOT random
(Also, for prokaryotes in particular, accurate ID of transcription start sites can greatly aid identification of promoter sequences)
Pattern Discovery Summary
• Pattern discovery methods can recover over-represented patterns in the promoters of co-expressed genes
• Methods are acutely sensitive to noise – notable problem if the signal we seek is weak
• TFs tolerate great variability between binding sites
• Supplementary information/approaches (such as phylogenetic footprinting) may be required to over-come the noise
REFLECTIONS• TFBS prediction
– Futility Theorem – Essentially predictions of individual TFBS have no relationship to an in vivo function
– Current successful bioinformatics methods incorporate additional information (clusters, sequence conservation)
• TFBS over-representation– TFBS over-representation is powerful way to identify TFs
likely to contribute to observed patterns of co-expression• Novel motif discovery (also useful in general)
– Pattern discovery methods are restricted by the Signal-to-Noise problem
• Observed patterns must be carefully considered– Again, successful methods incorporate additional
information (conservation, structural constraints on TFs, and (for prokaryotes) positional information relative to TSS)