NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator:...

NGS Bioinformatics Workshop1.5 Genome Annotation

April 4th, 2012IRMACS 10900, SFU

Facilitator: Richard BruskiewichAdjunct Professor, MBB

Acknowledgment:This week’s slides partly courtesy of Professor Fiona Brinkman, MBB,

with material from Wyeth W. Wasserman and Shannan Ho Sui

OverviewGeneral observations about genomesRepeat maskingGene findingGene regulation and promoter analysis

SOME GENERAL OBSERVATIONS ABOUT GENOMES

NGS Bioinformatics Workshop - 1.5 Genome Annotation

3

General Variables of GenomesProkaryote versus Eukaryote versus OrganelleGenome size:

Number of chromosomesNumber of base pairsNumber of genes

GC/AT relative contentRepeat contentGenome duplications and polyploidyGene content

See: Genomes, 2nd edition Terence A Brown. ISBN-10: 0-471-25046-5See NCBI Bookshelve: http://www.ncbi.nlm.nih.gov/books/NBK21128/

Eukaryote versus Prokaryote Genomes

Eukaryote Prokaryote

Size

Large (10 Mb – 100,000 Mb)

There is not generally a relationship between organism complexity and its genome size (many plants have larger genomes than human!)

Generally small (<10 Mb; most < 5Mb)

Complexity (as measured by # of genes and metabolism) generally proportional to genome size

Content Most DNA is non-coding DNA is “coding gene dense”

Telomeres/ Centromeres

Present (Linear DNA)

Circular DNA, doesn't need telomeres

Don’t have mitosis, hence, no centromeres.

Number of chromosomes

More than one, (often) including those discriminating sexual identity

Often one, sometimes more, -but plasmids, not true chromosome.

Chromatin Histone bound (which serves as a genome regulation point)

No histones

Uses supercoiling to pack genome

Eukaryote versus Prokaryote Genomes

Eukaryote Prokaryote

Genes

Often have introns

Intraspecific gene order and number generally relatively stable

many non-coding (RNA) genes

There is NOT generally a relationship between organism complexity and gene number

No introns

Gene order and number may vary between strains of a species

Gene regulation

Promoters, often with distal long range enhancers/silencers, MARS, transcriptional domains

Generally mono-cistronic

Promoters

Enhancers/silencers rare

Genes often regulated as polycistronic operons

Repetitive sequences Generally highly repetitive with genome wide

families from transposable element propagation

Generally few repeated sequences

Relatively few transposons

Organelle (subgenomes)

Mitochondrial (all)

chloroplasts (in plants) Absent

Genome SizePhysical:

Amount of DNA / number of base pairsNumber of chromosomes/linkage groupsInformation resources:

NCBI: http://www.ncbi.nlm.nih.gov/genomeAnimals: http://www.genomesize.com/Plants: http://data.kew.org/cvalues/Fungi: http://www.zbi.ee/fungal-genomesize/

Genetic: Number of genes in the genome

Gregory TR. 2002. Genome size and developmental complexity. Genetica. May;115(1):131-46.

Size of Organelle GenomesSpecies Type of organism Genome size (kb)

Mitochondrial genomes

Plasmodium falciparum Protozoan (malaria parasite) 6

Chlamydomonas reinhardtii Green alga 16

Mus musculus Vertebrate (mouse) 16

Homo sapiens Vertebrate (human) 17

Metridium senile Invertebrate (sea anemone) 17

Drosophila melanogaster Invertebrate (fruit fly) 19

Chondrus crispus Red alga 26

Aspergillus nidulans Ascomycete fungus 33

Reclinomonas americana Protozoa 69

Saccharomyces cerevisiae Yeast 75

Suillus grisellus Basidiomycete fungus 121

Brassica oleracea Flowering plant (cabbage) 160

Arabidopsis thaliana Flowering plant (vetch) 367

Zea mays Flowering plant (maize) 570

Cucumis melo Flowering plant (melon) 2500

Chloroplast genomes

Pisum sativum Flowering plant (pea) 120

Marchantia polymorpha Liverwort 121

Oryza sativa Flowering plant (rice) 136

Nicotiana tabacum Flowering plant (tobacco) 156

Chlamydomonas reinhardtii Green alga 195

http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511

http://www.ncbi.nlm.nih.gov/books/n/genomes/A9089/def-item/A9469/




Size of Prokaryote GenomesSpecies DNA molecules Size (Mb) Number of genes

Escherichia coli K-12 One circular molecule 4.639 4397

Vibrio cholerae El Tor N16961 Two circular molecules

Main chromosome 2.961 2770

Megaplasmid 1.073 1115Deinococcus radiodurans R1 Four circular molecules

Chromosome 1 2.649 2633 Chromosome 2 0.412 369 Megaplasmid 0.177 145 Plasmid 0.046 40

Borrelia burgdorferi B31 seven or eight circular molecules, 11 linear molecules

Linear chromosome 0.911 853

Circular plasmid cp9 0.009 12

Circular plasmid cp26 0.026 29

Circular plasmid cp32* 0.032 Not known

Linear plasmid lp17 0.017 25


Linear plasmid lp28-1 0.027 32







Linear plasmid lp56 0.056 Not known



Size of Eukaryote GenomesSpecies Genome size (Mb)

Fungi

Saccharomyces cerevisiae 12.1

Aspergillus nidulans 25.4

Protozoa

Tetrahymena pyriformis 190

Invertebrates

Caenorhabditis elegans 97

Drosophila melanogaster 180

Bombyx mori (silkworm) 490

Strongylocentrotus purpuratus (sea urchin) 845

Locusta migratoria (locust) 5000

Vertebrates

Takifugu rubripes (pufferfish) 400

Homo sapiens 3200

Mus musculus (mouse) 3300

Plants

Arabidopsis thaliana (vetch) 125

Oryza sativa (rice) 430

Zea mays (maize) 2500

Pisum sativum (pea) 4800

Triticum aestivum (wheat) 16 000

Fritillaria assyriaca (fritillary) 120 000


http://en.wikipedia.org/wiki/Genome_sizehttp://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes

Number of GenesSpecies Ploidy Cs Size (Mb) No. Genes

Saccharomyces cerevisiae 2 16 12 6,281

Plasmodium falciparum 2 14 23 5,509

Caenorhabditis elegans 2 6 100 21,175

Drosophila melanogaster 2 6 139 15,016

Oryza sativa 2 12 410 30,294

Canis lupus familaris 2 39 2,445 24,044

Homo sapiens 2 24 3,100 36,036

Zea mays 2 10 2,046 42,000-56,000 (*)

Protopterus aethiopicus ? ? 130,000 ?

Paris japonica 8 40 150,000 ?

Polychaos dubium ? ? 670,000 ?

http://www.ncbi.nlm.nih.gov/genome

(*) Haberer et al., Structure and architecture of the maize genome. Plant Physiol. 2005 Dec139(4):1612-24

AT/GC content Regional variations correlates with genomic content

and function like transposable element distribution, gene density, gene regulation, methylation, etc.

Often introduces bias in sequencing processes (e.g. library yields, PCR amplification, NGS sequencing)

Species GC%Streptomyces coelicolor A3(2) 72Plasmodium falciparum 20Arabidopsis thaliana 36Saccharomyces cerevisiae 38Arabidopsis thaliana 36Homo sapiens 41 (35 – 60)

Romiguier et al. 2010. Contrasting GC-content dynamics across 33 mammalian genomes: Relationship with life-history traits and chromosome sizes. Genome Res. 20: 1001-1009

Repeat Content Large genomes generally reflect evolutionary expansion of

large families of repetitive DNA (by RNA/DNA transposon amplification/insertion, genetic recombination)

Repeats drive genome mutational processes: Recombination resulting in insertion, deletion, translocation,

segmental duplication of DNA Insertional mutagenesis, possibly including de novo creation of genes Insert novel regulatory signals

Repeats generally confound genome sequence assembly (especially for NGS, due to short reads). Gene annotation can also be problematic as transposons mimic gene structures.

Jurka et al. 2007. Repetitive sequences in complex genomes: structure and evolution. Annu Rev Genomics Hum Genet. 2007;8:241-59.

Genome Duplications/PolyploidySegmental duplications (i.e. by recombination)

Tandem: direct and invertedWhole genome duplication & loss, e.g.

Ancestral vertebrate: 2 roundso HOX gene clusters…

Polyploidy - ~70% of all angiospermsGenomic hybridization (allopolyploids)Can lead to immediate and extensive changes in

gene expressionMapping of homeologous gene loci can be tricky

Dehal P and Boore JL.2005. Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate. PLoS Biol 3(10) : e314. doi:10.1371

Adams and Wendel. 2005. Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol. 8(2):135–141

Case Study: Rice 10 duplicated blocks identified on all 12 chromosomes

of Oryza sativa subspecies indica, that contained 47% of the total predicted genes.

Possible genome duplication occurred ~70 million years ago, supporting the polyploidy origin of rice.

Additional segmental duplication identified involving chromosomes 11 and 12, ~5 million years ago.

Following the duplications, there have been large-scale chromosomal rearrangements and deletions. About 30–65% of duplicated genes were lost shortly after the duplications, leading to a rapid diploidization.

Wang et al. 2005. Duplication and DNA segmental loss in the rice genome:implications for diploidization New Phytologist 165: 937–946

Gene Content

http://www.ncbi.nlm.nih.gov/books/NBK21120/figure/A5501

The Bottom Line

All of these genomic variables:Type of organism: i.e. prokaryote versus eukaryoteGenome sizeGC/AT relative contentRepeat contentGenome duplications and polyploidyGene content

are important factors that can drive the strategy, expected outcome and efficacy of genome sequence assembly and annotation.

GENOME REPEAT MASKING


19

Genomic (DNA) Sequence Repeat Masking

Classic approach: search against repeat libraries RepeatMasker

http://www.repeatmasker.org/Uses a previously compiled library of repeat familiesUses (user configured) external sequence search programComputationally intensive but……the project web site also provides “pre-masked” genomic

data for many completed genomes, complete with some statistical characterization.

More Repeat Masking … de novo identification and classification:

RECON: http://www.genetics.wustl.edu/eddy/reconRepeatGluer: http://nbcr.sdsc.edu/euler/PILER: http://www.drive5.com/piler

Repeat databases:Repbase: http://www.girinst.org/repbase/index.htmlplants: http://plantrepeats.plantbiology.msu.edu/

Related algorithms:“probability clouds”

Gu et al. 2008. Identification of repeat structure in large genomes using repeat probability clouds. Anal Biochem. 380(1): 77–83

GENE FINDING…OR “WHAT IS A GENE?”


23

24

Objectives

Review of differences in prokaryotic and eukaryotic gene organization.

Understand consequences and challenges for gene finding algorithms for Prokaryotes and Eukaryotes.

Appreciate HMM as powerful tool (in many areas of computational biology!)

Be reminded that not all genes encode proteins and predictions of such genes have their own computational challenges.

Gene Annotation QuestionsWhich genes are present?How did they get there (evolution)?Are the genes present in more than one copy?Which genes are not there that we would

expect to be present?What order are the genes in, and does this

have any significance?How similar is the genome of one organism to

that of another?

26

Why Gene-finding?

• Whole-genome annotation– Genome sequence does not give you list of all

genes• Fully characterizing Yfg (“your favourite gene”)

– example: A disease is associated with a SNP in a location in the human genome. BLAST finds similarity to a protein coding gene in the area, but its only similar to part of the whole protein. What’s the whole gene?

After completing the human genome we faced 3 Gigabytes of this

Not immediately apparent where the genes are…

What is a gene?

29

A distinct functional unit encoded (at minimum transcribed) in the genome

Eukaryotes: The region that is transcribed – the mRNA defines it

Prokaryotes: The coding region – because multiple genes may be on one transcript as an operon

Both eukaryotes and prokaryotes: non-coding RNAs are treated relatively the same

30

Prokaryotes• High gene density• mRNA transcription-

translation is coupled

• Genes are usually contiguous stretches of coding DNA

• mRNAs often polycistronic gene

____________________

• Low gene density• mRNA transcribed then

transported to cytoplasm for translation

• Genes’ coding DNA often split by non-coding introns

• mRNAs are generally monocistronic gene

___________

Eukaryotes

Raw Biological Materials

Great real-time Transcription-Translation video: http://www.youtube.com/watch?v=41_Ne5mS2ls

transcript

How many genes in human genome?• 2000: must be at least 100,000 (Rice has ~40,000,

C. elegans has ~19,000)

• 2001: only 35,000?

• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)

• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)

• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternative splicing these produce likely many more…

How many genes in human genome?• 2000: must be at least 100,000 (Rice has ~40,000,

C. elegans has ~19,000)

• 2001: only 35,000?

• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)

• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)

• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternative splicing these produce likely >100,000 proteins (178,191 currently annotated in Ensembl)

1 gene in how many basepairs?...a. 1:10,000,000b. 1:1,000,000c. 1:100,000 roughly for humand. 1:10,000 (1:5000 for C. elegans)e. 1:1000 roughly for most bacteriaf. 1:100g. 1:10

33

Gene Density

34

Approaches to Finding Genes

• AB INITIO (‘from the beginning’):

– search by content: find genes by statistical properties that distinguish protein-coding DNA from non-coding DNA

– Search by signals/sites: splice donor/acceptor sites, promoters, etc.

• HOMOLOGY:– search by sequence similarity to homologous sequences

state-of-the-art gene finding combines these strategies and also uses other data like EST/RNA-seq/microarray data

35

Gene-finding in Prokaryotes:Easy? ….or not?

• ORF Finder– Open reading frame (ORF) from methionine

codon to first Stop codon– ORFs linked to BLAST– http://www.ncbi.nlm.nih.gov/gorf/gorf.html

Problem: All ORFs are not genes. Why? How can this be improved?

36

Gene Finding: Ab Initio Search by Content

• Protein code affects the statistical properties of a DNA sequence – some amino acids are used more frequently than others

(Leu more popular than Trp)– different numbers of codons for different amino acids

(Leu has 6, Trp has 1)– for a given amino acid, usually one codon is used more frequently than others

• this is termed codon preference • these preferences vary by species

37

Gene-finding in Prokaryotes:Improving predictions…

Common way to search by content• build Markov models of coding & noncoding regions

apply to ORFs or fixed-sized sequence windowsMarkov Model approaches: prokaryotic gene prediction• Glimmer

– http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi– http://cbcb.umd.edu/software/glimmer/– open source

• GeneMark– http://opal.biology.gatech.edu/GeneMark/– not open source but slightly more accurate

38Reproduced with permission,Chris Burge

i.e. If its sunny then the next weather state is not likely to become suddenly rainy.

If its cloudy then it is more likely the next weather state will be rainy.

39Reproduced with permission,Chris Burge

For gene prediction:

Certain bases occur after others in coding sequence versus non-coding sequence

What’s Hidden in a Hidden Markov Model?

• The HMM is a model of what actually happened (the true gene structure) but all you can see is the sequence

• Each nucleotide is not labeled with its state; it’s hidden

• Based on training data – which is critical!

Glimmer version 3

• annotates microbial genomes with a sensitivity of 99% and accuracy of >98%

• resolves overlapping genes• use appropriate gene dataset for training program• no annotation of rRNA and tRNA genes• employs interpolated Markov model

42NAR Vol.26 p1107-1115

43

GeneMark.hmm

• HMM plus ribosome binding site signals in primary nucleotide sequence

• training also required• Most accurate prokaryotic gene predictor (by a bit over

Glimmer)• GeneMarkS version - translation start prediction: 83.2% of

Bacillus subtilis genes 94.4% of Escherichia coli genes

44

Main issues with prokaryotic gene prediction

• Which ATG or GTG start to choose? Not that accurate• Errors in sequence frame shifts that lead to errors in gene

prediction• Very small genes either not predicted at all (most prokaryotic

genome projects don’t annotate any genes < 90 bp) or poorly predicted

• Non-coding RNA genes tend to be ignored!

• RNA-seq data being incorporated now…

45

RNA-seq

• Sequence cDNA from RNA using “next generation” sequencing

• Map sequence reads to ref genome

• Count number of sequence reads (depth of reads) for a given window of ref sequence

Rne (Rnase E) Upstream Region in P. aeruginosa

Rne (Rnase E) Upstream Region

• Upstream region of Rne in E. coli forms stem loop to regulate mRNA synthesis

• Region aligns reasonably well between spp.• Also predicted to encode a small non-coding RNA using QRNA

and RNAz

E. coli PAO1

Gene prediction in eukaryotes

• Overview of eukaryotic gene structure will illustrate the gene-finding problem

• Components described are some of those that gene-finding tools must model

• These signals are never 100% reliable because there are (almost) as many exceptions as rules

• Signals (and noise) vary with organism– 3% of human genome is “coding” vs. 25% in fugu

49

The problem with eukaryotes:the easy stuff?

The problem with eukaryotes:the hard stuff

• ~50% of human genes undergo alternative splicing

• Genes can be nested: overlapping on same or opposite strand or inside an intron

• Pseudogenes: “dead” genes

Fig 10.1 Baxevanis and Ouellette, 2000

Raw Biological Materials: Eukaryotes

Transcription Start

core promoter

TSS – transcription start site

. . .

Fig 11.25 from Introduction to Genetic Analysis, 7th ed. Griffiths et al. New York: W H Freeman & Co; c1999.http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=iga.figgrp.d1e46911

• ~70% of promoter regions have TATA box

End Modification

• 5’cap added at 5’ end of transcript• polyA-tail added at AATAAA consensus site

Splice site recognition

Logos from:http://genio.informatik.uni-stuttgart.de/GENIO/splice/

exon intron5’ (a.k.a. donor) 3’ (a.k.a. acceptor)

intron exon

The problem with splice signals

• ~0.5% of splice sites are non-canonical, i.e. not GT-AG

• ~5% of human genes may have a non-canonical splice site

• ~50% of human genes are alternatively spliced

Intron Phase

CCA

CAGAGG

exon intron exon

Phase 0:Phase 1:Phase 2:

A codon can be interrupted by an intron in one of three placesand still code for the same amino acid

Example CAG -> glutamine

Translation signals

Start AUG

Note: Bacteria also use GUG (~10% of starts) and UUG (rarely)

StopUAA, UAG, or UGA

Translation signals

• Translation Start: “usually” first AUG after 5’ cap

• But Kozak consensus sequence aids the selection so the 2nd or 3rd AUG might be used

• Translation Stop: first in-frame stop codon (simple!)

Gene-finding Strategies

• Similarity-based a.k.a. lookup method– genome:protein– genome:genome

• ab initio a.k.a. template method• Experimental

– microarrays– RNA Sequencing

Similarity – genome:protein

• blastx finds similarity between translated genomic region and known proteins

• Problem: on human chr22 only 50% of genes had similarity to known proteins when first annotated

Similarity – genome:genome

• Functional sequences will be conserved• Learn different things from different evolutionary

distances (signal:noise ratio)– coding sequences– regulatory sequences

• Compare mouse:human, pufferfish:human (Exofish)• Problems:

– you need something “similar”– won’t find precise boundaries

ab initio Gene-finding

• Based on intrinsic information• Coding statistics plus signal sensors• No prior knowledge of similar genes needed• Problem: gene models can only approach the

biological truth (recall eukaryotic gene structure variability)

Coding Statistics

• Exon/intron base composition differences indicate protein-coding regions

• Use frequency of occurrence of hexamers

Signal Sensors

• Model features recognized by transcriptional, splicing and translational machinery– Promoter elements– Splice sites– PolyA consensus sites– Start and stop codons

• Use pattern recognition methods as signal detectors– Weight matrices– Neural networks– Decision trees

What don’t (coding) gene finders do?

• Explicit alternative splicing models• Overlapping or nested genes• Non-protein-coding genes

Tools for finding genes…

Check out the Bioinformatics Links Directory at Bioinformatics.ca

http://bioinformatics.ca/links_directory/

Some tools for finding genes…

http://bioinformatics.ca/links_directory/category/dna/gene-prediction

Tools for eukaryotic gene prediction

• Ab Initio– genscan, fgenesh

• Homology-based prediction– Genewise (uses predicted proteins from one

genome/seq to drive exon finding in a second genome/seq, combined with splice-site model)

– Twinscan (builds two genome predictions together, using conserved DNA signal as one criterion)

Gene Finding in General: Training Sets

• ab initio gene-finders must be trained on real sequences

• Mostly single genes from specific groups of organisms (or the organism being studied)

• Bias of training set will affect a tool’s prediction accuracy

• Example: HMR195 has 195 single- or multi-exon murine genes with typical splice sites and confirmed by cDNA (Rogic et al, 2001)

GENSCAN Dissection

Reproduced with permission,Chris Burge

72

Genscan HSMM

• Circles and diamonds are states e.g. exon, phase 0 on plus strand

• Arrows are probabilities of transition between states

• Sequence you observe is generated from a series of states

Reproduced with permission, Chris Burge

74


75


76

Genscan HSMM

• Circles and diamonds are states e.g. exon, phase 0 on plus strand

• Arrows are probabilities of transition between states

• Sequence you observe is generated from a series of states

Reproduced with permission, Chris Burge

Signal Sensors

• Model features recognized by transcriptional, splicing and translational machinery– Start codon

• 12bp WMM (Weight Matrix Model); probabilities estimated from GenBank CDS feature annotations

– 5’ splice site• MDD (Maximal Depedence Decomposition) developed

by Burge and Karlin (1997) to allow for dependencies between adjacent and non-adjacent nucleotides

Coding Statistics

• Exon/intron base composition differences indicate protein-coding regions

• Use frequency of occurrence of hexamers• Model with 5th order HMMs

– Non-coding: homogeneous – Coding: heterogeneous

• different model, depending on whether it’s 1st, 2nd, 3rd codon position

The predicted gene structure is

the optimal combination of the aboveReproduced with permission,Chris Burge

GENSCAN

• Strengths– Works well from fly to human (and plants with

some tweaks)• Weaknesses

– Practical• Initial exons; HMMgene does (a bit) better• Gene boundaries; tends to join or split genes

– Models• Promoter and polyA site models• Transcription start sites (TSS)

Homology Based Approach

• Use evolutionary divergence as a tool• Based on the principal that introns and other

non-coding regions usually diverge much faster over time

• Helped by the fact that many splice patterns have been determined experimentally (cDNA)

• Fails when the divergence of distant sequences is too extreme (<50%; due to rapidly evolving proteins or large evolutionary distances)

GeneWise (Ewan Birney)• GeneWise aligns a protein sequence (or HMM) to

genomic DNA taking into account splicing information

See also “Exonerate”

February 2001 December 2002

The birth of comparative genomics…

Comparative Genefinders• Twinscan -based on Genscan (Korf, 2001)• ROSETTA – global alignment of orthologous

regions (Batzoglou et al, 2000)• SLAM (Alexandersson) and Doublescan

(Meyer)– jointly align while finding predicting genes– computationally expensive (extra search

dimension) and hard to parameterise– still effectively requires a full alignment

• SGP2 (Parra, 2003)– Combines ab initio gene prediction with TBLASTX

Paul Flicek et al. Genome Res. 2003: TWINSCAN pred (red), GENSCAN (green), and an aligned RefSeq transcript (blue). Yellow: low-complexity regions, black: mouse

alignments

Evaluation of Twinscan

Paul Flicek et al. (Genome Res. 2003): Accuracy of GENSCAN and TWINSCAN by the exact gene, exact exon, and coding nucleotide measures

Evaluation of Twinscan

Twinscan (N-SCAN) in UCSC genome browser

Region of seq1 (from HMR195 dataset) presentedTwinscan URL: http://ardor.wustl.edu

What about non-coding RNAs (ncRNAs)?

88

What are non-coding RNAs (ncRNAs)?

89

ncRNA genes make transcripts that function directly as RNA, rather than encoding proteins.– ribosomal RNA, transfer RNA, RNase P, (tmRNA)– E.coli translational regulatory RNAs, e.g. OxyS– Eukaryotes: snRNA, snoRNA (& Archaea)

Example predictor: QRNA - probabilistic model III

• RNA: pair stochastic context-free grammar – base stacking, loop length (MFOLD)

1. Identification of ncRNAs with known RNA familiesRFAM: a database of RNA families (Griffiths-Jones et al.,

2005)

2. Identification of novel ncRNAsObtain orthologous sequences using BLASTNMultiple sequence alignment using MAFFTncRNA prediction using RNAz and QRNAAnalyze overlapping predictions and filter false positivesNeed lab validation (RT-PCR, RNA-seq…)

Prediction of non-coding RNAs

1. Sean Eddy (HHMI) RFAM, QRNA http://selab.janelia.org/

2. Cenk Sahinalp (SFU) TaveRNA suite for structural analysis and interaction predictionhttp://compbio.cs.sfu.ca/

Analysis of non-coding RNAs: Examples of Research Leaders

Rne (Rnase E) Upstream Region in P. aeruginosa

Rne (Rnase E) Upstream Region

• Upstream region of Rne in E. coli forms stem loop to regulate mRNA synthesis

• Region aligns reasonably well between spp.• Also predicted to encode a small non-coding RNA using QRNA

and RNAz

E. coli PAO1

Summary

• Genes are complex structures, which are difficult to predict

• Different approaches to gene finding:– Ab Initio : i.e. GenScan (use appropriate training

dataset!)– Homology guided: i.e. GeneWise

Outstanding Issues

Most Gene finders cannot handle untranslated regions (UTRs)

~40% of human genes have non-coding 1st exons (UTRs)

Most gene finders cannot handle alternative splicing

Most gene finders cannot handle overlapping or nested genes

Bottom Line

Gene finding in eukaryotes is not solvedAccuracy of the best methods approaches 80%

at the exon level (90% at the nucleotide level) in coding-rich regions (much lower for whole genomes)

Gene predictions should always be verified by other means (cDNA sequencing RNA-seq/EST, BLAST search)

98

Whole Genome Annotation

• A lot of work has been done for you!

• NCBI Map Viewers• Ensembl• UCSC Genome Browser

Top 10 Future ChallengesEwan Birney, Chris Burge, Jim Fickett in Genome Technology

1. Precise, predictive model of transcription initiation and termination: ability to predict where and when transcription will occur in a genome

2. Precise, predictive model of RNA splicing/ alternative splicing: ability to predict the splicing pattern of any primary transcript in any tissue

3. Precise, quantitative models of signal transduction pathways: ability to predict cellular responses to external stimuli

4. Determining effective protein:DNA, protein:RNA and protein:protein recognition codes

5. Accurate ab initio protein structure prediction

5 minute break?

GENE REGULATION AND PROMOTER ANALYSIS

NGS Bioinformatics Workshop1.5 Genome Annotation

With material from Wyeth W. Wasserman and Shannan Ho Sui

ObjectivesNote: Focus on Eukaryotic cells and Polymerase II Promoters

• Overview of transcription– Structure of eukaryotic promoter– Laboratory methods for identifying binding motifs

• Prediction of transcription factor binding sites using binding profiles (“Discrimination”)– TFBS scanning – a great example of how bioinformatics

can be tweaked when we don’t fully understand the biology!

• Interrogation of sets of co-expressed genes to identify mediating transcription factors– TFBS Over-representation

• Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)– Motif Discovery!

Transcription Simplified

TATATFBS

TF Pol-II

Three-step Process:1. Transcription factor (TF; a

protein) binds to transcription factor binding site (TFBS; in DNA)

2. TF catalyzes recruitment of polII complex

3. Production of RNA from transcription start site (TSS)

TSS

Anatomy of Transcriptional RegulationWARNING: Terms vary widely in meaning between scientists

• Core Promoter – Sufficient for initiation of transcription; orientation dependent (i.e. immediately upstream, functions in only one orientation)

• TSR – transcription start region– Refers to a region rather than specific start site (TSS)

• TFBS – single transcription factor binding site• Regulatory Regions

• Proximal/Distal – vague reference to distance from TSR• May be positive (enhancing) or negative (repressing)• Orientation independent and location independent (generally)• Modules – Sets of TFBS within a region that function together

EXONTFBS TATA

TSR

TFBSTFBS

Core Promoter/Initiation Region (Inr)

TFBSTFBS

Distal Regulatory Region Proximal Regulatory Region

EXONTFBS TFBS

Distal Regulatory region

105

Complexity in Transcription

Distal enhancer

Distal enhancerProximal enhancer Core Promoter

Chromatin

Refer again to Transcription in video: http://www.youtube.com/watch?v=41_Ne5mS2ls

http://www.youtube.com/watch?v=41_Ne5mS2ls

106

Transcription Factors

• Bind to enhancers found in vicinity of gene

• These proteins promote transcription by (i) binding DNA and (ii) activating transcription– These two functions usually reside on separate structural domains of

the transcription factor protein– Activation occurs via interactions with other transcription factors

and /or RNA polymerase

Structural motifs within different types of transcription factors.

See also http://www.youtube.com/watch?v=7EkSBBDQmpE

DNA Binding Domains – Conserved Motifs

Zinc Finger Motif

Helix-turn-Helix MotifHelix-loop-Helix

Motif

Leucine Zipper Motif

http://www.youtube.com/watch?v=7EkSBBDQmpE

Lab Discovery of TF Binding Sites

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

Reporter Gene Activity

mutation

0% 100%

Identify functional regulatory region within a sequence and delineate specific TFBS through mutagenesis (and in vitro binding studies)

EMSA/Gel Shift Assays to Identify Binding Proteins

http://www.biomedcentral.com/content/figures/1741-7015-4-28-8.jpg

TF + DNA

DNA

High-throughput Methods

• SELEX– mix random ds DNA oligonucleotides with TF protein,

recover TF-DNA complexes and sequence DNA

• Protein Binding Arrays (UniProbe Database)– prepare arrays with ds DNA attached, label protein with a

fluorescent mark and observe DNA bound by protein

• ChIP (Chromatin Immunoprecipitation)– covalently link proteins to DNA in cell, shear DNA, recover

protein-DNA complexes and identify DNA (PCR, array or sequencing)

Representing Binding Sites for a TF

• A set of sites represented as a consensus• VDRTWRWWSHD (IUPAC degenerate DNA)

• A single site• AAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Set of binding


IUPAC (International Union of Pure and Applied Chemistry) nomenclature

• A way to represent ambiguity when more than one value is possible at a particular position

A = adenineC = cytosineG = guanineT = thymineR = G or A (purine)Y = T or C (pyrimidine)K = G or T (keto)M = A or C (amino)S = G or C (strong bonds)

W = A or T (weak bonds)B = G, T or C (all but A)D = G, A or T (all but C)H = A, C or T (all but G)V = G, C or A (all but T)N = A, G, C or T (any)

For DNA

Representing Binding Sites for a TF

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

• A matrix describing a set of sites:

• A set of sites represented as a consensus• VDRTWRWWSHD (IUPAC degenerate DNA)

• A single site• AAGTTAATGA

Set of binding


Set of binding


Logo – A graphical representation of frequency matrix. Y-axis is information content , which reflects the strength of the pattern in each column of the matrix

113

TGCTG = 0.9

Conversion of PFMs to Position Specific Scoring Matrices (PSSM)

PSSMs also known as Position Weight Matrices(PWMs)

Add the following features to the matrix profile:1. Correct for nucleotide frequencies in genome2. Weight for the confidence (depth) in the pattern3. Convert to log-scale probability for easy arithmetic

A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1

A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2

pfm pssm

Log2 ( )f(b,i) + s(n)p(b)

f(b,i) = frequency of base b at position is(n) = pseudocount correction (optionally applied to small samples of binding sites)p(b) = background probability of base b

PSSM Scoring Scales

• Raw scores• Sum of values from indicated cells of the matrix

• Relative Scores (most common)• Normalize the scores to range of 0-1 or 0%-100%

• Empirical p-values• Based on distribution of scores for some DNA

sequence, determine a p-value (see next slide)

Detecting binding sites in a single sequenceRaw Scores

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC Sp1

Relative ScoresA [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

Max_score = 15.2 (sum of highest column scores)

Min_score = -10.3 (sum of lowest column scores)

93%

100%10.3)(15.2

(-10.3)-13.4

% 100Min_score - Max_scoreMin_score - Abs_score

Rel_score

Empirical p-value Scores

Abs_score = 13.4 (sum of column scores)

0.3

0.2

0.1

0.0

Freq

uenc

y0.0 0.2 0.4 0.6 0.8 1.0

Relative Score

Area to right of value

Area under entire curve

116

JASPAR:

AN OPEN-ACCESS DATABASE

OF TF BINDING PROFILES( jaspar.genereg.net )

(Some other databases with TF profiles include Transfac, TRRD, mPromDB, SCPD (yeast),

dbTBS and EcoTFS (bacteria))

http://www.genereg.net/

The Good…

• Tronche (1997) tested 50 predicted HNF1 TFBS using an in vitro binding test and found that 96% of the predicted sites were bound!

• Stormo and Fields (1998) found in detailed biochemical studies that the best weight matrices produce scores highly correlated with in vitro binding energy

PSSM SCORE

BIN

DIN

GE

NE

RG

Y

…the Bad…

• Fickett (1995) found that a profile for the myoD TF made predictions at a rate of 1 per ~500bp of human DNA sequence

– This corresponds to an average of 20 sites / gene (assuming 10,000 bp as average human gene size)

…and the Ugly!Human Cardiac a-Actin gene analyzed

with a set of profiles(each line represents a TFBS prediction)

Futility Conjecture: TFBS predictions, as a collective group,

are almost always wrong!

True binding sites are defined by properties not incorporated into

the PSSM profile scores

Red boxes are protein coding exons - TFBS predictions excluded in this analysis

What have we learned?• PSSMs accurately reflect in vitro binding properties of

DNA binding proteins

• Suitable binding sites occur at a rate far too frequent to reflect in vivo function

• Bioinformatics methods that use PSSMs for binding site studies must incorporate additional information to enhance specificity

• Unfiltered predictions are too noisy for most applications• Note: Organisms with short regulatory sequences are less

problematic (e.g. yeast and bacteria)

Example of a bioinformatics challenge that needs more biological information to make predictions

USING PHYLOGENETIC FOOTPRINTING TO IMPROVE TFBS DISCRIMINATION

70,000,000 years of evolution can reveal regulatory regions

% Id

en

tity

Actin gene compared between human and mouse200 bp Window Start Position (human sequence)

Phylogenetic Footprinting

• Align orthologous gene sequences (e.g. use LAGAN). For 1st window of x bp (i.e. 200 bp), of sequence#1, determine % identity with sequence#2. Step across (slide window across) the first sequence, recording % identity in each window with the second sequence.

• Observe high identity with exons, lower identity in 5’ and 3’ UTRs• Additional conserved region could be regulatory region

Multi-species Phylogenetic Footprinting

124

Phylogenetic Footprinting Dramatically Reduces Spurious Hits

Human

Mouse

Actin, alpha cardiac

TFBS Prediction with Human & Mouse Pairwise Phylogenetic Footprinting

• Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set)

• 75-80% of defined sites detected with “sequence conservation filter” (incorporating orthologous sequence data), while only 11-16% of total predictions retained

SELECTIVITY SENSITIVITY

126

1kbp insulin receptor promoter screened with footprinting

Choosing the ”right” species for pairwise comparison...

COW

MOUSE

CHICKEN

HUMAN

HUMAN

HUMAN

ConSite

http://consite.genereg.net/

TFBS Discrimination Tools I

• Phylogenetic Footprinting Servers– FOOTER http://biodev.hgen.pitt.edu/footer_php/Footerv2_0.php

– CONSITE http://asp.ii.uib.no:8090/cgi-bin/CONSITE/consite/ or

http://consite.genereg.net/

– rVISTA http://rvista.dcode.org/

– ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk

• SNPs in TFBS Analysis– RAVEN http://burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?rm=home

TFBS Discrimination Tools II• Prokaryotes or Yeast

– PRODORIC http://prodoric.tu-bs.de/– YEASTRACT http://www.yeastract.com/index.php

• Software Packages– TOUCAN http://homes.esat.kuleuven.be/~saerts/software/toucan.php

• Programming Tools– TFBS http://tfbs.genereg.net/– ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk

• TFBS discrimination coupled with phylogenetic footprinting has greater specificity with tolerable loss of sensitivity

• As with any purification process, some true binding sites will be lost

• There are several online resources that provide precomputed analyses using phylogenetic footprinting

What have we learned?

TFBS Over-representation

• We seek to determine if a set of co-expressed genes contains an over-abundance of predicted binding sites for a known TF

• Phylogenetic footprinting to reduce false prediction rate

• Similar to looking for functional over-representation (i.e. identify annotations of gene function in sets of co-expressed genes that are over-represented)

Two Examples of TFBS Over-Representation

Foreground

Background

More Genes with TFBS

Foreground

Background

More Total TFBS

Statistical Methods for Identifying Over-represented TFBS

• Binomial test (Z scores)– Based on the number of occurrences of the TFBS relative

to background– Normalized for sequence length– Simple binomial distribution model

• Fisher exact probability scores– Based on the number of genes containing the TFBS relative

to background– Hypergeometric probability distribution

oPOSSUM Server

Validation using Reference Gene Sets

TFs with experimentally-verified sites in the reference sets.

A. Muscle-specific (23 input; 16 analyzed)

B. Liver-specific (20 input; 12 analyzed)

Rank Z-score Fisher Rank Z-score Fisher

SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08

MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03

c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01

Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01

TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02

deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01

S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01

Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02

Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01

HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01

Structurally-related TFs with Indistinguishable TFBS

• Most structurally related TFs bind to highly similar patterns– Zn-finger is a

big exception

• TFBS over-representation analysis can identify increased number of sites in a set of co-expressed genes, or increased number of genes in a co-expressed gene set that have a given site.

• Generally best performance has been with data directly linked to a transcription factor

• Highly dependent on the experimental design – cannot overcome noisy data from poor design

Note that the identity of a mediating TF may not be apparent when many proteins can bind to the same motif

Notes

de novo Pattern Discovery

• String-based – e.g. YMF (Sinha & Tompa)– Generalization: Identify over-represented oligomers in

comparison of “+” and “-” (or complete) promoter collections

– Used often for yeast promoter analysis

• Profile-based – e.g. AnnSpec (Workman & Stormo) or MEME (Bailey &

Elkin)– Generalization: Identify strong patterns (position weight

matrices) in “+” promoter collection vs. background model of expected sequence characteristics

String-based methods(1)

How likely are X words in a set of sequences, given background

sequence characteristics?CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range


Find all words of length n in the yeast promoters (e.g. n=7)

Make a lookup table:

AAACCTTT 456TTTTTTTT 57788GATAGGCA 589

Etc...

GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAAGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAGACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA

Xw: Instances of a word w within our set of X genes

E[Xw]: Average number of instances of w based on number of genes in our set

Var[Xw]: Variance – how much deviation from the average is expected for w

w

www XVar

XEXZ


Limitations of String-based Methods• Longer word lengths not possible

– Number of possible strings of length n is 4n without any ambiguities and more if we allow ambiguous bases. Thus, as n increases, the computational time notably increases.

• While degeneracy codes can be used, TFBS are not words – we lose quantitation for variable positions with consensus sequences– Imagine column in PFM with 7 A’s and 1 T --- in a consensus

sequence we would represent as W or throw out the instance with T

• Recently the string-based method has found renewed utility in the analysis of 3’UTRs for the presence of microRNA target sequences...

Aside: microRNA Target Sequences

• Lim et al expressed miRNAs in cells and observed that the overall pattern of gene expression shifted toward the pattern of expression observed in cells which naturally express the miRNA

• The genes with reduced expression in response to miRNA exposure shared 7nt motifs the 3’UTR of their transcripts

• Nice website tutorial:– http://www.ambion.com/main/explorations/mirna.html

Motivation:

TFBS are not words, but rather sequence patterns

Efficiency – can handle longer patterns than string-based methods

Can be intentionally influenced to reflect prior knowledge

Overview: Find a local alignment of width x of sites that maximizes information content (or related measure) in reasonable time

Usually by Gibbs sampling (i.e. Gibbs Motif Sampler) or Expectation Maximization methods (i.e. MEME)

Probabilistic Methods

Ab initio motif finding: Gibbs sampling

• Popular “Markov chain Monte Carlo” algorithm for motif discovery

• Motif model: Position Weight Matrix• Local search algorithm

Gibbs sampling: basic idea

Current motif = PWM formedby circled substrings


Delete one substring


Try a replacement:Compute its score, Accept the replacementdepending on the score.


New motif

Gibbs Sampling(grossly over-simplified)

tgacttcctgctacct

agacctcactgtagtgacgcatct

cgatacgcttcgctcc

1 2 3 4 5 6 7 8A 2 0 2 2 2 1 0 1C 0 2 3 3 2 1 6 2G 0 4 1 0 1 0 1 1T 4 1 1 2 2 5 0 2

i.e. Gibbs Motif Sampler: http://bayesweb.wadsworth.org/gibbs/gibbs.html

http://www.youtube.com/watch?v=_4yrlzVTr_k

http://www.youtube.com/watch?v=_4yrlzVTr_k

Pattern Discovery

• Gibbs sampling is guaranteed to return an optimal pattern if repeated sufficiently often– Procedure is fast, so running many 1000s of times

is feasible• Problems if the TFBS’s are not strongly over-

represented relative to other patterns…

Applied Pattern Discovery is Acutely Sensitive to Noise

True Mef2 Binding Sites

10

12

14

16

18

0 100 200 300 400 500 600

SEQUENCE LENGTH

PATTE

RN

SIM

ILA

RIT

Yvs.

TR

UE

ME

F2 P

RO

FILE

Pink line is negative control with no Mef2 sites included

Ab initio motif finding: Expectation Maximization

• Popular algorithm for motif discovery• Motif model: Position Weight Matrix• Local search algorithm

– Move from current choice of motif to a new similar motif, so as to improve the score

– Keep doing this until no more improvement is obtained: converge to local optima

Basic idea of iteration

PWMCurrent motif1.

Scan sequence for good matches to the current motif.

2.

3. Build a new PWM out of these matches, and make it the new motif

The basic idea can be formalized in the language of probability

MEME

• Popular motif finding program that uses Expectation-Maximization

• Web sitehttp://meme.sdsc.edu/meme/website/meme.html

http://meme.sdsc.edu/meme/website/meme.html

Four Approaches to Improve Sensitivity• Better background models

-Higher-order properties of DNA

• Phylogenetic Footprinting– Human:Mouse comparison eliminates ~75% of sequence

• Regulatory Modules – Architectural rules

• Limit the types of binding profiles allowed– TFBS patterns are NOT random

(Also, for prokaryotes in particular, accurate ID of transcription start sites can greatly aid identification of promoter sequences)

Pattern Discovery Summary

• Pattern discovery methods can recover over-represented patterns in the promoters of co-expressed genes

• Methods are acutely sensitive to noise – notable problem if the signal we seek is weak

• TFs tolerate great variability between binding sites

• Supplementary information/approaches (such as phylogenetic footprinting) may be required to over-come the noise

REFLECTIONS• TFBS prediction

– Futility Theorem – Essentially predictions of individual TFBS have no relationship to an in vivo function

– Current successful bioinformatics methods incorporate additional information (clusters, sequence conservation)

• TFBS over-representation– TFBS over-representation is powerful way to identify TFs

likely to contribute to observed patterns of co-expression• Novel motif discovery (also useful in general)

– Pattern discovery methods are restricted by the Signal-to-Noise problem

• Observed patterns must be carefully considered– Again, successful methods incorporate additional

information (conservation, structural constraints on TFs, and (for prokaryotes) positional information relative to TSS)

NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator:...

Documents

Transcript of NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator:...