NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator:...

159
NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th , 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment: This week’s slides partly courtesy of Professor Fiona Brinkman, MBB, with material from Wyeth W. Wasserman and Shannan Ho Sui

Transcript of NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator:...

Page 1: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

NGS Bioinformatics Workshop1.5 Genome Annotation

April 4th, 2012IRMACS 10900, SFU

Facilitator: Richard BruskiewichAdjunct Professor, MBB

Acknowledgment:This week’s slides partly courtesy of Professor Fiona Brinkman, MBB,

with material from Wyeth W. Wasserman and Shannan Ho Sui

Page 2: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

OverviewGeneral observations about genomesRepeat maskingGene findingGene regulation and promoter analysis

Page 3: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

SOME GENERAL OBSERVATIONS ABOUT GENOMES

NGS Bioinformatics Workshop - 1.5 Genome Annotation

3

Page 4: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

General Variables of GenomesProkaryote versus Eukaryote versus OrganelleGenome size:

Number of chromosomesNumber of base pairsNumber of genes

GC/AT relative contentRepeat contentGenome duplications and polyploidyGene content

See: Genomes, 2nd edition Terence A Brown. ISBN-10: 0-471-25046-5See NCBI Bookshelve: http://www.ncbi.nlm.nih.gov/books/NBK21128/

Page 5: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Eukaryote versus Prokaryote Genomes

Eukaryote Prokaryote

Size

Large (10 Mb – 100,000 Mb)

There is not generally a relationship between organism complexity and its genome size (many plants have larger genomes than human!)

Generally small (<10 Mb; most < 5Mb)

Complexity (as measured by # of genes and metabolism) generally proportional to genome size

Content Most DNA is non-coding DNA is “coding gene dense”

Telomeres/ Centromeres

Present (Linear DNA)

Circular DNA, doesn't need telomeres

Don’t have mitosis, hence, no centromeres.

Number of chromosomes

More than one, (often) including those discriminating sexual identity

Often one, sometimes more, -but plasmids, not true chromosome.

Chromatin Histone bound (which serves as a genome regulation point)

No histones

Uses supercoiling to pack genome

Page 6: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Eukaryote versus Prokaryote Genomes

Eukaryote Prokaryote

Genes

Often have introns

Intraspecific gene order and number generally relatively stable

many non-coding (RNA) genes

There is NOT generally a relationship between organism complexity and gene number

No introns

Gene order and number may vary between strains of a species

Gene regulation

Promoters, often with distal long range enhancers/silencers, MARS, transcriptional domains

Generally mono-cistronic

Promoters

Enhancers/silencers rare

Genes often regulated as polycistronic operons

Repetitive sequences Generally highly repetitive with genome wide

families from transposable element propagation

Generally few repeated sequences

Relatively few transposons

Organelle (subgenomes)

Mitochondrial (all)

chloroplasts (in plants) Absent

Page 7: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Genome SizePhysical:

Amount of DNA / number of base pairsNumber of chromosomes/linkage groupsInformation resources:

NCBI: http://www.ncbi.nlm.nih.gov/genomeAnimals: http://www.genomesize.com/Plants: http://data.kew.org/cvalues/Fungi: http://www.zbi.ee/fungal-genomesize/

Genetic: Number of genes in the genome

Gregory TR. 2002. Genome size and developmental complexity. Genetica. May;115(1):131-46.

Page 8: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Size of Organelle GenomesSpecies Type of organism Genome size (kb)

Mitochondrial genomes

Plasmodium falciparum Protozoan (malaria parasite) 6

Chlamydomonas reinhardtii Green alga 16

Mus musculus Vertebrate (mouse) 16

Homo sapiens Vertebrate (human) 17

Metridium senile Invertebrate (sea anemone) 17

Drosophila melanogaster Invertebrate (fruit fly) 19

Chondrus crispus Red alga 26

Aspergillus nidulans Ascomycete fungus 33

Reclinomonas americana Protozoa 69

Saccharomyces cerevisiae Yeast 75

Suillus grisellus Basidiomycete fungus 121

Brassica oleracea Flowering plant (cabbage) 160

Arabidopsis thaliana Flowering plant (vetch) 367

Zea mays Flowering plant (maize) 570

Cucumis melo Flowering plant (melon) 2500

Chloroplast genomes

Pisum sativum Flowering plant (pea) 120

Marchantia polymorpha Liverwort 121

Oryza sativa Flowering plant (rice) 136

Nicotiana tabacum Flowering plant (tobacco) 156

Chlamydomonas reinhardtii Green alga 195

http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511

Page 9: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Size of Prokaryote GenomesSpecies DNA molecules Size (Mb) Number of genes

Escherichia coli K-12 One circular molecule 4.639 4397

Vibrio cholerae El Tor N16961 Two circular molecules

 Main chromosome 2.961 2770

 Megaplasmid 1.073 1115Deinococcus radiodurans R1 Four circular molecules

 Chromosome 1 2.649 2633 Chromosome 2 0.412 369 Megaplasmid 0.177 145 Plasmid 0.046 40

Borrelia burgdorferi B31 seven or eight circular molecules, 11 linear molecules

 Linear chromosome 0.911 853

 Circular plasmid cp9 0.009 12

 Circular plasmid cp26 0.026 29

 Circular plasmid cp32* 0.032 Not known

 Linear plasmid lp17 0.017 25

 Linear plasmid lp25 0.024 32

 Linear plasmid lp28-1 0.027 32

 Linear plasmid lp28-2 0.030 34

 Linear plasmid lp28-3 0.029 41

 Linear plasmid lp28-4 0.027 43

 Linear plasmid lp36 0.037 54

 Linear plasmid lp38 0.039 52

 Linear plasmid lp54 0.054 76

 Linear plasmid lp56 0.056 Not known

http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5524

Page 10: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Size of Eukaryote GenomesSpecies Genome size (Mb)

Fungi

Saccharomyces cerevisiae 12.1

Aspergillus nidulans 25.4

Protozoa

Tetrahymena pyriformis 190

Invertebrates

Caenorhabditis elegans 97

Drosophila melanogaster 180

Bombyx mori (silkworm) 490

Strongylocentrotus purpuratus (sea urchin) 845

Locusta migratoria (locust) 5000

Vertebrates

Takifugu rubripes (pufferfish) 400

Homo sapiens 3200

Mus musculus (mouse) 3300

Plants

Arabidopsis thaliana (vetch) 125

Oryza sativa (rice) 430

Zea mays (maize) 2500

Pisum sativum (pea) 4800

Triticum aestivum (wheat) 16 000

Fritillaria assyriaca (fritillary) 120 000

http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5471

Page 11: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

http://en.wikipedia.org/wiki/Genome_sizehttp://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes

Page 12: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Number of GenesSpecies Ploidy Cs Size (Mb) No. Genes

Saccharomyces cerevisiae 2 16 12 6,281

Plasmodium falciparum 2 14 23 5,509

Caenorhabditis elegans 2 6 100 21,175

Drosophila melanogaster 2 6 139 15,016

Oryza sativa 2 12 410 30,294

Canis lupus familaris 2 39 2,445 24,044

Homo sapiens 2 24 3,100 36,036

Zea mays 2 10 2,046 42,000-56,000 (*)

Protopterus aethiopicus ? ? 130,000 ?

Paris japonica 8 40 150,000 ?

Polychaos dubium ? ? 670,000 ?

http://www.ncbi.nlm.nih.gov/genome

(*) Haberer et al., Structure and architecture of the maize genome. Plant Physiol. 2005 Dec139(4):1612-24

Page 13: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

AT/GC content Regional variations correlates with genomic content

and function like transposable element distribution, gene density, gene regulation, methylation, etc.

Often introduces bias in sequencing processes (e.g. library yields, PCR amplification, NGS sequencing)

Species GC%Streptomyces coelicolor A3(2) 72Plasmodium falciparum 20Arabidopsis thaliana 36Saccharomyces cerevisiae 38Arabidopsis thaliana 36Homo sapiens 41 (35 – 60)

Romiguier et al. 2010. Contrasting GC-content dynamics across 33 mammalian genomes: Relationship with life-history traits and chromosome sizes. Genome Res. 20: 1001-1009

Page 14: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Repeat Content Large genomes generally reflect evolutionary expansion of

large families of repetitive DNA (by RNA/DNA transposon amplification/insertion, genetic recombination)

Repeats drive genome mutational processes: Recombination resulting in insertion, deletion, translocation,

segmental duplication of DNA Insertional mutagenesis, possibly including de novo creation of genes Insert novel regulatory signals

Repeats generally confound genome sequence assembly (especially for NGS, due to short reads). Gene annotation can also be problematic as transposons mimic gene structures.

Jurka et al. 2007. Repetitive sequences in complex genomes: structure and evolution. Annu Rev Genomics Hum Genet. 2007;8:241-59.

Page 15: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Genome Duplications/PolyploidySegmental duplications (i.e. by recombination)

Tandem: direct and invertedWhole genome duplication & loss, e.g.

Ancestral vertebrate: 2 roundso HOX gene clusters…

Polyploidy - ~70% of all angiospermsGenomic hybridization (allopolyploids)Can lead to immediate and extensive changes in

gene expressionMapping of homeologous gene loci can be tricky

Dehal P and Boore JL.2005. Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate. PLoS Biol 3(10) : e314. doi:10.1371

Adams and Wendel. 2005. Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol. 8(2):135–141

Page 16: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Case Study: Rice 10 duplicated blocks identified on all 12 chromosomes

of Oryza sativa subspecies indica, that contained 47% of the total predicted genes.

Possible genome duplication occurred ~70 million years ago, supporting the polyploidy origin of rice.

Additional segmental duplication identified involving chromosomes 11 and 12, ~5 million years ago.

Following the duplications, there have been large-scale chromosomal rearrangements and deletions. About 30–65% of duplicated genes were lost shortly after the duplications, leading to a rapid diploidization.

Wang et al. 2005. Duplication and DNA segmental loss in the rice genome:implications for diploidization New Phytologist 165: 937–946

Page 17: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Gene Content

http://www.ncbi.nlm.nih.gov/books/NBK21120/figure/A5501

Page 18: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

The Bottom Line

All of these genomic variables:Type of organism: i.e. prokaryote versus eukaryoteGenome sizeGC/AT relative contentRepeat contentGenome duplications and polyploidyGene content

are important factors that can drive the strategy, expected outcome and efficacy of genome sequence assembly and annotation.

Page 19: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

GENOME REPEAT MASKING

NGS Bioinformatics Workshop - 1.5 Genome Annotation

19

Page 20: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Genomic (DNA) Sequence Repeat Masking

Classic approach: search against repeat libraries RepeatMasker

http://www.repeatmasker.org/Uses a previously compiled library of repeat familiesUses (user configured) external sequence search programComputationally intensive but……the project web site also provides “pre-masked” genomic

data for many completed genomes, complete with some statistical characterization.

Page 21: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:
Page 22: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

More Repeat Masking … de novo identification and classification:

RECON: http://www.genetics.wustl.edu/eddy/reconRepeatGluer: http://nbcr.sdsc.edu/euler/PILER: http://www.drive5.com/piler

Repeat databases:Repbase: http://www.girinst.org/repbase/index.htmlplants: http://plantrepeats.plantbiology.msu.edu/

Related algorithms:“probability clouds”

Gu et al. 2008. Identification of repeat structure in large genomes using repeat probability clouds. Anal Biochem. 380(1): 77–83

Page 23: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

GENE FINDING…OR “WHAT IS A GENE?”

NGS Bioinformatics Workshop - 1.5 Genome Annotation

23

Page 24: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

24

Objectives

Review of differences in prokaryotic and eukaryotic gene organization.

Understand consequences and challenges for gene finding algorithms for Prokaryotes and Eukaryotes.

Appreciate HMM as powerful tool (in many areas of computational biology!)

Be reminded that not all genes encode proteins and predictions of such genes have their own computational challenges.

Page 25: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Gene Annotation QuestionsWhich genes are present?How did they get there (evolution)?Are the genes present in more than one copy?Which genes are not there that we would

expect to be present?What order are the genes in, and does this

have any significance?How similar is the genome of one organism to

that of another?

Page 26: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

26

Why Gene-finding?

• Whole-genome annotation– Genome sequence does not give you list of all

genes• Fully characterizing Yfg (“your favourite gene”)

– example: A disease is associated with a SNP in a location in the human genome. BLAST finds similarity to a protein coding gene in the area, but its only similar to part of the whole protein. What’s the whole gene?

Page 27: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

After completing the human genome we faced 3 Gigabytes of this

Page 28: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Not immediately apparent where the genes are…

Page 29: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

What is a gene?

29

A distinct functional unit encoded (at minimum transcribed) in the genome

Eukaryotes: The region that is transcribed – the mRNA defines it

Prokaryotes: The coding region – because multiple genes may be on one transcript as an operon

Both eukaryotes and prokaryotes: non-coding RNAs are treated relatively the same

Page 30: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

30

Prokaryotes• High gene density• mRNA transcription-

translation is coupled

• Genes are usually contiguous stretches of coding DNA

• mRNAs often polycistronic gene

____________________

• Low gene density• mRNA transcribed then

transported to cytoplasm for translation

• Genes’ coding DNA often split by non-coding introns

• mRNAs are generally monocistronic gene

___________

Eukaryotes

Raw Biological Materials

Great real-time Transcription-Translation video: http://www.youtube.com/watch?v=41_Ne5mS2ls

transcript

Page 31: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

How many genes in human genome?• 2000: must be at least 100,000 (Rice has ~40,000,

C. elegans has ~19,000)

• 2001: only 35,000?

• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)

• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)

• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternative splicing these produce likely many more…

Page 32: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

How many genes in human genome?• 2000: must be at least 100,000 (Rice has ~40,000,

C. elegans has ~19,000)

• 2001: only 35,000?

• 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts)

• 2006: Ensembl NCBI 36 release: 23,710 protein coding genes, plus 4421 RNA genes (48,851 transcripts)

• Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternative splicing these produce likely >100,000 proteins (178,191 currently annotated in Ensembl)

Page 33: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

1 gene in how many basepairs?...a. 1:10,000,000b. 1:1,000,000c. 1:100,000 roughly for humand. 1:10,000 (1:5000 for C. elegans)e. 1:1000 roughly for most bacteriaf. 1:100g. 1:10

33

Gene Density

Page 34: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

34

Approaches to Finding Genes

• AB INITIO (‘from the beginning’):

– search by content: find genes by statistical properties that distinguish protein-coding DNA from non-coding DNA

– Search by signals/sites: splice donor/acceptor sites, promoters, etc.

• HOMOLOGY:– search by sequence similarity to homologous sequences

state-of-the-art gene finding combines these strategies and also uses other data like EST/RNA-seq/microarray data

Page 35: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

35

Gene-finding in Prokaryotes:Easy? ….or not?

• ORF Finder– Open reading frame (ORF) from methionine

codon to first Stop codon– ORFs linked to BLAST– http://www.ncbi.nlm.nih.gov/gorf/gorf.html

Problem: All ORFs are not genes. Why? How can this be improved?

Page 36: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

36

Gene Finding: Ab Initio Search by Content

• Protein code affects the statistical properties of a DNA sequence – some amino acids are used more frequently than others

(Leu more popular than Trp)– different numbers of codons for different amino acids

(Leu has 6, Trp has 1)– for a given amino acid, usually one codon is used more frequently than others

• this is termed codon preference • these preferences vary by species

Page 37: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

37

Gene-finding in Prokaryotes:Improving predictions…

Common way to search by content• build Markov models of coding & noncoding regions

apply to ORFs or fixed-sized sequence windowsMarkov Model approaches: prokaryotic gene prediction• Glimmer

– http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi– http://cbcb.umd.edu/software/glimmer/– open source

• GeneMark– http://opal.biology.gatech.edu/GeneMark/– not open source but slightly more accurate

Page 38: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

38Reproduced with permission,Chris Burge

i.e. If its sunny then the next weather state is not likely to become suddenly rainy.

If its cloudy then it is more likely the next weather state will be rainy.

Page 39: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

39Reproduced with permission,Chris Burge

For gene prediction:

Certain bases occur after others in coding sequence versus non-coding sequence

Page 40: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

What’s Hidden in a Hidden Markov Model?

• The HMM is a model of what actually happened (the true gene structure) but all you can see is the sequence

• Each nucleotide is not labeled with its state; it’s hidden

• Based on training data – which is critical!

Page 41: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Glimmer version 3

• annotates microbial genomes with a sensitivity of 99% and accuracy of >98%

• resolves overlapping genes• use appropriate gene dataset for training program• no annotation of rRNA and tRNA genes• employs interpolated Markov model

Page 42: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

42NAR Vol.26 p1107-1115

Page 43: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

43

GeneMark.hmm

• HMM plus ribosome binding site signals in primary nucleotide sequence

• training also required• Most accurate prokaryotic gene predictor (by a bit over

Glimmer)• GeneMarkS version - translation start prediction: 83.2% of

Bacillus subtilis genes 94.4% of Escherichia coli genes

Page 44: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

44

Main issues with prokaryotic gene prediction

• Which ATG or GTG start to choose? Not that accurate• Errors in sequence frame shifts that lead to errors in gene

prediction• Very small genes either not predicted at all (most prokaryotic

genome projects don’t annotate any genes < 90 bp) or poorly predicted

• Non-coding RNA genes tend to be ignored!

• RNA-seq data being incorporated now…

Page 45: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

45

RNA-seq

• Sequence cDNA from RNA using “next generation” sequencing

• Map sequence reads to ref genome

• Count number of sequence reads (depth of reads) for a given window of ref sequence

Page 46: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Rne (Rnase E) Upstream Region in P. aeruginosa

Page 47: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Rne (Rnase E) Upstream Region

• Upstream region of Rne in E. coli forms stem loop to regulate mRNA synthesis

• Region aligns reasonably well between spp.• Also predicted to encode a small non-coding RNA using QRNA

and RNAz

E. coli PAO1

Page 48: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Gene prediction in eukaryotes

• Overview of eukaryotic gene structure will illustrate the gene-finding problem

• Components described are some of those that gene-finding tools must model

• These signals are never 100% reliable because there are (almost) as many exceptions as rules

• Signals (and noise) vary with organism– 3% of human genome is “coding” vs. 25% in fugu

Page 49: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

49

The problem with eukaryotes:the easy stuff?

Page 50: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

The problem with eukaryotes:the hard stuff

• ~50% of human genes undergo alternative splicing

• Genes can be nested: overlapping on same or opposite strand or inside an intron

• Pseudogenes: “dead” genes

Page 51: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Fig 10.1 Baxevanis and Ouellette, 2000

Raw Biological Materials: Eukaryotes

Page 52: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Transcription Start

core promoter

TSS – transcription start site

. . .

Fig 11.25 from Introduction to Genetic Analysis, 7th ed. Griffiths et al. New York: W H Freeman & Co; c1999.http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=iga.figgrp.d1e46911

• ~70% of promoter regions have TATA box

Page 53: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

End Modification

• 5’cap added at 5’ end of transcript• polyA-tail added at AATAAA consensus site

Page 54: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Splice site recognition

Logos from:http://genio.informatik.uni-stuttgart.de/GENIO/splice/

exon intron5’ (a.k.a. donor) 3’ (a.k.a. acceptor)

intron exon

Page 55: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

The problem with splice signals

• ~0.5% of splice sites are non-canonical, i.e. not GT-AG

• ~5% of human genes may have a non-canonical splice site

• ~50% of human genes are alternatively spliced

Page 56: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Intron Phase

CCA

CAGAGG

exon intron exon

Phase 0:Phase 1:Phase 2:

A codon can be interrupted by an intron in one of three placesand still code for the same amino acid

Example CAG -> glutamine

Page 57: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Translation signals

Start AUG

Note: Bacteria also use GUG (~10% of starts) and UUG (rarely)

StopUAA, UAG, or UGA

Page 58: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Translation signals

• Translation Start: “usually” first AUG after 5’ cap

• But Kozak consensus sequence aids the selection so the 2nd or 3rd AUG might be used

• Translation Stop: first in-frame stop codon (simple!)

Page 59: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Gene-finding Strategies

• Similarity-based a.k.a. lookup method– genome:protein– genome:genome

• ab initio a.k.a. template method• Experimental

– microarrays– RNA Sequencing

Page 60: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Similarity – genome:protein

• blastx finds similarity between translated genomic region and known proteins

• Problem: on human chr22 only 50% of genes had similarity to known proteins when first annotated

Page 61: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Similarity – genome:genome

• Functional sequences will be conserved• Learn different things from different evolutionary

distances (signal:noise ratio)– coding sequences– regulatory sequences

• Compare mouse:human, pufferfish:human (Exofish)• Problems:

– you need something “similar”– won’t find precise boundaries

Page 62: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

ab initio Gene-finding

• Based on intrinsic information• Coding statistics plus signal sensors• No prior knowledge of similar genes needed• Problem: gene models can only approach the

biological truth (recall eukaryotic gene structure variability)

Page 63: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Coding Statistics

• Exon/intron base composition differences indicate protein-coding regions

• Use frequency of occurrence of hexamers

Page 64: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Signal Sensors

• Model features recognized by transcriptional, splicing and translational machinery– Promoter elements– Splice sites– PolyA consensus sites– Start and stop codons

• Use pattern recognition methods as signal detectors– Weight matrices– Neural networks– Decision trees

Page 65: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

What don’t (coding) gene finders do?

• Explicit alternative splicing models• Overlapping or nested genes• Non-protein-coding genes

Tools for finding genes…

Page 66: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Check out the Bioinformatics Links Directory at Bioinformatics.ca

http://bioinformatics.ca/links_directory/

Some tools for finding genes…

http://bioinformatics.ca/links_directory/category/dna/gene-prediction

Page 67: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Tools for eukaryotic gene prediction

• Ab Initio– genscan, fgenesh

• Homology-based prediction– Genewise (uses predicted proteins from one

genome/seq to drive exon finding in a second genome/seq, combined with splice-site model)

– Twinscan (builds two genome predictions together, using conserved DNA signal as one criterion)

Page 68: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Gene Finding in General: Training Sets

• ab initio gene-finders must be trained on real sequences

• Mostly single genes from specific groups of organisms (or the organism being studied)

• Bias of training set will affect a tool’s prediction accuracy

• Example: HMR195 has 195 single- or multi-exon murine genes with typical splice sites and confirmed by cDNA (Rogic et al, 2001)

Page 69: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

GENSCAN Dissection

Page 70: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Reproduced with permission,Chris Burge

Page 71: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Reproduced with permission,Chris Burge

Page 72: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

72

Genscan HSMM

• Circles and diamonds are states e.g. exon, phase 0 on plus strand

• Arrows are probabilities of transition between states

• Sequence you observe is generated from a series of states

Reproduced with permission, Chris Burge

Page 73: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Reproduced with permission,Chris Burge

Page 74: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

74

Reproduced with permission,Chris Burge

Page 75: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

75

Reproduced with permission,Chris Burge

Page 76: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

76

Genscan HSMM

• Circles and diamonds are states e.g. exon, phase 0 on plus strand

• Arrows are probabilities of transition between states

• Sequence you observe is generated from a series of states

Reproduced with permission, Chris Burge

Page 77: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Signal Sensors

• Model features recognized by transcriptional, splicing and translational machinery– Start codon

• 12bp WMM (Weight Matrix Model); probabilities estimated from GenBank CDS feature annotations

– 5’ splice site• MDD (Maximal Depedence Decomposition) developed

by Burge and Karlin (1997) to allow for dependencies between adjacent and non-adjacent nucleotides

Page 78: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Coding Statistics

• Exon/intron base composition differences indicate protein-coding regions

• Use frequency of occurrence of hexamers• Model with 5th order HMMs

– Non-coding: homogeneous – Coding: heterogeneous

• different model, depending on whether it’s 1st, 2nd, 3rd codon position

Page 79: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

The predicted gene structure is

the optimal combination of the aboveReproduced with permission,Chris Burge

Page 80: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

GENSCAN

• Strengths– Works well from fly to human (and plants with

some tweaks)• Weaknesses

– Practical• Initial exons; HMMgene does (a bit) better• Gene boundaries; tends to join or split genes

– Models• Promoter and polyA site models• Transcription start sites (TSS)

Page 81: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Homology Based Approach

• Use evolutionary divergence as a tool• Based on the principal that introns and other

non-coding regions usually diverge much faster over time

• Helped by the fact that many splice patterns have been determined experimentally (cDNA)

• Fails when the divergence of distant sequences is too extreme (<50%; due to rapidly evolving proteins or large evolutionary distances)

Page 82: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

GeneWise (Ewan Birney)• GeneWise aligns a protein sequence (or HMM) to

genomic DNA taking into account splicing information

See also “Exonerate”

Page 83: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

February 2001 December 2002

The birth of comparative genomics…

Page 84: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Comparative Genefinders• Twinscan -based on Genscan (Korf, 2001)• ROSETTA – global alignment of orthologous

regions (Batzoglou et al, 2000)• SLAM (Alexandersson) and Doublescan

(Meyer)– jointly align while finding predicting genes– computationally expensive (extra search

dimension) and hard to parameterise– still effectively requires a full alignment

• SGP2 (Parra, 2003)– Combines ab initio gene prediction with TBLASTX

Page 85: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Paul Flicek et al. Genome Res. 2003: TWINSCAN pred (red), GENSCAN (green), and an aligned RefSeq transcript (blue). Yellow: low-complexity regions, black: mouse

alignments

Evaluation of Twinscan

Page 86: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Paul Flicek et al. (Genome Res. 2003): Accuracy of GENSCAN and TWINSCAN by the exact gene, exact exon, and coding nucleotide measures

Evaluation of Twinscan

Page 87: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Twinscan (N-SCAN) in UCSC genome browser

Region of seq1 (from HMR195 dataset) presentedTwinscan URL: http://ardor.wustl.edu

Page 88: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

What about non-coding RNAs (ncRNAs)?

88

Page 89: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

What are non-coding RNAs (ncRNAs)?

89

ncRNA genes make transcripts that function directly as RNA, rather than encoding proteins.– ribosomal RNA, transfer RNA, RNase P, (tmRNA)– E.coli translational regulatory RNAs, e.g. OxyS– Eukaryotes: snRNA, snoRNA (& Archaea)

Page 90: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Example predictor: QRNA - probabilistic model III

• RNA: pair stochastic context-free grammar – base stacking, loop length (MFOLD)

Page 91: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

1. Identification of ncRNAs with known RNA familiesRFAM: a database of RNA families (Griffiths-Jones et al.,

2005)

2. Identification of novel ncRNAsObtain orthologous sequences using BLASTNMultiple sequence alignment using MAFFTncRNA prediction using RNAz and QRNAAnalyze overlapping predictions and filter false positivesNeed lab validation (RT-PCR, RNA-seq…)

Prediction of non-coding RNAs

Page 92: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

1. Sean Eddy (HHMI) RFAM, QRNA http://selab.janelia.org/

2. Cenk Sahinalp (SFU) TaveRNA suite for structural analysis and interaction predictionhttp://compbio.cs.sfu.ca/

Analysis of non-coding RNAs: Examples of Research Leaders

Page 93: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Rne (Rnase E) Upstream Region in P. aeruginosa

Page 94: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Rne (Rnase E) Upstream Region

• Upstream region of Rne in E. coli forms stem loop to regulate mRNA synthesis

• Region aligns reasonably well between spp.• Also predicted to encode a small non-coding RNA using QRNA

and RNAz

E. coli PAO1

Page 95: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Summary

• Genes are complex structures, which are difficult to predict

• Different approaches to gene finding:– Ab Initio : i.e. GenScan (use appropriate training

dataset!)– Homology guided: i.e. GeneWise

Page 96: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Outstanding Issues

Most Gene finders cannot handle untranslated regions (UTRs)

~40% of human genes have non-coding 1st exons (UTRs)

Most gene finders cannot handle alternative splicing

Most gene finders cannot handle overlapping or nested genes

Page 97: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Bottom Line

Gene finding in eukaryotes is not solvedAccuracy of the best methods approaches 80%

at the exon level (90% at the nucleotide level) in coding-rich regions (much lower for whole genomes)

Gene predictions should always be verified by other means (cDNA sequencing RNA-seq/EST, BLAST search)

Page 98: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

98

Whole Genome Annotation

• A lot of work has been done for you!

• NCBI Map Viewers• Ensembl• UCSC Genome Browser

Page 99: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Top 10 Future ChallengesEwan Birney, Chris Burge, Jim Fickett in Genome Technology

1. Precise, predictive model of transcription initiation and termination: ability to predict where and when transcription will occur in a genome

2. Precise, predictive model of RNA splicing/ alternative splicing: ability to predict the splicing pattern of any primary transcript in any tissue

3. Precise, quantitative models of signal transduction pathways: ability to predict cellular responses to external stimuli

4. Determining effective protein:DNA, protein:RNA and protein:protein recognition codes

5. Accurate ab initio protein structure prediction

Page 100: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

5 minute break?

Page 101: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

GENE REGULATION AND PROMOTER ANALYSIS

NGS Bioinformatics Workshop1.5 Genome Annotation

With material from Wyeth W. Wasserman and Shannan Ho Sui

Page 102: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

ObjectivesNote: Focus on Eukaryotic cells and Polymerase II Promoters

• Overview of transcription– Structure of eukaryotic promoter– Laboratory methods for identifying binding motifs

• Prediction of transcription factor binding sites using binding profiles (“Discrimination”)– TFBS scanning – a great example of how bioinformatics

can be tweaked when we don’t fully understand the biology!

• Interrogation of sets of co-expressed genes to identify mediating transcription factors– TFBS Over-representation

• Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)– Motif Discovery!

Page 103: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Transcription Simplified

TATATFBS

TF Pol-II

Three-step Process:1. Transcription factor (TF; a

protein) binds to transcription factor binding site (TFBS; in DNA)

2. TF catalyzes recruitment of polII complex

3. Production of RNA from transcription start site (TSS)

TSS

Page 104: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Anatomy of Transcriptional RegulationWARNING: Terms vary widely in meaning between scientists

• Core Promoter – Sufficient for initiation of transcription; orientation dependent (i.e. immediately upstream, functions in only one orientation)

• TSR – transcription start region– Refers to a region rather than specific start site (TSS)

• TFBS – single transcription factor binding site• Regulatory Regions

• Proximal/Distal – vague reference to distance from TSR• May be positive (enhancing) or negative (repressing)• Orientation independent and location independent (generally)• Modules – Sets of TFBS within a region that function together

EXONTFBS TATA

TSR

TFBSTFBS

Core Promoter/Initiation Region (Inr)

TFBSTFBS

Distal Regulatory Region Proximal Regulatory Region

EXONTFBS TFBS

Distal Regulatory region

Page 105: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

105

Complexity in Transcription

Distal enhancer

Distal enhancerProximal enhancer Core Promoter

Chromatin

Refer again to Transcription in video: http://www.youtube.com/watch?v=41_Ne5mS2ls

Page 106: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

106

Transcription Factors

• Bind to enhancers found in vicinity of gene

• These proteins promote transcription by (i) binding DNA and (ii) activating transcription– These two functions usually reside on separate structural domains of

the transcription factor protein– Activation occurs via interactions with other transcription factors

and /or RNA polymerase

Page 107: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Structural motifs within different types of transcription factors.

See also http://www.youtube.com/watch?v=7EkSBBDQmpE

DNA Binding Domains – Conserved Motifs

Zinc Finger Motif

Helix-turn-Helix MotifHelix-loop-Helix

Motif

Leucine Zipper Motif

Page 108: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Lab Discovery of TF Binding Sites

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

Reporter Gene Activity

mutation

0% 100%

Identify functional regulatory region within a sequence and delineate specific TFBS through mutagenesis (and in vitro binding studies)

Page 109: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

EMSA/Gel Shift Assays to Identify Binding Proteins

http://www.biomedcentral.com/content/figures/1741-7015-4-28-8.jpg

TF + DNA

DNA

Page 110: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

High-throughput Methods

• SELEX– mix random ds DNA oligonucleotides with TF protein,

recover TF-DNA complexes and sequence DNA

• Protein Binding Arrays (UniProbe Database)– prepare arrays with ds DNA attached, label protein with a

fluorescent mark and observe DNA bound by protein

• ChIP (Chromatin Immunoprecipitation)– covalently link proteins to DNA in cell, shear DNA, recover

protein-DNA complexes and identify DNA (PCR, array or sequencing)

Page 111: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Representing Binding Sites for a TF

• A set of sites represented as a consensus• VDRTWRWWSHD (IUPAC degenerate DNA)

• A single site• AAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

IUPAC (International Union of Pure and Applied Chemistry) nomenclature

• A way to represent ambiguity when more than one value is possible at a particular position

A = adenineC = cytosineG = guanineT = thymineR = G or A (purine)Y = T or C (pyrimidine)K = G or T (keto)M = A or C (amino)S = G or C (strong bonds)

W = A or T (weak bonds)B = G, T or C (all but A)D = G, A or T (all but C)H = A, C or T (all but G)V = G, C or A (all but T)N = A, G, C or T (any)

For DNA

Page 112: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Representing Binding Sites for a TF

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

• A matrix describing a set of sites:

• A set of sites represented as a consensus• VDRTWRWWSHD (IUPAC degenerate DNA)

• A single site• AAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Logo – A graphical representation of frequency matrix. Y-axis is information content , which reflects the strength of the pattern in each column of the matrix

Page 113: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

113

TGCTG = 0.9

Conversion of PFMs to Position Specific Scoring Matrices (PSSM)

PSSMs also known as Position Weight Matrices(PWMs)

Add the following features to the matrix profile:1. Correct for nucleotide frequencies in genome2. Weight for the confidence (depth) in the pattern3. Convert to log-scale probability for easy arithmetic

A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1

A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2

pfm pssm

Log2 ( )f(b,i) + s(n)p(b)

f(b,i) = frequency of base b at position is(n) = pseudocount correction (optionally applied to small samples of binding sites)p(b) = background probability of base b

Page 114: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

PSSM Scoring Scales

• Raw scores• Sum of values from indicated cells of the matrix

• Relative Scores (most common)• Normalize the scores to range of 0-1 or 0%-100%

• Empirical p-values• Based on distribution of scores for some DNA

sequence, determine a p-value (see next slide)

Page 115: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Detecting binding sites in a single sequenceRaw Scores

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC Sp1

Relative ScoresA [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

Max_score = 15.2 (sum of highest column scores)

Min_score = -10.3 (sum of lowest column scores)

93%

100%10.3)(15.2

(-10.3)-13.4

% 100Min_score - Max_scoreMin_score - Abs_score

Rel_score

Empirical p-value Scores

Abs_score = 13.4 (sum of column scores)

0.3

0.2

0.1

0.0

Freq

uenc

y0.0 0.2 0.4 0.6 0.8 1.0

Relative Score

Area to right of value

Area under entire curve

Page 116: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

116

JASPAR:

AN OPEN-ACCESS DATABASE

OF TF BINDING PROFILES( jaspar.genereg.net )

(Some other databases with TF profiles include Transfac, TRRD, mPromDB, SCPD (yeast),

dbTBS and EcoTFS (bacteria))

Page 117: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

The Good…

• Tronche (1997) tested 50 predicted HNF1 TFBS using an in vitro binding test and found that 96% of the predicted sites were bound!

• Stormo and Fields (1998) found in detailed biochemical studies that the best weight matrices produce scores highly correlated with in vitro binding energy

PSSM SCORE

BIN

DIN

GE

NE

RG

Y

Page 118: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

…the Bad…

• Fickett (1995) found that a profile for the myoD TF made predictions at a rate of 1 per ~500bp of human DNA sequence

– This corresponds to an average of 20 sites / gene (assuming 10,000 bp as average human gene size)

Page 119: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

…and the Ugly!Human Cardiac a-Actin gene analyzed

with a set of profiles(each line represents a TFBS prediction)

Futility Conjecture: TFBS predictions, as a collective group,

are almost always wrong!

True binding sites are defined by properties not incorporated into

the PSSM profile scores

Red boxes are protein coding exons - TFBS predictions excluded in this analysis

Page 120: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

What have we learned?• PSSMs accurately reflect in vitro binding properties of

DNA binding proteins

• Suitable binding sites occur at a rate far too frequent to reflect in vivo function

• Bioinformatics methods that use PSSMs for binding site studies must incorporate additional information to enhance specificity

• Unfiltered predictions are too noisy for most applications• Note: Organisms with short regulatory sequences are less

problematic (e.g. yeast and bacteria)

Example of a bioinformatics challenge that needs more biological information to make predictions

Page 121: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

USING PHYLOGENETIC FOOTPRINTING TO IMPROVE TFBS DISCRIMINATION

70,000,000 years of evolution can reveal regulatory regions

Page 122: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

% Id

en

tity

Actin gene compared between human and mouse200 bp Window Start Position (human sequence)

Phylogenetic Footprinting

• Align orthologous gene sequences (e.g. use LAGAN). For 1st window of x bp (i.e. 200 bp), of sequence#1, determine % identity with sequence#2. Step across (slide window across) the first sequence, recording % identity in each window with the second sequence.

• Observe high identity with exons, lower identity in 5’ and 3’ UTRs• Additional conserved region could be regulatory region

Page 123: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Multi-species Phylogenetic Footprinting

Page 124: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

124

Phylogenetic Footprinting Dramatically Reduces Spurious Hits

Human

Mouse

Actin, alpha cardiac

Page 125: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

TFBS Prediction with Human & Mouse Pairwise Phylogenetic Footprinting

• Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set)

• 75-80% of defined sites detected with “sequence conservation filter” (incorporating orthologous sequence data), while only 11-16% of total predictions retained

SELECTIVITY SENSITIVITY

Page 126: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

126

1kbp insulin receptor promoter screened with footprinting

Page 127: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Choosing the ”right” species for pairwise comparison...

COW

MOUSE

CHICKEN

HUMAN

HUMAN

HUMAN

Page 128: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

ConSite

http://consite.genereg.net/

Page 129: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

TFBS Discrimination Tools I

• Phylogenetic Footprinting Servers– FOOTER http://biodev.hgen.pitt.edu/footer_php/Footerv2_0.php

– CONSITE http://asp.ii.uib.no:8090/cgi-bin/CONSITE/consite/ or

http://consite.genereg.net/

– rVISTA http://rvista.dcode.org/

– ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk

• SNPs in TFBS Analysis– RAVEN http://burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?rm=home

Page 130: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

TFBS Discrimination Tools II• Prokaryotes or Yeast

– PRODORIC http://prodoric.tu-bs.de/– YEASTRACT http://www.yeastract.com/index.php

• Software Packages– TOUCAN http://homes.esat.kuleuven.be/~saerts/software/toucan.php

• Programming Tools– TFBS http://tfbs.genereg.net/– ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk

Page 131: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

• TFBS discrimination coupled with phylogenetic footprinting has greater specificity with tolerable loss of sensitivity

• As with any purification process, some true binding sites will be lost

• There are several online resources that provide precomputed analyses using phylogenetic footprinting

What have we learned?

Page 132: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

TFBS Over-representation

• We seek to determine if a set of co-expressed genes contains an over-abundance of predicted binding sites for a known TF

• Phylogenetic footprinting to reduce false prediction rate

• Similar to looking for functional over-representation (i.e. identify annotations of gene function in sets of co-expressed genes that are over-represented)

Page 133: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Two Examples of TFBS Over-Representation

Foreground

Background

More Genes with TFBS

Foreground

Background

More Total TFBS

Page 134: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Statistical Methods for Identifying Over-represented TFBS

• Binomial test (Z scores)– Based on the number of occurrences of the TFBS relative

to background– Normalized for sequence length– Simple binomial distribution model

• Fisher exact probability scores– Based on the number of genes containing the TFBS relative

to background– Hypergeometric probability distribution

Page 135: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

oPOSSUM Server

Page 136: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Validation using Reference Gene Sets

TFs with experimentally-verified sites in the reference sets.

A. Muscle-specific (23 input; 16 analyzed)

B. Liver-specific (20 input; 12 analyzed)

Rank Z-score Fisher Rank Z-score Fisher

SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08

MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03

c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01

Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01

TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02

deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01

S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01

Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02

Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01

HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01

Page 137: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Structurally-related TFs with Indistinguishable TFBS

• Most structurally related TFs bind to highly similar patterns– Zn-finger is a

big exception

Page 138: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

• TFBS over-representation analysis can identify increased number of sites in a set of co-expressed genes, or increased number of genes in a co-expressed gene set that have a given site.

• Generally best performance has been with data directly linked to a transcription factor

• Highly dependent on the experimental design – cannot overcome noisy data from poor design

Note that the identity of a mediating TF may not be apparent when many proteins can bind to the same motif

Notes

Page 139: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

de novo Pattern Discovery

• String-based – e.g. YMF (Sinha & Tompa)– Generalization: Identify over-represented oligomers in

comparison of “+” and “-” (or complete) promoter collections

– Used often for yeast promoter analysis

• Profile-based – e.g. AnnSpec (Workman & Stormo) or MEME (Bailey &

Elkin)– Generalization: Identify strong patterns (position weight

matrices) in “+” promoter collection vs. background model of expected sequence characteristics

Page 140: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

String-based methods(1)

How likely are X words in a set of sequences, given background

sequence characteristics?CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range

Page 141: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

String-based methods(2)

Find all words of length n in the yeast promoters (e.g. n=7)

Make a lookup table:

AAACCTTT 456TTTTTTTT 57788GATAGGCA 589

Etc...

GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAAGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAGACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA

Page 142: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Xw: Instances of a word w within our set of X genes

E[Xw]: Average number of instances of w based on number of genes in our set

Var[Xw]: Variance – how much deviation from the average is expected for w

w

www XVar

XEXZ

String-based methods(3)

Page 143: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Limitations of String-based Methods• Longer word lengths not possible

– Number of possible strings of length n is 4n without any ambiguities and more if we allow ambiguous bases. Thus, as n increases, the computational time notably increases.

• While degeneracy codes can be used, TFBS are not words – we lose quantitation for variable positions with consensus sequences– Imagine column in PFM with 7 A’s and 1 T --- in a consensus

sequence we would represent as W or throw out the instance with T

• Recently the string-based method has found renewed utility in the analysis of 3’UTRs for the presence of microRNA target sequences...

Page 144: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Aside: microRNA Target Sequences

• Lim et al expressed miRNAs in cells and observed that the overall pattern of gene expression shifted toward the pattern of expression observed in cells which naturally express the miRNA

• The genes with reduced expression in response to miRNA exposure shared 7nt motifs the 3’UTR of their transcripts

• Nice website tutorial:– http://www.ambion.com/main/explorations/mirna.html

Page 145: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Motivation:

TFBS are not words, but rather sequence patterns

Efficiency – can handle longer patterns than string-based methods

Can be intentionally influenced to reflect prior knowledge

Overview: Find a local alignment of width x of sites that maximizes information content (or related measure) in reasonable time

Usually by Gibbs sampling (i.e. Gibbs Motif Sampler) or Expectation Maximization methods (i.e. MEME)

Probabilistic Methods

Page 146: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Ab initio motif finding: Gibbs sampling

• Popular “Markov chain Monte Carlo” algorithm for motif discovery

• Motif model: Position Weight Matrix• Local search algorithm

Page 147: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Gibbs sampling: basic idea

Current motif = PWM formedby circled substrings

Page 148: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Gibbs sampling: basic idea

Delete one substring

Page 149: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Gibbs sampling: basic idea

Try a replacement:Compute its score, Accept the replacementdepending on the score.

Page 150: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Gibbs sampling: basic idea

New motif

Page 151: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Gibbs Sampling(grossly over-simplified)

tgacttcctgctacct

agacctcactgtagtgacgcatct

cgatacgcttcgctcc

1 2 3 4 5 6 7 8A 2 0 2 2 2 1 0 1C 0 2 3 3 2 1 6 2G 0 4 1 0 1 0 1 1T 4 1 1 2 2 5 0 2

i.e. Gibbs Motif Sampler: http://bayesweb.wadsworth.org/gibbs/gibbs.html

http://www.youtube.com/watch?v=_4yrlzVTr_k

Page 152: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Pattern Discovery

• Gibbs sampling is guaranteed to return an optimal pattern if repeated sufficiently often– Procedure is fast, so running many 1000s of times

is feasible• Problems if the TFBS’s are not strongly over-

represented relative to other patterns…

Page 153: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Applied Pattern Discovery is Acutely Sensitive to Noise

True Mef2 Binding Sites

10

12

14

16

18

0 100 200 300 400 500 600

SEQUENCE LENGTH

PATTE

RN

SIM

ILA

RIT

Yvs.

TR

UE

ME

F2 P

RO

FILE

Pink line is negative control with no Mef2 sites included

Page 154: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Ab initio motif finding: Expectation Maximization

• Popular algorithm for motif discovery• Motif model: Position Weight Matrix• Local search algorithm

– Move from current choice of motif to a new similar motif, so as to improve the score

– Keep doing this until no more improvement is obtained: converge to local optima

Page 155: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Basic idea of iteration

PWMCurrent motif1.

Scan sequence for good matches to the current motif.

2.

3. Build a new PWM out of these matches, and make it the new motif

The basic idea can be formalized in the language of probability

Page 156: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

MEME

• Popular motif finding program that uses Expectation-Maximization

• Web sitehttp://meme.sdsc.edu/meme/website/meme.html

Page 157: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Four Approaches to Improve Sensitivity• Better background models

-Higher-order properties of DNA

• Phylogenetic Footprinting– Human:Mouse comparison eliminates ~75% of sequence

• Regulatory Modules – Architectural rules

• Limit the types of binding profiles allowed– TFBS patterns are NOT random

(Also, for prokaryotes in particular, accurate ID of transcription start sites can greatly aid identification of promoter sequences)

Page 158: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

Pattern Discovery Summary

• Pattern discovery methods can recover over-represented patterns in the promoters of co-expressed genes

• Methods are acutely sensitive to noise – notable problem if the signal we seek is weak

• TFs tolerate great variability between binding sites

• Supplementary information/approaches (such as phylogenetic footprinting) may be required to over-come the noise

Page 159: NGS Bioinformatics Workshop 1.5 Genome Annotation April 4 th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment:

REFLECTIONS• TFBS prediction

– Futility Theorem – Essentially predictions of individual TFBS have no relationship to an in vivo function

– Current successful bioinformatics methods incorporate additional information (clusters, sequence conservation)

• TFBS over-representation– TFBS over-representation is powerful way to identify TFs

likely to contribute to observed patterns of co-expression• Novel motif discovery (also useful in general)

– Pattern discovery methods are restricted by the Signal-to-Noise problem

• Observed patterns must be carefully considered– Again, successful methods incorporate additional

information (conservation, structural constraints on TFs, and (for prokaryotes) positional information relative to TSS)