Comp. Genomics

23
Comp. Genomics Recitation 7 2/4/09 PSSMs+Gene finding Partially based on slides by Irit Gat-Viks and Metsada Pasmanik-Chor

description

Comp. Genomics. Recitation 7 2/4/09 PSSMs+Gene finding. Partially based on slides by Irit Gat-Viks and Metsada Pasmanik-Chor. Biological Motifs. Biological units with common functions frequently exhibit similarities at the sequence level. These include very short “motifs”, such as: - PowerPoint PPT Presentation

Transcript of Comp. Genomics

Page 1: Comp. Genomics

Comp. Genomics

Recitation 72/4/09PSSMs+Gene finding

Partially based on slides by Irit Gat-Viks and Metsada Pasmanik-Chor

Page 2: Comp. Genomics

Biological Motifs• Biological units with common functions

frequently exhibit similarities at the sequence level. These include very short “motifs”, such as:• Gene splice sites • DNA regulatory binding sites (bound by transcription

factors)• Often it is desirable to model such motifs, to

enable searching for new ones. Probabilistic models are very useful. Today we deal with PSSM - the simplest.

Page 3: Comp. Genomics

E. Coli Promoters

Page 4: Comp. Genomics

Regulation of Genes

GeneRegulatory Element

RNA polymerase(Protein)

Transcription Factor(Protein)

DNA

Page 5: Comp. Genomics

Gene

RNA polymerase

Transcription Factor(Protein)

Regulatory Element

DNA

Regulation of Genes

Page 6: Comp. Genomics

Gene

RNA polymeraseTranscription Factor

Regulatory Element

DNA

New protein

Regulation of Genes

Page 7: Comp. Genomics

Motif Logo• Motifs can mutate on less

important bases. • The five motifs at top right

have mutations in position 3 and 5.

• Representations called motif logos illustrate the conserved regions of a motif.

http://weblogo.berkeley.eduhttp://fold.stanford.edu/eblocks/acsearch.html

1234567TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA

Position:

Page 8: Comp. Genomics

Example: Calmodulin-Binding Motif (calcium-binding proteins)

Page 9: Comp. Genomics

PSSM Starting Point

• A gap-less MSA of known instances of a given motif. Representing the motif by either:

• Consensus.• Position Specific Scoring Matrix

(PSSM).

Page 10: Comp. Genomics

Usage of a PSSM

• For a putative k-mer GTGC– multiply the probabilities: p1(G)·p2(T)·p3(G)·p4(C)

• This gives the likelihood of the motif given the PSSM model

TATA box motif

Page 11: Comp. Genomics

Gene finding

• Only part of the genome encodes proteins• 80-90% in bacteria, ab. 2% in humans

• Goal: Given a genome sequence, identify gene boundaries

Page 12: Comp. Genomics

The genetic code

• A protein-coding gene, an open reading frame (ORF) begins with an ATG and ends with one of three stop codons

Page 13: Comp. Genomics

Prokaryotic genes• The ‘easy’ problem• Difficulty – not all possible ORFs are actually

genes• In E.Coli: 6500 ORFs while there are 4290

genes.• Additional “handles” are needed

Page 14: Comp. Genomics

Handle #1: Long ORFs

• In random DNA, one stop codon every 64/3=21 codons on average.

• Average protein is ~300 codons long.

• => search long ORFs.• Problems:

• Short genes• Overlapping long ORFs on opposite strands

Page 15: Comp. Genomics

Handle #2: Codon frequencies

• Coding DNA is not random:• In random DNA, expect Leu : Ala : Trp

ratio of 6 : 4 : 1• In real proteins, 6.9 : 6.5 : 1

• Different frequencies for different species.

Page 16: Comp. Genomics

16

Using Codon Frequencies/Usage

• The probability that the ith reading frame is the coding region:

11332221

1322211

222111

...

...

...

3

2

1

nnn

nnn

nnn

bacbacbac

acbacbacb

cbacbacba

fffp

fffp

fffp

321 ppppP i

i

• Assume each codon is independent.• For codon abc calculate frequency f(abc) in

coding region.• Given coding sequence a1b1c1,…,

an+1bn+1cn+1

• Calculate

Page 17: Comp. Genomics

Handle #3: G+C content• C+G content (“isochore”) has strong

effect on gene density, gene length etc.• < 43% C+G : 62% of genome, 34% of genes• >57% C+G : 3-5% of genome, 28% of genes

• Gene density in C+G rich regions is 5 times higher than moderate C+G regions and 10 times higher than rich A+T regions• Amount of intronic DNA is 3 times higher for A+T

rich regions. (Both intron length and number).• Etc…

Page 18: Comp. Genomics

Handle #4: Promoter motifs• Transcription depends on regulatory

regions.• Common regulatory region – the promoter• RNA polymeraseRNA polymerase binds tightly to a specific

DNA sequence in the promoter

Page 19: Comp. Genomics

19

Gene prediction programsScan the sequence in all 6 reading frames: 1. Start and stop codons2. Long ORF3. Codon usage4. GC content5. Gene features: promotor, terminator,

poly A sites, exons and introns, …

Frame +1Frame +2Frame +3

Page 20: Comp. Genomics

Moving to eukaryotes

• Less of the genome is protein coding + introns are a (very) serious headache

Page 21: Comp. Genomics

21

Eukaryote gene structure

• Gene length: 30kb, coding region: 1-2kb • Binding site: ~6bp; ~30bp upstream of TSS• Average of 6 exons, 150bp long• Huge variance: - dystrophin: 2.4Mb long• Blood coagulation factor: 26 exons, 69bp to 3106bp;

intron 22 contains another unrelated gene

Page 22: Comp. Genomics

22

Splicing• Splicing: the removal of the introns.• Performed by complexes called spliceosomes,

containing both proteins and snRNA.• The snRNA recognizes the splice sites through

RNA-RNA base-pairing• Recognition must be precise: a 1nt error can

shift the reading frame making nonsense of its message.

• Many genes have alternative splicing which changes the protein created.

Page 23: Comp. Genomics

23

Splice Sites