DNA Motif Finding 2010
-
Upload
stewart-macarthur -
Category
Education
-
view
5.969 -
download
0
description
Transcript of DNA Motif Finding 2010
DNA Motif Finding
Stewart MacArthur
Bioinformatics Core
March 11th, 2010
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 1 / 33
Introduction
What is a DNA Motif?
DNA motifs are short, recurring patterns that are presumed to have abiological function.
• sequence-specific binding sites• nucleases
• ribosome binding• mRNA processing
• splicing• editing• polyadenylation
• transcription termination
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 2 / 33
Introduction
What is a DNA Motif?
DNA motifs are short, recurring patterns that are presumed to have abiological function.
• sequence-specific binding sites• transcription factors• nucleases
• ribosome binding• mRNA processing
• splicing• editing• polyadenylation
• transcription termination
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 2 / 33
Introduction
What is a DNA Motif?
DNA motifs are short, recurring patterns that are presumed to have abiological function.
• sequence-specific binding sites• transcription factors• nucleases
• ribosome binding• mRNA processing
• splicing• editing• polyadenylation
• transcription termination
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 2 / 33
Representing a motif
How to represent a DNA motif?How can we represent the binding specificity of a protein, such that wecan reliably predict its binding to any given sequence?Restriction enzymes sites can be written as simple DNA sequence,e.g. GAATTC for EcoRI
5’-G A A T T C-3’3’-C T T A A G-5’
These sequences can incorporate ambiguity, e.g. GTYRAC for HincII,using the IUPAC code.
GTYRACY = C or TR = A or C
All matching sites will be cut by the restriction enzyme
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 3 / 33
Representing a motif
Transcription Factors are different...
• Regulatory motifs are often degenerate,variable but similar.• Transcription factors are often pleiotropic, regulating several
genes, but they may need to be expressed at different levels.• A side effect of this degeneracy is spurious binding, where the
protein has affinity at positions in the genome other than theirfunctional sites.
• Degeneracy in restriction enzyme binding would be lethal• Non-specific binding competes for protein and requires more
protein to be produced than would be required otherwise
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 4 / 33
Representing a motif Consensus
The Consensus Sequence• A consensus binding site is often used to represent transcription
factor binding• Refers to a sequence that matches all examples of the binding
site closely but not exactly• There is a trade-off between the ambiguity in the consensus and
its sensitivity
TACGATTATAATTATAATGATACTTATGATTATGTT
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 5 / 33
Representing a motif Consensus
The Consensus Sequence• A consensus binding site is often used to represent transcription
factor binding• Refers to a sequence that matches all examples of the binding
site closely but not exactly• There is a trade-off between the ambiguity in the consensus and
its sensitivity
TACGATTATAATTATAATGATACTTATGATTATGTT
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 5 / 33
Representing a motif Consensus
The Consensus Sequence : Example
TACGATTATAATTATAATTATACTTATGATTATGTTTATAAT
Allowing 0 mismatches finds 2/6 Sites1 site every 4kb
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33
Representing a motif Consensus
The Consensus Sequence : Example
TACGATTATAAT*TATAAT*TATACTTATGATTATGTTTATAAT
Allowing 0 mismatches finds 2/6 Sites1 site every 4kb
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33
Representing a motif Consensus
The Consensus Sequence : Example
TACGATTATAAT*TATAAT*TATACTTATGAT*TATGTTTATAAT
Allowing at most 1 mismatch finds 3/6 Sites1 site every 200bp
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33
Representing a motif Consensus
The Consensus Sequence : Example
TACGAT*TATAAT*TATAAT*TATACT*TATGAT*TATGTT*TATAAT
Allowing up to 2 mismatches finds 6/6 Sites1 site every 30bp
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33
Representing a motif IUPAC
IUPAC codesA AdenineC CytosineG GuanineT ThymineR A or GY C or TS G or CW A or TK G or TM A or CB C or G or TD A or G or TH A or C or TV A or C or GN any base
. or - gapStewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 7 / 33
Representing a motif IUPAC
The Consensus Sequence : Example
TACGATTATAATTATAATTATACTTATGATTATGTTTATRNT
Allowing 0 mismatches finds 2/6 Sites
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 8 / 33
Representing a motif IUPAC
The Consensus Sequence : Example
TACGATTATAAT*TATAAT*TATACTTATGAT*TATGTT*TATRNT
Exact match finds 4/6 Sites - 1 site every 500bp
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 8 / 33
Representing a motif IUPAC
The Consensus Sequence : Example
TACGAT*TATAAT*TATAAT*TATACT*TATGAT*TATGTT*TATRNT
Up to one mismatch finds 6/6 Sites - 1 site every 30bp
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 8 / 33
Representing a motif Matrix
The Matrix• A position weight matrix (PWM)
• also called position-specific weight matrix (PSWM)• also called position-frequency matrix (PFM)• also called position-specific scoring matrix (PSSM)• or just matrix
• Alternative to the consensus.• There is a matrix element for all possible bases at every position.
1 2 3 4 5 6 7 8 9 10 11A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 9 / 33
Representing a motif Matrix
The Matrix• A position weight matrix (PWM)
• also called position-specific weight matrix (PSWM)• also called position-frequency matrix (PFM)• also called position-specific scoring matrix (PSSM)• or just matrix
• Alternative to the consensus.• There is a matrix element for all possible bases at every position.
1 2 3 4 5 6 7 8 9 10 11A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 9 / 33
Representing a motif Matrix
Matrix FormatsCounts
A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9
FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5
Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 10 / 33
Representing a motif Matrix
Matrix FormatsCounts
A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9
FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5
Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 10 / 33
Representing a motif Matrix
Matrix FormatsCounts
A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9
FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5
Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 10 / 33
Representing a motif Matrix
Sequence Logos
• A visual representation of themotif
• Each column of the matrix isrepresented as a stack ofletters whose size isproportional to thecorresponding residuefrequency
• The total height of eachcolumn is proportional to itsinformation content.
A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 11 / 33
Information theory
Information Theory
• Information theory is a branch of applied mathematics involvedwith the quantification of information
• It has been applied to DNA motifs in order to determine theamount of uncertainly at each position in a site
• Uncertainly is measured in bits of information, which is on a log2scale.
• Information is a decrease in uncertainty
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 12 / 33
Information theory
Information theory
• 1 base occurs every time - 2 bits• 2 bases occur 50% of time - 1bit• 4 bases occur equally - 0 bits
A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9
Example
Ii = 2 +∑
fb,i log2 fb,i
1 = 2 + 0.5× log2(0.5) + 0.5× log2(0.5)
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 13 / 33
Information theory
Information theory
• 1 base occurs every time - 2 bits• 2 bases occur 50% of time - 1bit• 4 bases occur equally - 0 bits
A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9
Example
Ii = 2 +∑
fb,i log2 fb,i
1 = 2 + 0.5× log2(0.5) + 0.5× log2(0.5)
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 13 / 33
Information theory
Why do we want to find them?
Expression Microarrays• Find co-regulated genes• Suggest Pathways
ChIP seq/chip• Determine binding
preferences• Find co-factors
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 14 / 33
Information theory
Why do we want to find them?
Expression Microarrays• Find co-regulated genes• Suggest Pathways
ChIP seq/chip• Determine binding
preferences• Find co-factors
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 14 / 33
Information theory
Two Methods
Pattern MatchingFinding known motifs
• Does protein X bind upstreamof my genes?
• Does it bind more thanexpected by chance?
Pattern DiscoveryFinding unknown motifs
• What motifs are upstream ofmy genes?
• What are these motifs
e.g. Patser, Pscan, Mast.. e.g. MEME, Weeder, MDScan ...
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 15 / 33
Information theory
Two Methods
Pattern MatchingFinding known motifs
• Does protein X bind upstreamof my genes?
• Does it bind more thanexpected by chance?
Pattern DiscoveryFinding unknown motifs
• What motifs are upstream ofmy genes?
• What are these motifs
e.g. Patser, Pscan, Mast.. e.g. MEME, Weeder, MDScan ...
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 15 / 33
Information theory
Two Methods
Pattern MatchingFinding known motifs
• Does protein X bind upstreamof my genes?
• Does it bind more thanexpected by chance?
Pattern DiscoveryFinding unknown motifs
• What motifs are upstream ofmy genes?
• What are these motifs
e.g. Patser, Pscan, Mast.. e.g. MEME, Weeder, MDScan ...
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 15 / 33
Databases of Motifs
Where can we find known motifs?
Online databases• Multicellular Eukaryotes
• Jaspar• Transfac• Pazar
• Yeast• Yeastract• SCPD
• Prokaryotes• RegulonDB• Prodoric
• Other• UniProbe
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 16 / 33
Databases of Motifs
Where can we find known motifs?Online databases• Multicellular Eukaryotes
• Jaspar• Transfac• Pazar
• Yeast• Yeastract• SCPD
• Prokaryotes• RegulonDB• Prodoric
• Other• UniProbe
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 16 / 33
Databases of Motifs
Where can we find known motifs?Online databases• Multicellular Eukaryotes
• Jaspar• Transfac• Pazar
• Yeast• Yeastract• SCPD
• Prokaryotes• RegulonDB• Prodoric
• Other• UniProbe
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 16 / 33
Finding known motifs
How do we find them?
TATATTGTTTATTTTCATGACTTCATGTCGCATGTATTGTTAATTAACACATGTCTCATGTACTGGACCATGTCTAAGGGGTGTAAGGGTACTAACGAATCGTAGCATGTCCAGAGGTGCGGAGTACGTAAGGAGGGTGCCCATACATGTCCGTTTCATATGAGCCTGCATTAATGTACCAACCTTCAACCATGTCTCAACATGTCGCGGGTGTGCCTCCACGTACGAGCCGGAAGTCGACTCGCATGTCTGTCAGTATTATCCAAAGCATGTCGACCTCTTCATGTCAGCGAACGCAAGATCTTCATATGAGCCTGCATTAATGTACC
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 17 / 33
Finding known motifs
Pattern MatchingCounts
A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9
FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5
Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 18 / 33
Finding known motifs
Pattern MatchingCounts
A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9
FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5
Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 18 / 33
Finding known motifs
Pattern MatchingCounts
A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9
FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5
Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 18 / 33
Finding known motifs
Pattern Matching
A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7
TATATTGTTTATTTTCATGACTTCATGTCGCATGTATTGTTAATTAA
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33
Finding known motifs
Pattern Matching
A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7
T A T A T T G T T T A
TATATTGTTTA TTTTCATGACTTCATGTCGCATGTATTGTTAATTAA
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33
Finding known motifs
Pattern Matching
A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7
A T A T T G T T T A T
T ATATTGTTTAT TTTCATGACTTCATGTCGCATGTATTGTTAATTAA
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33
Finding known motifs
Pattern Matching
A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7
T A T T G T T T A T T
TA TATTGTTTATT TTCATGACTTCATGTCGCATGTATTGTTAATTAA
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33
Finding known motifs
Pattern Matching
TA TATTGTTTATT TTCATGACTTCATGTCGCATG TATTGTTAATT AAStewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 20 / 33
Pattern Discovery
Introduction to de-novo motif finding
de-novo or ab-initio motif finding refers to finding motifs “from thebeginning”, i.e. without previous knowledge
Various Methods• Word-based algorithms e.g. Oligo-Analysis, Weeder• Expectation-Maximization methods e.g. MEME• Gibbs sampling methods e.g. Gibbs sampler, MotifSampler
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 21 / 33
Pattern Discovery
Guidelines
• If possible, remove repeat patterns from the target sequences• Use multiple motif prediction algorithms.• Run probabilistic algorithms multiple times• Return multiple motifs• Try a range of motif widths and expected number of sites
“... we do not recommend to trust pattern discoveryresults with vertebrate genomes. ”
Jacques van Helden
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 22 / 33
Pattern Discovery
Guidelines
• If possible, remove repeat patterns from the target sequences• Use multiple motif prediction algorithms.• Run probabilistic algorithms multiple times• Return multiple motifs• Try a range of motif widths and expected number of sites
“... we do not recommend to trust pattern discoveryresults with vertebrate genomes. ”
Jacques van Helden
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 22 / 33
Recommended Tools
Recommended Tools
Pattern Matching• RSAT
• Pscan• Galaxy• MotifMogul
Pattern Discovery
• RSAT• MEME• Weeder• WebMOTIFS
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
Recommended Tools
Recommended Tools
Pattern Matching• RSAT• Pscan
• Galaxy• MotifMogul
Pattern Discovery
• RSAT• MEME• Weeder• WebMOTIFS
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
Recommended Tools
Recommended Tools
Pattern Matching• RSAT• Pscan• Galaxy
• MotifMogul
Pattern Discovery
• RSAT• MEME• Weeder• WebMOTIFS
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
Recommended Tools
Recommended Tools
Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul
Pattern Discovery
• RSAT• MEME• Weeder• WebMOTIFS
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
Recommended Tools
Recommended Tools
Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul
Pattern Discovery
• RSAT• MEME• Weeder• WebMOTIFS
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
Recommended Tools
Recommended Tools
Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul
Pattern Discovery• RSAT
• MEME• Weeder• WebMOTIFS
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
Recommended Tools
Recommended Tools
Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul
Pattern Discovery• RSAT• MEME
• Weeder• WebMOTIFS
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
Recommended Tools
Recommended Tools
Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul
Pattern Discovery• RSAT• MEME• Weeder
• WebMOTIFS
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
Recommended Tools
Recommended Tools
Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul
Pattern Discovery• RSAT• MEME• Weeder• WebMOTIFS
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
Recommended Tools
Recommended Tools
Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul
Pattern Discovery• RSAT• MEME• Weeder• WebMOTIFS
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
Recommended Tools RSA Tools
Regulatory Sequence Analysis Toolshttp://rsat.ulb.ac.be/rsat/
Modular computer programs specifically designed for the detection ofregulatory signals in non-coding sequences.
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 24 / 33
Recommended Tools RSA Tools
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 25 / 33
Recommended Tools RSA Tools
Regulatory Sequence Analysis Tools
Nature Protocols Series: Volume 3 No 10 2008
• Using RSAT to scan genome sequences for transcription factor bindingsites and cis-regulatory modules
• Using RSAT oligo-analysis and dyad-analysis tools to discoverregulatory signals in nucleic sequences
• Analyzing multiple data sets by interconnecting RSAT programs viaSOAP Web services - an example with ChIP-chip data
• Network Analysis Tools: from biological networks to clusters andpathways
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 26 / 33
Recommended Tools RSA Tools
Example Workflow
ProblemI have some differentially expressed genes from a microarrayexperiment. I would like to know if P53 binds in their promoter regions,and if so where.
Workflow• BioMart: Convert Gene IDs, if necessary• RSAT: retrieve sequence• JASPAR: Get PWM (MA0106.1)• RSAT: matrix-scan• RSAT: feature map
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 27 / 33
Recommended Tools Pscan
Pscan“Finding over-represented transcriptionfactor binding site motifs in sequences fromco-regulated or co-expressed genes”
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 28 / 33
Recommended Tools Pscan
Example Workflow
ProblemI have some differentially expressed genes from a microarrayexperiment. I would like to know which transcription factors bind totheir promoters.
Workflow• BioMart: Convert Gene IDs, if necessary• Pscan: retrieve sequence
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 29 / 33
Recommended Tools Galaxy
Galaxyhttp://main.g2.bx.psu.edu
“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”
• Collection of online tools• Modular• Can create workflows• Saved Histories
• Reproducible analysis• Shared histories• In house version• Easily extendable
http://kinchie/galaxy
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
Recommended Tools Galaxy
Galaxyhttp://main.g2.bx.psu.edu
“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”
• Collection of online tools
• Modular• Can create workflows• Saved Histories
• Reproducible analysis• Shared histories• In house version• Easily extendable
http://kinchie/galaxy
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
Recommended Tools Galaxy
Galaxyhttp://main.g2.bx.psu.edu
“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”
• Collection of online tools• Modular
• Can create workflows• Saved Histories
• Reproducible analysis• Shared histories• In house version• Easily extendable
http://kinchie/galaxy
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
Recommended Tools Galaxy
Galaxyhttp://main.g2.bx.psu.edu
“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”
• Collection of online tools• Modular• Can create workflows
• Saved Histories
• Reproducible analysis• Shared histories• In house version• Easily extendable
http://kinchie/galaxy
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
Recommended Tools Galaxy
Galaxyhttp://main.g2.bx.psu.edu
“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”
• Collection of online tools• Modular• Can create workflows• Saved Histories
• Reproducible analysis• Shared histories• In house version• Easily extendable
http://kinchie/galaxy
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
Recommended Tools Galaxy
Galaxyhttp://main.g2.bx.psu.edu
“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”
• Collection of online tools• Modular• Can create workflows• Saved Histories
• Reproducible analysis
• Shared histories• In house version• Easily extendable
http://kinchie/galaxy
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
Recommended Tools Galaxy
Galaxyhttp://main.g2.bx.psu.edu
“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”
• Collection of online tools• Modular• Can create workflows• Saved Histories
• Reproducible analysis• Shared histories
• In house version• Easily extendable
http://kinchie/galaxy
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
Recommended Tools Galaxy
Galaxyhttp://main.g2.bx.psu.edu
“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”
• Collection of online tools• Modular• Can create workflows• Saved Histories
• Reproducible analysis• Shared histories• In house version
• Easily extendable
http://kinchie/galaxy
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
Recommended Tools Galaxy
Galaxyhttp://main.g2.bx.psu.edu
“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”
• Collection of online tools• Modular• Can create workflows• Saved Histories
• Reproducible analysis• Shared histories• In house version• Easily extendable
http://kinchie/galaxy
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
Recommended Tools MEME Suite
MEME SuiteSuite of web based tools for motif discovery
• MEME - de-novo motif finding
• MAST - find matches to knownmotifs (MEME output)
• TOMTOM - Compare motifs toTRANSFAC and Jaspar
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 31 / 33
Recommended Tools MEME Suite
MEME SuiteSuite of web based tools for motif discovery
• MEME - de-novo motif finding• MAST - find matches to known
motifs (MEME output)
• TOMTOM - Compare motifs toTRANSFAC and Jaspar
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 31 / 33
Recommended Tools MEME Suite
MEME SuiteSuite of web based tools for motif discovery
• MEME - de-novo motif finding• MAST - find matches to known
motifs (MEME output)• TOMTOM - Compare motifs to
TRANSFAC and Jaspar
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 31 / 33
Further Reading
Further Reading
• Stormo GD. DNA binding sites: representation and discovery.Bioinformatics. 2000 Jan;16(1):16-23. Review. PubMed PMID:10812473.
• D’haeseleer P. How does DNA sequence motif discovery work?Nat Biotechnol. 2006 Aug;24(8):959-61. Review. PubMed PMID:16900144.
• Das MK, Dai HK. A survey of DNA motif finding algorithms. BMCBioinformatics. 2007 Nov 1;8 Suppl 7:S21. Review. PubMedPMID: 18047721; PubMed Central PMCID: PMC2099490.
• Tompa M, Li N et.al. Assessing computational tools for thediscovery of transcription factor binding sites. Nat Biotechnol.2005 Jan;23(1):137-44. PubMed PMID: 15637633.
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 32 / 33
Practical
Practical Session
Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 33 / 33