Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research...

49
Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose a research group and how to begin a protein-centered research project, including how to find useful articles. Here I present an example of a DNA-centered research project, beginning mostly after the useful-article stage

Transcript of Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research...

Page 1: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Introduction to BioinformaticsResearch Project

The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose a research group and how to begin a

protein-centered research project, including how to find useful articles.

Here I present an example of a DNA-centered research project, beginning mostly after the useful-article stage

Page 2: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

How to choose your research group

Mueser TC et al (2010) Virol J 7:359

In this alternate universe, I’m in the DNA replication group.

I’m particularly interested in the DNA sequences that determine the initiation of DNA replication.

I’ve even read an article or two about them, discovering...

Page 3: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replicationCircular, dsDNA

genome

Origin

...that DNA in prokaryotes and their phages is primarily circular.

To replicate it, the circle has to be opened at some point. That point is called the origin of replication.

Page 4: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replicationCircular, dsDNA

genome

Origin

Bidirectionalinitiation

Opening the circle at the origin exposes two single-strands. Both are replicated, with the replication fork moving

in both directions, away from the origin.

Page 5: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replicationCircular, dsDNA

genome

Bidirectionalinitiation

Origin

Elongation

Separation

Eventually, two separate daughter circles are formed.

...But enough chatting. The issue is how is the starting point chosen?

Page 6: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replicationOrigin

Zooming in on the origin, we see the two intertwined strands at oriC (i.e., the Origin of the Chromosome)

Page 7: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replicationOrigin

+

What makes the origin special is that it binds proteins essential for initiating replication. The picture shows green DnaA protein binding to

the origin – also a protein called FIS (more on this in a moment).

Page 8: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replicationOrigin

+

+

DnaA binds not only to DNA but also to each other. With the help of a second DNA-binding protein, IHF (keep waiting), the bound DnaA proteins form a blob that

distorts the DNA.

The two strands of DNA separate at a nearby AT-rich region (you may recall that

AT-rich regions are less stable than GC-rich regions)

Page 9: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replicationOrigin

+

+

FISFactor for Inversion

Stimulation in Phage Mu

That’s the general idea.

For the rest of this project, I’m going to focus on DnaA, but before leaving the other

protein behind... (I hate throwing around undefined acronyms...)

FIS was first discovered as a protein important in gene

regulation by a phage.

Page 10: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replicationOrigin

+

+

IHFIntegration Host

Factor for lysogeny of Phage Lambda

Same with IHF. It was first found as a protein used by a

phage to integrate its genome into the bacterial genome.

It’s amazing how many things were first found in phages.

Page 11: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replicationOrigin

+

+

How to recognize origin of replication?

But back to the main question at hand.

I want to learn how to recognize origins of

replication. If I build a tool that can find known bacterial origins, maybe I can use the tool to search for origins in

bacteriophages.

Do phages have the same sorts of origins? Don’t know.

Page 12: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replicationOrigin

+

+

How to recognize origin of replication?

But how to tell?

One thing that distinguishes origins is their ability to bind

DnaA protein -- if DnaA binds to a specific sequence,

then origins must have multiple copies of them

in close proximity.

Does DnaA bind to a specific sequence?

Page 13: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replication

Kaguni (2006) Annu Rev Microbiol 60:351-371. DnaA binding site

Is DnaA binding to DNA specific? I found an article that says the answer is yes. The E. coli origin of replication, pictured

above, has five specific binding sites for DnaA.

I need to learn more about that sequence. Orange colored boxes are nice, but at this point, I need to get closer to the truth, closer to the sequence.

Page 14: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replication

Kaguni (2006) Annu Rev Microbiol 60:351-371.

Fuller et al (1984) Cell 38:889-900.

DnaA binding site

Here’s the sequence of the E. coli origin region. R1-R4 represent the sequences protected by DnaA when it binds. Are the all the same sequence?

Page 15: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replication

Kaguni (2006) Annu Rev Microbiol 60:351-371.

Fuller et al (1984) Cell 38:889-900.

DnaA binding site

For example R1 and R2... Are they the same sequence? Why are there two sets of nucleotides in each box?

Page 16: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replication

Kaguni (2006) Annu Rev Microbiol 60:351-371.

Fuller et al (1984) Cell 38:889-900.

DnaA binding site

If you notice that both strands of the DNA are shown, then you can make more sense of the boxes.

Page 17: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replication

Kaguni (2006) Annu Rev Microbiol 60:351-371.

Fuller et al (1984) Cell 38:889-900.

DnaA binding site

Putting all the boxes together (choosing one of the two strands arbitrarily), I begin to see a pattern. Kaguni said there was also R5 (M). Where’s that?

R1 TTATCCACAR2 TTATACACAR3 TTATCCAAAR4 TTATCCACA

Page 18: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replication

Fuller et al (1984) Cell 38:889-900.

Enough orange boxes! Even enough paper sequences!

If I’m going to make an origin-finding tool, I need to test it on a known case – Why not this case? Can I find the E. coli origin by DnaA-binding sequences?

R1 TTATCCACAR2 TTATACACAR3 TTATCCAAAR4 TTATCCACA

Page 19: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

My goal is to make a general origin-finding tool, using the E. coli origin as a test case.

I therefore need to find the coordinates of the E. coli origin, so I can tell if my tool is working.

Since I'm going to build the tool in BioBIKE, I need the coordinates known to BioBIKE. There's

no point finding the origin in Genbank or anywhere else. PhAnToMe is where you’ll find

E. coli and phage sequences.

Page 20: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

How do I find the E. coli origin in E. coli?

My general origin-finding tool will look for DnaA-binding sites. I think that will work to find the E. coli origin, but I don't know it will work.

I need the coordinates of the E. coli origin so I can test my unproven tool with a known case.

So, how can I find the E. coli origin with absolute certainty?

What do I have in hand to enable me to find it?

Page 21: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

What do I have in hand to enable me to find the origin?

Of course I have the sequence. That's essentially foolproof, so long as I have available the E. coli genome sequence to search through.

Looking for the sequence is much more certain than looking for DnaA boxes or some region annotated as “the origin”

Page 22: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

One strategy is to display the sequence of E. coli K12 (which is the standard laboratory strain).

Page 23: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Searching for some portion of the published origin sequence should get me to the right place in the genome. It doesn’t

matter much which part of the origin I choose.

Page 24: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Searching for some portion of the published origin sequence should get me to the right place in the genome. It doesn’t

matter much which part of the origin I choose.

How could that be?!? I recheck the sequence... No problem.

Page 25: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

When some strategy fails for no apparent reason and defies your best efforts to understand why, it is a generally a good idea to try something completely different, even though the

different strategy may not sound any more promising.

It is the worm that wiggles that gets off the hook.

So I try searching the E. coli genome for the same sequence, using a high threshold (expect value of 10, which would

allow even rare random matches to sneak through).

Page 26: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

That was informative!

The first match goes from the beginning to end (Q-start=1, Q-end=30) of the 30-nucleotide sequence I gave it, but the

match was only 96.67%. There must be a mismatch somewhere!

The other matches are very partial with poor E-values. I’ll ignore them.

Page 27: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Where is the mismatch?

The ALIGNMENT-OF function allows me to compare the 30-nucleotide query sequence with the actual sequence from

E. coli. I used the coordinates provided by SEQUENCE-SIMILAR-TO to pick out the relevant portion of the genome.

Page 28: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Ah! The original article from which I got the origin sequence had an error in it, an extra G! This is not so surprising.

In 1984 (the year of the article), all sequencing was done by hand with little redundancy.

In any event, I think I found the origin – around coordinate 3923300

Page 29: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Note how I got to this region: Clearing the Search field, entering the coordinate in the Go To field, and clicking Go.

Don’t be concerned about the blank lines on the top and the mayhem on the right. The E. coli genome happens to have lots

of sequence features that people have annotated, and the Sequence Viewer doesn’t handle them very well.

Page 30: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

First to confirm: Is this the right sequence? The first 30 nucleotides should match, of course (except for one). What about the rest? I’ll check the first 80... Check!

Page 31: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Does the region have the DnaA-binding motifs?

I could search for each individual sequence, but it’s more efficient to search for the pattern that encompasses all of them.

...Why only two? What happened to the other two?(you might want to look several slides back at the sequence)

R1 TTATCCACAR2 TTATACACAR3 TTATCCAAAR4 TTATCCACA

Page 32: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

I can't depend on my own eyes. I need to automate the process.

MATCHES-OF-PATTERN will search for the same DnaA-binding pattern but return all the results at once.

There’s no preference which of the two strands a DnaA protein will bind to, so I specify BOTH-STRANDS.

Page 33: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Note that the results are shown formatted in a popup window for immediate gratification and also in the result pane for further use.

There are a lot of sequences matching the pattern!

How many? And how many would you expect by chance?

Page 34: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

How many? That’s the easy one. I just counted the list (using * to indicate the previous result)

How many expected by chance? Not much worse. You’ve done this sort of calculation many times in

the past and will do so many times in the future.

You should reach the conclusion that most of the matches are garbage.

Page 35: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

If a mere match to a DnaA-binding sequence is not informative, then how can we recognize an origin?

What’s distinctive about the origin is that it contains a cluster of DnaA-binding sites.

Unfortunately, it is difficult to recognize clusters of sites because the sites’ coordinates are not sorted.

That’s the next step.(And then to clean up the screen)

Page 36: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the

known origin of E. coli (at coordinate ~3923000).

Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those

sites that are close to other sites.

I need to automate this process to create the tool that can scan hundreds of genomes

looking for origins of replication.

Page 37: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Automation of this sort of thing will come later. Can't do everything at once.

For now, I'll package the progress I've made to enable me to experiment easily.

I'll take the steps I've developed and put it into a function

Page 38: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

My function consists of no more than what I did step by step. Now it has a name.

Also, I generalized it to work with any genome, not just E. coli.

Does it work?

Page 39: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Yes! Executing the function (now on my FUNCTION button) with E. coli as the argument

gives exactly the same result as I got before.

Will it work with other organisms?

Page 40: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the

known origin of E. coli (at coordinate ~3923000).

Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those

sites that are close to other sites.

I need to automate this process to create the tool that can scan hundreds of genomes

looking for origins of replication.

Maybe! I tried it on Yersinia pestis (causative agent of the plague) and got a very provocative

result. What's the odds that five DnaA-sites would come up in the first 2000 nucleotides by chance?

(do the calculation)

Page 41: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the

known origin of E. coli (at coordinate ~3923000).

Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those

sites that are close to other sites.

I need to automate this process to create the tool that can scan hundreds of genomes

looking for origins of replication.

With this function in hand, I can experiment, checking whether my method is any good. I will

undoubtedly find that it could be improved in lots of ways. The ability to do quick experiments and gain rapid feedback enables my ideas to evolve.

Page 42: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Origin of DNA replicationAlgorithm (where it stands)

* Search genome sequence for DnaA-binding sites - TTAT[CA]CACA - (not perfect – allow one mismatch?) - Use MATCHES-OF-PATTERN

* Sort sites by coordinate - Use SORT

* Look for clusters of sites - (How???)

(Eventually) Apply to all phage genomes

Page 43: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

* Make problem tangible

Morals of the Story

Abstractions can give you a comforting big picture, but you won't make any progress unless you can connect the abstractions to reality

Page 44: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

* Make problem tangible

Morals of the Story

* Test ideas by experimentation

Develop your methods using cases where the answer is already known.

Page 45: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

* Make problem tangible

Morals of the Story

* Test ideas by experimentation

* Package your insights into functions

Start with an imperfect function and let it evolve as you gain more experience.

Page 46: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

* Make problem tangible

Morals of the Story

* Test ideas by experimentation

* Package your insights into functions

Try weird cases. Figure out why the method fails (if it fails) and what would make it not work (if it works).

Do lots of experiments.

* Test the limits of your method

Page 47: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

* Make problem tangible

Morals of the Story

* Test ideas by experimentation

* Package your insights into functions

* Test the limits of your method

* When things don't work (inevitable), cope

Try something different. Try lots of somethings different.

Page 48: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

* Make problem tangible

Morals of the Story

* Test ideas by experimentation

* Package your insights into functions

* Test the limits of your method

* When things don't work (inevitable), cope

* When things continue not to work, talk with others

Sometimes pooled confusion can lead to light.

Page 49: Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

TATTCAAAATGAATTATATCGGTAAATATCTGCAACTTTAAACCTGAATGA

GGATTTAGTATTGCTGGGCCAGCCCAAAGTTTAGAATTTTCATCAACTTTGCACAATGATGGAAAACGTGAATTCAAAAGGATTGCTATATATTATTAAGAAAACATTTGGAATTCGAGAACCGGAATATGGCATTCCGCAAATTAGAGAACGGAATAGGTATTCCTAAAAAAACACATTCTCTGCAATTTTTAAGATGAGTATTATACCTGCACTAACTTTGTGGGACGCAATATCAGAGCAACCCTATCATTTAAAACCTCAAAATACTTATCAGACTTGGGGAACATTCTGACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGAAGTGGTAATGGTGAATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAATCCATCCTTTTCAACATCGAAATTTAACAGCCCGTGAAGGAGCTAGAATCCAATCTTTTCCAGGAAGAAAGATTTGATGAAAAATTTCTTTGTCAATATAATCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTAGTGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAATACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAAT

CAAGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATCCAACCGCTCATAATCCTTACTGAGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCTAGGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAATTCGAATTCGAATTCGAATTCGAATTCGACAACCTGTTTTCAGATGGTAGTAGATAGCGTTGCATACTTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGGTCTGTAGAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTACTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGATTTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGTGCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAGCGAGAAATCCTAACAGTTTATACCTTGTGGTTATGGAATGGATAAAACT

GACCAATGATGTAAATTTACGAAAATATAAAGTTGATCAAATTTATGTACTACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACATAAAAAAT