Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research...

Introduction to BioinformaticsResearch Project

The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose a research group and how to begin a

protein-centered research project, including how to find useful articles.

Here I present an example of a DNA-centered research project, beginning mostly after the useful-article stage

How to choose your research group

Mueser TC et al (2010) Virol J 7:359

In this alternate universe, I’m in the DNA replication group.

I’m particularly interested in the DNA sequences that determine the initiation of DNA replication.

I’ve even read an article or two about them, discovering...

Origin of DNA replicationCircular, dsDNA

genome

Origin

...that DNA in prokaryotes and their phages is primarily circular.

To replicate it, the circle has to be opened at some point. That point is called the origin of replication.


genome

Origin

Bidirectionalinitiation

Opening the circle at the origin exposes two single-strands. Both are replicated, with the replication fork moving

in both directions, away from the origin.


genome

Bidirectionalinitiation

Origin

Elongation

Separation

Eventually, two separate daughter circles are formed.

...But enough chatting. The issue is how is the starting point chosen?

Origin of DNA replicationOrigin

Zooming in on the origin, we see the two intertwined strands at oriC (i.e., the Origin of the Chromosome)


+

What makes the origin special is that it binds proteins essential for initiating replication. The picture shows green DnaA protein binding to

the origin – also a protein called FIS (more on this in a moment).


+

+

DnaA binds not only to DNA but also to each other. With the help of a second DNA-binding protein, IHF (keep waiting), the bound DnaA proteins form a blob that

distorts the DNA.

The two strands of DNA separate at a nearby AT-rich region (you may recall that

AT-rich regions are less stable than GC-rich regions)


+

+

FISFactor for Inversion

Stimulation in Phage Mu

That’s the general idea.

For the rest of this project, I’m going to focus on DnaA, but before leaving the other

protein behind... (I hate throwing around undefined acronyms...)

FIS was first discovered as a protein important in gene

regulation by a phage.


+

+

IHFIntegration Host

Factor for lysogeny of Phage Lambda

Same with IHF. It was first found as a protein used by a

phage to integrate its genome into the bacterial genome.

It’s amazing how many things were first found in phages.


+

+

How to recognize origin of replication?

But back to the main question at hand.

I want to learn how to recognize origins of

replication. If I build a tool that can find known bacterial origins, maybe I can use the tool to search for origins in

bacteriophages.

Do phages have the same sorts of origins? Don’t know.


+

+

How to recognize origin of replication?

But how to tell?

One thing that distinguishes origins is their ability to bind

DnaA protein -- if DnaA binds to a specific sequence,

then origins must have multiple copies of them

in close proximity.

Does DnaA bind to a specific sequence?

Origin of DNA replication

Kaguni (2006) Annu Rev Microbiol 60:351-371. DnaA binding site

Is DnaA binding to DNA specific? I found an article that says the answer is yes. The E. coli origin of replication, pictured

above, has five specific binding sites for DnaA.

I need to learn more about that sequence. Orange colored boxes are nice, but at this point, I need to get closer to the truth, closer to the sequence.


Kaguni (2006) Annu Rev Microbiol 60:351-371.

Fuller et al (1984) Cell 38:889-900.

DnaA binding site

Here’s the sequence of the E. coli origin region. R1-R4 represent the sequences protected by DnaA when it binds. Are the all the same sequence?



Fuller et al (1984) Cell 38:889-900.

DnaA binding site

For example R1 and R2... Are they the same sequence? Why are there two sets of nucleotides in each box?



Fuller et al (1984) Cell 38:889-900.

DnaA binding site

If you notice that both strands of the DNA are shown, then you can make more sense of the boxes.



Fuller et al (1984) Cell 38:889-900.

DnaA binding site

Putting all the boxes together (choosing one of the two strands arbitrarily), I begin to see a pattern. Kaguni said there was also R5 (M). Where’s that?

R1 TTATCCACAR2 TTATACACAR3 TTATCCAAAR4 TTATCCACA


Fuller et al (1984) Cell 38:889-900.

Enough orange boxes! Even enough paper sequences!

If I’m going to make an origin-finding tool, I need to test it on a known case – Why not this case? Can I find the E. coli origin by DnaA-binding sequences?


My goal is to make a general origin-finding tool, using the E. coli origin as a test case.

I therefore need to find the coordinates of the E. coli origin, so I can tell if my tool is working.

Since I'm going to build the tool in BioBIKE, I need the coordinates known to BioBIKE. There's

no point finding the origin in Genbank or anywhere else. PhAnToMe is where you’ll find

E. coli and phage sequences.

How do I find the E. coli origin in E. coli?

My general origin-finding tool will look for DnaA-binding sites. I think that will work to find the E. coli origin, but I don't know it will work.

I need the coordinates of the E. coli origin so I can test my unproven tool with a known case.

So, how can I find the E. coli origin with absolute certainty?

What do I have in hand to enable me to find it?

What do I have in hand to enable me to find the origin?

Of course I have the sequence. That's essentially foolproof, so long as I have available the E. coli genome sequence to search through.

Looking for the sequence is much more certain than looking for DnaA boxes or some region annotated as “the origin”

One strategy is to display the sequence of E. coli K12 (which is the standard laboratory strain).

Searching for some portion of the published origin sequence should get me to the right place in the genome. It doesn’t

matter much which part of the origin I choose.

Searching for some portion of the published origin sequence should get me to the right place in the genome. It doesn’t

matter much which part of the origin I choose.

How could that be?!? I recheck the sequence... No problem.

When some strategy fails for no apparent reason and defies your best efforts to understand why, it is a generally a good idea to try something completely different, even though the

different strategy may not sound any more promising.

It is the worm that wiggles that gets off the hook.

So I try searching the E. coli genome for the same sequence, using a high threshold (expect value of 10, which would

allow even rare random matches to sneak through).

That was informative!

The first match goes from the beginning to end (Q-start=1, Q-end=30) of the 30-nucleotide sequence I gave it, but the

match was only 96.67%. There must be a mismatch somewhere!

The other matches are very partial with poor E-values. I’ll ignore them.

Where is the mismatch?

The ALIGNMENT-OF function allows me to compare the 30-nucleotide query sequence with the actual sequence from

E. coli. I used the coordinates provided by SEQUENCE-SIMILAR-TO to pick out the relevant portion of the genome.

Ah! The original article from which I got the origin sequence had an error in it, an extra G! This is not so surprising.

In 1984 (the year of the article), all sequencing was done by hand with little redundancy.

In any event, I think I found the origin – around coordinate 3923300

Note how I got to this region: Clearing the Search field, entering the coordinate in the Go To field, and clicking Go.

Don’t be concerned about the blank lines on the top and the mayhem on the right. The E. coli genome happens to have lots

of sequence features that people have annotated, and the Sequence Viewer doesn’t handle them very well.

First to confirm: Is this the right sequence? The first 30 nucleotides should match, of course (except for one). What about the rest? I’ll check the first 80... Check!

Does the region have the DnaA-binding motifs?

I could search for each individual sequence, but it’s more efficient to search for the pattern that encompasses all of them.

...Why only two? What happened to the other two?(you might want to look several slides back at the sequence)


I can't depend on my own eyes. I need to automate the process.

MATCHES-OF-PATTERN will search for the same DnaA-binding pattern but return all the results at once.

There’s no preference which of the two strands a DnaA protein will bind to, so I specify BOTH-STRANDS.

Note that the results are shown formatted in a popup window for immediate gratification and also in the result pane for further use.

There are a lot of sequences matching the pattern!

How many? And how many would you expect by chance?

How many? That’s the easy one. I just counted the list (using * to indicate the previous result)

How many expected by chance? Not much worse. You’ve done this sort of calculation many times in

the past and will do so many times in the future.

You should reach the conclusion that most of the matches are garbage.

If a mere match to a DnaA-binding sequence is not informative, then how can we recognize an origin?

What’s distinctive about the origin is that it contains a cluster of DnaA-binding sites.

Unfortunately, it is difficult to recognize clusters of sites because the sites’ coordinates are not sorted.

That’s the next step.(And then to clean up the screen)

That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the

known origin of E. coli (at coordinate ~3923000).

Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those

sites that are close to other sites.

I need to automate this process to create the tool that can scan hundreds of genomes

looking for origins of replication.

Automation of this sort of thing will come later. Can't do everything at once.

For now, I'll package the progress I've made to enable me to experiment easily.

I'll take the steps I've developed and put it into a function

My function consists of no more than what I did step by step. Now it has a name.

Also, I generalized it to work with any genome, not just E. coli.

Does it work?

Yes! Executing the function (now on my FUNCTION button) with E. coli as the argument

gives exactly the same result as I got before.

Will it work with other organisms?







Maybe! I tried it on Yersinia pestis (causative agent of the plague) and got a very provocative

result. What's the odds that five DnaA-sites would come up in the first 2000 nucleotides by chance?

(do the calculation)







With this function in hand, I can experiment, checking whether my method is any good. I will

undoubtedly find that it could be improved in lots of ways. The ability to do quick experiments and gain rapid feedback enables my ideas to evolve.

Origin of DNA replicationAlgorithm (where it stands)

* Search genome sequence for DnaA-binding sites - TTAT[CA]CACA - (not perfect – allow one mismatch?) - Use MATCHES-OF-PATTERN

* Sort sites by coordinate - Use SORT

* Look for clusters of sites - (How???)

(Eventually) Apply to all phage genomes

* Make problem tangible

Morals of the Story

Abstractions can give you a comforting big picture, but you won't make any progress unless you can connect the abstractions to reality


Morals of the Story

* Test ideas by experimentation

Develop your methods using cases where the answer is already known.


Morals of the Story


* Package your insights into functions

Start with an imperfect function and let it evolve as you gain more experience.


Morals of the Story



Try weird cases. Figure out why the method fails (if it fails) and what would make it not work (if it works).

Do lots of experiments.

* Test the limits of your method


Morals of the Story




* When things don't work (inevitable), cope

Try something different. Try lots of somethings different.


Morals of the Story




* When things don't work (inevitable), cope

* When things continue not to work, talk with others

Sometimes pooled confusion can lead to light.

TATTCAAAATGAATTATATCGGTAAATATCTGCAACTTTAAACCTGAATGA

GGATTTAGTATTGCTGGGCCAGCCCAAAGTTTAGAATTTTCATCAACTTTGCACAATGATGGAAAACGTGAATTCAAAAGGATTGCTATATATTATTAAGAAAACATTTGGAATTCGAGAACCGGAATATGGCATTCCGCAAATTAGAGAACGGAATAGGTATTCCTAAAAAAACACATTCTCTGCAATTTTTAAGATGAGTATTATACCTGCACTAACTTTGTGGGACGCAATATCAGAGCAACCCTATCATTTAAAACCTCAAAATACTTATCAGACTTGGGGAACATTCTGACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGAAGTGGTAATGGTGAATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAATCCATCCTTTTCAACATCGAAATTTAACAGCCCGTGAAGGAGCTAGAATCCAATCTTTTCCAGGAAGAAAGATTTGATGAAAAATTTCTTTGTCAATATAATCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTAGTGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAATACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAAT

CAAGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATCCAACCGCTCATAATCCTTACTGAGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCTAGGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAATTCGAATTCGAATTCGAATTCGAATTCGACAACCTGTTTTCAGATGGTAGTAGATAGCGTTGCATACTTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGGTCTGTAGAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTACTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGATTTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGTGCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAGCGAGAAATCCTAACAGTTTATACCTTGTGGTTATGGAATGGATAAAACT

GACCAATGATGTAAATTTACGAAAATATAAAGTTGATCAAATTTATGTACTACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACATAAAAAAT

Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research...

Documents

Transcript of Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research...