sites.biology.colostate.edu€¦  · Web view · 2016-04-02Computational Biology and Genomics...

11
Computational Biology and Genomics Workshop Todos Santos Center April 18-22, 2016 VISUALIZING GENOMIC DATA AND MAKING FIGURES WITH CIRCOS Background. One of the biggest challenges in dealing with large genomic datasets is being able to visualize them in an effective way. Generating informative and attractive figures is one of the most important things you can do to make your presentations and publications more impactful. If you look through genome papers, you will find that it is very common to generate circular images and summaries of genomic data. See the figure below summarizing two related bacterial genomes. When done properly, this can be very effective even in cases where you are working with genomes that are not circular themselves Objectives. The goal of this exercise will be to use the program Circos to generate publication-quality images from genomic data. Software and Dependencies Circos (installed and in your PATH) BLAST+ Executables (installed and in your PATH) Custom Perl scripts o gc_skew_for_circos.pl o blast_repeats_for_circos.pl TextWrangler Protocol

Transcript of sites.biology.colostate.edu€¦  · Web view · 2016-04-02Computational Biology and Genomics...

Page 1: sites.biology.colostate.edu€¦  · Web view · 2016-04-02Computational Biology and Genomics Workshop. Todos. Santos Center. April 18-22, 2016. Visualizing genomic data and making

Computational Biology and Genomics WorkshopTodos Santos Center

April 18-22, 2016

VISUALIZING GENOMIC DATA AND MAKING FIGURES WITH CIRCOS

Background. One of the biggest challenges in dealing with large genomic datasets is being able to visualize them in an effective way. Generating informative and attractive figures is one of the most important things you can do to make your presentations and publications more impactful. If you look through genome papers, you will find that it is very common to generate circular images and summaries of genomic data. See the figure below summarizing two related bacterial genomes. When done properly, this can be very effective even in cases where you are working with genomes that are not circular themselves

Objectives. The goal of this exercise will be to use the program Circos to generate publication-quality images from genomic data.

Software and Dependencies

Circos (installed and in your PATH) BLAST+ Executables (installed and in your PATH) Custom Perl scripts

o gc_skew_for_circos.plo blast_repeats_for_circos.pl

TextWrangler

Protocol

1. Generate the Circos figure from the example dataset distributed with the software. Before we get into the details of Circos, let’s see what it can do and confirm that the software is properly installed by generating the “example” figure based on the data distributed with the software.

First, open a Term inal session and enter the command to move into the directory with the relevant files for this exercise.

cd ~/TodosSantos/circos/example

Page 2: sites.biology.colostate.edu€¦  · Web view · 2016-04-02Computational Biology and Genomics Workshop. Todos. Santos Center. April 18-22, 2016. Visualizing genomic data and making

Computational Biology and Genomics WorkshopTodos Santos Center

April 18-22, 2016

Then run Circos by simply entering the name of the program. All the configuration files for this dataset are already in place in this directory, so no additional information is necessary. We will go through how to set up these configuration files later.

circos

The program should take about a minute to run and report a number of status updates along the way. It will return to the command prompt when finished. To see the figure that was generated, enter the following command.

open circos.png

You should see something like this:

Page 3: sites.biology.colostate.edu€¦  · Web view · 2016-04-02Computational Biology and Genomics Workshop. Todos. Santos Center. April 18-22, 2016. Visualizing genomic data and making

Computational Biology and Genomics WorkshopTodos Santos Center

April 18-22, 2016

Whoa. One could argue whether a figure that “busy” could ever be informative. But this is meant to give you a sense of the large amount of data that can be incorporated into circular figures and the many different ways Circos allows you to visualize those data. Let’s work through some of the steps required to convert genomic data into the input files Circos uses to generate figures.

2. Set up circos.conf and karyotype files. The heart of a Circos run is the configuration file. When you call Circos, it will assume that a file with the name circos.conf is present in your current working directory. Otherwise, you can specify the name and location of your configuratio file with the -conf option when you call Circos from the command line.

From your Terminal window, move into the exercise directory:

cd ~/TodosSantos/circos/exercise

Use less to view the contents of the configuration file.

less circos.conf

This file contains some standard feature that will always be present including references to some additional configurations that are distributed with Circos. It also contains parameters that you can set to alter the appearance of your figure. One key parameter is the one that defines the name and location of the karyotype file. That file defines the chromosomes that will be used for the figure.

Exit the circos.conf file by typing q, and then view the contents of the karyotype file.

less karyotype.human.txt

You should see that the chromosomes are defined by each of line of text, which are presented in the following format:

chr - ID LABEL START END COLOR

The columns should be largely self-explanatory. Note that the difference between ID and LABEL is that ID is what you will refer to in other data and configuration files, whereas LABEL is the text that will be put on the figure. START and END refer to chromosome lengts (typically in bp), and COLOR defines the color used for that chromosome in the figure.

Exit the karyotype fie by typing q.

3. Generate a circular representation of the human genome. Run Circos in your current directory by re-entering the circos command. Once the run has completed, re-enter the open circos.png command. You should see that it updates to the following image. These are the the 24 human chromosomes (including the 22 autosomes and both X and Y chromosomes) that were defined in the karyotype file.

Page 4: sites.biology.colostate.edu€¦  · Web view · 2016-04-02Computational Biology and Genomics Workshop. Todos. Santos Center. April 18-22, 2016. Visualizing genomic data and making

Computational Biology and Genomics WorkshopTodos Santos Center

April 18-22, 2016

4. Modify thickness of the chromosome. Now let’s try making some modifications. Launch TextWrangler and open the following file: ~/TodosSantos/circos/exercise/circos.conf

Many of the parameters that define how the figure is displayed are found in the <ideogram> block. Change the thickness parameter so that it reads:

thickness = 20p

Save that change and then re-run the circos command from the Terminal command line. You should see the thickness of the chromosomes has been reduced in the updated figure (if you still have the circos.png file open in Preview, it should update automatically if you simply click on the preview window).

5. Add labels for each of the chromosomes. In TextWrangler, paste the following text into the circos.conf anywhere between the <ideogram> and </ideogram> lines.

show_label = yeslabel_font = defaultlabel_radius = 1r + 75plabel_size = 50label_parallel = yes

Save the file and re-run Circos from the command-line and verify that your figure has been updated to add labels for each chromosome.

Page 5: sites.biology.colostate.edu€¦  · Web view · 2016-04-02Computational Biology and Genomics Workshop. Todos. Santos Center. April 18-22, 2016. Visualizing genomic data and making

Computational Biology and Genomics WorkshopTodos Santos Center

April 18-22, 2016

6. Visualize the single chromosome of a bacterial genome and add label positions. In TextWrangler change the karyotype line so that it reads following line in the circos.conf file so that it reads:

karyotype = karyotype.ecoli.txt

If you save that change and re-run Circos, you should see a rather dull-looking green circle. Let’s add some tick marks to the outside of the genome to indicate position. Use less to view the contents of the ticks.conf file in the current directory

less ticks.conf

This file sets various parameters for how to display tick around the circle. These particular settings will display big, labeled marks every 1 Mb and small, unlabeled marks every 50 kb. Exit the ticks file by typing q. Let’s add these ticks to our figure, but instead of copying the whole body of text into our main configuration file, we can just add the following line at the top of circos.conf.

<<include ticks.conf>>

Note that this is a general strategy. Rather than have one configuration file get bigger and bigger, you can refer to additional files in your main configuration file to keep your project more organized. If you save this change and re-run Circos, you should see the following updated image.

Page 6: sites.biology.colostate.edu€¦  · Web view · 2016-04-02Computational Biology and Genomics Workshop. Todos. Santos Center. April 18-22, 2016. Visualizing genomic data and making

Computational Biology and Genomics WorkshopTodos Santos Center

April 18-22, 2016

7. Add GC skew data to the plot. Our bacterial genome plot is still pretty boring. Let’s add some information about the genome. GC skew is a measure of strand-biased nucleotide composition that is defined as (G - C)/(G + C), where G and C are the number of guanines and cytosines in one strand of DNA. GC skew can be very valuable in identifying the origin of replication in bacterial genomes.

Let’s calculate GC skew across the entire to genome. To do so, run the following script. The script is written in Perl. It reads in a genome in FASTA format to calculate GC skew. You can investigate the contents of the script if you are interested, but we will only need the output for this exercise. The following command should be entered on a single line.

./gc_skew_for_circos.pl Ecoli.genome.fas 5000 1000 main blue orange > gc_skew.txt

This script has calculated GC skew in a 5-kb sliding window with a 1-kb step size. It has formatted the output for Circos such that positive GC skew values will be shown in blue and negative values will be shown in orange. The “main” term simply refers to the name of the chromosome in the E. coli karyotype file. View the contents of the output file…

less gc_skew.txt

You should see that each line specifies a location in the genome and a corresponding GC skew value. For example, here are a few lines from the middle of the file…

main 235001 235001 -0.0106518960374947 fill_color=orangemain 236001 236001 0.0111856823266219 fill_color=bluemain 237001 237001 0.0122448979591837 fill_color=bluemain 238001 238001 -0.00313199105145414 fill_color=orangemain 239001 239001 -0.00617828773168579 fill_color=orangemain 240001 240001 -0.0260223048327138 fill_color=orangemain 241001 241001 -0.0463576158940397 fill_color=orangemain 242001 242001 -0.0265222263728054 fill_color=orange

Exit the GC skew file by typing q. We can visualize these data in our figure by adding a “plots” block to our main configuration file. In TextWrangler, paste the following lines after the </ideogram> line in the circos.conf file.

<plots><plot>type = histogramfile = gc_skew.txtextend_bin = yesthickness = 0r0 = 0.6rr1 = 1.0rorientation = outmin = -0.2max = 0.2</plot></plots>

Page 7: sites.biology.colostate.edu€¦  · Web view · 2016-04-02Computational Biology and Genomics Workshop. Todos. Santos Center. April 18-22, 2016. Visualizing genomic data and making

Computational Biology and Genomics WorkshopTodos Santos Center

April 18-22, 2016

If you save these changes, and re-run Circos, you should see the following updated figure.

Do you have a guess as to where the origin of replication and termination of replication are in E. coli?

8. Identify repeats within the genome with links. Circular genome representations can be very useful for drawing connections between different parts of a genome. For example, they are very effective at visualizing repeated sequences. Let’s show the locations of repeats within the E. coli genome. First, we need to find the repeats. To do that, we will use BLAST. Run the following two commands to make a BLAST database and run a BLASTN search.

makeblastdb -in Ecoli.genome.fas -dbtype nucl

blastn -evalue 1e-10 -db Ecoli.genome.fas -query Ecoli.genome.fas -out self_blast.txt

Note that we are BLASTing the same sequence against itself. That may seem like a waste of time, but it is a very common way to identify repeats. In this case we are using the default MEGABLAST algorithm within blastn, so we will only identify very similar repeats (word size = 28). Running the following command (all in one line) will call a BioPerl script will parse the BLAST output and summarize the output in a format that Circos can read.

./blast_repeats_for_circos.pl self_blast.txt 1000 0.9 main > blast_repeats.txt

The command line parameters instruct the script to summarize all blast hits that are at least 1000 bp long and have at least 90% nucleotide sequence identity. Once again “main” simply refers to the name of the chromosome in the E. coli karyotype file. A sample of of the blast_repeats.txt output file looks like this:

repeat1 main 4166317 4166317

Page 8: sites.biology.colostate.edu€¦  · Web view · 2016-04-02Computational Biology and Genomics Workshop. Todos. Santos Center. April 18-22, 2016. Visualizing genomic data and making

Computational Biology and Genomics WorkshopTodos Santos Center

April 18-22, 2016

repeat1 main 2731499 2731499repeat2 main 2726052 2726052repeat2 main 4171773 4171773repeat3 main 3423658 3423658repeat3 main 228885 228885repeat4 main 223467 223467repeat4 main 3429065 3429065

Each pair of lines specifies the connection points for the given repeat pair. We can visualize these data with Circos by adding a “links” block to our main configuration file. In TextWrangler, paste the following lines after the </ideogram> line in the circos.conf file.

<links>radius = 0.8rbezier_radius = -0.1rbezier_radius_purity = 0.9crest = 0.5perturb = yesperturb_crest = 0.9,1.1perturb_bezier_radius = 0,2perturb_bezier_radius_purity = 0<link singles>z = 0show = yescolor = blackthickness = 2file = blast_repeats.txt</link></links>

If you save these changes, and re-run Circos, you should see the following updated figure.

Page 9: sites.biology.colostate.edu€¦  · Web view · 2016-04-02Computational Biology and Genomics Workshop. Todos. Santos Center. April 18-22, 2016. Visualizing genomic data and making

Computational Biology and Genomics WorkshopTodos Santos Center

April 18-22, 2016

9. Explore the Circos tutorials. We have barely “scratched the surface” on the diverse visualization options in Circos. Even for the code we have used in some of these examples, we have not gone into detail on many of the parameters (e.g., the Bezier and crest parameters above can control the “bendiness” of the links you just drew). As time allows, try out some of the tutorials described in the Circos website. The files are all available in the following directory:

/Applications/circos-0.69-2/circos-tutorials-0.67/