Using GC content to distinguish Phytophthora sequences from tomato sequences
description
Transcript of Using GC content to distinguish Phytophthora sequences from tomato sequences
Using GC content to distinguish Phytophthora sequences from
tomato sequences
Mission #1
Calculate the GC content of each sequence in the Phytophthora-tomato interactome
We will use a perl script to accomplish the mission.
Preparation
• Download the perl script (gc.pl) from the class web site and store it in C:/BioDownload folder
• Open cygwin, or command prompt (Vista users), or terminal (Mac users)
• Change directory (cd) to the BioDownload folder
perl<space>gc.pl<space>PhytophSeq1.txt<space>phyto_gc.out
Running the script
In cygwin (Windows users) or terminal (Mac users)
grep<space>--perl-regexp<space>”\t”<space>-c<space>phytoph_gc.out
grep<space>”>”<space>-c<space>PhytophSeq1.txt
You should get the same number from the two commands.
The number should be 3921.
Results
The output file
GC content column
Namecolumn
Build a histogram of the values of GC content
We will use R program to accomplish this mission.
Mission #2
http://www.r-project.org
Mac users
All Windows users
XP users
Vista users
getwd() to know which folder you are in now
setwd(“c:/BioDownload”) to change the working directory to C:/BioDownload
setwd(“/path/to/biodownload”) for Mac users
data<-read.table(“phytoph_gc.out”,sep=“\t”,header=FALSE)
to read in the data in the file phytoph_gc.out (your file name may be different)
data[1:10,]
to see the first 10 lines of the vector “data”
gc<-data[,2]
to assign the values from the 2nd column of “data” to a new vector “gc”
summary(gc)
to get the summary of the values in the vector “gc”
hist(gc,breaks=58)
to draw a histogram of the values in “gc” vector
Breaks indicates how many cells you want for the histogram. It was calculated as 78.7 (max) - 21.2 (min). It means the bin of the histogram is ~ 1 GC value
hist(gc,breaks=58,xlab=“GC content”,ylim=range(c(0,400)),main=“Histogram of GC content of sequences\ninPhytophthora-tomato interactome”)
to make the histogram look better
>pdf(“gc_histogram.pdf”)>hist(gc,breaks=58,xlab=“GC content”,ylim=range(c(0,400)),main=“Histogram of GC content of sequences\ninPhytophthora-tomato interactome”)>dev.off()
To output the histogram to a PDF file.
location
file