STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology STAT115 Lab 3 PART I Homework Q8 The...
-
Upload
ami-flowers -
Category
Documents
-
view
221 -
download
0
Transcript of STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology STAT115 Lab 3 PART I Homework Q8 The...
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
STAT115STAT115
Lab 3 PART ILab 3 PART I
Homework Q8 The Dot Matrix Method
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
The Dot Matrix Method.Gets you started thinking about sequence alignment in general.
Provides a ‘Gestalt’ of all possible alignments between two
sequences.
To begin — I will use a very simple 0, 1 (match, no-match) identity
scoring function without any windowing. As you will see later
today, more complex scoring functions will normally be used in
sequence analysis (especially with amino acid sequences)
A general way to see similarities in pair-wise comparisons:
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Since this is a comparison between two of the same sequences, an intra-sequence comparison, the most obvious feature is the main identity diagonal. Two short perfect palindromes can also be seen as crosses directly off the main diagonal; they are “ANA” and “SIS.”
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
The biggest asset of dot matrix analysis is it allows
you to visualize the entire comparison at once, not
concentrating on any one ‘optimal’ region, but rather
giving you the ‘Gestalt’ of the whole thing.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Here you can easily see the effect of a sequence ‘insertion’ or ‘deletion.’ It is impossible to tell whether the evolutionary event that caused the discrepancy between the two sequences was an insertion or a deletion and hence this phenomena is called an ‘indel.’ A jump or shift in the
register of the main diagonal on a dotplot clearly points out the existence of an indel. (again zero:one match score function)
Check out the ‘mutated’ inter-sequence comparison below:
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Another phenomenon that is very easy to visualize with dot matrix analysis are duplications or direct repeats. These are shown in the following example:
The ‘duplication’ here is seen as a distinct column of diagonals; whenever you see either a row or column of diagonals in a dotplot, you are looking at direct repeats.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Now consider the more complicated ‘mutation’ in the following comparison:
Again, notice the diagonals. However, they have now been displaced off of the center diagonal of the plot and, in fact, in this example, show the occurrence of a ‘transposition.’ Dot matrix analysis is one of the only sensible ways to locate such transpositions in sequences. Inverted repeats still show up as perpendicular lines to the diagonals, they are just now not on the center of the plot. The ‘deletion’ of ‘PRIMER’ is shown by the lack of a corresponding diagonal.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Reconsider the same plot. Notice the extraneous dots that neither indicate
runs of identity between the two sequences nor inverted repeats.
These merely contribute ‘noise’ to the plot and are due to the ‘random’
occurrence of the letters in the sequences, the composition of the
sequences themselves.
How can we ‘clean up’ the plots so that this noise does not detract from our
interpretations? Consider the implementation of a filtered windowing
approach; a dot will only be placed if some ‘stringency’ is met.
What is meant by this is that if within some defined window size, and when
some defined criteria is met, then and only then, will a dot be placed at
the middle of that window. Then the window is shifted one position and
the entire process is repeated. This very successfully rids the plot of
unwanted noise.
Filtered Windowing —
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
In this plot a window of size
three and a stringency of two
is used to considerably
improve the signal to noise
ratio (remember, I am using a
1:0 identity scoring function).
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
TUTORIAL I
LAB 3Alejandro Quiroz-Zárate
Daniel Fernandez
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
A little of istory
R is a dialect of the S language
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• Essentially we work with a 40 year-old technology!
• R is dived in 2 parts– The BASE system
• What comes with the download from CRAN (Comprehensive R Archive Network)
– The packages that you download• Based on your needs!!!
• Over 1000 packages on CRAN– http://www.r-project.org/
• Last but NOT least– R is FREE!!!!!!
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Outline• The Console and the Script
– Workspace management• Objects
– Classes and Mode– Some Classes:
• Vectors, Matrices and data.frames– Some Modes:
• Lists, strings• Loops and conditional statements• Functions
– R functions– My own functions
• Handling data– Reading and writing!
• Plotting!• Libraries• Exercises
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Getting startedThe Console
Essentially were the commands are executed
The Script
Were the code is written
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
An R session
Type code here
Adjust/Extend code
Output appears
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Workspace Management• Before jumping into R, it is important to ask
ourselves– Where am I?
• getwd()
– I want to be there…• setwd(“C://”)
– With who am I?• dir() # lists all the files in the working directory
– With who I can count on?• ls() #lists all the variables on the current session
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Workplace Management (2)• Saving
– save(x,file=“name.RData”)• Saves specific objects
– save.image(“name.Rdata”)• Saves the whole workspace
• Loading– load(“name.Rdata”)
• ‘?function’ and ‘??function’– ? To get the documentation of the function– ?? Find related functions to the query
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
R Objects
• Almost all things in R are OBJECTS!– Functions, datasets, results, etc… (graphs NO)
• OBJECTS are classified by two criteria– MODE: How objects are stored in R
• Character, numeric, logical, factor, list, function…• To obtain the mode of an object
– mode(object)
– CLASS: How objects are treated by functions• Vector, matrix, array, data.frame,…• To obtain the class of an object
– class(object)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
R Objects (2)
x1 x2 x3 x4 x5 x6
12345678
MODE: Is determined by the type of things stored (numbers, characters, Boolean,)If only numbers: numericIf it is a mixture: list
CLASS: Is determined by how functions deal with this object.If only numbers: matrixIf it is a mixture: data.frame
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Some classes• Vectors!!!
– x=c(10,5,3,6)– Calculations on vector are performed on each
entry• y=c(log(x),x,x^2)
– Not necessarily to have vectors of the same length in operations!
• w=sqrt(x)+2• z=c(pi,exp(1),sqrt(2))• x+z
– Logical vectors• aux=x<7
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Some classes (2)• Matrices !!!
– x=1:8– dim(x)=c(2,4)– y=matrix(1:8,2,4,byrow=F)– Operations are applied on each element
• x*x, max(x)• x=matrix(1:28,ncol=4), y=7:10 so then x*y is…?
– y=matrix(1:8,ncol=2)• y%*%t(y)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Some classes (3)
• Extracting info– y[1,] or y[,1]
• Extending matrices– cbind(y,seq(101,104))– rbind(y,c(102,109))
• apply is a useful function!– apply(y,2,mean)– apply(y,1,log)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Some classes (4)
• data.frame!!!– Creation
• Several ways to create a data frame– 1)
» logical=sample(c(T,F),size=20,replace=T)» numeric=rnorm(20)» my.df=data.frame(logical, numeric)
– 2)» test=matrix(rnorm(21),7,3)» test=data.frame(test)
• class(my.df[1,])
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
A mode
• Lists!!!– Is like a vector
• An element of a list can be an object of any type and structure
– x1=1:5– x2=c(T,T,F,T,F)– y=list(numbers=x1,questions=x2)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Functions!• My own functions
– function.name=function(arg1,arg2,…,argN)
{ Body of the function
}
– fun.plot=function(y,z){y=log(y)*z-z^3+z^2
plot(z,y)}
– z=seq(-11,10)– y=seq(11,32)– fun.plot(y,z)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Functions! (2)• The ‘…’ argument
– Can be used to pass arguments from one function to another
• Without the need to specify arguments in the header
fun.plot=function(y,z,...)
{ y=log(y)*z-z^3+z^2
plot(z,y,...)
}
fun.plot(y,z,type="l",col="red")
fun.plot(y,z,type="l”,col=“red”,lwd=4)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Handling data I/O
• Reading files– read.csv(“filename.csv“) # reads csv files into a
data.frame– read.table(“filename.txt“) # reads txt files in a
table format to a data.frame– scan(filename) # not friendly for matrices or
tables!!!
• Writing to files– write(x,file=“filename”) # writes the object x to
filename– write.table(x,filename) # writes the object x to
filename in a table format
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Plotting!• x.data=rnorm(1000)• y.data=x.data^3-10*x.data^2• z.data=-0.5*y.data-90
• plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label")
• points(x.data,z.data,col="red")• legend(-2,2,legend=c("Black points","Red points"),col=c("black","red"),pch=1,text.col=c("black","red"))
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Plotting! (2)• You can export graphs in many formats
– To check the formats that are available in your R installation
• capabilities()
– png• png("Lab2_plot.png",width=520,height=440)• plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label")
• points(x.data,z.data,col="red")• legend(-2,2,legend=c("Black points","Red points"),col=c("black","red"),pch=1,text.col=c("black","red"))
• dev.off()– eps
• postscript("Lab2_plot.eps",width=500,height=440)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Libraries!!
• Collection of R functions that together perform a specialized analysis or task.
• Install packages from CRAN• install.packages(“PackageName”)
• Loading libraries– library(LibraryName)
• Getting the documentation of a library– library(help=LibraryName)
• Listing all the available packages– library()
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Exercise 1 – Probability Transform
We know that , and we want to know the probability associated with
(a)Plot the theoretical pdf and cdf of X.
(b)Generate 10,000,000 observations of the random variable X
(c)Compute Y=3X5+4X2-7
(d)Estimate the probability that
(e)Plot histogram and empirical CDF of Y
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Exercise 2 – The empire strikes back: GOOG versus BAIDU
Plot historical Stock Prices times series using prices from yahoo finance.
(a)Download and install tseries package.
(b)Include tseries package as a library in your code.
(c)Use get.hist.quote to download GOOG and BAIDU historical data.
(d)Plot both time series in the same panel and add a legend to the plot.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Exercise 3 – Challenging Challenger
On January 28, 1986, the space Shuttle Challenger exploded in the early stages of its flight. Feynman, along a committee determined that the explosion was due to low temperatures and the failure of O-rings sealed on the booster rockets. The ambient temperature was 36 degrees on the morning of the launch.The scientists had data (temperature, number of failures) from previous flights.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Question 3 – Challenging Challenger
(a) Plot the number of failures versus the temperature for flights with one or more O-ring failures. Is there any evidence that temperature affects O-ring performance?
(b) Plot the number of failures versus temperature for all the flights. Is there any evidence that temperature affects O-ring performance?
(c) What’s your conclusion? What do you think the scientists plot before taking the decision to fly that day? Just historical curiosity, Whom played a central role in discovering the causes of the failure and how he announced it?