CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL
Welcome to lecture 4: An introduction to modular PERL
description
Transcript of Welcome to lecture 4: An introduction to modular PERL
![Page 1: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/1.jpg)
Welcome to lecture 4:An introduction to modular PERL
IGERT – Sponsored Bioinformatics Workshop SeriesMichael Janis and Max Kopelevich, Ph.D.
Dept. of Chemistry & Biochemistry, UCLA
![Page 2: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/2.jpg)
Last time…
• We covered a bit of material…• Try to keep up with the reading – it’s all in there!• We’ve covered variables, control structures, data
structures, functions…– Now we’ll cover modular programming
– We’ll create libraries of our own to use
– We’ll take an example of a biological problem that incorporates everything we’ve learned and used so far…
![Page 3: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/3.jpg)
Gene Finding(A very simplified example)
![Page 4: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/4.jpg)
How to find a gene given a sequence?
Conversely, what is the likelihood that a given region of sequence is a coding region?
Note that we used the term ‘likelihood’:
– A simplified statistical approach • Codon usage throughout an organisms genome is non-uniform
• Non-coding regions differ from coding regions in their codon usage
• We can use this information to test putative ORFs
– A more traditional approach • Makes use of homology of putative regions to other known protein
sequences
• Does not require prior information regarding codon bias
• Suffers from problems inherit in homology – based analysis
![Page 5: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/5.jpg)
How to find a gene given a sequence?
We don’t have to choose. We can use both (heuristics)? We’ll be dealing with sometimes complex and contradicting information
Genes are non-linear linear: – There may be many methionine codons present– There are different reading frames possible– There are intron / exon combinations (alt. splicing)
ATG possible start codons possible stop codons
In-frame start to stop putative ORF
![Page 6: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/6.jpg)
Note that we used the term ‘likelihood’:
We need to introduce, briefly, some Probability and statistics
Just a little Evil…
![Page 7: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/7.jpg)
Permutations
Groups of Ordered arrangements of thingsHow many 3 letter permutations of the letters a, b, & c are there?
abc, acb, bac, bca, cba, cab 6 total
– General Formula:
– n = total number of things– k = size of the groups your taking
k < n3!/(3-3)! = 6
– Can use Basic Principle of Counting:•3*2*1 = 6
![Page 8: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/8.jpg)
IQR
What if some of the things are identical?How many permutations of the letters a, a, b, c, c & c
are there?
Permutations
6! / (3!2!) = 60
Where n1, n2, … nr arethe number of objects
that are alike
![Page 9: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/9.jpg)
IQR
Combinations
Groups of things (Order doesn’t matter)How many 3 letter combinations of the letters a, b, & c are there?
1: abc
How many 2 letter combinations of the letters a, b, & c are there?3: ab, ac, bc
ab = ba; ac = ca; bc = cb *Order doesn’t matter
– General Formula:
– n = total number of things– k = size of the groups your taking
k < n “n choose k”
![Page 10: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/10.jpg)
IQR
E = {a, b, c, d} F = {b, c, d, e, f, g}
E S F S
Ec = {e, f, g, h, i, j}Fc = {a, h, i, j}
E F = {a, b, c, d, e, f, g}
E F = EF = {b, c, d}
Set TheorySample Space of an experiment is the set of all possible
values/outcomes of the experiment
S = {a, b, c, d, e, f, g, h, i, j} S = {Heads, Tails}
S = {1, 2, 3, 4, 5, 6}
![Page 11: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/11.jpg)
IQR
S
Venn Diagrams
E F
G
![Page 12: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/12.jpg)
Simple ProbabilityFrequent assumption: All Outcomes Equally likely to occur
The probability of an event E, is simply:Number of possible outcomes in E
Number of Total possible outcomes
S = {a, b, c, d, e, f, g, h, i, j}
E = {a, b, c, d} F = {b, c, d, e, f, g}
P(E) = 4/10 P(F) = 6/10
P(S) = 1 0 < P(E) < 1 P(Ec) = 1 – P(E)
P(E F) = P(E) + P(F) – P(EF)
![Page 13: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/13.jpg)
IQR
Independence
Two events, E & F are independent if neither of their outcomes depends on the outcomes of
others.
So if E & F are independent, then:
P(EF) = P(E)*P(F)
If E, F & G are independent, then:P(EFG) = P(E)*P(F)*P(G)
![Page 14: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/14.jpg)
IQR
ConditionalProbability
Given E, the probability of F is:
S
E FEF
• Similarly:
![Page 15: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/15.jpg)
ASSUMEHere’s a simple question. I come from a family of
two children (prior information states: I am a male). What’s the probability that my sibling is a
sister?
• Outcome (sex of offspring) is equally likely• Is it 0.5? Something else? • What is the question really asking?
![Page 16: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/16.jpg)
AssumptionsHere’s a simple question. I come from a family of
two children. What’s the probability that my sibling is a sister?
• Sample space is actually four pairs of possible siblings (in order of birth): {(B,B),(B,G),(G,B),(G,G)}• Let U be the event “one child is a girl”• Let V be the event “one child is Mike”• We want to calculate P(U|V)
![Page 17: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/17.jpg)
IQR
Assumptions…Here’s a simple question. I come from a family of
two children. What’s the probability that my sibling is a sister?
• P(U|V)=P(U V)/P(V)• = P(one child is B, one is G)/P(one is B)• = 2/4 / ¾=2/3• biologists cringe now…
![Page 18: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/18.jpg)
IQR
ConditionalProbability
Given E, the probability of F is:
S
E FEF
• Similarly:
![Page 19: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/19.jpg)
IQR
Random Variables
Definition: A variable that can have different values
Each value has its own probabilityX = Result of coin toss
Heads 50%, Tails 50%
Y = Result of die roll1, 2, 3, 4, 5, 6 each 1/6
![Page 20: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/20.jpg)
IQR
Discrete vs. Continuous
Discrete random variables can only take a finite set of different values.
Die roll, coin flip
Continuous random variables can take on an infinite number (real) of valuesTime of day of event, height of a person
![Page 21: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/21.jpg)
Probability Density Function
Many problems don’t have simple probabilities. For those the probabilities are expressed as a
function…
aka “pdf”
Plug a into some functioni.e. 2a2 – a3
![Page 22: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/22.jpg)
Some Useful pdf’s
Simple cases (like fair/loaded coin/dice, etc…)
Uniform random variable (“equally likely”)
For a = Heads
For a = Tails
![Page 23: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/23.jpg)
IQR
pdf of a Binomial
Very useful function!
Where p = P(success) & q = P(failure)P(success) + P(failure) = 1
n choose k is the total number of possible ways to get k successes in n attempts
![Page 24: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/24.jpg)
IQR
Hypergeometric distributionTends towards the binomial distribution when N is large
We can use combinatorics to test for enrichment; i.e. is the number found greater than expected by chance?
![Page 25: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/25.jpg)
IQR
Hypergeometric distributionOur microarray of 9300 probesets (genes with some
duplication) yields 200 upregulated genes in response to substance X.
We use gene ontology to cluster these genes into 4 biological process clusters: 160 genes in mitosis, 80 in oncogenesis, 60 in cell
proliferation, and 40 in glucose transport.Is substance X related to cancer?
Need to account for total number of genes queried by microarray in each category…
An enrichment problem (obs genes M, total number of genes N, the number of categorical genes x on the array, and the number
of regulated genes K).
Source: Data Analysis Tools for DNA Microarrays. Sorin Draghici, 2003.
![Page 26: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/26.jpg)
Hypergeometric distributionWe may find that the inferred effect of substance X is
very different from our initial response…Glucose transport 4x more than expected by chance; oncogenesis
not better than chance… Maybe correlation is with diabetes instead?
![Page 27: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/27.jpg)
IQR
Using the p.d.f.
What is the Probability of getting 3 Heads in 5 coin tosses? (Same as 2T in 5 tosses)
n = 5 tosses k = 3 Heads
p = P(H) = .5 q = P(T) = .5
P(3H in 5 tosses) = p3q2 = 10p3q2
= 10*P(H)3*P(T)2
= 10(.5)3(.5)2 = 0.3125
![Page 28: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/28.jpg)
IQR
Notice how these are Binomials…What is the probability of winning the lottery in 2
of your next 3 tries?n = 3 tries k = 2 wins
Assume P(win) = 10-7 P(lose) = 1-10-7
P(win 2 of 3 lotto) = P(win)2P(lose) = 3(10-7)2(1-10-7)
= ~ 3*10-14
That’s about a 3 in 100 trillion chance. Good Luck!
![Page 29: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/29.jpg)
IQR
Expectation of a Discrete Random Variable
Weighted average of a random variable
…Of a function
![Page 30: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/30.jpg)
IQR
Measures of central tendency
• Sample mean: the sum of measurements divided by the number of subjects.
• Sample median: the measurement that falls at the middle of the ordered sample.
![Page 31: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/31.jpg)
IQR
Variance
Variation, or spread of the values of a random variable
Where μ = E[X]
![Page 32: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/32.jpg)
IQR
Variance and standard deviation: measures of variation in statistics:
•Variance (s2 ): the mean of the squared deviations for a sample.
•standard deviation (s ): the square root of the variance, or the root mean squared deviation, labelled
![Page 33: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/33.jpg)
IQR
Statistics of populations
The equations so far are for sample statisticsa statistic is a single number estimated from a sample
We use the sample to make inferences about the population.
a parameter is a single number that summarizes some quality of a variable in a population.
the term for the population mean is (mu), and Ybar is a sample estimator of .
the term for the population standard deviation is (sigma), and s is a sample estimator of .
Note that and are both elements of the normal probability curve.
Source: http://www.bsos.umd.edu/socy/smartin/601/
![Page 34: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/34.jpg)
IQR
Measuring probabilities under the normal curve
We can make transformations by scaling everything with respect to the mean and standard deviation.
Let z = the number of standard deviations above or below the population mean.
z = 0 y = z = 1 y = +/- (p=0.68)
z = 2 y = +/- 2 (p=0.95)
z = 3 y = +/- 3 (p=0.997)
![Page 35: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/35.jpg)
Did rounding occur?
Ordered Array (radix sort) yields stem and leaf plots
![Page 36: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/36.jpg)
![Page 37: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/37.jpg)
![Page 38: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/38.jpg)
Difficult to integrate… But probabilities have beenMapped out to this curve. Transformations from other Curves possible…
![Page 39: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/39.jpg)
![Page 40: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/40.jpg)
Box plots (box and whiskers plots, Tukey, 1977)
Outliers
Fence / whiskers
IQR
Q3
Q1
Median
Fence / whiskers
min((Q3+1.5(IQR)),largest X)
max((Q1+1.5(IQR)),smallest X)
![Page 41: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/41.jpg)
IQR
![Page 42: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/42.jpg)
IQR
Statistics of populations
The equations so far are for sample statisticsa statistic is a single number estimated from a sample
We use the sample to make inferences about the population.
a parameter is a single number that summarizes some quality of a variable in a population.
the term for the population mean is (mu), and Ybar is a sample estimator of .
the term for the population standard deviation is (sigma), and s is a sample estimator of .
Note that and are both elements of the normal probability curve.
Source: http://www.bsos.umd.edu/socy/smartin/601/
![Page 43: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/43.jpg)
IQR
Measuring probabilities under the normal curve
We can make transformations by scaling everything with respect to the mean and standard deviation.
Let z = the number of standard deviations above or below the population mean.
z = 0 y = z = 1 y = +/- (p=0.68)
z = 2 y = +/- 2 (p=0.95)
z = 3 y = +/- 3 (p=0.997)
![Page 44: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/44.jpg)
IQR
![Page 45: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/45.jpg)
Back to our task – gene finding
Let’s start with a simple model that utilizes codon bias
What we need: – A routine for reading and accessing the data– A statistical construct for evaluating all possible codons within
the data– A way to reuse segments of our code when appropriate
ATG possible start codons possible stop codons
In-frame start to stop putative ORF
![Page 46: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/46.jpg)
IQR
Codon bias assumptions
• Codons are independent of each other
• So if E & F are independent, then:
P(EF) = P(E)*P(F)
• Codon frequencies are not uniform across the genome
![Page 47: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/47.jpg)
ConditionalProbability
•Given E, the probability of F is:
(this is the likelihood)
S
E FEF
• We can evaluate competing likehoodsThrough a ratio; called log-odds ratio,Or LOD
![Page 48: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/48.jpg)
ConditionalProbability
• Our LOD is culled from the following information:
S
E FEF
![Page 49: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/49.jpg)
Our model• To get our codon model, we need a TRAINING SET of data for known coding regions…• We then simply count the frequencies of each codon occurrence
S
E FEF
• We can often get this information from genomic databases in the form of ORF-only FASTA files…
![Page 50: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/50.jpg)
Our model• To get our random model, it is typical to model noncoding sequences as random DNA (uniform distribution)
S
E FEF
![Page 51: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/51.jpg)
We need to deal with the sequence• It’s in a FASTA file: we need to build a ‘reader’ of sorts to load the data into
useful data structures– Recall that our grep search of FASTA had problems– Sequence read is across many lines– Only one strand present– WE CAN SOLVE THIS BY USING A HASH
#!/usr/bin/perl –wUse strict;open(IN, “chr.fsa”);while (<IN>) {
chomp; # load the fasta file into a hash
# the header will be the key# the sequence will be the value
}close(IN);
>one CTAAACAAAGTGCTGCCACCCCGAATTGCCAATATAAT…
(fasta file looks like this)
![Page 52: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/52.jpg)
There are multiple FASTA files• Each chromosomal sequence has it’s own FASTA file• We need a training set of data to get our LODs for
evaluation• We can build a complex data structure (HoH or AoH)• My approach: HoH (Major, minor; or outer, inner hashes)
foreach my $file(@chrFiles) {chomp($file); # get rid of metacharacters, newlines from filenames
%fastaSeqs=(); $header=''; open (IN,"$file"); # create a file handle for the file being processed
$file=~s/\.fsa//; while (<IN>) {
chomp; # INSERT YOUR CODE TO READ IN A FASTA FILE HERE # # (Hint: use the hash function you learned about)
} close IN; # close that file, filehandle. we'll need to use it for the next file ### you'll need to update the hash... }
![Page 53: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/53.jpg)
We can build up our program piecemeal
First, let’s write a fasta file readerOn the first pass, write it for one small file
– Then build it for multiple filesWe’ll also need some functions…– This is a good time to introduce modular programming– At the very least, we should incorporate subroutines
• We might need functions to:– Reverse complement a sequence (you did this already!!! Now we’ll just
make it a function so we can call is whenever we want – it’s like a control structure that’s ALWAYS AVAILABLE!)
– Translate a sequence to amino acids (much like the revcom)– Calculate LOD scores for codons– Count and get frequencies of nucleotides in the sequence– We may add more… such as creating a random sequence that preserves
the nucleotide composition of the original sequence… This will come in handy later
![Page 54: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/54.jpg)
Let’s begin
First, let’s write a fasta file readerOn the first pass, write it for one small file
– We’ll build it for our test file, which contains two sequences
• We’ll evaluate these sequences for the propensity of ORFs using our statistical model
• We’ll revisit this problem with a more traditional homology search problem
– We’ll write our own aligner!!!!!!!
– The starting code and sequences are available for you• http://www.chem.ucla.edu/~mjanis/biohackers2005.html• Remember you can use wget!!!
![Page 55: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/55.jpg)
Where to start?
I’ve written an outline for you to follow
– #!/usr/bin/perl -w
– use strict;– # this is straight out of your reading:– # a simple hash to map codons to their – # corresponding amino acids
– my %geneticCode = (– 'TCA'=>'S', # Serine– 'TCC'=>'S',– 'TCG'=>'S',– 'TCT'=>'S',
– 'TTC'=>'F', # Phenylalanine– 'TTT'=>'F',
– 'TTA'=>'L', # Leucine– 'TTG'=>'L',
– 'TAC'=>'Y', # Tyrosine– 'TAT'=>'Y',
– 'TAA'=>'-', # STOP CODON– 'TAG'=>'-', # STOP CODON– 'TGA'=>'-', # STOP CODON
– 'TGC'=>'C', # Cysteine– 'TGT'=>'C',
![Page 56: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/56.jpg)
My fasta file loadermy %fastaSeqs; # declaration of the minor hash to (re)used to hold the fasta data filesmy $header; # declaration of scalar to hold the fasta header for each instancemy %chrList; # declaration of master hash, a hash of hashes, for every file
my @chrFiles=`ls -1 *.fsa`; # a way to use wildcards to load all fasta files in the CWDforeach my $file(@chrFiles) { # process each file in turn; here we populate each minor hash, # then pass that hash to the master hash chomp($file); # get rid of metacharacters, newlines from filenames %fastaSeqs=(); # clear out the minor hash from the last instance $header=''; # clear out the header scalar
open (IN,"$file"); # create a file handle for the file being processed $file=~s/\.fsa//; # we've used the filename to create the filehandle; # we don't need the filename any longer, so we'll # remove the .fsa extension and use the filename as # the major hash key
while (<IN>) { # use a while loop to go through the file one line at a time chomp; # remove newlines/metacharacters if ($_=~/^\s*$/) { # let's ignore blank lines in the file that may exist next; } elsif ($_=~/^>/) { # here we grad the header; note that the key - value pair # is empty at this point in the minor hash $header=$_; # just take the whole header $header=~s/>//; # strip out the leading > from the header } else { # here's where we grab the sequence; # (if it's not a header, it's sequence in our fasta) $fastaSeqs{$header} .= $_; # we simply concatenate all the sequence lines together # into a cohesive sequence. } } close IN; # close that file, filehandle! we'll need to use it for # next file $chrList{$file}={%fastaSeqs}; # finally, for each minor hash created, we append it to # major hash (called chrList). }
![Page 57: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/57.jpg)
Hopefully your fasta file reader makes sense… Now let’s build
some functionality for the sequences we’ve loaded…
First steps in modular programming
![Page 58: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/58.jpg)
An analogy for programmers - procedural C++
![Page 59: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/59.jpg)
A bit about the C language family
C is the basis of most OS
C is a compiled language
C is portable
C extends functionality of many programs well (especially if they are slow)
![Page 60: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/60.jpg)
C++ is C, ++
C++ is the OOC
ANSI C/C++ is handled by gcc/g++ compiler
Of course emacs has edit/compile/debug functions for both!
C/C++ are modular, based on libraries – this is the hardest part of learning C, remembering all the libraries!
![Page 61: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/61.jpg)
Procedural C++
We’ll look at procedural C++, and leave the OOP for later…
• C++ is a great language to start with – grammer similar to PERL (although syntax isn’t)
– forces the programmer to declare variables and clean memory
• these are good programming basics to learn!
![Page 62: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/62.jpg)
A simple C++ program
#include<iostream>
using namespace std;
int main() {
int numberGenes;
cout << “Enter number of genes\”;
cin >> numberGenes;
Return 0;
}
![Page 63: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/63.jpg)
C works by function calls
#include<iostream>
using namespace std;
int main() {
int numberGenes;
cout << “Enter number of genes\”;
cin >> numberGenes;
Return 0;
}
![Page 64: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/64.jpg)
C uses arrays in functions
No matrices; arrays of arrays are used instead
Double a[10]=(1.2,3.3,4.4);
Char b[3]={‘a’,’b’,’c’,};
For (int b=0; b<10; b++) {
cout << a[b] << endl;
}
![Page 65: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/65.jpg)
An example function
Write a function, which accept an integer array and return the sum of the array.
int sumOfArray(int a[], int size){
int sum = 0;for(int j = 0; j < size; j++)sum = sum + a[j];
return (sum);
}
![Page 66: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/66.jpg)
Perl functions
Perl has functions as well… subroutines and Modules.
This is the basis of modern bioinformatics programming – modularity
The environment gives C or PERL the language references it should use – kind of like locality or accent for a language
![Page 67: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/67.jpg)
Functions
• We can access these functions by a call:– functionName();
• Likewise we can pass parameters to these functions:– functionName(x) or functionName(int x);
• And we can return results from these functions:– return(y);
![Page 68: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/68.jpg)
First steps - subroutines
We have been using a form of subroutines all along. Perl functions are basically built in subroutines. You call them (or "invoke") a function by typing its name, and giving it one or more arguments.
![Page 69: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/69.jpg)
Subroutines
Perl gives you the opportunity to define your own functions, called "subroutines". In the simplest sense, subroutines are named blocks of code that can be reused as many times as you wish.
![Page 70: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/70.jpg)
Subroutines
sub hypotenuse {
my ($a,$b) = @_;
return sqrt($a**2 + $b**2);
}
sub E {
return 2.71828182845905;
}
![Page 71: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/71.jpg)
Calling subroutines
$y = 3; $x = hypotenuse($y,4); # $x now contains 5
$x = hypotenuse((3*$y),12); # $x now contains 15
$value_e = E(); # $value_e now contains 2.71828182845905
![Page 72: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/72.jpg)
Subroutines
This way of using subroutines makes them look suspiciously like functions.
Note: Unlike a function, you must use parentheses when calling a subroutine in this manner, even if you are giving it no arguments.
![Page 73: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/73.jpg)
The Magic Array @_
Perhaps the most important concept to understand is that values are passed to the subroutine in the default array @_. This array springs magically into existence (like the scalar $_ we learned about earlier), and contains the list of values that you gave to subroutine (within the parentheses).
![Page 74: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/74.jpg)
The Magic Array @_
sub Add_two_numbers { my ($number1) = shift; # get first argument from @_
# and put it in $number1 my ($number2) = shift; # get second argument from @_
# and put it in $number2
my $sum = $number1 + $number2; return $sum; }
![Page 75: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/75.jpg)
"my" Variables (scoping, strict)
• Variables that you use in a subroutine should be made private to that subroutine with the my operator. – This avoids accidentally overwriting similarly-named
variables in the main program. • If you already included use strict at the top of your program,
perl will check that all variables are introduced with my.
![Page 76: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/76.jpg)
Introducing References
Sometimes you need a more complex data structure! (we’ll be using a HoH!)Examples:
* An array of arrays (can do the job of a 2-dimensional matrix). DATA: Spot_num Ch1-BKGD CH1 Ch2-BKGD Ch2 000 0.124 43.2 0.102 80.4 001 0.113 60.7 0.091 22.6 002 0.084 1 12.2 0.144 35.3
my @spotarray = ([0.124, 43.2, 0.102, 80.4], [0.113, 60.7, 0.091, 22.6], [0.084, 112.2, 0.144, 35.3]);
![Page 77: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/77.jpg)
What Is A Reference?
Well, first, what is a variable?Think of a variable as a (named) box that holds a value. The name of the box is
the name of the variable. After$x = 1;we have +---+$x: | 1 | +---+After@y = (1, 'a', 23);we have +---------------+ @y: | (1, 'a', 23) | +---------------+
![Page 78: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/78.jpg)
Making References To Variables' Values
$list_ref = \@array;$map_ref = \%hash; $c_ref = \$count;Refs to subroutines:
$sub_ref = \&subroutine;A reference is an additional, rather different way, to name the variable.
Ex: from $ref_to_y = \@y we have +---------------+ +-> @y: | (1, 'a', 23) | | +---------------+ | +-|- +$ref_to_y: | * | +---+$ref_to_y contains a reference (pointer) to @y.print @y yields 1a23 and print $ref_to_y yields ARRAY(0x80cd6ac).
![Page 79: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/79.jpg)
Getting At The Value ('de-referencing')
@{array_reference}
%{hash_reference}
${scalar_reference}
print @{$ref_to_y} yields 1a23.
![Page 80: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/80.jpg)
Scripting Example: Hash of Hashes
#!/usr/bin/perl -wuse strict;@ARGV = '/home/mako/DATA/sequences.txt' unless @ARGV;$/ = ">";my %DATA;while (<>) { chomp; my ($id_line,@rest) = split "\n"; $id_line =~ /^(\S+)/ or next; my $id = $1; my $sequence = join '',@rest; my $length = length $sequence; my $gc_count = $sequence =~ tr/gcGC/gcGC/; my $gc_content = $gc_count/$length; $DATA{$id} = { sequence => $sequence,
length => $length, gc_content => sprintf("%3.2f",$gc_content)
};}my @ids = sort { $DATA{$a}->{gc_content} <=> $DATA{$b}->{gc_content}
} keys %DATA;foreach my $id (@ids) { print "$id\n"; print "\tgc content = $DATA{$id}->{gc_content}\n"; print "\tlength = $DATA{$id}->{length}\n"; print "\n";}
![Page 81: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/81.jpg)
Using a Module
• After writing some of our functions, we can see that they might be really useful to other programs as well– Handling sequence pattern matching for example
– Cleans up the main portion of our program’s code
• A module is a package of useful subroutines and variables that someone (you?) has put together. – Modules extend the ability of Perl.subroutine in this
manner, even if you are giving it no arguments.
![Page 82: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/82.jpg)
File::Basename Module
The File::Basename module is a standard module that is distributed with Perl. When you load the File::Basename module, you get two new functions, basename and dirname.
basename takes a long UNIX path name and returns the file name at the end. dirname takes a long UNIX path name and returns the directory part.
The File::Basename is the syntax for accessing any module you create
But you might have to tell perl where you put it…
![Page 83: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/83.jpg)
File::Basename Module
#!/usr/bin/perl# file: basename.pluse strict;use File::Basename;
my $path = '/home/mako/DATA/chrT.fsa'; my $base = basename($path); my $dir = dirname($path);
print "The base is $base and the directory is $dir.\n";
![Page 84: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/84.jpg)
Using a Module
Each module will automatically import a different set of variables and subroutines when you use it. You can control what gets imported by providing use with a list of what to import.
![Page 85: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/85.jpg)
Finding out What Modules are Installed
To find out what modules come with perl, look in Appendix A of Perl 5 Pocket Reference. From the command line, use the perldoc command from the UNIX shell. All the Perl documentation is available with this command:
% perldoc perlmodlib
To learn more about a module, run perldoc with the module's name:
% perldoc File::Basename
![Page 86: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/86.jpg)
Installing Modules
You can find thousands of Perl Modules on CPAN, the Comprehensive Perl Archive Network:
http://www.cpan.org
![Page 87: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/87.jpg)
Installing Modules Using the CPAN Shell
Perl has a CPAN module installer built into it. You run it like this:
% perl -MCPAN -e shellcpan shell -- CPAN exploration and modules installation (v1.59_54)ReadLine support enabled
cpan>cpan> install Text::Wrap
![Page 88: Welcome to lecture 4: An introduction to modular PERL](https://reader036.fdocuments.in/reader036/viewer/2022081513/56814283550346895daeb047/html5/thumbnails/88.jpg)
Object-Oriented Modules
Some modules are object-oriented. Instead of importing a series of subroutines that are called directly, these modules define a series of object types that you can create and use.
We’ll see what OOP is and why we want to use it next time…