Perl for Bioinformatics

37
Perl for Bioinformatics Lecture 4

description

Perl for Bioinformatics. Lecture 4. Variables - review. A variable name starts with a $ It contains a number or a text string Use my to define a variable Use = to assign a value Use \ to stop the variable being interpolated - PowerPoint PPT Presentation

Transcript of Perl for Bioinformatics

Page 1: Perl for Bioinformatics

Perl for Bioinformatics

Lecture 4

Page 2: Perl for Bioinformatics

Variables - review

• A variable name starts with a $• It contains a number or a text string• Use my to define a variable• Use = to assign a value• Use \ to stop the variable being

interpolated• Take care with variable names and with

changing the contents of variables

Page 3: Perl for Bioinformatics

Conditional Blocks, review• An if test can be used to control a

command in a conditional block, according to the outcome of a decision made by comparing variables.

• It’s important to keep track of whether variables are strings or numbers. Numbers are compared with ==, strings with eq.

• It’s usual to indent the block to make it easier to read the code

Page 4: Perl for Bioinformatics

Arrays• An array can store multiple pieces of data. • They are essential for the most useful

functions of Perl. They can store data such as:

– the lines of a text file (e.g. primer sequences)– a list of numbers (e.g. BLAST e values)

• Arrays are designated with the symbol @

my @bases = (“A”, “C”, “G”, “T”);

Page 5: Perl for Bioinformatics

Converting a variable to an array

split splits a variable into parts and puts them in an array.

my $dnastring = "ACGTGCTA";

my @dnaarray = split //, $dnastring;

@dnaarray is now (A, C, G, T, G, C, T, A)

@dnaarray = split /T/, $dnastring;

@dnaarray is now (ACG, GC, A)

Page 6: Perl for Bioinformatics

• join combines the elements of an array into a single scalar variable (a string)

$dnastring = join('', @dnaarray);

Converting an array to a variable

which arrayspacer(empty here)

Page 7: Perl for Bioinformatics

Loops• A loop repeats a bunch of functions until it is done.

The functions are placed in a BLOCK – some code delimited with curly brackets {}

• Loops are really useful with arrays.

• The “foreach” loop is probably the most useful of all:

foreach my $base (@dnaarray) {

print "$base “;

}

Page 8: Perl for Bioinformatics

• String comparison (is the text the same?)

• eq (equal ) • ne (not equal )

There are others but beware of them!

Comparing strings

Page 9: Perl for Bioinformatics

Getting part of a string

• substr takes characters out of a string

$letter = substr($dnastring, $position, 1)

which string where in the string

how many letters to take

Page 10: Perl for Bioinformatics

Combining strings

• Strings can be concatenated (joined).

• Use the dot . operator$seq1= “ACTG”;

$seq2= “GGCTA”;

$seq3= $seq1 . $seq2;

print $seq3;ACTGGGCTA

Page 11: Perl for Bioinformatics

Making Decisions - review

• The if operator is generally used together with numerical or string comparison operators, inside an (expression).

numerical: ==, !=, >, <, ≥, ≤strings: eq, ne

• You can make decisions on each member of an array using a loop which puts each part of the array through the test, one at a time

Page 12: Perl for Bioinformatics

More healthy exercise

• Write a program that asks the user for a DNA restriction site, and then tells them whether that particular sequence matches the site for the restriction enzyme EcoRI, or Bam HI, or Hind III.

• Site for EcoR1: GAATTC• Bam H1: GGATCC• Hind III: AAGCTT

Page 13: Perl for Bioinformatics

• Read in restriction site to variable

• Remove newline character

• Check if variable contains “GAATTC”

• Check if variable contains “GGATCC”

• ..etc.

pseudocode

Page 14: Perl for Bioinformatics

• Read in sequence to variable

• Remove newline character

• Split sequence in variable to array using “GAATTC”.

• Count and report number of fragments.

• Measure length of fragments and report site positions, adding six for missing sites

What about longer sequences?

Page 15: Perl for Bioinformatics

Arrays and loops - review

• An array starts with @. It contains multiple bits of data in a list-like format.

• @bases = (“A”, “C”, “G”, “T”);

• You can make decisions on each member of an array using a foreach loop which puts each part of the array through the test, one at a time

Page 16: Perl for Bioinformatics

Test time, again

• Remember –

keep track of what’s in a variable

don’t over-write a variable with another value, unless you intend to

syntax and case are critical

lines end with a semicolon

brackets and quotes must match.

Page 17: Perl for Bioinformatics

Opening and closing files

• So we can input large amounts of data, Perl has to read data out of files, and write results into output files

• This is done in two steps

• First, you must give the file a name within the script - this is known as a filehandle

• Use the open command:open MYFILE, ‘exampleprotein.txt’;

Page 18: Perl for Bioinformatics

Reading a file• Once the file is open, you can read from it, line by

line, using the readline <> operator again

– (put the filehandle between the angle brackets)

• Perl reads files one line at a time, each time you input data from the file, the next line is read:

open FILE1,’exampleprotein.txt’;$line1 = <FILE1>;chomp $line1;$line2 = <FILE1>;

Page 19: Perl for Bioinformatics

Using loops to read in a file

• The while loop just keeps doing an expression while it’s true. So it will keep reading lines from the file until it runs out.

my $longsequence;open FILE, ‘exampleprotein.txt’;while (my $line = <FILE>){

chomp $line;$longsequence = $longsequence . $line;

}close FILE;

• This reads the whole file, and puts each line into the variable $longsequence one at a time.

• Think about what happens to the ID line….

Page 20: Perl for Bioinformatics

Now More Fun Excercises

• Read a DNA sequence from a fasta format file

• Calculate the GC content.• What about the non-DNA characters in

the file?>header lines with the name of the sequencecarriage returns !! You know this one.blank spacesN’s or X’s or unexpected letters

Page 21: Perl for Bioinformatics

• Open file

• Load ID line and sequence into different variables

• Split the sequence into an array of individual letters

• For each letter in the array: – if A, increment “A” counting number variable– If T, increment “T” counting variable– …etc

Pseudocode

Page 22: Perl for Bioinformatics

Writing to a File• Writing to a file is similar to reading from it• Use the > operator to open a file for writing:

open OUTPUT, ‘>/home/class30/output.txt’;

• This creates a new file with that name, or overwrites an existing file

• Use >> to append text to an existing file• print to the file using the filehandle:

print OUTPUT $myoutputdata;

Page 23: Perl for Bioinformatics

else

• Instead of just letting the script go on if it fails an if test, you can get it to execute a second block of code if the statement in brackets isn’t true.

Some more stuff you need to know

Page 24: Perl for Bioinformatics

elsif• You can string a lot of “if”s together using elsif

if ($site eq “GAATTC” {print “EcoR1 site\n”;

} elsif ($site eq “CCATGG” {print “BamHI site\n”;

} elsif ($site eq “AAGCTT”) {print “HindIII site\n”;

} else { #only happens if none of the preceeding are truedie “I can’t find any of the sites I know\n”;

}

Page 25: Perl for Bioinformatics

Subscripts

• Bioinformatics data often can be made into array format:– multi-line sequence files– Microarray or statistics data in “tab delimited”

format

• You can address part of the array as if it was a variable using a subscript

@numbers = (8, 8, 8, 23984092, 8);print “$numbers[3]\n”;

Please note – the first element is number zero! Second is 1!

Page 26: Perl for Bioinformatics

Regular Expressions• Sounds odd, doesn’t it? It means a pattern that

the computer can match, in a standard format.

• Very useful in bioinformatics work

• DNA patterns• restriction sites• promoters/transcription factor binding sites• intron splice site

• Protein patterns• conserved domains (motifs)• active sites• structural motifs (membrane spanning, signal peptide, etc.)

Page 27: Perl for Bioinformatics

The Binding and Match Operators: =~ / /

• The =~ operator binds functions together• The // operator matches things to patterns

It can be translated as “contains”

• The forward slashes contain the pattern to be matched, like this:

if ($dnaseq =~ /GAATTC/) {print “EcoRI site found\n”}

Page 28: Perl for Bioinformatics

A regular expression..is a joy forever. And a pattern to match:

can be just a text string, such as: /GATC/

it can have alternative characters: /G[AT]TC/

or contain a wildcard that matches any character: /G.TC/

Or be something bizzare: /\/[^\/]*\/\.\./

Page 29: Perl for Bioinformatics

Perl Regular Expressions

• It never ceases to amaze me what people can do with regular expressions, but you can match pretty much anything you can think of and a lot you can’t:

#man perlrequick

Page 30: Perl for Bioinformatics

Alternative Characters• Square brackets within the match expression allow

for alternative characters:if ($dna =~ /CAG[AT]CAG/)

• This will match an DNA string that starts with CAG; has A or T in the 4th position, followed by another CAG.

• A vertical line within the /expression/ means “or”; it allows you to look for either of two completely different patterns:

if ($dna =~ /GAAT|ATTC/)

Page 31: Perl for Bioinformatics

Special characters• Perl has a large set of special characters to use

in regular expressions:

– the dot (.) matches any character– \d matches any digit (a number from 0-9)– \w matches any “word” character

(a letter or a number, not punctuation or space)

– \s matches white space (any amount)– \t matches a tab (useful for tab delimited files)– ^ matches the beginning of a line– $ matches the end of a line

– Knowing this makes you lots of fun at parties.

Page 32: Perl for Bioinformatics

“Special” characters

• What if you need to match text that contains a special character?

• Aren’t there dots at the end of sentences?

• Now you have to use a backslash (\) to “escape” the special meaning of that character:

if $onewordsentence =~ /\w+ \ ./

-This would match any text that has one or more text characters, followed by a dot.

Finished.

Page 33: Perl for Bioinformatics

Bringing it together

• So now, when you think about it, you can:

Open a fileCheck whether each line of the file contains a particular

patternRecover part of that lineWrite it out to another file

So.. isn’t that what you wanted to know?But really, it’s very useful combined with the UNIX

command line.

Page 34: Perl for Bioinformatics

A last exercise?...

• Now we’re getting up to speed with Perl, lets try something more fun:

• Open up a BLAST output file

• Spit out the name of the query sequence, the top hit, and how many hits there were.

Page 35: Perl for Bioinformatics

Only the beginning

• Sadly, there is much, much more than this to the Power of Perl.

• You can make, create and download other people’s websites

• Make Linux and Windows graphical programs• Do almost anything on the internet• Interact with databases• And much much more

Page 36: Perl for Bioinformatics

Why won’t I teach you more stuff?

• Whoa!• Programming takes time to learn properly• You’ve got the tools now to get started on a

programming project

• We will go through some more Perl functions in the later classes, especially modules such as Bioperl.

Page 37: Perl for Bioinformatics

Practice makes perfect

• You can now practice your Perl skills and understand a lot of the books and help files, which are probably more useful.

#man perlintro

#man perlrequick

#perldoc bioperl

• Also, check out Radhika’s resource page

http://compbio.sph.harvard.edu/chb/training/training-resources