Post on 13-Jan-2016
description
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Why Life is Beautiful
James Tisdall
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Biocomputing AssociatesPO Box 9965Philadelphia PA 19118 USA(610)933-1200http:/www.biocomputingassociates.com
And
DuPont Experimental StationJames.Tisdall@DuPont.com
James Tisdall (Jim)
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Who am I?
Musician Bell Laboratories, Information Principles Research Human Genome Project
early Perl program DNA WorkBench Mercator Genetics, Computational Biologist Fox Chase Cancer Center, Bioinformatics Manager Biocomputing Associates, consulting DuPont Genetics Research Author of "Beginning Perl for Bioinformatics" and
"Mastering Perl for Bioinformatics" from O'Reilly
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Bioinformatics means using computers for biology research, and
Perl is a popular computer language for bioinformatics programming
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
OUTLINE
Perl resources Overview of basic Perl The art of programming Sequences and strings Motifs and loops Subroutines and bugs
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
OUTLINE (Continued)
Mutations and Randomization The Genetic Code Restriction Maps and Regular Expressions GenBank Protein Data Bank BLAST Further Topics
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Advantages of Perl
1. Ease of Programming
• Simplify common programming tasks. A lot of free software for bioinformatics, web programming, …
2. Rapid Prototyping
• Often less programming time than other languages
3. Portability
• Most operating systems: Windows, Mac, Unix, etc…
• Free
• Run speed - suitable for most tasks
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Biological Sequence Data
Bases and amino acids are represented as letters in Perl, just as in biology literature:
DNA and RNA
A C G T (or U) Amino acids
A C D E F G H I K L M N P Q R S T V W X Y
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
A Small Bioinformatics Program
Sequences are represented as Strings
# First we store the DNA in a variable called $DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Next, we print the DNA onto the screen
print $DNA;
# Finally, we'll specifically tell the program to exit.
exit;
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Virtual Transcription: DNA to RNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
print "$DNA\n\n";
# Transcribe the DNA to RNA by substituting all T's with U's.
$RNA = $DNA;
$RNA =~ s/T/U/g;
print "$RNA\n";
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Reverse Complement
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
$revcom = reverse $DNA;
# Next substitute all bases by their complements,
# A->T, T->A, G->C, C->G, uppercase or lowercase
$revcom =~ tr/ACGTacgt/TGCAtgca/;
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Mutations and Randomization
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Modeling mutations randomly
How to randomly select a position in a string How to randomly select an index in an array Randomly select a base in DNA Mutate a base to some other random base Generate random DNA data sets Repeatedly mutate DNA
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
A program to simulate DNA mutation
Pseudocode:
1. Select a random position in the DNA
2. Choose a random nucleotide
3. Substitute the random nucleotide in the random position
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
A program to simulate DNA mutation
1. Select a random position in the DNA
$position = int(rand(length($DNA)))
2. Choose a random nucleotide
@nucleotides = ('A', 'C', 'G', 'T');
$new = $nucleotides[rand @nucleotides]
3. Substitute the random nucleotide in the random position
substr($DNA, $position, 1, $new)
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Generating random DNA
sub make_random_DNA {
# Collect arguments, declare variables
my($length) = @_;
my $dna;
for (my $i=0 ; $i < $length ; ++$i) {
$dna .= $nucleotides[rand @nucleotides];
}
return $dna;
}
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
The Genetic Code
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
The genetic code
DNA encodes amino acids
3 bases of DNA = a codon Each codon represents an amino acid or a "stop" There are 64 codons, but 20 amino acids Most amino acids are coded for by multiple codons,
which makes the genetic code a redundant code
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Translating codons to amino acids
sub codon2aa {
my($codon) = @_;
$codon = uc $codon;
my(%genetic_code) = (
'TCA' => 'S', # Serine
'TCC' => 'S', # Serine
'TCG' => 'S', # Serine
'TCT' => 'S', # Serine
'TTC' => 'F', # Phenylalanine
… et cetera …
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Translating DNA into proteins
sub dna2peptide {
my($dna) = @_;
# Initialize variables
my $protein = '';
# Translate each three-base codon to an amino acid, and append to a protein
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
$protein .= codon2aa( substr($dna,$i,3) );
}
return $protein;
}
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Reading DNA from FASTA files
FASTA format is the most common way that DNA is stored in files (but there are many others)
Example:> sample dna (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcag
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Reading DNA from FASTA files
> sample dna (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
Tgggacgccgccgagtggtctgtgcag
Notice that first line starts with a > symbol, and contains annotation.
Each other line just contains sequence data.
Simple!
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Reading FASTA files: core code
From textbook, p. 170
foreach my $line (@fasta_file_data) {
if($line =~ /^>/) {# discard fasta header line
next;
} else {# keep line, add to sequence string
$sequence .= $line;
}
}
# remove non-sequence data (in this case, whitespace) from $sequence string
$sequence =~ s/\s//g;
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Regular expressions
Regular expressions are a language within the Perl language that describe many kinds of patterns in strings, e.g. motifs in DNA.
Regular expressions permit you to find and alter many patterns with relative ease.
The excellent regular expressions in Perl are a major reason for Perl's success as a bioinformatics programming language.
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Restriction enzymes, sites, and maps
A restriction enzyme is a protein that cuts DNA at short, specific sequences
EcoRI cuts at the restriction site GAATTC, between the G and A, on both complementary strands (the reverse complement of GAATTC is GAATTC), leaving "sticky ends" for insertion into a vector for cloning and sequencing.
A restriction map is the list of locations where a given restriction site occurs in the sequence
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Finding restriction sites
while ( $sequence =~ /$regexp/ig ) {
push ( @positions, pos($sequence) - length($&) + 1);
}
while loop conditional is a pattern match
ig: i stands for "case insensitive"
g stands for "global", will keep searching entire string
pos is a built-in function that gives the ending position of the last successful match; subtract the length of the match ($&) to get to the beginning of the match; add 1 because biologists call the first base in DNA as position 1, while Perl calls the first letter in a string as position 0
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
GenBank
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
GenBank
http://ncbi.nlm.nih.gov/Genbank
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
GenBank Libraries
Examples:
PRI: primate sequences
ROD: rodent sequences
PHG: bacteriophage sequences
GSS: genome survey sequences
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Protein Data Bank
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
PDB - Protein Data Bank
The primary depository for 3D protein structures http://www.rcsb.org/pdb/
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
BLAST
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
BLAST
Basic Local Alignment Search Tool Most popular bioinformatics program Discovers sequence similarity between your
sequence and large databases Voluminous output Need to process output by program
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Similarity vs Homology
String matching is computer science term for finding sequence similarity
Exact Approximate : percent identity
Homology Sequences are or are not related evolutionarily Yes or no; no percentages
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Parsing BLAST output
Many bioinformatics projects require frequent BLAST searches
Need to automate collection and processing BLAST output does not label lines by their
function, so use regular expressions Several BLAST parsers are available as Perl
modules: see for example Bioperl modules Bio::Tools::BPlite and Bio::Tools::Blast::*
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
1. Books• Bioinformatics: Sequence and Genome Analysis by Mount
• Developing Bioinformatics Computer Skills – Gibas & Jambeck
• Introduction to Computational Biology: Maps,Sequences and Genomes, Waterman
• Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. Baxecvanis & Ouellette, editors
Bioinformatics Resources
2. Government/Public Information Resources• National Center for Biotechnology Information (NCBI) – U.S. Government
Center – www.ncbi.nlm.nih.gov
• European Molecular Biology Laboratory (EMBL). European Union Laboratory – www.embl.org
• European Bioinformatics Institute (EBI)- From UK, part of EMBL- www.ebi.ac.uk
• UCSC Genome Resource- serves constructed genome data – www.genome.ucsc.edu
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Bioinformatics Resources (cont.)
3. Conferences• ISMB: Intellegent Systems for Molecular Biology, www.iscb.org/ismb2004/
• Bioinformatics Open Source Conference, www.bioinformatics.org
• RECOMB: Conference on Computational Molecular Biology, recomb04.sdsc.edu/
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Molecular Biology
1. Helpful Reference Books for Programmers• Recombinant DNA, 2nd edition, Watson et al.
• Molecular Biology of the Gene, Watson et al.
• Molecular Cell Biology, Lodish, et al.
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Approximate string matching
As an example of advanced data structures, following is an algorithm that relies on a 2 dimensional array.
It finds the best match for a (shorter) pattern in a (longer) text.
It is one of several important dynamic programming algorithms in bioinformatics
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Approximate string matching 1 of 4my $pattern = 'EIQADEVRL';print "PATTERN:\n$pattern\n";my $text =
'SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDE';print "TEXT:\n$text\n";
my $TLEN = length $text;my $PLEN = length $pattern;
# D is the Distance matrix, which shows the "edit distance" between# substrings of the pattern and the text.# It is implemented as a reference to an anonymous array.my $D = [];
# The rows correspond to the text# Initialize row 0 of D.for (my $t=0; $t <= $TLEN ; ++$t) { $D->[$t][0] = 0;}
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Approximate string matching 2 of 4# Compute the edit distances.for (my $t=1; $t <= $TLEN ; ++$t) { for (my $p=1; $p <= $PLEN ; ++$p) { $D->[$t][$p] = # Choose whichever alternative has the least cost min3( # Text and pattern may or may not match at this
character ... substr($text, $t-1, 1) eq substr($pattern, $p-1, 1) ? $D->[$t-1][$p-1] # If match, no increase in edit distance! : $D->[$t-1][$p-1] + 1, # If the text is missing a character $D->[$t-1][$p] + 1, # If the pattern is missing a character $D->[$t][$p-1] + 1 ) }}
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Approximate string matching 3 of 4# Print D, the resulting edit distance arrayfor (my $p=0; $p <= $PLEN ; ++$p) { for (my $t=0; $t <= $TLEN ; ++$t) { print $D->[$t][$p], " "; } print "\n";}
# Find the best match(es).# The edit distances appear in the last row.my @matches = ();my $bestscore = 10000000;for (my $t=1 ; $t <= $TLEN ; ++$t) { if( $D->[$t][$PLEN] < $bestscore) { $bestscore = $D->[$t][$PLEN]; @matches = ($t); }elsif( $D->[$t][$PLEN] == $bestscore) { push(@matches, $t); }}
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Approximate string matching 4 of 4
# Report the best match(es).print "\nThe best match for the pattern $pattern\n";print "has an edit distance of $bestscore\n";print "and appears in the text ending at location";print "s" if ( @matches > 1);print " @matches\n";
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
Approximate string matching: Output
PATTERN:EIQADEVRLTEXT:SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDE0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 2 2 2 2 2 2 2 2 2 2 2 2 1 0 1 2 2 2 1 1 2 2 1 1 1 1 2 2 2 2 2 1 2 2 2 1 3 3 3 3 2 3 3 3 3 3 3 2 2 1 1 2 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 3 3 2 4 4 4 4 3 3 4 4 4 4 4 3 3 2 2 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 4 3 5 5 5 5 4 3 4 5 5 5 5 4 4 3 3 2 2 2 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 6 6 6 6 5 4 4 5 6 6 6 5 4 4 4 3 3 3 2 3 4 5 4 5 4 5 5 5 5 4 4 5 5 5 4 3 7 7 6 7 6 5 5 5 6 7 7 6 5 5 5 4 4 4 3 2 3 4 5 5 5 5 6 6 6 5 5 5 6 6 5 4 8 8 7 7 7 6 5 6 6 7 8 7 6 6 6 5 5 5 4 3 3 4 5 6 6 6 5 6 7 6 6 6 6 7 6 5 9 9 8 7 8 7 6 6 7 7 8 8 7 7 6 6 6 6 5 4 3 4 5 6 7 7 6 6 7 7 7 7 7 7 7 6
The best match for the pattern EIQADEVRLhas an edit distance of 3and appears in the text ending at location 20
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
BioperlBioperl is a collection of Perl modules for
bioinformatics. It has been developed by a large group of volunteers.
Bioperl has become a fairly stable platform with which to develop software; it is growing and changing rapidly, but its core is relatively mature.
Bioperl is a must for anyone interested in writing bioinformatics software in Perl
Don't reinvent the wheel! (Unless you like to, for the fun or interest of it.)
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
What can Bioperl do?
Sequence manipulation (DNA, RNA, proteins)
Several kinds of sequence objects Sequence format conversion Translation Revcom Sequence statistics (mol. weights, etc.) Restriction enzyme maps Signal cleavage sites
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
What can Bioperl do?
Sequence alignment
Interfaces with many alignment programs such as fasta, clustalw, gcg
Extract subalignment, consensus string Alignment statistics
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
What can Bioperl do?
Biological databases
Retrieving Genbank, EMBL, acedb, etc. Retrieving from remote databases Indexing and retrieving from local databases Parsing program reports from Blast and
variants, HMMer, genefinding programs
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
What can Bioperl do?
Sequence annotation
Adding annotation Parsing annotation from sequence databases
04/21/23
JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho
What can Bioperl do?
Protein structure
Read in PDB files Extract chains and their residues