Why Life is Beautiful James Tisdall

50
06/27/22 JOIN2004 Universidade do Minho JOIN2004 Universidade do Minho Why Life is Beautiful James Tisdall

description

Why Life is Beautiful James Tisdall. James Tisdall (Jim). Biocomputing Associates PO Box 9965 Philadelphia PA 19118 USA (610)933-1200 http:/www.biocomputingassociates.com And DuPont Experimental Station [email protected]. Who am I?. Musician - PowerPoint PPT Presentation

Transcript of Why Life is Beautiful James Tisdall

Page 1: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Why Life is Beautiful

James Tisdall

Page 2: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Biocomputing AssociatesPO Box 9965Philadelphia PA 19118 USA(610)933-1200http:/www.biocomputingassociates.com

And

DuPont Experimental [email protected]

James Tisdall (Jim)

Page 3: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Who am I?

Musician Bell Laboratories, Information Principles Research Human Genome Project

early Perl program DNA WorkBench Mercator Genetics, Computational Biologist Fox Chase Cancer Center, Bioinformatics Manager Biocomputing Associates, consulting DuPont Genetics Research Author of "Beginning Perl for Bioinformatics" and

"Mastering Perl for Bioinformatics" from O'Reilly

Page 4: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Bioinformatics means using computers for biology research, and

Perl is a popular computer language for bioinformatics programming

Page 5: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

OUTLINE

Perl resources Overview of basic Perl The art of programming Sequences and strings Motifs and loops Subroutines and bugs

Page 6: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

OUTLINE (Continued)

Mutations and Randomization The Genetic Code Restriction Maps and Regular Expressions GenBank Protein Data Bank BLAST Further Topics

Page 7: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Advantages of Perl

1. Ease of Programming

• Simplify common programming tasks. A lot of free software for bioinformatics, web programming, …

2. Rapid Prototyping

• Often less programming time than other languages

3. Portability

• Most operating systems: Windows, Mac, Unix, etc…

• Free

• Run speed - suitable for most tasks

Page 8: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Biological Sequence Data

Bases and amino acids are represented as letters in Perl, just as in biology literature:

DNA and RNA

A C G T (or U) Amino acids

A C D E F G H I K L M N P Q R S T V W X Y

Page 9: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

A Small Bioinformatics Program

Sequences are represented as Strings

# First we store the DNA in a variable called $DNA

$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

# Next, we print the DNA onto the screen

print $DNA;

# Finally, we'll specifically tell the program to exit.

exit;

Page 10: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Virtual Transcription: DNA to RNA

$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

print "$DNA\n\n";

# Transcribe the DNA to RNA by substituting all T's with U's.

$RNA = $DNA;

$RNA =~ s/T/U/g;

print "$RNA\n";

Page 11: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Reverse Complement

$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

$revcom = reverse $DNA;

# Next substitute all bases by their complements,

# A->T, T->A, G->C, C->G, uppercase or lowercase

$revcom =~ tr/ACGTacgt/TGCAtgca/;

Page 12: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Mutations and Randomization

Page 13: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Modeling mutations randomly

How to randomly select a position in a string How to randomly select an index in an array Randomly select a base in DNA Mutate a base to some other random base Generate random DNA data sets Repeatedly mutate DNA

Page 14: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

A program to simulate DNA mutation

Pseudocode:

1. Select a random position in the DNA

2. Choose a random nucleotide

3. Substitute the random nucleotide in the random position

Page 15: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

A program to simulate DNA mutation

1. Select a random position in the DNA

$position = int(rand(length($DNA)))

2. Choose a random nucleotide

@nucleotides = ('A', 'C', 'G', 'T');

$new = $nucleotides[rand @nucleotides]

3. Substitute the random nucleotide in the random position

substr($DNA, $position, 1, $new)

Page 16: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Generating random DNA

sub make_random_DNA {

# Collect arguments, declare variables

my($length) = @_;

my $dna;

for (my $i=0 ; $i < $length ; ++$i) {

$dna .= $nucleotides[rand @nucleotides];

}

return $dna;

}

Page 17: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

The Genetic Code

Page 18: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

The genetic code

DNA encodes amino acids

3 bases of DNA = a codon Each codon represents an amino acid or a "stop" There are 64 codons, but 20 amino acids Most amino acids are coded for by multiple codons,

which makes the genetic code a redundant code

Page 19: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Translating codons to amino acids

sub codon2aa {

my($codon) = @_;

$codon = uc $codon;

my(%genetic_code) = (

'TCA' => 'S', # Serine

'TCC' => 'S', # Serine

'TCG' => 'S', # Serine

'TCT' => 'S', # Serine

'TTC' => 'F', # Phenylalanine

… et cetera …

Page 20: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Translating DNA into proteins

sub dna2peptide {

my($dna) = @_;

# Initialize variables

my $protein = '';

# Translate each three-base codon to an amino acid, and append to a protein

for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {

$protein .= codon2aa( substr($dna,$i,3) );

}

return $protein;

}

Page 21: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Reading DNA from FASTA files

FASTA format is the most common way that DNA is stored in files (but there are many others)

Example:> sample dna (This is a typical fasta header.)

agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg

tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct

gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc

tgggacgccgccgagtggtctgtgcag

Page 22: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Reading DNA from FASTA files

> sample dna (This is a typical fasta header.)

agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg

tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct

gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc

Tgggacgccgccgagtggtctgtgcag

Notice that first line starts with a > symbol, and contains annotation.

Each other line just contains sequence data.

Simple!

Page 23: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Reading FASTA files: core code

From textbook, p. 170

foreach my $line (@fasta_file_data) {

if($line =~ /^>/) {# discard fasta header line

next;

} else {# keep line, add to sequence string

$sequence .= $line;

}

}

# remove non-sequence data (in this case, whitespace) from $sequence string

$sequence =~ s/\s//g;

Page 24: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Regular expressions

Regular expressions are a language within the Perl language that describe many kinds of patterns in strings, e.g. motifs in DNA.

Regular expressions permit you to find and alter many patterns with relative ease.

The excellent regular expressions in Perl are a major reason for Perl's success as a bioinformatics programming language.

Page 25: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Restriction enzymes, sites, and maps

A restriction enzyme is a protein that cuts DNA at short, specific sequences

EcoRI cuts at the restriction site GAATTC, between the G and A, on both complementary strands (the reverse complement of GAATTC is GAATTC), leaving "sticky ends" for insertion into a vector for cloning and sequencing.

A restriction map is the list of locations where a given restriction site occurs in the sequence

Page 26: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Finding restriction sites

while ( $sequence =~ /$regexp/ig ) {

push ( @positions, pos($sequence) - length($&) + 1);

}

while loop conditional is a pattern match

ig: i stands for "case insensitive"

g stands for "global", will keep searching entire string

pos is a built-in function that gives the ending position of the last successful match; subtract the length of the match ($&) to get to the beginning of the match; add 1 because biologists call the first base in DNA as position 1, while Perl calls the first letter in a string as position 0

Page 27: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

GenBank

Page 28: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

GenBank

http://ncbi.nlm.nih.gov/Genbank

Page 29: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

GenBank Libraries

Examples:

PRI: primate sequences

ROD: rodent sequences

PHG: bacteriophage sequences

GSS: genome survey sequences

Page 30: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Protein Data Bank

Page 31: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

PDB - Protein Data Bank

The primary depository for 3D protein structures http://www.rcsb.org/pdb/

Page 32: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

BLAST

Page 33: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

BLAST

Basic Local Alignment Search Tool Most popular bioinformatics program Discovers sequence similarity between your

sequence and large databases Voluminous output Need to process output by program

Page 34: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Similarity vs Homology

String matching is computer science term for finding sequence similarity

Exact Approximate : percent identity

Homology Sequences are or are not related evolutionarily Yes or no; no percentages

Page 35: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Parsing BLAST output

Many bioinformatics projects require frequent BLAST searches

Need to automate collection and processing BLAST output does not label lines by their

function, so use regular expressions Several BLAST parsers are available as Perl

modules: see for example Bioperl modules Bio::Tools::BPlite and Bio::Tools::Blast::*

Page 36: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

1. Books• Bioinformatics: Sequence and Genome Analysis by Mount

• Developing Bioinformatics Computer Skills – Gibas & Jambeck

• Introduction to Computational Biology: Maps,Sequences and Genomes, Waterman

• Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. Baxecvanis & Ouellette, editors

Bioinformatics Resources

2. Government/Public Information Resources• National Center for Biotechnology Information (NCBI) – U.S. Government

Center – www.ncbi.nlm.nih.gov

• European Molecular Biology Laboratory (EMBL). European Union Laboratory – www.embl.org

• European Bioinformatics Institute (EBI)- From UK, part of EMBL- www.ebi.ac.uk

• UCSC Genome Resource- serves constructed genome data – www.genome.ucsc.edu

Page 37: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Bioinformatics Resources (cont.)

3. Conferences• ISMB: Intellegent Systems for Molecular Biology, www.iscb.org/ismb2004/

• Bioinformatics Open Source Conference, www.bioinformatics.org

• RECOMB: Conference on Computational Molecular Biology, recomb04.sdsc.edu/

Page 38: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Molecular Biology

1. Helpful Reference Books for Programmers• Recombinant DNA, 2nd edition, Watson et al.

• Molecular Biology of the Gene, Watson et al.

• Molecular Cell Biology, Lodish, et al.

Page 39: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Approximate string matching

As an example of advanced data structures, following is an algorithm that relies on a 2 dimensional array.

It finds the best match for a (shorter) pattern in a (longer) text.

It is one of several important dynamic programming algorithms in bioinformatics

Page 40: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Approximate string matching 1 of 4my $pattern = 'EIQADEVRL';print "PATTERN:\n$pattern\n";my $text =

'SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDE';print "TEXT:\n$text\n";

my $TLEN = length $text;my $PLEN = length $pattern;

# D is the Distance matrix, which shows the "edit distance" between# substrings of the pattern and the text.# It is implemented as a reference to an anonymous array.my $D = [];

# The rows correspond to the text# Initialize row 0 of D.for (my $t=0; $t <= $TLEN ; ++$t) { $D->[$t][0] = 0;}

Page 41: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Approximate string matching 2 of 4# Compute the edit distances.for (my $t=1; $t <= $TLEN ; ++$t) { for (my $p=1; $p <= $PLEN ; ++$p) { $D->[$t][$p] = # Choose whichever alternative has the least cost min3( # Text and pattern may or may not match at this

character ... substr($text, $t-1, 1) eq substr($pattern, $p-1, 1) ? $D->[$t-1][$p-1] # If match, no increase in edit distance! : $D->[$t-1][$p-1] + 1, # If the text is missing a character $D->[$t-1][$p] + 1, # If the pattern is missing a character $D->[$t][$p-1] + 1 ) }}

Page 42: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Approximate string matching 3 of 4# Print D, the resulting edit distance arrayfor (my $p=0; $p <= $PLEN ; ++$p) { for (my $t=0; $t <= $TLEN ; ++$t) { print $D->[$t][$p], " "; } print "\n";}

# Find the best match(es).# The edit distances appear in the last row.my @matches = ();my $bestscore = 10000000;for (my $t=1 ; $t <= $TLEN ; ++$t) { if( $D->[$t][$PLEN] < $bestscore) { $bestscore = $D->[$t][$PLEN]; @matches = ($t); }elsif( $D->[$t][$PLEN] == $bestscore) { push(@matches, $t); }}

Page 43: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Approximate string matching 4 of 4

# Report the best match(es).print "\nThe best match for the pattern $pattern\n";print "has an edit distance of $bestscore\n";print "and appears in the text ending at location";print "s" if ( @matches > 1);print " @matches\n";

Page 44: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

Approximate string matching: Output

PATTERN:EIQADEVRLTEXT:SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDE0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 2 2 2 2 2 2 2 2 2 2 2 2 1 0 1 2 2 2 1 1 2 2 1 1 1 1 2 2 2 2 2 1 2 2 2 1 3 3 3 3 2 3 3 3 3 3 3 2 2 1 1 2 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 3 3 2 4 4 4 4 3 3 4 4 4 4 4 3 3 2 2 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 4 3 5 5 5 5 4 3 4 5 5 5 5 4 4 3 3 2 2 2 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 6 6 6 6 5 4 4 5 6 6 6 5 4 4 4 3 3 3 2 3 4 5 4 5 4 5 5 5 5 4 4 5 5 5 4 3 7 7 6 7 6 5 5 5 6 7 7 6 5 5 5 4 4 4 3 2 3 4 5 5 5 5 6 6 6 5 5 5 6 6 5 4 8 8 7 7 7 6 5 6 6 7 8 7 6 6 6 5 5 5 4 3 3 4 5 6 6 6 5 6 7 6 6 6 6 7 6 5 9 9 8 7 8 7 6 6 7 7 8 8 7 7 6 6 6 6 5 4 3 4 5 6 7 7 6 6 7 7 7 7 7 7 7 6

The best match for the pattern EIQADEVRLhas an edit distance of 3and appears in the text ending at location 20

Page 45: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

BioperlBioperl is a collection of Perl modules for

bioinformatics. It has been developed by a large group of volunteers.

Bioperl has become a fairly stable platform with which to develop software; it is growing and changing rapidly, but its core is relatively mature.

Bioperl is a must for anyone interested in writing bioinformatics software in Perl

Don't reinvent the wheel! (Unless you like to, for the fun or interest of it.)

Page 46: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

What can Bioperl do?

Sequence manipulation (DNA, RNA, proteins)

Several kinds of sequence objects Sequence format conversion Translation Revcom Sequence statistics (mol. weights, etc.) Restriction enzyme maps Signal cleavage sites

Page 47: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

What can Bioperl do?

Sequence alignment

Interfaces with many alignment programs such as fasta, clustalw, gcg

Extract subalignment, consensus string Alignment statistics

Page 48: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

What can Bioperl do?

Biological databases

Retrieving Genbank, EMBL, acedb, etc. Retrieving from remote databases Indexing and retrieving from local databases Parsing program reports from Blast and

variants, HMMer, genefinding programs

Page 49: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

What can Bioperl do?

Sequence annotation

Adding annotation Parsing annotation from sequence databases

Page 50: Why Life is Beautiful James Tisdall

04/21/23

JOIN2004 Universidade do MinhoJOIN2004 Universidade do Minho

What can Bioperl do?

Protein structure

Read in PDB files Extract chains and their residues