Perl for Bioinformatics, 140 - Amazon Web...

30
Perl for Bioinformatics, 140.636 F. Pineda 1) Please fill out questionnaire while waiting for class to start 2) If you did not get permission from instructor to register for credit please talk to me after class...it could end badly...

Transcript of Perl for Bioinformatics, 140 - Amazon Web...

Page 1: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Perl for Bioinformatics, 140.636!F. Pineda

1)   Please fill out questionnaire while waiting for class to start2)   If you did not get permission from instructor to register for !

credit please talk to me after class...it could end badly...

Page 2: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Course mechanicsn  The course website is authoritative

n  http://www.pinedalab.org/perln  check the schedule regularly for updatesn  Lecture notes, homeworks, etc. will all be on-line

n  Homeworkn  Read and understand the homework policies on the course websiten  Homework submission consists of:

n  HTML page – description of code and resultsn  Code

n  Homework is NOT accepted on paper or via emailn  Today you must fill out questionnaire

n  Needed to get user accountn  Only registered students get user accountsn  Mark Miller for system issues (not help with class)

Page 3: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

This afternooon

n  Mark Miller will get you oriented on the cluster and will give out passwords

Page 4: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Historical background

n  Perl (Practical Extraction & Report Language) originally developed by Larry Wall as System Administration tool. His tasks were 90% text manipulation and 10% everything else. �Version 1.0 released in 1987.�

n  Linux (open source Unix) originally developed by Linus Torvalds (CS graduate student in Finland).�First kernel released ~1992. Allegedly done for the fun of it.

n  Both released under GNU “open source” license

Page 5: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Some typical bioinformatics tasks

n  I have a huge output file from a BLAST search. What are all the Mus musculus matches with high scores?

n  I want to digest (with Trypsin) each of the ~900 proteins in a FASTA file and calculate the mass of each fragments. Oh, I also want the results in a format that I can feed to the MASCOT proteomics search engine .

n  What’s the most common 6-mer consisting of only A’s and T’s in the P. falciparum genome (14 chromosomes and 2.2 million nucleotides)

n  I want to automatically update the database on our local BLAST server every weekend with the latest sequences deposited at NCBI.

n  Search for potential RNA hairpins in the 14 chromosomes of the Plasmodium falciparum genome

The common characteristic of these tasks is that they !involve ~90% text manipulation, and ~10% everything else.

Page 6: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

A quickie task

---------A: 40.31 %T: 40.30 %C:  9.70 %G:  9.69 %total(ATCG only ) = 22853497 nucleotidestotal(everything) = 22853764 nucleotides---------1       ATATAT  5705772       TATATA  5298263       AAAAAA  4202324       TTTTTT  4174395       TATTAT  1475066       ATAATA  1469487       TATATT  1467118       AATATA  146695

-- snip --

57      ATTTAA  5061758      ATTAAT  4985659      TTAAAT  4973160      TAATTA  4578161      AATTAA  4527862      TTAATT  4521263      AATTTA  4510164      TAAATT  44553

Question:

What 6-mer containing only A’s and T’s occurs mostly commonly in the P. falciparum !genome

Answer:

Scan through the 2.2 million nucleotides !of the 14 chromosomes using a 6-nt window!and find that ATATAT occurs most !frequently

Runs in a minute or so on a slow laptop

Page 7: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

A bigger task: “middleware” for a “pipeline”

n  Applications for processing raw data from DNA sequencers 1.  A trace editor to analyze, display, and allow biologists to edit the

short DNA read chromatograms from DNA-sequencing machines. 2.  A read assembler to find overlaps between the reads and assemble

them in long contiguous sections. 3.  An assembly editor to view the assemblies and make changes in

places where the assembler went wrong. 4.  A database to keep track of everything.

n  Write a set of scripts that runs the applications sequentially and converts the output format of one application into the input format of the subsequent application.

see e.g. L. Stein “How Perl Saved the Human Genome Project”

Page 8: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

MBL/TIGR Assembly Pipeline From Advanced Genomics and Bioinformatics course

F. Pineda, 10/31/2002

ABI 3700sequencer .chr

PreTA(Lucy, Phred)

.x

.y

phd2fasta

.qual .seq

RunTATIGR Assembler

.asm ta2ace

.ace

Consed

ace2contig.contig

Traceviewer

.fasta

RepeatFinder

.repeatsBambus

.stats.details

.dot

.mates ?

Closure

Closed?

No

Yes

files

Legend applications

format conversionPUC19

PUC19.splice

sampleDNA

Finishedsequence

Lab

To database for annotation

Page 9: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Step 1: Got perl?

n  Unix, Linux and MacOS systems come with a Perl interpreter.n  Open a shell window for entering command linesn  Find the perl interpreter by typing: “whereis perl” (remember this)n  To find out which version of perl you have type “perl -v”

n  If you insist on working on a Windows PC, you have two choices (neither of which we will support):n  Cygwin (a “linux-like environment with a limited set of tools)

n  http://www.cygwin.com/

n  Active state perl (a native perl interpreter).n  (http://www.activestate.com/activeperl/)

Page 10: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Invoking the perl interpreter

n  Perl is an application that processes perl statements.n  Feeding perl a one line command

n  perl -e ‘print “hello world\n”’

n  Feeding perl a sequence of commandsn  step 1. invoke perln  step 2. write the statement(s)n  step 3. tell perl you are finished writing statements (control-D)

(this causes perl to interpret the statements)n  Example: use print statement to print a haiku:

an old pond the sound of a frog jumping into water - Basho

print “an old pond\nthe sound of a frog jumping\ninto water\n”

Page 11: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Perl application = Compiler+Virtual Machine

n  The perl interpreter is an application consisting of a compiler and a “virtual machine.”n  The perl interpreter accepts input consisting of perl instructions and

executes them in a two-step process:n  A compiler converts source code into instructions for a “virtual

machine”n  Bottom-up parsern  Top-down Optimizer & Peephole Optimizern  Generates Opcodes (instructions) n  Code is compiled each time it is executed – no binaries

n  The Perl “virtual machine” executes the instructionsn  Executes the opcodesn  Emulates a stack-based computer !

(like HP calculator or Forth)

Page 12: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Step 2: Got editor?

n  The more normal way of using perl is to put perl statements into a text file and then tell perl to process the text file.

n  Create programs using a text editor, which allows you to create plain text (ASCII) documents.

n  On UNIX and Linux computers use a powerful text editor like vi, vim, or emacs. If you are working in a GUI envirionment, you can use gedit. Alternately if you are in a command line environment, you can use a basic text editor like pico or nano.

n  On a Mac, I highly recommend TextWrangler n  http://www.barebones.com �

n  Under Microsoft Windows, use notepad or editplus n  http://www.editplus.com

n  Do not use Word, Word Perfect, laTeX or any other word processor for writing your programs. Word processors embed formatting commands into the text.

n  Both TextWrangler and editplus will allow you to edit files on the teaching server (recommended).

Page 13: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

n  Text between ‘#’ and the end of the line is a commentn  print() is a function that prints the text string n  Double quotes delimit a stringn  The two characters \n are not interpreted literally. !

Instead they represents newline.n  Run the program by invoking the perl interpreter with helloworld.pl as an argument.

1st perl program: Hello World

# my first perl programprint(“hello world\n”);

bash> perl helloworld.pl

Page 14: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

n  Three things to remember

n  Add a special first line that will tell bash where to find perl

n  Tell bash it has permission to execute this script (just once)

n  Invoke the script

stand alone perl script

#!/usr/bin/perl# my first perl scriptprint(“hello world\n”);

bash> chmod u+x helloperl.pl

bash> ./helloperl.pl

Page 15: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Character data, Sequences & codes

n  99% of what you will do with perl is to manipulate characters and text files, so you might as well get intimately familiar with character data and how it is represented (coded) on computers.

n  Many important biological compounds are polymers. These are compounds of usually high molecular weight consisting of up to millions of repeated linked units, each a relatively light and simple molecule. Different single letter codes are used to represent simple molecules in different classes of compounds (e.g. DNA, RNA and proteins).

Page 16: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

!Biological sequences!

!representing information contained in

biological polymers!

Page 17: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

DNAProtein

RNA

(RNA polymerase II transcribing DNA into RNA)

Page 18: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

The fundamental “dogma”

RNA

Proteins

DNA

Translation

Transcription

n  after discovery of retroviruses

Replication

Reverse-transcription

Page 19: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

DNA n  Double stranded molecule with a spiral

arrangement n  Nucleotides pair by hydrogen bonding n  Purines (A,G) bind to pyrimidines (T,C) A pairs to T with 2 hydrogen bonds G pairs to C with 3 hydrogen bonds n  The strands run opposite to each other

(antiparallel) n  4 letter alphabet: A,T,G,C

n  A human genome is about 3 billion base-pairs long spread across 23 chromosomes

Page 20: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

RNA n  Single stranded molecule with complex secondary structure n  Nucleotides pair by hydrogen bonding n  Purines bind to pyrimidines

n  A pairs to U with 2 hydrogen bonds n  G pairs to C with 3 hydrogen bonds n  G pairs with U ( a wobble pair)

n  4 letter alphabet: A,U,G,C

Page 21: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Proteins!20 letter amino acid alphabet of proteins

Page 22: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Summary of the important polymers & their “alphabets”

n  DNAn  Double stranded (helical)n  deoxyribose-phosphate backbonen  4 letter alphabet (A,T,C,G) represent nucleotidesn  Complementary base pairing (A-T), (C-G)

n  RNAn  Single stranded (messenger RNA)n  ribose-phosephate backbonen  4 letter alphabet (A,U,C,G)n  Complementary base pairing (A-U), (C-G),(U-G)

n  Proteinsn  Single stranded (complex folding structure)n  20 letter alphabet (represent amino acid residues)

Page 23: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Notational and directional conventions

DNAsense (+) strand

anti-sense (-) strand(template)

RNA

Transcription

Translation

Protein

downstreamupstream

5’ 3’

3’ 5’

“gene”

5’ 3’

N-terminal ( amino end )

C -terminal(carboxyl end)

UTR UTR

promoterRNA polymerease

Ribosome

Page 24: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

!Character sequences in

computing !!

Page 25: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

How are characters represented on computers?

0 1 1 0 0 0 0 1‘a’ =

8 bits = 1 byte = 1 character

Up to 2 = 256 characterscan be represented withthe ASCII code.

The map from bit patterns to characters is a convention: ASCII

8

Page 26: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Mega, Giga, Tera, Peta and all that

n  kilobyte – 103 bytesn  megabyte – 106 bytesn  gigabyte – 109 bytesn  terabyte – 1012 bytesn  petabyte – 1015 bytes

Page 27: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

ASCII codes Decimal Octal Hex Binary Value ------- ----- --- ------ ----- 048 060 030 00110000 0 049 061 031 00110001 1 050 062 032 00110010 2 051 063 033 00110011 3 052 064 034 00110100 4 053 065 035 00110101 5 054 066 036 00110110 6 055 067 037 00110111 7 056 070 038 00111000 8 057 071 039 00111001 9 058 072 03A 00111010 : (colon) 059 073 03B 00111011 ; (semi-colon) 060 074 03C 00111100 < (less than) 061 075 03D 00111101 = (equal sign) 062 076 03E 00111110 > (greater than) 063 077 03F 00111111 ? (question mark) 064 100 040 01000000 @ (AT symbol) 065 101 041 01000001 A 066 102 042 01000010 B 067 103 043 01000011 C 068 104 044 01000100 D 069 105 045 01000101 E 070 106 046 01000110 F 071 107 047 01000111 G 072 110 048 01001000 H 073 111 049 01001001 I 074 112 04A 01001010 J 075 113 04B 01001011 K 076 114 04C 01001100 L 077 115 04D 01001101 M 078 116 04E 01001110 N 079 117 04F 01001111 O

Page 28: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

ASCII codes Decimal Octal Hex Binary Value ------- ----- --- ------ ----- 000 000 000 00000000 NUL (Null char.) 001 001 001 00000001 SOH (Start of Header) 002 002 002 00000010 STX (Start of Text) 003 003 003 00000011 ETX (End of Text) 004 004 004 00000100 EOT (End of Transmission) 005 005 005 00000101 ENQ (Enquiry) 006 006 006 00000110 ACK (Acknowledgment) 007 007 007 00000111 BEL (Bell) 008 010 008 00001000 BS (Backspace) 009 011 009 00001001 HT (Horizontal Tab) 010 012 00A 00001010 LF (Line Feed) 011 013 00B 00001011 VT (Vertical Tab) 012 014 00C 00001100 FF (Form Feed) 013 015 00D 00001101 CR (Carriage Return) 014 016 00E 00001110 SO (Shift Out) 015 017 00F 00001111 SI (Shift In) 016 020 010 00010000 DLE (Data Link Escape) 017 021 011 00010001 DC1 (XON) (Device Control 1) 018 022 012 00010010 DC2 (Device Control 2) 019 023 013 00010011 DC3 (XOFF)(Device Control 3) 020 024 014 00010100 DC4 (Device Control 4) 021 025 015 00010101 NAK (Negative Acknowledgement) 022 026 016 00010110 SYN (Synchronous Idle) 023 027 017 00010111 ETB (End of Trans. Block) 024 030 018 00011000 CAN (Cancel) 025 031 019 00011001 EM (End of Medium) 026 032 01A 00011010 SUB (Substitute) 027 033 01B 00011011 ESC (Escape)

Page 29: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

newline and end-of-line conventions

n  How is newline ‘\n’ represented in ASCII ? It depends!

n  DOS carriage-return (ascii 13) AND linefeed (ascii 10) n  MACINTOSH carriage-return only (ascii 13)n  UNIX linefeed only (ascii 10)

n  Moving text files across platforms must be done carefully!n  It will cause mysterious and unexpected problems unless you “fix” the end-of-

lines

Page 30: Perl for Bioinformatics, 140 - Amazon Web Servicesec2-54-227-251-26.compute-1.amazonaws.com/.../perl_intro.pdfPerl for Bioinformatics, 140.636! F. Pineda 1) Please fill out questionnaire

Fixing end-of-line conventions

n  DOS to UNIX (delete the carriage-return)n  bash> tr -d ‘\r’ < input.txt > output.txt

n  MAC to UNIX ( change carraige-return to line-feed)n  bash> tr ‘\r’ ‘\n’ < input.txt > output.txt

n  UNIX to MAC ( change line-feed to carriage return)n  bash> tr ‘\n’ ‘\r’ < input.txt > output.txt

n  MAC to UNIX (using perl)n  bash> perl -pi -e 's/\r/\n/g' input.txt

You will understand these commands in a few weeks, for now just remember the magical incantations. For more details: http://www.answers.com/topic/newline