Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series...

58
Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA

Transcript of Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series...

Page 1: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Welcome to lecture 2:Feeling at home in *nix

IGERT – Sponsored Bioinformatics Workshop SeriesMichael Janis and Max Kopelevich, Ph.D.

Dept. of Chemistry & Biochemistry, UCLA

Page 2: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Last time…

• We covered a bit of material…• Try to keep up with the reading – it’s all in there!• How’s it coming along?– BioKnoppix

– Remote logins, navigation

– Unix / linux concepts?

– General questions?

Page 3: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

The CLI and YOU

Most of bioinformatics is accomplished through command-line tools • Command line interaction is easily batched• Command line interaction is easily integrated• Command line interaction is a form of PROGRAMMING

• It’s therefore worthwhile to become familiar with your *nix environment in a non-graphical interface

Page 4: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Commands

• In Bioinformatics, we are mostly concerned with TEXT PROCESSING – the CLI is well suited for this type of work

• Specific commands are used to perform functions in the shell

• Each command is itself a program and takes command line arguments– The syntax order is program [-options] filename

• For help on a specific command type:man command; apropos topic; command --help

Page 5: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Some review of system tools

• Who

• W

• Uname

• Pwd

• Find

• Top

Page 6: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Another example of a pipe

Command 1(cut)

Command 2(sort)Pipe

cut –d: -f1 < /etc/passwd | sort

file

Stdout

• The file /etc/passwd stores information about user’s accounts on the system

• Let’s get a sorted listing of all user names

Page 7: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Example: redirecting STDOUT

CommandOr

Program

STDIN

STDOUT

cut –d: -f1 < /etc/passwd | sort > output_file

more output_file

OUTPUT_FILE

“redirection operator”

Page 8: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Process Control

• Each specific job / command is called a process• Each process runs in a shell– BEFORE: prompt available

– DURING: prompt NOT available

– AFTER: prompt available

• Control keys– CTRL-C -> stop current command

– CTRL-D -> end of input

Page 9: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Two Ways to monitor Processes

• “top”– Lists all jobs

– Uses a table format

– Dynamically changes

• “ps”– man ps

– static content

– Command options

Page 10: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

What are you doing, Dave?

Page 11: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Background / Foreground

• Commands running in foreground prevent prompt from being used until command completes

• Commands can also run in BACKGROUND• “Backgrounded” commands DO NOT AFFECT

the prompt

Page 12: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Two Ways to Background jobs

• “&”– Running a command with

“&” automacically sends it to the background

– Backgrounded commands return the prompt

• “bg”– Once a command is run

from the prompt

– Stop the command

– Then background it• Starts the command again

• Returns the prompt for use

Page 13: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

File System Navigation

• Absolute filepaths begin with the root ‘/’• Relative filepaths don’t have a preceding slash; they begin from the cwd• What is the absolute path to cd from john to mary?• What is the relative path to cd from john to mary?• Once you are in mary, and your username is john, what are two ways to return to your home directory?

Page 14: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

The society for anti-defamation of computer mouses opposes this slide

• There’s very little reason to leave the CLI• Most tasks can be written within the shell• The user-friendliness becomes self-limiting

Page 15: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Let’s take an example…

• Suppose you wanted to do some biological analysis – like motif searching through a database of biological sequences… What do you need to do this?– You need to retrieve the sequences

– You need to describe the motif

– You need to search the sequences

Page 16: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

I want to search for zinc-finger motifs genomically in yeast (S.c.)

• I’m going to need the genomic sequence for Saccharomyces cerevisiae (http://www.yeastgenome.org)

• I’m going to need the motif that describes the zinc finger I’d like to search for (ProSite).

• I’m going to need do do this search many times across every chromosome.

Page 17: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

A brief overview of some databases / biological information repositories

• NCBI• Genome-specific databases (SGD…)• SMD http://genome-www5.stanford.edu/

The Stanford Microarray Database. Repository of microarray analysis from a wide variety.• PROSITE http://au.expasy.org/prosite/

Used to rapidly search your protein sequences for catalogued motifs. • SWISSPROT http://www.ebi.ac.uk/swissprot/

SWISSPROT is a "one stop shop" for protein sequence information. Use it to extend your knowledge of your proteins.

• PDB: The Protein Databank http://www.rcsb.org/pdb/The Protein Data Bank is the single worldwide archive of structural data of biological macromolecules. Structure implies function in general.

• PFAM: http://www.sanger.ac.uk/Software/Pfam/search.shtmlThis database is a collection of protein motifs. 

• PRODOM http://protein.toulouse.inra.fr/prodom/current/html/home.phpPRODOM is similar to PFAM in that it is a set of curated protein domain families. However, the underlying computational engine is different.

• BLOCKS http://blocks.fhcrc.org/Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.  The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro. 

• COG http://www.ncbi.nlm.nih.gov/COG/COG stands for Clusters of Orthologous Groups of proteins.  This is a tool for phylogenetic classification of proteins encoded in complete genomes.  COGs were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages.

Page 18: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Retrieving data

Page 19: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Retrieving data• You don’t have to leave the CLI. Really.– If you need to do something, chances are there’s a

utility to do so

– Debian is your friend (search packages FIRST!!!)

Introducing wget:>wget ftp://genome-ftp.stanford.edu/pub/yeast/data_download/protein_info/hypothetical_peptides/*.gz

Of course you can use ftp:>ftp genome-ftp.stanford.edu -login anonymous; use your email address as passwd -traverse filesystem like any linux CLI -bin, get, prompt, mget…

Page 20: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

A note about file archives• Most files will be compressed. Usually using

gunzip.• Most files will be agglomerative, using TAR.

Introducing gunzip:>gunzip *.gz

Introducing tar (tape archive):>tar –xvf *.tarOr to create a tar>tar –cvf output.tar *.*

Page 21: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

A brief note about the biological file format called FASTA

• In bioinformatics, FASTA format is a file format used to exchange information between genetic sequence databases. Its format looks like this:

• >SEQUENCE_1 ;comment line 1 (optional) MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEE

• It consists of a header line (beginning with a '>') which gives a name and/or a unique identifier for the sequence. Many different sequence databases use FASTA files.

• After the header line and comments, one or more sequence lines may follow. Sequences may be protein sequences or DNA sequences– they must be shorther than 80 characters and can contain gaps or

alignment characters• FASTA format files often have file extensions like .fa or .fsa• The simple format of FASTA files makes them easy to manipulate using

text processing tools and scripting languages like Perl.*From http://en.wikipedia.org/wiki/Fasta_format

Page 22: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

ProSite motif

Page 23: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Describing the motif - GREP

• “GREP” searches contents of a file or directory of files– “Get Regex” – uses regular expressions

– File wildcards can be used like with ls

• grep 1sq ~/DATA/*.CEL -> array type used– We explored this last time (briefly!)

Page 24: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Regular expressions

• A regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. – For example, the set containing the three strings Mike, Mark,

and Matt can be described by the pattern “M((ike|(ark|att))?)"

– Alternatively, it is said that the pattern “M((ike|(ark|att))?)" matches each of the three strings.

– There are usually multiple different patterns describing any given set. Most formalisms provide the following operations to construct regular expressions.

Page 25: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Formalisms of regular expressions• alternation

– A vertical bar separates alternatives. For example, "gray|grey" matches grey or gray.

• grouping– Parentheses are used to define the scope and precedence of the operators. For

example, "gray|grey" and "gr(a|e)y" are different patterns, but they both describe the set containing gray and grey.

• quantification– A quantifier after a character or group specifies how often that preceding

expression is allowed to occur. The most common quantifiers are ?, *, and +: – ?

• The question mark indicates that the preceding character may be present at most once. For example, "colou?r" matches color and colour.

– *• The asterisk indicates that the preceding character may be present zero, one, or more

times. For example, "0*42" matches 42, 042, 0042, etc.– +

• The plus sign indicates that the preceding character must be present at least once. For example, "go+gle" matches the infinite set gogle, google, gooogle, etc. (but not ggle).

• These constructions can be combined to form arbitrarily complex expressions, very much like one can construct arithmetical expressions from the numbers and the operations +, -, * and /.

*From http://en.wikipedia.org/wiki/Regular_expression

Page 26: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

The real world is fuzzy and complex…

• What if we just want to search for a string in the format of a phone number;

• E.g. 825 8901

213 487 0353

• Obviously we can’t check for each possible phone number (some 1010 possibilities makes for a very long set of statements…).

No area code

Area code

Page 27: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

This is where regular expressions come in…

• Regular expressions describe generalised patterns of strings instead of exact strings.

• (clearly this is a little more complex as an example…)

>grep /([0-9]{3} ){0,1}[0-9]{3} [0-9]{4}/) filename

Page 28: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Special characters(‘metacharacters’)

‘.’ is a wildcard and matches any character

>grep ‘.ed’ filename

If file contains “bed” -will findIf file contains “red” -will findIf file contains “head” -will not findIf file contains “edward” -will find

Page 29: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Special characters(‘metacharacters’)

‘*’ means ‘zero or more of the previous character’.

>grep ‘be*d’ filename

If file contains “bed” -will findIf file contains “red” -will not findIf file contains “beeeed” -will findIf file contains “bd” -will find

Page 30: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Special characters(‘metacharacters’)

‘+’ means ‘one or more of the previous character’.

>grep ‘be+d’ filename

If file contains “bed” -will findIf file contains “red” -will not findIf file contains “beeeed” -will findIf file contains “bd” -will not find

Page 31: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Start and end of line

‘^’ is designates the start of the line, ‘$’ the end.

>grep ‘bed’ filename

If file contains “bed” -will findIf file contains “bedbed” -will findIf file contains “xxxbedxxx” - will find

>grep ‘^bed$’ filename

Iff file contains “bed” on line by itself -will findIf file contains “bedbed” -will not findIf file contains “xxxbedxxx” – will not find

Page 32: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Grouping with parentheses

Parentheses group characters

>grep ‘(bed)+’ filename

If file contains “bed” -will findIf file contains “bedbed” -will findIf file contains “beddd” -will not find

Page 33: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Character classes

• The square brackets are used to denote whole groups of characters

>grep ‘[brf]ed’ filename

If file contains “bed” -will findIf file contains “red” -will findIf file contains “led” -will not find

Page 34: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Character classes (cont)

• A hyphen designates a range:

>grep ‘[a-z]ed’ filename

If file contains “bed” -will findIf file contains “fed” -will findIf file contains “Bed” -will NOT find (why not?)

Page 35: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Character class shortcuts

• Some character classes are so common there are in-built shortcuts:

– [0-9] = \d– [A-Za-z0-9] = \w– [\f\t\n\r ] = \s

Page 36: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Quantifying

• Curly brackets quantify repeats better than ‘*’ (0+) or ‘+’ (1+)

a{3,5} = three, four or five ‘a’’s.

>grep ‘la{3,5}’

If file contains “laaaad” -will findIf file contains “laaaaaaad” -will not find

Page 37: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Referencing

• Back-slashes match the substring previously matched by the nth parenthesized subexpression of the regular expression. – The back-reference is denoted `\n', where n is a single

digit

>grep ‘(a)\1’

If file contains “laaaad” -will findIf file contains “lad” -will not find

Page 38: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Back to our ProSite motif…• We can use regular expressions to describe the

motif– The motif is actually a REGULAR EXPRESSION!

chr04.peptides.20040928.fsa-4202->Annotated|04:1356055:1357359| frame 1; YDR448W/ADA2; Verified; this gene contains 1 exonchr04.peptides.20040928.fsa:4203:MSNKFHCDVCSADCTNRVRVSCAICPEYDLCVPCFSQGSYTGKHRPYHDYRIIETNSYPILCPDWGADEELQLIKGAQTL

>grep -n –E -–color –B2‘C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C *.fsa

Page 39: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Did it work?

Page 40: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Let’s try this…

• Download the genomic DNA sequence from SGD• Search for any variant of the TATA – box

promoter– TATAAA

– TATAAT

– TATATT

– TAATAA

– TAATAT

Page 41: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

More more more

• Many MS tools allow for wildcard searching• The shell allows variables; interpolation; control

structures– For example, attempt to find a palindrome of length 4

within genomic sequences (hint: use backreferences!)

– Variables allow for persistence and control structures

>myVar=`grep -n –E -–color‘C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C *.fsa`

mako@subi:~$ echo $myVarchr04.peptides.20040928.fsa:4203:MSNKFHCDVCSADCTNRVRVSCAICPEYDLCVPCFSQGSYTGKHRPYHDYRIIETNSYPILCPDWGADEELQLIKGAQTL

Page 42: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

A better variable interpolation

• The variable is allowed to change• We can set the variable to the Prosite Pattern

mako@subi:~$ myVar=C\.{2}C\.{4,8}[RHDGSCV][YWFMVIL]\.[CS]\.{2,5}[CHEQ]\.[DNSAGE][YFVLI]\.[LIVFM]C\.{2}C

mako@subi:~$ echo $myVarC.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C

mako@subi:~$ grep -n -E --color $myVar *.fsachr04.peptides.20040928.fsa:4203:MSNKFHCDVCSADCTNRVRVSCAICPEYDLCVPCFSQGSYTGKHRPYHDYRIIETNSYPILCPDWGADEELQLIKGAQTL

Page 43: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Variables can be overwritten• The variable is allowed to change• We can set the variable to the Prosite Pattern

mako@subi:~$ function afun {> for i in 1 2 3 4 5> do> echo $i> echo $myVar> done> }mako@subi:~$ afun1C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C2C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C3C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C4C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C5C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C

Page 44: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Functions

• What if we wanted to search every ProSite pattern against our genomic database?

• We’d have to repeatedly do our search– This is called a loop

– We have to write this so the computer knows exactly what to repeat, how many times to repeat, and where to find the next ProSite pattern to match

– We would store the what and where in VARIABLES

– We would utilize a CONTROL STRUCTURE to handle the how…

Page 45: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Control structures

• All out programs so far have run from start to finish. Each line has been executed in turn.

• What if we only want to run some lines some of the time?

• This is where control structures come in.

Page 46: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Control structures

• Programming languages generally have a number of control structures.

• Basic structures:– if

– while

– for & foreach

• There are others (e.g. unless)

Page 47: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

‘for’ example

>afunction() {for i in 1 2 3 4 5 do echo "Looping ... number $i" done }

Page 48: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Variables can interpolated• The command is substituted from the system• It’s like a pipe, but we are allowed to operate

mako@subi:~$ afun() {> myvar=$(ls -1 *.fsa)> for i in $myvar> do> echo $i> done> }mako@subi:~$ afunchr01.fsachr01.peptides.20040928.fsachr02.peptides.20040928.fsachr03.peptides.20040928.fsachr04.peptides.20040928.fsachr05.peptides.20040928.fsachr06.peptides.20040928.fsachr07.peptides.20040928.fsachr08.peptides.20040928.fsachr09.peptides.20040928.fsachr10.peptides.20040928.fsachr11.peptides.20040928.fsa…

Page 49: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

The ‘while’ control structure (combined with opening files)

• The ‘while’ control stucture keeps looping while a given condition is satisfied

• ‘while’ and open files go together very well:

mako@subi:~$ afun() {> while read f> do> echo $f> done> }mako@subi:~$ afun < chrmt.peptides.20040928.fsa >Notannotated|mt:385:459| frame 1MNYILLLLLIKLLIIINMKLIKIL …

Page 50: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Editors

• Shell programming is like a batch file– Commands are linked together in a procedure– The procedure is accessed via a file

• We need an editor that will allow us to construct that file– We’ll use Emacs (or you can use vi, pico, …)– Comprehensive, extensible working environment– Complete (arguable!) IDE– Integration– Extensible (elisp)

Page 51: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Emacs

• Invoking Emacs is easy: emacs –nw filename• In many cases, Emacs will work out the mode

appropriate for your file (.cpp, .pl, etc…)– The mode allows Emacs to become sensitive to the task

– There is a biomode for reverse complement, etc….

– You can write your own!

• Emacs has many tools– Search, replace, cut, paste, mail…

– File navigation, ftp, remote shells…

Page 52: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

The Emacs survival guide

• Notation– Emacs uses the control key and escape key heavily. We write it like this:

• C-x Pronounced "Control-x“– Hold down the Ctrl key (usually in the lower left corner of the keyboard) while pressing the x

key.– Both Ctrl and x must be down at the same time. M-x Pronounced "Meta-x"

Press the Esc key (usually in the upper left corner of the keyboard), release it, then press the x key.

– Esc and x should not be down at the same time. So C-x C-f means hold down the control key, then type x and then f while holding it down. (This is the command to load a file into emacs).

• Typing– Just type. All the regular keys, arrow keys, delete, backspace, and page up/down keys should

work. Alternatively, you can try these commands: C-f cursor forward, C-b cursor back, C-p previous line, C-n next line, M-v page up, C-v page down.

• Exiting– Type C-x C-c. If you have any unsaved work, emacs will ask you if you want to save it. Type

y. • Other commands

– Most control or escape sequences are commands. Usually a prompt appears in the command line at the bottom of the window. Here are a few:

– C-x C-f Load file, prompt for filenameC-x C-s Save file without exiting C-x C-c Exit, prompt to save files C-s Search forward, prompt for search string C-r Search backward, prompt for search string C-h ?Show help options, prompt for choice C-h t Start emacs tutorial If you make a mistake or change your mind you can always escape:

• C-g– Abandon command and resume typing

Page 53: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Command line editing

• Learning the keybindings can be difficult– But it will increase your speed

– Faster than using a mouse

– Transferable! The keybindings for command line editing from Emacs is the default set of commands for line editing in the Bash Shell!

Page 54: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Let’s try it…

• Open up the file that we found contained the ProSite Motif

• Open a second window • Goto the line that contains the motif (hint: use

grep with –n!)• Copy and paste that line into a new file• Save and close that file

Page 55: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

AWK is your pre-perl friend

• Use to print a subset of fields• Default field delimiter is “ “ (white space)• Useful for grabbing a subset of fields• Useful for rearranging fields

field1 filed2 field3 field4 . . .

$1 $2 $3 $4 . . . .

Page 56: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Using AWK

| awk –F” “ ‘{print $1}’

| awk –F” “ ‘{print $1” “$2}’

| awk –F” “ ‘{print $1”\t”$2}’

\t = TAB

\n = newline

pipe

Page 57: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Overwrite versus Append

• > OVERWRITE – delete and replace

• >> APPEND – add to end of existing file

Page 58: Welcome to lecture 2: Feeling at home in *nix IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry.

Example: microarray data tracking

• grep 1sq ~/DATA/*.CEL (gives array info)• grep 1sq ~/DATA/*.CEL | awk ‘{print $12}’ gives

array type only• grep 1sq ~/DATA/*.CEL | awk ‘{print $12}’ >

arrayTypes.txt (store results in file)• ls ~/DATA/*.DAT | wc (gives a count)