Welcome to lecture 2: Feeling at home in *nix

download Welcome to lecture 2: Feeling at home in *nix

If you can't read please download the document

description

Welcome to lecture 2: Feeling at home in *nix. IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA. Last time…. We covered a bit of material… Try to keep up with the reading – it’s all in there! - PowerPoint PPT Presentation

Transcript of Welcome to lecture 2: Feeling at home in *nix

  • Welcome to lecture 2:Feeling at home in *nixIGERT Sponsored Bioinformatics Workshop SeriesMichael Janis and Max Kopelevich, Ph.D.Dept. of Chemistry & Biochemistry, UCLA

  • Last timeWe covered a bit of materialTry to keep up with the reading its all in there!Hows it coming along?BioKnoppixRemote logins, navigationUnix / linux concepts?General questions?

  • The CLI and YOUMost of bioinformatics is accomplished through command-line tools Command line interaction is easily batched Command line interaction is easily integrated Command line interaction is a form of PROGRAMMING Its therefore worthwhile to become familiar with your *nix environment in a non-graphical interface

  • CommandsIn Bioinformatics, we are mostly concerned with TEXT PROCESSING the CLI is well suited for this type of work Specific commands are used to perform functions in the shellEach command is itself a program and takes command line argumentsThe syntax order is program [-options] filenameFor help on a specific command type:man command; apropos topic; command --help

  • Some review of system toolsWhoWUnamePwdFindTop

  • Another example of a pipeCommand 1(cut)Command 2(sort)Pipecut d: -f1 < /etc/passwd | sortfileStdoutThe file /etc/passwd stores information about users accounts on the systemLets get a sorted listing of all user names

  • Example: redirecting STDOUTcut d: -f1 < /etc/passwd | sort > output_file

    more output_fileOUTPUT_FILEredirection operator

  • Process ControlEach specific job / command is called a processEach process runs in a shellBEFORE: prompt availableDURING: prompt NOT availableAFTER: prompt availableControl keysCTRL-C -> stop current commandCTRL-D -> end of input

  • Two Ways to monitor ProcessestopLists all jobs Uses a table formatDynamically changespsman psstatic contentCommand options

  • What are you doing, Dave?

  • Background / ForegroundCommands running in foreground prevent prompt from being used until command completesCommands can also run in BACKGROUNDBackgrounded commands DO NOT AFFECT the prompt

  • Two Ways to Background jobs&Running a command with & automacically sends it to the backgroundBackgrounded commands return the prompt bgOnce a command is run from the promptStop the commandThen background itStarts the command againReturns the prompt for use

  • File System Navigation Absolute filepaths begin with the root / Relative filepaths dont have a preceding slash; they begin from the cwd What is the absolute path to cd from john to mary? What is the relative path to cd from john to mary? Once you are in mary, and your username is john, what are two ways to return to your home directory?

  • The society for anti-defamation of computer mouses opposes this slideTheres very little reason to leave the CLIMost tasks can be written within the shellThe user-friendliness becomes self-limiting

  • Lets take an exampleSuppose you wanted to do some biological analysis like motif searching through a database of biological sequences What do you need to do this?You need to retrieve the sequencesYou need to describe the motifYou need to search the sequences

  • I want to search for zinc-finger motifs genomically in yeast (S.c.)Im going to need the genomic sequence for Saccharomyces cerevisiae (http://www.yeastgenome.org)Im going to need the motif that describes the zinc finger Id like to search for (ProSite).Im going to need do do this search many times across every chromosome.

  • A brief overview of some databases / biological information repositoriesNCBIGenome-specific databases (SGD)SMD http://genome-www5.stanford.edu/The Stanford Microarray Database. Repository of microarray analysis from a wide variety.PROSITE http://au.expasy.org/prosite/ Used to rapidly search your protein sequences for catalogued motifs. SWISSPROT http://www.ebi.ac.uk/swissprot/ SWISSPROT is a "one stop shop" for protein sequence information. Use it to extend your knowledge of your proteins.PDB: The Protein Databank http://www.rcsb.org/pdb/ The Protein Data Bank is the single worldwide archive of structural data of biological macromolecules. Structure implies function in general. PFAM: http://www.sanger.ac.uk/Software/Pfam/search.shtml This database is a collection of protein motifs. PRODOM http://protein.toulouse.inra.fr/prodom/current/html/home.php PRODOM is similar to PFAM in that it is a set of curated protein domain families. However, the underlying computational engine is different. BLOCKS http://blocks.fhcrc.org/ Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro. COG http://www.ncbi.nlm.nih.gov/COG/ COG stands for Clusters of Orthologous Groups of proteins. This is a tool for phylogenetic classification of proteins encoded in complete genomes. COGs were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages.

  • Retrieving data

  • Retrieving dataYou dont have to leave the CLI. Really.If you need to do something, chances are theres a utility to do soDebian is your friend (search packages FIRST!!!)

    Introducing wget:>wget ftp://genome-ftp.stanford.edu/pub/yeast/data_download/protein_info/hypothetical_peptides/*.gz Of course you can use ftp:>ftp genome-ftp.stanford.edu -login anonymous; use your email address as passwd -traverse filesystem like any linux CLI -bin, get, prompt, mget

  • A note about file archivesMost files will be compressed. Usually using gunzip.Most files will be agglomerative, using TAR.

    Introducing gunzip:>gunzip *.gz Introducing tar (tape archive):>tar xvf *.tarOr to create a tar>tar cvf output.tar *.*

  • A brief note about the biological file format called FASTAIn bioinformatics, FASTA format is a file format used to exchange information between genetic sequence databases. Its format looks like this:>SEQUENCE_1 ;comment line 1 (optional) MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEIt consists of a header line (beginning with a '>') which gives a name and/or a unique identifier for the sequence. Many different sequence databases use FASTA files.After the header line and comments, one or more sequence lines may follow. Sequences may be protein sequences or DNA sequencesthey must be shorther than 80 characters and can contain gaps or alignment charactersFASTA format files often have file extensions like .fa or .fsaThe simple format of FASTA files makes them easy to manipulate using text processing tools and scripting languages like Perl.*From http://en.wikipedia.org/wiki/Fasta_format

  • ProSite motif

  • Describing the motif - GREPGREP searches contents of a file or directory of filesGet Regex uses regular expressionsFile wildcards can be used like with lsgrep 1sq ~/DATA/*.CEL -> array type usedWe explored this last time (briefly!)

  • Regular expressionsA regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. For example, the set containing the three strings Mike, Mark, and Matt can be described by the pattern M((ike|(ark|att))?)" Alternatively, it is said that the pattern M((ike|(ark|att))?)" matches each of the three strings. There are usually multiple different patterns describing any given set. Most formalisms provide the following operations to construct regular expressions.

  • Formalisms of regular expressionsalternationA vertical bar separates alternatives. For example, "gray|grey" matches grey or gray.groupingParentheses are used to define the scope and precedence of the operators. For example, "gray|grey" and "gr(a|e)y" are different patterns, but they both describe the set containing gray and grey.quantificationA quantifier after a character or group specifies how often that preceding expression is allowed to occur. The most common quantifiers are?, *, and +: ?The question mark indicates that the preceding character may be present at most once. For example, "colou?r" matches color and colour.*The asterisk indicates that the preceding character may be present zero, one, or more times. For example, "0*42" matches 42, 042, 0042, etc.+The plus sign indicates that the preceding character must be present at least once. For example, "go+gle" matches the infinite set gogle, google, gooogle, etc. (but not ggle).These constructions can be combined to form arbitrarily complex expressions, very much like one can construct arithmetical expressions from the numbers and the operations +, -, * and /.

    *From http://en.wikipedia.org/wiki/Regular_expression

  • The real world is fuzzy and complexWhat if we just want to search for a string in the format of a phone number;

    E.g. 825 8901213 487 0353

    Obviously we cant check for each possible phone number (some 1010 possibilities makes for a very long set of statements).No area codeArea code

  • This is where regular expressions come inRegular expressions describe generalised patterns of strings instead of exact strings.

    (clearly this is a little more complex as an example)>grep /([0-9]{3} ){0,1}[0-9]{3} [0-9]{4}/) filename

  • Special characters(metacharacters). is a wildcard and matches any character>grep .ed filename

    If file contains bed -will findIf file contains red -will findIf file contains head -will not findIf file contains edward -will find

  • Special characters(metacharacters)* means zero or more of the previous character.>grep be*d filename If file contains bed -will findIf file contains red -will not findIf file contains beeeed -will findIf file contains bd -will find

  • Special characters(metacharacters)+ means one or more of the previous character.>grep be+d filename

    If file contains bed -will findIf file contains red -will not findIf file contains beeeed -will findIf file contains bd -will not find

  • Start and end of line^ is designates the start of the line, $ the end.>grep bed filename

    If file contains bed -will findIf file contains bedbed -will findIf file contains xxxbedxxx - will find

    >grep ^bed$ filename Iff file contains bed on line by itself -will findIf file contains bedbed -will not findIf file contains xxxbedxxx will not find

  • Grouping with parenthesesParentheses group characters>grep (bed)+ filename If file contains bed -will findIf file contains bedbed -will findIf file contains beddd -will not find

  • Character classesThe square brackets are used to denote whole groups of characters>grep [brf]ed filename If file contains bed -will findIf file contains red -will findIf file contains led -will not find

  • Character classes (cont)A hyphen designates a range:>grep [a-z]ed filename If file contains bed -will findIf file contains fed -will findIf file contains Bed -will NOT find (why not?)

  • Character class shortcutsSome character classes are so common there are in-built shortcuts:

    [0-9]=\d[A-Za-z0-9]=\w[\f\t\n\r ]=\s

  • QuantifyingCurly brackets quantify repeats better than * (0+) or + (1+)

    a{3,5}=three, four or five as.>grep la{3,5} If file contains laaaad -will findIf file contains laaaaaaad -will not find

  • ReferencingBack-slashes match the substring previously matched by the nth parenthesized subexpression of the regular expression. The back-reference is denoted `\n', where n is a single digit >grep (a)\1 If file contains laaaad -will findIf file contains lad -will not find

  • Back to our ProSite motifWe can use regular expressions to describe the motifThe motif is actually a REGULAR EXPRESSION!

    chr04.peptides.20040928.fsa-4202->Annotated|04:1356055:1357359| frame 1; YDR448W/ADA2; Verified; this gene contains 1 exonchr04.peptides.20040928.fsa:4203:MSNKFHCDVCSADCTNRVRVSCAICPEYDLCVPCFSQGSYTGKHRPYHDYRIIETNSYPILCPDWGADEELQLIKGAQTL>grep -n E -color B2C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C *.fsa

  • Did it work?

  • Lets try thisDownload the genomic DNA sequence from SGDSearch for any variant of the TATA box promoterTATAAATATAATTATATTTAATAATAATAT

  • More more moreMany MS tools allow for wildcard searchingThe shell allows variables; interpolation; control structuresFor example, attempt to find a palindrome of length 4 within genomic sequences (hint: use backreferences!)Variables allow for persistence and control structures

    >myVar=`grep -n E -colorC.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C *.fsa` mako@subi:~$ echo $myVarchr04.peptides.20040928.fsa:4203:MSNKFHCDVCSADCTNRVRVSCAICPEYDLCVPCFSQGSYTGKHRPYHDYRIIETNSYPILCPDWGADEELQLIKGAQTL

  • A better variable interpolationThe variable is allowed to changeWe can set the variable to the Prosite Pattern

    mako@subi:~$ myVar=C\.{2}C\.{4,8}[RHDGSCV][YWFMVIL]\.[CS]\.{2,5}[CHEQ]\.[DNSAGE][YFVLI]\.[LIVFM]C\.{2}C

    mako@subi:~$ echo $myVarC.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C

    mako@subi:~$ grep -n -E --color $myVar *.fsachr04.peptides.20040928.fsa:4203:MSNKFHCDVCSADCTNRVRVSCAICPEYDLCVPCFSQGSYTGKHRPYHDYRIIETNSYPILCPDWGADEELQLIKGAQTL

  • Variables can be overwrittenThe variable is allowed to changeWe can set the variable to the Prosite Pattern

    mako@subi:~$ function afun {> for i in 1 2 3 4 5> do> echo $i> echo $myVar> done> }mako@subi:~$ afun1C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C2C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C3C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C4C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C5C.{2}C.{4,8}[RHDGSCV][YWFMVIL].[CS].{2,5}[CHEQ].[DNSAGE][YFVLI].[LIVFM]C.{2}C

  • FunctionsWhat if we wanted to search every ProSite pattern against our genomic database?Wed have to repeatedly do our searchThis is called a loopWe have to write this so the computer knows exactly what to repeat, how many times to repeat, and where to find the next ProSite pattern to matchWe would store the what and where in VARIABLESWe would utilize a CONTROL STRUCTURE to handle the how

  • Control structuresAll out programs so far have run from start to finish. Each line has been executed in turn.

    What if we only want to run some lines some of the time?

    This is where control structures come in.

  • Control structuresProgramming languages generally have a number of control structures.Basic structures:ifwhilefor & foreach

    There are others (e.g. unless)

  • for example>afunction() {for i in 1 2 3 4 5 do echo "Looping ... number $i" done }

  • Variables can interpolatedThe command is substituted from the systemIts like a pipe, but we are allowed to operate

    mako@subi:~$ afun() {> myvar=$(ls -1 *.fsa)> for i in $myvar> do> echo $i> done> }mako@subi:~$ afunchr01.fsachr01.peptides.20040928.fsachr02.peptides.20040928.fsachr03.peptides.20040928.fsachr04.peptides.20040928.fsachr05.peptides.20040928.fsachr06.peptides.20040928.fsachr07.peptides.20040928.fsachr08.peptides.20040928.fsachr09.peptides.20040928.fsachr10.peptides.20040928.fsachr11.peptides.20040928.fsa

  • The while control structure (combined with opening files)The while control stucture keeps looping while a given condition is satisfiedwhile and open files go together very well:mako@subi:~$ afun() {> while read f> do> echo $f> done> }mako@subi:~$ afun < chrmt.peptides.20040928.fsa >Notannotated|mt:385:459| frame 1MNYILLLLLIKLLIIINMKLIKIL

  • EditorsShell programming is like a batch fileCommands are linked together in a procedureThe procedure is accessed via a fileWe need an editor that will allow us to construct that fileWell use Emacs (or you can use vi, pico, )Comprehensive, extensible working environmentComplete (arguable!) IDEIntegrationExtensible (elisp)

  • EmacsInvoking Emacs is easy: emacs nw filenameIn many cases, Emacs will work out the mode appropriate for your file (.cpp, .pl, etc)The mode allows Emacs to become sensitive to the taskThere is a biomode for reverse complement, etc.You can write your own!Emacs has many toolsSearch, replace, cut, paste, mailFile navigation, ftp, remote shells

  • The Emacs survival guideNotationEmacs uses the control key and escape key heavily. We write it like this: C-x Pronounced "Control-xHold down the Ctrl key (usually in the lower left corner of the keyboard) while pressing the x key.Both Ctrl and x must be down at the same time. M-x Pronounced "Meta-x" Press the Esc key (usually in the upper left corner of the keyboard), release it, then press the x key.Esc and x should not be down at the same time. So C-x C-f means hold down the control key, then type x and then f while holding it down. (This is the command to load a file into emacs). TypingJust type. All the regular keys, arrow keys, delete, backspace, and page up/down keys should work. Alternatively, you can try these commands: C-f cursor forward, C-b cursor back, C-p previous line, C-n next line, M-v page up, C-v page down. ExitingType C-x C-c. If you have any unsaved work, emacs will ask you if you want to save it. Type y. Other commandsMost control or escape sequences are commands. Usually a prompt appears in the command line at the bottom of the window. Here are a few:C-x C-f Load file, prompt for filenameC-x C-s Save file without exiting C-x C-c Exit, prompt to save files C-s Search forward, prompt for search string C-r Search backward, prompt for search string C-h ?Show help options, prompt for choice C-h t Start emacs tutorial If you make a mistake or change your mind you can always escape: C-gAbandon command and resume typing

  • Command line editingLearning the keybindings can be difficultBut it will increase your speedFaster than using a mouseTransferable! The keybindings for command line editing from Emacs is the default set of commands for line editing in the Bash Shell!

  • Lets try itOpen up the file that we found contained the ProSite MotifOpen a second window Goto the line that contains the motif (hint: use grep with n!)Copy and paste that line into a new fileSave and close that file

  • AWK is your pre-perl friendUse to print a subset of fieldsDefault field delimiter is (white space)Useful for grabbing a subset of fieldsUseful for rearranging fields

    field1 filed2 field3 field4 . . . $1 $2 $3 $4 . . . .

  • Using AWK | awk F {print $1}

    | awk F {print $1 $2}

    | awk F {print $1\t$2}

    \t = TAB\n = newlinepipe

  • Overwrite versus Append> OVERWRITE delete and replace

    >> APPEND add to end of existing file

  • Example: microarray data tracking

    grep 1sq ~/DATA/*.CEL (gives array info)grep 1sq ~/DATA/*.CEL | awk {print $12} gives array type onlygrep 1sq ~/DATA/*.CEL | awk {print $12} > arrayTypes.txt (store results in file)ls ~/DATA/*.DAT | wc (gives a count)