IUPAC/IUB Single-Letter Codes Within

49
APPENDIX 1A IUPAC/IUB Single-Letter Codes Within Nucleic Acid and Amino Acid Sequences The International Union of Pure and Applied Chemistry (IUPAC) and the International Union of Biochemistry and Molecular Biology (IUBMB) have established standards for representing nucleic acids and amino acids with single capital letters. Table A.1A.1 summarizes the codes for bases in nucleic acid sequences. Table A.1A.2 summarizes the codes for amino acids in protein sequences. Additional information may be found at the IUPAC Web site: http://www.chem.qmul.ac.uk/iupac. Contributed by Shonda A. Leonard Contributed by Shonda A. Leonard Current Protocols in Bioinformatics (2003) A.1A.1 Copyright © 2003 by John Wiley & Sons, Inc. Table A.1A.1 IUPAC/IUBMB Codes for Nucleic Acid Bases Code Nucleic acid base A Adenine C Cytosine G Guanine T Thymine U Uracil R Guanine or adenine (purine) Y Thymine or cytosine (pyrimidine) K Guanine or thymine (keto group at similar positions) M Adenine or cytosine (amino group at similar positions) S Guanine or cytosine (strong interaction: 3 hydrogen bonds) W Adenine or thymine (weak interaction: 2 hydrogen bonds) B Not adenine D Not cytosine H Not guanine V Not thymine N Any nucleic acid base Table A.1A.2 IUPAC/IUBMB Codes for Amino Acids Code Amino acid A Alanine C Cysteine D Aspartic acid E Glutamic acid F Phenylalanine G Glycine H Histidine I Isoleucine K Lysine L Leucine M Methionine N Asparagine P Proline Q Glutamine R Arginine S Serine T Threonine V Valine W Tryptophan X Any amino acid Y Tyrosine A.1A.1 User Fundamentals

Transcript of IUPAC/IUB Single-Letter Codes Within

Page 1: IUPAC/IUB Single-Letter Codes Within

APPENDIX 1AIUPAC/IUB Single-Letter Codes WithinNucleic Acid and Amino Acid Sequences

The International Union of Pure and Applied Chemistry (IUPAC) and the InternationalUnion of Biochemistry and Molecular Biology (IUBMB) have established standards forrepresenting nucleic acids and amino acids with single capital letters. Table A.1A.1summarizes the codes for bases in nucleic acid sequences. Table A.1A.2 summarizes thecodes for amino acids in protein sequences. Additional information may be found at theIUPAC Web site: http://www.chem.qmul.ac.uk/iupac.

Contributed by Shonda A. Leonard

Contributed by Shonda A. LeonardCurrent Protocols in Bioinformatics (2003) A.1A.1Copyright © 2003 by John Wiley & Sons, Inc.

Table A.1A.1 IUPAC/IUBMB Codes forNucleic Acid Bases

Code Nucleic acid base

A Adenine

C Cytosine

G Guanine

T Thymine

U Uracil

R Guanine or adenine (purine)

Y Thymine or cytosine(pyrimidine)

K Guanine or thymine (ketogroup at similar positions)

M Adenine or cytosine (aminogroup at similar positions)

S Guanine or cytosine (stronginteraction: 3 hydrogen bonds)

W Adenine or thymine (weakinteraction: 2 hydrogen bonds)

B Not adenine

D Not cytosine

H Not guanine

V Not thymine

N Any nucleic acid base

Table A.1A.2 IUPAC/IUBMB Codes forAmino Acids

Code Amino acid

A Alanine

C Cysteine

D Aspartic acid

E Glutamic acid

F Phenylalanine

G Glycine

H Histidine

I Isoleucine

K Lysine

L Leucine

M Methionine

N Asparagine

P Proline

Q Glutamine

R Arginine

S Serine

T Threonine

V Valine

W Tryptophan

X Any amino acid

Y Tyrosine

A.1A.1

UserFundamentals

Page 2: IUPAC/IUB Single-Letter Codes Within

APPENDIX 1BCommon File Formats

This appendix discusses a few of the file formats frequently encountered in bioinformat-ics.

FASTA FILES

FASTA files may contain DNA, RNA, or protein sequences. In each case, the sequenceis written in the standard IUPAC single-letter codes (APPENDIX 1A), with the followingexceptions:

Lowercase letters are accepted;

A hyphen (-) represents a gap of indeterminate length;

The letter U represents selenocysteine in protein sequences;

An asterisk (∗) in a protein sequence indicates a translation stop.

A FASTA file may contain one or more sequences. A file with multiple sequences is calleda multi-FASTA file. The first line, or descriptor line (see discussion below), of each newentry begins with a greater-than sign (>), followed by a single-line description of thesequence that follows. This title line may be any length, including simply the greater-than sign followed by no additional characters. Subsequent lines contain the sequence(Fig. A.1B.1). It is recommended that sequence lines be less than 80 characters.

NCBI Descriptor Lines

In typical NCBI descriptor lines, pipe (“|”) characters delineate key fields. For example:

>gi|532319|pir|TVFV2E|TVFV2E envelope protein

The syntax of NCBI sequence FASTA format descriptor lines depends on the databasefrom which each sequence was obtained. Table A.1B.1 lists the identifiers for thedatabases from which the sequences were derived.

“gi” identifiers are being assigned by NCBI for all sequences contained within NCBI’ssequence databases. The “gi” identifier provides a uniform and stable naming convention

Figure A.1B.1 A sample FASTA file that contains the sequences for two homologous proteins,actophorin and yeast cofilin. Note that a greater-than sign (>) designates the beginning of eachentry and that each of the lines of sequence contains less than 80 characters.

Contributed by Shonda A. Leonard, Timothy G. Littlejohn, and Andreas D. BaxevanisCurrent Protocols in Bioinformatics (2006) A.1B.1-A.1B.9Copyright C© 2006 by John Wiley & Sons, Inc.

UserFundamentals

A.1B.1

Supplement 16

Page 3: IUPAC/IUB Single-Letter Codes Within

Common FileFormats

A.1B.2

Supplement 16 Current Protocols in Bioinformatics

Table A.1B.1 Identifiers to be Used With Sequence Databases

Database name Identifier syntax

GenBank gb|accession|locusEMBL Data Library emb|accession|locusDDBJ (DNA Database of Japan) dbj|accession|locusNBRF PIR pir||entryProtein Research Foundation prf||nameSWISS-PROT sp|accession|entry name

Brookhaven Protein Data Bank pdb|entry|chainPatents pat|country|numberGenInfo Backbone identifier bbs|numberGeneral database identifier gnl|database|identifierNCBI Reference Sequence ref|accession|locusLocal sequence identifier lcl|identifier

whereby a specific sequence is assigned its unique gi identifier. If a nucleotide or proteinsequence changes, however, a new gi identifier is assigned, even if the accession numberof the record remains unchanged. Thus gi identifiers provide a mechanism for identifyingthe exact sequence that was used or retrieved in a given search.

In the example above,532319 is thegi accession number,TVFV2E is the PIR accessioncode, and TVFV2E envelope protein is the description of the sequence. Note thatin this case there are two accession codes (a gi number and a PIR code).

The gnl (“general”) identifier allows databases not on the above list to be identified withthe same syntax. An example here is the PID identifier:

gnl|PID|e1632PID stands for Protein ID; the e (in e1632) indicates that this ID was issued by EMBL.The pipe (“|”) separates different fields as listed in the above table. In some cases, a fieldis left empty, even though the original specification called for including this field. Tomake these identifiers backwardly compatible for software applications that expect thesefields, the empty field is denoted by an additional pipe (“||”).

GenBank FLAT FILES

GenBank Files summarize pertinent information (e.g., sequence, size, source organism,and key references) for genes and gene products. They are readily available from theNCBI server (http://www.ncbi.nlm.nih.gov). Each file is broken into fields that designatewhat information is found on the following line(s). New fields are identified by a left-justified field name (given in capital letters) at the beginning of a new line of text. Somefields contain subfields, which are indented on subsequent lines. Table A.1B.2 lists thepossible field names and describes the contents of each field. It is important to note thatany given GenBank file may not contain every field. Figure A.1B.2 illustrates one exampleof a GenBank file. Additional information regarding the content and format of GenBankrecords may be found at http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html.

If you are creating a sequence file in GenBank format, it may contain multiple sequences.Each new sequence begins with a LOCUS field. The other fields are optional, except forthe ORIGIN field, which marks the beginning of the sequence. Two slashes (//) markthe end of the sequence (Fig. A.1B.2).

Page 4: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1B.3

Current Protocols in Bioinformatics Supplement 16

Table A.1B.2 A Summary of Fields Commonly Found in GenBank Records (Fig. A.1B.2)

Field Identifier(s) in Figure A.1B.2 Contents

LOCUS 1a: Locus name Although the locus name was originally intended to identifysimilar sequences, it no longer carries such significance. EachGenBank file has a unique locus name. Often, it is either thefirst letter of the genus and species followed by the accessionnumber, or simply the GenBank accession number of the file.

1b: Sequence length The number of nucleotide base pairs (bp) or amino acidresidues (aa) in the gene or gene product.

1c: Molecule type Identifies the type of sequence found in a particular file.Possibilities include: genomic DNA, genomic RNA, precursorRNA, mRNA, rRNA, tRNA, small nuclear RNA, andcytoplasmic RNA.

1d: Molecular topology The molecule’s expected topology. The options are linear andcircular.

1e: GenBank division Each GenBank sequence is currently classified in one of thefollowing 17 subdivisions: PRI, primates; ROD, rodents; MAM,mammals (excluding primates and rodents); VRT, vertebrates(excluding mammals); INV, invertebrates; PLN, plants, fungi,and algae; BCT, bacteria; VRL, viral; PHG, bacteriophages;SYN, synthetic; UNA, unannotated; EST, expressed sequencetag; PAT, patent sequence; STS, sequence tagged sites; GSS,genome survey sequence; HTG, high-throughput genomicsequence; HTC, unfinished high-throughput cDNA sequence.Note that the organismal subdivisions do not coincide with thecurrent NCBI taxonomy. They are purely historical.

1f: Modification date Indicates when the file was last revised.

DEFINITION 2 A brief description of the sequence, including the organismsource and the gene or protein name.

ACCESSION 3 A unique, stable, identifier for the particular file, which isusually a combination of one or two letters with five or sixdigits.

VERSION 4 Allows users to track multiple incarnations of a givensequence. The version number is the accession numberconcatenated with a period and a number. For the first versionof a particular accession, the number following the period isset to 1. Each time the sequence data are modified, the numberfollowing the period is incremented by 1. The example shownin Figure A.1B.2 is the first version of accession numberM93361.

This field will also contain a GenInfo Identifier (GI) fornucleotide sequence files. This number uniquely identifieseach nucleotide sequence in GenBank, even if they differ by asingle nucleotide. Note that, unlike the accession number for afile, the GI number may change.

KEYWORDS 5 A word or phrase describing the sequence. Althoughfrequently found in older GenBank records, this field isgenerally not present in more recent GenBank files.

continued

Page 5: IUPAC/IUB Single-Letter Codes Within

Common FileFormats

A.1B.4

Supplement 16 Current Protocols in Bioinformatics

Table A.1B.2 A Summary of Fields Commonly Found in GenBank Records (Fig. A.1B.2), continued

Field Identifier(s) in Figure A.1B.2 Contents

SOURCE 6 The first line is a free-format description of the sourceorganism, followed by the molecule type. The subsequent linescontain the subfield ORGANISM, which has the completescientific name of the source organism and its phylogeneticclassification as given by the NCBI Taxonomy Database.

REFERENCE 7 Publications by the authors of the GenBank entry that discussthe molecule. Multiple publications may be listed inchronological order, ending with the most recent. Eachreference entry will contain subfields (e.g., AUTHORS, TITLE,JOURNAL, MEDLINE) that are appropriate for the particularpublication type.

FEATURES 8 This is essentially a concise summary of the gene or proteinannotation. It offers a list of genes, gene products, and regionsof biological interest that have been identified within thereported sequence. The first subfield in each FEATURE list isthe source subfield, which contains the length of thesequence, the scientific name of the source organism, and thetaxon ID number. Additional subfields are given—e.g., gene,promoter, TATA signal,5′ UTR, 3′ UTR, and coding sequence (CDS)—depending onthe features within the sequence. For each feature, theGenBank record provides its location within the sequence andother pertinent information (e.g., the product or gene name,possible function, and protein translation).

BASE COUNT 9 The number of adenine, cytosine, thymine (or uracil), andguanine nucleotide bases within the sequence.

ORIGIN 10 This field is often left blank. In older records, it may containthe experimentally derived restriction cleavage site. Note thatthe ORIGIN field should be included in every GenBankrecord, even if it contains no information. Most parsers lookfor the sequence on the first line after the word ORIGIN.

11 The sequence data with 60 bases (or residues) per line. Thebases on each line are presented in six groups of ten bases pergroup, with the groups separated by spaces. The sequence endswith two slashes (//).

PHYLIP FILES

The Phylogeny Inference Package (PHYLIP) is a suite of programs used to infer phylo-genies and generate evolutionary trees. The PHYLIP package allows users to generatedistance matrices and apply a variety of phylogenetic methods to both nucleotide andprotein sequences.

All of the programs in the package that use sequence data as the input (usually the resultof a multiple sequence alignment) require that the sequences be formatted in PHYLIP’sown “PHYLIP format.” An example of a PHYLIP-formatted file is shown in FigureA.1B.3. The file begins with a line containing two numbers; the first number representsthe number of sequences in the data set (here, 5), and the second number representsthe length of the alignment (here, 547 amino acids). The five sequences then follow,in “interleaved” format; this is different from the “sequential” FASTA format describedabove, where a sequence is listed in its entirety before the next sequence begins. In the

Page 6: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1B.5

Current Protocols in Bioinformatics Supplement 16

Figure A.1B.2 A sample GenBank record. Circled numbers identify the fields listed in Table A.1B.1.

first block, each sequence is preceded by an identifier that can be up to ten characters inlength.

PHYLIP files usually take the file extension .phy.

MSF FILES

Multiple sequence files (called “MSF files”) are used by the individual programs withinthe GCG (Wisconsin) sequence analysis package. The GCG suite (UNIT 3.6) provides alarge number of programs designed for sequence comparison, the generation of multiplesequence alignments, gene and secondary structure prediction, and pattern recognition,to name a few.

As with PHYLIP, the programs in the GCG package that use multiple sequence alignmentdata as the input require that the sequences be provided in MSF format. An example of anMSF-formatted file is shown in Figure A.1B.4. All MSF files begin with either PileUp,!!NA MULTIPLE ALIGNMENT, or !!AA MULTIPLE ALIGNMENT on the first lineof the file. In the line that begins with MSF:, the length of the alignment is given (here,547 amino acids), followed by the type of alignment (N for nucleotide, P for protein),a checksum number, and two dots, which signify the end of the descriptive header. Thenext block of lines contains information on the sequences in the alignment, giving their

Page 7: IUPAC/IUB Single-Letter Codes Within

Common FileFormats

A.1B.6

Supplement 16 Current Protocols in Bioinformatics

Figure A.1B.3 A sample PHYLIP-formatted file. The five sequences shown are HIV-1 and HIV-2gag proteins from a variety of isolates. See text for details.

Page 8: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1B.7

Current Protocols in Bioinformatics Supplement 16

Figure A.1B.4 A sample MSF-formatted file. The five sequences shown are HIV-1 and HIV-2 gag proteinsfrom a variety of isolates. See text for details.

Page 9: IUPAC/IUB Single-Letter Codes Within

Common FileFormats

A.1B.8

Supplement 16 Current Protocols in Bioinformatics

name, length, a checksum, and weight. A double slash then precedes the sequences,which are shown in interleaved format.

MSF files usually take the file extension .msf.

NEXUS FILES

A variety of programs use a format known as Nexus format. These include programs suchas MacClade, MrBayes, and the PAUP suite of phylogenetics programs (UNITS 6.4 & 6.5).

A sample Nexus-formatted file is shown in Figure A.1B.5. The file begins with thekeyword #NEXUS, which is then followed by a block describing the sequences thatfollow. The DIMENSIONS line gives the number of sequences (or taxa) in the file(NTAX=5, for five sequences), as well as the length of the alignment (NCHAR=547).The FORMAT line indicates whether the alignment is of nucleotide or protein sequencesand that they are interleaved. The lines that follow, one per sequence, give the name ofeach sequence and its length. The sequences then follow under the keyword MATRIX,as shown in the figure. The end of the file is signified by the semicolon and END; on thefinal two lines of the file.

Nexus files usually take the file extension .nex or .nxs.

Figure A.1B.5 A sample Nexus-formatted file. The five sequences shown are HIV-1 and HIV-2 gag proteins from avariety of isolates. See text for details.

Page 10: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1B.9

Current Protocols in Bioinformatics Supplement 16

CONVERTING BETWEEN FILE FORMATS

Often, the output of a program is provided in one format, but in order to provide thatoutput as the input to another program, the format of the file needs to be convertedbefore the second program can actually read the file. A number of utilities are availableto perform this conversion, and many are available on the Web.

The most widely-used format conversion program is ReadSeq, which is described indetail in APPENDIX 1E. ReadSeq currently allows interconversion between 19 different filetypes, including those discussed in this appendix. Programs such as ClustalW (UNIT 2.3),which are widely used to generate reliable multiple sequence alignments, can providethe alignment output in MSF and PHYLIP formats, among others, facilitating analysisby programs such as PHYLIP and PAUP.

DISCLAIMER

This component of this unit by Dr. Andreas D. Baxevanis was written in his privatecapacity. No official support or endorsement by the National Institutes of Health orthe United States Department of Health and Human Services is intended or should beinferred.

INTERNET RESOURCES

http://iubio.bio.indiana.edu/cgi-bin/readseq.cgiReadSeq biosequence interconversion tool.

http://www.ebi.ac.uk/clustalwClustalW multiple sequence alignment interface.

Contributed by Shonda A. Leonard (discussion of FASTA and GenBank file formats)

Timothy G. Littlejohn (discussion of NCBI descriptor lines)IBM Life SciencesSt. Leonards, NSW, Australia

Andreas D. Baxevanis (discussion of PHYLIP, MSF, and NEXUS file formats)Bethesda, Maryland

Page 11: IUPAC/IUB Single-Letter Codes Within

APPENDIX 1CUnix Survival Guide

For a mixture of historical and practical reasons, much of the bioinformatics softwarediscussed in this series runs on Linux, MacOS X, Solaris, or one of the many other Unixvariants. This appendix provides the minimum information needed to survive in a Unixenvironment.

LOGGING IN AND OUT

Unix dates from the time when computers were very expensive, necessitating that multipleusers share the same computer hardware. For this reason, a session on a Unix systembegins with a login prompt. You provide the system with a username and password inorder to gain access to the system’s resources. If your Unix system is managed by asystem administrator from your institution’s Information Technology (IT) department,the username and password will have been assigned to you. If you have installed Unixyourself, you will have been prompted for a username-password pair at the time ofinstallation.

There are two common login scenarios. In the first, you are sitting in front of the Unixcomputer itself and are using its monitor and keyboard directly (a situation sentimentallycalled “logged in at the console”). In the second, you use a conventional Windows orMacintosh desktop machine to connect via the network to a Unix server located at someremote location.

Logging in at the Console

In the first scenario, you will be presented with a login window. A typical login windowis shown in Figure A.1C.1, but because of the great variability in Unix distributions yourswill almost certainly look a bit different; however, all login windows have a field forUsername and another for Password. Type yours in and press the appropriate button (Go!in the example shown in the figure).

If the username and password are recognized, the system will log you in and displaya graphical desktop (Fig. A.1C.2). Like the login prompt, Unix systems vary widely inthe appearance and behavior of the desktop. Some, such as the KDE desktop shown inthe Figure A.1C.2, do a good job of reproducing the familiar experience of a windowsor Macintosh desktop. Others are frustratingly alien. All require some getting used to.Popular Unix desktop systems that you may encounter include the aforementioned KDesktop Environment (KDE), Gnome, and the Common Desktop Environment (CDE).

It would be impractical to give a full tutorial on navigating all the Unix desktop variantshere, but a few hints will help you get started. First, many desktops make extensive useof the right mouse button. If in doubt about what to do next, pressing the right mousebutton on the desktop, within a window, or in a window title bar often brings up a menuof possible commands. Some desktops also make use of the middle mouse button, whichis a standard feature of Unix workstations but is not found on many PC mice. To emulatethe middle mouse button, try pressing the left and right buttons simultaneously. Finally,most desktops have a built-in tutorial and help system which can usually be activatedwithout too much flailing.

To log out of the desktop, look for menu items with names like “Log out,” buttons withthe power on/off icon found on some electronic appliances, or icons that show a moonand stars.

Contributed by Lincoln D. SteinCurrent Protocols in Bioinformatics (2006) A.1C.1-A.1C.24Copyright C© 2006 by John Wiley & Sons, Inc.

UserFundamentals

A.1C.1

Supplement 16

Page 12: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.2

Supplement 16 Current Protocols in Bioinformatics

Figure A.1C.1 A typical login window for “logging in at the console.”

Figure A.1C.2 K Desktop Environment (KDE) at point of successful login.

Logging in Remotely

If the Unix system you wish to access is located remotely, you will use one of severalremote access programs to log into it from your desktop machine. These programs rangefrom extremely bare-bones terminal emulators that provide you with a 24-line by 80-character text-only window to sophisticated graphical emulators that will display theUnix graphical desktop on your PC or Macintosh.

Page 13: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1C.3

Current Protocols in Bioinformatics Supplement 16

Which terminal emulation program you use depends on the capabilities of your desktopmachine, the configuration of your local area network, and what software is installed onthe Unix machine. Typically, your system administrator or IT department will tell youwhat remote access software to use. Common remote login packages are listed below(see Internet Resources).

More information on using the graphical remote access protocols based on VNC andthe X windows systems are given in APPENDIX 1D. Here we will assume that you will belogging in using a text-only terminal emulator.

Logging into a remote system from a PCIf you are on a Microsoft Windows 95 or higher system, a simple terminal emulator isalready installed on your system; however, it is a bit hidden. Select Run Command. . .

from the Start menu, and when prompted type in telnet. This starts up the Telnetprogram which displays a plain white window and simple menu bar. From the Telnetwindow’s Connect menu, select Remote System. . . to bring up a dialogue box thatprompts you for the name of the host with which to connect and the connection settingsyou wish to use. By and large, the default connection settings will work, so don’t changethem. Just enter the name of the Unix machine to which you wish to connect (using itsdotted internet name or address) and press Connect.

Telnet will now attempt to connect to the indicated machine. If successful, the terminalwindow will display a login prompt (Fig. A.1C.3). Type your login name and password,pressing Enter each time. If you successfully log in, the remote host will print a greeting,a status message, and possibly a pithy quote of the day as shown in the figure. Theremote host will start the command-line shell, and print an input prompt, which is shownin Figure A.1C.3 as the cryptic series of characters (∼) 51%.

Since Unix is a multiuser operating system, you can log into the same system multipletimes. Simply repeat the login procedure described above as many times as you wish.

Figure A.1C.3 Successful remote login using Telnet.

Page 14: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.4

Supplement 16 Current Protocols in Bioinformatics

Logging into a remote system from a Unix systemIf you are using a Unix system and wish to log into another system remotely, open upthe command line window. Depending on how your system is configured, this may becalled a “shell window,” “command window,” or “terminal window.” Now, use one of theterminal emulators Telnet or SSH to connect to the remote machine. Telnet uses an oldercommunication protocol that sends all text across the network in unencrypted form. Sshuses a newer protocol that encrypts all outgoing and incoming communication. Becauseof security concerns, we highly recommend SSH if it is available, but using it requiresthat it be supported by both the local and the remote machines, which is usually, but notalways, the case.

To connect to the remote machine named host.example.com using Telnet, typetelnet host.example.com at the command line and press return. As describedearlier, Telnet will attempt to contact the named machine, and, if successful, willprompt you for your username and password. SSH works similarly; imply type sshhost.example.com.

Logging into a remote system from a MacintoshRecent versions of the Macintosh run Mac OSX. Mac OSX is itself a Unix system, andmost, if not all of the bioinformatics tools described in Current Protocols in Bioinfor-matics will run on a Macintosh under this platform. However, some of the tools that havegraphical user interfaces will require installation of X Window server software. This isdescribed in APPENDIX 1D, X Window Survival Guide.

In order to log into a remote Unix machine (e.g., Linux or even another Macintoshrunning under OSX), you will use the command line tool that comes standard with MacOSX. Using the Finder, look for the Terminal application in the Applications area in theUtilities folder. Double-click the Terminal to launch it, then proceed as described for theUnix login. SSH is standard in Mac OSX, and we highly recommend that you use thatapplication if the remote host supports it.

Logging outTo quit a terminal emulator session, you may either close its window or type logoutat the command line prompt. You can also quit the emulator application entirely, but thiswill have the effect of closing all open sessions.

USING THE COMMAND SHELL

Despite the graphical desktop environments now becoming prevalent, Unix is still verymuch a command-line oriented system. You issue instructions to the system by typingcryptic commands in a terminal window, and the output of programs are displayed astext inside the same window. Most bioinformatics packages are command-line oriented,and even for those few that use windows, menus, and mouse clicks, you will still have toinstall and possibly invoke them from the command line.

The Unix program that accepts and processes commands is called the “shell.” It is asimple program that prints out a command-line prompt, waits for you to type a commandand press the Enter key, and then runs the command. After the command is complete,the shell again prints the command line prompt, awaiting further instructions.

If you have logged into the system using a terminal emulator, you are already running ashell. Otherwise, if you are using a graphical desktop, you will need to launch a terminalemulator within the desktop environment in order to interact with the shell. To do this,look for a menu command called Shell, Terminal, Console Xterm, or some variant of

Page 15: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1C.5

Current Protocols in Bioinformatics Supplement 16

Figure A.1C.4 Some shell command-line prompts.

the above. Icons that launch terminal emulators take the form of stylized shells or littledesktop PCs. Running one of the emulators creates a terminal window similar to thoseused by the Windows and Macintosh emulator programs. One advantage provided by thedesktop environment terminal emulators is that you can resize them at will. You can alsolaunch multiple emulators, and each one will run a different shell session.

Regardless of whether you have logged in graphically or remotely, the terminal emulatorwill be displaying a command-line prompt. The exact appearance of this prompt dependson the variant of Unix you are using, which shell program (there are several), and howthe system has been configured. A few common command-line prompts are shown inFigure A.1C.4. Prompts typically contain a short amount of status information (e.g., thetime of day, your login name, the hostname, or the number of commands you have typed)followed by one of the characters “%”, “>”, or “$”.

Working at the command line will be a foreign experience to many readers. Although itwill never be completely painless, a few features do make working at the command lineeasier. First, most command-line shells offer in-place editing. You can use the left andright cursor keys to move the text insertion point back and forth on the command line inorder to insert and delete characters. The backspace key will delete characters to the leftof the insertion point, and the delete key, or sometimes Control-D, will delete charactersto the right of the insertion point.

If you find yourself repeating many commands with minor variations, the up (↑) anddown (↓) cursor keys will activate the shell’s “command history” feature. Pressing theup-cursor key will insert the last-issued command at the prompt. Pressing ↑ again willfetch the command previous to that, and so forth. You can press Enter to reissue thecommand, or use the cursor keys to edit the command prior to issuing it again.

Most shells also offer a “command completion” feature. With this feature, you can typethe first few letters of a command or file name and then press the Tab key. The shell willcomplete the command for you, or, if what you typed was ambiguous, display a numberof alternatives from which to make a selection

Command Syntax

Unix commands are case-sensitive, meaning that the commands mkdir and Mkdir arenot the same. The first command will create a new directory. The second is not recognizedon typical Unix systems and will result in a Command not found warning. Unixcommands typically take arguments that are separated from the command name by oneor more spaces called “whitespace.” As a concrete example, the mkdir command takesa series of arguments giving the names of the directories to create. This command willcreate three directories named “docs,” “toy,” and “experiments”:

(∼) 51% mkdir docs toy experiments

Page 16: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.6

Supplement 16 Current Protocols in Bioinformatics

To pass an argument that contains whitespace, surround it with double or single quotes. Incontrast to the previous example, this one will create two directories, one named “docs”and one named “toy experiments”:

(∼) 51% mkdir docs ‘‘toy experiments’’

OptionsMany Unix commands accept “options” which modify their behavior. Depending on thecommand, its options may be single-letter codes preceded by a hyphen, as in -v, or fullyspelled-out words preceded by two hyphens, as in --verbose. Options come after thecommand name and before any arguments. For example, to have the mkdir commandprint out what it is doing, use the --verbose option:

(∼) 51% mkdir --verbose docs toy experimentsmkdir: created directory ‘docs’mkdir: created directory ‘toy’mkdir: created directory ‘experiments’

Getting Information on Commands

When given the -h or --help options, most commands will print out a brief usagesummary. Try -h first, and if that doesn’t work try --help as shown in Figure A.1C.5.

Manual commandFor more detailed help the man (manual) command is extremely useful. Invoke it withthe name of the command you wish help on (e.g., man mkdir). This will display apage of detailed information on how to use the command. If you don’t know the nameof the command for which you are looking, try the apropos command (e.g., “aproposdirectory”) to generate a list of commands that might have something to do with thefunction for which you’re looking.

The man command may use a “pager” to display a manual page that is longer than willfit comfortably into a terminal window. The pager is very simple. It displays a single

Figure A.1C.5 To get help on the use of the mkdir command, the (A) -h option is used (unsuccessfully),followed by the longer (B) --help (successful).

Page 17: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1C.7

Current Protocols in Bioinformatics Supplement 16

page for your perusal. When you are ready for the next page, hit the Space Bar. Whenyou are done reading, press “q” to quit. Some systems have more sophisticated pagersthat will allow you to page up and down a line at a time using the cursor keys, or a pageat a time using the Page Up and Page Down keys. Experiment a bit to see if your systemsupports this.

Suspending and Killing Commands

At some point while working with the command shell you will issue a command thateither produces large amounts of output, takes a long time to run, or does somethingunexpected. In this case, you can interrupt a command in either of two ways.

To interrupt a command before it has finished running press Control-C. This means topress the Control key (marked Ctrl on most PC keyboards) and simultaneously press the(lowercase) “c” key. In most cases this will interrupt the command and return you to thecommand prompt. In rare cases you may need a more emphatic type of interruption. TryControl-\ (i.e., backslash while holding down the Control key).

To temporarily suspend a command without killing it entirely, press Control-Z. This willput the command into suspended animation and return you to the command prompt. Toresume the command, type fg (foreground). You can suspend and resume a commandas many times as you like.

All the Unix commands we have seen so far are short lived. For example, the mkdircommand does its work and returns almost instantly. However, other commands arelong lived. This is particularly true of commands that launch graphical programs suchas Web browsers or text editors. In such cases you will not be able to use the com-mand line until the program has finished executing and the command-line prompt hasreappeared.

To avoid losing the use of the command line, you can place an ampersand “&” after thename of a program that will take a long time to execute. This will place the program in the“background” and return you to the command-line prompt immediately. For example,the command netscape & will launch the Netscape Web browser in the background.The Netscape window will appear, and you will be returned to the command-line promptin the terminal window.

If you forget to add the ampersand and lose your command line, you can temporarilysuspend the running program by typing Control-Z in the terminal window. The command-line prompt will reappear. Type bg (background), and the suspended program will berestarted in background mode.

MANAGING FILES AND DIRECTORIES

Like other operating systems, a fundamental part of Unix is its support for files anddirectories. A file can contain text, computer code, word processing data, images orsounds, or any other data. A directory, equivalent to the Macintosh and Windows “folder,”contains files and/or other directories.

If you are logging in via a terminal emulator, you will have to learn to work with files viathe command line. If you have a graphical login, chances are that the desktop environmentprovides a file browser. With the browser, you can view the contents of directories, peekinto files, create new directories, move existing files and directories around, and so forth.Even so, you will need to learn the basic shell commands for manipulating files anddirectories.

Page 18: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.8

Supplement 16 Current Protocols in Bioinformatics

Unix has the concept of the “current working directory,” the default directory that thevarious file manipulation commands operate on if not otherwise specified. When youfirst log in, the current working directory is set to your “home directory,” a directory towhich you have full access and where you will normally store your personal files andother data.

List Command

To see the contents of your home directory, issue the ls (list) command (Fig. A.1C.6).

Fancy optionThe ls command shows a formatted list of files and directories, but doesn’t provideany indication about which is which. For a more informative display, use the -F (fancy)option (Fig. A.1C.7). The ls -F command shows a marked-up version of the directorylisting. Directories end in the slash character “/”, executable files (those that containcomputer code) end in an asterisk “∗”, symbolic links (a type of alias or shortcut) end inthe “at” character “@” while regular files have no special character at the end.

Some of the files shown in Figure A.1C.7 are text files. An example is INBOX, whichcontains a list of recent E-mail messages to the author. Others contain image datasuch as chloroplast.png and plastid.png, which are both images of genomicannotations of the rice chloroplast. Unix distinguishes file types by using distinctive filename extensions. For example, .png is used for a file that contains portable networkgraphics image data. Unlike some systems, where file extensions are limited to threecharacters, Unix extensions can be of any length.

Long version optionAnother useful variant of ls is the long version, invoked with ls -lF. This addsdetailed information to the listing. This form will tell you how large the file or directoryis, which user owns it, and what its access permissions are (Fig. A.1C.8).

The first column of the long listing indicates the file permissions and its interpretation isbeyond the scope of this appendix; however, it is handy to know that the d that some-times appears at the beginning of the column indicates that the corresponding item is adirectory.

Figure A.1C.6 Example output of the ls (list) command.

Figure A.1C.7 Example output of the ls (list) command with -F (fancy) option.

Page 19: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1C.9

Current Protocols in Bioinformatics Supplement 16

Figure A.1C.8 Example output of the ls (list) command with -lF (long version) option.

All optionBy convention, Unix uses files whose names begin with a period (.) to hold softwareconfiguration information. Since there are many of these in your home directory, the lscommand skips over these hidden files by default. To force ls to show all files, includingthose that are ordinarily hidden, use the -a option:

(∼) 68% ls -a.ptksh history .qtella.DCOPserver pesto@ .qtella.hosts.FVWM2-errors .registry*.FVWM95-errors .rhmapper.ICEauthority .rhosts.MCOP-random-seed .rnd. . .

Directory Paths

To view the contents of a directory, you have several options. One is to use the directoryname (e.g., docs) as the argument of the ls command (Fig. A.1C.9). The contentsof the docs directory is mostly other directories. We can peek down even further byproviding ls with a directory “path.” A path is simply a list of directories separated byslashes (Fig. A.1C.10).

If Unix paths remind you of Web URLs in any way, that isn’t a total coincidence. TheWeb was originally built on top of Unix.

Page 20: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.10

Supplement 16 Current Protocols in Bioinformatics

Figure A.1C.9 Viewing the contents of the docs directory using the ls command.

Figure A.1C.10 Viewing the contents of the (A) talks subdirectory of docs (path: docs/talks) and the(B) networking1 subdirectory of talks (path: docs/talks/networking1) using the ls command. Alsosee Figure A.1C.8.

Figure A.1C.11 Viewing the contents of the docs directory by first changing the current working directory with thecd command, and then listing the contents with the fancy option using the ls -F command.

Change directory commandAnother way of exploring a directory is to make it the current working directory sothat ls operates on it by default. You do this with the cd (change directory) command(Fig. A.1C.11).

Thecd command takes a single argument, the directory path, to make the current workingdirectory. The indicated directory then becomes the default directory for ls and otherfile utilities.

Sometimes the shell prompt will indicate the current working directory. By convention,the home directory is indicated using a tilde (∼) symbol, so the prompt (∼/docs)indicates that the current working directory is the docs directory inside the homedirectory.

Page 21: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1C.11

Current Protocols in Bioinformatics Supplement 16

Print working directory commandIf your prompt doesn’t have a working directory indicator, you can find out the currentdirectory with the pwd (print working directory) command:

(∼/docs) 72% pwd/home/lstein/docs

Unlike the shell prompt, pwd doesn’t indicate the home directory with a tilde (∼), butprints out the complete path, which in this case is /home/lstein/docs.

Common Commands and Shortcuts

Move/rename commandTo move a file or directory from one location to another use the mv (move) command.It takes two arguments: the file or directory to move and the location to move it to. Forexample, the following command will move the directory networking tutorialand all its contents into the directory talks:

(∼/docs) 72% mv networking tutorial talks

The mv command can also be used to rename an existing file or directory. This examplewill rename the file mod perl book.tar.gz to modperl book.tgz:

(∼/docs) 73% mv mod perl book.tar.gz modperl book.tgz

The difference between the two commands is that in the first case the second argumentwas an existing directory, and so was interpreted by the mv command as an instruction tomove the first argument into that directory. In the second case, the second argument wasnot a directory, and so was interpreted by mv as a command to rename the indicated file.

Copy commandThe cp command will make a copy of a file (but not a directory). It is simple to use:

(∼) 74% cp INBOX INBOX.bak

This creates an identical copy of the file INBOX named INBOX.bak.

Make and remove directory commandsTo create a new directory, use the mkdir (make directory) command. This takes a listof one or more directory names and creates them in the current working directory. Toremove an empty directory, use rmdir (remove directory). This command will fail ifthe directory is not empty.

Remove commandTo delete a file, use the rm (remove) command. It takes a list of files and deletes them.The deletion is irrevocable—i.e., unlike Windows and Macintosh systems there is norecycle bin or trashcan from which to retrieve deleted files. A useful variant of rm is rm-r, which will delete a directory and all its contents; however, be careful with this, as itis easy to delete more than you intend.

Wild cardsThe shell provides several convenient shortcuts. One is wild cards which allow you torefer to several files or directories at once. An asterisk “∗” appearing in a command lineargument is treated as a wild card that can match any series of zero or more characters,while a question mark “?” can match any single character.

Page 22: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.12

Supplement 16 Current Protocols in Bioinformatics

Figure A.1C.12 Using wildcards with the ls command (fancy option, -F) to display all (A) PNGfiles and (B) files containing the text “plastid.”

Figure A.1C.13 Listing the contents of a directory using the double dot “..” abbreviation for theroot directory.

Using wild cards, you can refer to all PNG files as shown in Figure A.1C.12A, or to allfiles that contain the text “plastid” as shown in Figure A.1C.12B.

Directory abbreviationsIf you are in a nested subdirectory and you want to refer to the directory above the currentone, you can refer to this directory with the special name “..” (two dots). For example,the following command, when executed from your home directory, will list the contentsof the directory that contains it:

(∼) 85% ls -F..ftp/ lost+found/ lstein/ siao/ testuser/ todd/ www/

The “..” can be used in a directory path just like any other directory name as shown inFigure A.1C.13.

The symbol “.” stands for the current working directory.

The shell also lets you use the tilde symbol “∼” (found in the upper left-hand corner ofmost keyboards) to refer to your home directory. You can obtain a listing of your homedirectory like this:

(∼/docs) 88% ls -F ∼and return to your home directory from wherever you are by typing:

(∼/docs) 89: cd ∼For your convenience, typing cd alone will also return you to your home directory,making it the current working directory.

WORKING WITH TEXT FILES

Creating and manipulating files of text is central to most bioinformatics activities. Unixgives you a large number of ways to manipulate text files.

Page 23: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1C.13

Current Protocols in Bioinformatics Supplement 16

Figure A.1C.14 Viewing the first 1% of the text file genomic-seq.fasta, located in the directoryprojects/data, using the more command. Note that some of the screen has been deleted to conservespace.

The fastest and easiest way to view the contents of a text file is with the more command.It takes a list of one or more text file names, and displays them on the screen one page ata time. This works even with very large files. See Figure A.1C.14 for example.

The --More-- prompt at the bottom of the screen indicates that there is more of the fileto display and gives the approximate position of the region that is being displayed, in thiscase the top 1%. As described earlier, you can page through the file from top to bottomby pressing the Space Bar. Press “q” to stop viewing the file.

If your system has it, the less command is recommended as an improved version ofmore. It works just like more, but allows you to page upwards as well as downwardsby pressing the Page Up and Page Down keys. It also allows you to navigate a line at atime using the up (↑) and down (↓) cursor keys, and to search through the file for wordsand phrases.

Redirecting Output to a File

Much of the software used in bioinformatics produces large amounts of text data. Thisinformation is often written directly to the terminal, and it can be frustrating to seesomething interesting scroll by into the irretrievable oblivion beyond the top of theterminal window.

One way to deal with this situation is to use the output redirection feature of the Unixshell. The output of any command can be redirected into a file by following the commandwith a “>” sign followed by the name of the file you wish to create. For example, theblastn command (see UNIT 3.3) will write the results of its search to the terminal window.To redirect this to a file named blastn.out, issue the command as shown in FigureA.1C.15A. After the command completes, you will find its output in blastn.out,which you can then inspect with more, less, or a text editor.

If the file indicated with “>” already exists, it will be overwritten, erasing whatever wasthere before. If you prefer to append the command output to the file, leaving its previouscontents intact, use “�”, as shown in Figure A.1C.15B.

Redirecting Output to More

Another handy alternative, useful for those cases when there is more output from acommand than will fit onto a terminal screen, but you don’t need to save the informationto a file, is to redirect output directly to the more program. You can do this using the“pipe” or vertical bar symbol “|”:

(∼) 107% blastn humseq202 data/genomic-seq.fasta | more

Page 24: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.14

Supplement 16 Current Protocols in Bioinformatics

Figure A.1C.15 Redirecting the results of the blastn command to the file blastn.out. Notethat (A) using a single greater than sign “>” sign causes any previous copy of blastn.out tobe overwritten, while (B) using a double greater-than sign “�” will cause the current output to beappended to the existing blastn.out file.

Table A.1C.1 Graphical Text Editors

Desktop environment Editor

Gnome gedit

KDE kedit

CDE dtpad

Blastn’s output will now be captured by more and displayed a screen at a time for easyviewing.

Unix Text Editors

To work effectively with Unix-based bioinformatics software you will need to be able tocreate and modify text files from scratch. This means becoming proficient with one ormore of the Unix text editors.

Unlike the more familiar word processors, Unix text editors produce files that are devoidof any fancy fonts or formatting. They also have a reputation for being unfriendly tonovice users. This is only partly true. The graphical desktop environments are eachequipped with user-friendly text editors similar in style to the Windows Notepad deskaccessory. Table A.1C.1 lists the names of the graphical text editors in each of the threemost popular Unix desktop environments. Each one is easily reachable via a menu itemor icon.

If you have access to one of the graphical editors, you should have no problem creating andediting text files since everything can be done using mouse clicks and menu commands. Ifyou must interact with Unix via a text-only terminal, life will be slightly more interesting.

picoIf you have never used a Unix text editor before, it is suggested that you begin with thepico text editor. This editor is installed on most (but not all) Unix systems, and has arelatively straightforward user interface. To launch it, type pico at the command line.This will replace the contents of the terminal window with the editor screen shown inFigure A.1C.16. The middle of the screen is the current contents of the text file. Use thecursor keys to move around in the file, the Backspace key to delete text to the left of theinsertion point, and the Delete key to delete text to the right of the insertion point.

Various Control key combinations allow you to read files, save files, and exit the program.The currently available commands are listed at the bottom of the pico window usingnotation in which a caret “ˆ” means the Control key. So ˆX Exit means to pressControl-X in order to exit the program.

Page 25: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1C.15

Current Protocols in Bioinformatics Supplement 16

Figure A.1C.16 Pico editor screen.

To create a text file from scratch, launch pico, type the text, and then press Control-Oto write out (i.e., output) the file. You will be prompted to type in the name of the file.To read in an existing file, press Control-R. You will be prompted for a file name, whichwill then be appended to the bottom of whatever is currently on display. Another way toedit an existing file is to give its name to pico on the command line. For example, for thefile test.txt, the following will cause pico to open and edit the file:

(∼) 108% pico test.txt

Other text editorsPico has relatively limited abilities. Much more powerful Unix text editors are theaforementioned vi editor, as well as Emacs. Using vi requires learning a set of cryptickeyboard-based commands. Emacs is slightly easier to use in the graphical X Windowenvironment because it provides menus for most common commands, but it is not nearlyas straightforward as more familiar word processors. However, if you plan to work heavilyin the Unix environment, it is worth investing some time learning one of these editors.Good introductions to vi and Emacs can be found in most general-audience Unix books.

CHANGING THE ENVIRONMENT

Various Unix commands, and several bioinformatics programs, are dependent on “envi-ronment variables,” a set of configuration variables that are set up for you each time youlog into the system. In this section, we walk through a practical example of changing anenvironment variable.

VISUAL and SHELL variablesVarious Unix commands will automatically invoke a text editor for you when needed(e.g., when examining a configuration file). The default text editor is vi, a powerful butextremely cryptic text-based editor. To make pico your default editor, you must alter thevalue of an environment variable named VISUAL.

Page 26: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.16

Supplement 16 Current Protocols in Bioinformatics

To change the VISUAL environment variable you must edit one of the hidden “dot” fileslocated in your home directory. Which file you edit depends on which shell interpreteryou are using. To discover which shell you are using, you will examine the contents ofanother environment variable named SHELL. Run the command echo $SHELL:

(∼) 51% echo $SHELL/bin/tcsh

The echo command simply echoes back its arguments, which in this case is the con-tents of the SHELL environment variable. The command will print out the path to theshell program that is currently running, which will most likely be one of /bin/tcsh,/bin/csh, /bin/ksh, /bin/bash, or /bin/sh. If the shell is either tcsh orcsh, then the configuration file you will edit is .cshrc. For any other shell, you willinstead edit the file .profile.

We will first assume that you are running tcsh or csh, and therefore need to edit the file.cshrc in your home directory. First, make a copy of the current version of .cshrc,using the cp (copy) command. Name the copy cshrc.orig:

(∼) 52% cp ∼/.cshrc ∼/cshrc.origNow using pico, edit .cshrc:

(∼) 52% pico ∼/.cshrcIf the file does not already exist, create it. Scroll to the bottom of the file and add this ine:

setenv VISUAL pico

Save the file, and then log out of the shell. Log in again and confirm that VISUAL is nowset to pico:

(∼) 56% echo $VISUALpico

The procedure is slightly different for the bash, ksh, or sh shells. In this case, the fileto change is .profile, also located at the top level of your home directory. Create acopy of .profile as described earlier for .cshrc. Using pico (or another text editor),open or create this file, and then add the following two lines:

VISUAL=picoexport VISUAL

Log out and in again, and run echo $VISUAL to confirm that the environment variablehas indeed been set.

Other variablesYou can follow this procedure to add or modify any number of environment variables.Just be sure to put each setenv or export command on a separate line. If you make amistake, your shell may start misbehaving. Don’t panic. Just copy the original back intoplace:

(∼) 56% cp cshrc.orig.cshrc

Page 27: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1C.17

Current Protocols in Bioinformatics Supplement 16

INSTALLING SOFTWARE

The last topic that we’ll cover in this survival guide is installing and upgrading software.This is a task that is usually best left to a system administrator, but for many readers thisis not a viable option.

Most Unix software, bioinformatics included, is distributed in source code form astar.gz files. The tar program is first applied to the software to archive its many filesand directories into a single file, and then the gzip compression program is used tocompress the archive for easy transmission over the Internet.

Although there are an infinite number of variations, downloading and installing Unixsoftware follows this general theme:

1. Identify a Web or FTP site that has the desired software and download it to theUnix system.

2. Uncompress and unarchive the package.

3. Read the README and/or INSTALL documentation.

4. Configure the package.

5. Compile the software.

6. Install the software.

As an ordinary user of the system you can perform all but the very last step of thisprocess; however, the final install of software requires that you have write permissionto portions of the Unix system that are usually off-limits to ordinary users. To installsoftware in its usual place requires that you log in as the privileged user known as “root,”using the password for the root account; however, if you do not know the root password,you can still install software in your home directory. In the example that follows we willinstall a new software package as root, and then as an unprivileged user.

The example we will use is the MySQL package, a popular open source relationaldatabase that is used as the exemplar of database management systems in Chapter 9.

Downloading (FTP)

The first step is to download the software archive onto the Unix system. You willeventually generate quite a collection of these archives, so create a directory named srcunder your home directory, and make it your current working directory:

(∼) 101% mkdir src(∼) 102% cd src

If the software is located on an FTP site, you will use the ftp command to download it.If the software is located on a Web site, you can use Mozilla Firefox (with the firefoxcommand) if you have a graphical login, or the text-only Web browser lynx if you areusing a terminal emulation program. Both applications are self-explanatory.

In the case of MySQL, we will connect to the FTP site ftp.mysql.com using theftp command. When prompted for a username, we enter the name anonymous, andgive our E-mail address when prompted for a password (Fig. A.1C.17).

Page 28: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.18

Supplement 16 Current Protocols in Bioinformatics

Figure A.1C.17 Login screen of the MySQL FTP site using anonymous as the login name and the user’se-mail address as the password.

Figure A.1C.18 (A) Changing the working directory using the cd command, (B) listing files using the ls command,and (C) retrieving the mysql-3.23.46.tar.gz file using the get command, all within the FTP program shell. Wechose the mysql-3.23.46.tar.gz file after determining that is was the most recent version of the MySQLdistribution. Note that some of the listing has been omitted to conserve space.

Get commandAfter logging into the remote site, the prompt changes to ftp>, indicating that thecommand line is now under control of the FTP program. The FTP program containsa miniature shell that recognizes the Unix cd and ls commands, with the differencethat these commands operate on the remote FTP site rather than on the local machine.Using these commands we navigate to the desired directory and use the get command todownload the file containing the MySQL source distribution (Fig. A.1C.18). We choosethe .tar.gz file with the most recent date. Other files in this subdirectory end with theMicrosoft Windows extension .zip.

Page 29: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1C.19

Current Protocols in Bioinformatics Supplement 16

Quit commandWhen the download is finished, we issue the quit command, and the FTP programexits, returning us to the Unix shell:

ftp> quit221 Goodbye.(∼/src) 104%

Uncompressing and unarchivingThe next step is to unpack the MySQL distribution. Return to the home directory andcreate a new directory named build. This will be used as a temporary place in whichto build new software prior to installing it:

(∼/src) 105% cd ∼(∼) 106% mkdir build(∼) 107% cd build

We will now uncompress and unarchive the MySQL distribution in a single step. Thisuses a trick in which the output of the gunzip program, which uncompresses the archive,is fed directly into the input of the tar program, which unarchives the software. Themagical incantation and its results are shown in Figure A.1C.19. Notice how the “∼”symbol is used as a shortcut to indicate the home directory. This will create a directorynamed mysql-3.23.46 containing the unpacked MySQL source code distribution.As each file is unpacked, its name is printed on the terminal.

Reading Documentation

We enter this new directory and look for a file named README, INSTALL, or somethingsimilar. In this case there is a README file, which contains a general description ofMySQL, and a more specific file named INSTALL-SOURCE which contains step-by-step instructions on building and installing the software.

Configure Package

MySQL is typical of software that is written in the C programming language. You first runa script contained within the distribution directory called configure. This checks thatany libraries or other software on which the package depends are installed, and configuresthe package with values that are appropriate for the variant of Unix that the machine isrunning. After configure successfully completes, run the make program, which compilesthe source code into machine-readable computer code. Finally, you give the commandmake install to move the compiled code into the appropriate locations for installedsoftware.

Figure A.1C.19 Uncompressing and unarchiving the MySQL distribution in a single step. Notethat the file listing, which runs to hundreds of lines, has been truncated in the interest of space.

Page 30: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.20

Supplement 16 Current Protocols in Bioinformatics

Figure A.1C.20 Invocation and results of the configure script in the mysql-3.23.46 directory. Note thatthe middle portion of the output has been omitted.

Figure A.1C.21 Running the make command. Note that most of the output has been omitted.

Configure scriptWe will step through this process. First, we run the configure script located in themysql-3.23.46 directory. Since there might be other configure programs installed onyour machine, we take care to run the particular one that MySQL comes with by usinga path starting with “.” to indicate the configure script located in the current workingdirectory as shown in Figure A.1C.20.

Thankfully the configure script ran to completion. If it had detected that some softwarepackage that MySQL depends on was missing from the system, it would have failed partway through and notified us of the problem.

Compile

Make programWe now run make. The make program is a standard part of Unix rather than a MySQL-specific script, so we do not need to specify the current directory in its path (Fig. A.1C.21).Making the package is an involved process that takes several minutes to complete. Ifany errors are encountered during the process, make will terminate with any obvious∗∗error∗∗ message (you can safely ignore any warnings that are issued). If make doesfail, your best option is to refer the problem to someone more knowledgeable. Otherwise,you can proceed to make test and make install (see below).

Make testSome software packages come with a set of tests that you can run to ensure that they havecompiled correctly. If such tests are defined, you can invoke them with the commandmake test as shown in Figure A.1C.22.

Page 31: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1C.21

Current Protocols in Bioinformatics Supplement 16

Figure A.1C.22 Using the software-included make test command. Note that most of the output has beenomitted for brevity.

All tests passed, so we can feel confident that MySQL will function properly. Forpackages that do not have any tests defined, the make test command will produce anerror message similar to Don’t know how to make test. Stop. In this casejust skip the step.

Install

Su commandThe last step is to make install. The catch here is that you will have to log in as theroot user in order to run this command. Assuming that you know the root password, youcan use the su command to temporarily assume the identity of root without having tolog out and in again:

(∼/build/mysql-3.23.46) 124% suPassword: ∗∗∗∗∗∗∗∗bash-2.05#

After issuing the su command, the system prompts us for the root password. Afterentering it, the shell prompt changes, telling us that we are now logged in as the root user(the prompt character “#” is usually, but not necessarily, reserved for root). You can skipthis step if the Sudo program is installed and configured, as described later.

Make install commandWe then run make install, and wait while the system copies the MySQL softwareinto its installed locations as shown in Figure A.1C.23.

Exit commandAfter make install completes, we issue the exit command to return to our normaluser privileges.

bash-2.05# exit(∼/build/mysql-3.23.46) 125%

You do not want to remain logged in as root longer than you need to, because as root youhave access to commands that can seriously damage the system if issued inadvertently.

Page 32: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.22

Supplement 16 Current Protocols in Bioinformatics

Figure A.1C.23 Copying the MySQL software into its installed locations using the make install command.Note that most of the output has been omitted for brevity.

There are now several steps that are specific to MySQL, including setting up databasesand user accounts. These steps are described in the INSTALL-SOURCE file. Since theyare not applicable to the general case, we won’t cover them here.

Make install with SudoOn many shared Unix systems you will not have access to the root password. On suchsystems, the system administrator may be able to give you limited root privileges usingthe Sudo system. On such systems you may be able to skip the su step entirely andrun the privileged install step by issuing the command sudo make install. Thesystem will prompt you for your password, become the root user temporarily, run makeinstall, and then return you to your normal user status. Please see your systemadministrator for help with this. If you are the system administrator, you can read abouthow to configure Sudo by consulting the sudo and visudo manual pages (see TheManual Command, above).

Installing Software into your Home Directory

What if you don’t have the root password or Sudo privileges? With a little additionaleffort, you can install the software package in your home directory, something that youdon’t need root access to perform.

The key is to pass the optional--prefix= option to the configure script, specifying a lo-cation in your home directory after the equal sign. My home directory is/home/lsteinand I would like MySQL to install itself in a subdirectory named mysql, so I pass theoption --prefix=/home/lstein/mysql, as shown in Figure A.1C.24.

Be sure to use the full path to your home directory here. If you are unsure of the correctvalue, cd to your home directory and then issue the pwd command.

Now run the make and make test commands as described earlier. If all goes well, runmake install. Since you are installing into your home directory, there is no reasonto become root.

MySQL has been installed in your home directory. What now? If you inspect the contentsof the ∼/mysql directory, you will discover that the installation process created anumber of subdirectories (see Managing Files and Directories; Fig. A.1C.25).

By convention the bin subdirectory contains executable files (commands), man containsdocumentation, and include and lib together contain packages of code for useby software developers. The other directories contain mysql-specific components. Themysql program lives in ∼/mysql/bin, and you can run it by typing its completepath:

Page 33: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1C.23

Current Protocols in Bioinformatics Supplement 16

Figure A.1C.24 Providing a directory into which MySQL can install itself.

Figure A.1C.25 Subdirectories of ∼ created during installation.

(∼/build/mysql-3.23.46) 141% ∼/mysql/bin/mysqlIf you like, you can set up your environment so that ∼/mysql/bin is searched auto-matically whenever you type a command. This involves setting the environment variablePATH, which contains a list of directories to be searched for executables (see List Com-mand).

As described earlier, the procedure to follow depends on which shell you are using. Ifyou are using tcsh or csh, open the file .cshrc and add the following to the bottom:

setenv PATH ∼/mysql/bin:$PATHThis will set the PATH environment variable to contain ∼/mysql/bin, followed bywhatever was on the PATH before. Log out and in again. You should now be able typemysql without qualifying it with a path.

If you are using the bash, ksh, or sh shells, open .profile and add the followingto the bottom:

export PATH=∼/mysql/bin:$PATHAgain check that when you log out and in again, mysql is found automatically.

As before, you are advised to make copies of .cshrc or .profile before you do this.If you mess up PATH, the system may not be able to find any commands, including thecp command required to restore the original version of .cshrc or .profile. Thisisn’t a cataclysm. Simply refer to cp using its explicit path, /bin/cp:

(∼) 142% /bin/cp cshrc.orig.cshrc

This will put .cshrc back the way it was before you modified it.

If you install a program and later move it to another location on your PATH, the systemmay not be able to find it until you log off and in again. With the csh and tchs shells,the command rehash may help the system find the command without doing this.

CONCLUSION

Unix will feel alien and intimidating at first. Do not be inhibited, but feel free to exploreand experiment with the Unix environment. With experience, you may eventually cometo tolerate, if not appreciate, Unix’s alternative take on the world.

Page 34: IUPAC/IUB Single-Letter Codes Within

Unix SurvivalGuide

A.1C.24

Supplement 16 Current Protocols in Bioinformatics

KEY REFERENCES

Frisch, A. 1996. Essential System Administration, 2nd Edition. O’Reilly and Associates, Sebastopol, Calif.A slightly more advanced book that emphasizes troubleshooting.

Nemeth, E., Snyder, G., and Seebass, S. 1995. Unix System Administration Handbook. Prentice-Hall,Engelwood Cliffs, NJ.

This is a user friendly and comprehensive guide to working on Unix systems. Although aimed at systemadministrators, it is highly recommended for newcomers to the Unix environment.

Raymond, E.S. (ed.) 1996. The New Hacker’s Dictionary, Third Edition. MIT Press, Cambridge, Mass.An introduction to the Unix culture.

Sobel, M.G. 1998. Hands-On Linux. Addison-Wesley, Reading, Mass.An introduction to Linux.

Welsh, M. 1999. Tunning Linux, Third Edition. O’Reilly and Associates, Sebastopol, Calif.A user-level guide to the Linux operating system

INTERNET RESOURCES

Login Packages for Windows

http://www.microsoft.comMicrosoft Web site for downloading Telnet. Bare-bones terminal emulator using the Telnet protocol andtext-only login. Built into Windows 95 & higher.

http://www.securenetterm.comNetTerm Web site. More configurable terminal emulator using Telnet and Secure Shell protocols usingtext-only Web site.

http://www.vandyke.com/products/crtCRT Web site. Full-featured terminal emulator using Telnet and rlogin protocols using text-only login.

http://www.chiark.greenend.org.uk/˜sgtatham/putty/PuTTY Web site. Bare-bones terminal emulator using Telnet and Secure Shell protocols using text-only login(freeware).

http://www.uk.research.att.com/vncVNCviewer Web site. Graphical login using the lightweight VNC protocol (freeware).

http://www.hummingbird.com/products/nc/exceed/eXceed Web site. Graphical login using the network- intensive X Windows protocol.

http://www.powerlan-usa.comWebTermX Web site. Graphical login using the network- intensive X Windows protocol

http://www.starnet.com/productsX-Win32 Web site. Graphical login using the network- intensive X Windows protocol.

Contributed by Lincoln D. SteinCold Spring Harbor LaboratoryCold Spring Harbor, New York

Page 35: IUPAC/IUB Single-Letter Codes Within

APPENDIX 1DX Window System Survival Guide

When you log into a Unix system from the console (APPENDIX 1C), you are typicallydropped into a graphical desktop environment that is similar, but not identical, to theMicrosoft Windows and Apple Macintosh desktops. From this desktop, you can runwindowing applications such as text editors, office productivity tools, and other familiartypes of applications. The Unix desktop uses a flexible windowing environment calledX Window System or X for short.

Some Unix-based bioinformatics applications take advantage of this desktop environ-ment. Good examples include David Gordon’s Consed program (UNIT 11.2) for editingsequence assemblies produced by the PHRAP assembler (see UNIT 11.4). However, evenif you run text-only bioinformatics tools, it is liberating to be able to run them in anenvironment in which you can open multiple resizable terminal windows.

A problem arises when you are logging into a Unix system remotely via an MS Windowsor Macintosh terminal emulator. This typically limits you to a small text-only window of24 rows by 80 columns, and any attempt to launch graphical applications will terminatein the error message Can’t open display. For those who would like to keep anMS Windows or Macintosh machine on their desk and log into a Unix server from timeto time, this is a major annoyance.

Fortunately, Unix provides a solution. The X Window System makes it possible to runapplications on the remote Unix machine and have multiple Unix terminal windowsappear on the local desktop. This works equally well for Macintosh and MS MicrosoftWindows computers and for connections from one Unix machine to another.

RUNNING X APPLICATIONS LOCALLY

X Window System applications can be run either locally or remotely. In the first case,the application runs on the same machine to which the screen, keyboard, and mouseare connected. In the latter case, the application runs on another machine located some-where else on the local area network or the Internet, but its windows and other userinterface elements appear on the screen of the local machine.

If you are using Solaris, Linux, or other Unix-like operating system, X is already installedand running when you log in. To launch a bioinformatics application written for X, simplytype its name at the command line or click on its icon in your system’s desktop manager.

Things are similar with MacOS X, except that X is not typically installed by default.If you are using MacOS X 10.3 or 10.4, you will need to first install the “X11”application, located on disk #3 of the MacOS X install disks. Alternatively, the X11application can be downloaded from http://www.apple.com/downloads/macosx/apple/x11formacosx.html. Once X11 is installed, start it by double-clicking on its icon, locatedunder “Applications/Utilities. . .” This will bring up a simple terminal window named“xterm.” Within xterm, you can launch X Window applications by typing their names atthe command line. Alternatively, you can simply double-click on the application’s iconin the Finder and MacOS X will launch both the selected application and X to manage theapplication’s windows. This works because most graphical bioinformatics applicationshave been ported to MacOS X.

Contributed by Lincoln D. SteinCurrent Protocols in Bioinformatics (2007) A.1D.1-A.1D.11Copyright C© 2007 by John Wiley & Sons, Inc.

UserFundamentals

A.1D.1

Supplement 17

Page 36: IUPAC/IUB Single-Letter Codes Within

X Window SystemSurvival Guide

A.1D.2

Supplement 17 Current Protocols in Bioinformatics

It is more difficult to run bioinformatics X applications locally from within MS Windowssystems, because few such applications have been ported to this environment. To accessgraphical X applications from a PC, you will usually need to run the application remotelyon a Unix system and arrange for its windows to be displayed on the local PC.

RUNNING X WINDOW SYSTEM APPLICATIONS REMOTELY

There are two major options for running X applications remotely in such a way thattheir windows appear on the local machine: the VNC virtual desktop and the X WindowSystem itself. The rest of this appendix describes these options.

Using Virtual Computer Network (VNC)

VNC (Virtual Network Computing) is a lightweight desktop sharing system that wascreated by the research division of AT&T U.K. and later taken over by RealVNC, Ltd., aU.K. spinoff (http://www.realvnc.com). From here, you can download a fully functionalfree version of the software, as well as a pay version that adds encryption and othersecurity features. The information in this appendix describes the free version.

RealVNC does not provide a version of VNC for the Macintosh, but a version of theVNC viewer application that runs well on MacOS X can be found at http://www.redstonesoftware.com/vnc.html. There is also a Java version of the VNC viewer thatruns on MacOS X (and all other platforms) at RealVNC.

If you have previously used desktop sharing systems like Timbuktu, the way VNC workswill be familiar. On the Unix side of the connection, you install and run a server applicationcalled vncserver. The vncserver server program runs silently in the background, listeningfor incoming connections.

On the desktop side of the connection, you run a viewer application called VNCViewer(the capitalization varies slightly among the different operating systems). When you useone of these applications to connect to a machine running vncserver, a graphical windowappears on your desktop that contains an image of the Unix desktop. You can createwindows, use menus, run programs, and interact with this remote desktop just as if youwere using the Unix console directly. However, you might notice some jerkiness in screenupdating, depending on the speed of the network connection.

Launching vncserverAssuming that you have successfully downloaded and installed VNC, the first step isto launch vncserver. Use a terminal emulator to log into the Unix machine and run thefollowing command:

(∼) 100% vncserver

The vncserver program must be in a directory specified by your PATH environmentvariable (see APPENDIX 1C) or it will not be recognized. If vncserver is installed somewhereelse, for example, under your home directory in a subdirectory named vnc-unix, youwould type:

(∼) 100% ∼/vnc-unix/vncservervncserver will now ask you to provide a password for accessing your desktop remotely:

Page 37: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1D.3

Current Protocols in Bioinformatics Supplement 17

You will require a password to access your desktops:

Password: *******Verify: *******

Choose a password that cannot easily be guessed. This can be the same as the passwordyou use to log into Unix, or can be something different. Change the password later usingthe vncpasswd application. Before returning you to the command line prompt, vncserverwill print out some useful messages:

New ’X’ desktop is pesto:1Starting applications specified in /home/lstein/.vnc/xstartupLog file is /home/lstein/.vnc/pesto:1.log

The important part of this message is the first line, which gives the name of the Unixmachine and the “desktop number” on which the VNC server is listening. In the example,VNC is running on pesto:1, which is interpreted as desktop number 1 on the machinenamed pesto. If other users are using vncserver on a multiuser machine, you might beassigned a higher-numbered desktop. Remember this information since you will need itto connect.

You can now log out and shut down the terminal emulator. Vncserver will continue torun until the Unix machine is rebooted or you intentionally shut vncserver.

Launching VNCViewerAssuming that you have successfully downloaded and installed VNCViewer, it can nowbe used to connect to the Unix desktop. Launch VNCViewer from the Start menu (MSWindows) or the desktop (Macintosh). A small dialog box similar to the one shown inFigure A.1D.1 will appear. Type in the name and desktop number that was assigned whenyou launched vncserver, e.g., pesto:1.

VNCViewer will try to establish a connection. If successful, it will now prompt youto provide a password. Type in the password that you selected when you launched theserver. A window that contains a copy of the Unix desktop (Fig. A.1D.2) will appear.

You can work inside the window just as if you were sitting at the Unix console. TheMS Windows version of VNCViewer also offers a handy full-screen mode that willtemporarily replace your desktop with the Unix desktop. To enter full-screen mode,right click on VNC’s window or taskbar icon in the MS Windows version and selectFull-Screen Mode from the pop-up options menu. To get out of full-screen mode, pressControl-Escape, and then the Escape key one more time. This will restore the Windowsdesktop.

Figure A.1D.1 When you launch VNCViewer on MS Windows or Macintosh desktops, a smalldialog box prompts you to enter the host and desktop number for your Unix desktop.

Page 38: IUPAC/IUB Single-Letter Codes Within

X Window SystemSurvival Guide

A.1D.4

Supplement 17 Current Protocols in Bioinformatics

Figure A.1D.2 VNCViewer opens up a single window that contains the Unix desktop and allwindows created by Unix applications.

When finished using VNC, just quit the VNCViewer application. The Unix desktopwill continue to run, however, so the next time you reconnect to your desktop withVNCViewer, you will find it in exactly the state in which it was left.

To bring down the VNC server completely, log into Unix and run vncserver with the-kill option:

(∼) 100% vncserver -kill pesto:1

Notice that you must provide the hostname and desktop number in order for the -killcommand to work.

A number of things may go wrong while using VNC. One common problem is thatthe VNCViewer will report a connection failure when it tries to connect. If you areusing VNCViewer across the Internet, you will probably need to use the full Internetaddress of the Unix machine. In the running example, the full address for “pesto” ispesto.cshl.edu, so replace pesto:1 with pesto.cshl.edu:1.

Another possibility is that vncserver is no longer running. To check this, log into theUnix machine and run the ps -x command:

(∼) 100% ps -x

This will list all the programs that are currently running under your user account. If oneof the programs listed is Xvnc, then the server is still running. Otherwise, you will haveto relaunch the server.

If all the problems described earlier have been checked and you are still unable to connect,it may be that there is a firewall in place between you and the Unix machine. To find out,talk to the network administrator for your organization. In many cases, it is possible for

Page 39: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1D.5

Current Protocols in Bioinformatics Supplement 17

the administrator to create firewall exceptions that will allow VNC to run. Otherwise, theadministrator may be able to offer a way to work around the problem using a productlike Secure Shell (http://www.openssh.com/).

Finally, while working with VNCViewer, be careful not to log out of the Unix desktopmanager. This will typically leave you unable to reconnect to the desktop again. If you dothis inadvertently, simply log into Unix using a terminal emulator, kill the VNC serverusing vncserver -kill, and restart it.

Customizing VNCServerThe default desktop provided by vncserver is an early, primitive desktop manager called“twm.” Some people grow to like twm, but for most, it is nearly unusable. Fortunatelythe VNC desktop can be changed.

To do this, you must have run vncserver at least once before. Kill vncserver, if it is stillrunning, and then use a text editor to open and edit the file ∼/.vnc/xstartup. Bydefault, this file contains the following lines:

#!/bin/shxrdb $HOME/.Xresourcesxsetroot -solid greyxterm -geometry 80x24+10+10 -ls -title ‘‘$VNCDESKTOPDesktop’’&twm &

To change the desktop, replace twm on the last line with the command used to start upthe desktop of choice. For example to start the KDE desktop manager, replace the lastline with startkde &.

The trick, of course, is knowing what command to put here. Table A.1D.1 lists a numberof popular desktop managers to try. Some may not be installed on the Unix machine thatyou work with.

If you launch vncviewer and discover that no desktop manager is active (as indicatedby “bare” windows without any frame or other decoration), you may need to indicatethe full path to the window manager. For example, the Common Desktop Environment’sdtwm can be found at /usr/dt/bin/dtwm on many systems.

Using the X Window System

A more sophisticated way to establish a graphical connection to a remote Unix systemis to use the X Window System itself. Using X to connect to an application running ona remote machine is almost exactly the opposite of using VNC. Whereas in VNC youstart by launching the VNC server application on the Unix machine and then connectto it using the VNC viewer client, X works by running an application on your desktopmachine called the “X server.” When you launch graphical Unix applications, you thentell them to use your desktop machine for their windows, keyboard, and mouse.

As described earlier, X is a standard part of Linux, Solaris, and other Unix systems, butmust be installed as an option on MacOS X.

For Microsoft Windows, the free Cygwin/X server, available from http://x.cygwin.com/,is recommended. This application provides bare-bones X functionality. For a richer setof configuration options, you may wish to purchase a commercial X server. The servers

Page 40: IUPAC/IUB Single-Letter Codes Within

X Window SystemSurvival Guide

A.1D.6

Supplement 17 Current Protocols in Bioinformatics

Table A.1D.1 Some Popular Desktop Manager Programs

Program Description

blackbox An imitation of the NeXT desktop, found on some Linux systems.

dtwm The Common Desktop Environment, found on many commercial Unixsystems (but not Linux).

fvwm2 An uncluttered window manager commonly found on Linux systems.

fvwm95 An imitation of the MS Windows 95/98/ME desktop, commonly foundon Linux systems.

gnome-session The Gnome desktop environment, found on many Linux systems.

mwm A basic desktop manager, commonly found on older Sun systems.

startkde The K desktop environment, found on more recent Linux systems.

olwm The Open Look desktop manager, found on many systems.

wmaker Another imitation of the NeXT desktop, found on some Linux systems.

that the author has used with the most success on Windows are Hummingbird Exceed(http://www.hummingbird.com) and WRQ ReflectionX (http://www.wrq.com).

Running remote X applications on MacOS X, Linux, and SolarisTo launch a remote X application on a Unix system, open a shell window (the Terminalapplication on MacOS X). As described in APPENDIX 1C, use SSH to log into the remotemachine, but use the -Y option to tell SSH to forward the X session from the remotemachine to the local machine:

> ssh -Y login name@remote host.example.com

Replace login name and remote host.example.com with your login name andthe name or IP address of the remote machine. If you use the same account names onthe local and remote machines, you can omit the login name. You will be prompted for apassword. Once you are logged into the remote server, launch the desired application bytyping its name on the command line. If all goes well, the application will start up anddisplay its user interface on your local screen. You can then interact with it as thoughyou were sitting on the remote desktop.

On some older machines, SSH may not be installed. In this case, you can use telnet,but the recipe is more complex. First, determine the IP (internet) address of your localmachine. From the command line, type:

> /sbin/ifconfig

This will display information about each of the network interfaces on the machine.This may include ethernet ports, wireless interfaces, and other interfaces. Embeddedsomewhere in that information is the IP address of your machine, which has the formatXXX.XXX.XXX.XXX. There may be multiple such addresses; if you see an address of127.0.0.1, you should ignore it (it is used internally) and use another one. For thepurposes of this example, we will use 192.168.1.1 as the IP address.

Before telnetting to the remote machine, run the xhost command:

> xhost +remote host.example.com

Page 41: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1D.7

Current Protocols in Bioinformatics Supplement 17

Figure A.1D.3 In contrast to VNC, the default for most X servers is to open a different windowfor each running X application and to allow them to be displayed together on the desktop.

This grants access to X applications running on the remote machine. Changeremote host.example.com to the name or address of the machine into whichyou will be logging.

Now log into the remote machine using telnet, as described in APPENDIX 1C. Once you arelogged in, type at the command line:

> setenv DISPLAY 192.168.1.1:0

Replace 192.168.1.1with the IP address of your desktop. This environment tells anyX applications that you launch to use the indicated X server. The :0 following the IPaddress tells the application to use the first X server found running on the machine (on apersonal computer it is unlikely that there will ever be more than one X server runningat a time).

If you are using bash, ksh, or sh, set the environment variable like this:

> bash$ export DISPLAY=192.168.1.1:0

Do not set DISPLAY if you logged in with SSH. SSH sets DISPLAY for you.

Whether you logged in via SSH or telnet, you are now ready to launch a graphicalapplication. From the shell, type xclock. This is a simple X application that shows agraphical clock. If everything is working, a clock will appear on your desktop screen.You can try this now with other graphical applications. Commonly installed applicationsinclude xterm, a command-line shell window, emacs, a windowing text editor, andMozilla Firefox, a Web browser. Figure A.1D.3 shows a portion of a Windows desktopafter launching the xclock and term applications. Notice that although the windowframes follow normal MS Windows or Macintosh conventions, the window contents aredecidedly Unix-like.

When the author works with X remotely, he usually launches an xterm first and thenlaunches other applications from within the xterm. The advantage of this is that theDISPLAY variable is inherited by the xterm shell, and does not need to be set again.

Page 42: IUPAC/IUB Single-Letter Codes Within

X Window SystemSurvival Guide

A.1D.8

Supplement 17 Current Protocols in Bioinformatics

If your desktop machine has a stable DNS (Internet) name, likeyourpc.yourorganization.com, you can use that instead of the numericIP address when you set the DISPLAY. However, most organizations and almostall home network connections assign IP addresses dynamically each time a personalcomputer is booted. This means that not only will your machine not have a DNS name,but it may have a different IP address each time you reboot it. In this case, you will haveto look up the IP address each time you start a session with X.

If X applications do not run, there are a number of things that might have gone wrong.If the command to launch the application terminates with a connection refusedmessage or hangs indefinitely, chances are either that the IP address in the DISPLAYenvironment variable is incorrect or that the X server is not running on your desktop.

Another possibility is there is a firewall system between the Unix host and your desk-top machine. Firewalls are typically configured to prevent incoming connections, andthis usually includes blocking incoming connections from X applications. The bestworkaround for this is SSH with the --Y option, which usually circumvents firewallissues. You may need to discuss remote access options with your system administrator.

Launching an X session under Microsoft WindowsUnlike the Macintosh, the Microsoft Windows operating system does not have a default Xapplication, and you will have to install a third-party application. Details of using these Xservers vary somewhat depending on the vendor. In the examples below, discussion willbe provided on how to set up X connections using Cygwin/X, a free X server. Althoughthe details will differ with other implementations, the general concepts will remain thesame.

To install Cygwin/X, go to http://www.cygwin.com and click “Install cygwin now” onthe right-hand side of the page. This will take you to an installer for the Cygwin package.This application will prompt you for the location of the files to select (choose Installfrom Internet), and the place to install Cygwin (keep the default of C:\cygwin). Whenprompted for the site to download from, choose an FTP site close to your region of theworld.

You will next be taken to a screen that prompts you to select which packages to install. Thepackages are organized hierarchically by topic. You may keep all the default selections,but you will need to add the “openssh” and “X11” packages to the list of packages to beinstalled. To add openssh, open up the Net section, scroll down until you see “openssh”,and click once on the icon that says Skip. The icon will change to show the versionnumber of the openssh package, and a check box will appear to indicate that openssh willbe installed (several other packages will also be activated—these are ones that opensshdepends on).

To install X11, scroll to the bottom of the package selection screen until you find the“X11” section. Click on the icon that says Default to change it to Install, then press Nextat the bottom of the installer in order to begin installing the packages you selected. Atthe end of the install, you will be given the option to install icons on the desktop and theStart menu, which will be convenient to have.

Once Cygwin/X is installed, starting the X server is a two-step process. First, double-clickon the Cygwin icon on the desktop (or use the Start menu: select Start >All Programs >

Cygwin > Cygwin Bash Shell). This will launch a small command window containing aUnix shell prompt on a black background. From this shell, start X by issuing the startxcommand:

Page 43: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1D.9

Current Protocols in Bioinformatics Supplement 17

> startx &

Some diagnostic messages will print out, and after a brief pause another Unix shellwindow will appear. This one will have a white background. This is the xterm programrunning inside X. You can now safely minimize the black command window and issueall your commands from within the xterm window.

To confirm that everything is working as expected, launch the xlock application fromwithin the xterm window:

> xclock&

This will bring up the graphical X clock. You will find other standard X applicationsinstalled as menu items under the Start > All Programs > Cygwin-X menu.

On MS Windows XP systems, the first time X is run, a dialog box will appear warningthat the “Xwin” application has tried to open up a network port and that the built-in XPfirewall has blocked it. Accept the offered option to unblock Xwin; otherwise remote Xsessions will not work. If only accessing remote machines on your local area network(i.e., inside your office LAN), then the firewall can be partially reactivated later by makingthe following selections from the Control Panel: Control Panel > Windows Firewall >

Exceptions; next, double-click on Xwin, press the “Change Scope. . . ” button, and thenselect “My network (subnet) only.” This will allow Xwin to communicate with machineson your local area network, but not with machines elsewhere on the Internet.

Launching an xterm on a remote Unix machineThis example describes how to launch an xterm on a remote Unix machine using ssh.Assume the remote Unix machine is named remote host.example.com and theuser name on the remote machine is yourname. Type the following command at thexterm prompt:

> ssh --f --Y login name@remote host.example.com xtermyourname@remote host password: ********

This is similar to what was done previously when ssh was used to log into the remotemachine from a Unix or Macintosh box, but now the --f switch is used to tell ssh to“fork” (Unix jargon) into the background and then run the “xterm” command on theremote machine. As before, the --Y switch causes the application’s X windows to beforwarded to the local machine. After being prompted for your password on the remotemachine, if all goes well, a new xterm window will appear containing the commandprompt from a shell running on the remote machine. You can now launch your favoriteX Window System applications by typing the appropriate command lines in the xtermwindow.

Obtaining a Unix desktopBy default, X runs in “multiple window” mode. In this mode, windows opened by remoteX applications run on the regular PC desktop, intermixing with windows opened by localapplications.

An alternative is to run in “single” or “rooted” window mode. In this mode, the X serverrunning on your local machine creates a single desktop window. You then connect to theremote machine and run a window manager on it. This will create the Unix desktop, itsicons, and all the Unix application windows running inside the desktop window. This ishandy if you want to access pop-up menus, icons and other niceties of the Unix desktop

Page 44: IUPAC/IUB Single-Letter Codes Within

X Window SystemSurvival Guide

A.1D.10

Supplement 17 Current Protocols in Bioinformatics

environment, but it does impose a noticeable demand upon performance, particularlywhen you are connecting across a slow network connection.

The following recipe works with Cygwin/X on Microsoft Windows machines. First startCygwin by clicking on its icon. When the black command window comes up, issue thexinit command rather than startx:

> xinit

A large window filled with an ugly herringbone pattern will appear. This is what the Xroot window looks like when no window manager is running inside it. The upper leftcorner of the root window will contain an xterm command window, but the window willlack a frame or title bar. You will now issue the command to start the window manageron the remote machine. Click on the root window to bring it to the front and type thecommand:

> ssh --f --Y login name@remote host.example.com startkde

You will be prompted for your password on the remote machine as usual. After a briefpause, the KDE window manager will take control of the root window and add its menus,icons, and other user-interface elements. You can now launch applications from the startmenu, or open up xterms on the remote machine. The original xterm will still be there,but it will take on the frame and title bar of given to it by the KDE window manager. Youcan safely minimize it to get it out of the way (but do not close the xterm window or thewhole X server will exit).

You can use the same recipe to launch other window managers. Simply replacestartkde in the command above with any of the window managers listed in TableA.1D.1.

On Macintosh and Unix machines, the recipe for obtaining a remote desktop is slightlydifferent. First, open a terminal window on the local machine and type the followingcommand:

> Xnest :1 --geometry 800x600 &; xterm --display :1 &

Xnest is a X server “nested” inside the one the runs on your local machine. This commandwill bring up an 800 × 600–pixel root window with the default herringbone backgroundand a frameless xterm. Switch to this window and type the ssh command to start thedesktop manager of your choice on the remote machine:

> ssh --f --Y login name@remote host.example.com startkde

As in the Cygwin/X recipe, the remote desktop manager will now take control and createits various icons, windows and menus. When you are finished, simply click the Closeicon on the window title bar.

Unfortunately the Xnest window cannot be resized once you create it. To work with alarger window, change the --geometry switch to something larger than 800 × 600.Standard monitors are now 1024 × 768 and 1280 × 1024.

If the Macintosh does not seem to recognize the Xnest or xterm commands, it maybe because the directory that contains the corresponding executables is not in the PATHenvironment variable. Make sure that /usr/X11R6/bin is one of the directories listedin PATH and if necessary add it.

Page 45: IUPAC/IUB Single-Letter Codes Within

UserFundamentals

A.1D.11

Current Protocols in Bioinformatics Supplement 17

CONCLUSIONS

Unix provides two popular solutions for running graphical programs across the network.VNC is easier to set up and use, but does not have the stunning array of features of the XWindow System. The X Window System is quite powerful, but generally more difficultto set up and use.

Contributed by Lincoln D. SteinCold Spring Harbor LaboratoryCold Spring Harbor, New York

Page 46: IUPAC/IUB Single-Letter Codes Within

APPENDIX 1ESequence File Format Conversion withCommand-Line Readseq

One of the major challenges in using bioinformatics software is that there are a widevariety of sequence formats. A few are widely known, e.g., GenBank, EMBL, and FASTA(see APPENDIX 1B). Other widely-used formats are derivatives of these, e.g., Swiss-Prot andSP-TrEMBL. It is almost always the case that a sequence or a set of sequences will be inone format but is needed in another. Some bioinformatics suites offer limited conversionutilities, such as the ToFastA and FromFastA programs in the Wisconsin Package (GCG).Some bioinformatics companies have produced software with mutually incompatiblesequence formats. The problem can be daunting, even for professional bioinformaticians.

Luckily, a near-solution is available. Readseq is a program that can read and write 18different formats, although it cannot convert every format into every other format for avariety of good reasons.

Readseq is now available in two forms. The first, command-line version, was written in1993 and is available for Unix, DOS, and pre-OS X Macintosh operating systems. The1999 version added seven new formats and was rewritten in the computer language Java.This appendix will explain how to use only the command-line, or “classic” version.(Instructions for using the Java version will be added here once the new user interface hasbeen released.)

Table A.1E.1 lists the formats known to the most widely used version of Readseq, whichis now called “Readseq classic” by its author.

Obtaining Readseq

All versions of readseq are available at http://iubio.bio.indiana.edu/soft/molbio/readseq.Note that there is no “www” in the URL. There are four directories, classic, Java,version 1, and version 2. The latter three are all Java-based.

The classic version. Under the classic directory, there are versions for: SGI (formerlySilicon Graphics), Apple Macintosh (pre-OS X), DOS, MS Windows, VAX-VMS, andSolaris. Finally, a source code file, readseq.shar is available. It can be compiled onalmost any computer with a C (a computer language) compiler. The directory contains

Contributed by Don GilbertCurrent Protocols in Bioinformatics (2003) A.1E.1-A.1E.4Copyright © 2003 by John Wiley & Sons, Inc.

Table A.1E.1 Version 1 Formats Readable orWriteable by Readseq Version 1

1. IG/Stanford 10. Olsen (in-only)

2. GenBank/GB 11. PHYLIP 3.2

3. NBRF 12. PHYLIP

4. EMBL 13. Plain/Raw

5. GCG 14. PIR/CODATA

6. DNA Strider 15. MSF

7. Fitch 16. ASN.1

8. Pearson/FASTA 17. PAUP*/NEXUS

9. Zuker (in-only) 18. Pretty (out-only)

A.1E.1

UserFundamentals

Page 47: IUPAC/IUB Single-Letter Codes Within

instructions for using the .shar file and compiling the program. This is valuable if onewishes to have the system administrator download and compile the code.

Using Readseq Classic for Unix and DOS

Readseq is used from the command line on these operating systems. The command linesare identical for both operating systems. It is important, however, to be aware that Unixand DOS do not use the same “end-of-line” character. Copying a file from Unix to DOS,for instance via a floppy disk or network drive, will cause trouble because of thisdifference. Transferring files between operating systems is still best done using the FileTransfer Protocol (FTP), typically at a command line, or by any of the graphical versionsof FTP available both in the public domain and commercially.

If files are transferred incorrectly, the result will be (at least) a blank line between eachline of the sequence file, or (at worst) meaningless characters.

Also note that Readseq classic knows a wide variety of sequence formats that are no longerin use. It is not necessary to concern oneself with these unfamiliar formats.

Figure A.1E.1 provides a list of the available commands for Readseq classic. If

readseq

is typed, the program then outputs:

Enter sequence name or ? for help

Entering ? will return the list of commands shown in Figure A.1E.1.

Create a command line selecting from the commands and options shown in Figure A.1E.1.At minimum, the names of the input and output files must be included, as well as theformat that the input file will be converted to. All options starting with a dash cannotcontain any spaces in the rest of the text of that command. For example, -o= new-file.seq is not valid. The order of the options does not matter.

The command line takes the general form

readseq [-options] in.seq > out.seq

A typical command line might be

readseq -a -v -f8 -o=newfile.seq oldfile.nexus

This command would convert all sequences (-a) in file oldfile.nexus to FASTAformat (-f8) and name the output file newfile.seq. It would also provide detailedmessages about the conversion (-v).

Some particularly useful options that one might want to include in the command line aredescribed below.

Selecting sequences: Among the key features to note are that the user can select a singlesequence from a file (-i=n, where n is a specified sequence), a set of sequences (e.g.,-i=2,11,24) or all sequences (-a or -all).

Removing gaps: A useful option to note is “-degap=.”, which specifies that if there areany sequences that have the gap character “.”, the “.” is removed. The default gapcharacter is “-”, as indicated in Figure A.1E.1. This is convenient when one has sequencesfrom a multiple sequence alignment and needs to remove the gaps.

Current Protocols in Bioinformatics

A.1E.2

Sequence FileFormat

Conversion withCommand-Line

Readseq

Page 48: IUPAC/IUB Single-Letter Codes Within

Pretty printing: Another convenient feature of the program is the ability to “pretty print”sequences (i.e., produce publication quality output). Options control the number ofcharacters per line (wid[th]=#), the indentation level (tab=#), control spacingbetween groups of letters (col[space]=#), and control numbering placement (theremaining options shown in Figure A.1E.1).

Sample Sequence Conversions

The easiest way to learn how to use Readseq is to examine some typical conversions andthe corresponding command lines. Some typical conversions are presented below. Be sureto substitute the input file name for infile and the output file name for outfile.Extensions (the part of the file name after the dot) are included here only for clarity.

1. Convert all sequences from a file of sequences in FASTA format to GCG format

readseq –v –a –f5 –o=outfile.gcg infile.fasta

Readseq (1Feb93), multi-format molbio sequence reader.usage: Readseq [-options] in.seq > out.seq

options:-a[ll] select All sequences-c[aselower] change to lower case-C[ASEUPPER] change to UPPER CASE-degap[=-] remove gap symbols-i[tem=2,3,4] select Item number(s) from several-l[ist] List sequences only-o[utput=]out.seq redirect Output-p[ipe] Pipe (command line, <stdin, >stdouta

-r[everse] change to Reverse-complement-v[erbose] Verbose progressa

-f[ormat=]# Format number for output, or-f[ormat=]Name Format name for output:

1. IG/Stanford 10. Olsen (in-only)2. GenBank/GB 11. Phylip3.23. NBRF 12. Phylip4. EMBL 13. Plain/Raw5. GCG 14. PIR/CODATA6. DNAStrider 15. NSF7. Fitch 16. ASN.18. Pearson/Fasta 17. PAUP/NEXUS9. Zuker (in-only) 18. Pretty (out-only)

Pretty format options:wid[th]=# sequence line widthtab=# left indentcol[space]=# column space within sequence line on outputgap [count] count gap chars in sequence numbersnameleft, -nameright[=#] name on left/right side [=max width]nametop name at top/bottomnumleft, -numright seq index on left/right sidenumtop, -numbot index on top/bottommatch[=.] use match base for 2..n speciesinter[line=#] blank line(s) between sequence blocks

Figure A.1E.1 Readseq classic commands. See APPENDIX 1C, Unix Survival Guide, for more details.

Current Protocols in Bioinformatics

A.1E.3

UserFundamentals

Page 49: IUPAC/IUB Single-Letter Codes Within

2. Convert all sequences from a file of sequences in GCG MSF format to FASTA format,remove gaps, which are denoted by dashes in the MSF file, and convert the sequencesto uppercase:

readseq –v –a –f8 –degap="-"–C –o=outfile.fasta infile.msf

3. Convert 8 specific sequences from a file of 200 sequences in FASTA format toNEXUS format, used by PAUP* and other phylogenetic analysis programs, removegaps, convert to upper case,

readseq –v –I=22,35,122,168,169,170,181,199 –f17 –degap-C –o=phylo.nexus phylo.fasta

4. List all the sequences in a file to the screen. If Readseq cannot figure out the sequencefile format, it will issue an error message.

readseq –l input.seq

5. Pretty print a file of aligned sequences in FASTA format. Print the sequences 50characters per line, in 10-character blocks separated by a space. Start each line with5 spaces, and put the names of the sequences (truncated to 12 characters if the nameis longer) on the right side of the printout. Print the sequence numbering above thesequences and include gaps in the numbering. Finally, print 4 blank lines betweenthe sequences and print a . for matching positions in columns. Note that the commandline is very long and has been split across two lines for readability. When entered byhand, do not break it into two lines.

readseq –f18 –o=outfile.pretty -tab=5 -width=60 -col=10-gap -nameright=12 –numright –match -inter=4 infile.fasta

Contributed by Don GilbertIndiana UniversityBloomington, Indiana

Current Protocols in Bioinformatics

A.1E.4

Sequence FileFormat

Conversion withCommand-Line

Readseq