Using the Unix Shell There is No ‘Undelete’. The Unix Shell “A Unix shell is a command-line...

Using the Unix Shell

There is No ‘Undelete’

The Unix Shell

“A Unix shell is a command-line interpreter or shell that provides a traditional user interface for the Unix operating system and for Unix-like systems. Users direct the operation of the computer by entering commands as text for a command line interpreter to execute or by creating text scripts of one or more such commands.” - Wikipedia

Things to Keep in Mind

• There is no ‘undelete’• Shell commands are case-sensitive

(CaPitaLizaTIoN mAttErs)• Do NOT use space, ?, *, \, / or $ in file names

because these have special meanings to the shell

• Filenames that begin with . are ‘hidden’• There is no ‘undelete’

The Importance of Being ‘Root’

• ‘Root’ or ‘Superuser’ is the administrator account, which has phenomenal cosmic power.

• The ‘sudo’ command allows you to “do as superuser” from an account with ‘sudo privileges’.

• As root in the shell, you can literally ‘delete’ the operating system or operating system files (like choosing to delete Microsoft Windows while using Windows)… and then watch the stars go out…– Moral of the story: If you don’t know what a file is… it’s

better to ask or leave it alone.– Installing software can require use of ‘sudo’

Unix Tutorial

• http://www.ee.surrey.ac.uk/Teaching/Unix/

• Science.txt file location for tutorial:– http://www.ee.surrey.ac.uk/Teaching/Unix/science.txt– Unix command:

• wget http://www.ee.surrey.ac.uk/Teaching/Unix/science.txt

Additional help/tutorial/walkthrough• http://software-carpentry.org/4_0/shell/

http://www.ee.surrey.ac.uk/Teaching/Unix/

http://www.ee.surrey.ac.uk/Teaching/Unix/

http://www.ee.surrey.ac.uk/Teaching/Unix/science.txt

http://www.ee.surrey.ac.uk/Teaching/Unix/science.txt

http://software-carpentry.org/4_0/shell/



Grep

• grep science science.txt• grep science science.txt > newfile1.txt• grep -B 1 -A 2 science science.txt > newfile1.txt

• Use man grep to learn more about grep

A ‘redirect’ symbol that sends output which would normally go to the screen to a text file instead.

Command line ‘options’ that change the behavior of the ‘grep’ program, with numerical parameters that specify the new behavior.

Permissions

• Type ls -l *note: those are both lower-case L characters

• -rw-r--r-- 1 krmerrill staff 358400 Feb 2 13:00 AJB_Merrill-d1100085_au.doc• drwxr-xr-x 47 krmerrill staff 1598 Jul 17 2011 My Pictures

- means regular file, d means directory, l (lower-case L) means linkfirst triplet is the user read, write, and execute permissions

second triplet is the group permissionslast triplet is permissions for everyone else, or ‘other’

ls -al shows above information for all files, including hidden files chmod = change permissions

u = user; g = group; o = other; a = all (user, group, and other)r = read; w = write; x = execute

chmod u+x filename adds user execute permission on filenamechmod g-wx filename removes group write and execute permissions from filename Permissions that are not mentioned in this format chmod command are not affected

Useful Shell Commands

• See the Linux Command Line Reference document on the course website

• Directory commands• Change to sub-directory within the current directory: cd xyz• Change to sub-directory in another part of the directory tree: cd

/path/to/filename• Create directory: mkdir newdir• Remove empty directory: rmdir xyz• Wildcard characters: ? matches any single character, * matches zero or

more characters• Example: rm *.txt will remove all files with a name ending in .txt• rm file?.fastq will remove file1.fastq, file2.fastq, … , filex.fastq

Regular Expressions

• See the RegularExpressions.pdf document on the course website for an overview of literal characters and metacharacters

• Regular expressions are useful within grep, awk, sed and other command-line tools as well as in Java, Perl, Python, and other scripting languages.

• Some text editor programs in Linux also use regular expressions, (also called regexps or regex). We will use nedit as an example.

• Replacing a space character with a new-line character in a file of barcodes – find ‘(OWB\d+) ’ and replace with ‘\1\n’ – note the trailing space in the first expression.

Command-line example

• Testing analyses on a small random sample of a sequence dataset is a good idea – find and fix problems quickly

• How to randomly sample the same reads from a set of paired-end files?

• A one-line command is saved on the course website to do this.• time paste file1.fastq file2.fastq |awk '{ printf("%s",$0); n++; if(n

%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf | head -2000000 | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print $2 > "file2.fastq"}‘

• Let’s look at this step by step

time this tells the system to display the time required to execute the command

paste Bigfile1.fastq Bigfile2.fastq | this joins two files of paired-end sequence reads as tab-delimited columns, line by line – the files should have the same number of lines, with reads in the same order in both files

awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | this uses the ‘awk’ program to convert the four lines of FASTQ format to tab-separated fields on a single line per sequence record

shuf | this utility sorts lines in a file into a random order

head -2000000 | this utility takes the first 2 million lines of the re-ordered filesed 's/\t\t/\n/g' | this uses the ‘sed’ stream editor to convert the tab delimiters back into new-line characters to restore the 4-line FASTQ format

awk '{print $1 > “Subfile1.fastq"; print $2 > “Subfile2.fastq"}' this uses ‘awk’ to split the two tab-delimited columns back into two separate files

Command-line example

How do you come up with this stuff?

Someone else has probably had this problem

Search for help on SeqAnswers or StackExchange

http://biostar.stackexchange.com/

The Bioinformatics Forum on SeqAnswers:http://seqanswers.com/forums/forumdisplay.php?f=18



http://seqanswers.com/forums/forumdisplay.php?f=18

http://seqanswers.com/forums/forumdisplay.php?f=18

SolexaQA.pl

• This Perl script assumes that header lines of sequence files are written in one of several formats

• The code uses regular expressions to sort out formats:if( $line =~ /\S+\s\S+/ ){ # Cassava 1.8 variant

if( $line =~ /^@[\d\w\-\._]+:[\d\w]+:[\d\w]+:[\d\w]+:(\d+)/ ){$number_of_tiles = $1 + 1; # Sequence

Read Archive variant }elsif( $line =~ /^@[\d\w\-\._\s]+:[\d\w]+:(\d+)/ ){ $number_of_tiles = $1 + 1;

} # All other variants }elsif( $line =~ /^@[\d\w\-:\._]*:+\d*:(\d*):[\.\d]+:[\.\/\#\d\w]+$/ ){

$number_of_tiles = $1 + 1; }

Alternate Formats

• This Perl script assumes that header lines of sequence files are written in one of several formats

• The code uses regular expressions to sort out formats:if( $line =~ /\S+\s\S+/ ){ # Cassava 1.8 variant – does the header

line contain a space surrounded

by non-space characters?

@EAS139:136:FC706VJ:2:2104:15343:197393_1:Y:18:ATCACG

$line =~ /^@[\d\w\-\._]+:[\d\w]+:[\d\w]+:[\d\w]+:(\d+)/ ) # NCBI SRA variant – does the header line contain a string with – , _ ,or . before the first colon?

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

SolexaQA.pl

$line =~ /^@[\d\w\-\._\s]+:[\d\w]+:(\d+)/ ) # Two other variants – 1. does first field contain – , ., or _ followed by two more colon-

delimited fields?

$line =~ /^@[\d\w\-:\._]*:+\d*:(\d*):[\.\d]+:[\.\/\#\d\w]+$/ )

2. does first field contain – , ., :, or _ followed by four colon-delimited fields, followed by ., /, or # at the end of the line?

Example header line from GSL sequence file:@3:1:1006:20321:Y This would be described by $line =~ /^@\d+:\d+:\d+:\d+:[YN]/

Using the Unix Shell There is No ‘Undelete’. The Unix Shell “A Unix shell is a command-line...

Documents

Transcript of Using the Unix Shell There is No ‘Undelete’. The Unix Shell “A Unix shell is a command-line...