Introduction to Text-Processing - Stanford...
Transcript of Introduction to Text-Processing - Stanford...
![Page 1: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/1.jpg)
Introduction to Text-ProcessingJim Notwell
October 7, 2011
Thank you to Aaron Wenger, Cory McLean, and Gus Katsiapis
1
![Page 2: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/2.jpg)
Stanford UNIX Resources
• Host: cardinal.stanford.edu
• To connect from UNIX / Linux / Mac:
• To connect from Windows
PuTTY (http://goo.gl/s0itD)
2
![Page 3: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/3.jpg)
Many useful UNIX text processing commands
• awk cat cut grep head sed sort tail tee tr uniq wc zcat
• Unix commands work together via text streams
• Example usage and others available at:
http://tldp.org/LDP/abs/html/textproc.html
3
![Page 4: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/4.jpg)
Knowing UNIX Commands Eliminates Having To Reinvent The Wheel
For homework #1 last year, to perform a simple file sort, submissions used:
• 35 lines of Python
• 19 lines of Perl
• 73 lines of Java
• 1 line of UNIX commands
4
![Page 5: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/5.jpg)
Anatomy of a UNIX Command
• options: -n 1 -g -c = -n1 -gc• Output is directed to “standard
output” (stdout)
• If no input file is specified, inut comes from “standard input” (stdin)
• “-” also means stdin in a file list
5
command [options] [file1] [file2]
![Page 6: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/6.jpg)
The real power of UNIX commands comes from combinations through
piping (“|”)• Pipes are used to pass the output of one
program (stdout) as the input (stdin) to another
• Pipe character is <shift>-\
grep “CS273a” grades.txt | sort -k 2,2gr | uniq
Find all lines in the file that contain “CS273a”
6
![Page 7: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/7.jpg)
The real power of UNIX commands comes from combinations through
piping (“|”)• Pipes are used to pass the output of one
program (stdout) as the input (stdin) to another
• Pipe character is <shift>-\
grep “CS273a” grades.txt | sort -k 2,2gr | uniq
Find all lines in the file that contain “CS273a”
Sort those lines by the second column, in numerical order, highest to lowest
7
![Page 8: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/8.jpg)
The real power of UNIX commands comes from combinations through
piping (“|”)• Pipes are used to pass the output of one
program (stdout) as the input (stdin) to another
• Pipe character is <shift>-\
grep “CS273a” grades.txt | sort -k 2,2gr | uniq
Find all lines in the file that contain “CS273a”
Sort those lines by the second column, in numerical order, highest to lowest
Remove duplicates and print to standard output
8
![Page 9: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/9.jpg)
Output redirection (>, >>)
• Instead of writing everything to standard output, we can write (>) or append (>>) to a file
grep “CS273a” allClasses.txt > CS273aInfo.txt
9
cat addInfo.txt >> CS273aInfo.txt
![Page 10: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/10.jpg)
UNIX Commands
10
![Page 11: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/11.jpg)
man, whatis, apropos• UNIX program that invokes the manual written for a particular
program
• man sort• Shows all info about the program sort
• Hit <space> to scroll down, “q” to exit
• whatis sort• Shows short description of all programs that have “sort” in their
names
• apropos sort• Shows all programs that have sort in their names or short
descriptions
11
![Page 12: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/12.jpg)
cat• Concatenates files and prints them to standard output
• cat [option] [file] ...
• Variants for compressed input files:
• zcat (.gz files)
• bzcat (.bz2 files)
12
![Page 13: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/13.jpg)
cat• Concatenates files and prints them to standard output
• cat [option] [file] ...
• Variants for compressed input files:
• zcat (.gz files)
• bzcat (.bz2 files)
13
ABCD
123
![Page 14: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/14.jpg)
cat• Concatenates files and prints them to standard output
• cat [option] [file] ...
• Variants for compressed input files:
• zcat (.gz files)
• bzcat (.bz2 files)
14
ABCD
123
ABCD123
![Page 15: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/15.jpg)
head, tail• head: first ten lines
tail: last ten line
• -n option: number of lines• For tail, -n+K means line K to the end
• head -5 : first five lines
• tail -n73 : last 73 lines
• tail -n+10 | head -n 5 : lines 10-14
15
![Page 16: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/16.jpg)
cut• Prints the selected parts of lines from each file to
standard output
cut [option] ... [file] ...• -d Choose the delimiter between columns (default
tab)
• -f : Fields to print
• -f1,7 : fields 1 and 7
• -f1-4,7,11-13 :fields 1,2,3,4,7,11,12,13
16
![Page 17: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/17.jpg)
cut Example
17
CS 273 aCS.273.aCS 273 a
cut -f1,3 file.txt
cat file.txt | cut -f1,3=
file.txt
![Page 18: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/18.jpg)
cut Example
18
CS 273 aCS.273.aCS 273 a
cut -f1,3 file.txt
cat file.txt | cut -f1,3=
CS 273 aCS.273.aCS 273 a
file.txt
![Page 19: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/19.jpg)
cut Example
19
CS 273 aCS.273.aCS 273 a
cut -d ‘.’ -f1,3 file.txt
file.txt
![Page 20: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/20.jpg)
cut Example
20
CS 273 aCS.273.aCS 273 a
cut -d ‘.’ -f1,3 file.txtCS 273 aCS.aCS 273 a
file.txt
![Page 21: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/21.jpg)
cut Example
21
CS 273 aCS.273.aCS 273 a
cut -d ‘.’ -f1,3 file.txtCS 273 aCS.aCS 273 a
file.txtIn general, you should make sure your file columns are all delimited with the same
character(s) before applying cut!
![Page 22: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/22.jpg)
wc
• Print line, word, and character (byte) counts for each file, and totals of each if more tan one file specified
wc [option]... [file]...• -l Print only line counts
22
![Page 23: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/23.jpg)
sort• Sorts lines in a delimited file (default: tab)
• -k m,n Sorts by columns m to n (1-based)
• -g Sorts by general numerical value (can handle scientific format)
• -r Sorts in descending order
• sort -k1,1gr -k2,3• Sort on field 1 numerically (high to low because of r)
• Break ties on field 2 alphabetically
• Break further ties on field 3 alphabetically
23
![Page 24: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/24.jpg)
uniq
• Discard all but one successive identical lines from input and print to standard output
• -d Only print duplicate lines
• -i Ignore case in comparison
• -u Only print unique lines
24
![Page 25: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/25.jpg)
uniq Example
25
CS 273aCS 273aTA: Jim NotwellCS 273a
uniq file.txt
file.txt
![Page 26: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/26.jpg)
CS 273 aCS.273.aCS 273 a
uniq Example
26
CS 273aCS 273aTA: Jim NotwellCS 273a
uniq file.txtCS 273aTA: Jim NotwellCS 273a
file.txt
![Page 27: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/27.jpg)
uniq Example
27
CS 273aCS 273aTA: Jim NotwellCS 273a
file.txt
uniq -u file.txt
![Page 28: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/28.jpg)
CS 273 aCS.273.aCS 273 a
uniq Example
28
CS 273aCS 273aTA: Jim NotwellCS 273a
uniq -u file.txt TA: Jim NotwellCS 273a
file.txt
![Page 29: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/29.jpg)
uniq Example
29
CS 273aCS 273aTA: Jim NotwellCS 273a
uniq -d file.txt
file.txt
![Page 30: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/30.jpg)
CS 273 aCS.273.aCS 273 a
uniq Example
30
CS 273aCS 273aTA: Jim NotwellCS 273a
uniq -d file.txt CS 273a
file.txt
![Page 31: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/31.jpg)
CS 273 aCS.273.aCS 273 a
uniq Example
31
CS 273aCS 273aTA: Jim NotwellCS 273a
uniq -d file.txt CS 273a
file.txtIn general, you probably want to make sure your file is sorted before applying uniq
![Page 32: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/32.jpg)
grep• Search for lines that contain a word or match a
regular expression
grep [options] PATTERN [file] ...• -i Ignores case
• -v Output lines that do not match
• -E regular expressions
• -f <file> : patterns from a file (1 per line)
32
![Page 33: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/33.jpg)
grep Example
33
CS 273aCS273CS 273cs 273CS 273
grep -E “^CS[[:space:]]+273$” file
file.txt
For lines that start with CS
Then have one or more spaces (or tabs)
And ends with 273
![Page 34: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/34.jpg)
grep Example
34
CS 273aCS273CS 273cs 273CS 273
grep -E “^CS[[:space:]]+273$” file
file.txt
For lines that start with CS
Then have one or more spaces (or tabs)
And ends with 273
CS 273CS 273
![Page 35: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/35.jpg)
tr• Translate or delete characters from standard
input to standard output
tr [option] ... SET1 [SET2]
• -d Delete characters in SET1, don’t translate
35
![Page 36: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/36.jpg)
tr• Translate or delete characters from standard
input to standard output
tr [option] ... SET1 [SET2]
• -d Delete characters in SET1, don’t translate
36
Thisis anExample.
file.txt
cat file.txt | tr ‘\n’ ‘,’
![Page 37: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/37.jpg)
tr• Translate or delete characters from standard
input to standard output
tr [option] ... SET1 [SET2]
• -d Delete characters in SET1, don’t translate
37
Thisis anExample.
file.txt
cat file.txt | tr ‘\n’ ‘,’
This,is an,Example.,
![Page 38: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/38.jpg)
• Most common use is a string replace
sed -e “s/search/replace/g”
sed: stream editor
38
![Page 39: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/39.jpg)
• Most common use is a string replace
sed -e “s/search/replace/g”
sed: stream editor
39
Thisis anExample.
file.txt
sed file.txt | sed -e “s/is/EEE/g”
![Page 40: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/40.jpg)
• Most common use is a string replace
sed -e “s/search/replace/g”
sed: stream editor
40
Thisis anExample.
file.txt
sed file.txt | sed -e “s/is/EEE/g”
ThEEEEEE anExample.
![Page 41: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/41.jpg)
• Join lines of two files on a common field
join [option] ... file1 file2
• -1 Specify which column of file1 to join on
• -2 Specify which column of file2 to join on
• Important: file 1 and file 2 must already be sorted on their join fields
join
41
![Page 42: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/42.jpg)
Shell Scripting
42
![Page 43: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/43.jpg)
Common Shells
43
• Two common shells: bash and tcsh
• Run ps to see which you are using
![Page 44: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/44.jpg)
Multiple UNIX Commands Can Be Combined into a Single Shell Script
44
#!/bin/bashset -beEu -o pipefailcat $1 $2 > tmp.txtsort tmp.txt > $3
script.sh
>chmod u+x script.sh
> ./script.sh file1.txt file2.txtTo Run:
Set script to executable:
http://www.faqs.org/docs/bashman/bashref_toc.html
![Page 45: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/45.jpg)
Multiple UNIX Commands Can Be Combined into a Single Shell Script
45
#!/bin/tcsh -ecat $1 $2 > tmp.txtsort tmp.txt > $3
script.csh
>chmod u+x script.csh
> ./script.csh file1.txt file2.txtTo Run:
Set script to executable:
http://www.the4cs.com/~corin/acm/tutorial/unix/tcsh-help.html
![Page 46: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/46.jpg)
for loop• BASH for loop to print 1,2,3 on separate lines
• TCSH for loop to print 1,2,3 on separate lines
46
for i in `seq 1 3`do echo ${i}done
foreach i ( `seq 1 3` ) echo ${i}end
![Page 47: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/47.jpg)
for loop• BASH for loop to print 1,2,3 on separate lines
• TCSH for loop to print 1,2,3 on separate lines
47
for i in `seq 1 3`do echo ${i}done
foreach i ( `seq 1 3` ) echo ${i}end
Special quote characters, usually left of “1” on
keyboard that indicate we should execute the
command within the quotes
![Page 48: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/48.jpg)
UCSC Kent Source Utilities
48
![Page 49: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/49.jpg)
• Many C programs in this directory that do manipulation of sequences or chromosome ranges
• Run programs with no arguments to see help message
/afs/ir/class/cs273a/bin/@sys/
49
selectFile
overlapSelect [OPTION]… selectFile inFile outFile
inFile
![Page 50: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/50.jpg)
• Many C programs in this directory that do manipulation of sequences or chromosome ranges
• Run programs with no arguments to see help message
/afs/ir/class/cs273a/bin/@sys/
50
selectFile
overlapSelect [OPTION]… selectFile inFile outFile
inFile
outFileOutput is all inFile elements that overlap any selectFile elements
![Page 51: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/51.jpg)
Scripting Languages
51
![Page 52: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/52.jpg)
• A quick-and-easy shell scripting language
• http://www.grymoire.com/Unix/Awk.html
• Treats each line of a file as a record, and splits fields by whitespace
• Fields referenced as $1, $2, $3, … ($0 is entire line)
awk
52
![Page 53: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/53.jpg)
Anatomy of an awk script
awk ‘BEGIN {...} {...} END {...}’
Before the first line Once per line After the last line
53
![Page 54: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/54.jpg)
• Output the lines where column 3 is less than column 5 in a comma-delimited file. Output a summary line at the end.
awk Example
54
awk -F’,’‘BEGIN{ct=0;}{ if ($3 < $5) { print $0; ct=ct+1; } }END { print “TOTAL LINES: “ ct; }’
![Page 55: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/55.jpg)
Useful Things From awk• Make sure fields are delimited with tabs (to be used
by cut, sort, join, etc.)
awk ‘{print $1 “\t” $2 “\t” $3}’ whiteDelim.txt > tabDelim.txt
• Good string processing using substr, index, length functions
awk ‘{print substr($1, 1, 10)}’ longNames.txt > shortNames.txt
55
![Page 56: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/56.jpg)
Useful Things From awk• Make sure fields are delimited with tabs (to be used
by cut, sort, join, etc.)
awk ‘{print $1 “\t” $2 “\t” $3}’ whiteDelim.txt > tabDelim.txt
• Good string processing using substr, index, length functions
awk ‘{print substr($1, 1, 10)}’ longNames.txt > shortNames.txt
56
String to manipulate Start Position Length
![Page 57: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/57.jpg)
57
![Page 58: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/58.jpg)
Miscellaneous
58
![Page 59: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/59.jpg)
Interacting with UCSC Genome Browser MySQL Tables
• Galaxy (a GUI to make SQL commands easy)–http://main.g2.bx.psu.edu/
• Direct interaction with the tables:mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A –Ne “<STMT>“
e.g.mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A –Ne \ “select count(*) from hg18.knownGene“;
+-------+
| 66803 |
+-------+
http://dev.mysql.com/doc/refman/5.1/en/tutorial.html 30
![Page 60: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/60.jpg)
Python• A scripting language with many useful
constructs• Easier to read than Perl• http://wiki.python.org/moin/
BeginnersGuide• http://docs.python.org/tutorial/index.html
• Call a python program from the command line:python myProg.py
3660
![Page 61: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/61.jpg)
Number types• Numbers: int, float>>> f = 4.7>>> i = int(f)>>> j = round(f)>>> i4>>> j5.0>>> i*j20.0>>> 2**i16
3761
![Page 62: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/62.jpg)
Strings>>> dir(“”)
[…, 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
>>> s = “hi how are you?”
>>> len(s)15
>>> s[5:10]
‘w are’
>>> s.find(“how”)3
>>> s.find(“CS273”)
-1
>>> s.split(“ “)[‘hi’, ‘how’, ‘are’, ‘you?’]
>>> s.startswith(“hi”)True
>>> s.replace(“hi”, “hey buddy,”)‘hey buddy, how are you?’
>>> “ extraBlanks ”.strip()‘extraBlanks’
3862
![Page 63: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/63.jpg)
Lists• A container that holds zero or more objects in
sequential order>>> dir([])
[…, 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
>>> myList = [“hi”, “how”, “are”, “you?”]>>> myList[0]
‘hi’
>>> len(myList)
4
>>> for word in myList:
print word[0:2]
hi
ho
ar
yo
>>> nums = [1,2,3,4]
>>> squares = [n*n for n in nums]>>> squares
[1, 4, 9, 16]3963
![Page 64: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/64.jpg)
Dictionaries• A container like a list, except key can be
anything (instead of a non-negative integer)
>>> dir({})
[…, clear', 'copy', 'fromkeys', 'get', 'has_key', 'items', 'iteritems', 'iterkeys', 'itervalues', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']
>>> fruits = {“apple”: True, “banana”: True}
>>> fruits[“apple”]
True
>>> fruits.get(“apple”, “Not a fruit!”)
True
>>> fruits.get(“carrot”, “Not a fruit!”)
‘Not a fruit!’
>>> fruits.items()
[('apple', True), ('banana', True)] 4064
![Page 65: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/65.jpg)
Reading from files
>>> openFile = open(“file.txt”, “r”)
>>> allLines = openFile.readlines()
>>> openFile.close()
>>> allLines[‘Hello, world!\n’, ‘This is a file-reading\n’, ‘\texample.\n’]
41
Hello, world!This is a file-reading example.
file.txt
65
![Page 66: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/66.jpg)
Writing to files>>> writer = open(“file2.txt”, “w”)
>>> writer.write(“Hello again.\n”)
>>> name = “Cory”
>>> writer.write(“My name is %s, what’s yours?\n” % name)
>>> writer.close()
42
Hello again.My name is Cory, what’s yours?
file2.txt
66
![Page 67: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/67.jpg)
Creating functionsdef compareParameters(param1, param2):
if param1 < param2:
return -1
elif param1 > param2:
return 1
else:
return 0
def factorial(n):
if n < 0:
return None
elif n == 0:
return 1 else:
retval = 1
num = 1
while num <= n:
retval = retval*num
num = num + 1
return retval4367
![Page 68: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/68.jpg)
Example program
#!/usr/bin/env python
import sys # Required to read arguments from command line
if len(sys.argv) != 3:
print “Wrong number of arguments supplied to Example.py”
sys.exit(1)
inFile = open(sys.argv[1], “r”)
allLines = inFile.readlines()
inFile.close()
outFile = open(sys.argv[2], “w”)
for line in allLines:
outFile.write(line)
outFile.close()
44
Example.py
68
![Page 69: Introduction to Text-Processing - Stanford Universityweb.stanford.edu/class/cs273a/presentations.aut11/UnixTextProces… · Useful Things From awk • Make sure fields are delimited](https://reader033.fdocuments.in/reader033/viewer/2022053006/5f09f2437e708231d4294437/html5/thumbnails/69.jpg)
Example program
python Example.py file1 file2
sys.argv = [‘Example.py’, ‘file1’, ‘file2’]
45
#!/usr/bin/env pythonimport sys # Required to read arguments from command line
if len(sys.argv) != 3: print “Wrong number of arguments supplied to Example.py” sys.exit(1)
inFile = open(sys.argv[1], “r”)allLines = inFile.readlines()inFile.close()
outFile = open(sys.argv[2], “w”)for line in allLines: outFile.write(line)
outFile.close()
69