Advanced Text Processing. 222 Lecture Overview Character manipulation commands cut, paste, tr Line...

Post on 20-Jan-2016

225 views 4 download

Transcript of Advanced Text Processing. 222 Lecture Overview Character manipulation commands cut, paste, tr Line...

Advanced Text Processing

222

Lecture Overview

Character manipulation commands cut, paste, tr

Line manipulation commands sort, uniq, diff

Regular expressions and grep

Text replacement using sed

333

Cutting Lines – cut

The cut command extracts sections from each line of the input file

Command line options for cut: -c – output only these characters -f – output only these fields -d – use this character as the field delimiter

cut options [files]

444

Cutting Lines – cut

With cut, at least one of the selection options (-c or -f) must be specified

The value given with -c or -f can be: A number – specifies a single character position A range – specifies a sequence of positions A comma separated list – specifies multiple

positions or ranges

555

cut – Examples

Given a file called 'my_phones.txt':ADAMS, Andrew 7583BARRETT, Bruce 6466BAYES, Ryan 6585BECK, Bill 6346BENNETT, Peter 7456GRAHAM, Linda 6141HARMER, Peter 7484MAKORTOFF, Peter 7328MEASDAY, David 6494NAKAMURA, Satoshi 6453REEVE, Shirley 7391ROSNER, David 6830

666

cut – Examples

head -3 my_phones.txt | cut -c3-16

AMS, Andrew 75RRETT, Bruce 6YES, Ryan 6585

head -3 my_phones.txt | cut -d" " -f2

AndrewBruceRyan

head -3 my_phones.txt | cut -c1-3,10,12,15-18

ADAde7583BARBu 646BAYa 85

777

Merging Files – paste

The paste command merges multiple files by concatenating corresponding lines

Command line options for paste: -d – provide a list of separator characters -s – paste one file at a time instead of in parallel

(each file becomes a single line)

paste [options] [files]

888

paste – Examples

Assume that we are given 3 input files:

AndrewBruceRyanBillPeterLindaPeterPeterDavidSatoshi

first.txtADAMSBARRETTBAYESBECKBENNETTGRAHAMHARMERMAKORTOFFMEASDAYNAKAMURA

last.txt7583646665856346745661417484732864946453

num.txt

999

paste – Examples

paste first.txt last.txt num.txt | head -3

Andrew ADAMS 7583Bruce BARRETT 6466Ryan BAYES 6585

paste -d" :" first.txt last.txt num.txt | head -3

Andrew ADAMS:7583Bruce BARRETT:6466Ryan BAYES:6585

paste -s last.txt first.txt num.txt | cut -f1-5,10

ADAMS BARRETT BAYES BECK BENNETT NAKAMURAAndrew Bruce Ryan Bill Peter Satoshi7583 6466 6585 6346 7456 6453

101010

Translating Characters – tr

The tr command is used to translate between one character set and another

Input is read from standard input and written to standard output (no files)

With no options, tr accepts two character sets with equal lengths, and replaces each character with the corresponding one

tr [options] set1 [set2]

111111

Deleting or Squeezing Characters – tr

Sets contain literal characters, or character ranges, such as: 'a-z' or 'DEFa-z'

With command line options, tr can also be used to delete or squeeze characters

Command line options for tr: -d – delete characters in set1 -s – replace sequence of characters with one

121212

Defining Sets for tr

tr has some interpreted sequences to simplify the definition of sets: [:alpha:] – all letters [:digit:] – all digits [:alnum:] – all letters and digits [:space:] – all whitespace [:punct:] – all punctuation characters [CHAR*REPEAT] – REPEAT copies of CHAR [CHAR*] – copies of CHAR until set1 length

131313

tr – Examples

Change lower case to capital, and replace the digits 6, 7, 8 with the letters x, y, z

head -3 padded_phones.txt

ADAMS Andrew 7583BARRETT Bruce 6466BAYES Ryan 6585

head -3 padded_phones.txt | tr 'a-z678' 'A-Zxyz'

ADAMS ANDREW y5z3BARRETT BRUCE x4xxBAYES RYAN x5z5

141414

tr – Examples

Squeeze sequences of spaces into one:

Delete spaces, and digits 7 and 8:head -3 padded_phones.txt | tr -d " 78"

ADAMSAndrew53BARRETTBruce6466BAYESRyan655

head -3 padded_phones.txt | tr -s " "

ADAMS Andrew 7583BARRETT Bruce 6466BAYES Ryan 6585

151515

Reading from Standard Input

Many UNIX commands accept one or more input files listed in the command line(tr is one of the few that don't)

If no input file is given, these commands will read from the standard input

Alternately, if the file list contains a '-', the standard input will be inserted in its place

161616

Standard Input – Example

cat last.txt | tr "A-Z" "a-z" | \ paste –d"_" first.txt - number.txt | head -10

Andrew_adams_7583Imelda_aguilar_6518Daniel_albers_7540Pierre_amaudruz_7567Friedhelm_ames_7581Willy_andersson_6238Andrei_andreyev_6491Jonathan_aoki_6820Donald_arseneau_6295Danny_ashery_6188

171717

Lecture Overview

Character manipulation commands cut, paste, tr

Line manipulation commands sort, uniq, diff

Regular expressions and grep

Text replacement using sed

181818

Sorting Files – sort

The sort command reorders the lines ina file (or files), and sends the result to the standard output

Command line options for sort: -f – ignore case (fold lowercase to uppercase) -r – sort in reverse order -n – sort in numeric order

sort [options] [files]

191919

Sorting Files – sort

With no options given, the input is sorted based on the ASCII code order

The sort command has many more options for selecting which fields to sort by, and for changing the way input is treated

As always, you should read the man pages for the full details

202020

sort – Example: Using Ignore-Case

AndrewbillBrucepeterRyan

AndrewBruceRyanbillpeter

BruceRyanpeterAndrewbill

sort -f

sort

212121

sort – Example: Sorting Numbers

1838665751256875

1256875183857566

3818125687566575

sort -n

sort

222222

Removing Duplicate Lines – uniq

The uniq command removes adjacent duplicate lines from its input file If input is sorted, removes all duplicate lines

Command line options for uniq: -i – ignore case -c – prefix lines by the number of occurrences -d – only print duplicate lines -u – only print unique lines

232323

uniq – Example

1 Andrew1 Bill2 David3 Peter1 Ryan

AndrewBillDavidPeterRyan

AndrewBillDavidDavidPeterPeterPeterRyan

uniq -c

uniq

242424

uniq – Example

AndrewBillRyan

DavidPeter

AndrewBillDavidDavidPeterPeterPeterRyan

uniq -u

uniq -d

252525

Example – File Processing Using Pipes

Task – go over the book "War and Peace" and count the appearances of each word Step 1: remove all punctuation marks

Step 2: put each word in a separate line

Step 3: sort words

cat war_and_peace.txt | tr -d '[:punct:]'

cat war_and_peace.txt | tr -d '[:punct:]' |tr " " "\n"

cat war_and_peace.txt | tr -d '[:punct:]' |tr " " "\n" | sort

262626

Example – File Processing Using Pipes

Step 4: count appearances of each word

Step 5: sort result by number of appearances

Step 6: write output to file

cat war_and_peace.txt | tr -d '[:punct:]' |tr " " "\n" | sort | uniq -c | sort -nr

cat war_and_peace.txt | tr -d '[:punct:]' |tr " " "\n" | sort | uniq -c

cat war_and_peace.txt | tr -d '[:punct:]' |tr " " "\n" | sort | uniq -c | sort -nr > words.txt

272727

Comparing Text Files – diff

The diff command takes two input files, and compares them

The output contains only the different lines, with their line numbers

Command line options for diff: -i – ignore case -b – ignore changes in amount of white space -B – ignore insertion or deletion of blank lines

282828

diff – Examples

2,3c2,3< BARRETT Bruce 6466< BAYES Ryan 6585---> BARRETT Bruce 3333> BAYES Ryan 65855c5< BENNETT Peter 7456---> Bennett peter 7456

diff

ADAMS Andrew 7583BARRETT Bruce 3333BAYES Ryan 6585BECK Bill 6346Bennett peter 7456

ADAMS Andrew 7583BARRETT Bruce 6466BAYES Ryan 6585BECK Bill 6346BENNETT Peter 7456

292929

diff – Examples

2c2< BARRETT Bruce 6466---> BARRETT Bruce 33335c5< BENNETT Peter 7456---> Bennett peter 7456

diff -b

ADAMS Andrew 7583BARRETT Bruce 3333BAYES Ryan 6585BECK Bill 6346Bennett peter 7456

ADAMS Andrew 7583BARRETT Bruce 6466BAYES Ryan 6585BECK Bill 6346BENNETT Peter 7456

2c2< BARRETT Bruce 6466---> BARRETT Bruce 3333diff -bi

303030

Maintaining Output Consistency

During program development, assume that we have reached the correct output

We want to verify that it does not change Create reference output file:

After changing the program, compare output:

prog > prog.out

prog | diff – prog.out

313131

Lecture Overview

Character manipulation commands cut, paste, tr

Line manipulation commands sort, uniq, diff

Regular expressions and grep

Text replacement using sed

323232

Searching For Matching Patterns – grep

The grep command searches files for patterns, and prints matching lines

The mandatory regexp argument defines a regular expression

A regular expression is a formula for matching strings that follow some pattern

grep [options] regexp [files]

333333

Searching For Matching Patterns – grep

The simplest regular expression is just a sequence of characters

This regular expression matches only a single string – itself

The following command prints all lines from any of files that contain word:

grep word files

343434

Searching For Matching Patterns – grep

The power of grep lies in using more sophisticated regular expressions

Command line options for grep: -v – print all lines that don't match -c – print only a count of matched lines -n – print line numbers -h – don't print file names (for multiple files) -l – print file name but not matching line

353535

Regular Expressions

Regular expressions are a powerful tool for searching and selecting text

Their origin is in the UNIX grep command (and further back in automata theory)

They have since been copied into many other tools and languages such as awk, sed, perl and Java

363636

Regular Expressions vs.Filename Expansion

Note that regular expressions are different from filename expansion

Filename expansion uses some regular expression concepts and symbols, but: Filename expansion is done by the shell Regular expressions are passed as arguments to

specific commands or utilities

373737

Matching a Single Character

A period (.) matches any single character

For example:

Regular Expression

Matches Doesn't Match

b.g bagdebugbigger

bragbgbad

U..X UNIX unix

. a, b, c An empty line

383838

Matching a Character Class

Square brackets ([]) match any single character within the brackets

If the first character following the left bracket is a '^', the expression matches any character not in the brackets

A '-' can be used to indicate a range,such as: [a-z]

393939

Matching a Character Class

Regular Expression

Matches Doesn't Match

[Bb]ill Billbillgot billed

Dillillkill

t[aeiou].k talkstackstink

tracktake

number [^0-5] number xxxnumber 8:

number 59

404040

Matching a Character Class

The same predefined character classes used for tr can also be used here

For portability reasons, [:alpha:] is always preferable to [A-Za-z]

Note: the brackets are part of the symbolic names, and must be included in addition to the enclosing brackets, i. e. [[:alpha:]]

414141

Matching Repetitions

An asterisk (*) represents zero or more matches of the regular expression it follows

Regular Expression

Matches Doesn't Match

ab*c acabcaaabbbc

abacacb

t.*ing thingstringthinking

king

424242

Matching Special Characters

Sometimes we want to literally matcha character that has a special meaning, such as '*' or '['

There are two ways to do that: Precede the character with a '\' Use square brackets – any character inside is

taken literally

434343

Matching Special Characters

Regular Expression

Matches Doesn't Match

a\.c a.c abc

\.\.\.* the end...more.....

abcstop.

[*.] * start *Sys.print

Hello worldabc

C:\\bin C:\bin C:\\bin

444444

Matching the Beginning orthe End of a Line

A regular expression that begins with a caret (^) can match a string only at the beginning of a line

Similarly, a regular expression that ends with a dollar sign ($) can match a string only at the end of a line

454545

Matching the Beginning orthe End of a Line

Regular Expression

Matches Doesn't Match

^T This lineThat bug

STARTMy Tag

^num.*[0-9]$ num5num99number 1

my num1the number 6num 6a

^t.*k$ talktracktk

stacktake

464646

Using Regular Expressions with grep – Examples

cat bugs.txt

big boybad bugbagbigger bagbetterboogie nights

grep 'b.g' bugs.txt

big boybad bugbagbigger bag

grep 'b.g.' bugs.txt

big boybigger bag

grep 'b.*g.' bugs.txt

big boybigger bagboogie nights

474747

Using Regular Expressions with grep – Examples

cat f.txt

ADAMS,Andrew7583BARRETT,Bruce6466BAYES,Ryan6585

grep '[[:alpha:]],' f.txt

grep '^[C-Z][[:lower:]]*$' f.txtRyan

ADAMS,BARRETT,BAYES,

64666585

grep '^[^[:alpha:]0-3]*$' f.txt

484848

Pipes and Regular Expressions – Example

Task: create a file containing the names of all source files in the current directory, sorted by the number of lines in each file Step 1: count lines in each file

Step 2: leave only '.c' and '.h' files

Step 3: sort in reverse order (largest first)

wc -l *

wc -l * | grep '\.[ch]$'

wc -l * | grep '\.[ch]$' | sort -nr

494949

Pipes and Regular Expressions – Example

Step 4: squeeze leading spaces (into one)

Step 5: remove number field

Step 6: write output to file

wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " | cut -d" " –f3 > sorted_source_files.txt

wc -l * | grep '\.[ch]$' | sort -nr | tr -s " "

wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " | cut -d" " –f3

505050

Which grep to Use?

In addition to grep itself, there are two more variants of it: egrep and fgrep Use grep for most standard text finding tasks Use egrep for complex tasks, where basic regular

expressions are just not enough, and you need to use extended regular expressions

Use fgrep when only fixed strings are searched, and speed is of the essence

515151

Extended Regular Expressions – egrep

Extended regular expressions support all basic regular expression syntax, plus some additional special characters: + – similar to '*', but at least one appearance ? – similar to '*', but zero or one appearances () – grouping a|b – the OR operator – matches either regular

expression a or regular expression b

525252

Extended Regular Expressions – egrep

Regular Expression

Matches Doesn't Match

num6+ num666 num654

num566 number

num6?5 num65num555

num6num665

Barret|Bennet BarretBennet

B(arr|enn)et BarretBennet

535353

Lecture Overview

Character manipulation commands cut, paste, tr

Line manipulation commands sort, uniq, diff

Regular expressions and grep

Text replacement using sed

545454

Stream Editor – sed

sed is a script editor for text streams, which supports basic regular expressions

It performs transformations on an input stream, based on simple instructions

sed has many commands, but the most commonly used is the substitute command:

sed 's/pattern/replacement/[g]' [file]

555555

Stream Editor – sed

pattern is any basic regular expression replacement is a string that will replace one

or more matches of pattern The optional g flag defines whether the

operation is global – without it only the first match in every line is replaced

The special character '&' can be used inside replacement to refer to the matched text

565656

Using Regular Expressions with grep – Examples

cat bugs.txt

big boybad bugbagbigger bagbetter

sed 's/b.g/XXX/' bugs.txt

XXX boybad XXXXXXXXXger bagbetter

sed 's/b.g/XXX/g' bugs.txt

XXX boybad XXXXXXXXXger XXXbetter

575757

sed – Examples

head -2 my_phones.txt

head -2 my_phones.txt | sed 's/ [[:upper:]]/<&>/g'

ADAMS,< A>ndrew 7583BARRETT,< B>ruce 6466

ADAMS, Andrew 7583BARRETT, Bruce 6466

ADAMS, Andrew ###BARRETT, Bruce ###

head -2 my_phones.txt | sed 's/[[:digit:]]*$/###/g'

585858

Matching and Reusing Portions ofa Pattern in sed

It is also possible to use portions of the matching pattern

Within the pattern, portions should be enclosed between '\(' and '\)'

In replacement , the special sequences: '\1', '\2', etc. can be used to refer to the matched portions

595959

Matching and Reusing Portions ofa Pattern in sed – Examples

Remove the first name from each line:

Replace first name with initial:head -2 my_phones.txt |sed 's/ \([[:upper:]]\)[[:lower:]]* / \1. /'

ADAMS, A. 7583BARRETT, B. 6466

ADAMS, 7583BARRETT, 6466

head -2 my_phones.txt |sed 's/ [[:upper:]][[:lower:]]* / /'

606060

Matching and Reusing Portions ofa Pattern in sed – Examples

Switch between first and last names:

Switch names and parenthesize number:head -2 my_phones.txt |sed 's/\(.*\), \(.*\) \(.*\)/\2 \1: (03-555\3)/'

Andrew ADAMS: (03-5557583)Bruce BARRETT: (03-5556466)

Andrew ADAMS 7583Bruce BARRETT 6466

head -2 my_phones.txt |sed 's/\(.*\), \(.*\) /\2 \1 /'