Unix Text Analysis

download Unix Text Analysis

of 9

Transcript of Unix Text Analysis

  • 8/8/2019 Unix Text Analysis

    1/9

    Introduction1

    Data Manipulation with UNIX

    Introduction

    Who is this course for?This course is for anyone who sometimes needs to manipulate UNIX data files that is to say files inplain text format without needing all the power of a spreadsheet or database. It is for someone whodoes not necessarily know much Unix but is comfortable with typing less at the command line to view afile and moving around withcd, maybe piping grep to less to see just some lines of a file. For someelements of the course it will be an advantage to know some programming but not much: everything wedo can be explained.

    Why do this course?

    Besides learning that you can do quite a lot of useful data cleansing, manipulation, and simple reportingfrom the command line, this course will improve your general confidence in using Unix.

    I refer below to two text files which can be downloaded from the ISD training resources website.

    Displaying file contents

    In what follows, I assume that we are dealing withrecord orienteddata where each line of the fileanalysed is a case or record a collection of data items that belong together.

    First lets check the contents of your files withless orhead. This will give you a clue to their format. Wewill start with a comma delimited file called results.csv. Each line has the following structure

    Surname, Maths_score, English _score, History_score

    There are four fields separated by commas. This is a very common file type and easy to work with, butit has a disadvantage - you may have text fields that contain commas as data. In those cases it iseasiest to use another character as a field delimiterwhen you create the data (you may be able to dothis for example with Excel- if you cannot you may have to do some clever data munging).

    You can use either

    less results.csv

    Or

    head results.csv

    When we view the file withlesswe see

    ADAMS,55,63,65ALI,52,46,35BAGAL,51,58,55BENJAMIN,59,70,68BLAKEMORE,56,38,40BUCHAN,45,62,59CHULANI,63,69,69CLARK,52,64,65DALE,50,55,52DE SOUZA,44,60,62DENCIK,57,67,65

    DOBLE,64,56,65DRURY,50,50,49EL-DANA,62,59,60FREEMAN,52,58,62FROGGATT,39,57,59GEORGARA,56,52,50JAN,62,63,59JENNER,56,67,65JUNCO,48,57,55LEFKARITIS,53,56,59LUKKA,58,59,55MILNER,53,62,58

  • 8/8/2019 Unix Text Analysis

    2/9

    Counting Data Items2

    MIYAJI,58,66,60NICHOLSON,55,55,58PATEL,60,59,54PEIRIS,60,52,55RAMANI,42,43,40ROSEN,54,55,54ROWLANDS,47,50,48

    (A screen at a time).

    Counting Data Items

    First, lets make some simple counts on this file. We can use wcto count characters, words (anythingsurrounded by whitespace) and lines

    wc xresults.csvwc -w xresults.csvwc -c xresults.csvwc -l xresults.csv

    When we are dealing withrecord orienteddata like ours wc lwill display the number of records.

    Selecting Data Items

    Selecting Rows

    Next, we will select some data using grep. Try the following command

    grep ^Rxresults.csv

    This will display only three rows of the file. The expression in quotes is the search string. If we wish wecan direct the output of this process to a new file, like this

    grep ^Rxresults.csv > outputfile.txt

    This command line uses the redirect outputsymbol. In Unix the default output destination is the screen,and its known as stdout(when it needs naming). The default input source is the keyboard, known asstdin. So when data is coming from or going to anywhere else, we use redirection. We use redirectionwith> to pass the results of a process to a new output or with> instead of>.

    We can use a similar command line to count the rows selected, but this time lets change the grep

    command slightly.

    grep ^[RB]xresults.csv | wc -l

    This command line uses thepipe symbol. We usepipingwith|to pass the results of one process toanother process. If we only wanted a count of the lines that match then instead of piping the result to wcwe could use thecparameter on grep, like this

    grep -c ^[RB] xresults.csv | wc -l

    Also notice that in the cases above we have used the anchor to limit the match by grep to the start of aline. The anchor$limits the search to matches at the end of the line. We have used the characterclass, indicated by [ and ], containing Rand B andgrep will succeed if it finds any of the characters in theclass. We enclose thisregular expression in single quotes.

    We use grep in this way to select row data.

    More AboutSearching

    The standard form of a basic grep command is

    grep -[options] search expression filename

    Typicallysearch expression is a regular expression. The simplest type of expression is a string literal a succession of characters each treated literally, that is to say standing for themselves and nothing else.If the string literal contains a space, we will need to surround it by single quote marks. In our data wemight look for the following

    grep de souza xresults.csv

    The next thing to learn is how to match a class of characters rather than a specific character. Consider

  • 8/8/2019 Unix Text Analysis

    3/9

    Selecting Data Items3

    grep [A-Z] xresults.csv

    This matches any uppercase alphabetic character. Similarly

    grep [1-9] xresults.csv

    matches any numeric character. In both these cases any single character of the right class causes asuccessful match. You can specify the class by listing as well. Consider

    grep [perl] xresults.csv

    Which matches any character from the listp,e,r,l(the order in which they are listed is immaterial).You can combine a character class and a literal in a search string. Consider

    grep Grade [BC] someresults.csv

    this search would find lines containingGrade B and lines containing Grade C.

    You can also search using special characters as wildcards. The character . for example, used in asearch stands forany single character except the newline character. So the search

    grep . xresults.csv

    succeeds for every non-empty line. (If. matched the newline character it would succeed for empty linesas well). The character * stands for zero or any number of repetitions of a character. So

    grep a* xresults.csv

    match

    es

    aaaaaa

    and so on. Notice the blank line there? Probably not, but its there. This regular expression matcheszero or more instances of the preceding character.

    Suppose that I wish to find a string that contains any sequence of characters followed by, for examplem. The grep command would be

    grep .+m xresults.csv

    This is a greedy search: it is not satisfied with the very first successful match, it continues past the firstmatch it finds to match the longeststring it can. For now we will just accept this greedy searching, but ifyou investigate regular expressions further you will discover that some versions have non-greedymatching strategies available.

    Selecting Columns

    We can also select columns. Because this is a delimitedfile we can split it into columns at eachdelimiter - in this case a comma. This is equivalent to selecting fields from records.

    Suppose that we want to extract column two from our data. We do this with the cutcommand. Heresan example

    cut -d, -f2 xresults.csv| head

    The first nine lines of the resulting display are

    xxxxxxxxxxxxxxxxxxxx

    We can display several columns like this

    cut d, -f1-3 xresults.csv

    which displays a contiguous range of columns, or

    cut -d, -f1,3 xresults.csv

  • 8/8/2019 Unix Text Analysis

    4/9

    Transforming Data4

    which displays a list of separate columns. The -doption on cut specifies the delimiter(your system willhave a default if you dont specify - find out what it is!) and the -foption specifies the column orfieldnumber. We use cutin this way to select column data.

    The general form of the cutcommand is

    cut -ddelimiter -ffieldnumbers datafile

    So in the examples, we specified comma as the delimiter and used fields 1 and 3 and the range of fields1 to 3.

    Selecting Columns and Rows

    Suppose that we want to select just some columns for only some rows? We do this by first selectingrows withgrep and passing this to cutto select columns. You can try

    grep ^[AR] xresults.csv | cut -d, -f 1,3 | less

    (I put the less in because its generally a good idea if youre going to squirt data at t he screen - its notdoing anything important). Again, we use piping to pass the results of one process to another. Youcould also redirect the output to a new file.

    Transforming Data

    There is another comma delimited file called results.csv whichhas the following structure

    Surname, Mean_score, Grade

    Currently the grade is expressed as an alphabetic character. You should check this by viewing thesurnames and grades from this file. The command is

    cut -d, -f1,3 results.txt

    We can translate the alphabetic grade into a numeric grade (1=A, 2=B etc) with the command tr. Trythis

    tr ,A ,1 < results.txt

    Notice that I included a leading comma in the search and replace strings because I wanted to catch justthe field containingA. I could have done this more elegantly by using A$to anchor the match to theend of the line.

    In the example trgets its input from the file by redirection. You can perform a multiple translation byincluding more than one pair on the command line. For example

    tr ,A ,B ,C ,1 ,2 ,3< results.txt | less

    You can use special characters in a trcommand. For example to search for or replace a tab there aretwo methods:

    1. use the escape string \tto represent the tab

    2. at the position in the command line where you want to insert a tab, first type control-v (^v) andthen press the tab key.

    There are a number of different escape sequences (1 above) and there are different control sequences(2 above) to represent special characters for example \n or sometimes ^M to represent a new line and \sfor white space. In general the escape sequence is easier to use.

    Sorting

    Alphabetically

    Unix sorts alphabetically by default. This means that 100 comes before 11.

    On Rows

    You can sort with the command sort. For example

    sort results.csv | less

    This sorts the file in UNIX order on each character of the entire line. The default alphanumeric sortorder means that the numbers one to ten would be sorted like this

  • 8/8/2019 Unix Text Analysis

    5/9

    Finding Unique Values in Columns5

    1,10,2,3,4,5,6,7,8, 9

    This makes perfect sense but it can be a surprise the first time you see it.

    Descending

    You can sort in reverse order with the option -r. Like this

    sort -r results.csv | less

    Numerically

    To force a numeric sort, use the option -n.

    sort -n results.csv

    You can use a sort on numeric data to get maximum and minimum values for a variable. Sort then pipeto head 1 and tail 1, which will produce the first and last records in the file.

    On Columns

    To sort on columns you must specify a delimiter, with-tand a field number with-k. To sort on the thirdcolumn of the results data, try this

    sort -n -t , -k3 results.csv | less

    (Ive used a slightly more verbose method of specifying the delimiterhere). You can select rows after

    sorting, like this

    sort-n -t , -k3 results.csv | grep ^[A] | less

    Which shows those pupils with surnames beginning withA sorted on the third field of the data file.

    To sort on multiple columns we use more than onekparameter. For example, to sort first on Mathsscore and then on surname we use

    sort-n -t , -k2n -k1 xresults.csv | less

    Finding Unique Values in Columns

    Suppose that you want to know how many different values appear in a particular column. With a littlework, you can find this out using the command uniq. Used alone, uniq tests each line against whatpreceded it before writing it out and ignores duplicate lines.

    Before we try to use uniqwe need a sorted column with

    some repeated values. We can use cuttoextract one. Test this first

    cut -d, -f2 results.csv | less

    This should list just the second column of data whichhas a few duplicate values.

    IWe pass the output throughsort to uniq

    cut -d, -f2 results.csv | sort |uniq | less

    to get data in which the adjacent duplicates have been squeezed to one.

    We can now pipe this result to wc-lto get the count of unique values.

    cut -d, -f2 results.csv | sort | uniq | wc -l

    Effectively, we can now calculate frequency results for our data.

    Joining Data Files

    There are two UNIX commands that will combine data from different files:paste andjoin. We will lookfirst at paste.

    Paste

    Paste has two modes of operation depending on the option selected. The first operation is simplest:paste takes two files and treats each as column data and appends the second to the first. Thecommand is

    paste first_file second_file

  • 8/8/2019 Unix Text Analysis

    6/9

    Joining Data Files6

    Consider this file:

    onetwothree

    Call this first_file. Then let this

    four five sixseven eight nine

    ten eleven twelve

    be second_file. The output would be

    one four five sixtwo seven eight ninethree ten eleven twelve

    So paste appends the columns from the second file to the first row by row. As with other commandsyou can redirect the output to a new file:

    paste first_file second_file > new_file

    The other use ofpaste is to linearize a file. Suppose Ihave a file in the format

    JimTysonUCL

    Information ServicesYou can create this in a text editor. I can use paste to merge the four lines of data into one line

    Jim Tyson UCL Information Services

    The command is

    Paste -s file

    As well as thes option, I can add a delimiter character withd. Try this

    Paste -d: -s file

    Join

    We have seen how to split a data file into different columns and we can also join two data files together.To do this there must be a column of values that match in each file and the files must be sorted on thefield you are going to use to join them.

    We start with files where for every row in file one there is a row in file two and vice versa.

    Consider our two files. The first has the structure

    Surname, Maths_score, English _score, History_score

    The second

    Surname, Mean_score, Grade

    We can see then that these could be joined on the column surname with ease since surname is unique.After sorting both files we can do this with the command line

    join -t, -j1 results.csv xresults.csv | less

    The option-tspecifies the delimiter and-jallows us to specify a single field number where this is theshared field.

    If the columns on which to match for joining dont appear in the same position in each file, you can usethe -jn m option several times where in each casen is the numeric file handle (look at the order that youname the files later) and m is the number of the join field. In fact, we could write

    join -t, -j1 1 -j2 1 results.csv xresults.csv | less

    for the same result as our previousjoin command.

    Essentially,join matches lines on the chosen fields and adds column data. We could send the resultingoutput to a new file with> if we wished.

  • 8/8/2019 Unix Text Analysis

    7/9

    sed and AWK - more powerful searching and replacing7

    In my example there is (deliberately) one line in file one for each line in file 2. There is of course noguarantee that this will be the case. To list all the files from a file regardless of a match being found, weuse the optiona and the file handle number.

    join -t, -a1 -j1 1 -j2 1 results.csv xresults.csv | less

    This would list every line ofresults.csvand only those lines ofxresults.csvwhere a match is found.

    The default join is that only items having a matching element in both files are displayed. We can alsoproduce ajoin where all the rows from the first file named and only the matching rows from the second

    are selected as we did above. Finally we can produce a version whereall the rows of the second file arelisted with only matching rows from the first with the following

    join -t, -a2 -j1 1 -j2 1 results.csv xresults.csv | less

    And lastly, we can produce all rows from both files, matching or not with

    join -t, -a1 -a2 -j1 1 -j2 1 results.csv xresults.csv | less

    The last thing we should learn aboutjoin is how to control the output. The optiono allows us to choosewhich data fields from each file are displayed. For example

    -o 0 1.2 2.3

    displays the match column ( always denoted 0), the second column from the first file (1.2) and the thirdcolumn from the second file (2.3).

    sed and AWK - more powerful searching and replacing

    sed

    Sed is a powerful Unix tool and there are books devoted to explaining it. The name stands forstreameditora reminder that it reads and processes files line by line. One of the basic uses of sed is to searcha file - much like grep does - and replace the search expression with some other text specified by theuser. An example may make this clearer

    sed s/abc/def/g input

    After the command name, we have s for search followed by the search string and then the replace stringsurrounded and separated by /and then gindicating that this operation is global- we are looking toprocess every occurrence ofabcin this file. The filename follows, in this case a file called input.

    Some sed Hacks

    Rather than pretend to cover sed in any real depth, there follow a very short list of sed tricks that are

    sometimes useful in processing data files.These are famous sed one-linersand are listed by EricPement on his website at http://www.pement.org/sed/sed1line.txt.

    sed G

    Double spaces the file. It reads a line and G appends a newline character. Remember that reading in anewline is basic to seds operation.

    sed '/^$/d;G'

    Double spaces a file that already has some blank lines. First remove an empty line then append anewline.

    sed 'G;G'

    Triple spaces the file.

    sed 'n;d'

    This removes double line spacing - and does it in a rather crafty way. Assuming that all the first lineread is not blank, then all even lines should be blank, so alternately printing a line out and deleting a lineshould result in a single spaced file.

    sed '/regex/{x;p;x;}'

    This command puts a blank line before every occurrence of the search sting regex.

    sed -n '1~2p'

    This command deletes odd lines from a file.

    I leave the investigation of more sed wizardry to you.

  • 8/8/2019 Unix Text Analysis

    8/9

    In-line Perl - the Swiss army chainsaw of Unix data manipulation8

    AWK

    AWK is a programming language developed specifically for text data manipulation. You can writecomplete programs in AWK and execute them in much the same way as a C or Java program (AWK isinterpreted though not compiled like C or byte code compiled like Java).

    AWK allows for some sophisticated command line manipulation and I will use a few simple examples toillustrate.

    Because our file is comma delimited, we will invoke AWK with the optionF,. AWK will automaticallyidentify the columns of data and put the fields a row at a time into its variables $1$n$FN. The lastalways identifies the last field of data.

    So, we can try

    awk -F, {print $2, $NF} results.csv

    We can also find text strings in a particular column for example with

    awk $n~/searchtext results.csv

    Where n in $n is a column number.

    The ~ means matches. The expression !~means does not match.

    Conditional processing in simple cases can be carried out by just stating the condition before the blockof code to be executed (that is inside the braces). For example

    awk -F, $2>55 {print $2} xresults.csv

    And we can create complex conditions

    awk -F, $2 > 50 || $3 < 50 {print $3} xresults.csv

    The || means OR and && means AND in awk.

    But we can construct more complex processes quite easily. The following code wont be difficult tounderstand if you know any mainstream programming language

    cut xresults.csv d, -f2 | awk '{sum=0; for (i=1; i

  • 8/8/2019 Unix Text Analysis

    9/9

    Final Exercise9

    The example makes use of the popular but initially puzzling ternary operatorwhich is a kind ofshorthand way of righting a conditional statement. Here the conditional is read

    if $numberis greater than or equal to four, print $number, else print thestringless than four

    The real value of in-line programming comes when we learn that we can loop through the output of othercommand line operations and execute Perl code. We do this with the option -n Here is an example

    cut-d, -f2 results.csv | perl -ne $_>=55?print"welldone":print"what a shame"

    Or we could do some mathematics

    cut -d, -f2 results.csv | perl -ne '$n += $_; END { print "$n\n" }

    which will sum the column of numbers.

    Another very useful Perl function for command line use is split. In Perl, split takes a string and divides itinto separate data items at a delimiter character and then it puts the results into an array.

    To illustrate this try the following

    less resluts.txt | perl -ne @fields=split(/,/,$_);print "@fields[0] , "\t", "@fields[1]" , "\t","@fields[2]","\n" outputs the data from a process to anew file but >> appends it to the end of an existing file)

    3. Take the original results.csv and find the average examination mark for each pupil and on thebasis of the following rule, assign them to a stream

    If average exam markis greater than or equal to 60 the studentis instream A, else if average exam markis greater than or equal to 50 thestudentis in stream B, else the studentis in stream C.

    Create a new file that includes these two new data items for each pupil.

    Add solution.

    THE END.