Data Manipulation with AWK · 2015. 3. 3. · About AWK Check . A pattern scanning and processing...
Transcript of Data Manipulation with AWK · 2015. 3. 3. · About AWK Check . A pattern scanning and processing...
Data Manipulation with AWKEvangelos Pournaras, Izabela Moise, Dirk Helbing
Evangelos Pournaras, Izabela Moise, Dirk Helbing 1
AWK
A "Swiss knife" for data manipulation, retrieval, formatting,processing, transformation, prototyping and more...
Evangelos Pournaras, Izabela Moise, Dirk Helbing 2
About AWKCheck www.awk.info.
• A pattern scanning and processing language.
• AWK name: Alfred V. Aho, Peter J. Wein-berger and Brian W.Kernighan (creators)
• An evolving yet, stable, cross-platform language.
• Written in 1977 at AT&T Bell Laboratories.• Data-driven language.
– Posix standard for AWK:– Various Implementations: gawk, nawk, mawk, spawk, etc.
"AWK is a convenient and expressive programming language thatcan be applied to a wide variety of computing and data-manipulationtasks."
Evangelos Pournaras, Izabela Moise, Dirk Helbing 3
What you can do with AWK
• Manage small databases
• Validate data
• Produce indexes & perform document preparation tasks
• Experiment with algorithms you can adapt later to otherprogramming languages
Evangelos Pournaras, Izabela Moise, Dirk Helbing 4
Implementations
• GAWK– Extract bits and pieces of data for processing– Sort bits– Perform simple network communications
• MAWK– Efficiency, byte code interpreter
• JAWK– Java support
• NAWK, XGAWK, SPAWK, QTAWK, RunAWK, etc.
Evangelos Pournaras, Izabela Moise, Dirk Helbing 5
AWK Advantages
• Very simple
• Easy learning curve
• Standardized
• On-the-fly calculations
• No need to open/close files• Interpreted, not compiled
– Avoiding the edit-compile-test-debug lifecycle
Evangelos Pournaras, Izabela Moise, Dirk Helbing 6
Programming Philosophy
• Programming in AWK: Building a list of rules• Rules consist of a pattern and an action
– (pattern-1){action}(pattern-2){action}...
• Linear scans, handling one data element per time– Resembling Hadoop philosophy– Random access seek times vs. hard drives sizes
• Manipulating delimited text files in a single pass• By design, division of a file in records & fields
– Each line is a record– Fields are delimited by a special character
Every clause is a potential action performed on the current record!Evangelos Pournaras, Izabela Moise, Dirk Helbing 7
Comparison with other Languages
A case study with converting triplets to sparse matrices:
Source: https://github.com/brendano/awkspeed
Evangelos Pournaras, Izabela Moise, Dirk Helbing 8
Running an AWK program
Three ways to run an AWK program from command line:
1. >awk ‘program’ input-file1 input-file2 ...
2. >awk -f program-file input-file1 input-file2 ...
3. Unix script: my-awk-script.sh
#!/usr/bin/awk -f#awk rules go here
Evangelos Pournaras, Izabela Moise, Dirk Helbing 9
Program Structure
# Initialization bodyBEGIN{# initialization actions}#Main execution body{# main program actions}# Finalization bodyEND{# Final actions}
Evangelos Pournaras, Izabela Moise, Dirk Helbing 10
AWK Demonstrationexample-01.awk, example-02.awk
Evangelos Pournaras, Izabela Moise, Dirk Helbing 11
AWK Regular Expressions
A pattern enclosed in slashes (‘/’) checked if it matches each inputrecord.
• letters, numbers, both.
• /foo/
• ˜ matches
• !˜ does not match
• | alternation expression
• ˆ matches the beginning of a string
• $ matches the end of a string
• . matches any single character
Evangelos Pournaras, Izabela Moise, Dirk Helbing 12
AWK Demonstration
Evangelos Pournaras, Izabela Moise, Dirk Helbing 13
Scripts
>awk ’/.edu/ {print $0}’ mail-list.txt>awk ’$1 ~ /J/’ inventory-shipped.txt>awk ’$3 ~ /edu$|be$/’ mail-list.txt>awk ’{if (length($0)>max) max=length($0)}END{print max}’ mail-list.txt>awk ’NF>0’ inventory-shipped.txt>awk ’END{print NR}’>awk ’NR%2==0’ mail-list.txt>awk ’$1 ~ /^Jan/ {sum+=$5} ENDprint sum’
inventory-shipped.txt
Evangelos Pournaras, Izabela Moise, Dirk Helbing 14
Variables
• No variable declaration is needed.
• No type declaration is needed.• Built-in variables:
– NF: number of fields– NR: current record number– FS: field separator
Evangelos Pournaras, Izabela Moise, Dirk Helbing 15
Functions
Specified as follows:
function awkFunction(a,b,c,d){return a+b+c+d
}
Built-in functions:
• Numeric:– sqrt, log, sin, cos, rand, log, etc.
• String:– index, length, match, split, substr, etc.
Evangelos Pournaras, Izabela Moise, Dirk Helbing 16
Arrays
Associative arrays:
• String for indices rather than numbers
• arrayname[string]=value• Multi-dimensional arrays:
– Supported by concatenation of indices into one string– foo[5,12]="value"
Evangelos Pournaras, Izabela Moise, Dirk Helbing 17
AWK Demonstrationexample-03.awk, example-04.awk
Evangelos Pournaras, Izabela Moise, Dirk Helbing 18
AWK Example - Arrays
BEGIN{}{
letters[$4]++;}END{
for(var in letters)print var, "exists", letters[var], "times."
if("A" in letters)print "A exists"
elseprint "A does not exist"
}
Evangelos Pournaras, Izabela Moise, Dirk Helbing 19
Proposed Literature
AWK scripts:https://github.com/data-science-course/lectures/tree/master/awk
A. D. Robbins.Gawk: Effective AWK Programming.Free Software Foundation, Inc., 4.1 edition, April 2014.
How to read the user guide:
• Fast reading: Chapters 1-10
• Practical examples: Chapters 11
Evangelos Pournaras, Izabela Moise, Dirk Helbing 20
What is next?
• SQL and relational databases
• Plotting and visualizing data
Evangelos Pournaras, Izabela Moise, Dirk Helbing 21