Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela...

21
Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1

Transcript of Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela...

Page 1: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

Data Manipulation with AWKEvangelos Pournaras, Izabela Moise, Dirk Helbing

Evangelos Pournaras, Izabela Moise, Dirk Helbing 1

Page 2: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

AWK

A "Swiss knife" for data manipulation, retrieval, formatting,processing, transformation, prototyping and more...

Evangelos Pournaras, Izabela Moise, Dirk Helbing 2

Page 3: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

About AWKCheck www.awk.info.

• A pattern scanning and processing language.

• AWK name: Alfred V. Aho, Peter J. Wein-berger and Brian W.Kernighan (creators)

• An evolving yet, stable, cross-platform language.

• Written in 1977 at AT&T Bell Laboratories.• Data-driven language.

– Posix standard for AWK:– Various Implementations: gawk, nawk, mawk, spawk, etc.

"AWK is a convenient and expressive programming language thatcan be applied to a wide variety of computing and data-manipulationtasks."

Evangelos Pournaras, Izabela Moise, Dirk Helbing 3

Page 4: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

What you can do with AWK

• Manage small databases

• Validate data

• Produce indexes & perform document preparation tasks

• Experiment with algorithms you can adapt later to otherprogramming languages

Evangelos Pournaras, Izabela Moise, Dirk Helbing 4

Page 5: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

Implementations

• GAWK– Extract bits and pieces of data for processing– Sort bits– Perform simple network communications

• MAWK– Efficiency, byte code interpreter

• JAWK– Java support

• NAWK, XGAWK, SPAWK, QTAWK, RunAWK, etc.

Evangelos Pournaras, Izabela Moise, Dirk Helbing 5

Page 6: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

AWK Advantages

• Very simple

• Easy learning curve

• Standardized

• On-the-fly calculations

• No need to open/close files• Interpreted, not compiled

– Avoiding the edit-compile-test-debug lifecycle

Evangelos Pournaras, Izabela Moise, Dirk Helbing 6

Page 7: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

Programming Philosophy

• Programming in AWK: Building a list of rules• Rules consist of a pattern and an action

– (pattern-1){action}(pattern-2){action}...

• Linear scans, handling one data element per time– Resembling Hadoop philosophy– Random access seek times vs. hard drives sizes

• Manipulating delimited text files in a single pass• By design, division of a file in records & fields

– Each line is a record– Fields are delimited by a special character

Every clause is a potential action performed on the current record!Evangelos Pournaras, Izabela Moise, Dirk Helbing 7

Page 8: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

Comparison with other Languages

A case study with converting triplets to sparse matrices:

Source: https://github.com/brendano/awkspeed

Evangelos Pournaras, Izabela Moise, Dirk Helbing 8

Page 9: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

Running an AWK program

Three ways to run an AWK program from command line:

1. >awk ‘program’ input-file1 input-file2 ...

2. >awk -f program-file input-file1 input-file2 ...

3. Unix script: my-awk-script.sh

#!/usr/bin/awk -f#awk rules go here

Evangelos Pournaras, Izabela Moise, Dirk Helbing 9

Page 10: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

Program Structure

# Initialization bodyBEGIN{# initialization actions}#Main execution body{# main program actions}# Finalization bodyEND{# Final actions}

Evangelos Pournaras, Izabela Moise, Dirk Helbing 10

Page 11: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

AWK Demonstrationexample-01.awk, example-02.awk

Evangelos Pournaras, Izabela Moise, Dirk Helbing 11

Page 12: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

AWK Regular Expressions

A pattern enclosed in slashes (‘/’) checked if it matches each inputrecord.

• letters, numbers, both.

• /foo/

• ˜ matches

• !˜ does not match

• | alternation expression

• ˆ matches the beginning of a string

• $ matches the end of a string

• . matches any single character

Evangelos Pournaras, Izabela Moise, Dirk Helbing 12

Page 13: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

AWK Demonstration

Evangelos Pournaras, Izabela Moise, Dirk Helbing 13

Page 14: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

Scripts

>awk ’/.edu/ {print $0}’ mail-list.txt>awk ’$1 ~ /J/’ inventory-shipped.txt>awk ’$3 ~ /edu$|be$/’ mail-list.txt>awk ’{if (length($0)>max) max=length($0)}END{print max}’ mail-list.txt>awk ’NF>0’ inventory-shipped.txt>awk ’END{print NR}’>awk ’NR%2==0’ mail-list.txt>awk ’$1=="Jan" {sum+=$5} END{print sum}’

inventory-shipped.txt

Evangelos Pournaras, Izabela Moise, Dirk Helbing 14

Page 15: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

Variables

• No variable declaration is needed.

• No type declaration is needed.• Built-in variables:

– NF: number of fields– NR: current record number– FS: field separator

Evangelos Pournaras, Izabela Moise, Dirk Helbing 15

Page 16: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

Functions

Specified as follows:

function awkFunction(a,b,c,d){return a+b+c+d

}

Built-in functions:

• Numeric:– sqrt, log, sin, cos, rand, log, etc.

• String:– index, length, match, split, substr, etc.

Evangelos Pournaras, Izabela Moise, Dirk Helbing 16

Page 17: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

Arrays

Associative arrays:

• String for indices rather than numbers

• arrayname[string]=value• Multi-dimensional arrays:

– Supported by concatenation of indices into one string– foo[5,12]="value"

Evangelos Pournaras, Izabela Moise, Dirk Helbing 17

Page 18: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

AWK Demonstrationexample-03.awk, example-04.awk

Evangelos Pournaras, Izabela Moise, Dirk Helbing 18

Page 19: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

AWK Example - Arrays

BEGIN{}{

letters[$4]++;}END{

for(var in letters)print var, "exists", letters[var], "times."

if("A" in letters)print "A exists"

elseprint "A does not exist"

}

Evangelos Pournaras, Izabela Moise, Dirk Helbing 19

Page 20: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

Proposed Literature

AWK scripts:https://github.com/data-science-course/lectures/tree/master/awk

A. D. Robbins.Gawk: Effective AWK Programming.Free Software Foundation, Inc., 4.1 edition, April 2014.

How to read the user guide:

• Fast reading: Chapters 1-10

• Practical examples: Chapters 11

Evangelos Pournaras, Izabela Moise, Dirk Helbing 20

Page 21: Data Manipulation with AWK - ETH Zürich · Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 ... –

What is next?

• SQL and relational databases

• Plotting and visualizing data

Evangelos Pournaras, Izabela Moise, Dirk Helbing 21