Week4

Code Club(Wrangling Data With Python)

Tony Hirst & Sam Leon@psychemedia / tony.hirst @ okfn.org

Week 4

Wee

k 4

–Le

arn

ing

Ob

ject

ives By the end of this session, you will be able to:

• Recall and make use of what we did previously…

• Be able to merge data from different data frames

• Be familiar with the idea of “tidy data”

• Have taken first steps in being able to reshape datasets (long-wide data transforms, pivots)

• Be able to start cleaning datasets using a variety of strategies

Wee

ks 1

-3 –

Rec

apIn previous weeks, you have learned…

• How to getting data into and out of pandas from a variety of file types (CSV, XSLSX) on your computer and the web, as well as from HTML wen pages and the World Bank Indicators API

• How Python can use lists to represent data and how pandas represents tabular data using a data frame

• How to filter, sort and generally manipulate tabular data using pandas

• How to process data columns and derive new ones

Why Code?

Ref

: Pyt

ho

n f

or

Fin

an

ce

Efficiency and Productivity Through Python

From Prototyping to Production

Financial industry context- “quants” develop proof-of-concept models in

eg Matlab or R- developers translate applications into

production code (C++, Java)

Inefficiencies – prototype code not reusableDiverse skill sets – different programming languages & regimesLegacy code – maintenance and development becomes complex

Ref

: Pyt

ho

n f

or

Fin

an

ce


Shorter time to results

- eg ability to download data directly source API- eg ability to perform vectorised, column based

operations,such as cumulative sum operations

- eg ability to reshape datasets - eg ability to generate charts directly from

appropriately shaped data

?pd.read_csvFi

nd

ing

Hel

p…

pd.<TAB>df. <TAB>pd.read_csv.<TAB>

Ref

: Pyt

ho

n f

or

Fin

an

ce


A consistent technological framework

“Python has the potential to provide a single, powerful, consistent framework with which to streamline end to end development and production efforts…”

Fro

m t

he

com

man

d li

ne… cd “/path/to/my file”

(no ! required.....)

MergingData

.reset_index()

Mer

gio

ng

dat

a…

pd.merge()

Reshaping data

Wide and long format data“Tidy data”

.melt() and .pivot()

Hadley Wickham, “Tidy Data”, Journal of Statistical Softwarehttp://vita.had.co.nz/papers/tidy-data.pdf

Res

hap

ing

dat

a: m

elt pd.melt()

Index COL_A COL_B COL_C

ROW1 VAL_A1 VAL_B1 VAL_C1


Index variable value

ROW1 COL_A VAL_A1

ROW2 COL_A VAL_A2

ROW1 COL_B VAL_B1

ROW2 COL_B VAL_B2

ROW1 COL_C VAL_C1

ROW2 COL_C VAL_C2

pd.melt( df, id_vars=“COL_A”, value_vars=[“COL_B”,”COL_C”])

Res

hap

ing

dat

a: m

elt pd.melt()

Reshape:R rowsM index cols (id_vars=[])N value cols (value_vars=[])

to: M index cols1 variable col1 value colLots of rows…

Res

hap

ing

dat

a: m

elt pd.pivot()

Index COL_A COL_B COL_C



df.pivot( index=‘Index’, variable=‘variable’, value=‘value’)

Index variable value

ROW1 COL_A VAL_A1

ROW2 COL_A VAL_A2

ROW1 COL_B VAL_B1

ROW2 COL_B VAL_B2

ROW1 COL_C VAL_C1

ROW2 COL_C VAL_C2

Res

hap

ing

dat

a: m

elt

& p

ivo

t

CleaningData

Cle

anin

g o

per

atio

ns Handle empty rows/cols

Handle duplicate rows(“deduping”)

Clean up text stringsMap one value to another

Drop empty rowsdf.dropna(axis=0,how=‘all’)

Drop empty columnsdf.dropna(axis=1,how=‘all’)

Emp

ty a

nd

du

plc

iate

dat

a

Remove duplicate rowsdf.drop_duplicates()df.drop_duplicates([columns])

Tid

y w

ith

in c

olu

mn

s

df3['incomeLevel'].str.split(':')

Stri

ng

Op

erat

ion

s

http://pandas.pydata.org/pandas-docs/dev/text.html

A n

ew P

yth

on

dat

a st

ruct

ure

: dic

t()

Dictionaries / “dicts” / {}“associatively addressed lists”

Tran

slat

ing

or

con

vert

ing

valu

es map()

Week4

Documents

Transcript of Week4