Week4
-
Upload
tony-hirst -
Category
Documents
-
view
105 -
download
1
Transcript of Week4
Code Club(Wrangling Data With Python)
Tony Hirst & Sam Leon@psychemedia / tony.hirst @ okfn.org
Week 4
Wee
k 4
–Le
arn
ing
Ob
ject
ives By the end of this session, you will be able to:
• Recall and make use of what we did previously…
• Be able to merge data from different data frames
• Be familiar with the idea of “tidy data”
• Have taken first steps in being able to reshape datasets (long-wide data transforms, pivots)
• Be able to start cleaning datasets using a variety of strategies
Wee
ks 1
-3 –
Rec
apIn previous weeks, you have learned…
• How to getting data into and out of pandas from a variety of file types (CSV, XSLSX) on your computer and the web, as well as from HTML wen pages and the World Bank Indicators API
• How Python can use lists to represent data and how pandas represents tabular data using a data frame
• How to filter, sort and generally manipulate tabular data using pandas
• How to process data columns and derive new ones
Ref
: Pyt
ho
n f
or
Fin
an
ce
Efficiency and Productivity Through Python
From Prototyping to Production
Financial industry context- “quants” develop proof-of-concept models in
eg Matlab or R- developers translate applications into
production code (C++, Java)
Inefficiencies – prototype code not reusableDiverse skill sets – different programming languages & regimesLegacy code – maintenance and development becomes complex
Ref
: Pyt
ho
n f
or
Fin
an
ce
Efficiency and Productivity Through Python
Shorter time to results
- eg ability to download data directly source API- eg ability to perform vectorised, column based
operations,such as cumulative sum operations
- eg ability to reshape datasets - eg ability to generate charts directly from
appropriately shaped data
Ref
: Pyt
ho
n f
or
Fin
an
ce
Efficiency and Productivity Through Python
A consistent technological framework
“Python has the potential to provide a single, powerful, consistent framework with which to streamline end to end development and production efforts…”
Hadley Wickham, “Tidy Data”, Journal of Statistical Softwarehttp://vita.had.co.nz/papers/tidy-data.pdf
Res
hap
ing
dat
a: m
elt pd.melt()
Index COL_A COL_B COL_C
ROW1 VAL_A1 VAL_B1 VAL_C1
ROW2 VAL_A2 VAL_B2 VAL_C2
Index variable value
ROW1 COL_A VAL_A1
ROW2 COL_A VAL_A2
ROW1 COL_B VAL_B1
ROW2 COL_B VAL_B2
ROW1 COL_C VAL_C1
ROW2 COL_C VAL_C2
pd.melt( df, id_vars=“COL_A”, value_vars=[“COL_B”,”COL_C”])
Res
hap
ing
dat
a: m
elt pd.melt()
Reshape:R rowsM index cols (id_vars=[])N value cols (value_vars=[])
to: M index cols1 variable col1 value colLots of rows…
Res
hap
ing
dat
a: m
elt pd.pivot()
Index COL_A COL_B COL_C
ROW1 VAL_A1 VAL_B1 VAL_C1
ROW2 VAL_A2 VAL_B2 VAL_C2
df.pivot( index=‘Index’, variable=‘variable’, value=‘value’)
Index variable value
ROW1 COL_A VAL_A1
ROW2 COL_A VAL_A2
ROW1 COL_B VAL_B1
ROW2 COL_B VAL_B2
ROW1 COL_C VAL_C1
ROW2 COL_C VAL_C2
Cle
anin
g o
per
atio
ns Handle empty rows/cols
Handle duplicate rows(“deduping”)
Clean up text stringsMap one value to another
Drop empty rowsdf.dropna(axis=0,how=‘all’)
Drop empty columnsdf.dropna(axis=1,how=‘all’)
Emp
ty a
nd
du
plc
iate
dat
a
Remove duplicate rowsdf.drop_duplicates()df.drop_duplicates([columns])
A n
ew P
yth
on
dat
a st
ruct
ure
: dic
t()
Dictionaries / “dicts” / {}“associatively addressed lists”